Introduction
Bike-sharing demand analysis refers to the study of factors that impact the usage of bike-sharing services and the demand for bikes at different times and locations. The purpose of this analysis is to understand the patterns and trends in bike usage and make predictions about future demand. This post will examine how statistical machine-learning methods can analyze the given data.
This article will use a small subset of this dataset and just focus on the functionality. Please note that the chances of inaccuracy are high for such a small subset of the dataset. Feel free to use the complete dataset for your analysis.
Learning Objectives:
- Accurately predict the number of bike rentals for a given time period and location based on historical data and other relevant factors.
- Identify and analyze the key factors that influence bike rental demand, like weather conditions, holidays, and events.
- Develop and evaluate predictive models that can effectively forecast bike rental demand using techniques like regression analysis, time series analysis, and machine learning algorithms.
- Use the forecasting results to optimize bike inventory and resources, ensuring that bike-sharing companies can meet customer demand and maximize revenue.
- Continuously monitor and evaluate the forecasting accuracy, refine the models, and improve accuracy and reliability.
Dataset on Kaggle: https://www.kaggle.com/c/bike-sharing-demand
This article was published as a part of the Data Science Blogathon.
What is Bike Sharing Demand Forecasting?
Bike-sharing demand forecasting aims to provide bike-sharing companies with the insights and tools they need to make data-driven decisions and effectively manage their operations.
Factors often considered during bike sharing demand analysis include weather conditions, seasonality, day of the week, holiday periods, and events. Demographic information about users, like age, gender, and income. It can be used to understand usage patterns.
Methods used in bike-sharing demand analysis include statistical models like time-series analysis, regression analysis, and machine learning algorithms. Bike-sharing companies can use the analysis results to optimize their operations, distribution, pricing strategies, and marketing campaigns. Additionally, the findings can inform city planners in developing bike-friendly infrastructure and policies.
Why Bike Sharing System?
Bike-sharing systems have become increasingly popular in recent years due to their many benefits, which include:
- Affordable and Sustainable Transportation: Bike-sharing systems provide an affordable and sustainable mode of transportation, especially for short trips. They are a low-cost alternative to owning a personal bike and can help reduce reliance on private cars and sharing cars, which can positively impact the environment.
- Health and comfort: Bike-sharing systems promote physical action and exercise, positively impacting health and comfort. Regular cycling can help reduce the risk of heart disease, stroke, and other chronic diseases.
- Convenience: Bike-sharing systems are often located in densely populated city areas, making them a convenient mode of transportation for short trips. They can be easily accessed, making them a flexible and convenient option for commuters and tourists alike.
- Reduced Traffic Congestion: Bike-sharing systems can help reduce traffic congestion by providing an alternative mode of transportation for short trips. This can have a positive impact on town mobility.
In summary, bike-sharing systems provide multiple benefits, including affordable and sustainable transportation, health and comfort, convenience, reduced traffic congestion, and tourism and economic development. These benefits have contributed to the popularity of bike-sharing systems in many cities around the world.
Problem Statement
The problem statement for bike-sharing demand is predicting the number of bikes that will be rented from a bike-sharing system at a given time based on factors such as weather, day of the week, and time of day. The purpose is to build a predictive model that can accurately forecast bike rental demand to optimize bike allocation and improve the bike-sharing system’s overall efficiency.
The problem statement may involve answering specific questions such as:
- What is the expected demand for bikes during peak hours, weekdays, or weekends?
- How does weather (e.g., wind, temperature, precipitation) affect bike rental demand?
- Are any specific locations or routes with higher or lower demand for bikes?
- How can we optimize the bike-sharing system to meet fluctuating demand and minimize operational costs?
- Can the bike-sharing system expand or improve to better serve users’ needs and promote sustainable transportation?
The problem statement for bike-sharing demand analysis typically involves predicting bike rental demand and optimizing bike allocation to improve the bike-sharing system’s efficiency and sustainability.
The company management wants:
- To create a model of the demand for shared bikes with the available independent variables.
- To understand the demand dynamics of the market using the model.
Reading and Understanding the Data
To build a bike-sharing demand forecasting model, it’s important to start by reading and understanding the data. The key steps involved in this process are loading, exploring, cleaning, preprocessing, and visualizing the data. By following these steps, analysts can gain a deeper understanding of the data and identify any issues that need addressing before building the bike-sharing demand forecasting model. This helps ensure the model is accurate and reliable, which is essential for optimizing bike-sharing operations.
import pandas as pd
bikeshare_df = pd.read_csv("day.csv")
print(bikeshare_df.head())
bike_sharing.info()
bike_sharing.describe()
Visualizing the Data
Visualizing the data is an important step in the bike-sharing demand forecasting process. It can help identify patterns and trends that may not be immediately apparent from raw data.
import matplotlib.pyplot as plt
import seaborn as sns
#Plotting pairplot of all the numeric variables
sns.pairplot(bike_sharing[["temp","atemp","hum","windspeed","casual","registered","cnt"]])
plt.show()
#Plotting box plot of continuous variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
plt.boxplot(bike_sharing["temp"])
plt.subplot(2,3,2)
plt.boxplot(bike_sharing["atemp"])
plt.subplot(2,3,3)
plt.boxplot(bike_sharing["hum"])
plt.subplot(2,3,4)
plt.boxplot(bike_sharing["windspeed"])
plt.subplot(2,3,5)
plt.boxplot(bike_sharing["casual"])
plt.subplot(2,3,6)
plt.boxplot(bike_sharing["registered"])
plt.show()
Visualising Categorical Variables
#Plotting box plot of categorical variables
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,2)
sns.boxplot(x = 'yr', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,3)
sns.boxplot(x = 'mnth', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,6)
sns.boxplot(x = 'workingday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,7)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bike_sharing)
plt.show()
Data Preparation
Data preparation is a crucial step in bike-sharing demand forecasting, as it involves cleaning, transforming, and organizing the data to make it suitable for analysis. By preparing the data in this way, analysts can ensure that the data is suitable for analysis and that any biases or errors in the data are addressed. This can lead to more accurate and reliable forecasting models that can help bike-sharing companies optimize their operations and better meet customer demand.
Dropping unnecessary columns instant, dteday, casual & registered
- instant – It is just a sequence number of rows
- dteday – It is not required since columns for year & month already exists
- casual – This variable cannot be predicted.
- registered – This variable cannot be predicted.
bike_sharing.drop(columns=["instant","dteday","casual","registered"],axis=1,inplace =True)
bike_sharing.head()
Dummy Variables
season_type = pd.get_dummies(bike_sharing['season'], drop_first = True)
season_type.rename(columns={2:"season_summer", 3:"season_fall", 4:"season_winter"},inplace=True)
season_type.head()
weather_type = pd.get_dummies(bike_sharing['weathersit'], drop_first = True)
weather_type.rename(columns={2:"weather_mist_cloud", 3:"weather_light_snow_rain"},inplace=True)
weather_type.head()
#Concatenating new dummy variables to the main dataframe
bike_sharing = pd.concat([bike_sharing, season_type, weather_type], axis = 1)
#Dropping columns season & weathersit since we have already created dummies for them
bike_sharing.drop(columns=["season", "weathersit"],axis=1,inplace =True)
#Analysing dataframe after dropping columns
bike_sharing.info()
Creating derived variables for the categorical variable month
#Creating year_quarter derived columns from month columns.
#Note that last quarter has not been created since we need only 3 columns to define the four quarters.
bike_sharing["Quarter_JanFebMar"] = bike_sharing["mnth"].apply(lambda x: 1 if x<=3 else 0)
bike_sharing["Quarter_AprMayJun"] = bike_sharing["mnth"].apply(lambda x: 1 if 4<=x<=6 else 0)
bike_sharing["Quarter_JulAugSep"] = bike_sharing["mnth"].apply(lambda x: 1 if 7<=x<=9 else 0)
#Dropping column mnth since we have already created dummies.
bike_sharing.drop(columns=["mnth"],axis=1,inplace =True)
bike_sharing["weekend"] = bike_sharing["weekday"].apply(lambda x: 0 if 1<=x<=5 else 1)
bike_sharing.drop(columns=["weekday"],axis=1,inplace =True)
bike_sharing.drop(columns=["workingday"],axis=1,inplace =True)
bike_sharing.head()
#Analysing dataframe after dropping columns weekday & workingday
bike_sharing.info()
#Plotting correlation heatmap to analyze the linearity between the variables in the dataframe
plt.figure(figsize = (16, 10))
sns.heatmap(bike_sharing.corr(), annot = True, cmap="Greens")
plt.show()
#Dropping column temp since it is very highly collinear with the column atemp.
#Further,the column atemp is more appropriate for modelling compared to column temp from human perspective.
bike_sharing.drop(columns=["temp"],axis=1,inplace =True)
bike_sharing.head()
Splitting the Data into Training and Testing Sets
Splitting the data into training and testing sets is a critical step in bike-sharing demand forecasting. It enables analysts to evaluate the performance of their forecasting models on unseen data. The general approach is to use historical data to train the model and then test the model’s performance on a separate, holdout set of data.
#Importing library
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
bike_sharing_train, bike_sharing_test = train_test_split(bike_sharing, train_size = 0.7, test_size = 0.3, random_state = 100)
Rescaling the training dataframe using the MinMax scaling function after the split to achieve optimum beta coefficients for all features.
#importing library
from sklearn.preprocessing import MinMaxScaler
#assigning variable to scaler
scaler = MinMaxScaler()
# Applying scaler to all the columns except the derived and 'dummy' variables that are already in 0 & 1.
numeric_var = ['atemp','hum','windspeed','cnt']
bike_sharing_train[numeric_var] = scaler.fit_transform(bike_sharing_train[numeric_var])
# Analysing the train dataframe after scaling
bike_sharing_train.head()
By splitting the data into training and testing sets, analysts can evaluate the performance of their forecasting models on unseen data and ensure that the models are robust and reliable. This can help bike-sharing companies optimize their operations and better meet customer demand.
y_train = bike_sharing_train.pop('cnt')
X_train = bike_sharing_train
print (y_train.head())
print (X_train.head())
Building the Linear Model
Building a linear model for bike-sharing demand forecasting involves creating a model that uses linear regression to predict bike rental demand based on a set of input variables. The linear regression model is trained using the training set, with the input variables used to predict the target variable (bike rental demand). The model is optimized to minimize the error between the predicted and actual demands in the training set.
Using the LinearRegression function from SciKit Learn and Recursive Feature Elimination (RFE):
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# Running RFE with the output number of the variable equal to 12
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm, 12) # running RFE
rfe = rfe.fit(X_train, y_train)
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
By building a linear model for bike-sharing demand forecasting, analysts can develop a simple yet effective forecasting system to optimize bike-sharing operations and improve customer satisfaction. However, it’s important to note that linear models may have limitations in capturing more complex patterns and relationships in the data, so other modeling techniques (such as decision trees or neural networks) can be more accurate predictions.
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[columns_rfe]
X_train_rfe
Residual Analysis of the Training Data
Residual analysis is an essential step in evaluating the performance of a linear model for bike-sharing demand forecasting. Residuals are the difference between the predicted demand and the actual demand, and analyzing these residuals can help identify any patterns or biases in the model’s predictions.
#using the final model lr5 on train data to predict y_train_cnt values
y_train_cnt = lr5.predict(X_train_lr5)
# Plotting the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)
plt.xlabel('Errors', fontsize = 18)
plt.scatter(y_train,(y_train - y_train_cnt))
plt.show()
Making Predictions Using the Final Model lr5
To make predictions using the final linear model for bike-sharing demand forecasting (lr5), you will need to provide values for the input variables and use the model to generate a prediction for the target variable (bike rental demand).
#Applying the scaling on the test sets
numeric_vars = ['atemp','hum','windspeed','cnt']
bike_sharing_test[numeric_vars] = scaler.transform(bike_sharing_test[numeric_vars])
bike_sharing_test.describe()
Dividing into X_test and y_test
y_test = bike_sharing_test.pop('cnt')
X_test = bike_sharing_test
# Adding constant variable to test dataframe
X_test_lr5 = sm.add_constant(X_test)
# Updating X_test_lr5 dataframe by dropping the variables as analyzed from the above models
X_test_lr5 =X_test_lr5.drop(["atemp", "hum", "season_fall", "Quarter_AprMayJun", "weekend","Quarter_JanFebMar"], axis = 1)
# Making predictions using the fifth model
y_pred = lr5.predict(X_test_lr5)
Model Evaluation
Model evaluation is a critical step in assessing the performance of a bike-sharing demand forecasting model. Use various metrics to evaluate the performance of a model, including mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R-squared).
# Plotting y_test and y_pred to understand the spread
fig = plt.figure()
plt.scatter(y_test, y_pred)
fig.suptitle('y_test vs y_pred', fontsize = 20)
plt.xlabel('y_test', fontsize = 18)
plt.ylabel('y_pred', fontsize = 16)
You should evaluate the model’s performance using metrics such as MAE, RMSE, and R-squared. MAE and RMSE measure the average magnitude of the errors between the predicted and actual values. R-squared measures the proportion of variance in the target variable, explained by the input variables.
#importing library and checking mean squared error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print('Mean_Squared_Error :' ,mse)
#importing library and checking R2
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
Conclusion
This study aimed to improve the bike-sharing activities of Capital Bikeshare and support the reinvention of the city transportation system. This comprehensive exploratory data analysis on their publicly available data helped us understand and analyze the underlying patterns and characteristics of the bike-share network and to work on this data to achieve data-driven results.
We performed an analysis on the growth in popularity of bike-share over the two years, 2011–2012, and the effect of the seasonal and day factors on the ridership patterns. The impacts of seasonal and weather parameters were to understand the ridership pattern in Washington, DC. Analysis of the trip data helped to understand the characteristics of the locality where the stations are located.
Keeping these inferences in mind, we could suggest the following recommendations:
- Most of the rentals are for commuting to workshops and colleges on a daily basis. So CaBi should launch more stations near these landmarks to reach out to their main customers.
- Planning for more sharing bikes to stations must consider the peak rental hours, i.e., 7–9 am and 5–6 pm.
- The offer should not be a fixed price. Instead, it should be based on seasonal variations to promote bike usage during the fall and winter seasons.
- Data about the most used routes can help build roads/lanes dedicated to bikes specifically.
- Due to the low usage of bikes at night, it would be better to do bike maintenance at night. Removing some bikes from the streets at night time will not cause trouble for the customers.
- Converting registered customers to casual customers on the weekends by providing them with discounts and coupons.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, May 27, 2023.