Introduction
According to a report by the International Energy Agency (IEA), the lifecycle of buildings from construction to demolition was responsible for 37% of global energy- and process- CO2 emissions in 2020. Yet it is possible to drastically reduce buildings’ energy consumption by combining easy-to-implement fixes and state-of-the-art strategies. In this article, we will develop a data science solution to predict site EUI using machine learning techniques and then deploy the model on streamlit, cloud platform for inference. Such a solution can be used by government agencies or construction firms to build and maintain energy intensity within a certain threshold and help reduce carbon emissions. So let’s get started with this exciting project.
Source: streamlit
Site energy intensity is an important metric to measure for any building to understand its energy utilization and identify opportunities for energy efficiency. It also gives us an indication of the building’s overall CO2 emissions and helps us control the emissions.
Site EUI is a measure that is a function of a building’s energy intensity as the size of the building. Site EUI depends upon various characteristics of the building and weather- factors of location.
Table of Contents
Project Background
In this article, we aim to explore and identify the causes behind the site energy intensity of types of buildings located in various states of the USA. Measuring Site EUI helps in the following way:
- To identify the energy consumption of buildings which can further help to control energy consumption levels and curb the CO2 emissions
- Reducing energy consumption of the buildings
- It helps in cost savings for the building’s energy requirements
- Improved indoor comfort
- Compliance with regulations and energy standards
- Improved property value
In this project, the dataset consists of building characteristics (e.g., floor area, facility type, building class, etc.), weather data for the location of the building (e.g., annual average temperature, annual total precipitation, etc.) as well as the energy usage for the building, and the given year, measured as Site Energy Usage Intensity (Site EUI). Each row in the data corresponds to the single building observed in a given year.
We will use various data science techniques to understand the data and explore the data to find insights on energy intensity levels of rows, followed by the use of machine learning techniques to predict site EUI. We will also look at the Explainable AI (XAI) technique to understand the predictions of the machine learning model. After developing the model, we will deploy the model on the streamlit cloud for real-time and batch predictions.
Problem Statement
You are provided with two datasets:
(1) The ‘train_dataset’ where the observed values of the Site EUI for each row are provided.
(2) The ‘x_test dataset’, the observed values of the Site EUI for each row is removed and provided separately in y_test.
Your task is to predict the Site EUI for each row (using the complete training dataset), given the characteristics of the building and the weather data for the location of the building. Use the test sets for validation and testing. The target variable is site_eui for the predictive analytics problem.
Evaluation Metric: Root Mean Squared Error (RMSE)
Pre-requisites
This project is for intermediate and advanced learners who are looking to increase their knowledge about explainable AI and the deployment of machine learning models. Beginners in the data science field can also work on this project if they are familiar with the following tools:
- Knowledge of Python programming language.
- Basic data science tools like pandas, matplotlib, seaborn, and Sci-kit learn.
- Familiarity with machine learning techniques such as regression, classification, etc.
Dataset Description
The dataset is taken from WiDS Datathon 2022, which consists of 100k observations of buildings in the USA over 7 years. Let’s look at the dataset description before moving ahead with the project workflow.
Dataset source: Click Here
- id: building id
- Year_Factor: an anonymized year in which the weather and energy usage factors were observed
- State_Factor: anonymized state in which the building is located
- building_class: building classification
- facility_type: building usage type
- floor_area: floor area (in square feet) of the building
- year_built: a year in which the building was constructed
- energy_star_rating: the energy star rating of the building
- ELEVATION: elevation of the building location
- january_min_temp: the minimum temperature in January (in Fahrenheit) at the location of the building
- january_avg_temp: the average temperature in January (in Fahrenheit) at the location of the building
- january_max_temp: maximum temperature in January (in Fahrenheit) at the location of the building
- cooling_degree_days: cooling degree day for a given day is the number of degrees where the daily average temperature exceeds 65 degrees Fahrenheit. Each month is summed to produce an annual total at the location of the building.
- heating_degree_days: heating degree day for a given day is the number of degrees where the daily average temperature falls under 65 degrees Fahrenheit. Each month is summed to produce an annual total at the location of the building.
- precipitation_inches: annual precipitation in inches at the location of the building
- snowfall_inches: annual snowfall in inches at the location of the building
- snowdepth_inches: annual snow depth in inches at the location of the building
- avg_temp: average temperature over a year at the location of the building
- days_below_30F: total number of days below 30 degrees Fahrenheit at the location of the building
- days_below_20F: total number of days below 20 degrees Fahrenheit at the location of the building
- days_below_10F: total number of days below 10 degrees Fahrenheit at the location of the building
- days_below_0F: total number of days below 0 degrees Fahrenheit at the location of the building
- days_above_80F: total number of days above 80 degrees Fahrenheit at the location of the building
- days_above_90F: total number of days above 90 degrees Fahrenheit at the location of the building
- days_above_100F: total number of days above 100 degrees Fahrenheit at the location of the building
- days_above_110F: total number of days above 110 degrees Fahrenheit at the location of the building
- direction_max_wind_speed: wind direction for maximum wind speed at the location of the building. Given in 360-degree-compass point directions (e.g. 360 = north, 180 = south, etc.).
- direction_peak_wind_speed: wind direction for peak wind gust speed at the location of the building. Given in 360-degree compass point directions (e.g. 360 = north, 180 = south, etc.).
- max_wind_speed: maximum wind speed at the location of the building
- days_with_fog: number of days with fog at the location of the building
Now that we have an understanding of what’s the project problem statement. Let’s quickly jump on to the loading dataset and exploratory data analysis.
Loading the Dataset
I have used the Kaggle environment to work on this project, and the dataset was loaded from the above-mentioned source in the Kaggle input directory. You may use your local environment or any cloud-based IDE to work on this project.
Note: I will also attach the GitHub repository at the end of this article so that you can look at it for your reference.
# importing all the Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from statsmodels.stats.outliers_influence import variance_inflation_factor
#importing algorithms
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBRegressor
import xgboost
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
import sklearn
import shap
import optuna
import joblib
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Load the dataset
df = pd.read_csv("/kaggle/input/site-energy-intensity-dataset/train_dataset.csv")
test_df = pd.read_csv("/kaggle/input/site-energy-intensity-dataset/x_test.csv")
df.head()
The output of the above code
Our train dataset has about 75K rows and 64 columns, while the test dataset contains 25K rows of observations. we have stored the test dataset in the test_df variable in our project.
Data Understanding
Now that we have the dataset loaded into our environment, we can use some of the pandas’ functions to get descriptions about the dataset.
# columns with null values
df.isnull().sum()[df.isnull().sum() != 0]
The output of the above code
# find the correlation between numerical columns
df.corr().T
The output of the above code
Using corr() function, we find correlation coefficients of the numerical columns, which allow us to infer which columns are influencing the target variable.
# descriptive statistics of dataframe
df.describe()
The output of the above code
Due to the large output size, I have shown only a portion of the output of codes to understand the purpose. you can check out linked github repositories at the end of the article for more information.
Some observations from the data understanding phase:
- 6 Columns have missing values
- Energy intensity means the usage of electricity and heat (The more the use, more the intensity of energy and CO2 emissions per building)
- Building characteristics and energy usage over a year are given (Every Month’s Min, Max, and Mean)
- Samples are from various states of the USA
Exploratory Data Analysis
In the next step, we will do exploratory data analysis to find some actionable insights from the dataset, which can further help us to develop accurate machine learning models.
Heatmap of Dataframe
# heatmap to visualize multi collinearity in dataset
plt.figure(figsize=(30,20))
sns.set(font_scale=0.8)
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.CMRmap_r)
The output of the above code
Using Seaborn’s heatmap function, we can visualize the correlation between columns. The above heatmap shows the relationship between various columns and derives insights on highly cor columns to take further actions.
Note: for clear code outputs and images, please refer to the code repository linked at the end of the article.
Observations:
- The dataset has multiple columns of inter-correlations, which are not good for a regression problem
- Temperature columns during winter and summer months correlate with each other
# create obj_cols list with two object columns from the dataset
obj_cols = ['State_Factor','building_class']
# Using for loop plot the stripplot for both object columns
for col in obj_cols:
plt.figure(figsize=(5,5))
sns.stripplot(x=col, y='site_eui', data=df)
plt.show()
The output of the above code
Relations Between Numerical Columns and Target
# create a new dataframe to extract all the columns from the dataset
ndf = df.drop(['id','site_eui'], axis=1)
target = df['site_eui']
col_list = [col for col in ndf.columns if ndf[col].dtype != 'object']
# plot the scatterplot of all numerical columns with target variable
fig, ax = plt.subplots(30,2, figsize=(12,60))
for idx, col in enumerate(col_list):
sns.scatterplot(x=col, y='site_eui', data=df, ax=ax[idx//2, idx%2])
ax[idx//2, idx%2].grid(visible=True, linestyle="--", color="#0b1324")
ax[idx//2, idx%2].set_title(f'{col} ', pad=12)
plt.suptitle(f'Relations between columns and target', fontsize=15, fontweight="bold",
color="#0b1324")
plt.show()
The output of the above code
Observations:
- The ‘ELEVATION’ column has a relation with the target column.
- There’s no significant correlation between the dependent columns and the target column.
- Year factor and Year built columns do relate to target columns; we will drop them.
- ‘days_with_fog’ will be filled with knnimputer and the other three columns with ‘mode’ value as most of the values are 1.
- ‘Energy star rating’ filling using mean values as it relates to the target variable.
- We will remove the ‘Facility type’ column for categorical columns to avoid high dimensionality and noise.
- We will remove the ‘Days with above 110F’ column.
Data Pre-processing and Feature Engineering
After EDA, the next step in any data science project lifecycle is Data pre-processing and feature engineering to develop quality and accurate machine learning models. Let’s start with handling missing values.
Handling Missing Values
# last 4 columns are having more than 50% of the missing values
(ndf.isnull().sum()[ndf.isnull().sum() != 0]/len(target))*100
The output of the above code
The above code tells us what percentage of values are missing in our dataset. By using the above code, we can know the percentages of missing values in each column. In the next code, we will fill the missing values.
# remove noisy features
remove_col = ['Year_Factor','facility_type','year_built','days_above_110F']
n_df = ndf.drop(columns=remove_col, axis=1)
# fill the missing values of the columns
n_df['energy_star_rating'] = n_df['energy_star_rating'].fillna(n_df['energy_star_rating'].mean())
n_df['direction_max_wind_speed'] = n_df['direction_max_wind_speed'].fillna(1.0)
n_df['direction_peak_wind_speed'] = n_df['direction_peak_wind_speed'].fillna(1.0)
n_df['max_wind_speed'] = n_df['max_wind_speed'].fillna(1.0)
n_df['days_with_fog'] = n_df['days_with_fog'].fillna(n_df['days_with_fog'].mean())
Using the ‘fillna()’ method, we can fill the missing values with the mean values of each column. One can consider many approaches to handle missing values depending on the problem type.
Treating Multicollinearity
In our dataset, many columns are highly cor with others which can lead to inaccurate predictions for our target variable. To tackle this multi-collinearity problem, we will use the Variance Inflation Factor (VIF) method. VIF is calculated using the statsmodels library with an in-built function. Let’s identify multicollinear features and remove them before modeling.
# VIF calculations
def get_vif(X):
X_matrix = np.array(X)
vif = [variance_inflation_factor(X_matrix, i) for i in range(X_matrix.shape[1])]
vif_factors = pd.DataFrame()
vif_factors['columns'] = X.columns
vif_factors['VIF'] = vif
return vif_factors
X = n_df.drop(columns=['State_Factor','building_class'], axis=1)
vif_factors = get_vif(X)
# total 56 columns
vif_factors.info()
# find the columns with large VIF (vif > 4)
cols_with_large_vif = vif_factors[vif_factors.VIF > 4]['columns']
cols_with_large_vif
As per the book ‘Machine learning using the python’ by Dinesh Kumar and M Pradhan, if VIF values are higher than 4 then those columns are highly cor with each other. In that case, we should remove those columns before applying machine learning algorithms. Now we will do feature engineering by keeping multicollinearity in mind.
Feature Engineering
As we know some of the columns in certain seasons are highly cor, we will aggregate values within each season to reduce the total number of columns.
# Summetion of seasonal (i.e. winter and summer) columns to reduce columns and MC
n_df['Avg_min_temp_winter'] = (n_df['january_min_temp'] + n_df['february_min_temp']
+ n_df['march_min_temp'] +
n_df['april_min_temp'] + n_df['october_min_temp']
+ n_df['november_min_temp'] + n_df['december_min_temp'])/7
n_df['Avg_max_temp_winter'] = (n_df['january_max_temp'] + n_df['february_max_temp'] +
n_df['march_max_temp'] +
n_df['april_max_temp'] + n_df['october_max_temp'] +
n_df['november_max_temp'] + n_df['december_max_temp'])/7
n_df['Avg_temp_winter'] = (n_df['january_avg_temp'] + n_df['february_avg_temp'] +
n_df['march_avg_temp'] +
n_df['april_avg_temp'] + n_df['october_avg_temp'] +
n_df['november_avg_temp'] + n_df['december_avg_temp'])/7
n_df['Avg_min_temp_summer'] = (n_df['may_min_temp'] + n_df['june_min_temp'] +
n_df['july_min_temp'] + n_df['august_min_temp'] + n_df['september_min_temp'])/5
n_df['Avg_max_temp_summer'] = (n_df['may_max_temp'] + n_df['june_max_temp'] +
n_df['july_max_temp'] + n_df['august_max_temp'] + n_df['september_max_temp'])/5
n_df['Avg_temp_summer'] = (n_df['may_avg_temp'] + n_df['june_avg_temp'] + n_df['july_avg_temp']
+ n_df['august_avg_temp'] + n_df['september_avg_temp'])/5
n_df['Avg_days_below30F'] = (n_df['days_below_30F'] + n_df['days_below_20F'] +
n_df['days_below_10F'] + n_df['days_below_0F'])/4
# columns to remove from the dataset
months_cols = list(n_df.iloc[:,5:41].columns) + ['days_below_30F','days_below_20F',
'days_below_10F','days_below_0F','direction_max_wind_speed',
'direction_peak_wind_speed','snowdepth_inches','avg_temp','days_above_90F']
xdf = n_df.drop(columns=months_cols, axis=1)
xdf.info()
# columns to remove from the datset
months_cols = list(n_df.iloc[:,5:41].columns) + ['days_below_30F',
'days_below_20F','days_below_10F','days_below_0F','direction_max_wind_speed',
'direction_peak_wind_speed','snowdepth_inches','avg_temp','days_above_90F']
xdf = n_df.drop(columns=months_cols, axis=1)
xdf.info()
# final dataframw of 18 features to continue for modeling and evaluation
zdf = xdf.drop(columns=['State_Factor','building_class'], axis=1)
The output of the above code
In the above code, we have done feature engineering using the aggregation method and then removed unnecessary columns from the dataframe. Finally, we have stored all the features in a variable called ‘zdf’ to apply machine learning modeling techniques for our regression task.
Modeling and Evaluation
In this section, we will apply regression modeling techniques like XGBoost and Random Forest, followed by hyperparameter tuning to build accurate regression models and predict Site EUI to optimize the energy usage of the building.
# zdf and target are the final dataset for the time being for baseline modeling
X_train, X_val, y_train, y_val = train_test_split(zdf, target,
test_size=0.3, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
--------------------------------[output]--------------------------------------
(53029, 18) (22728, 18) (53029,) (22728,)
In the above code, we split the data into train and test datasets for modeling and evaluation purposes. Now we will write the modeling function to train using XGBoost and Random Forest algorithms.
# modelling function for xgb and rf
def modeling(X_train, y_train, X_test, y_test, **kwargs):
scores = {}
models = []
if 'xgb' in kwargs.keys() and kwargs['xgb']:
xgb = XGBRegressor()
xgb.fit(X_train._get_numeric_data(), np.ravel(y_train, order="C"))
y_pred = xgb.predict(X_test._get_numeric_data())
scores['xgb']= [np.sqrt(mean_squared_error(y_test, y_pred))]
models.append(xgb)
if 'rf' in kwargs.keys() and kwargs['rf']:
rf = RandomForestRegressor(n_estimators=200, max_depth=8)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
scores['rf']= [np.sqrt(mean_squared_error(y_test, y_pred))]
models.append(rf)
return scores
# call the function using train and val datasets
modeling(X_train, y_train, X_val, y_val, xgb=True, rf=True)
------------------------------[Output]-----------------------------------
{'xgb': [48.27921059144114], 'rf': [49.49613354383458]}
By training the model using the default hyperparameters we are getting an RMSE score of 48.27 in XGB and 49.49 in the RF model. We will further optimize the model by using hyperparameters techniques.
Hyperparameter Tuning
gkf = KFold(n_splits=3, shuffle=True, random_state=42).split(X=X_train, y=y_train)
# A parameter grid for random forest
params = {
'n_estimators': range(100, 200, 100),
'ccp_alpha': [0.0, 0.1],
'criterion': ['squared_error'],
'max_depth': [5,6],
'min_samples_split': [2,3,5],
}
rf_estimator = RandomForestRegressor()
gsearch = GridSearchCV(
estimator= rf_estimator,
param_grid= params,
scoring='neg_mean_squared_error',
n_jobs=-1,
cv=gkf,
verbose=1,
)
rf_model = gsearch.fit(X=X_train, y=y_train)
(gsearch.best_params_, gsearch.best_score_)
--------------------------------[output]---------------------------------------
Fitting 3 folds for each of 12 candidates, totaling 36 fits
({'ccp_alpha': 0.1,
'criterion': 'squared_error',
'max_depth': 6,
'min_samples_split': 3,
'n_estimators': 100},
-2754.1303968681623)
In the above code, we have used the ‘gridsearch’ technique to search through the parameters grid and print the best parameters after fitting all the different combinations of the parameters to the model.
By using the ‘gridsearch’ technique, we could find the best parameters for the model. In the above code, we trained the model again using optimal hyperparameters to reduce the error, and we can see that RMSE has reduced to 46.51 from the previous score of 48.27 which is a reasonable improvement. now we can save the model using joblib for inference. We will also look at explainable AI to understand how models come up with a certain prediction.
Model Explainability
Source: mathpix
Machine learning model explainability has become a really important area recently to explain complex models and make them interpretable. In this section, we will be using shap library to explain our trained XGBoost model.
# initialize shap module
shap.initjs()
sample_set = X_val.sample(100)
# shap values
shap_values = shap.TreeExplainer(xgb_estimator).shap_values(sample_set)
ntree_limit is deprecated, use `iteration_range` or model slicing instead.
# plot the summary plot for shaply values
shap.summary_plot(shap_values, sample_set, plot_type="bar")
Summary plot of the model
Above code outputs the summary plot, which tells us which feature is a major contributor to predicting the target. In our case, we can see that ‘energy_star_rating’ and ‘floor_area’ are the top contributor to predicting the target, followed by ‘snowfall_inches’ and ‘ELEVATION’ features. Similarly, we can plot a summary plot without bars to understand the impact of each feature in more detail.
# summary plot with bar chart
shap.summary_plot(shap_values, sample_set)
The output of the above code
Now, let’s look at the Decision plot to understand the variation in output by the model to predict the target variable.
# plot the decision plot to understand the predictions
shap.decision_plot(shap.TreeExplainer(xgb_estimator).expected_value[0],
shap_values[50:100],
feature_names=sample_set.columns.tolist()
)
The output of the above code
For more information on ‘shaply values’, I’d recommend visiting their official documentation, which you can find here.
Deploying Model on Streamlit Cloud
Source: thenewstack.io
In this section, we will look at Deploying the saved model in a cloud environment using the streamlit library. If you are new the streamlit, then I highly recommend going through my previous article published on Analytics Vidhya, which was about the detailed guide on Streamlit; you can read the article by visiting the link here.
Now, let’s get started with Streamlit App deployment.
1. Create a project repository structure as shown in the below image:
We need to create a project repository structure that enables web application deployment on the streamlit cloud. In the above repo, we have the ‘app.py’ file, which is the Python script that will run on Stremlit’s managed cloud service. After deploying and running the script, it will automatically generate a web URL to access the site and make inferences.
2. Log in to your streamlit account to push the repo to the cloud:
As shown in the above image, you must assign a repository where your project files lie, Branch name, and Main File path (i.e., python file name). Finally, just click on that ‘Delpoy!’ button, and your web application will be live in a few minutes.
It’s that simple to deploy apps using streamlit, and for more information on the project code files, you can refer to the GitHub repository — here.
Here are some of the website screenshots from the deployed app.
Github repository link for detailed code: Click here
Conclusion
In conclusion, this article covers the end-to-end project lifecycle to predict site energy intensity, which helps to control the energy usage by buildings and implement energy efficiency measures to cut the cost and energy usage while helping to conserve and preserve the environment at large. Let’s summarise the lessons learned from this article.
- We learned how project scoping works and noted down dataset descriptions before the start of the project.
- Exploratory data analysis to find critical insights for data preparation and modeling.
- Feature engineering techniques like aggregation and data pre-processing techniques like handling missing values.
- Machine learning modeling and evaluation for regression tasks.
- Model explainability to make models more explainable and finally deployment on the cloud using a low-code tool like streamlit.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, April 19, 2023.