Introduction
Weather is a major driver for so many things that happen in the real world. In fact, it is so important that it usually ends up benefiting any forecasting model that incorporates it using machine learning models.
Think about the following scenarios:
- A public transport agency tries to forecast delays and congestion in the system
- An energy provider would like to estimate the amount of solar electricity generation tomorrow for the purpose of energy trading
- Event organizers need to anticipate the amount of attendees in order to ensure safety standards are met
- A farm needs to schedule the harvesting operations for the upcoming week
It is fair to say that any model in the scenarios above that doesn’t include weather as a factor is either pointless or not quite as good as it could be.
Surprisingly, while there are a lot of online resources focusing on how to forecast weather itself, there’s virtually nothing that shows how to obtain & use weather data effectively as a feature, i.e. as an input to predict something else. This is what this post is about.
Overview
First we’ll highlight the challenges associated with using weather data for modelling, which models are commonly used, and what providers are out there. Then we’ll run a case study and use data from one of the providers to build a machine learning model that forecasts taxi rides in New York.
At the end of this post you will have learned about:
- Challenges around using weather data for modelling
- Which wheather models and providers exist
- Typical ETL & feature building steps for time series data
- Evaluation of feature importances using SHAP values
This article was published as a part of the Data Science Blogathon.
Challenges
Measured vs. Forecasted Weather
For a ML model in production we need both (1) live data to produce predictions in real time and (2) a bulk of historical data to train a model that is able to do such a thing.
Obviously, when making live predictions, we will use the current weather forecast as an input, as it is the most up-to-date estimate of what is going to happen in the future. For instance, when predicting how much solar energy will be produced tomorrow the model input we need is what the forecasts say about tomorrow’s weather.
What about Model Training?
If we want the model to perform well in the real world, training data needs to reflect live data. For model training, there’s a choice to be made about whether to use historical measurements or historical forecasts. Historical measurements reflect only the outcome, i.e. what weather stations recorded. However, the live model is going to make use of forecasts, not measurements, since the measurements aren’t yet available at the time the model makes it’s prediction.
If there is a chance to obtain historical forecasts, they should always be preferred as this trains the model under the exact same conditions as are available at the time of live predictions.
Consider this example: Whenever there’s a lot of clouds, a solar energy farm will produce little electricity. A model that is trained on historical measurements will learn that when cloud coverage feature shows a high value, there’s a 100% probability that there won’t be much electricity. On the other hand, a model trained on historical forecasts will learn that there’s another dimension to this: forecasting distance. When making predictions several days ahead, a high value for cloud coverage is only an estimate and does not mean that the day in question will be cloudy with certainty. In such cases the model will be able to only somewhat rely on this feature and consider other features too when predicting solar generation.
Format
Weather data =/= weather data. There’s tons of factors ruling out a specific set of weather data as even remotely useful. Among the main factors are:
- Granularity: are there records for every hour, every 3 hours, daily?
- Variables: does it include the feature(s) I need?
- Spatial Resolution: how many km² does one record refer to?
- Horizon: how far out does the forecast go?
- Forecast Updates: how often is a new forecast created?
Additionally, the shape or format of the data can be cumbersome to work with. Any extra steps of ETL that you need to create may introduce bugs and the time-dependent nature of the data can make this work quite frustrating.
Live vs. Old Data
Data that is older than a day, or a week, often comes in form of CSV dumps, FTP servers, or at best on a separate API endpoint, but then again often with different fields than the live forecast endpoint. This creates the risk of mismatched data and can blow up complexity in your ETL.
Costs
Costs can vary extremely depending on the provider and which types of weather data are required. For instance, providers may charge for each single coordinate which can be a problem when many locations are required. Obtaining historical weather forecasts is generally quite difficult and costly.
Weather Models
Numerical weather prediction models, as they are often called, simulate the physical behavior of all the different aspects of weather. There’s plenty of them, varying in their format (see above), the parts of the globe they cover, and accuracy.
Here’s a quick list of the most widely used weather models:
- GFS: most known standard model, widely used, global
- CFS: less accurate than GFS, for long-term climate forecasts, global
- ECMWF: most accurate but expensive model, global
- UM: most accurate model for UK, global also available
- WRF: open source code to produce DIY regional weather forecasts
Providers
Providers are there to bring data from weather models to the end user. Often enough they also have their own proprietary forecasting models on top of the standard weather models. Here are some of the known ones:
- AccuWeather
- MetOffice
- OpenWeatherMap
- AerisWeather
- DWD (Germany)
- Meteogroup (UK)
BlueSky API
For the machine learning use case, the providers mentioned above turn out to be either not offering historical forecasts, or the process to get and combine the data is both cumbersome and expensive. In contrast, blueskyapi.io offers a simple API that can be called to obtain both live and historical forecasts in the same format, making the data pipelining very straightforward. The original data comes from GFS, the most widely used weather model.
Case Study: New York Taxi Rides
Imagine you own a taxi business in NYC and want to forecast the amount of taxi rides in order to optimize your staff & fleet planning. As you have access to NYC’s historical combined taxi data, you decide to make use of it and create a machine learning model.
We’ll use data that can be downloaded from the NYC website here.
First some imports:
import pandas as pd
import numpy as np
import holidays
import datetime
import pytz
from dateutil.relativedelta import relativedelta
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import shap
import pyarrow
Preprocessing Taxi Data
timezone = pytz.timezone("US/Eastern")
dates = pd.date_range("2022-04", "2023-03", freq="MS", tz=timezone)
To get our taxi dataset, we need to loop through the files and create an aggregated dataframe with counts per hour. This will take about 20s to complete.
aggregated_dfs = []
for date in dates:
print(date)
df = pd.read_parquet(
f"./data/yellow_tripdata_{date.strftime('%Y-%m')}.parquet",
engine="pyarrow"
)
df["timestamp"] = pd.DatetimeIndex(
df["tpep_pickup_datetime"], tz=timezone, ambiguous="NaT"
).floor("H")
# data cleaning, sometimes it includes wrong timestamps
df = df[
(df.timestamp >= date) &
(df.timestamp < date + relativedelta(months=1))
]
aggregated_dfs.append(
df.groupby(["timestamp"]).agg({"trip_distance": "count"}
).reset_index())
df = pd.concat(aggregated_dfs).reset_index(drop=True)
df.columns = ["timestamp", "count"]
Let’s have a look at the data. First 2 days:
df.head(48).plot("timestamp", "count")
Everything:
fig, ax = plt.subplots()
fig.set_size_inches(20, 8)
ax.plot(df.timestamp, df["count"])
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
Interestingly, we can see that during some of the holiday times the amount of taxi rides is quite reduced. From a time series perspective there is no obvious trend or heteroscedasticity in the data.
Feature Engineering Taxi Data
Next, we’ll add a couple of typical features used in time series forecasting.
Encode timestamp pieces
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.day_of_week
Encode holidays
us_holidays = holidays.UnitedStates()
df["date"] = df["timestamp"].dt.date
df["holiday_today"] = [ind in us_holidays for ind in df.date]
df["holiday_tomorrow"] = [ind + datetime.timedelta(days=1) in us_holidays for ind in df.date]
df["holiday_yesterday"] = [ind - datetime.timedelta(days=1) in us_holidays for ind in df.date]
BlueSky Weather Data
Now we come to the interesting bit: the weather data. Below is a walkthrough on how to use the BlueSky weather API. For Python users, it is available via pip:
pip install blueskyapi
However it is also possible to just use cURL.
BlueSky’s basic API is free. It’s recommended to get an API key via the website, as this will boost the amount of data that can be pulled from the API.
With their paid subcription, you can obtain additional weather variables, more frequent forecast updates, better granularity, etc., but for the sake of the case study this is not needed.
import blueskyapi
client = blueskyapi.Client() # use API key here to boost data limit
We need to pick the location, forecast distances, and weather variables of interest. Let’s get a full year worth of weather forecasts to match the taxi data.
# New York
lat = 40.5
lon = 106.0
weather = client.forecast_history(
lat=lat,
lon=lon,
min_forecast_moment="2022-04-01T00:00:00+00:00",
max_forecast_moment="2023-04-01T00:00:00+00:00",
forecast_distances=[3,6], # hours ahead
columns=[
'precipitation_rate_at_surface',
'apparent_temperature_at_2m',
'temperature_at_2m',
'total_cloud_cover_at_convective_cloud_layer',
'wind_speed_gust_at_surface',
'categorical_rain_at_surface',
'categorical_snow_at_surface'
],
)
weather.iloc[0]
That’s all we had to do to when it comes to obtaining the weather data!
Join Data
We need to ensure the weather data gets mapped correctly to the taxi data. For that we need the target moment a weather forecast was made for. We get this by adding forecast_moment + forecast_distance:
weather["target_moment"] = weather.forecast_moment + pd.to_timedelta(
weather.forecast_distance, unit="h"
)
A typical issue when joining data is the data type and timezone awareness of the timestamps. Let’s match up the timezones to ensure we join them correctly.
df["timestamp"] = [timezone.normalize(ts).astimezone(pytz.utc) for ts in df["timestamp"]]
weather["target_moment"] = weather["target_moment"].dt.tz_localize('UTC')
As a last step we join, for any timestamp in the taxi data, the latest available weather forecast to it.
d = pd.merge_asof(df, weather, left_on="timestamp", right_on="target_moment", direction="nearest")
d.iloc[0]
Our dataset is complete!
Model
Before modelling it usually makes sense to check a couple more things, such as whether the target variable is stationary and if there is any missingness or anomalies in the data. However, for the sake of this blog post, we’re going to keep it really simple and just go ahead and fit an out-of-the-box random forest model with the features we extracted & created:
d = d[~d.isnull().any(axis=1)].reset_index(drop=True)
X = d[
[
"day_of_week",
"hour",
"holiday_today",
"holiday_tomorrow",
"holiday_yesterday",
"precipitation_rate_at_surface",
"apparent_temperature_at_2m",
"temperature_at_2m",
"total_cloud_cover_at_convective_cloud_layer",
"wind_speed_gust_at_surface",
"categorical_rain_at_surface",
"categorical_snow_at_surface"
]
]
y = d["count"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42, shuffle=False
)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
pred_train = rf.predict(X_train)
plt.figure(figsize=(50,8))
plt.plot(y_train)
plt.plot(pred_train)
plt.show()
pred_test = rf.predict(X_test)
plt.figure(figsize=(50,8))
plt.plot(y_test.reset_index(drop=True))
plt.plot(pred_test)
plt.show()
As expected, quite some accuracy is lost on the test set vs. the training set. This could be improved, but overall, the predictions seem reasonable, albeit often conservative when it comes to the very high values.
print("MAPE is", round(mean_absolute_percentage_error(y_test,pred_test) * 100, 2), "%")
MAPE is 17.16 %
Model Without Weather
To confirm that adding weather data improved the model, let’s compare it with a benchmark model that is fitted on everything but the weather data:
X = d[
[
"day_of_week",
"hour",
"holiday_today",
"holiday_tomorrow",
"holiday_yesterday"
]
]
y = d["count"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42, shuffle=False
)
rf0 = RandomForestRegressor(random_state=42)
rf0.fit(X_train, y_train)
pred_train = rf0.predict(X_train)
pred_test = rf0.predict(X_test)
print("MAPE is", round(mean_absolute_percentage_error(y_test,pred_test) * 100, 2), "%")
MAPE is 17.76 %
Adding weather data improved the taxi ride forecast MAPE by 0.6%. While this percentage may not seem like a lot, depending on the operations of a business such an improvement could have a significant impact.
Feature Importance
Next to metrics, let’s have a look at the feature importances. We’re going to use the SHAP package, which is using shap values to explain the individual, marginal contribution of each feature to the model, i.e. it checks how much an individual feature contributes on top of the other features.
explainer = shap.Explainer(rf)
shap_values = explainer(X_test)
This will take a couple of minutes, as it’s running plenty of “what if” scenarios over all of the features: if any feature was missing, how would that affect overall prediction accuracy?
shap.plots.beeswarm(shap_values)
We can see that by far the most important explanatory variables were the hour of the day and the day of the week. This makes perfect sense. Taxi ride counts are highly cyclical with demand on taxis varying a lot during the day and the week. Some of the weather data turned out to be useful as well. When it’s cold, there’s more cab rides. To some degree, however, temperature might also be a proxy for general yearly seasonality effects in taxi demand. Another important feature is wind gusts, with less cabs being used when there is more gusts. A hypothesis here could be that there is less traffic during stormy weather.
Further Model Improvements
- Consider creating more features from existing data, for instance lagging the target variable from the previous day or week.
- Frequent retraining of the model will make sure trends are always captured. This will have a big impact when using the model in the real world.
- Consider adding more external data, such as NY traffic & congestion data.
- Consider other timeseries models and tools such as Facebook Prophet.
Conclusion
That’s it! You have created a simple model using weather that can be used in practice.
In this article we discussed the importance of weather data in forecasting models across various sectors, the challenges associated with using it effectively, and the available numerical weather prediction models and providers, highlighting BlueSky API as a cost-effective and efficient way to obtain both live and historical forecasts. Through a case study on forecasting New York taxi rides, this article provided a hands-on demonstration of using weather data in machine learning, teaching you all the basic skills you need to get started:
- Typical ETL & feature building steps for time series data
- Weather data ETL and feature building via BlueSky API
- Fitting and evaluating a simple random forest model for timeseries
- Evaluation of feature importances using shap values
Key Takeaways
- While weather data can be extremely complex to integrate into existing machine learning models, modern weather data services such as BlueSky API greatly reduce the workload.
- The integration of BlueSky’s weather data into the model enhanced predictive accuracy in the New York taxi case study, highlighting that weather plays a visible practical role in daily operations.
- Plenty of sectors like retail, agriculture, energy, transport, etc. benefit in similar or greater ways and therefore require good weather forecast integrations to improve their own forecasting and enhance their operational efficiency and resource allocation.
Frequently Asked Questions
A. Weather data can be incorporated into time series forecasting models as a set of external variables or covariates, also called features, to forecast some other time-dependent target variable. Unlike many other features, weather data is both conceptually and practically more complicated to add to such a model. The article explains how to do this correctly.
A. It’s important to consider the various aspects such as accuracy, granularity, forecast horizon, forecast updates, and relevance of the weather data. You should ensure it is reliable and corresponds to the location of interest. Also, not all weather variables may be impactful to your operations, so feature selection is crucial to avoid overfitting and enhance model performance.
A. There are many possible reasons. For instance, by integrating weather data, businesses can anticipate fluctuations in demand or supply caused by weather changes and adjust accordingly. This can help optimize resource allocation, reduce waste, and improve customer service by preparing for expected changes.
A. Machine learning algorithms can automatically identify patterns in historical data, including subtle relationships between weather changes and operational metrics. They can handle large volumes of data, accommodate multiple variables, and improve over time when getting exposed to more data.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, July 25, 2023.