Introduction
Have you ever thought there would be a system where we can predict the efficiency of electric vehicles and that users can easily use that system? In the world of Electric Vehicles, we can predict the efficiency of electric vehicles with extreme accuracy. This concept has now come into the real world, we are extremely thankful for Zenml and MLflow. In this project, we will explore the technical deep dive, and we will see how combining data science, machine learning, and MLOps creates this technology beautifully, and you will see how we use ZenML for electric vehicles.
Learning Objectives
In this article, we will learn,
- Learn what Zenml is and how to use it in an end-to-end machine-learning pipeline.
- Understand the role of MLFlow in creating an experiment tracker for machine learning models.
- Explore the deployment process for machine learning models and how to set up a prediction service.
- Discover how to create a user-friendly Streamlit app for interacting with machine learning model predictions.
This article was published as a part of the Data Science Blogathon.
Understanding Electric Vehicle Efficiency
- Electric vehicle (EV) efficiency refers to how efficiently an EV can convert the electrical energy from its battery into a driving range. It is typically measured in miles per kWh (kilowatt hour).
- Factors like motor and battery efficiency, weight, aerodynamics, and auxiliary loads impact EV efficiency. So it’s clear that if we optimize those areas, we can improve our EV efficiency. For consumers, choosing an EV with higher efficiency results in a better driving experience.
- In this project, we will build an end-to-end machine-learning pipeline to predict electric vehicle efficiency using real-world EV data. Predicting efficiency accurately can guide EV manufacturers in optimizing designs.
- We will use ZenML, an MLOps framework, to automate the workflow for training, evaluating, and deploying machine learning models. ZenML provides capabilities for metadata tracking, artifact management, and model reproducibility across stages of the ML lifecycle.
Data Collection
For this project, we will start collecting the data from Kaggle. Kaggle is an online platform offering many datasets for data science and machine learning projects. You can collect data from anywhere as you wish. By collecting this dataset, we can perform our prediction into our model. Here is my GitHub repository where you can find all the files or templates – https://github.com/Dhrubaraj-Roy/Predicting-Electric-Vehicle-Efficiency.git
Problem Statement
Efficient electric vehicles are the future, but predicting their range accurately is very difficult.
Solution
Our project combines data science and MLOps to create a precise model for forecasting electric vehicle efficiency, benefiting consumers and manufacturers.
Set Up a Virtual Environment
Why do we want to set up a Virtual Environment?
It helps us to make our project stand out and not conflict with other projects in our system.
Creating a Virtual Environment
python -m venv myenv
#then for activation
myenvScriptsactivate
python3 -m venv myenv
#then for activation
source myenv/bin/activate
It helps keep our environment clean.
Working on the Project
With our environment ready, we need to install Zenml. Now, what is Zenml? So, Zenml is a machine learning operations (MLOps) framework for managing end-to-end machine learning pipelines. We chose Zenml because of the efficient management of machine learning pipelines. Therefore, you need to install the Zenml server.
Use this command in your terminal to install the Zenml server –
pip install ‘zenml[server]’
This is not the end; after installing the Zenml server, we need to create the Zenml repository, for creating Zenml repository –
zenml init
Why We Use `zenml init`: `zenml init` is used to initialize a ZenML repository, creating the structure necessary to manage machine learning pipelines and experiments effectively.
Requirements Installation
To satisfy project dependencies, we utilized a ‘requirements.txt’ file. In this file, you should have those dependencies.
catboost==1.0.4
joblib==1.1.0
lightgbm==3.3.2
optuna==2.10.0
streamlit==1.8.1
xgboost==1.5.2
markupsafe==1.1.1
zenml==0.35.1
Organizing the Project
When working on a data science project, we should organize everything properly. Let me break down how we keep things structured in our project:
Creating Folders
We organize our project into folders. There are some folders we need to create.
- Model Folder: First, we need to create a model folder. It contains essential files for our machine-learning models. Inside this folder, we have some files like ‘data_cleaning.py,’ ‘evaluation.py,’ and ‘model_dev.py.’ These files are like different tools to help us throughout the project.
- Steps Folder: This folder serves as the control center for our project. Inside the ‘Steps’ folder, we have essential files for various stages of our data science process. Then, we must create some files in the steps folder, like Ingest_data.py. This file helps us with data input, just like gathering materials for your project. Next, Cleaning_data.py It’s like the part of your project where you clean and prepare materials for the main job. Model_train.py: This file is where we train our machine learning model, like shaping your materials into the final product. Evaluation.py: This evaluation.py file evaluates our model, where we check how well our final product performs.
Pipelines Folder
This is where we assemble our pipeline, similar to setting up a production line for your project. Inside the ‘Pipelines’ folder, ‘Training_pipeline.py’ acts as the primary production machine. In this file, we imported ‘Ingest_data.py’ and the ‘ingest_df’ class to prepare the data, clean it up, train the model, and evaluate its performance. To run the entire project, utilize ‘run_pipeline.py’, similar to pushing the start stage on your production line with the command:
python run_pipeline.py
Here, you can see the file structure of the project-
This structure helps us to run our project smoothly, just like a well-structured workspace helps you create a project effectively.
3. Setting up Pipeline
After organizing the project and configuring the pipeline, the next step is to execute the pipeline. Now, you might have a question: what is a pipeline? A pipeline is a set of automated steps that streamline the deployment, monitoring, and management of machine learning models from development to production. This is achieved by running the ‘zenml up‘ command, which acts as the power switch for your production line. It ensures that all defined steps in your data science project are executed in the correct sequence, initiating the entire workflow, from data ingestion and cleaning to model training and evaluation.
Data Cleaning
In the ‘Model’ folder, you’ll find a file called ‘data_cleaning,’ this file is responsible for data cleaning. Within this file, you’ll discover – Column Cleanup: A section dedicated to identifying and removing unnecessary columns from the dataset, making it more ordered and easier to find what you need. DataDevideStretegy Class: This class helps us strategize how to divide our data effectively. It’s like planning how to arrange your materials for your project.
class DataDivideStrategy(DataStrategy):
"""
Data dividing strategy which divides the data into train and test data.
"""
def handle_data(self, data: pd.DataFrame) -> Union[pd.DataFrame, pd.Series]:
"""
Divides the data into train and test data.
"""
try:
# Assuming "Efficiency" is your target variable
# Separating the features (X) and the target (y) from the dataset
X = data.drop("Efficiency", axis=1)
y = data["Efficiency"]
# Splitting the data into training and testing sets with a 80-20 split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Returning the divided datasets
return X_train, X_test, y_train, y_test
except Exception as e:
# Logging an error message if any exception occurs
logging.error("Error in Divides the data into train and test data.".format(e))
raise e
- It takes a dataset and separates it into training and testing data (80-20 split), returning the divided datasets. If any errors occur during this process, it logs an error message.
- DataCleaning Class: The ‘DataCleaning’ class is a set of rules and methods to ensure our data is in the best shape possible. Handle_data Method: This method is like a versatile tool that allows us to manage and manipulate data in different ways, ensuring it’s ready for the next steps in our project.
- Our main class is the Data Cleaning is DataPreProcessStrategy. In this class, we clean our data.
Now, we move on to the ‘Steps’ folder. Inside, there’s a file called ‘clean_data.py.’ This file is dedicated to data cleaning. Here’s what happens here:
- We import ‘DataCleaning,’ ‘DataDevideStretegy,’ and ‘DataPreProcesStretegy’ from ‘data_cleaning.’ This is like getting the right tools and materials from your toolbox to continue working on your project effectively.
import logging
from typing importTupleimport pandas as pd
from model.data_cleaning import DataCleaning, DataDivideStrategy, DataPreProcessStrategy
from zenml import step
from typing_extensions import Annotated
@stepdefclean_df(data: pd.DataFrame) -> Tuple[
Annotated[pd.DataFrame, 'X_train'],
Annotated[pd.DataFrame, 'X_test'],
Annotated[pd.Series, 'y_train'],
Annotated[pd.Series, 'y_test'],
]:
"""
Data cleaning class which preprocesses the data and divides it into train and test data.
Args:
data: pd.DataFrame
"""
try:
preprocess_strategy = DataPreProcessStrategy()
data_cleaning = DataCleaning(data, preprocess_strategy)
preprocessed_data = data_cleaning.handle_data()
divide_strategy = DataDivideStrategy()
data_cleaning = DataCleaning(preprocessed_data, divide_strategy)
X_train, X_test, y_train, y_test = data_cleaning.handle_data()
logging.info(f"Data Cleaning Complete")
return X_train, X_test, y_train, y_test
except Exception as e:
logging.error(e)
raise e
- First, it imports necessary libraries and modules, including logging, pandas, and various data-cleaning strategies.
- The @step decorator marks a function as a step in a machine-learning pipeline. This step takes a DataFrame, preprocesses it, and divides it into training and testing data.
- In that step, it uses data cleaning and division strategies, logging the process and returning the split data as specified data types. For example, our X_train and X_test are DataFrame, and y_test and y_train are Series.
Create a Simple Linear Regression Model
Now, let’s talk about creating the model_dev in the model folder. In this file, we mostly work on building the machine learning model.
- Simple Linear Regression Model: In this file, we create a simple linear regression model. Our main goal is to focus on MLOps, not building a complex model. It’s like building a basic prototype of your MLOps project.
This structured approach ensures that we have a clean and organized data-cleaning process, and our model development follows a clear blueprint, keeping the focus on MLOps efficiency rather than building an intricate model. In the future, we will update our model.
import logging
from abc import ABC, abstractmethod
import pandas as pd
from sklearn.linear_model import LinearRegression
from typing importDictimport optuna # Import the optuna library
# Rest of your code...
classModel(ABC):
"""
Abstract base class for all models.
""" @abstractmethoddeftrain(self, X_train, y_train):
"""
Trains the model on the given data.
Args:
x_train: Training data
y_train: Target data
"""passclassLinearRegressionModel(Model):
"""
LinearRegressionModel that implements the Model interface.
"""deftrain(self, X_train, y_train, **kwargs):
try:
reg = LinearRegression(**kwargs) # Create a Linear Regression model
reg.fit(X_train, y_train) # Fit the model to the training data
logging.info('Training complete')
# Log a message indicating training is completereturn reg
# Return the trained modelexcept Exception as e:
logging.error("error in training model: {}".format(e))
# Log an error message if an exception occursraise e
# Raise the exception for further handling
Improvements in ‘model_train.py’ for Model Development
In the ‘model_train.py’ file, we make several important additions to our project:
Importing Linear Regression Model: We import ‘LinearRegressionModel’ from ‘model.mode_dev.‘ It has helped us to build our project. Our ‘model_train.py’ file is set up to work with this specific type of machine-learning model.
def train_model(
X_train: pd.DataFrame,
X_test: pd.DataFrame,
y_train: pd.Series,
y_test: pd.Series,
config: ModelNameConfig,
) -> RegressorMixin:
"""
Train a regression model based on the specified configuration.
Args:
X_train (pd.DataFrame): Training data features.
X_test (pd.DataFrame): Testing data features.
y_train (pd.Series): Training data target.
y_test (pd.Series): Testing data target.
config (ModelNameConfig): Model configuration.
Returns:
RegressorMixin: Trained regression model.
"""
try:
model = None
# Check the specified model in the configuration
if config.model_name == "linear_regression":
# Enable MLflow auto-logging
autolog()
# Create an instance of the LinearRegressionModel
model = LinearRegressionModel()
# Train the model on the training data
trained_model = model.train(X_train, y_train)
# Return the trained model
return trained_model
else:
# Raise an error if the model name is not supported
raise ValueError("Model name not supported")
except Exception as e:
# Log and raise any exceptions that occur during model training
logging.error(f"Error in train model: {e}")
raise e
This code trains a regression model (e.g., linear regression) based on a chosen configuration. It checks if the selected model is supported, uses MLflow for logging, trains the model on provided data, and returns the trained model. If the chosen model is not supported, it will raise an error.
Method ‘Train Model‘: The ‘model_train.py‘ file defines a method called ‘train_model‘, which returns a ‘LinearRegressionModel.’
Importing RegressorMixin: We import ‘RegressorMixin‘ from sklearn.base. RegressorMixin is a class in scikit-learn that provides a common interface for regression estimators. sklearn.base is a part of the Scikit-Learn library, a tool for building and working with machine learning models.
Configuring Model Settings and Performance Evaluation
Create ‘config.py’ in the ‘Steps’ folder: In the ‘steps’ folder, we create a file named ‘config.py.’ This file contains a class called ‘ModelNameConfig.’ `ModelNameConfig` is a class in the ‘config.py’ file that serves as a configuration guide for your machine learning model. It specifies various settings and options for your model.
# Import the necessary class from ZenML for configuring model parameters
from zenml.steps import BaseParameters
# Define a class named ModelNameConfig that inherits from BaseParameters
class ModelNameConfig(BaseParameters):
"""
Model Configurations:
"""
# Define attributes for model configuration with default values
model_name: str = "linear_regression" # Name of the machine learning model
fine_tuning: bool = False # Flag for enabling fine-tuning
- It allows you to choose the model’s name and whether to do fine-tuning. Fine-tuning is like making small refinements to an already working machine-learning model for better performance on specific tasks.
- Evaluation: In the ‘src’ or ‘model’ folder, we create a file named ‘evaluation.py.’ This file contains an abstract class called ‘evaluation’ and a method called ‘calculate_score.’ These are the tools we use to measure how well our machine-learning model is performing.
- Evaluation Strategies: We introduce specific evaluation strategies, such as Mean Squared Error (MSE). Each strategy class contains a ‘calculate_score’ method for assessing the model’s performance.
- Implementing Evaluation in ‘Steps’: We implement these evaluation strategies in ‘evaluation.py’ within the ‘steps’ folder. This is like setting up the quality control process in our project.
Quantifying Model Performance with the ‘Evaluate Model’ Method
Method ‘Evaluate Model‘: In ‘evaluation.py’ within the ‘steps’ folder, we create a method called ‘evaluate_model’ that returns performance metrics like R-squared (R2) score and Root Mean Squared Error (RMSE).
@step(experiment_tracker=experiment_tracker.name)
def evaluate_model(
model: RegressorMixin, X_test: pd.DataFrame, y_test: pd.Series
) -> Tuple[Annotated[float, "r2"],
Annotated[float, "rmse"],
]:
"""
Evaluate a machine learning model's performance using various metrics and log the results.
Args:
model: RegressorMixin - The machine learning model to evaluate.
X_test: pd.DataFrame - The test dataset's feature values.
y_test: pd.Series - The actual target values for the test dataset.
Returns:
Tuple[float, float] - A tuple containing the R2 score and RMSE.
"""
try:
# Make predictions using the model
prediction = model.predict(X_test)
# Calculate Mean Squared Error (MSE) using the MSE class
mse_class = MSE()
mse = mse_class.calculate_score(y_test, prediction)
mlflow.log_metric("mse", mse)
# Calculate R2 score using the R2Score class
r2_class = R2()
r2 = r2_class.calculate_score(y_test, prediction)
mlflow.log_metric("r2", r2)
# Calculate Root Mean Squared Error (RMSE) using the RMSE class
rmse_class = RMSE()
rmse = rmse_class.calculate_score(y_test, prediction)
mlflow.log_metric("rmse", rmse)
return r2, rmse # Return R2 score and RMSE
except Exception as e:
logging.error("error in evaluation".format(e))
raise e
These additions in ‘model_train.py,’ ‘config.py,’ and ‘evaluation.py’ enhance our project by introducing machine learning model training, configuration, and thorough evaluation, ensuring that our project meets high-quality standards.
Run the Pipeline
Next, we update the ‘training_pipeline’ file to run the pipeline successfully; ZenML is an open-source MLOps framework designed to streamline and standardize machine learning workflow management. To see your pipeline, you can use this command ‘zenml up.’
Now, we proceed to implement the experiment tracker and deploy the model:
- Importing MLflow: In the ‘model_train.py’ file, we import ‘mlflow.’ MLflow is a versatile tool that helps us manage the machine learning model’s lifecycle, track experiments, and maintain a detailed record of each project.
- Experiment Tracker: Now, you might have a question: what is an experiment tracker? An experiment tracker is a system for monitoring and organizing experiments, allowing us to keep a record of our project’s progress. In our code, we access the experiment tracker through ‘zenml.client’ and ‘mlflow,’ ensuring we can effectively manage our experiments. You can see the model_train.py code for better understanding.
- Autologging with MLflow: We use the ‘autolog’ feature from ‘mlflow.sklearn’ to automatically log various aspects of our machine learning model’s performance. This simplifies the experiment tracking process, providing valuable insights into how well our model is doing.
- Logging Metrics: We log specific metrics like Mean Squared Error (MSE) using ‘mlflow.log_metric’ in our ‘evaluation.py’ file. This allows us to keep track of the model’s performance during the project.
If you’re running the ‘run_deployment.py’ script, you must install some integrations using ZenML. Now, integrations help connect your model to the deployment environment, where you can deploy your model.
Zenml Integration
Zenml provides integration with MLOps tools. By running the following command, we have to install Zenml’s integration with MLflow, it’s a very important step:
To create this integration, you have to use this command:
zenml integration install mlflow -y
This integration helps us manage those experiments efficiently.
Experiment Tracking
Experiment tracking is a critical aspect of MLOps. We use Zenml and MLflow to monitor, record, and manage all aspects of our machine-learning experiments, facilitating efficient experimentation and reproducibility.
Register Experiment Tracker:
zenml experiment-tracker register mlflow_tracker --flavor=mlflow
Register Model Deployer:
zenml model-deployer register mlflow --flavor=mlflow
Stack:
zenml stack register mlflow_stack -a default -o default -d mlflow -e mlflow_tracker --set
Deployment
Deployment is the final step in our pipeline, and it’s an essential part of our project. Our goal is not just to build the model, we want our model to be deployed on the internet so that users can use it.
Deployment Pipeline Configuration: You have a deployment pipeline defined in a Python file named ‘deployment_pipeline.py.’ This pipeline manages the deployment tasks.
Deployment Trigger: There’s a step named ‘deployment_trigger’
class DeploymentTriggerConfig(BaseParameters):
min_accuracy = 0
@step(enable_cache=False)
def dynamic_importer() -> str:
"""Downloads the latest data from a mock API."""
data = get_data_for_test()
return data
This code defines a class `DeploymentTriggerConfig` with a minimum accuracy parameter. In this case, it’s zero. It also defines a pipeline step, dynamic_importer, that downloads data from a mock API, with caching disabled for this step.
Prediction Service Loader
The ‘prediction_service_loader’ step retrieves the prediction service started by the deployment pipeline. It is used to manage and interact with the deployed model.
def prediction_service_loader(
pipeline_name: str,
pipeline_step_name: str,
running: bool = True,
model_name: str = "model",
) -> MLFlowDeploymentService:
"""Get the prediction service started by the deployment pipeline.
Args:
pipeline_name: name of the pipeline that deployed the MLflow prediction
server
step_name: the name of the step that deployed the MLflow prediction
server
running: when this flag is set, the step only returns a running service
model_name: the name of the model that is deployed
"""
# get the MLflow model deployer stack component
mlflow_model_deployer_component = MLFlowModelDeployer.get_active_model_deployer()
# fetch existing services with same pipeline name, step name and model name
existing_services = mlflow_model_deployer_component.find_model_server(
pipeline_name=pipeline_name,
pipeline_step_name = pipeline_step_name,
model_name=model_name,
running=running,
)
if not existing_services:
raise RuntimeError(
f"No MLflow prediction service deployed by the "
f"{pipeline_step_name} step in the {pipeline_name} "
f"pipeline for the '{model_name}' model is currently "
f"running."
)
return existing_services[0]
This code defines a function `prediction_service_loader` that retrieves a prediction service started by a deployment pipeline.
- It takes inputs like the pipeline name, step name, and model name.
- The function checks for existing services matching these parameters and returns the first one found. If none are found, it will raise an error.
Predictor
The ‘predictor’ step runs inference requests against the prediction service. It processes incoming data and returns predictions.
@step
def predictor(
service: MLFlowDeploymentService,
data: str,
) -> np.ndarray:
"""Run an inference request against a prediction service"""
service.start(timeout=10) # should be a NOP if already started
data = json.loads(data) # Parse the input data from a JSON string into a Python dictionary.
data.pop("columns")
data.pop("index")
columns_for_df = [ #Define a list of column names for creating a DataFrame.
"Acceleration",
"TopSpeed",
"Range",
"FastChargeSpeed",
"PriceinUK",
"PriceinGermany",
]
df = pd.DataFrame(data["data"], columns=columns_for_df)
json_list = json.loads(json.dumps(list(df.T.to_dict().values())))
data = np.array(json_list) # Convert the JSON list into a NumPy array.
prediction = service.predict(data)
return prediction
- This code defines a function called `predictor` used for making predictions with an ML model deployed via MLFlow. It starts the service, processes input data from a JSON format, converts it into a NumPy array, and returns the model’s predictions. The function operates on data with specific features to an electric vehicle.
Deployment Execution: You have a script, ‘run_deployment.py,’ that allows you to trigger the deployment process. This script takes the ‘–config’ parameter. The `–config` parameter is used to specify a configuration file or settings for a program via the command line, which can be set to ‘deploy’ for deploying the model, ‘predict’ for running predictions, or ‘deploy_and_predict’ for both.
Deployment Status and Interaction: The script also provides information about the status of the MLflow prediction server, including how to start and stop it. It uses MLFlow for model deployment.
Min Accuracy Threshold: The ‘min_accuracy’ parameter can be specified to set a minimum accuracy threshold for model deployment. If satisfied with that value, the model will deployed.
Docker Configuration: Docker is used for managing the deployment environment, and you have defined Docker settings in your deployment pipeline.
This deployment process appears to be focused on deploying machine learning models and running predictions in a controlled and configurable manner.
- Deploying our model is as simple as running the ‘run_deployment.py’ script. Use this:
python3 run_deployment.py --config deploy
Prediction
Once our model is deployed, our model is ready for predictions.
- Run Predictions: Execute predictions using the following command –
python3 run_deployment.py --config predict
Streamlit App
The Streamlit app provides a user-friendly interface for interacting with our model’s predictions. Streamlit simplifies the creation of interactive, web-based data science applications, making it easy for users to explore and understand the model’s predictions. Again, you can find the code on GitHub for the Streamlit app.
- Launch the Streamlit app with the following command: streamlit run streamlit_app.py
With this, you can explore and interact with our model’s predictions.
- Streamlit app makes our model’s predictions user-friendly and accessible online; users can easily interact with and understand the results. Here you can see the picture of how the Streamlit app looks on the web –
Conclusion
In this article, we’ve delved into an exciting project that demonstrates the power of MLOps in predicting electric vehicle efficiency. We’ve learned about Zenml and MLFlow, which are crucial in creating an end-to-end machine-learning pipeline. We’ve also explored the data collection process, problem statement, and the solution to accurately predict electric vehicle efficiency.
This project highlights the significance of efficient electric vehicles and how MLOps can be harnessed to create precise models for forecasting efficiency. We’ve covered essential steps, including setting up a virtual environment, model development, configuring model settings, and evaluating model performance. The article concludes by emphasizing the importance of experiment tracking, deployment, and user interaction through a Streamlit app. With this project, we’re one step closer to shaping the future of electric vehicles.
Key Takeaways
- Seamless Integration: The “End-to-End Predicting Electric Vehicle Efficiency Pipeline with Zenml” project exemplifies the seamless integration of data collection, model training, evaluation, and deployment. It highlights the immense potential of MLOps in reshaping the electric vehicle industry.
- GitHub Project: For further exploration, you can access the project on GitHub: GitHub Project.
- MLOps Course: To deepen your understanding of MLOps, we recommend watching our comprehensive course: MLOps Course.
- This project showcases the potential of MLOps in reshaping the electric vehicle industry, providing valuable insights, and contributing to a greener future.
Frequently Asked Questions
A. MLflow manages the end-to-end machine learning lifecycle, enabling experiment tracking, model packaging, and deployment, making it easier to develop and deploy machine learning models.
A. MLOps and DevOps serve distinct but complementary purposes: MLOps is tailored for the machine learning lifecycle, while DevOps focuses on software development. Neither is better; their integration can optimize end-to-end development and deployment.
A. Yes, MLOps often involves coding for developing machine learning models and automating deployment and management processes.
A. MLflow simplifies machine learning development by providing tools for experiment tracking, model versioning, and model deployment.
A. Yes, ZenML is a fully open-source MLOps framework that makes the transition from local development to production pipelines as easy as 1 line of code.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, October 31, 2023.