Getting started with MLflow

This tutorial focuses on getting you started with MLflow.

For the purposes of this tutorial, we will assume that you will be working on some form of Jupyter notebook. In my case, I use Google Colab.

Step 1. Install MLFlow

Our first step is to install the mlflow library. It is available on PyPI or if behind a corporate firewall, your company’s local PyPI repository:

pip install mlflow

Step 2. Install MlFlow server and set tracking server URI

Running locally

Next we kickoff an instance of the MLFlow server running locally

% mlflow server —host 127.0.0.1 —port 5000

Any other open port can be used if 5000 is unavailable

See below for how to access this UI if this tutorial is being executed in a Google colab or other online Jupyter notebook

We set the tracing server in python as follows:

import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")

If the tracking server uri is not set, runs will be logged to the local filesystem.

Running within a remote notebook

If we’re running on a remote Jupyter notebook such as Google Colab, we don’t have direct access to the local mlflow server. Instead we will make use of ngrok to create a tunnel to the mlflow server which we can use to access the UI.

It works as follows:

Sign up for ngrok account and obtain an auth token
Install ngrok
Start the ngrok server
Open the ngrok tunnel
Access the UI

I will not delve into great detail but you can read more about ngrok here You can also check out this blog post where I detail how to setup an app on colab and create a tunnel on ngrok so it can be accessed from the internet.

The code looks like this:

from pyngrok import ngrok
get_ipython().system_raw("mlflow ui --port 5000 &")
ngrok.kill()


ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open HTTPS tunnel
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLFlow Tracking UI:", ngrok_tunnel.public_url)

Step 3. Train model and prepare dataset for loading

We will use the California Housing dataset. We will do the following:

Load the housing dataset and Preprocess it.
Train the model using a simple linear regression
Evaluate the model

First, we import the necessary libraries:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import mlflow
from mlflow.models import infer_signature

We can download the California Housing dataset from within scikit-learn.

from sklearn.datasets import fetch_california_housing
california_housing_dset = fetch_california_housing(as_frame=True)

We can quickly inspect the dataset to see what we have:

print(california_housing_dset.DESCR)

print(california_housing_dset.data.columns)

print(california_housing_dset.data.head())

print(california_housing_dset.target_names)

We can build a pipeline to train the model.

X = california_housing_dset.frame.iloc[:,:-1]
y = california_housing_dset.frame.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


regression_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])
regression_pipeline.fit(X_train,y_train)

y_pred = regression_pipeline.predict(X_test)
r2_score( y_test, y_pred)

Step 4 – Log the model and its metadata to MLflow

Here we show how to log our model to MLFlow along with its metadata:

# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart")

# Start an MLflow run
with mlflow.start_run():

    # Log the loss metric
    mlflow.log_metric("r2", r2)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for California Housing data")

    # Infer the model signature
    signature = infer_signature(X_train, regression_pipeline.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=regression_pipeline,
        artifact_path="CA_Housing_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="mlflow-tracking-quickstart",
    )

Step 5 – Load the model as a Python Function (pyfunc) and use it for inference

# Load the model back for predictions as a generic Python Function model
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

predictions = loaded_model.predict(X_test)

housing_feature_names = california_housing_dset.feature_names

result = pd.DataFrame(X_test, columns=housing_feature_names)
result["actual_class"] = y_test
result["predicted_class"] = predictions

result[:4]

Step 6 – View the Run in the MLflow UI

Running locally

In order to see the results of our run, we can navigate to the MLflow UI. Since we have already started the Tracking Server at http://localhost:5000, we can simply navigate to that URL in our browser.

Running within a remote notebook

If we’re running within a remote Jupyter notebook such as Google Colab, we can simply navigate to the MLflow UI that was printed above:

https://e2fd-104-155-194-144.ngrok-free.app

and select the appropriate Experiment.

Here are screenshots of the run results:

Using XGBoost instead of Linear Regression

One point to note is that by using standard Linear Regression, we didn’t have to deal with hyperparameters. In order to demonstrate the use of hyper parameters and logging them, we will use XGBoost for steps 3 and 4, designated as 3a and 4a respectively.

Step 3a – Train model and prepare dataset for loading (XGBoost)

We first import the necessary libraries.
Next, we run a GridSearch to find the optimal hyperparameters for our model.

Using the optimal hyperparameters, we train the model.

Our next step is to run a prediction and calculate the r2 score to evaluate our model.

import xgboost as xgb
from sklearn.metrics import mean_squared_error

#=========================================================================
# exhaustively search for the optimal hyperparameters
#=========================================================================
from sklearn.model_selection import GridSearchCV
# set up our search grid
param_grid = {"max_depth":    [4, 5, 6],
              "n_estimators": [500, 600, 700],
              "learning_rate": [0.01, 0.015]}

# try out every combination of the above values
search = GridSearchCV(xgb_regressor, param_grid, cv=5).fit(X_train, y_train)

print("The best hyperparameters are ",search.best_params_)

xgb_regressor=xgb.XGBRegressor(learning_rate = search.best_params_["learning_rate"],
                           n_estimators  = search.best_params_["n_estimators"],
                           max_depth     = search.best_params_["max_depth"],
                           eval_metric='rmsle')

xgb_regressor.fit(X_train, y_train)

y_pred = xgb_regressor.predict(X_test)

r2 = r2_score( y_test, y_pred)

Step 4 – Log the model and its metadata to MLflow

Here we show how to log our model to MLFlow along with its metadata. This time we log our hyperparameters.

# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart - XGB")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(param_grid)

    # Log the loss metric
    mlflow.log_metric("r2", r2)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "XGB model for California Housing data")

    # Infer the model signature
    signature = infer_signature(X_train, xgb_regressor.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=xgb_regressor,
        artifact_path="CA_Housing_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="mlflow-tracking-quickstart-xgb",
    )
mlflow.end_run()

We can then repeat the code in Step 5.

Conclusion

In this tutorial, we explored the fundamentals of MLflow and demonstrated how to use it to manage the machine learning lifecycle. We covered the installation of MLflow, setting up a tracking server, training and logging models with Linear Regression and XGBoost, and using the MLflow UI to visualize results. By following these steps, you can leverage MLflow to streamline your machine learning projects, track experiments, and collaborate with others. Whether you’re working on a local machine or a remote Jupyter notebook, MLflow provides a flexible and scalable platform for machine learning development.