Streamlining Your Machine Learning Workflow with Scikit-learn and Joblib

Streamlining Your Machine Learning Workflow with Scikit-learn and Joblib

Let’s learn how to streamline our workflow with Scikit-Learn and Joblib.


If you use a data science environment distribution such as Anaconda, you should have all the required packages for this tutorial by default. If not, you can install the packages with the following code.

pip install -U pandas scikit-learn joblib

Once the packages are ready, we will use the scikit-learn built-in dataset function to fetch the data. We will use the Titanic dataset for our tutorial.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

titanic = fetch_openml("titanic", version=1, as_frame=True)
df = titanic.frame

features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
X = df[features]
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

With everything in place, let’s move into the central part of the tutorial.

Streamlining Machine Learning Workflow

Streamlining machine learning workflow means that we try to develop any process in the workflow to become as efficient and reproducible as possible, from the data preprocessing to the deployment and even monitoring part. This is what it means to streamline our machine learning workflow.

Using Scikit-Learn and Joblib, it’s possible for us to streamline the process easily. In Scikit-Learn, simplifying the process can be done with the Pipeline function. For example, here is how we consolidate the preprocessing and model development part.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

numerical_features = ['age', 'fare', 'pclass', 'sibsp', 'parch']
categorical_features = ['sex', 'embarked']

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))

preprocessor = ColumnTransformer(
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))

In the code above, we set the preprocessing step and the modeling training process in one simple function.

With the pipeline ready, you can train your model and save the model object using Joblib. Here is how you can do that.

import joblib, y_train)

joblib.dump(pipeline, 'sample_pipeline_model.pkl')
load_model = joblib.load('sample_pipeline_model.pkl')

By saving your model with Joblib, you can save your trained pipeline and deploy it in any environment.

Joblib is not only helpful in saving your pipeline, but you can also parallelize your workflow in case your pipeline is resource-intensive. For example, we can use the following code.

from joblib import Parallel, delayed

def train_and_evaluate(seed):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed), y_train)
    return pipeline.score(X_test, y_test)

results = Parallel(n_jobs=4)(delayed(train_and_evaluate)(i) for i in range(10))
print(f"Average Accuracy: {sum(results)/len(results)}")

The output:

Average Accuracy: 0.7816793893129771

In the code above, the Parallel function is used to run tasks simultaneously, while the delayed function wraps the function we want to parallelize and delays its execution until it is called within. By using joblib to parallelize our workflow, we can easily speed up any resource-intensive computational task.

Try to master the Scikit-Learn pipeline and Joblib to streamline your whole machine learning workflow.

Leave a Reply

Your email address will not be published. Required fields are marked *