How to Create Pipelines in Scikit-learn for More Efficient Data Processing

How to Create Pipelines in Scikit-learn for More Efficient Data Processing

Let’s learn to develop pipelines in Scikit-Learn for data processing.

Preparation

Ensure the Numpy, Pandas, and Scikit-Learn are installed in your environment. If not, please install them via pip using the following code:

pip install numpy pandas scikit-learn

Then, we would import relevant Python packages into your environment:

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

We will create our sample data once the packages have been successfully imported.

sample_data = pd.DataFrame({
    'age': [25, 30, np.nan, 43, 50, 60],
    'income': [50000, 60000, 40000, np.nan, 90000, 86000],
    'gender': ['male', 'female', 'female', 'male', np.nan, 'male'],
    'occupation': ['engineer', 'doctor', 'engineer', 'artist', 'doctor', 'artist'],
    'churn': [1, 0, 1, 0, 1, 1]
})

Then, let’s try to separate the dataset as we would simulate the pipeline with data for machine learning.

X = sample_data.drop('churn', axis=1)
y = sample_data['churn']

Creating Scikit-Learn Pipeline for Efficient Data Preprocessing

The Scikit-Learn pipeline aims to streamline and automate the data workflow process related to machine learning activity. It combines multiple process steps into one form as a standard single unit.

Let’s try to set up the data preprocessing with Scikit-Learn. In general, a data science project would require lots of data preprocessing such as missing data imputation, categorical encoding, and many more. The scikit-learn pipeline would allow us to combine these steps into one.

First, we show both numerical and categorical feature processing. The Scikit-learn pipeline would combine the required data preprocessing steps.

# Numerical features preprocessing
numerical_features = ['age', 'income']
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features preprocessing
categorical_features = ['gender', 'occupation']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Then, we combine all the pipelines into one.

# Combine preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Lastly, we use the Scikit-Learn pipeline to combine them with the Machine Learning algorithm.

# Create the pipeline with ML
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

Finally, we can fit the pipeline with the sample data we have.

pipeline.fit(X, y)

The output:

Pipeline(steps=[('preprocessor',
             ColumnTransformer(transformers=[('num',
                  Pipeline(steps=[('imputer',
                       SimpleImputer(strategy='median')),
                      ('scaler',
                       StandardScaler())]),
                  ['age', 'income']),
                 ('cat',
                  Pipeline(steps=[('imputer',
                       SimpleImputer(strategy='most_frequent')),
                      ('onehot',
                       OneHotEncoder(handle_unknown='ignore'))]),
                  ['gender', 'occupation'])])),
            ('classifier', LogisticRegression())])

Mastering the Scikit-Learn pipeline would allow streamlining of our data science project process and allow reproducibility.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *