How to Integrate Scikit-learn with Pandas for Easier Data Manipulation

How to Integrate Scikit-learn with Pandas for Easier Data Manipulation

Let’s learn how to integrate Scikit-Learn with Pandas for Data Manipulation.


If you have Numpy, Pandas, and Scikit-Learn Python packages installed, then everything is fine. If not, we need them in the environment so install them via pip using the following code:

pip install numpy pandas scikit-learn

After that, we need the following packages imported:

import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

With the packages imported, we would create our sample dataset.

df = pd.DataFrame({
    'salary': [1000, 2000, 1500, 3000, 4550, 6000, 2500],
    'size': ['L', 'M', 'M', 'S', 'L', 'S', 'M'],
    'target': [0, 1, 0, 1, 0, 1, 1]

X = df[['salary', 'size']]
y = df['target']

Scikit-Learn and Pandas Integration for Data Manipulation

Pandas is a Python package used by many for data manipulation. The Scikit-Learn function could be directly integrated with many of the Pandas functions. Let’s take an example in the code below.

def get_dummies_size(df):
    return pd.get_dummies(df, columns=['size'])

# Using FunctionTransformer to integrate pd.get_dummies
dummies_transformer = FunctionTransformer(get_dummies_size)

# Creating a pipeline
pipeline = Pipeline(steps=[
    ('dummies', dummies_transformer),
    ('classifier', LogisticRegression())

preprocessed_X = pipeline.named_steps['dummies'].fit_transform(X)


The output:

   salary  size_L  size_M  size_S
0    1000    True   False   False
1    2000   False    True   False
2    1500   False    True   False
3    3000   False   False    True
4    4550    True   False   False
5    6000   False   False    True
6    2500   False    True   False

By using FunctionTransformer from Scikit-Learn, we can integrate the One-Hot Encoder function from Pandas into the Scikit-Learn pipeline. This will easily streamline the required data preprocessing pipeline.

We can try to combine several Pandas functions for data manipulation into the Scikit-Learn pipeline using the FunctionTransformer.

def bin_salary(df):
    df = df.copy()
    df['binned_salary'] = pd.cut(df['salary'], bins=3, labels=['low', 'medium', 'high'])
    return df

def ordinal_encode_salary(df):
    df = df.copy()
    ordinal_mapping = {'low': 0, 'medium': 1, 'high': 2}
    df['binned_salary'] = df['binned_salary'].map(ordinal_mapping)
    return df
binning_transformer = FunctionTransformer(bin_salary)
ordinal_transformer = FunctionTransformer(ordinal_encode_salary)

pipeline = Pipeline(steps=[
    ('binning', binning_transformer),
    ('ordinal', ordinal_transformer),
    ('classifier', LogisticRegression())

preprocessed_X = pipeline.named_steps['binning'].fit_transform(X)
preprocessed_X = pipeline.named_steps['ordinal'].fit_transform(preprocessed_X)

The output:

   salary size binned_salary
0    1000    L           low
1    2000    M           low
2    1500    M           low
3    3000    S        medium
4    4550    L          high
5    6000    S          high
6    2500    M           low

   salary size binned_salary
0    1000    L             0
1    2000    M             0
2    1500    M             0
3    3000    S             1
4    4550    L             2
5    6000    S             2
6    2500    M             0

In the code above, we would wrap the pandas function for binning and mapping into functions and wrap it once more into transformers. Then, we would combine them into one to create sequential data preprocessing.

You can streamline your data science project by mastering Scikit-Learn with Pandas for data manipulation.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *