How to Visualize Decision Boundaries Using Scikit-learn

How to Visualize Decision Boundaries Using Scikit-learn

Let’s learn how to visualize the decision boundaries with Scikit-Learn

Preparation

First, let us install the necessary Python packages if you haven’t done them or not upgrading them in a while. You can skip this part if you have already installed them.

pip install -U pandas numpy matplotlib scikit-learn

Then, we need to import the Python packages important for this tutorial.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import DecisionBoundaryDisplay

After installing the packages, we will create sample data for our tutorial.

# Create sample data
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)

df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
df['Target'] = y

X_train, X_test, y_train, y_test = train_test_split(df.drop('Target', axis =1), df['Target'], test_size=0.2, random_state=42)

In the above dataset, we create two independent variables with one variable having randomization introduced to the features.

Decision Boundaries Visualization Using Scikit-Learn

Decision Boundaries are surfaces like a line in 2D or a plane in 3D that separate classes in a classification problem. A binary classification problem with two features could be a simple line but becomes more complex depending on the classifier and the dataset.

Decision Boundaries Visualization could help the audience understand how the classifier decides the class on the new point—knowing if the model generalized well or underfitting/overfitting can be helpful.

Let’s try to visualize the Decision Boundaries. First, we need to train the classifier.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

clf = LogisticRegression()
clf.fit(X_train, y_train)

Once we have trained the classification model, we can use Scikit-Learn to visualize the Decision Boundaries.

db = DecisionBoundaryDisplay.from_estimator(
    clf,
    X_train,
    response_method="predict"
)

# Plot training and test points with different colors for each class
scatter_train_0 = plt.scatter(X_train[y_train == 0][:, 0], X_train[y_train == 0][:, 1], 
                              c='blue', edgecolors='k', marker='o', label='Train class 0')
scatter_train_1 = plt.scatter(X_train[y_train == 1][:, 0], X_train[y_train == 1][:, 1], 
                              c='red', edgecolors='k', marker='o', label='Train class 1')
scatter_test_0 = plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], 
                             c='lightblue', marker='x', label='Test class 0')
scatter_test_1 = plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], 
                             c='black', marker='x', label='Test class 1')

# Add legend
plt.legend()

plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Decision Boundary of Logistic Regression with Prediction Method")
plt.show()

How to Visualize Decision Boundaries Using Scikit-learn
Decision Boundary with Prediction Method

The image above shows clearly where the classifier separates the class and where our dataset falls within the decision boundaries. Most classes were positioned on the correct side, but some were not. Some are even close to the decision boundaries.

We can also get better visualization using the prediction probability from the classifier. You only need to change the response_method parameter to “predict_proba”.

db = DecisionBoundaryDisplay.from_estimator(
    clf,
    X_train,
    response_method="predict_proba"
)

How to Visualize Decision Boundaries Using Scikit-learn
Decision Boundary with Prediction Probability Method

The visualization provides a better way to understand where each data point falls and how close it is to the decision boundaries.

Try to use the Decision Boundaries Visualization to understand your model better.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *