Optimizing Scikit-learn Models for Better Performance

Optimizing Scikit-learn Models for Better Performance

Let’s learn how to optimize our models to improve the model performance.

Preparation

In this tutorial, we need the scikit-learn, scipy, numpy and pandas packages. If you haven’t installed them, you can do that with the following code.

pip install -U pandas scikit-learn scipy numpy

With the package ready, we will prepare our sample dataset. In this tutorial, we will use the built-in wine dataset from scikit-learn.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

With everything ready, let’s move on to the next part.

Scikit-Learn Model Optimization

When we talk about model performance, we refer to the model metrics when we evaluate the model. These metrics can be accuracy, RMSE, ROC-AUC, or anything else. In Scikit-Learn, many machine learning algorithms can be optimized to perform better.

The way we optimize them will vary depending on the task. However, we will only discuss the general way to optimize the model. Let’s start with model optimization with data preprocessing.

There are many ways to improve model performance with data preprocessing. The best one is feature engineering, but we need to have the domain knowledge for it. Another way is to fill in the missing data and scale the data.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Filling Missing Data
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Data Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

Removing the outlier from the data could also help improve the model’s performance, although we need to understand first why the data is considered an outlier. For example, we are removing outliers based on the Z-score.

import numpy as np
from scipy.stats import zscore

z_scores = np.abs(zscore(X_train_scaled))
threshold = 3
outliers = np.where(z_scores > threshold)

X_train_no_outliers = X_train_scaled[(z_scores < threshold).all(axis=1)]
y_train_no_outliers = y_train[(z_scores < threshold).all(axis=1)]

The next way to optimize the model performance is by selecting only the important features. In this example, we would use the Recursive Feature Elimination to select only the number of most important features.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

model = LogisticRegression(max_iter=10000)
rfe = RFE(model, n_features_to_select=5)
X_train_rfe = rfe.fit_transform(X_train_no_outliers, y_train_no_outliers)
X_test_rfe = rfe.transform(X_test_scaled)

selected_features = np.array(feature_names)[rfe.support_]
print(selected_features)

The output:

['alcohol' 'flavanoids' 'color_intensity' 'hue' 'proline']

From the model standpoint, we can also optimize our model performance by performing Hyperparameter Optimization.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('classifier', LogisticRegression(max_iter=10000))
])

param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__solver': ['liblinear', 'lbfgs']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train_rfe, y_train_no_outliers)

best_model = grid_search.best_estimator_
print(grid_search.best_params_)

Lastly, we can use appropriate evaluation methods to evaluate the model’s robustness. To do this, we can use the Cross-Validation method.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

y_pred = best_model.predict(X_test_rfe)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))

scores = cross_val_score(best_model, X_train_no_outliers, y_train_no_outliers, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy Scores:", scores)

The output:

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Cross-Validation Accuracy Scores: [0.96296296 0.96296296 1.0.96153846 1.]

Try to master all these methods to improve your machine learning model performances.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *