How to Use Scikit-learn’s RandomizedSearchCV for Efficient Hyperparameter Tuning

How to Use Scikit-learn's RandomizedSearchCV for Efficient Hyperparameter Tuning

Let’s learn efficient hyperparameter tuning with Scikit-Learn RandomizedSearchCV.

Preparation

You must install the Pandas, Scipy, and Scikit-Learn packages for the tutorial to work. The following code will help you do that.

pip install -U pandas scikit-learn scipy

Once everything is installed, we must import the packages we will use in this tutorial.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

Then, we would use the Iris dataset as our sample dataset.

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

With the dataset ready, let’s try to perform hyperparameter tuning with RandomizedSearchCV.

Efficient Hyperparameter Tuning with RandomizedSearchCV

Hyperparameter tuning is the process of selecting the optimal hyperparameter combination for a machine learning model. Hyperparameters are parameters that we set before the training begins and are not learned from the data, unlike the model parameters, which are acquired during the learning process.

We can try each permutation to achieve the best set of hyperparameter combinations to see the best. However, exhaustive searches could become time and resource-consuming with more combinations. This is especially true in machine learning research as the number of iterations could easily surpass millions.

With RandomizedSearchCV, we can efficiently perform hyperparameter tuning because it reduces the number of evaluations needed by random sampling, allowing better coverage in large hyperparameter sets. Using the RandomizedSearchCV, we can minimize the parameters we could try before doing the exhaustive search.

Let’s try the RandomizedSearchCV using sample data. First, we need to initiate the model.

model = RandomForestClassifier()

Then, we would set the hyperparameter combination we would try to look for. You need to know the model Hyperparameters before you set them.

param_dist = {
    'n_estimators': randint(10, 200),
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 11),
    'bootstrap': [True, False]
}

In the example above, I set around 648600 possible combinations of hyperparameters to try out. If we tried out all the combinations, it would take a lot of time, so we would minimize them using RandomizedSearchCV.

randomized_search = RandomizedSearchCV(
    model,
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    scoring='accuracy', 
    n_jobs=-1
)
randomized_search.fit(X_train, y_train)

We only sample 100 out of 648600 possible combinations in the code above. You can control the amount to sample using the n_iter parameter.

We can see the best hyperparameter combination samples using the following code.

print(f"Best Parameters: {randomized_search.best_params_}")
print(f"Best Cross-Validation Score: {randomized_search.best_score_}")

The output:

Best Parameters: {'bootstrap': True, 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 70}
Best Cross-Validation Score: 0.9666666666666666

This is the best-set combination from the 100 samples we are trying out. You can run multiple searches to find the hyperparameter range that minimizes the search number.

Combining RandomizedSearchCV with an exhaustive search would help you to get the best model.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *