How to Use Feature Selection Techniques with Scikit-learn to Improve Your Model

How to Use Feature Selection Techniques with Scikit-learn to Improve Your Model

Let’s learn how to perform feature selection to improve your Machine Learning model.

Preparation

We would use the Numpy, Pandas, and Scikit-Learn Python packages to ensure they are installed in your environment. If not, please install them via pip using the following code:

pip install numpy pandas scikit-learn

Then, import the Python packages that we would use with the following code:

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.linear_model import Lasso, LogisticRegression

With that, we would create our sample data. For this tutorial, we would make around 30 features to show how Feature Selection works.

np.random.seed(42)

# Create non-negative numerical features
X_numerical = np.random.rand(200, 15) * 100

# Create categorical features
X_categorical = np.random.choice(['A', 'B', 'C'], size=(200, 5))

# Combine numerical and categorical features
X = np.hstack((X_numerical, X_categorical))
y = np.random.choice([0, 1], size=n_samples)

numerical_feature_names = [f'num_{i}' for i in range(15)]
categorical_feature_names = [f'cat_{i}' for i in range(5)]
feature_names = numerical_feature_names + categorical_feature_names

data = pd.DataFrame(X, columns=feature_names)
data['target'] = y

data = pd.get_dummies(data, columns=categorical_feature_names)
X = data.drop('target', axis=1)
y = data['target']

Feature Selection Technique with Scikit-Learn

The purpose of Feature Selection is to select a subset of relevant features from available features that can improve the performance of a machine learning model. There are many techniques from Scikit-Learn to make Feature Selection that we can try.

First, we would try the filter method feature selection technique. This technique uses statistical measures to assign a score to each feature and rank the features based on the score.

selector = SelectKBest(score_func=chi2, k=10)
X_kbest = selector.fit_transform(X, y)

# Get the selected feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)

The output:

Index(['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'num_5', 'num_8', 'num_9','num_14', 'cat_0_A'], dtype='object')

From 30 features, we end up with the 10 highest-scoring features according to the statistical measures. Adjust the K values to specify the number of features you want to retain.

Next, we would try the Wrapper method feature selection. It’s a technique based on a search problem, where each feature combination is tested and compared to the other combination. One of the techniques that fall under this method is Recursive Feature Elimination (RFE).

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=10)
rfe = rfe.fit(X, y)

# Get the ranking of features
feature_ranking = pd.DataFrame({'Feature': X.columns, 'Ranking': rfe.ranking_})
feature_ranking.sort_values('Ranking')

selected_features = feature_ranking[feature_ranking['Ranking'] == 1]['Feature']
print(selected_features)

The output:

15    cat_0_A
16    cat_0_B
17    cat_0_C
20    cat_1_C
22    cat_2_B
23    cat_2_C
24    cat_3_A
25    cat_3_B
26    cat_3_C
27    cat_4_A
Name: Feature, dtype: object

Like the previous technique, we can set the number of selected features. However, RFE is based on eliminating the machine learning model. The process recursively removes the least essential features according to the model until the intended number of selected features is reached.

Lastly, we would try Embedded Method feature selection, a technique tied to specific machine learning algorithms. Basically, the selection occurs during the model training process.

lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Get the coefficients of the features
feature_coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso.coef_})

selected_features = feature_coefficients[feature_coefficients['Coefficient'] != 0]['Feature']
print(selected_features)

The output:

0      num_0
1      num_1
2      num_2
3      num_3
4      num_4
5      num_5
6      num_6
8      num_8
9      num_9
10    num_10
11    num_11
13    num_13
14    num_14
Name: Feature, dtype: object

LASSO is a linear model with L1 Regularization which could shrink the feature coefficients to zero. In other words, it would eliminate those zero coefficient features.

Try to learn more about feature selection as these techniques would help improve your machine learning model.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *