How to Simplify Data with Dimensionality Reduction Techniques in Scikit-learn

How to Simplify Data with Dimensionality Reduction Techniques in Scikit-learn

Let’s learn how to perform Dimensionality Reduction with Scikit-Learn.


First, install the following Python libraries if you haven’t already. You can skip this step if you already have them installed.

pip install -U pandas scikit-learn matplotlib

After that, we would import all the packages used in this tutorial.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

Then, we load the Iris sample data for our Tutorial.

iris = load_iris()
X =  
y =  

df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y

The sample dataset contains 4 features that we would try to reduce.

Dimensionality Reduction with Scikit-Learn

Dimensionality Reduction is a technique to reduce the number of variables in the dataset while still preserving as much relevant information from the whole dataset. It’s often used in the case of high-dimension data where the model performance would be affected as the number of features is too high.

We would first try the PCA (Principal Component Analysis). This technique reduces dimensionality by preserving variance using eigenvalues and eigenvectors and projecting the data to preserve as much variance as possible. PCA is a popular dimensionality reduction technique because it is usable in many situations, but it works better if the data relationship is linear.

As it’s based on distance, it’s better to standardize the features before using the method.

# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])

You can reduce the features to your intended number using PCA by tweaking the n_components hyperparameter.

Another technique we would try is the linear discriminant analysis (LDA). It reduces dimensionality by maximizing class separability, which is useful for classification tasks. The technique works by computing the within-class and between-class scatter matrices and maximizing the between-class to within-class variance ratio.

Using the data we have standardized previously, we can also use LDA to reduce the dimension.

# Applying LDA
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
X_lda = pd.DataFrame(X_lda, columns=['LDA1', 'LDA2'])

Lastly, we would try the t-distributed Stochastic Neighbor Embedding (t-SNE) technique. This dimensionality reduction technique converts high-dimensional pairwise distances into probabilities to visualize the data clusters. It’s a preferred technique for visualizing high-dimensional data in 2 or 3 dimensions, especially non-linear data.

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X_scaled)
df_tsne = pd.DataFrame(X_tsne, columns=['t-SNE1', 't-SNE2'])
df_tsne['target'] = y
target_names = iris.target_names

# Create a scatter plot
plt.figure(figsize=(10, 8))
colors = ['r', 'g', 'b']
for target, color in zip(range(len(target_names)), colors):
    subset = df_tsne[df_tsne['target'] == target]
    plt.scatter(subset['t-SNE1'], subset['t-SNE2'], c=color, label=target_names[target], alpha=0.6)

plt.title('t-SNE Visualization of the Iris Dataset')

How to Simplify Data with Dimensionality Reduction Techniques in Scikit-learn
t-SNE result for Iris dataset with 2 components

You can also try to visualize the dimension reduction result from the other techniques, but t-SNE is the go-to one for high-dimension visualization.

Overall, the dimension reduction technique could help your work improve the machine learning performance and understanding of your data.


Additional Resources



Leave a Reply

Your email address will not be published. Required fields are marked *