Let’s learn how to perform Dimensionality Reduction with Scikit-Learn.

## Preparation

First, install the following Python libraries if you haven’t already. You can skip this step if you already have them installed.

pip install -U pandas scikit-learn matplotlib

After that, we would import all the packages used in this tutorial.

import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.manifold import TSNE from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

Then, we load the Iris sample data for our Tutorial.

iris = load_iris() X = iris.data y = iris.target df = pd.DataFrame(X, columns=iris.feature_names) df['target'] = y

The sample dataset contains 4 features that we would try to reduce.

## Dimensionality Reduction with Scikit-Learn

Dimensionality Reduction is a technique to reduce the number of variables in the dataset while still preserving as much relevant information from the whole dataset. It’s often used in the case of high-dimension data where the model performance would be affected as the number of features is too high.

We would first try the PCA (Principal Component Analysis). This technique reduces dimensionality by preserving variance using eigenvalues and eigenvectors and projecting the data to preserve as much variance as possible. PCA is a popular dimensionality reduction technique because it is usable in many situations, but it works better if the data relationship is linear.

As it’s based on distance, it’s better to standardize the features before using the method.

# Standardizing the features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Applying PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) X_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])

You can reduce the features to your intended number using PCA by tweaking the **n_components** hyperparameter.

Another technique we would try is the linear discriminant analysis (LDA). It reduces dimensionality by maximizing class separability, which is useful for classification tasks. The technique works by computing the within-class and between-class scatter matrices and maximizing the between-class to within-class variance ratio.

Using the data we have standardized previously, we can also use LDA to reduce the dimension.

# Applying LDA lda = LDA(n_components=2) X_lda = lda.fit_transform(X_scaled, y) X_lda = pd.DataFrame(X_lda, columns=['LDA1', 'LDA2'])

Lastly, we would try the t-distributed Stochastic Neighbor Embedding (t-SNE) technique. This dimensionality reduction technique converts high-dimensional pairwise distances into probabilities to visualize the data clusters. It’s a preferred technique for visualizing high-dimensional data in 2 or 3 dimensions, especially non-linear data.

tsne = TSNE(n_components=2) X_tsne = tsne.fit_transform(X_scaled) df_tsne = pd.DataFrame(X_tsne, columns=['t-SNE1', 't-SNE2']) df_tsne['target'] = y target_names = iris.target_names # Create a scatter plot plt.figure(figsize=(10, 8)) colors = ['r', 'g', 'b'] for target, color in zip(range(len(target_names)), colors): subset = df_tsne[df_tsne['target'] == target] plt.scatter(subset['t-SNE1'], subset['t-SNE2'], c=color, label=target_names[target], alpha=0.6) plt.title('t-SNE Visualization of the Iris Dataset') plt.xlabel('t-SNE1') plt.ylabel('t-SNE2') plt.legend(title='Target') plt.show()

t-SNE result for Iris dataset with 2 components

You can also try to visualize the dimension reduction result from the other techniques, but t-SNE is the go-to one for high-dimension visualization.

Overall, the dimension reduction technique could help your work improve the machine learning performance and understanding of your data.

## Additional Resources

- What Is Dimension Reduction In Data Science?
- Dimensionality Reduction Techniques in Data Science
- 6 Dimensionality Reduction Algorithms With Python