How to Leverage Scikit-learn’s Built-in Datasets for Machine Learning Practice

How to Leverage Scikit-learn's Built-in Datasets for Machine Learning Practice

Let’s use the Built-In Scikit-Learn Datasets for your Machine Learning practice.
 

Preparation

 
We must install the Scikit-Learn package, as we will use it. We also need to install the Pandas and Matplotlib packages.

pip install -U pandas scikit-learn matplotlib

With the packages installed. Let’s get into the main part of the tutorial.
 

Leverage Scikit-Learn’s Built-In Datasets

 
Scikit-Learn provides many open-source datasets in their package that we can use for machine learning experiments. Most of the tasks we want to try out can be facilitated with these built-in datasets. Let’s explore the available datasets.

from sklearn import datasets
print(datasets.__all__)

There are so many built-in datasets we can use. Let’s try out several examples for each task: Tabular Classification, Tabular Regression, and Image Classification.

For the Tabular Classification task, we can use the Iris Built-In dataset. Let’s try to examine the Iris dataset and use them in the model classification.

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target

df_iris.info()

The output:

RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB

As you can see, the dataset contains four different columns with one target column. Let’s use it for classification model training.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

The output:

             precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

The Iris dataset it’s a great built-in dataset for experiments in the classification task.

How about the Tabular Regression task? We can use the California Housing dataset. It’s a built-in dataset for regression tasks where we try to predict the housing price.

from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()

df_california = pd.DataFrame(data=california.data, columns=california.feature_names)
df_california['target'] = california.target

df_california.info()

The output:

RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

There are seven features for the predictors and one target column to predict in this California Housing dataset.

We can use the dataset above for the Regression modeling task,

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

The output:

Mean Squared Error: 0.5558915986952426
R^2 Score: 0.5757877060324521

Lastly, we can use a built-in image dataset called the Digits dataset for image classification tasks. Let’s examine it.

from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()
fig, axes = plt.subplots(1, 5, figsize=(10, 3))

for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image)
    ax.set_title(f'Target: {label}')

plt.show()

How to Leverage Scikit-learn's Built-in Datasets for Machine Learning Practice

The dataset contains collections of handwritten number images with their actual target. We can use them to experiment with an image classification task using the code below.

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42)

model = MLPClassifier(hidden_layer_sizes=(64,), max_iter=300, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

The output:

             precision    recall  f1-score   support

           0       0.97      1.00      0.99        33
           1       1.00      1.00      1.00        28
           2       0.97      1.00      0.99        33
           3       1.00      0.97      0.99        34
           4       1.00      1.00      1.00        46
           5       0.96      0.94      0.95        47
           6       0.97      0.97      0.97        35
           7       1.00      0.97      0.99        34
           8       0.97      1.00      0.98        30
           9       0.97      0.97      0.97        40

    accuracy                           0.98       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.98      0.98      0.98       360

There are still so many built-in datasets by Scikit-Learn. Find the one that is suitable for your use cases.

 

Additional Resources

 

 
 

Leave a Reply

Your email address will not be published. Required fields are marked *