One of the most common clustering algorithms in machine learning is known as **k-means clustering**.

K-means clustering is a technique in which we place each observation in a dataset into one of *K* clusters.

The end goal is to have *K *clusters in which the observations within each cluster are quite similar to each other while the observations in different clusters are quite different from each other.

In practice, we use the following steps to perform K-means clustering:

**1. Choose a value for K.**

- First, we must decide how many clusters we’d like to identify in the data. Often we have to simply test several different values for
*K*and analyze the results to see which number of clusters seems to make the most sense for a given problem.

**2. Randomly assign each observation to an initial cluster, from 1 to K.**

**3. Perform the following procedure until the cluster assignments stop changing.**

- For each of the
*K*clusters, compute the cluster*centroid.*This is simply the vector of the*p*feature means for the observations in the*k*th cluster. - Assign each observation to the cluster whose centroid is closest. Here,
*closest*is defined using Euclidean distance.

The following step-by-step example shows how to perform k-means clustering in Python by using the **KMeans** function from the **sklearn** module.

**Step 1: Import Necessary Modules**

First, we’ll import all of the modules that we will need to perform k-means clustering:

**import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler**

**Step 2: Create the DataFrame**

Next, we’ll create a DataFrame that contains the following three variables for 20 different basketball players:

- points
- assists
- rebounds

The following code shows how to create this pandas DataFrame:

**#create DataFrame
df = pd.DataFrame({'points': [18, np.nan, 19, 14, 14, 11, 20, 28, 30, 31,
35, 33, 29, 25, 25, 27, 29, 30, 19, 23],
'assists': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14,
np.nan, 9, 4, 3, 4, 12, 15, 11],
'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4,
11, 6, 5, 5, 3, 8, 12, 7, 6, 5]})
#view first five rows of DataFrame
print(df.head())
points assists rebounds
0 18.0 3.0 15
1 NaN 3.0 14
2 19.0 4.0 14
3 14.0 5.0 10
4 14.0 4.0 8
**

We will use k-means clustering to group together players that are similar based on these three metrics.

**Step 3: Clean & Prep the DataFrame**

Next, we’ll perform the following steps:

- Use
**dropna()**to drop rows with NaN values in any column - Use
**StandardScaler()**to scale each variable to have a mean of 0 and a standard deviation of 1

The following code shows how to do so:

#drop rows with NA values in any columns df = df.dropna() #create scaled DataFrame where each variable has mean of 0 and standard dev of 1 scaled_df = StandardScaler().fit_transform(df) #view first five rows of scaled DataFrame print(scaled_df[:5]) [[-0.86660275 -1.22683918 1.72722524] [-0.72081911 -0.96077767 1.45687694] [-1.44973731 -0.69471616 0.37548375] [-1.44973731 -0.96077767 -0.16521285] [-1.88708823 -0.16259314 1.45687694]]

**Note**: We use scaling so that each variable has equal importance when fitting the k-means algorithm. Otherwise, the variables with the widest ranges would have too much influence.

**Step 4: Find the Optimal Number of Clusters**

To perform k-means clustering in Python, we can use the **KMeans** function from the **sklearn** module.

This function uses the following basic syntax:

**KMeans(init=’random’, n_clusters=8, n_init=10, random_state=None)**

where:

**init**: Controls the initialization technique.**n_clusters**: The number of clusters to place observations in.**n_init**: The number of initializations to perform. The default is to run the k-means algorithm 10 times and return the one with the lowest SSE.**random_state**: An integer value you can pick to make the results of the algorithm reproducible.

The most important argument in this function is n_clusters, which specifies how many clusters to place the observations in.

However, we don’t know beforehand how many clusters is optimal so we must create a plot that displays the number of clusters along with the SSE (sum of squared errors) of the model.

Typically when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level off. This is typically the optimal number of clusters.

The following code shows how to create this type of plot that displays the number of clusters on the x-axis and the SSE on the y-axis:

#initialize kmeans parameters kmeans_kwargs = { "init": "random", "n_init": 10, "random_state": 1, } #create list to hold SSE values for each k sse = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, **kmeans_kwargs) kmeans.fit(scaled_df) sse.append(kmeans.inertia_) #visualize results plt.plot(range(1, 11), sse) plt.xticks(range(1, 11)) plt.xlabel("Number of Clusters") plt.ylabel("SSE") plt.show()

In this plot it appears that there is an elbow or “bend” at k = **3 clusters**.

Thus, we will use 3 clusters when fitting our k-means clustering model in the next step.

**Note**: In the real-world, it’s recommended to use a combination of this plot along with domain expertise to pick how many clusters to use.

**Step 5: Perform K-Means Clustering with Optimal ***K*

*K*

The following code shows how to perform k-means clustering on the dataset using the optimal value for *k* of 3:

#instantiate the k-means class, using optimal number of clusters kmeans = KMeans(init="random", n_clusters=3, n_init=10, random_state=1) #fit k-means algorithm to data kmeans.fit(scaled_df) #view cluster assignments for each observation kmeans.labels_ array([1, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0])

The resulting array shows the cluster assignments for each observation in the DataFrame.

To make these results easier to interpret, we can add a column to the DataFrame that shows the cluster assignment of each player:

#append cluster assingments to original DataFrame df['cluster'] = kmeans.labels_ #view updated DataFrame print(df) points assists rebounds cluster 0 18.0 3.0 15 1 2 19.0 4.0 14 1 3 14.0 5.0 10 1 4 14.0 4.0 8 1 5 11.0 7.0 14 1 6 20.0 8.0 13 1 7 28.0 7.0 9 2 8 30.0 6.0 5 2 9 31.0 9.0 4 0 10 35.0 12.0 11 0 11 33.0 14.0 6 0 13 25.0 9.0 5 0 14 25.0 4.0 3 2 15 27.0 3.0 8 2 16 29.0 4.0 12 2 17 30.0 12.0 7 0 18 19.0 15.0 6 0 19 23.0 11.0 5 0

The **cluster** column contains a cluster number (0, 1, or 2) that each player was assigned to.

Players that belong to the same cluster have roughly similar values for the **points**, **assists**, and **rebounds** columns.

**Note**: You can find the complete documentation for the **KMeans** function from **sklearn** here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in Python:

How to Perform Linear Regression in Python

How to Perform Logistic Regression in Python

How to Perform K-Fold Cross Validation in Python