Clustering is a technique in machine learning that attempts to find clusters of observations within a dataset.
The goal is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other.
The easiest way to perform clustering in SAS is to use PROC CLUSTER.
The following example shows how to use PROC CLUSTER in practice.
Example: How to Use PROC CLUSTER in SAS
Suppose we have the following dataset that contains information about points, assists and rebounds for 20 different basketball players:
/*create dataset*/ data my_data; input points assists rebounds; datalines; 18 3 15 20 3 14 19 4 14 14 5 10 14 4 8 15 7 14 20 8 13 28 7 9 30 6 5 31 9 4 35 12 11 33 14 6 29 9 5 25 9 5 25 4 3 27 3 8 29 4 12 30 12 7 19 5 6 23 11 5 ; run; /*view dataset*/ proc print data=my_data;
Suppose we would like to perform clustering to attempt to identify “clusters” of players that have similar stats to each other.
The following code shows how to use PROC CLUSTER in SAS to perform clustering:
/*perform clustering using points, assists and rebounds variables*/ proc cluster data=my_data method=average; var points assists rebounds; run;
The first tables in the output provide information about how the clustering was performed:
A dendrogram is also produced so that we can visually inspect the similarity between observations in the dataset:
The y-axis shows the individual observations and the x-axis shows the average distance between clusters.
From looking at this dendrogram, it appears that the observations naturally group themselves into three clusters:
We can then use the PROC TREE statement with ncl=3 to tell SAS to assign each observation in the original dataset to one of three clusters:
/*assign each observation to one of three clusters*/ proc tree data=clustd noprint ncl=3 out=clusts; copy points assists rebounds; id player_ID; run; proc sort; by cluster; run; /*view cluster assignments*/ proc print data=clusts; id player_ID; run;
The resulting dataset shows each of the original observations along with the cluster they belong to:
For example, we can see: that players with ID’s 2, 3, 1, 4, 5, 7, 6 and 19 all belong to cluster 1.
This tells us that these eight players are “similar” across the points, assists and rebounds variables.
Note: For this example we chose to use average as the linkage method for clustering. Refer to the SAS documentation for a complete list of other linkage methods you can use.
The following tutorials explain how to perform other common tasks in SAS: