Principal components analysis (PCA) is an unsupervised machine learning technique that seeks to find principal components – linear combinations of the predictor variables – that explain a large portion of the variation in a dataset.
The easiest way to perform PCA in SAS is to use the PROC PRINCOMP statement, which uses the following basic syntax:
proc princomp data=my_data out=out_data outstat=stats; var var1 var2 var3; run;
Here is what each statement does:
- data: The name of the dataset to use for PCA
- out: The name of the dataset to create that contains all original data along with the principal component scores
- outstat: Specifies that a dataset should be created that contains the means, standard deviations, correlation coefficients, eigenvalues, and eigenvectors.
- var: The variables to use for PCA from the input dataset.
The following step-by-step example shows how to use the PROC PRINCOMP statement in practice to perform principal components analysis in SAS.
Step 1: Create Dataset
Suppose we have the following dataset that contains various information about 20 basketball players:
/*create dataset*/ data my_data; input points assists rebounds; datalines; 22 8 4 29 7 3 10 4 12 5 5 15 35 6 2 8 3 10 10 4 8 8 4 3 2 5 17 4 5 19 9 9 4 7 6 4 31 5 3 4 6 13 5 7 8 8 8 4 10 4 8 20 4 6 25 8 8 18 8 3 ; run; /*view dataset*/ proc print data=my_data;
Step 2: Perform Principal Components Analysis
We can use the PROC PRINCOMP statement to perform principal components analysis using the points, assists and rebounds variables in the dataset:
/*perform principal components analysis*/ proc princomp data=my_data out=out_data outstat=stats; var points assists rebounds; run;
The first portion of the output shows various descriptive statistics including the mean and standard deviations of each input variable, a correlation matrix, and the values for the eigenvalues and eigenvectors:
The next portion of the output displays a Scree Plot and a Variance Explained plot:
When we perform PCA, we’re often interested in understanding what percentage of the total variation in the dataset can be explained by each principal component.
The table in the output titled Eigenvalues of the Correlation Matrix allow us to see exactly what percentage of total variation is explained by each principal component:
- The first principal component explains 61.7% of the total variation in the dataset.
- The second principal component explains 26.51% of the total variation in the dataset.
- The third principal component explains 11.79% of the total variation in the dataset.
Notice that all of the percentages sum to 100%.
The plot titled Variance Explained then allows us to visualize these values.
The x-axis displays the principal component and the y-axis displays the percentage of total variance explained by each individual principal component.
Step 3: Create Biplot to Visualize Results
To visualize the results of PCA for a given dataset we can create a biplot, which is a plot that displays every observation in a dataset on a plane that is formed by the first two principal components.
We can use the following syntax in SAS to create a biplot:
/*create dataset with column called obs to represent row numbers of original data*/ data biplot_data; set out_data; obs=_n_; run; /*create biplot using values from first two principal components*/ proc sgplot data=biplot_data; scatter x=Prin1 y=Prin2 / datalabel=obs; run;
The x-axis displays the first principal component, the y-axis displays the second principal component, and the individual observations from the dataset are shown inside the plot as tiny circles.
Observations that are next to each other on the plot have similar values across the three variables of points, assists and rebounds.
For example, on the far left side of the plot we can see that observations #9 and #10 are extremely close to each other.
If we refer to the original dataset, we can see the following values for these observations:
- Observation #9: 2 points, 5 assists, 17 rebounds
- Observation #10: 4 points, 5 assists, 19 rebounds
The values are similar across each of the three variables, which explains why these observations are so close to each other on the biplot.
We also saw from the table in the output titled Eigenvalues of the Correlation Matrix that the first two principal components account for 88.21% of the total variation in the dataset.
Since this percentage is so high, it’s valid to analyze which observations in the biplot are near each other because the two principal components that make up the biplot account for almost all of the variation in the dataset.
The following tutorials explain how to perform other common tasks in SAS:
How to Perform Simple Linear Regression in SAS
How to Perform Multiple Linear Regression in SAS
How to Perform Logistic Regression in SAS