When we would like to calculate the correlation between two continuous variables, we typically use the Pearson correlation coefficient.
However, when we would like to calculate the correlation between a continuous variable and a categorical variable, we can use something known as point biserial correlation.
Point biserial correlation is used to calculate the correlation between a binary categorical variable (a variable that can only take on two values) and a continuous variable and has the following properties:
- Point biserial correlation can range between -1 and 1.
- For each group created by the binary variable, it is assumed that the continuous variable is normally distributed with equal variances.
- For each group created by the binary variable, it is assumed that there are no extreme outliers.
The following example shows how to calculate a point biserial correlation in practice.
Example: Calculating a Point Biserial Correlation
Suppose a college professor would like to determine if there is a correlation between gender and score on particular aptitude exam.
He collects the following data on 12 males and 12 females in his class:
Since gender is a categorical variable and score is a continuous variable, it makes sense to calculate a point-biserial correlation between the two variables.
The professor can use any statistical software (including Excel, R, Python, SPSS, Stata) to calculate the point-biserial correlation between the two variables.
The following code shows how to calculate the point-biserial correlation in R, using the value 0 to represent females and 1 to represent males for the gender variable:
#define values for gender gender <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) #define values for score score <- c(77, 78, 79, 79, 82, 84, 85, 88, 89, 91, 91, 94, 84, 84, 84, 85, 85, 86, 86, 86, 89, 91, 94, 98) #calculate point-biserial correlation cor.test(gender, score) Pearson's product-moment correlation data: gender and score t = 1.3739, df = 22, p-value = 0.1833 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1379386 0.6147832 sample estimates: cor 0.2810996
From the output we can see that the point biserial correlation coefficient is 0.281 and the corresponding p-value is 0.1833.
Since the correlation coefficient is positive, it tells us that there is a positive correlation between gender and score.
Since we coded the males as 1 and females as 0, this indicates that scores tend to be higher for males (i.e. scores tend to increase as gender “increases” from 0 to 1)
However, since the p-value is not less than .05, this correlation coefficient is not statistically significant.
The following tutorials explain how to calculate point biserial correlation using different statistical software:
How to Calculate Point-Biserial Correlation in Excel
How to Calculate Point-Biserial Correlation in R
How to Calculate Point-Biserial Correlation in Python