Quadratic Discriminant Analysis in R (Step-by-Step)


Quadratic discriminant analysis is a method you can use when you have a set of predictor variables and you’d like to classify a response variable into two or more classes. It is considered to be the non-linear equivalent to linear discriminant analysis.

This tutorial provides a step-by-step example of how to perform quadratic discriminant analysis in R.

Step 1: Load Necessary Libraries

First, we’ll load the necessary libraries for this example:

library(MASS)
library(ggplot2)

Step 2: Load the Data

For this example, we’ll use the built-in iris dataset in R. The following code shows how to load and view this dataset:

#attach iris dataset to make it easy to work with
attach(iris)

#view structure of dataset
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 ...

We can see that the dataset contains 5 variables and 150 total observations.

For this example we’ll build a quadratic discriminant analysis model to classify which species a given flower belongs to.

We’ll use the following predictor variables in the model:

  • Sepal.length
  • Sepal.Width
  • Petal.Length
  • Petal.Width

And we’ll use them to predict the response variable Species, which takes on the following three potential classes:

  • setosa
  • versicolor
  • virginica

Step 3: Create Training and Test Samples

Next, we’ll split the dataset into a training set to train the model on and a testing set to test the model on:

#make this example reproducible
set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(iris), replace=TRUE, prob=c(0.7,0.3))
train <- iris[sample, ]
test <- iris[!sample, ] 

Step 4: Fit the QDA Model

Next, we’ll use the qda() function from the MASS package to fit the QDA model to our data:

#fit QDA model
model <- qda(Species~., data=train)

#view model output
model

Call:
qda(Species ~ ., data = train)

Prior probabilities of groups:
    setosa versicolor  virginica 
 0.3207547  0.3207547  0.3584906 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa         4.982353    3.411765     1.482353   0.2411765
versicolor     5.994118    2.794118     4.358824   1.3676471
virginica      6.636842    2.973684     5.592105   2.0552632 

Here is how to interpret the output of the model:

Prior probabilities of group: These represent the proportions of each Species in the training set. For example, 35.8% of all observations in the training set were of species virginica.

Group means: These display the mean values for each predictor variable for each species.

Step 5: Use the Model to Make Predictions

Once we’ve fit the model using our training data, we can use it to make predictions on our test data:

#use QDA model to make predictions on test data
predicted <- predict(model, test)

names(predicted)

[1] "class"     "posterior" "x"   

This returns a list with two variables:

  • class: The predicted class
  • posterior: The posterior probability that an observation belongs to each class

We can quickly view each of these results for the first six observations in our test dataset:

#view predicted class for first six observations in test set
head(predicted$class)

[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

#view posterior probabilities for first six observations in test set
head(predicted$posterior)

   setosa   versicolor    virginica
4       1 7.224770e-20 1.642236e-29
6       1 6.209196e-26 8.550911e-38
7       1 1.248337e-21 8.132700e-32
15      1 2.319705e-35 5.094803e-50
17      1 1.396840e-29 9.586504e-43
18      1 7.581165e-25 8.611321e-37

Step 6: Evaluate the Model

We can use the following code to see what percentage of observations the QDA model correctly predicted the Species for:

#find accuracy of model
mean(predicted$class==test$Species)

[1] 1

It turns out that the model correctly predicted the Species for 100% of the observations in our test dataset.

In the real-world an QDA model will rarely predict every class outcome correctly, but this iris dataset is simply built in a way that machine learning algorithms tend to perform very well on it.

You can find the complete R code used in this tutorial here.

Leave a Reply

Your email address will not be published.