What is a “Good” Accuracy for Machine Learning Models?

When using classification models in machine learning, one metric we often use to assess the quality of a model is accuracy.

Accuracy is simply the percentage of all observations that are correctly classified by the model.

It is calculated as:

Accuracy = (# True Positives + # True Negatives) / (Total Sample Size)

One question that students often have about accuracy is:

What is considered a “good” value for the accuracy of a machine learning model?

While the accuracy of a model can range between 0% and 100%, there is no universal threshold that we use to determine if a model has “good” accuracy or not.

Instead, we typically compare the accuracy of our model to the accuracy of some baseline model.

A baseline model is one that simply predicts every observation in a dataset to belong to the most common class.

In practice, any classification model that has a higher accuracy than a baseline model can be considered “useful” but obviously the greater the difference in accuracy between our model and a baseline model, the better.

The following example shows how to roughly determine if a classification model has “good” accuracy or not.

Example: Determining if a Model Has “Good” Accuracy

Suppose we use a logistic regression model to predict whether or not 400 different college basketball players get drafted into the NBA.

The following confusion matrix summarizes the predictions made by the model:

Here is how to calculate the accuracy of this model:

• Accuracy = (# True Positives + # True Negatives) / (Total Sample Size)
• Accuracy = (120 + 170) / (400)
• Accuracy = 0.725

The model correctly predicted the outcome for 72.5% of players.

To get an idea of whether or not that is accuracy is “good”, we can calculate the accuracy of a baseline model.

In this example, the most common outcome for the players was to not get drafted. Specifically, 240 out of 400 players did not get drafted.

A baseline model would be one that simply predicts every single player to not get drafted.

The accuracy of this model would be calculated as:

• Accuracy = (# True Positives + # True Negatives) / (Total Sample Size)
• Accuracy = (0 + 240) / (400)
• Accuracy = 0.6

This baseline model would correctly predict the outcome for 60% of players.

In this scenario, our logistic regression model offers a noticeable improvement in accuracy compared to a baseline model so we would consider our model to at least be “useful.”

In practice, we would likely fit several different classification models and choose the final model as the one that offers the greatest boost in accuracy compared to a baseline model.

Cautions on Using Accuracy to Assess Model Performance

Accuracy is a commonly used metric because it’s easy to interpret.

For example, if we say that a model is 90% accurate, we know that it correctly classified 90% of observations.

However, accuracy does not take into account how the data is distributed.

For example, suppose 90% of all players do not get drafted into the NBA. If we have a model that simply predicts every player to not get drafted, the model would correctly predict the outcome for 90% of the players.

This value seems high, but the model is actually unable to correctly predict any player who gets drafted.

An alternative metric that is often used is called the F1 Score, which takes into account how the data is distributed.

For example, if the data is highly imbalanced (e.g. 90% of all players do not get drafted and 10% do get drafted) then F1 score will provide a better assessment of model performance.