When using classification models in machine learning, a common metric that we use to assess the quality of the model is the **F1 Score**.

This metric is calculated as:

**F1 Score** = 2 * (Precision * Recall) / (Precision + Recall)

where:

**Precision**: Correct positive predictions relative to total positive predictions**Recall**: Correct positive predictions relative to total actual positives

For example, suppose we use a logistic regression model to predict whether or not 400 different college basketball players get drafted into the NBA.

The following confusion matrix summarizes the predictions made by the model:

Here is how to calculate the F1 score of the model:

Precision = True Positive / (True Positive + False Positive) = 120/ (120+70) = **.63157**

Recall = True Positive / (True Positive + False Negative) = 120 / (120+40) = **.75**

F1 Score = 2 * (.63157 * .75) / (.63157 + .75) = .**6857**

**What is a Good F1 Score?**

One question students often have is:

**What is a good F1 score?**

In the most simple terms, higher F1 scores are generally better.

Recall that F1 scores can range from 0 to 1, with 1 representing a model that perfectly classifies each observation into the correct class and 0 representing a model that is unable to classify any observation into the correct class.

To illustrate this, suppose we have a logistic regression model that produces the following confusion matrix:

Here is how to calculate the F1 score of the model:

Precision = True Positive / (True Positive + False Positive) = 240/ (240+0) = **1**

Recall = True Positive / (True Positive + False Negative) = 240 / (240+0) = **1**

F1 Score = 2 * (1 * 1) / (1 + 1) = **1**

The F1 score is equal to one because it is able to perfectly classify each of the 400 observations into a class.

Now consider another logistic regression model that simply predicts every player to get drafted:

Here is how to calculate the F1 score of the model:

Precision = True Positive / (True Positive + False Positive) = 160/ (160+240) = **0.4**

Recall = True Positive / (True Positive + False Negative) = 160 / (160+0) = **1**

F1 Score = 2 * (.4 * 1) / (.4 + 1) = **0.5714**

This would be considered a **baseline model** that we could compare our logistic regression model to since it represents a model that makes the same prediction for every single observation in the dataset.

The greater our F1 score is compared to a baseline model, the more useful our model.

Recall from earlier that our model had an F1 score of **0.6857**. This isn’t much greater than **0.5714**, which indicates that our model is more useful than a baseline model but not by much.

**On Comparing F1 Scores**

In practice, we typically use the following process to pick the “best” model for a classification problem:

**Step 1:** Fit a baseline model that makes the same prediction for every observation.

**Step 2:** Fit several different classification models and calculate the F1 score for each model.

**Step 3:** Choose the model with the highest F1 score as the “best” model, verifying that it produces a higher F1 score than the baseline model.

There is no specific value that is considered a “good” F1 score, which is why we generally pick the classification model that produces the highest F1 score.

**Additional Resources**

F1 Score vs. Accuracy: Which Should You Use?

How to Calculate F1 Score in R

How to Calculate F1 Score in Python