To evaluate the performance of a model on a dataset, we need to measure how well the model predictions match the observed data.

For regression models, the most commonly used metric is the mean squared error (MSE), which is calculated as:

MSE = (1/n)*Σ(y_{i} – f(x_{i}))^{2}

where:

**n:**Total number of observations**y**The response value of the i_{i}:^{th}observation**f(x**The predicted response value of the i_{i}):^{th}observation

The closer the model predictions are to the observations, the smaller the MSE will be.

However, we only care about **test MSE** – the MSE when our model is applied to unseen data. This is because we only care about how the model will perform on unseen data, not existing data.

For example, it’s nice if a model that predicts stock market prices has a low MSE on historical data, but we *really* want to be able to use the model to accurately forecast future data.

It turns out that the test MSE can always be decomposed into two parts:

**(1) The variance:** Refers to the amount by which our function *f* would change if we estimated it using a different training set.

**(2) The bias:** Refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

Written in mathematical terms:

Test MSE = Var(*f̂(*x_{0})) + [Bias(*f̂(*x_{0}))]^{2} + Var(ε)

Test MSE = Variance + Bias^{2} + Irreducible error

The third term, the irreducible error, is the error that cannot be reduced by any model simply because there always exists *some* noise in the relationship between the set of explanatory variables and the response variable.

Models that have **high bias** tend to have **low variance**. For example, linear regression models tend to have high bias (assumes a simple linear relationship between explanatory variables and response variable) and low variance (model estimates won’t change much from one sample to the next).

However, models that have **low bias** tend to have **high variance**. For example, complex non-linear models tend to have low bias (does not assume a certain relationship between explanatory variables and response variable) with high variance (model estimates can change a lot from one training sample to the next).

**The Bias-Variance Tradeoff**

The **bias-variance tradeoff **refers to the tradeoff that takes place when we choose to lower bias which typically increases variance, or lower variance which typically increases bias.

The following chart offers a way to visualize this tradeoff:

The total error decreases as the complexity of a model increases but only up to a certain point. Past a certain point, variance begins to increase and total error also begins to increase.

In practice, we only care about minimizing the total error of a model, not necessarily minimizing the variance or bias. It turns out that the way to minimize the total error is to strike the right balance between variance and bias.

In other words, we want a model that is complex enough to capture the true relationship between the explanatory variables and the response variable, but not overly complex such that it finds patterns that don’t really exist.

When a model is too complex, it **overfits** the data. This happens because it works too hard to find patterns in the training data that are just caused by random chance. This type of model is likely to perform poorly on unseen data.

But when a model is too simple, it **underfits** the data. This happens because it assumes the true relationship between the explanatory variables and the response variable is more simple than it actually is.

The way to pick optimal models in machine learning is to strike the balance between bias and variance such that we can minimize the test error of the model on future unseen data.

In practice, the most common way to minimize test MSE is to use cross-validation.