In statistics and machine learning, we collect data, build models from this data and make inferences. Too little data, the model is most likely not representative of truth since it's biased to what it sees. Too much data, the model could become complex if it attempts to deal with all the variations it sees.

Ideally, we want models to have low bias and low variance. In practice, lower bias leads to higher variance, and vice versa. For this reason, we call it Bias-Variance Trade-off, also called Bias-Variance Dilemma.

There are techniques to address this trade-off. The idea is to get the right balance of bias and variance that's acceptable for the problem. A good model must be,

Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns.

## Discussion

• Could you explain bias-variance trade-off with examples?

Suppose we collect income data from multiple cities across professions. While income can be correlated with profession, there will be variations across cities due to differing lifestyles, cost of living, tax rules, etc. For example, a doctor in London would have a higher income than a doctor in Leicester. This can be called heterogeneity in data. In regression modelling, a model built on this data will have high variance and predictions may not be accurate.

We can overcome this by making the data more homogeneous by splitting the data by city. Thus, we'll end up with multiple models, one per city. Each model is biased to its city but has lesser variance. We could also choose to split the data by state or region. The amount of variance that can be tolerated will dictate bias.

Consider KNN classification as another example. At low $$k$$, predictions are not consistent due to high variations. When we consider more neighbours, we get better predictions as variance is reduced. However, if $$k$$ is too high we start considering neighbours that are "too far", which contributes to increased bias.

• What's the intuition behind bias-variance trade-off?

Assume we're building a prediction model. We build multiple models from different data samples. Bias is a measure of how far the predictions are from the true value. Variance is a measure of variability across these models. This is illustrated graphically with a bulls-eye diagram.

Intuitively, we can say that a biased model is too simple. It's unable to capture essential patterns in the training data. We say that such a model is underfitting the data.

On the other hand, a model with high variance is a complex model. It's in fact sensitive to the training data. It's overfitting the data. When it sees new data, it's unable to predict correctly since it's overfitted to the training data. We might also state that such a model does not generalize well. A simpler model would have done better but if it becomes too simple it also becomes biased.

In summary, a biased model is underfitted and of low complexity. A model of high variance is overfitted and of high complexity.

• What's the math behind bias-variance trade-off?

Given x-y data points, we can represent the relationship as $$Y = f(X) + \epsilon$$, where $$\epsilon$$ is the error term of Normal distribution $$N(0,\sigma_\epsilon)$$. Let $$\hat{f}(X)$$ be an estimate of $$f(X)$$ obtained via any modelling technique such as linear regression or KNN. Let's use mean squared error for the prediction error. Thus, the expected prediction error at point $$x$$ is,

\begin{align}\Bbb{E}[(y - \hat{f}(x))^2] = & \; \Bbb{E}[(f(x) - \Bbb{E}[\hat{f}(x)])^2] \\ & + \Bbb{E}[(\hat{f}(x) - \Bbb{E}[\hat{f}(x)])^2] \\ & + {\sigma_\epsilon}^2\end{align}

The first term is the squared bias of the estimator. The second term is the variance of the estimator. The third term is simply noise. A perfect model would eliminate both bias and variance, but not the noise. Noise contributes to what we call irreducible error. $$\Bbb{E}[\hat{f}(x)]$$ is the average prediction from various estimators. Each estimator is trained on a different sampling of the dataset.

The idea of separating and analysing bias and variance terms of the prediction error is called bias-variance decomposition.

Perfect models don't exist. In practice, we aim for a model that attempts to minimize the error, neither underfitting nor overfitting.

• Could you explain specific examples of high/low bias/variance?

In this example, we use $$f(x)$$ for the underlying process (purple) and $$\hat{f}(x)$$ for our estimate of the process (orange). Individuals fit functions (orange) are averaged to give $$\Bbb{E}[(\hat{f}(x)]$$ (green).

Consider a nonlinear process $$f(x)$$ (top figure). We don't know that it's nonlinear. We attempt to fit a linear function $$\hat{f}(x)$$ to the data. The data may also contain noise. We take different samples of the data and find suitable fits. None of the lines are close to $$f(x)$$. Thus, our fits are all biased, in fact, biased towards linear functions. However, all lines are not too different, which implies low variance.

Now consider a linear process $$f(x)$$ (bottom figure). We now attempt to fit a nonlinear function $$\hat{f}(x)$$ to the data. The functions are complex enough to fit the data and the noise. In fact, it's overfitting. Each nonlinear curve fits its own data and the curves all look different. Thus, our model $$\hat{f}(x)$$ exhibits high variance. When we average these fits, we get a line that close to the original process. Thus, there's low bias.

• How can I calculate the bias and variance of my model?

Bias and variance can be calculated only when we have multiple estimators, each trained on a different dataset. In practice, we usually have a single dataset to train an estimator. In such a case, we can use bootstrapping or cross validation.

We don't usually calculate bias and variance. Instead, the dataset is divided into training and test sets. The model is trained on the training set. It's then evaluated for the prediction error on the test set. This is equivalent to selecting the best one from a candidate list of estimators using bias and variance of each estimator. This similarity can also be observed with the error curves for $${Bias}^2 + Variance$$ and test set.

Overfitting can be observed when the training set error drops but the test set error increases. This is often an indication to consider a simpler model. Equivalently, in neural networks, it's an indication to stop the training process.

It's important that the test set is not used for training. Otherwise, it's difficult to assess the model's performance.

• What are some possible methods to overcome bias-variance trade-off?

A common myth is to minimize bias at the expense of variance. It's important to minimize both. Resampling techniques such as bagging and cross validation help to reduce variance without increasing bias. By such techniques we build multiple models and predict using an ensemble of these models. A specific example is random forests used for classification. The variance of a single decision tree is reduced by random forests. The penalty is memory and computation due to multiple models.

Bagging reduces variance with little effect on bias. Boosting is a technique to reduce bias. In practice, boosting can hurt performance on noisy data. Moreover, boosting is known to increase variance at an exponential decaying rate, which some call exponential bias-variance trade-off.

• Is bias-variance trade-off applicable to neural networks?

Historically, it was believed that the trade-off applies to neural networks as well. To address the trade-off, early stopping and dropping are techniques to avoid overfitting.

In the 2010s, neural networks challenged the classical bias-variance trade-off. The classical U-shaped risk curve was replaced with a double-descent risk curve. While bias decreases monotonically, variance first increases and then decreases after a point called the interpolation threshold. Beyond this point, as more parameters are added, the network performs better. In fact, this behaviour has been observed not just with neural networks but also with ensemble methods such as boosting and random forests.

In particular, performance is influenced by both width and depth of the network. Bias decreases as width increases. Variances decreases as width increases beyond the threshold. As depth increases, bias decreases and variance increases by a lesser amount. Deeper networks generalize better and this mainly due to lower bias.

• Where exactly is the bias-variance trade-off relevant?

Bias-variance trade-off applies to supervised machine learning. It applies to both classification problems and regression problems. In general, it can be a useful conceptual framework when modelling any complex system.

The trade-off has been useful in analyzing human cognition. Given limited training data, we rely on high-bias, low-variance heuristics. These heuristics are fairly simple but generalize well to a wide variety of situations. Tasks such as object recognition use some "hard wiring" that's later fine tuned by experience. Do humans learn concepts based on prototypes (high bias, low variance) or exemplar models (low bias, high variance)? This sort of question can be investigated by the bias-variance trade-off.

In program analysis, precise abstractions may not lead to better results and bias-variance trade-off has been used to explain this. In fact, a tool produced using cross validation had better running time, found new defects and experienced fewer timeouts.

In reinforcement learning with partial observability, there's a similar trade-off between asymptotic bias and overfitting. A smaller state representation might decrease the risk of overfitting but at the cost of increasing asymptotic bias.

## Milestones

1952

In a paper titled On empirical spectral analysis of stochastic processes Grenander introduces what he calls the uncertainty principle. He states, "if we want high resolvability we have to sacrifice some precision of the estimate and vice versa." The term resolvability relates to bias whereas the term precision relates to variance.

1975

Given discrete, noisy observations, Wahba and Wold show how a smooth curve can be fitted to the data via cross validation. Smoothing can be done to control variance or bias. Cross validation helps in controlling both and obtain a better fit.

1986

Hastie and Tibshirani discuss the bias-variance trade-off in the context of regression modelling. This is just an example to show that the trade-off is well known by the 1980s.

1992

Geman et al. note that a feedforward neural network trained by error backpropagation is essentially nonparametric regression. It's a model-free approach but requires lots of training data and slow to converge. A model-based approach learns faster but is also biased: it can't address complex inference problems. They therefore state the trade-off clearly, "whereas incorrect models lead to high bias, truly model-free inference suffers from high variance."

1995

Historically, bias-variance trade-off started in regression with squared loss as the loss function. For classification problems, zero-one loss is used. For classification, Kong and Dietterich show that ensembles can reduce bias. In 1996, Breiman shows that ensembles can reduce variance.

1998

Schapire et al. show that ensembles enlarge the margins and thereby enable models to generalize better.

2000

Domingos proposes a unified bias-variance decomposition that can be applied to any loss function (squared loss, zero-one loss, etc.). The decomposition is not always additive. He notes that bias-variance trade-off behaviour is dependent on the loss function. Domingos also shows that Schapire's margin-based approach is equivalent to bias-variance-based approach. An ensemble's generalization error can be expressed either as the distribution of the margins or as bias-variance decomposition of the error.

2004

Valentini and Dietterich perform bias-variance analysis on Support Vector Machines (SVMs) to get insights into how SVMs learn. They observe the expected bias-variance trade-off but they also see complex relationships, especially in Gaussian and polynomial kernels. They propose how bias-variance decomposition can be used to develop ensemble methods using SVMs as base learners.

Oct
2018

Neal et al. observe that since the mid-2010s, empirical results show that wider networks generalize better. The classical U-shaped test error curve due to bias-variance trade-off is being defied by neural networks. In their experiments, they show that bias and variance decrease as more parameters are added to the network.

## Sample Code

• # Source: https://machinelearningmastery.com/calculate-the-bias-variance-trade-off/
# Accessed 2020-09-17

# estimate the bias and variance for a regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
# separate into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the model
model = LinearRegression()
# estimate bias and variance
mse, bias, var = bias_variance_decomp(model, X_train, y_train, X_test, y_test,
loss='mse', num_rounds=200, random_seed=1)
# summarize results
print('MSE: %.3f' % mse)
print('Bias: %.3f' % bias)
print('Variance: %.3f' % var)


Author
No. of Edits
No. of Chats
DevCoins
2
0
1581
1
1
14
2042
Words
1
Likes
4923
Hits