# Bias-Variance Trade-off

## Summary

In statistics and machine learning, we collect data, build models from this data and make inferences. Too little data, the model is most likely not representative of truth since it's biased to what it sees. Too much data, the model could become complex if it attempts to deal with all the variations it sees.

Ideally, we want models to have low bias and low variance. In practice, lower bias leads to higher variance, and vice versa. For this reason, we call it Bias-Variance Trade-off, also called *Bias-Variance Dilemma*.^{}

There are techniques to address this trade-off. The idea is to get the right balance of bias and variance that's acceptable for the problem. A good model must be,^{}

Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns.

## Milestones

2018

## Discussion

Could you explain bias-variance trade-off with examples? Suppose we collect income data from multiple cities across professions. While income can be correlated with profession, there will be variations across cities due to differing lifestyles, cost of living, tax rules, etc. For example, a doctor in London would have a higher income than a doctor in Leicester. This can be called heterogeneity in data. In regression modelling, a model built on this data will have high variance and predictions may not be accurate.

We can overcome this by making the data more homogeneous by splitting the data by city. Thus, we'll end up with multiple models, one per city. Each model is biased to its city but has lesser variance. We could also choose to split the data by state or region. The amount of variance that can be tolerated will dictate bias.

Consider KNN classification as another example. At low \(k\), predictions are not consistent due to high variations. When we consider more neighbours, we get better predictions as variance is reduced. However, if \(k\) is too high we start considering neighbours that are "too far", which contributes to increased bias.

^{}What's the intuition behind bias-variance trade-off? Assume we're building a prediction model. We build multiple models from different data samples.

**Bias**is a measure of how far the predictions are from the true value.**Variance**is a measure of variability across these models. This is illustrated graphically with a bulls-eye diagram.^{}Intuitively, we can say that a biased model is too simple. It's unable to capture essential patterns in the training data. We say that such a model is

**underfitting**the data.^{}On the other hand, a model with high variance is a complex model. It's in fact sensitive to the training data. It's

**overfitting**the data. When it sees new data, it's unable to predict correctly since it's overfitted to the training data. We might also state that such a model**does not generalize**well. A simpler model would have done better but if it becomes too simple it also becomes biased.^{}In summary, a biased model is underfitted and of low complexity. A model of high variance is overfitted and of high complexity.

What's the math behind bias-variance trade-off? Given x-y data points, we can represent the relationship as \(Y = f(X) + \epsilon\), where \(\epsilon\) is the error term of Normal distribution \(N(0,\sigma_\epsilon)\). Let \(\hat{f}(X)\) be an estimate of \(f(X)\) obtained via any modelling technique such as linear regression or KNN. Let's use mean squared error for the prediction error. Thus, the expected prediction error at point \(x\) is,

^{}$$\begin{align}\Bbb{E}[(y - \hat{f}(x))^2] = & \; \Bbb{E}[(f(x) - \Bbb{E}[\hat{f}(x)])^2] \\ & + \Bbb{E}[(\hat{f}(x) - \Bbb{E}[\hat{f}(x)])^2] \\ & + {\sigma_\epsilon}^2\end{align}$$

The first term is the squared bias of the estimator. The second term is the variance of the estimator. The third term is simply noise. A perfect model would eliminate both bias and variance, but not the noise. Noise contributes to what we call

*irreducible error*. \(\Bbb{E}[\hat{f}(x)]\) is the average prediction from various estimators. Each estimator is trained on a different sampling of the dataset.^{}^{}The idea of separating and analysing bias and variance terms of the prediction error is called

*bias-variance decomposition*.^{}^{}Perfect models don't exist. In practice, we aim for a model that attempts to minimize the error, neither underfitting nor overfitting.

^{}Could you explain specific examples of high/low bias/variance? In this example, we use \(f(x)\) for the underlying process (purple) and \(\hat{f}(x)\) for our estimate of the process (orange). Individuals fit functions (orange) are averaged to give \(\Bbb{E}[(\hat{f}(x)]\) (green).

^{}Consider a nonlinear process \(f(x)\) (top figure). We don't know that it's nonlinear. We attempt to fit a linear function \(\hat{f}(x)\) to the data. The data may also contain noise. We take different samples of the data and find suitable fits. None of the lines are close to \(f(x)\). Thus, our fits are all biased, in fact, biased towards linear functions. However, all lines are not too different, which implies low variance.

^{}Now consider a linear process \(f(x)\) (bottom figure). We now attempt to fit a nonlinear function \(\hat{f}(x)\) to the data. The functions are complex enough to fit the data and the noise. In fact, it's overfitting. Each nonlinear curve fits its own data and the curves all look different. Thus, our model \(\hat{f}(x)\) exhibits high variance. When we average these fits, we get a line that close to the original process. Thus, there's low bias.

^{}How can I calculate the bias and variance of my model? Bias and variance can be calculated only when we have multiple estimators, each trained on a different dataset. In practice, we usually have a single dataset to train an estimator. In such a case, we can use bootstrapping or cross validation.

^{}We don't usually calculate bias and variance. Instead, the dataset is divided into training and test sets. The model is trained on the training set.

^{}It's then evaluated for the prediction error on the test set. This is equivalent to selecting the best one from a candidate list of estimators using bias and variance of each estimator. This similarity can also be observed with the error curves for \({Bias}^2 + Variance\) and test set.^{}Overfitting can be observed when the training set error drops but the test set error increases. This is often an indication to consider a simpler model.

^{}Equivalently, in neural networks, it's an indication to stop the training process.It's important that the test set is not used for training. Otherwise, it's difficult to assess the model's performance.

^{}What are some possible methods to overcome bias-variance trade-off? A common myth is to minimize bias at the expense of variance. It's important to minimize both. Resampling techniques such as bagging and cross validation help to reduce variance without increasing bias. By such techniques we build multiple models and predict using an

**ensemble**of these models. A specific example is random forests used for classification. The variance of a single decision tree is reduced by random forests. The penalty is memory and computation due to multiple models.^{}Bagging reduces variance with little effect on bias.

**Boosting**is a technique to reduce bias. In practice, boosting can hurt performance on noisy data.^{}Moreover, boosting is known to increase variance at an exponential decaying rate, which some call*exponential bias-variance trade-off*.^{}Is bias-variance trade-off applicable to neural networks? Historically, it was believed that the trade-off applies to neural networks as well.

^{}To address the trade-off, early stopping and dropping are techniques to avoid overfitting.^{}In the 2010s, neural networks challenged the classical bias-variance trade-off.

^{}The classical**U-shaped risk curve**was replaced with a**double-descent risk curve**. While bias decreases monotonically, variance first increases and then decreases after a point called the**interpolation threshold**. Beyond this point, as more parameters are added, the network performs better. In fact, this behaviour has been observed not just with neural networks but also with ensemble methods such as boosting and random forests.^{}In particular, performance is influenced by both width and depth of the network. Bias decreases as width increases. Variances decreases as width increases beyond the threshold. As depth increases, bias decreases and variance increases by a lesser amount. Deeper networks generalize better and this mainly due to lower bias.

^{}Where exactly is the bias-variance trade-off relevant? Bias-variance trade-off applies to supervised machine learning. It applies to both classification problems and regression problems.

^{}In general, it can be a useful conceptual framework when modelling any complex system.^{}The trade-off has been useful in analyzing human cognition. Given limited training data, we rely on high-bias, low-variance heuristics. These heuristics are fairly simple but generalize well to a wide variety of situations. Tasks such as object recognition use some "hard wiring" that's later fine tuned by experience.

^{}Do humans learn concepts based on prototypes (high bias, low variance) or exemplar models (low bias, high variance)? This sort of question can be investigated by the bias-variance trade-off.^{}In program analysis, precise abstractions may not lead to better results and bias-variance trade-off has been used to explain this. In fact, a tool produced using cross validation had better running time, found new defects and experienced fewer timeouts.

^{}In reinforcement learning with partial observability, there's a similar trade-off between asymptotic bias and overfitting. A smaller state representation might decrease the risk of overfitting but at the cost of increasing asymptotic bias.

^{}

## Sample Code

## References

- Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. "Reconciling modern machine-learning practice and the classical bias–variance trade-off." PNAS, vol. 116, no. 32, pp. 15849-15854. Accessed 2020-09-17.
- Briscoe, Erica, and Jacob Feldman. 2011. "Conceptual complexity and the bias/variance tradeoff." Cognition, vol. 118, pp. 2-16, Elsevier B.V. Accessed 2020-09-17.
- Cornell University. 2005. "Bias/Variance Tradeoff." CS578, Cornell University. Accessed 2020-09-17.
- Domingos, Pedro. 2000. "A Unified Bias-Variance Decomposition and its Applications." In Proc. 17th International Conf. on Machine Learning, pp. 231-238, Morgan Kaufmann. Accessed 2020-09-17.
- Fortmann-Roe, Scott. 2012. "Understanding the Bias-Variance Tradeoff." June. Accessed 2020-09-17.
- Francois-Lavet, Vincent, Guillaume Rabusseau, Joelle Pineau, Damien Ernst, and Raphael Fonteneau. 2020. "On Overfitting and Asymptotic Bias in Batch Reinforcement Learning with Partial Observability." Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 5055-5059, July. Accessed 2020-09-17.
- Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. "Neural Networks and the Bias/Variance Dilemma." Neural Computation, vol. 4, no. 1, pp. 1-58, January. Accessed 2020-09-17.
- Grenander, Ulf. 1952. "On empirical spectral analysis of stochastic processes." Arkiv för Matematik, vol. 1, no. 6, pp. 503-531. Accessed 2020-09-17.
- Hastie, Trevor and Robert Tibshirani. 1986. "Generalized Additive Models." Statistical Science, vol. 1, no. 3, pp. 297-310. Accessed 2020-09-17.
- Neal, Brady, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, and Ioannis Mitliagkas. 2019. "A Modern Take on the Bias-Variance Tradeoff in Neural Networks." arXiv, v4, December 18. Accessed 2020-09-17.
- Rojas, Raúl. 2015. "The Bias-Variance Dilemma." February 10. Accessed 2020-09-17.
- Sharma, Rahul, Aditya V. Nori, and Alex Aiken. 2014. "Bias-Variance Tradeoffs in Program Analysis." POPL ’14, ACM, January 22-24. Accessed 2020-09-17.
- Stansbury, Dustin. 2020. "Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff." The Clever Machine, July 20. Accessed 2020-09-17.
- Valentini, Giorgio, and Thomas G. Dietterich. 2004. "Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods." Journal of Machine Learning Research, vol. 5, pp. 725-775. Accessed 2020-09-17.
- Wahba, G. and S. Wold. 1975. "A completely automatic french curve: fitting spline functions by cross validation." Communications in Statistics, vol. 4, no. 1. doi: 10.1080/03610927508827223. Accessed 2020-09-19.
- Wikipedia. 2020. "Bias–variance tradeoff." Wikipedia, September 10. Accessed 2020-09-17.
- Wågberg, Johan. 2020. "Lecture 5 – Cross-validation and the bias-variance trade-off." In: Statistical Machine Learning, Uppsala University. Accessed 2020-09-17.
- Yang, Zitong, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. 2020. "Rethinking Bias-Variance Trade-off for Generalization of Neural Networks." arXiv, v2, March 21. Accessed 2020-09-17.
- Yu, Lean Yu, Kin Keung Lai, Shouyang Wang, and Wei Huang. 2006. "A Bias-Variance-Complexity Trade-Off Framework for Complex System Modeling." In: M. Gavrilova et al. (eds.), ICCSA 2006, LNCS 3980, pp. 518-527, Springer-Verlag Berlin Heidelberg. Accessed 2020-09-17.

## Milestones

2018

## Tags

## See Also

- Overfitting and Underfitting
- Ensemble Learning
- Boosting (Machine Learning)
- Regression Modelling
- Analysis of Variance
- Machine Learning

## Further Reading

- Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. "Neural Networks and the Bias/Variance Dilemma." Neural Computation, vol. 4, no. 1, pp. 1-58, January. Accessed 2020-09-17.
- Rojas, Raúl. 2015. "The Bias-Variance Dilemma." February 10. Accessed 2020-09-17.
- Neal, Brady, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, and Ioannis Mitliagkas. 2019. "A Modern Take on the Bias-Variance Tradeoff in Neural Networks." arXiv, v4, December 18. Accessed 2020-09-17.
- Neal, Brady. 2019. "On the Bias-Variance Tradeoff: Textbooks Need an Update." M.Sc. Thesis, Université de Montréal, December 10. Accessed 2020-09-17.
- Brownlee, Jason. 2016. "Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning." Machine Learning Mastery, March 18. Updated 2019-10-25. Accessed 2020-09-17.
- Brownlee, Jason. 2020. "How to Calculate the Bias-Variance Trade-off with Python." Machine Learning Mastery, August 19. Updated 2020-08-26. Accessed 2020-09-17.