Linear Regression
 Summary

Discussion
 Could you explain linear regression with some examples?
 What are the main types of linear regression models?
 What's the mathematical notation of a linear regression model?
 What are some estimation methods in linear regression?
 How do I evaluate the performance of a linear regression model?
 What are the main assumptions when constructing a linear regression model?
 How can I determine if linear regression is appropriate for a particular set of data?
 How do autocorrelation, multicollinearity, and heteroscedasticity affect linear regression estimates?
 How do we model the interaction of independent variables?
 What are random effects, fixed effects and mixed effects models?
 What software packages are useful for solving linear regression problems?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
Linear regression is a statistical technique used to establish the relationship between variables in a dataset. The equation \(y = mx + c\) describes a linear relationship between dependent variable \(y\) and independent variable \(x\). We may state that \(y\) depends on \(x\). Given sufficient data, linear regression estimates the values of coefficient \(m\) and constant \(c\).^{} In a geometric interpretation, \(m\) is the slope and \(c\) is the intercept.^{} In an alternative notation, these are expressed as \(β_1\) and \(β_0\) respectively.^{}
Variable \(y\) is also called response or predicted variable. Variable \(x\) is also called predictor variable. The reason for this that once parameters of the model \(β_0\) and \(β_1\) are estimated, we can make predictions of \(y\) given any value of \(x\).^{}
Linear regression is a field of statistics.^{} This article looks at the types, important assumptions and techniques in linear regression.
Discussion
Could you explain linear regression with some examples? Linear regression is frequently used by businesses to understand the link between advertising budget and revenue. In other words, it answers the question "For every advertising dollar I spend, how much will my revenue increase?" This can be modelled as \(Revenue=β_0+β_1 \cdot AdSpending\)
\(β_0\) represents total expected revenue when ad spending is zero. The coefficient \(β_1\) represents the average increase in total revenue when ad spending is increased by one unit. When \(β_1<0\), higher ad spending is associated with lower revenue. When \(β_1=0\) is close to zero, ad spending has little effect on revenue. When \(β_1>0\) is positive, higher ad spending leads to higher revenue. The model thus aids decision making: a company may decrease or increase its ad spending based on the value of \(β_1\).^{}
In the figure, the red line is the bestfit straight line, \(y=4.1870.356x\). The values \(β_0=4.187\) and \(β_1=0.356\) are what regression analysis has estimated from available data. Yield falls by 0.356% for every 1% increase in cultivated area. This model can now be used to make predictions; that is, given an area, we can predict the yield.^{}
What are the main types of linear regression models? The main types of linear regression models are:
 Simple Linear Regression: This is the most basic type and deals with a single predictor variable.^{} Predicting revenue from ad spending is an example.^{}
 Multiple Linear Regression: Aka multivariable linear regression. This is applicable when there are many predictor variables. An example of this is predicting wine prices. This depends on mean growing season temperature, harvest rainfall, winter rainfall, and more.^{}
 Hierarchical Linear Model: Aka multilevel regression. Such a model captures the natural hierarchy in predictor variables. Analysis involves a hierarchy of regressions, such as A regressed on B, and B regressed on C. For example, students are nested into classes, classrooms into schools, and schools into districts. So a student's test score can be modelled based on overall performance at different levels.^{}
What's the mathematical notation of a linear regression model? We consider the general case of multiple linear regression with \(k\) independent variables. The model therefore has to estimate \(k+1\) parameters: constant \(β_0\) and coefficients \(β_j\;for\;1\le j\le k\). The mathematical notation of this model is given in the figure.^{}
In linear algebra, this is expressed as \(Y = X \cdot β + ϵ\), where \(X\) is a \(m\,\times\,k+1\) matrix, \(β\) and \(Y\) are mdimensional vectors, and \(m\) being the number of observations or data points.^{}
Regression analysis is simply finding \(β\) that minimizes the error term \(ϵ\). This leads us to what's called the normal equation: \(β=(X^TX)^{1} \cdot X^TY\). As is apparent, this equation is in a form that solves for model parameters \(β\).^{} ^{}
In Machine Learning (ML), it's common to use \(θ\) instead of \(β\) for the model parameters.^{} ^{}
What are some estimation methods in linear regression? Methods commonly used for estimating the model parameters (also called estimates) are:^{}
 Ordinary Least Squares (OLS): This looks at the sum of the square of the difference between actual observed value and its prediction via the model. The method attempts to minimize this sum.
 Method of Moments (MoM): This uses moments, which are the expectation of the powers of a random variable. Number of moments to be calculated is equal to the number of unknown parameters. The resulting system of equations is then solved.
 Maximum Likelihood Estimate (MLE): This seeks to maximize the likelihood function. In other words, we determine the estimates that make the observed values most probable. MLE method is applicable when the probability distribution of the error terms is known.
Related to OLS are more sophisticated methods including Weighted Least Squares (WLS) and Generalized Least Squares (GLS).^{} Less common ones are Least Median Squares and Least Trimmed Squares.^{} In fact, the minimization need not be of the squares. We could minimize on Least Absolute Deviations, Huber, Bisquare, etc.^{}
How do I evaluate the performance of a linear regression model? Models are rarely perfect and there's a need to measure how good a model really is. The differences between predicted values and actual values are called residuals. Model evaluation and validating the assumptions can be performed from residuals and this field of study is called Residual Analysis.^{}
Mean Absolute Error (MAE) and Mean Squared Error (MSE) are two ways to quantify the residuals. MAE looks at absolute differences. MSE looks at the square of the differences. Another measure is Root Mean Squared Error (RMSE) that's the square root of MSE. RMSE has the same unit as the output variable, making it easier to interpret.^{}
Perhaps the most widely used statistical measure is RSquared (R2). It quantifies the proportion of the variation explained by the model. Closer it is to 1, better the model explains the data.^{} R2 is also called the Coefficient of Determination.^{}
What are the main assumptions when constructing a linear regression model? In linear regression, we usually make the following assumptions:^{} ^{}
 Linearity: The dependent variable Y is related to the independent variables X in a linear way.
 Independence: Observations are independent of one another. We could also say that the residuals are independent of Y. In timeseries data, observations are not correlated.^{} When data doesn't meet this assumption, we have a problem called autocorrelation.^{}
 Normality: Residuals are normally distributed. Equivalently, at a fixed observation X, the dependent variable Y is normally distributed.
 Homoscedasticity: The residuals have the same variance at all predicted or fitted points Y. When data doesn't meet this assumption, we have a problem called heteroscedasticity.^{}
If one or more of these assumptions is violated, what the model predicts may be incorrect or even deceptive.^{}
How can I determine if linear regression is appropriate for a particular set of data? Scatterplot can help us validate the linearity assumption. For multiple linear regression, 2D pairwise scatter plots, rotating plots, and dynamic graphs can help.^{}
To validate the independence assumption, a scatterplot of residuals versus fitted values shouldn't show any pattern.^{}
To validate the normality assumption, a normal probability plot, a residual histogram or a quantilequantile plot can be used.^{}
To validate the homoscedasticity assumption, do a scatterplot of residuals against the fitted values. A coneshaped pattern implies that the residuals vary more for some predicted values than others, thus invalidating the assumption.^{} There are also lots of statistical tests to check for homoscedasticity: Bartlett's Test, Box's Test, BrownForsythe Test, Hartley's Fmax Test, Levene's Test, and BreuschPagan Test.^{}
We also expect predictor variables to be independent of one another. A scatterplot of one independent variable with another can validate this.^{} We can also calculate the correlation coefficients pairwise for all independent variables. Correlation coefficients close to ±1 imply high correlation. Low model coefficients or high Variance Inflation Factor (VIF) indicate correlated variables.^{}
How do autocorrelation, multicollinearity, and heteroscedasticity affect linear regression estimates? Autocorrelation exists when multiple observations of a predictor variable are not independent. This is common in timeseries data but could occur in other scenarios such as samples drawn from a cluster or geographic area. Autocorrelation can be detected with the DurbinWatson test.^{} Due to autocorrelation, OLS estimators will be inefficient. Estimated variance of regression coefficients will be biased and inconsistent. Hypothesis testing will be invalid. R2 will be overestimated.^{}
A correlation between two or more predictor variables is referred to as multicollinearity.^{} Though the overall model fit is not affected, multicollinearity can increase the variance of estimates and make them sensitive to model changes. Model becomes harder to interpret. It's harder to determine the precise effect of each predictor.^{}
Heteroscedasticity means unequal spread of the residuals. It's a problem because OLS regression assumes constant spread. Though this doesn't introduce bias to the estimates, it does make them less precise. It produces smaller pvalues because OLS regression doesn't detect the increase in the variance of the estimates. Thus, we may wrong conclude that an estimate is statistically significant.^{}
How do we model the interaction of independent variables? When one independent variable has a distinct effect on the outcome based on the values of another independent variable, we call this an interaction.^{}
Assume that a cholesterollowering medication is being evaluated in a clinical trial. Drug effect is dependent on both dose administered and the patient's sex. Without interaction between dose and sex, effect increases at a fixed slope with respect to the dose regardless of the sex.^{}
With interaction, we can no longer ask what's the drug's effect since for every unit dose the incremental effect depends on the sex. Dose affects males differently from females and this what interaction is about. We can see in the figure that the slope is steeper for males than for females. We could use two separate linear models, one for each sex. But it's easier to enhance a single model to handle the interaction. Such as model can be written as, \(Y = β_0 + β_1 \cdot dose + β_2 \cdot sex + β_3 \cdot dose \cdot sex\). If there's no interaction, \(β_3\) is zero.^{}
What are random effects, fixed effects and mixed effects models? When model parameters \(β\) are random variables, it's called a random effects model. Otherwise, we have a fixed effects model. A model that considers both fixed and random effects is called a mixed effects model.^{}
Mathematically, a mixed effects model is written as, \(Y = X \cdot β + Z \cdot u + ϵ\) where the first term models the fixed effects and the second term models the random effects.^{}
Let's assume a study involving 10 people. Repeated measurements are collected from each person. However, these individuals are only "random" samples from a larger population. This sampling is accounted for by the random effects term of the model.^{} ^{}
The random effects term also models a hierarchy of distinct populations (hence it relates to multilevel regression). An example hierarchy is students, schools, and districts. Random effects term models the variations at school and district levels. Perhaps a good definition that clarifies this is,^{}
Fixedeffect parameters describe the relationships of the covariates to the dependent variable for an entire population, random effects are specific to clusters of subjects within a population.
What software packages are useful for solving linear regression problems? In Python, scikitlearn is perhaps the most helpful Python for linear regression. In particular,
sklearn.linear_model.LinearRegression
andsklearn.metrics
are relevant.^{}In R, there are many packages with functions to perform linear regression. Package
stats
is most useful. Packagecar
can help with ANOVA analysis, residual analysis and testing the assumptions. PackageMASS
enables Generalized Least Squares (GLS) and robust fitting of linear models. Packagecaret
streamlines the model training process and includes ML algorithms. Packageglmnet
enables many types of linear regression.BLR
supports Bayesian linear regression, which is a subset of linear regression.Lars
supports Lasso regression efficiently.^{}
Milestones
Sir Francis Galton and Karl Pearson reveal that Galton's research on the genetic traits of sweet peas is probably the first example of linear regression.^{}
Sir Francis Galton proposes the notion of linear regression for the first time.^{}
Pearson's first rigorous discussion of correlation and regression is published in the Philosophical Transactions of the Royal Society of London. Pearson credits Bravais (1846) with discovering the first mathematical formulas for correlation.^{}
Pearson's theory explains how the regression slope is discovered.^{}
Pearson develops a theory for multiple regression. He also makes novel advances in other areas of statistics such as chisquare.^{}
Ghiselli explains a simpler proof for the productmoment approach than Pearson's.^{}
Computations to a complete linear regression is formed.^{}
Sample Code
References
 Agrawal, Raghav. 2021. "Know The Best Evaluation Metrics for Your Regression Model!" Analytics Vidhya, May 19. Accessed 20211221.
 Bellemare, Marc F. 2011. "A Primer on Linear Regression." Handout PPS232S.01, v2.0, Sanford School of Public Policy, Duke University, Durham, August. Accessed 20211222.
 Bergen, Elizabeth, Kara Fikrig, and Heather Grab. 2018. "Mixed Effects Models." Entom 4940: Advanced Statistical Methods in Ecology, April 24. Accessed 20211224.
 Cankaya, Soner, G. Tamer Kayaalp, Levent Sangun, Yalcin Tahtali, and Mustafa Akar. 2006. "A Comparative Study of Estimation Methods for Parameters in Multiple Linear Regression Model." J. Appl. Anim. Res., GSP, India, vol. 29, pp. 4347. Accessed 20211225.
 Ciaburro, Giuseppe. 2018. "R packages for regression." In: Regression Analysis with R, Packt Publishing, January. Accessed 20211221.
 Clupeid. 2015. "How do you know when a linear regression model is appropriate?" Socratic Q&A, November 06. Accessed 20211220.
 DataTechNotes. 2019. "Regression Model Accuracy (MAE, MSE, RMSE, Rsquared) Check in R." DataTechNotes, February 14. Accessed 20211221.
 Frost, Jim. 2017. "Heteroscedasticity in Regression Analysis." Statistics By Jim, August 13. Updated 20190315. Accessed 20211225.
 Jain, Kunal. 2015. "Scikitlearn(sklearn) in Python – the most important Machine Learning tool I learnt last year!" Analytics Vidhya, January 5. Accessed 20211221.
 Joseph, Lawrence. 2019. "Interactions in Multiple Linear Regression." EPIB621: Data Analysis in the Health Sciences, Dept. of Epidemiology and Biostatistics, McGill University. Accessed 20211220.
 Kumari, Khushbu and Suniti Yadav. 2018. "Linear regression analysis study." Journal of the Practice of Cardiovascular Sciences, vol. 4, no. 1, pp. 3336, May 04. Accessed 20211219.
 Liu, ChingTi, Jacqueline Milton, and Avery McIntosh. 2016. "Simple Linear Regression." In: Correlation and Regression with R, School of Public Health, Boston University, January 6. Accessed 20211223.
 Luo, Xianghong. 2016. "A Comparison of Three Estimation Methods In Linear Regression Analysis." 4th International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2016), Advances in Computer Science Research, Atlantis Press, vol. 71, pp. 498502. Accessed 20211225.
 Midway, Steve. 2021. "Chapter 9: Random Effects." In: Data Analysis in R, December 5. Accessed 20211224.
 Minitab. 2013. "What Are the Effects of Multicollinearity and When Can I Ignore Them?" Blog, Minitab, May 2. Accessed 20211225.
 Ng, Andrew. 2021. "Normal Equation." In: Machine Learning, Stanford University, via Coursera. Accessed 20211223.
 O'Hair, Allison. 2017. "Video 3: Multiple Linear Regression." Section 2.2: An Introduction to Linear Regression, MIT 15.071 The Analytics Edge, MIT OpenCourseWare. Accessed 20211223.
 Oluwole. 2020. "Autocorrelation, Heteroscedasticity, and Multicollinearity." Technical Notes of Ehi Kioya, February 21. Accessed 20211220.
 PennState. 2021. "Assumptions for the SLR Model." Sec. 9.2.3 in: STAT 500 Applied Statistics, The Pennsylvania State University. Accessed 20211224.
 Prutor. 2021. "Least squares." Prutor. Accessed 20211222.
 Python for Data Science. 2021. "Mixed Effect Regression." Python for Data Science. Accessed 20211224.
 Sridharan, Ramesh. 2015a. "Chapter 3: Linear Regression." In: 6.S085 Statistics for Research Projects, MIT. Accessed 20211223.
 Sridharan, Ramesh. 2015b. "Chapter 4: Regression Diagnostics and Advanced Regression Topics." In: 6.S085 Statistics for Research Projects, MIT. Accessed 20211223.
 Stanton, Jeffrey M. 2001. "Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors." Journal of Statistics Education, vol. 9, no. 3. doi: 10.1080/10691898.2001.11910537. Accessed 20211219.
 Statistics Solutions. 2021. "Autocorrelation." Complete Dissertation, Statistics Solutions, June 22. Accessed 20211225.
 StatTrek. 2021. "Linear Regression." StatTrek. Accessed 20211222.
 Ullah, Muhammad Imdad. 2020. "Consequences of Autocorrelation." Basic Statistics and Data Analysis, itfeature, November 5. Accessed 20211225.
 Wikipedia. 2021a. "Linear regression." Wikipedia, November 22. Accessed 20211220.
 Wikipedia. 2021b. "Fixed effects model." Wikipedia, June 22. Accessed 20211220.
 Zach. 2020a. "Examples of Using Linear Regression in Real Life." Statology, May 19. Accessed 20211220.
 Zach. 2020b. "The Four Assumptions of Linear Regression." Statology, January 08. Accessed 20211220.
Further Reading
 scikitlearn. 2021. "Linear Regression Example." scikitlearn, v1.0.1, October. Accessed 20211216.
 Machine Learning Glossary. 2020. "Linear Regression." Machine Learning Glossary, September 7. Accessed 20211216.
 Brownlee, Jason. 2016. "Linear Regression for Machine Learning." Machine Learning Mastery, March 25. Updated 20200815. Accessed 20211216.
Article Stats
Cite As
See Also
 Regression Modelling
 Types of Regression
 Generalized Linear Model
 Multicollinearity
 Residual Analysis
 Machine Learning