# Linear Regression

Linear regression is a statistical technique used to establish the relationship between variables in a dataset. The equation $$y = mx + c$$ describes a linear relationship between dependent variable $$y$$ and independent variable $$x$$. We may state that $$y$$ depends on $$x$$. Given sufficient data, linear regression estimates the values of coefficient $$m$$ and constant $$c$$. In a geometric interpretation, $$m$$ is the slope and $$c$$ is the intercept. In an alternative notation, these are expressed as $$β_1$$ and $$β_0$$ respectively.

Variable $$y$$ is also called response or predicted variable. Variable $$x$$ is also called predictor variable. The reason for this that once parameters of the model $$β_0$$ and $$β_1$$ are estimated, we can make predictions of $$y$$ given any value of $$x$$.

Linear regression is a field of statistics. This article looks at the types, important assumptions and techniques in linear regression.

## Discussion

• Could you explain linear regression with some examples?

Linear regression is frequently used by businesses to understand the link between advertising budget and revenue. In other words, it answers the question "For every advertising dollar I spend, how much will my revenue increase?" This can be modelled as $$Revenue=β_0+β_1 \cdot AdSpending$$

$$β_0$$ represents total expected revenue when ad spending is zero. The coefficient $$β_1$$ represents the average increase in total revenue when ad spending is increased by one unit. When $$β_1<0$$, higher ad spending is associated with lower revenue. When $$β_1=0$$ is close to zero, ad spending has little effect on revenue. When $$β_1>0$$ is positive, higher ad spending leads to higher revenue. The model thus aids decision making: a company may decrease or increase its ad spending based on the value of $$β_1$$.

In the figure, the red line is the best-fit straight line, $$y=4.187-0.356x$$. The values $$β_0=4.187$$ and $$β_1=-0.356$$ are what regression analysis has estimated from available data. Yield falls by 0.356% for every 1% increase in cultivated area. This model can now be used to make predictions; that is, given an area, we can predict the yield.

• What are the main types of linear regression models?

The main types of linear regression models are:

• Simple Linear Regression: This is the most basic type and deals with a single predictor variable. Predicting revenue from ad spending is an example.
• Multiple Linear Regression: Aka multivariable linear regression. This is applicable when there are many predictor variables. An example of this is predicting wine prices. This depends on mean growing season temperature, harvest rainfall, winter rainfall, and more.
• Hierarchical Linear Model: Aka multilevel regression. Such a model captures the natural hierarchy in predictor variables. Analysis involves a hierarchy of regressions, such as A regressed on B, and B regressed on C. For example, students are nested into classes, classrooms into schools, and schools into districts. So a student's test score can be modelled based on overall performance at different levels.
• What's the mathematical notation of a linear regression model?

We consider the general case of multiple linear regression with $$k$$ independent variables. The model therefore has to estimate $$k+1$$ parameters: constant $$β_0$$ and coefficients $$β_j\;for\;1\le j\le k$$. The mathematical notation of this model is given in the figure.

In linear algebra, this is expressed as $$Y = X \cdot β + ϵ$$, where $$X$$ is a $$m\,\times\,k+1$$ matrix, $$β$$ and $$Y$$ are m-dimensional vectors, and $$m$$ being the number of observations or data points.

Regression analysis is simply finding $$β$$ that minimizes the error term $$ϵ$$. This leads us to what's called the normal equation: $$β=(X^TX)^{-1} \cdot X^TY$$. As is apparent, this equation is in a form that solves for model parameters $$β$$.

In Machine Learning (ML), it's common to use $$θ$$ instead of $$β$$ for the model parameters.

• What are some estimation methods in linear regression?

Methods commonly used for estimating the model parameters (also called estimates) are:

• Ordinary Least Squares (OLS): This looks at the sum of the square of the difference between actual observed value and its prediction via the model. The method attempts to minimize this sum.
• Method of Moments (MoM): This uses moments, which are the expectation of the powers of a random variable. Number of moments to be calculated is equal to the number of unknown parameters. The resulting system of equations is then solved.
• Maximum Likelihood Estimate (MLE): This seeks to maximize the likelihood function. In other words, we determine the estimates that make the observed values most probable. MLE method is applicable when the probability distribution of the error terms is known.

Related to OLS are more sophisticated methods including Weighted Least Squares (WLS) and Generalized Least Squares (GLS). Less common ones are Least Median Squares and Least Trimmed Squares. In fact, the minimization need not be of the squares. We could minimize on Least Absolute Deviations, Huber, Bisquare, etc.

• How do I evaluate the performance of a linear regression model?

Models are rarely perfect and there's a need to measure how good a model really is. The differences between predicted values and actual values are called residuals. Model evaluation and validating the assumptions can be performed from residuals and this field of study is called Residual Analysis.

Mean Absolute Error (MAE) and Mean Squared Error (MSE) are two ways to quantify the residuals. MAE looks at absolute differences. MSE looks at the square of the differences. Another measure is Root Mean Squared Error (RMSE) that's the square root of MSE. RMSE has the same unit as the output variable, making it easier to interpret.

Perhaps the most widely used statistical measure is R-Squared (R2). It quantifies the proportion of the variation explained by the model. Closer it is to 1, better the model explains the data. R2 is also called the Coefficient of Determination.

• What are the main assumptions when constructing a linear regression model?

In linear regression, we usually make the following assumptions:

• Linearity: The dependent variable Y is related to the independent variables X in a linear way.
• Independence: Observations are independent of one another. We could also say that the residuals are independent of Y. In time-series data, observations are not correlated. When data doesn't meet this assumption, we have a problem called autocorrelation.
• Normality: Residuals are normally distributed. Equivalently, at a fixed observation X, the dependent variable Y is normally distributed.
• Homoscedasticity: The residuals have the same variance at all predicted or fitted points Y. When data doesn't meet this assumption, we have a problem called heteroscedasticity.

If one or more of these assumptions is violated, what the model predicts may be incorrect or even deceptive.

• How can I determine if linear regression is appropriate for a particular set of data?

Scatterplot can help us validate the linearity assumption. For multiple linear regression, 2-D pairwise scatter plots, rotating plots, and dynamic graphs can help.

To validate the independence assumption, a scatterplot of residuals versus fitted values shouldn't show any pattern.

To validate the normality assumption, a normal probability plot, a residual histogram or a quantile-quantile plot can be used.

To validate the homoscedasticity assumption, do a scatterplot of residuals against the fitted values. A cone-shaped pattern implies that the residuals vary more for some predicted values than others, thus invalidating the assumption. There are also lots of statistical tests to check for homoscedasticity: Bartlett's Test, Box's Test, Brown-Forsythe Test, Hartley's Fmax Test, Levene's Test, and Breusch-Pagan Test.

We also expect predictor variables to be independent of one another. A scatterplot of one independent variable with another can validate this. We can also calculate the correlation coefficients pairwise for all independent variables. Correlation coefficients close to ±1 imply high correlation. Low model coefficients or high Variance Inflation Factor (VIF) indicate correlated variables.

• How do autocorrelation, multicollinearity, and heteroscedasticity affect linear regression estimates?

Autocorrelation exists when multiple observations of a predictor variable are not independent. This is common in time-series data but could occur in other scenarios such as samples drawn from a cluster or geographic area. Autocorrelation can be detected with the Durbin-Watson test. Due to autocorrelation, OLS estimators will be inefficient. Estimated variance of regression coefficients will be biased and inconsistent. Hypothesis testing will be invalid. R2 will be overestimated.

A correlation between two or more predictor variables is referred to as multicollinearity. Though the overall model fit is not affected, multicollinearity can increase the variance of estimates and make them sensitive to model changes. Model becomes harder to interpret. It's harder to determine the precise effect of each predictor.

Heteroscedasticity means unequal spread of the residuals. It's a problem because OLS regression assumes constant spread. Though this doesn't introduce bias to the estimates, it does make them less precise. It produces smaller p-values because OLS regression doesn't detect the increase in the variance of the estimates. Thus, we may wrong conclude that an estimate is statistically significant.

• How do we model the interaction of independent variables?

When one independent variable has a distinct effect on the outcome based on the values of another independent variable, we call this an interaction.

Assume that a cholesterol-lowering medication is being evaluated in a clinical trial. Drug effect is dependent on both dose administered and the patient's sex. Without interaction between dose and sex, effect increases at a fixed slope with respect to the dose regardless of the sex.

With interaction, we can no longer ask what's the drug's effect since for every unit dose the incremental effect depends on the sex. Dose affects males differently from females and this what interaction is about. We can see in the figure that the slope is steeper for males than for females. We could use two separate linear models, one for each sex. But it's easier to enhance a single model to handle the interaction. Such as model can be written as, $$Y = β_0 + β_1 \cdot dose + β_2 \cdot sex + β_3 \cdot dose \cdot sex$$. If there's no interaction, $$β_3$$ is zero.

• What are random effects, fixed effects and mixed effects models?

When model parameters $$β$$ are random variables, it's called a random effects model. Otherwise, we have a fixed effects model. A model that considers both fixed and random effects is called a mixed effects model.

Mathematically, a mixed effects model is written as, $$Y = X \cdot β + Z \cdot u + ϵ$$ where the first term models the fixed effects and the second term models the random effects.

Let's assume a study involving 10 people. Repeated measurements are collected from each person. However, these individuals are only "random" samples from a larger population. This sampling is accounted for by the random effects term of the model.

The random effects term also models a hierarchy of distinct populations (hence it relates to multilevel regression). An example hierarchy is students, schools, and districts. Random effects term models the variations at school and district levels. Perhaps a good definition that clarifies this is,

Fixed-effect parameters describe the relationships of the covariates to the dependent variable for an entire population, random effects are specific to clusters of subjects within a population.
• What software packages are useful for solving linear regression problems?

In Python, scikit-learn is perhaps the most helpful Python for linear regression. In particular, sklearn.linear_model.LinearRegression and sklearn.metrics are relevant.

In R, there are many packages with functions to perform linear regression. Package stats is most useful. Package car can help with ANOVA analysis, residual analysis and testing the assumptions. Package MASS enables Generalized Least Squares (GLS) and robust fitting of linear models. Package caret streamlines the model training process and includes ML algorithms. Package glmnet enables many types of linear regression. BLR supports Bayesian linear regression, which is a subset of linear regression. Lars supports Lasso regression efficiently.

## Milestones

1875

Sir Francis Galton and Karl Pearson reveal that Galton's research on the genetic traits of sweet peas is probably the first example of linear regression.

1894

Sir Francis Galton proposes the notion of linear regression for the first time.

1896

Pearson's first rigorous discussion of correlation and regression is published in the Philosophical Transactions of the Royal Society of London. Pearson credits Bravais (1846) with discovering the first mathematical formulas for correlation.

1922

Pearson's theory explains how the regression slope is discovered.

1938

Pearson develops a theory for multiple regression. He also makes novel advances in other areas of statistics such as chi-square.

1981

Ghiselli explains a simpler proof for the product-moment approach than Pearson's.

1990

Computations to a complete linear regression is formed.

## Sample Code

• # Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
# Accessed 2021-12-22
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0...
>>> reg.predict(np.array([[3, 5]]))
array([16.])

## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins Bhavani vangipurapu
25
0
3104 arvindpdmn
6
8
1921 devbot5S
1
0
7
2148
Words
19
Likes
1931
Hits

## Cite As

Devopedia. 2022. "Linear Regression." Version 32, February 15. Accessed 2022-06-15. https://devopedia.org/linear-regression
• Site Map