# Regression modelling

### Improve this article. Show messages.

## Summary

Regression is a method to mathematically formulate relationship between variables that in due course can be used to estimate, interpolate and extrapolate. Suppose we want to estimate the weight of individuals, which is influenced by height, diet, workout, etc. Here, *Weight* is the **predicted** variable. *Height*, *Diet*, *Workout* are **predictor** variables.

The predicted variable is a **dependant** variable in the sense that it depends on predictors. Predictors are also called as **independent** variables. Regression reveals to what extent the predicted variable is affected by the predictors. In other words, what amount of variation in predictors will result in variations of the predicted variable. The predicted variable is mathematically represented as \(Y\). The predictor variables are represented as \(X1\), \(X2\), \(X3\), etc. This mathematical relationship is often called the **regression model**.

Regression is a branch of statistics. There are many types of regression. Regression is commonly used for prediction and forecasting.

## Discussion

What's a typical process for performing regression analysis? Collect data of both predictor values and predicted values. Collect sufficient number of data points. Use a suitable estimation technique to arrive at the mathematical formula between predicted and predictor variables. Give error bounds.

When predictor variables are given for a new data point, estimate the predicted variable.

I've heard of Least Squares. What's this and how is it related to regression? **Least Squares**is a term that signifies that the square of errors are at a minimum. The error is defined as the difference between observed value and predicted value. The objective of regression estimation is produce least squared errors as a result. When error approaches zero, we term it as*overfitting*.**Least Squares Method**provides linear equations with unknowns that can be solved for any given data. The unknowns are regression parameters. The linear equations are called as*Normal Equations*. The normal equations are derived using calculus to minimize squared errors.All other algorithms (

*ANN*,*KNN*, etc.) too attempt to minimize squared error unless the objective states otherwise.Could you explain the difference between interpolation and extrapolation w.r.t. regression? Simply put, interpolation is estimation in familiar territory and extrapolation is estimation where not much of data is available due to various reasons—not collected or cannot be collected.

We can interpolate missing data points using regression. For instance, we want to estimate height given weight and data collection process missed out certain weights, we can use regression to interpolate. This missing data can estimated by other means too. The missing data estimation is called

*imputation*.The height and weight data is bound by nature and can be sourced. Say, we want to estimate future weight of an individual given historical weight variations of the individual. This is extrapolation. In regression, we call it

*forecasting*. This is solved using a distinct set of techniques called as**Time Series Regression**.What is correlation? How is it related to regression? Correlation helps identify variables that can be applied for regression modelling. Correlations between each predictor and predicted variable are identified to decide on the predictors that need to be included in the model.

Correlation is defining the association between two variables. The effect of \(X\) (or \(X1\), \(X2\), \(X3\)...) on \(Y\) can be thus quantified:

**Positive Correlation**: \(Y\) goes up/down as \(X\) goes up/down. Correlation coefficient will be in the range [0,1].**Negative Correlation**: \(Y\) goes up/down as \(X\) goes down/up. Correlation coefficient will be in the range [-1,0].**No Correlation**: \(Y\) doesn't go up/down as \(X\) goes up/down. Correlation coefficient will be close to 0.

*Correlation coefficient*\(r\) has the following formula:$$r=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2 \sum_{i=1}^n(y_i-\bar y)^2}}$$

Correlation coefficient is measure of

*linear*association strength. Non-linearity in correlation is not quantified by the above formula. A correlation coefficient of 80% (0.8) means that 80% of variation in one variable is explained by variation in the other variable. Example, 80% of variation in rainfall is explained by the number of trees; 20% is due to factors other than the number of trees.Could you give examples of non-linear correlation? Non-linear relationship can exhibit monotonous positive, monotonous negative, or both patterns together. We call a relationship

*curvilinear*when it's monotonous positive or monotonous negative.How can we do data analysis when relationships are non-linear? The correlation coefficient formula applies for only linear relationships. One common approach for non-linear correlations is to transform them into linear forms. If the relationship is curvilinear, we can apply transformations directly. Common transformations include logarithmic or inverse transformations.

If the relationship is non-linear but not curvilinear, we can split the data into distinct segments. Data within some segments may be linear. In other segments, if it's curvilinear, transformations can be applied to make them linear. Analysis is thus segment-wise.

What is causal relationship in regression? Causality or causation refers to the idea that variation in a predictor \(X\)

*causes*variation in the predicted variable \(Y\). This is distinct from regression, which is more about predicting \(Y\) based on its correlation with \(X\). Regression does not claim that \(Y\) is caused by \(X\). This is better explained through the following examples of causality:- Higher exam score \((X)\) results in higher earnings \((Y)\)
- More trees \((X)\) causes more rainfall \((Y)\)
- Deeper research \((X)\) leads to complete knowledge\((Y)\)
- Regular exercise \((X)\) results in better health \((Y)\)
- Ambient weather \((X1)\) and more factory machines \((X2)\) influence power consumption \((Y)\)

All pairs of variables that have causal relationship will exhibit significant correlation.

Does strong correlation always imply causal relationship? No. Sometimes correlations are purely coincidental. For example, non-commercial space launches and sociology doctorates awarded are completely unrelated but the image shows them to be strongly correlated. This is called a

**Spurious Correlation**. This is a clear case where correlation does not imply causation. Data analysis should avoid such a correlation.Correlation does not always imply causation.

Could you explain the regression model? We call it a

*model*when the relationship between variables is in a well-defined mathematical form: \(Y\)=\(f(X)\).For instance, a linear relationship can be written as \(f(X)=a+b_1X_1+b_2X_2+b_3X_3\), where \(a\) is a constant and \(b_1,\,b_2,\,b_3\) are regression coefficients. \(a\) is constant effect while a unit change in \(X_1\), will result in \(b_1\) unit change in \(Y\).

How do we measure the accuracy of a regression model? The accuracy of regression model is relative to base model. Called

**R-Squared**, this measure is squared deviation from the expected value, which is mathematically defined below:- For base model, the sum of squared deviation of actual value \(Y\) from mean value \(E(Y)\) is referred to as
*Total Variance*or*SST (Total Sum of Squares)*. $$SST=\sum_{i=1}^n(y_i-\bar y)^2$$ - For regression model, the sum of squared deviation of estimated value \(\widehat Y \) from mean value \(E(Y)\) is referred to as
*Explained Variance*or*SSR (Regression Sum of Squares)*. $$SSR=\sum_{i=1}^n(\widehat y_i-\bar y)^2$$ - The accuracy of the model is called
*R-Squared*. $$R^2=\frac{\text{Explained Variance}}{\text{Total Variance}} = \frac{SSR}{SST}$$

Higher the \(R^2\), larger the explained variance and lower the unexplained. Hence, higher \(R^2\) value is desired. For example, if \(R^2=0.8\), 80% of variation in data is explained by model.

- For base model, the sum of squared deviation of actual value \(Y\) from mean value \(E(Y)\) is referred to as
How do you classify the different types of regression? Regression techniques can be classified in many ways and here's one way to do it:

**Parametric / Non-parametric**: If the relationship between variables conforms to well-defined mathematical specifications such as linear, exponential, inverse, or other curvilinear, then it's called parametric. Otherwise, it's called non-parametric, for which K-NN, Neural Networks, piecewise linear regression are examples.**Variable selection**: With*Forward Selection*, one builds the model with the most important predictor and later add other predictors one by one. With*Backward Elimination*, one builds the model with all predictors and later eliminate the not so important predictors one by one.**Distribution of dependant variable**: We call it*Linear Regression*when dependant variable has a Normal distribution. Otherwise, for non-Normal distributions or exponential family of distributions, we can use link functions. For example, for a dependant variable of Binomial distribution, we use the logistic link function and we call this*Logistic Regression*; for Poisson distribution, we use the log link function; for Beta or Gamma distribution, we use the beta or gamma link functions.

What's bias-variance trade-off in regression? In general, a larger sample size will increase variance and decrease bias; a smaller sample size will increase bias and decrease variance.

Suppose we collect income data from multiple cities across professions. While income can be correlated with profession, there will be variations across cities due to differing lifestyles, cost of living, tax rules, etc. This can be called heterogenity in data. Model built on this data will have high variance and predictions may not be accurate. We can overcome this by making the data more homogeneous by splitting the data by city. Thus, we'll end up with multiple models, one per city. Each model is biased to its city but has lesser variance.

When we have geography-wise data that can be drilled down to locality, the decision to split on a particular level of geography is called bias-variance trade-off. For example, we can split by city, by state, by region (East Coast vs West Coast), by country, by continent, etc. The amount of variance that can be tolerated will dictate bias.

What's the main challenge in regression and how to overcome it? *Overfitting*is the main challenge in regression. Overfitting occurs when the model has bias. Such a model will not fit any other data.*Regularization*is the technique used to avoid overfitting. For parametric models, there are regression routines that address overfitting concerns. Lasso regression and ridge regression are a couple of such routines.

## References

- Goldstein, Zachary. 2017. "What data science reveals about the SAT, earnings, and poverty in higher education." Coding it Forward Blog, May 22. Retrieved 2018-03-16.
- Johnivan. 2011. "Scatter Diagrams." STPM Further Mathematics T, Aug 8. Retrieved 2018-03-16.
- Stark, Ian. 2017. "Lecture 19: Data Scales; Correlation and Causation." The University of Edinburgh, Mar 28. Retrieved 2018-03-16.
- Statistics How To. 2018. "Correlation Coefficient: Simple Definition, Formula, Easy Steps." Updated 2018-03-14. Retrieved 2018-03-16.
- Statwing Docs. 2018. "A user-friendly guide to linear regression." Statwing Documentation. Retrieved 2018-03-16.
- Sultana, Nahid. 2014. "Introduction to Statistics and Probability: Looking at Data-Relationships, Chapter 2, Part 3.", SlideShare, March 19. Accessed 2018-03-28.
- Teknomo, Kardi. 2017. "NonLinear Transformation." Revoledu. Retrieved 2018-03-16.
- Zhao, Yanchang. 2011. "Time Series Analysis and Mining with R." RDataMining, August 23. Accessed 2018-03-28.

## Tags

## See Also

- Types of regression
- Overfitting and underfitting
- Least squares
- Analysis of variance
- Classification
- Machine learning

## Further Reading

- Ramcharan, Rodney. 2006. "Regressions: Why Are Economists Obessessed with Them?" Finance & Development, IMF, March, vol. 23, no. 1. Retrieved 2018-03-16.
- Armstrong, J. Scott. 2012. "Illusions in Regression Analysis." International Journal of Forecasting, July, vol. 28, pp. 689-694. Retrieved 2018-03-16.
- Rawlings, John O., Sastry G. Pantula, and David A. Dickey. 1998. "Applied Regression Analysis: A Research Tool." Second Edition. Springer-Verlag New York, Inc. Retrieved 2018-03-16.
- ablongman.com. 2018. "Measures of Relationship." Retrieved 2018-03-16.