Regression Modelling

Regression is a method to mathematically formulate relationship between variables that in due course can be used to estimate, interpolate and extrapolate. Suppose we want to estimate the weight of individuals, which is influenced by height, diet, workout, etc. Here, Weight is the predicted variable. Height, Diet, Workout are predictor variables.

The predicted variable is a dependant variable in the sense that it depends on predictors. Predictors are also called as independent variables. Regression reveals to what extent the predicted variable is affected by the predictors. In other words, what amount of variation in predictors will result in variations of the predicted variable. The predicted variable is mathematically represented as \(Y\). The predictor variables are represented as \(X1\), \(X2\), \(X3\), etc. This mathematical relationship is often called the regression model.

Regression is a branch of statistics. There are many types of regression. Regression is commonly used for prediction and forecasting.

Discussion

  • What's a typical process for performing regression analysis?

    First select a suitable predicted variable with acceptable measurement qualities such as reliability and validity. Likewise, select the predictors. When there's a single predictor, we call it bivariate analysis; anything more, we call it multivariate analysis.

    Collect sufficient number of data points. Use a suitable estimation technique to arrive at the mathematical formula between predicted and predictor variables. No model is perfect. Hence, give error bounds.

    Finally, assess the model's stability by applying it to different samples of the same population. When predictor variables are given for a new data point, estimate the predicted variable. If stable, the model's accuracy should not decrease. This process is called model cross-validation.

  • I've heard of Least Squares. What's this and how is it related to regression?
    The least squares regression line. Source: Sultana 2014, slide 6.
    The least squares regression line. Source: Sultana 2014, slide 6.

    Least Squares is a term that signifies that the square of errors are at a minimum. The error is defined as the difference between observed value and predicted value. The objective of regression estimation is produce least squared errors as a result. When error approaches zero, we term it as overfitting.

    Least Squares Method provides linear equations with unknowns that can be solved for any given data. The unknowns are regression parameters. The linear equations are called as Normal Equations. The normal equations are derived using calculus to minimize squared errors.

    All other algorithms (Artificial Neural Network (ANN), K-Nearest Neighbour (KNN), etc.) too attempt to minimize squared error unless the objective states otherwise.

  • Could you explain the difference between interpolation and extrapolation w.r.t. regression?
    Time Series Forecasting. Source: Zhao 2011.
    Time Series Forecasting. Source: Zhao 2011.

    Simply put, interpolation is estimation in familiar territory and extrapolation is estimation where not much of data is available due to various reasons—not collected or cannot be collected.

    We can interpolate missing data points using regression. For instance, we want to estimate height given weight and data collection process missed out certain weights, we can use regression to interpolate. This missing data can estimated by other means too. The missing data estimation is called imputation.

    The height and weight data is bound by nature and can be sourced. Say, we want to estimate future weight of an individual given historical weight variations of the individual. This is extrapolation. In regression, we call it forecasting. This is solved using a distinct set of techniques called as Time Series Regression.

  • What is correlation? How is it related to regression?
    Types of correlations. Source: Statistics How To 2018.
    Types of correlations. Source: Statistics How To 2018.

    Correlation helps identify variables that can be applied for regression modelling. Correlations between each predictor and predicted variable are identified to decide on the predictors that need to be included in the model.

    Correlation is defining the association between two variables. The effect of \(X\) (or \(X1\), \(X2\), \(X3\)...) on \(Y\) can be thus quantified:

    • Positive Correlation: \(Y\) goes up/down as \(X\) goes up/down. Correlation coefficient will be in the range [0,1].
    • Negative Correlation: \(Y\) goes up/down as \(X\) goes down/up. Correlation coefficient will be in the range [-1,0].
    • No Correlation: \(Y\) doesn't go up/down as \(X\) goes up/down. Correlation coefficient will be close to 0.

    Correlation coefficient \(r\) has the following formula:

    $$r=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2 \sum_{i=1}^n(y_i-\bar y)^2}}$$

    An equivalent formula that substitutes the mean values \(\bar x\) and \(\bar y\) with their individual sample points \(x_i\) and \(y_i\) is published in Wikipedia. More formally, \(r\) is called Pearson Product Moment Correlation (PPMC).

  • What's the right interpretation of correlation coefficient?
    Different samples with same correlation coefficient although their regression lines may differ. Source: Stanton 2001, fig. 2.
    Different samples with same correlation coefficient although their regression lines may differ. Source: Stanton 2001, fig. 2.

    Correlation coefficient \(r\) is measure of linear association strength. It doesn't quantify non-linearity. A correlation coefficient of 80% (0.8) means that 80% of variation in one variable is explained by variation in the other variable. Example, 80% of variation in rainfall is explained by the number of trees; 20% is due to factors other than the number of trees.

    It will be apparent from the formula that \(r\) factors in the sample variance. On a X-Y scatterplot, the regression line may have different slopes due to different sample variance even when all of them share the same correlation coefficient. In other words, \(r\) is not simply the slope of the regression line.

  • Could you give examples of non-linear correlation?
    Illustrating linear, non-linear and no correlation types. Source: Johnivan 2011.
    Illustrating linear, non-linear and no correlation types. Source: Johnivan 2011.

    A non-linear correlation is where the relationship between the variables cannot be expressed by a straight line. We call this relationship curvilinear.

    Non-linear relationship can exhibit monotonous positive, monotonous negative, or both patterns together.

  • How can we do data analysis when relationships are non-linear?
    Transformations for non-linear relationships. Source: Teknomo 2017.
    Transformations for non-linear relationships. Source: Teknomo 2017.

    The correlation coefficient formula applies for only linear relationships. One common approach for non-linear correlations is to transform them into linear forms. If the relationship is curvilinear, we can apply transformations directly. Common transformations include logarithmic or inverse transformations.

    If the relationship is non-linear but not curvilinear, we can split the data into distinct segments. Data within some segments may be linear. In other segments, if it's curvilinear, transformations can be applied to make them linear. Analysis is thus segment-wise, sometimes called segmented regression. As an example, yield of mustard is not affected by soil salinity for low values. For salinity above a threshold, there's a negative linear relation. This dataset can be segmented at the threshold.

  • What is causal relationship in regression?
    Bivariate chart indicating correlation between exam scores and income. Source: Goldstein 2017.
    Bivariate chart indicating correlation between exam scores and income. Source: Goldstein 2017.

    One study about college education, showed positive correlation between SAT scores of incoming students and their earnings when they graduate. Moreover, we can state that graduating from elite colleges (high SAT scores) had a role in higher salaries.

    Causality or causation refers to the idea that variation in a predictor \(X\) causes variation in the predicted variable \(Y\). This is distinct from regression, which is more about predicting \(Y\) based on its correlation with \(X\). Regression does not claim that \(Y\) is caused by \(X\).

    Here are some possible examples of causality. High scores lead to higher earnings. Regular exercise results in better health. Current season influences power consumption. All pairs of variables that have causal relationship will exhibit significant correlation.

  • Does strong correlation always imply causal relationship?
    An example where correlation does not imply causation. Source: Stark 2017.
    An example where correlation does not imply causation. Source: Stark 2017.

    No. Sometimes correlations are purely coincidental. For example, non-commercial space launches and sociology doctorates awarded are completely unrelated but the image shows them to be strongly correlated. This is called a Spurious Correlation. This is a clear case where correlation does not imply causation.

    Another example is when ice cream sales are positively correlated with violent crime. However, violent crime is not caused by ice cream sales. It so happens that there's a confounding variable, which in this case is weather. Hot weather influences both ice cream sales and violent crimes. It's therefore obvious that,

    correlation does not always imply causation

    Correlation shouldn't be mistaken for causation. Look at the physical mechanism causing such a relationship. For example, is rain driving the sale of your product? Data may show a correlation. It need not be causal unless your product is an umbrella. However, proving causality is hard. At best, we can do randomized trials to establish causality.

    Regression is a useful tool in either predictive or causal analysis. With the growth of Big Data, it's being used more often for predictive analysis.

  • Could you explain the regression model?

    We call it a model when the relationship between variables is in a well-defined mathematical form: \(Y\)=\(f(X)\).

    For instance, a linear relationship can be written as \(f(X)=a+b_1X_1+b_2X_2+b_3X_3\), where \(a\) is a constant and \(b_1,\,b_2,\,b_3\) are regression coefficients. \(a\) is constant effect while a unit change in \(X_1\), will result in \(b_1\) unit change in \(Y\).

    It's important to note that linearity is in terms of the coefficients, not in terms of predictor variables. For example, this model is still linear though it's quadratic in terms of \(X_1\): \(f(X)=a+b_1X_1+b_2X_1^2\).

  • How do we measure the accuracy of a regression model?
    R-Squared comparison. Source: Statwing Docs 2018.
    R-Squared comparison. Source: Statwing Docs 2018.

    The accuracy of regression model is relative to base model. Called R-Squared, this measure is squared deviation from the expected value, which is mathematically defined below:

    • For base model, the sum of squared deviation of actual value \(Y\) from mean value \(E(Y)\) is referred to as Total Variance or SST (Total Sum of Squares). $$SST=\sum_{i=1}^n(y_i-\bar y)^2$$
    • For regression model, the sum of squared deviation of estimated value \(\widehat Y \) from mean value \(E(Y)\) is referred to as Explained Variance or SSR (Regression Sum of Squares). $$SSR=\sum_{i=1}^n(\widehat y_i-\bar y)^2$$
    • The accuracy of the model is called R-Squared. $$R^2=\frac{\text{Explained Variance}}{\text{Total Variance}} = \frac{SSR}{SST}$$

    Higher the \(R^2\), larger the explained variance and lower the unexplained. Hence, higher \(R^2\) value is desired. For example, if \(R^2=0.8\), 80% of variation in data is explained by model.

  • What are some challenges with regression and how to overcome them?

    High multicollinearity is a challenge. It basically means one or more independent variables are highly linearly dependent on another independent variable. This makes it difficult to estimate the coefficients. One possible solution is to increase the sample size.

    Another challenge is non-constant error variance, also called heteroscedasticity. An example of this when the observations "funnel out" as we move along the regression line. One solution is to use a Weighted Least Squares (WLS).

    Regression assumes that errors from one observation are not related to other observations. This is often not true with time series data. Autocorrelated errors are therefore a challenge. One approach is to estimate the pattern in the errors and refine the regression model.

    Another problem is overfitting that occurs when the model is "too well-trained". Such a model will not fit any other data. Regularization is the technique used to avoid overfitting. For parametric models, there are regression routines that address overfitting concerns. Lasso regression and ridge regression are a couple of such routines.

  • Could you share some tips for beginners getting into regression modelling?

    Here are a few useful tips:

    • It's known that when not enough data is collected, \(R^2\) is overestimated. Collect sufficient data.
    • Use partial F-test to identify predictors that can explain most of the variance in the predicted variable. Try to select as few predictors as possible to simplify analysis.
    • Try different techniques for cross-validation, such as independent samples or split samples.
    • Starting with lots of predictors might result in bad analysis. Start with a narrower focus.
    • Analysis is sensitive to bad data. Be careful about how data is collected.
    • Let decision makers be aware of the error term. See if the predictions make sense. Don't blindly believe in data: combine it with intuition.

Milestones

1795

Carl Friedrich Gauss invents the method of least squares. He doesn't publish the method until much later in 1809. He uses it to predict the position of the celestial body named Ceres. Squared error is easy to compute and the error from this method is also normally distributed.

1805

Adrien-Marie Legendre publishes his invention of the method of least squares independently of Gauss. He uses it for the determination of orbits of comets.

1875

Francis Galton analyzes the sizes of mother and daughter sweet-pea seeds. He also makes a 2D-plot comparing the two, thereby obtaining the first insights into regression. He presents his first regression line in 1877. He notices that extreme values are "dampened" in the next generation whose values are closer to the mean. The idea of regression to the mean starts with Galton. Galton initially uses the term reversion rather than regression.

1896

Karl Pearson gives a mathematical treatment of correlation and regression using product-moment method.

1898
Multivariate regression: three generations of ancestors pass on their influence. Source: Stanton 2001, fig. 3.
Multivariate regression: three generations of ancestors pass on their influence. Source: Stanton 2001, fig. 3.

Francis Galton considers the role of previous generations of ancestors on one individual, thus recognizing that multiple variables can affect the predicted variable. The idea of multivariate regression starts here but developed only later by Karl Pearson.

1915

R. A. Fisher gives the exact sampling distribution of the correlation coefficient. Rigorous mathematical treatment of multivariate analysis also starts with Fisher through his z-transformation and F distribution.

1962

G.E.P. Box and P.W. Tidwell investigate transformations on predictor variables. Such transformations become useful to maintain the assumptions of independence, normality and variance homogeneity.

1970

A.E. Hoerl and R.W. Kennard look into the problem of near linear dependencies in the predictors. They propose ridge regression as a solution that uses suitable biasing parameters.

References

  1. Allison, Paul. 2014. "Prediction vs. Causation in Regression Analysis." Statistical Horizons, July 8. Accessed 2020-07-24.
  2. Bush, Joshua. 2018. "The Difference Between Bivariate & Multivariate Analyses." Sciencing, June 04. Accessed 2018-08-30.
  3. Gallo, Amy. 2015. "A Refresher on Regression Analysis." Harvard Business Review, November 04. Accessed 2018-08-30.
  4. Goldstein, Zachary. 2017. "What data science reveals about the SAT, earnings, and poverty in higher education." Coding it Forward Blog, May 22. Retrieved 2018-03-16.
  5. Gupta, Prashant. 2017. "Regularization in Machine Learning." Towards Data Science, on Medium, November 15. Accessed 2020-07-24.
  6. Hocking, R. R. 1983. "Developments in Linear Regression Methodology: 1959-1982." Technometrics 25, no. 3, pp. 219-30. Accessed 2018-08-30.
  7. Johnivan. 2011. "Scatter Diagrams." STPM Further Mathematics T, Aug 8. Retrieved 2018-03-16.
  8. Kahane, Leo H. 2001. "Regression Basics." Sage Publications. Accessed 2020-07-24.
  9. Kiernan, Diane. 2014. "Chapter 7: Correlation and Simple Linear Regression." In: Natural Resources Biometrics, Open SUNY Textbooks, January 16. Accessed 2020-07-24.
  10. Kopf, Dan. 2015. "The Discovery of Statistical Regression." Priceonomics, November 06. Accessed 2018-08-30.
  11. Memidex. 2013. "Curvilinear correlations." Memidex, June 26. Accessed 2018-08-30.
  12. Palmer, Phillip B, and Dennis G O'Connell. 2009. "Regression Analysis for Prediction: Understanding the Process." Cardiopulmonary Physical Therapy Journal, September, vol. 20, no. 3, pp. 23–26. Accessed 2018-08-30.
  13. Philosophy Terms. 2016. "Causality." Philosophy Terms, October 10. Updated 2018-10-25. Accessed 2020-07-24.
  14. Rao, C. Radhakrishna. 1983. "Multivariate Analysis: Some Reminiscences on Its Origin and Development." Sankhyā: The Indian Journal of Statistics, Series B (1960-2002) 45, no. 2, pp. 284-99. Accessed 2018-08-30.
  15. Schmitt, Peter, Jonas Mandel, and Mickael Guedj. 2015. "A Comparison of Six Methods for Missing Data Imputation." Journal of Biometrics & Biostatistics, vol. 6, no. 224. Accessed 2018-08-30.
  16. Smith, Martha K. 2014. "Overfitting." Common Mistakes in Using Statistics: Spotting and Avoiding Them, University of Texas at Austin, June 13. Accessed 2018-08-30.
  17. Stanton, Jeffrey M. 2001. "Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors." Journal of Statistics Education, vol. 9, no. 3. Accessed 2018-08-30.
  18. Stark, Ian. 2017. "Lecture 19: Data Scales; Correlation and Causation." The University of Edinburgh, Mar 28. Retrieved 2018-03-16.
  19. Statistics How To. 2018. "Correlation Coefficient: Simple Definition, Formula, Easy Steps." Updated 2018-03-14. Retrieved 2018-03-16.
  20. Statwing Docs. 2018. "A user-friendly guide to linear regression." Statwing Documentation. Retrieved 2018-03-16.
  21. Sultana, Nahid. 2014. "Introduction to Statistics and Probability: Looking at Data-Relationships, Chapter 2, Part 3.", SlideShare, March 19. Accessed 2018-03-28.
  22. Teknomo, Kardi. 2017. "NonLinear Transformation." Revoledu. Retrieved 2018-03-16.
  23. Weisstein, Eric W. 2009. "Normal Equation." MathWorld--A Wolfram Web Resource, February 08. Accessed 2018-08-30.
  24. Wikipedia. 2018a. "Regression analysis." Wikipedia, July 13. Accessed 2018-08-30.
  25. Wikipedia. 2018b. "Pearson correlation coefficient." Wikipedia, August 09. Accessed 2018-08-30.
  26. Wikipedia. 2020. "Nonlinear regression." Wikipedia, May 16. Accessed 2020-07-24.
  27. Zhao, Yanchang. 2011. "Time Series Analysis and Mining with R." RDataMining, August 23. Accessed 2018-03-28.

Further Reading

  1. Gallo, Amy. 2015. "A Refresher on Regression Analysis." Harvard Business Review, November 04. Accessed 2018-08-30.
  2. Ramcharan, Rodney. 2006. "Regressions: Why Are Economists Obssessed with Them?" Finance & Development, IMF, March, vol. 23, no. 1. Retrieved 2018-03-16.
  3. Kopf, Dan. 2015. "The Discovery of Statistical Regression." Priceonomics, November 06. Accessed 2018-08-30.
  4. Armstrong, J. Scott. 2012. "Illusions in Regression Analysis." International Journal of Forecasting, July, vol. 28, pp. 689-694. Retrieved 2018-03-16.
  5. Rawlings, John O., Sastry G. Pantula, and David A. Dickey. 1998. "Applied Regression Analysis: A Research Tool." Second Edition. Springer-Verlag New York, Inc. Retrieved 2018-03-16.
  6. ablongman.com. 2018. "Measures of Relationship." Retrieved 2018-03-16.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
6
3
1983
21
1
1497
1
2
38
2220
Words
5
Likes
18K
Hits

Cite As

Devopedia. 2022. "Regression Modelling." Version 28, February 15. Accessed 2024-06-25. https://devopedia.org/regression-modelling
Contributed by
3 authors


Last updated on
2022-02-15 11:50:34