Types of Regression
- Summary
-
Discussion
- Could you introduce regression?
- How do you classify the different types of regression?
- What are the types of linear regression models?
- Could you compare linear and logistic regression?
- Could you explain parametric versus non-parametric regression?
- What are some specialized regression models?
- Could you share examples to illustrate a few regression methods?
- With so many types of regression models, how do I select a suitable one?
- What are some tips to analyze model statistics?
- What software packages support regression?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Regression is widely used for prediction or forecasting where given one or more independent variables we try to predict another variable. For example, given advertising expense, we can predict sales. Given a mother's smoking status and the gestation period, we can predict the baby's birth weight.
There are many types of regression models, one source mentioning as many as 35 different models. An analyst or statistician must select a model that makes sense to the problem. Models differ based on the number of independent variables, type of the dependent variable and how these two are related to each other.
Regression comes from statistics. It's one of many techniques used in machine learning.
Discussion
-
Could you introduce regression? Suppose there's a dependent or response variable \(Y_i\) and independent variables or predictors \(X_i\). The essence of regression is to estimate the function \(f(X_i,\beta)\) that's a model of how the dependent variable is related to the predictors. Adding an error term or residual \(\epsilon_i\), we get \(Y_i = f(X_i,\beta) + \epsilon_i\), for scalar \(Y_i\) and vector \(X_i\).
The residual is not seen in data. It's the difference between the observed value \(Y_i\) and what the model predicts. With the goal of minimizing the residuals, regression estimates model parameters or coefficients \(\beta\) from data. There are many ways to do this and the term estimation is used for this process.
Regression modelling also makes important assumptions. The sampled data should represent the population. There are no measurement errors in the predictor values. Residuals have zero mean (when conditioned on \(X_i\)) and constant variance. Residuals are also uncorrelated with one another. More assumptions are used depending on the model type and estimation technique.
Regression uncovers useful relationships, that is, how predictors are correlated to the response variable. Regression makes no claim that predictors influence or cause the outcome. Correlation should not be confused for causality.
-
How do you classify the different types of regression? Regression techniques can be classified in many ways:
- Number of Predictors: We can distinguish between Univariate Regression and Multivariate Regression.
- Outcome-Predictors Relationship: When this is linear, we can apply Linear Regression or its many variants. If the relationship is non-linear, we can apply Polynomial Regression or Spline Regression. More generally, when the relationship is known it's Parametric Regression, otherwise it's Non-parametric Regression.
- Predictor Selection: With multiple predictors, sometimes not all of them are important. Best Subsets Regression or Stepwise Regression can find the right subset of predictors. We could penalize too many predictors in the model using Ridge Regression, Lasso Regression or Elastic Net Regression.
- Correlated Predictors: If predictors are correlated, one approach is to transform them into fewer predictors by a linear combination of the original predictors. Principal Component Regression (PCR) and Partial Least Squares (PLS) Regression are two approaches to do this.
- Outcome Type: When predicting categorical data, we can apply Logistic Regression. When outcome is a count variable, we can apply Poisson Regression or Negative Binomial Regression. In fact, a suitable method of regression can be inferred from the distribution of the dependent variable.
-
What are the types of linear regression models? Simple Regression involves only one predictor. For example, \(Y_i = \beta_0 + \beta_{1}X_{1i} + \epsilon_i\).
If we generalize to many predictors, the term Multiple Linear Regression is used. Consider a bivariate linear model \(Y_i = \beta_0 + \beta_{1}X_{1i} + \beta_{2}X^2_{2i} + \epsilon_i\). Although there's a square term, the model is still linear in terms of the parameters.
To represent many Multiple Linear Regression models in a compact form we can use the General Linear Model. This generalization allows us to work with many dependent variables dependent on the same independent variables. This also incorporates different statistical models including ANOVA, ANCOVA, OLS, t-test and F-test.
The General Linear Model makes the assumption that \(Y_i ∼ N(X^T_i\beta,\sigma^2)\), that is, response variable is normally distributed with a mean that's a linear combination of predictors. A larger class of models is called Generalized Linear Model (GLM) that allows \(Y_i\) to be any distribution of the exponential family of distributions. The General Linear Model is a specialization of the GLM.
If response is affected by randomness, the Generalized Linear Mixed Model (GLMM) can be used.
-
Could you compare linear and logistic regression? Since logistic regression deals with categorical outcomes, it predicts the probability of an outcome rather than a continuous value. Predictions should therefore be restricted to the range 0-1. This is done by transforming the linear regression equation to the logit scale. This is the natural log of the odds of being in one category versus the other categories.
For this reason, logistic regression may be seen as a particular case of GLM. Logit is used as the link function that relates predictors to the outcome.
Logistic regression shares with linear regression many of the assumptions: independence of errors, linearity (but in the logit scale), absence of multicollinearity among predictors, and lack of influential outliers.
There are three types of logistic regressions:
- Binary: Only two outcomes. Example: predict that a student passes a test. When all predictors are categorial, we call them logit models .
- Nominal: More than two outcomes. Also called Multinominal Logistic Regression. Example: predict the colour of an iPhone model a customer is likely to buy.
- Ordinal: More than two ordered outcomes. Example: predicting a medical condition (good, stable, serious, critical).
-
Could you explain parametric versus non-parametric regression? Linear models and even non-linear models are parametric models since we know (or make an educated guess) about how the outcome relates to predictors. Once the model is fixed, the task is to estimate the parameters \(\beta\) of the model. If we have problems in this estimation, we can revise the model and try again.
Non-parametric regression is more suitable when we have no idea how the outcome relates to the predictors. Usually when the relationship is non-linear, we can adopt non-parametric regression. For example, one study attempting to predict the logarithm of wage from age found that non-parametric regression approaches outperformed simple linear and polynomial regression methods.
Parametric models have a finite set of parameters that try to capture everything about observed data. Model complexity is bounded even with unbounded data. Non-parametric models are more flexible because the model gets better as more data is observed. We can view them as having infinite parameters or functions that we attempt to estimate. Artificial neural networks with infinitely many hidden units is equivalent to non-parametric regression.
-
What are some specialized regression models? We note a few of these with brief descriptions:
- Robust Regression: This is better suited than linear regression in handling outliers or influential observations. Observations are weighted.
- Huber Regression: To handle outliers better, this optimizes a combination of squared error and absolute error.
- Quantile Regression: Linear regression predicts the mean of the dependent variable. Quantile regression predicts the median. More generally, it predicts the nth quantile. For example, predicting the 25th quantile of a house price means that there's 25% chance that the actual price is below the predicted value.
- Functional Sequence Regression: Sometimes predictors affect the outcome in a time-dependent manner. This model includes the time component. For example, onion weight depends on environmental factors at various stages of the onion's growth.
- Regression Tree: Use a decision tree to split the predictor space at internal nodes. Terminal nodes or leaves represent predictions, which are the mean of data points in each partitioned region.
-
Could you share examples to illustrate a few regression methods? In a production plant, there's a linear correlation between water consumption and amount of production. Simple regression suffices in this case, giving the fit as
Water = 2273 + 0.0799 Production
. Thus, even without any production, 2273 units of water are consumed. Every unit of production increases water consumption by 0.0799 units. Both predictor and outcome are continuous variables.As an example of multiple linear regression, let's predict the birth weight of a baby (continuous variable) based on two predictors: mother is a smoker or non-smoker (categorial variable) and gestation period (continuous variable). We represent non-smokers as 0 and smokers as 1. The regression equation is
Wgt = - 2390 + 143.10 Gest - 244.5 Smoke
. If we plot this, we'll actually see two parallel lines, one for smokers and one for non-smokers.One study looked at the number of cigarettes college students smoked per day. They predicted this count from gender, birth order, education level, social/psychological factors, etc. The study used poisson regression, negative binomial regression, and many others.
-
With so many types of regression models, how do I select a suitable one? To apply linear regression, the main assumptions must be met: linearity, independence, constant variance and normality. Linearity can be checked via graphical analysis. A plot of residuals versus predicted values can show non-linearity, or use goodness of fit test. Non-linear relations can be made linear using transformations of predictors and/or the outcome. These could be log, square root or power transformations. Try adding transformations of current predictors. Try semi or non-parametric models.
In practice, linear regression is sensitive to outliers and cross-correlations. Piecewise linear regression, particularly for time series data, is a better approach. Non-parametric regression can be used when there's an unknown non-linear relationship. SVR is an example of non-parametric regression.
When overfitting is a problem, use cross validation to evaluate models. Ridge, lasso and elastic net models can help tackle overfitting. They can also handle multicollinearity. Quantile regression is suited to handle outliers.
For predicting counts, use negative binomial regression if variance is larger than the mean. Poisson regression can be used only if variance equals the mean.
-
What are some tips to analyze model statistics? Well-known model performance metrics include R-squared (R2), Root Mean Squared Error (RMSE), Residual Standard Error (RSE) and Mean Absolute Error (MAE). We also have metrics that penalize additional predictors: Adjusted R2, Akaike's Information Criteria (AIC) and Bayesian Information Criteria (BIC) and Mallows Cp. Higher the R2 or Adjusted R2, better the model. For all other metrics, lower value implies a better model.
A high t-statistic implies coefficient is probably non-zero. A low p-value on the t-statistic gives confidence on the estimate. Low coefficients and low p-value for the model as a whole can imply multicollinearity. While t-test is applied to individual coefficients, F-test is applied to the overall model.
Two models can be compared graphically. For example, the coefficients and their confidence intervals can be plotted and compared visually.
-
What software packages support regression? In R, functions
lm()
,summary()
,residuals()
andpredict()
in thebase
package enable linear regression. For GLM, we can useglm()
function. Usequantreg
package for quantile regression;glmnet
for ridge, lasso and elastic net regression;pls
for principal component regression;plsdepot
for PLS regression;e1071
for Support Vector Regression (SVR);ordinal
for ordinal regression;MASS
for negative binomial regression;survival
for cox regression. Other useful packages arestats
,car
,caret
,sgd
,BLR
,Lars
, andnlme
.In Python,
scikit-learn
provides a number of modules and functions for regression. Use modulesklearn.linear_model
for linear regression including logistic, poisson, gamma, huber, ridge, lasso, and elastic net;sklearn.svm
for SVR;sklearn.neighbors
for k-nearest neighbours regression;sklearn.isotonic
for isotonic regression;metrics
for regression metrics;sklearn.ensemble
for ensemble methods for regression.
Milestones
Francis Galton plots in 1877 what may be called the first regression line. It concerns the size of sweet-pea seeds. It correlated the size of daughter seeds against that of mother seeds. Such an analysis came about in the course of investigating Darwin's mechanism for heredity. By these experiments, Galton also introduces the concept of "reversion to the mean", later called regression to the mean.
Hoerl and Kennard note that least squares estimation is unbiased and this can give poor results if there's multicollinearity among the predictors. To improve the estimation they propose a biased estimation approach that they call Ridge Regression. Ridge regression uses standardized variables, that is, outcome and predictors are subtracted by mean and divided by standard deviation. By introducing some bias, variance of the least squares estimator is controlled.
D.R. Cox applies regression to life-table analysis. Among the sampled individuals, he observes either the time to "failure" or that the individual is removed from the study (called censoring). Moreover, the distribution of survival times is often skewed. For these reasons, linear regression is not suitable. Cox instead uses a hazard function that incorporates age-specific failure rate. In later years, this approach is simply called Cox Regression.
Nelder and Wedderburn introduce the Generalized Linear Model (GLM). As examples, they relate GLM to normal, binomial (probit analysis), poisson (contingency tables), and gamma (variance components) distributions. However, it's only in the 1980s that GLM becomes popular due to the work of McCullagh and Nelder.
De'ath proposes the Multivariate Regression Tree (MRT). The history of regression trees goes back to the 1960s. With the release of CART (Classification and Regression Tree) software in 1984, they became more well known. However, CART is limited to a single response variable. MRT extends CART to multivariate response data.
References
- Analytics University. 2017. "35 Types of Regression Models used in Data Science." Analytics University, on YouTube, September 19. Accessed 2020-11-11.
- Artigue, Heidi and Gary Smith. 2019. "The principal problem with principal components regression." Cogent Mathematics & Statistics, vol. 6, no. 1. Accessed 2020-11-15.
- Bartocha, Kamil. 2014. "Linear Regression vs Logistic Regression vs Poisson Regression." MarketingDistillery, via SlideShare, November 23. Accessed 2020-11-11.
- Bhalla, Deepanshu. 2018. "15 Types of Regression in Data Science." Listen Data, March. Accessed 2020-11-11.
- Bock, Tim. 2020. "What is Linear Regression?" Blog, Display R. Accessed 2020-11-11.
- Bolker, Ben. 2018. "Generalized linear mixed models." Accessed 2020-11-12.
- Brannick, Michael T. 2020. "Logistic Regression." College of Arts & Sciences, Univ. of South Florida. Accessed 2020-11-14.
- Cho, Wanhyun, Myung Hwan Na, Yuha Park, Deok Hyeon Kim, and Yongbeen Cho. 2020. "Prediction of Weights during Growth Stages of Onion Using Agricultural Data Analysis Method." Applied Sciences, MDPI, 10(6), 2094, March 19. Accessed 2020-11-12.
- Ciaburro, Giuseppe. 2018. "R packages for regression." In: Regression Analysis with R, Packt Publishing Limited, January. Accessed 2020-11-11.
- Cox, D.R. 1972. "Regression Models and Life-Tables." Journal of the Royal Statistical Society, Series B (Methodological), vol. 34, no. 2, pp. 187-220. Accessed 2020-11-15.
- Cramer, J.S. 2002. "The Origins of Logistic Regression." Tinbergen Institute Discussion Paper, TI 2002-119/4, November. Accessed 2020-11-15.
- De'ath, Glenn. 2002. "Multivariate Regression Trees: a new technique for modeling species–environment relationships." Ecology, Ecological Society of America, vol. 83, no. 4, pp. 1105-1117. Accessed 2020-11-16.
- Dye, Steven. 2020. "Quantile Regression." Towards Data Science, February 13. Accessed 2020-11-12.
- Explorium. 2019. "The Complete Guide to Decision Trees." Blog, Explorium, December 10. Accessed 2020-11-16.
- Gardner, William, Edward Patrick Mulvey, and Esther C. Shaw. 1995. "Regression Analyses of Counts and Rates: Poisson, Overdispersed Poisson, and Negative Binomial Models." Psychological Bulletin, vol. 118, no. 3, pp. 392-404. Accessed 2020-11-11.
- Ghahramani, Zoubin. 2015. "Parametric vs Nonparametric Models." Part II of Bayesian Inference, The Machine Learning Summer School, Max Planck Institute for Intelligent Systems, Tübingen, Germany, July 13-24. Accessed 2020-11-11.
- Gillham, Nicholas W. 2009. "Cousins: Charles Darwin, Sir Francis Galton and the birth of eugenics." Royal Statistical Society, vol. 6, no. 3, pp. 132-135, September. Accessed 2020-11-15.
- Grace-Martin, Karen. 2008. "Regression Models for Count Data." The Analysis Factor, October 24. Updated 2018-05-02. Accessed 2020-11-11.
- Grace-Martin, Karen. 2009. "Multiple Regression Model: Univariate or Multivariate GLM?" The Analysis Factor, April 20. Updated 2018-05-02. Accessed 2020-11-11.
- Granville, Vincent. 2014. "10 types of regressions. Which one to use?" Blog, Data Science Central, July 21. Accessed 2020-11-11.
- Hoerl, Arthur E. and Robert W. Kennard. 1970. "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, vol. 12, no. 1, pp. 55-67, February. Accessed 2020-11-15.
- Kabacoff, Robert I. 2020. "Multiple (Linear) Regression." Quick-R, Datacamp. Accessed 2020-11-11.
- Kassambara, Alboukadel. 2018. "Penalized Regression Essentials: Ridge, Lasso & Elastic Net." STHDA, March 11. Accessed 2020-11-12.
- Kassambara, Alboukadel. 2018b. " Regression Model Accuracy Metrics: R-square, AIC, BIC, Cp and more." STHDA, March 11. Accessed 2020-11-14.
- Khurram, Tauqeer. 2020. "Different Types of Regression Analysis to Know." Tech Funnel, March 18. Accessed 2020-11-11.
- Koenker, Roger, and Gilbert Bassett. 1978. "Regression Quantiles." Econometrica, vol. 46, no. 1, pp. 33-50, January. Accessed 2020-11-15.
- Koenker, Roger and Kevin F. Hallock. 2001. "Quantile Regression." Journal of Economic Perspectives, vol. 15, no. 4, pp. 143-156. Accessed 2020-11-11.
- Kopf, Dan. 2015. "The Discovery of Statistical Regression." Priceonomics, November 6. Accessed 2020-11-14.
- Legendre, Pierre, and Louis Legendre. 2012. "Multivariate regression trees (MRT)." Sec. 8.11 in: Developments in Environmental Modelling, Elsevier, vol. 24, pp. 337-424. doi: 10.1016/B978-0-444-53868-0.50008-3. Accessed 2020-11-11.
- Liu, Ching-Ti, Jacqueline Milton, and Avery McIntosh. 2016. "Simple Linear Regression." Boston University School of Public Health, January 6. Accessed 2020-11-14.
- Long, Jacob. 2020. "Tools for summarizing and visualizing regression models." Vignettes, jtools, on CRAN, June 22. Accessed 2020-11-11.
- Mahmoud, Hamdy F. F. 2014. "Parametric versus Semi/nonparametric Regression Models." Laboratory for Interdisciplinary Statistical Analysis, Univ. of Colarado Boulder, July 23. Accessed 2020-11-11.
- Mahmoud, Hamdy F. F. 2014b. "Parametric versus Semi/nonparametric Regression Models." Laboratory for Interdisciplinary Statistical Analysis, Univ. of Colarado Boulder, July 23. Accessed 2020-11-12.
- Marquardt, Donald W. and Ron Snee. 1975. "Ridge Regression in Practice." The American Statistician, vol. 29, no. 1, February. Accessed 2020-11-11.
- NCSS. 2020a. "Chapter 565: Cox Regression." NCSS Statistical Software. Accessed 2020-11-15.
- NCSS. 2020b. "Chapter 335: Ridge Regression." NCSS Statistical Software. Accessed 2020-11-15.
- Nelder, J. A. and R. W. M. Wedderburn. 1972. "Generalized Linear Models." Journal of the Royal Statistical Society. Series A (General), vol. 135, no. 3, pp. 370-384. Accessed 2020-11-15.
- Owen, Art B. 2006. "A robust hybrid of lasso and ridge regression." Stanford University, October. Accessed 2020-11-11.
- PennState. 2020a. "Introduction to Generalized Linear Models." Sec. 6.1 in: STAT 504 Analysis of Discrete Data, The Pennsylvania State University. Accessed 2020-11-11.
- PennState. 2020b. "Example on Birth Weight and Smoking." Sec. 8.1 in: STAT 501 Regression Methods, The Pennsylvania State University. Accessed 2020-11-11.
- PennState. 2020c. "Logistic Regression." Sec. 15.1 in: STAT 501 Regression Methods, The Pennsylvania State University. Accessed 2020-11-14.
- Philosophy Terms. 2016. "Causality." Philosophy Terms, October 10. Updated 2018-10-25. Accessed 2020-11-12.
- Princeton University. 2020a. "Interpreting Regression Output." Data and Statistical Services, Princeton University Library, Princeton University. Accessed 2020-11-14.
- Rao, C. Radhakrishna. 1983. "Multivariate Analysis: Some Reminiscences on Its Origin and Development." Sankhyā: The Indian Journal of Statistics, Series B (1960-2002) 45, no. 2, pp. 284-99. Accessed 2020-11-15.
- STHDA. 2020. "Regression Analysis Essentials For Machine Learning." STHDA. Accessed 2020-11-11.
- Sagar, Chaitanya. 2017. "Building Regression Models in R using Support Vector Regression." KDNuggets, March. Accessed 2020-11-14.
- Sharareh, Parami , Tapak Leili, Moghimbeigi Abbas, Poorolajal Jalal, and Ghaleiha Ali. 2020. "Determining correlates of the average number of cigarette smoking among college students using count regression models." Scientific Reports, 10, Article number: 8874, June 1. Accessed 2020-11-11.
- Steorts, Rebecca C. 2017. "Tree Based Methods: Regression Trees." Chapter 8 ISL, STA 325, Duke University. Accessed 2020-11-11.
- Stoltzfus, Jill C. 2011. "Logistic Regression: A Brief Primer." Academic Emergency Medicine, 18:1099-1104. Accessed 2020-11-11.
- Tibshirani, Robert. 1996. "Regression shrinkage and selection via the lasso." J. Royal. Statist. Society, Series B (Methodological), vol. 58, no. 1, pp. 267-288. Accessed 2020-11-15.
- UCLA. 2020a. "Regression Models with Count Data." Statistical Consulting Group, UCLA. Accessed 2020-11-11.
- UCLA. 2020b. "Introduction to Generalized Linear Mixed Models." Statistical Consulting Group, UCLA. Accessed 2020-11-12.
- UCLA. 2020c. "Robust Regression." Stata Data Analysis Examples, UCLA. Accessed 2020-11-12.
- Wikipedia. 2020a. "Regression analysis." Wikipedia, October 20. Accessed 2020-11-11.
- Wikipedia. 2020b. "General linear model." Wikipedia, November 9. Accessed 2020-11-11.
- jvriesem. 2017. "When should linear regression be called “machine learning”?" CrossValidated, StackExchange, March 20. Accessed 2020-11-11.
- scikit-learn. 2020a. "API Reference." v0.23.2, scikit-learn, August. Accessed 2020-11-11.
- scikit-learn. 2020b. "sklearn.linear_model.HuberRegressor." v0.23.2, scikit-learn, August. Accessed 2020-11-11.
Further Reading
- Bhalla, Deepanshu. 2018. "15 Types of Regression in Data Science." Listen Data, March. Accessed 2020-11-11.
- Statistics Solutions. 2020. "Selection Process for Multiple Regression." Statistics Solutions, June 23. Accessed 2020-11-11.
- Princeton University. 2020b. "Introduction to Regression." Data and Statistical Services, Princeton University Library, Princeton University. Accessed 2020-11-11.
- scikit-learn. 2020c. "Support Vector Regression (SVR) using linear and non-linear kernels." v0.23.2, scikit-learn, August. Accessed 2020-11-11.
- Long, Jacob. 2020. "Tools for summarizing and visualizing regression models." Vignettes, jtools, on CRAN, June 22. Accessed 2020-11-11.
- Koenker, Roger and Kevin F. Hallock. 2001. "Quantile Regression." Journal of Economic Perspectives, vol. 15, no. 4, pp. 143-156. Accessed 2020-11-11.
Article Stats
Cite As
See Also
- Regression Modelling
- Linear Regression
- Logistic Regression
- Stepwise Regression
- Support Vector Regression
- Generalized Linear Regression
Article Warnings
- Readability score of this article is below 50 (48.6). Use shorter sentences. Use simpler words.