Logistic Regression
 Summary

Discussion
 Could you explain logistic regression with an example?
 What are different types of logistic regression?
 Why can't I use linear regression for predicting classes?
 What is the logistic function?
 What are GLMs and how are they relevant to logistic regression?
 What is the logistic regression equation and the logit function?
 What is the cost function for logistic regression?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
Suppose we're asked to classify emails into two categories: spam or not spam. Compare this with another application that attempts to predict product sales given recent advertising expense.^{} Unlike the second example in which the target variable is continuous, email classification predicts a categorical variable.^{}
Logistic regression is a statistical method used for classifying a target variable that is categorical in nature. It is an extension of a linear regression model.^{} It uses a logistic function to estimate the probability of a target variable belonging to a particular class or category.^{}
Discussion
Could you explain logistic regression with an example? Consider email classification as an example. To be able to predict if an email is spam or not, we will extract relevant information from the emails such as:^{}
 Sender of the email
 Number of typos in the email
 Occurrence of words or phrases such as "offer", "prize", "free gift", etc.
The above information is converted into a vector of numerical features. These numerical features are linearly combined and then transformed using a logistic function to give a score in the range 0 to 1. This score is the probability of an email being either spam or not. If the probability is higher than 50%, then the email will be classified as spam.^{}
What are different types of logistic regression? There are three types of logistic regression:^{}
 Binary or binomial: where the dependent variable can have only two outcomes. Examples: spam/notspam, dead/alive, pass/fail.
 Multiclass or multinomial: where the dependent variable is classified into three or more categories and these categories are not ordered. Examples: types of cuisines (Italian, Mediterranean, Chinese).
 Ordinal: where the dependent variable is classified into three or more categories and these categories are ordered. Examples: movie rating (15).
Why can't I use linear regression for predicting classes? In classification problems, we are predicting the probability that the outcome variable belongs to a particular class. If linear regression is used for classification, it will treat the classes or categories as numbers. It will fit the best line that minimises the distance between the data points and the line. The linear regression equation would just give scores that lie along the best fit line. These scores cannot be interpreted as probabilities. A meaningful threshold cannot be set to distinguish the classes.^{}
Also, the linear regression model fits a straight line that can extrapolate. Values can go out of range, such as below 0 or above 1 (∞ to ∞).^{} Since probability lies in a fixed range between 0 to 1, in logistic regression, a logistic function is applied so that the dependent variable only takes values between 0 and 1.
What is the logistic function? Logistic function also known as sigmoid function is an Sshaped curve that can take any realvalued number and transforms it into a number between 0 and 1 using the following equation:^{}
$$f(x)= \frac{1}{1+e^{x}}$$
In the above image as x approaches ∞, then, f(x) becomes 1 and as x approaches ∞, then, f(x) becomes 0.
$$f(x) = \frac{1}{1+e^{∞}} = 1, \qquad e^{∞}\to 0$$
$$f(x) = \frac{1}{1+e^{(∞)}} = \frac{1}{1+ e^∞} = 0, \qquad 1/∞ \to 0$$
What are GLMs and how are they relevant to logistic regression? Generalized Linear Models (GLMs) are a class of nonlinear regression models that can be used in certain cases where linear models do not fit well.^{} They're applicable when the outcome variable follows a nonlinear distribution such as binomial, exponential, poisson, etc.
A GLM is represented by the following equation:^{}
$$\large{g(E(y))=\beta_0+\beta_1{}x_{1}+\ldots{}\beta_p{}x_{p}}$$
Where,
 \(E(y)\) is the mean value or the expected value of the outcome variable that follows an assumed distribution
 \(\beta_0+\beta_1{}x_{1}+\ldots{}\beta_p{}x_{p}\) is the linear predictor i.e. the weighted sum of features where \(\beta\) is the weight and x is the explanatory variable.
 \(g\) is the link function that mathematically links the expected value of the outcome variable and the linear predictor.
GLM is a generalised form of linear regression and logistic regression is a specific type of GLM. For logistic regression, we can derive a specific link function \(g\) called the logit function.^{}
What is the logistic regression equation and the logit function? Let's start with the linear regression equation:
$$y=\beta_0+\beta_1{}x_{1}\qquad(1)$$
We derive the link function for logistic regression. In linear regression, \(y\) is a continuous variable. Since we want a probability for logistic regression, we will wrap the linear predictor in a logistic function so that the values do not go below 0 or beyond 1. We will denote this as probability with \(p\):
$$p=\frac{1}{1+e^{(\beta_0+\beta_1{}x_{1})}}\qquad(2)$$
The figure shows the probability that a person, given his/her age, will die within the next five years. We note that changing \(\beta_0\) shifts the curve while changing \(\beta_1\) affects steepness.^{}
Using (1) we can rewrite (2) as:
$$p=\frac{1}{1+e^{y}}=\frac{e^y}{1+ e^y}\qquad(3)$$
If \(p\) is the probability that an email is spam, then the probability of a nonspam email can be written as:
$$q=1p=1\frac{1}{1+e^{y}}=\frac{1}{1+e^y}\qquad(4)$$
Dividing (3) by (4) we get,
$$\frac{p}{1p}=e^y$$
Taking natural logarithm on both sides and substituting the value of y we get the logistic regression equation,^{}
$$\ln(\frac{p}{1p})=\beta_0+\beta_1{}x_{1}$$
\(p/(1p)\) is the odds ratio. \(\ln(p/(1p))\) is the link function or logit function. The output values from this function are called logits.^{} ^{}
What is the cost function for logistic regression? A cost function quantifies the error between the predicted value and the expected value.^{} The weights of features in the model are estimated by minimising or maximising this cost function.
The cost function used in logistic regression is known as Log Loss or Negative LogLikelihood (NLL) equation. It is the negative average of the log of correctly predicted probabilities for each instance in the training data.^{}
$$\frac{1}{N}\sum_{i =1}^Ny_i\cdot\ln(p(y_i))+(1y_i)\cdot\ln(1p(y_i))$$
Where,
 \(N\) is the number of training samples
 \(y_i\) is actual value of i'th sample
 \(p(y_i)\) is the predicted probability of the i'th sample
We simplify this equation for the two possible outcomes for a single training sample:^{}
 True output y=1 (positive): \((1\cdot\ln(p) + (1–1)\cdot\ln(1p)) = ln(p)\)
 True output y=0 (negative): \((0\cdot\ln(p) + (1–0)\cdot\ln(1p)) = ln(1p)\)
Also in the above graph we can see that since the scale is logarithmic the loss decreases slowly as the predicted probability gets closer to the true label. But, as the predicted probability diverges from the true label the loss increases rapidly. This has the effect of heavily penalising incorrect predictions.^{}
Milestones
The logistic function is introduced in a series of three papers by Pierre François Verhulst between 1838 and 1847. He uses it as a model of population growth by adjusting the exponential growth model, under the guidance of Adolphe Quetelet.^{}
The term regression is coined by Francis Galton to describe a biological phenomenon. He observes that the heights of descendants of tall ancestors tend to regress down towards a normal average, a phenomenon also known as regression toward the mean.^{}
Wilson and Worcester use logistic model in bioassay which is the first known application of its kind.^{}
Cox introduces multinomial logit model. This is a step up for logistic regression applications with the logit model.^{}
Daniel McFadden links the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit follows from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences. This gives a theoretical foundation for the logistic regression.^{} In 2000, McFadden is awarded Nobel Prize for this contribution.^{}
Sample Code
References
 Analytics Vidhya. 2015. "Simple Guide to Logistic Regression in R and Python." Blog, on Analytics Vidhya, November 1. Accessed 20210115
 Bock, Tim. 2018. "What is Linear Regression?" Blog, Display R, April 5. Updated 20201209. Accessed 20210122.
 Brownlee, Jason. 2016. "Logistic Regression for Machine Learning." Blog, on Machine Learning Mastery, April 1. Updated 20200815. Accessed 20210104.
 Goel, Aman. 2018. "4 Logistic Regressions Examples to Help You Understand." Post, on Magoosh, May 21. Accessed 20210107
 GraceMartin, Karen. 2015. "What is a Logit Function and Why Use Logistic Regression?" The Analysis Factor, May 11. Updated 20181214. Accessed 20210122.
 HolyPython. 2020. "Logistic Regression History." Blog, on HolyPython, July 29. Accessed at 20210121.
 Jaiswal, Sonoo. 2021. "Linear Regression vs Logistic Regression." Tutorial, on Javatpoint. Accessed 20210108
 Krzyk, Kamil. 2018. "Coding Deep Learning for Beginners — Linear Regression (Part 2): Cost Function."towards data science, on Medium, August 8. Accessed 20210118.
 Lumen Learning. 2021. "Introduction to Logistic Regression." In: Introduction to Statistics, Lumen Learning. Accessed 20210122.
 Mcdonald, Conor. 2018. "Log Loss: A short note." Blog, on Wordpress, March 3. Accessed 20210119
 Megha270396. 2020. "Binary Cross Entropy aka Log LossThe cost function used in Logistic Regression." Blog, Analytics Vidhya, November 9. Accessed 20210118
 Molnar, Christoph. 2021. "Interpretable Machine Learning." Github, January 4. Accessed 20210104.
 Polamuri, Saimadhu. 2017. "How the Logistic Regression Model Works." Blog, on Dataspirant, March 2. Accessed 20210104.
 Reddy, Sushmith. 2020. "Understanding the log loss function." Analytics Vidhya, on Medium, July 6. Accessed 20210118.
 Sheldon, Kerby. 2019. "Generalized Linear Models." Notes, Department of Statistics, University of Michigan, December 9. Accessed 20210115
 Swaminathan, Saishruthi. 2018. "Logistic Regression — Detailed Overview." Towards Data Science, on Medium, March 18. Accessed 20210106.
 van den Berg, Ruben Geert. 2020. "Logistic Regression – Simple Introduction." SPSS Tutorials. Accessed 20210122.
 Waseem, Mohammad. 2020."How To Implement Classification In Machine Learning?." Blog, Edureka.co, July 21. Accessed at 20210105
 Wikipedia. 2020a. "Logistic Regression." Wikipedia, December 18. Accessed 20210104.
 Wikipedia. 2020b. "Regression analysis." Wikipedia, December 21. Accessed 20210121.
 Wikipedia. 2021c. "Logistic Function." Wikipedia, January 10. Accessed 20210121.
Further Reading
 BrooksBarlett, Jonny. 2018. "Probability concepts explained: Maximum likelihood estimation." Towards Data Science, on Medium, January 3. Accessed 20210118
 Agarwal, Rahul. 2019. "The 5 Classification Evaluation Metrics Every Data Scientist Must Know." Blog, on KDnuggets, October. Accessed 20210118
 Ray, Sunil. 2017. "Commonly used Machine Learning Algorithms (with Python and R Codes)." Blog, on Analytics Vidhya, September 7. Accessed 20210118
Article Stats
Cite As
See Also
 Multinomial Logistic Regression
 Linear Regression
 Types of Regression
 Regression Modelling
 Statistical Classification
 Machine Learning Model