Logistic Regression

Article Info

Contributed by
2 authors

Last updated on
2022-01-19 15:36:26

Improve this article

Multinomial Logistic Regression
Linear Regression
Types of Regression
Regression Modelling
Statistical Classification
Machine Learning Model

Article Versions

10 2022-01-19 15:36:26
3151,2365 10,3151

By arvindpdmn

Minor fix to URL in Further Reading
9 2021-01-22 08:58:07
2365,2364 9,2365

By arvindpdmn

Moved a paragraph. Improving inline display of fractions.
8 2021-01-22 05:53:05
2364,2362 8,2364

By arvindpdmn

Combined first qa with summary. Reorganized other content. Added one more image. Formatting changes. Log-likelihood changed to negative log-likelihood. Introduced the term logit.
7 2021-01-22 03:06:01
2362,2360 7,2362

By arvindpdmn

Minor formatting improvements. Spelling corrections. Added two sample code examples.
6 2021-01-21 10:16:07
2360,2353 6,2360 1

By barsiwala

Removed Styling, corrected the formatting for references and further reading, added the last question, added milestones

Chat Room

Submitting ...

You are editing an existing chat message.
2021-01-22 05:56:01
-

By arvindpdmn

@barsiwala Well-written article. Initially I got the impression that there's just too much math. But the derivation seems to be useful in understanding the concept. Hope these equations don't scare readers.
I initially imagined multinomial logistic regression as part of this article. We'll keep that as a separate article.
Milestone sections can be vastly improved. Keep in view for the future.
2021-01-21 10:16:08
-

By arvindpdmn

[Draft Approval Comment]
Will publish shortly.
2021-01-19 10:41:12
-

By arvindpdmn

[Draft Approval Comment]
Approving addition of Further Reading and updates to References.
2021-01-19 08:11:35
-

By arvindpdmn

"Accessed 01-06-2021" change to "Accessed 2021-01-06."
2021-01-19 08:10:14
-

By arvindpdmn

For example, "Jason Brownlee" should be written as "Brownlee, Jason." Cite as [(Brownlee 2016)].
For two authors, "Brownlee, Jason and John Smith." Cite as [(Brownlee and Smith 2016)].
For 3 or more authors, "Brownlee, Jason, John Smith, and Mary Ann." Cite as [(Brownlee et al. 2016)]

Logistic regression model. Source: Polamuri 2017.

Suppose we're asked to classify emails into two categories: spam or not spam. Compare this with another application that attempts to predict product sales given recent advertising expense. Unlike the second example in which the target variable is continuous, email classification predicts a categorical variable.

Logistic regression is a statistical method used for classifying a target variable that is categorical in nature. It is an extension of a linear regression model. It uses a logistic function to estimate the probability of a target variable belonging to a particular class or category.

Discussion

Could you explain logistic regression with an example?
Logistic regression is used for email classification. Source: Waseem 2020.
Consider email classification as an example. To be able to predict if an email is spam or not, we will extract relevant information from the emails such as:
- Sender of the email
- Number of typos in the email
- Occurrence of words or phrases such as "offer", "prize", "free gift", etc.
The above information is converted into a vector of numerical features. These numerical features are linearly combined and then transformed using a logistic function to give a score in the range 0 to 1. This score is the probability of an email being either spam or not. If the probability is higher than 50%, then the email will be classified as spam.
What are different types of logistic regression?
There are three types of logistic regression:
- Binary or binomial: where the dependent variable can have only two outcomes. Examples: spam/not-spam, dead/alive, pass/fail.
- Multiclass or multinomial: where the dependent variable is classified into three or more categories and these categories are not ordered. Examples: types of cuisines (Italian, Mediterranean, Chinese).
- Ordinal: where the dependent variable is classified into three or more categories and these categories are ordered. Examples: movie rating (1-5).
Why can't I use linear regression for predicting classes?
Linear Regression vs Logistic Regression. Source: Jaiswal 2021.
In classification problems, we are predicting the probability that the outcome variable belongs to a particular class. If linear regression is used for classification, it will treat the classes or categories as numbers. It will fit the best line that minimises the distance between the data points and the line. The linear regression equation would just give scores that lie along the best fit line. These scores cannot be interpreted as probabilities. A meaningful threshold cannot be set to distinguish the classes.
Also, the linear regression model fits a straight line that can extrapolate. Values can go out of range, such as below 0 or above 1 (-∞ to ∞). Since probability lies in a fixed range between 0 to 1, in logistic regression, a logistic function is applied so that the dependent variable only takes values between 0 and 1.
What is the logistic function?
Logistic function. Source: Molnar 2021.
Logistic function also known as sigmoid function is an S-shaped curve that can take any real-valued number and transforms it into a number between 0 and 1 using the following equation:
$$f(x)= \frac{1}{1+e^{-x}}$$
In the above image as x approaches ∞, then, f(x) becomes 1 and as x approaches -∞, then, f(x) becomes 0.
$$f(x) = \frac{1}{1+e^{-∞}} = 1, \qquad e^{-∞}\to 0$$
$$f(x) = \frac{1}{1+e^{-(-∞)}} = \frac{1}{1+ e^∞} = 0, \qquad 1/∞ \to 0$$
What are GLMs and how are they relevant to logistic regression?
Generalized Linear Models (GLMs) are a class of non-linear regression models that can be used in certain cases where linear models do not fit well. They're applicable when the outcome variable follows a non-linear distribution such as binomial, exponential, poisson, etc.
A GLM is represented by the following equation:
$$\large{g(E(y))=\beta_0+\beta_1{}x_{1}+\ldots{}\beta_p{}x_{p}}$$
Where,
- $E(y)$ is the mean value or the expected value of the outcome variable that follows an assumed distribution
- $\beta_0+\beta_1{}x_{1}+\ldots{}\beta_p{}x_{p}$ is the linear predictor i.e. the weighted sum of features where $\beta$ is the weight and x is the explanatory variable.
- $g$ is the link function that mathematically links the expected value of the outcome variable and the linear predictor.
GLM is a generalised form of linear regression and logistic regression is a specific type of GLM. For logistic regression, we can derive a specific link function $g$ called the logit function.
What is the logistic regression equation and the logit function?
Effect of coefficients on the logistic function. Source: van den Berg 2020.
Let's start with the linear regression equation:
$$y=\beta_0+\beta_1{}x_{1}\qquad(1)$$
We derive the link function for logistic regression. In linear regression, $y$ is a continuous variable. Since we want a probability for logistic regression, we will wrap the linear predictor in a logistic function so that the values do not go below 0 or beyond 1. We will denote this as probability with $p$:
$$p=\frac{1}{1+e^{-(\beta_0+\beta_1{}x_{1})}}\qquad(2)$$
The figure shows the probability that a person, given his/her age, will die within the next five years. We note that changing $\beta_0$ shifts the curve while changing $\beta_1$ affects steepness.
Using (1) we can rewrite (2) as:
$$p=\frac{1}{1+e^{-y}}=\frac{e^y}{1+ e^y}\qquad(3)$$
If $p$ is the probability that an email is spam, then the probability of a non-spam email can be written as:
$$q=1-p=1-\frac{1}{1+e^{-y}}=\frac{1}{1+e^y}\qquad(4)$$
Dividing (3) by (4) we get,
$$\frac{p}{1-p}=e^y$$
Taking natural logarithm on both sides and substituting the value of y we get the logistic regression equation,
$$\ln(\frac{p}{1-p})=\beta_0+\beta_1{}x_{1}$$
$p/(1-p)$ is the odds ratio. $\ln(p/(1-p))$ is the link function or logit function. The output values from this function are called logits.
What is the cost function for logistic regression?
Log loss curve. Source: Mcdonald 2018.
A cost function quantifies the error between the predicted value and the expected value. The weights of features in the model are estimated by minimising or maximising this cost function.
The cost function used in logistic regression is known as Log Loss or Negative Log-Likelihood (NLL) equation. It is the negative average of the log of correctly predicted probabilities for each instance in the training data.
$$-\frac{1}{N}\sum_{i =1}^Ny_i\cdot\ln(p(y_i))+(1-y_i)\cdot\ln(1-p(y_i))$$
Where,
- $N$ is the number of training samples
- $y_i$ is actual value of i'th sample
- $p(y_i)$ is the predicted probability of the i'th sample
We simplify this equation for the two possible outcomes for a single training sample:
- True output y=1 (positive): $-(1\cdot\ln(p) + (1–1)\cdot\ln(1-p)) = -ln(p)$
- True output y=0 (negative): $-(0\cdot\ln(p) + (1–0)\cdot\ln(1-p)) = -ln(1-p)$
Also in the above graph we can see that since the scale is logarithmic the loss decreases slowly as the predicted probability gets closer to the true label. But, as the predicted probability diverges from the true label the loss increases rapidly. This has the effect of heavily penalising incorrect predictions.

Milestones

1838

The logistic function is introduced in a series of three papers by Pierre François Verhulst between 1838 and 1847. He uses it as a model of population growth by adjusting the exponential growth model, under the guidance of Adolphe Quetelet.

1889

The term regression is coined by Francis Galton to describe a biological phenomenon. He observes that the heights of descendants of tall ancestors tend to regress down towards a normal average, a phenomenon also known as regression toward the mean.

1943

Wilson and Worcester use logistic model in bioassay which is the first known application of its kind.

1966

Cox introduces multinomial logit model. This is a step up for logistic regression applications with the logit model.

1973

Daniel McFadden links the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit follows from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences. This gives a theoretical foundation for the logistic regression. In 2000, McFadden is awarded Nobel Prize for this contribution.

Sample Code

# Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# Accessed 2021-01-22
 
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
 
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

# Source: https://towardsdatascience.com/logistic-regression-on-mnist-with-pytorch-b048327f8d19
# Accessed 2021-01-22
 
# Logistic Regression on MNIST with PyTorch
 
import torch
from torch.autograd import Variable
import torchvision.transforms as transforms
import torchvision.datasets as dsets
 
train_dataset = dsets.MNIST(root='./data', train=True, 
                    transform=transforms.ToTensor(), download=False)
test_dataset = dsets.MNIST(root='./data', train=False, 
                    transform=transforms.ToTensor())
 
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                    batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                    batch_size=batch_size, shuffle=False)
 
class LogisticRegression(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LogisticRegression, self).__init__()
        self.linear = torch.nn.Linear(input_dim, output_dim)
 
    def forward(self, x):
        outputs = self.linear(x)
        return outputs
 
batch_size = 100
n_iters = 3000
epochs = n_iters / (len(train_dataset) / batch_size)
input_dim = 784
output_dim = 10
lr_rate = 0.001
 
model = LogisticRegression(input_dim, output_dim)
criterion = torch.nn.CrossEntropyLoss() # computes softmax and then the cross entropy
optimizer = torch.optim.SGD(model.parameters(), lr=lr_rate)
 
iter = 0
for epoch in range(int(epochs)):
    for i, (images, labels) in enumerate(train_loader):
        images = Variable(images.view(-1, 28 * 28))
        labels = Variable(labels)
 
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
 
        iter+=1
        if iter%500==0:
            # calculate Accuracy
            correct = 0
            total = 0
            for images, labels in test_loader:
                images = Variable(images.view(-1, 28*28))
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total+= labels.size(0)
                # for gpu, bring the predicted and labels back to cpu
                # for python operations to work
                correct+= (predicted == labels).sum()
            accuracy = 100 * correct/total
            print("Iteration: {}. Loss: {}. Accuracy: {}.".format(iter, loss.item(), accuracy))

# Source: https://stats.idre.ucla.edu/r/dae/logit-regression/
# Accessed 2021-01-22
 
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
 
summary(mylogit)
 
# CIs using profiled log-likelihood
confint(mylogit)
 
# CIs using standard errors
confint.default(mylogit)
 
# Analysis
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
 
# odds ratios only
exp(coef(mylogit))
 
# odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
 
# prediction on new data
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")