# Naive Bayes Classifier

Naive Bayes is a probabilistic classifier that returns the probability of a test point belonging to a class rather than the label of the test point. It's among the most basic Bayesian network models, but when combined with kernel density estimation, it may attain greater levels of accuracy. . This algorithm is applicable for Classification tasks only, unlike many other ML algorithms which can typically perform Regression as well as Classification tasks.

Naive Bayes algorithm is considered naive because the assumptions the algorithm makes are virtually impossible to find in real-life data. It uses conditional probability to calculate a product of individual probabilities of components. This means that the algorithm assumes the presence or absence of a specific feature of a class which is not related to the presence or absence of any other feature (absolute independence of features), given the class variable.

## Discussion

• Could you explain the Naive Bayes classifier with examples?

Consider two groups of insects, grasshoppers and katydids. By studying the antenna lengths from many insect samples, we can discern some patterns and computed probabilities. For examples, given an antenna length of 3 cm, the insect is more likely to be a grasshopper than a katydid. Naive Bayes classifier is a technique to perform such a classification. Antenna length is a feature that's used to classify an insect into one of two classes.

Suppose the antenna length is 5 cm. Probabilities computed from observed samples inform that both classes are equally likely. In this case, classification can be improved by considering more features such as abdomen length. NB classifier assumes that features are independent of one another.

Consider the statement "Officer Drew arrested me." Is Drew male or female? We can answer this by gathering data on the officer: height, eye colour and long/short hair. Then we lookup a police database of all officers and apply NB classifier. This problem uses three independent features and two classes (male or female).

• What is Bayes' Theorem and how is it relevant to the NB classifier?

Bayes theorem (aka Bayes rule) works on conditional probability. In conditional probability, the occurrence of a particular outcome is conditioned on the outcome of another event occurring. Given two events A and B, Bayes theorem states that,

$$P(A|B) = \frac{P(A⋂B)}{P(B)} = \frac{P(A) \cdot P(B|A)}{P(B)}$$

where $$P(A)$$ and $$P(B)$$, called marginal probability or prior probability, are the probabilities of events A and B event occurring; where $$P(A|B)$$, called posterior probability, is the probability of event A occurring given that event B has occurred; where $$P(B|A)$$, called likelihood probability, is the probability of event B occurring given that event A has occurred; $$P(A⋂B)$$ is the joint probability of both events occurring. $$P(A|B)$$ and $$P(B|A)$$ are also called conditional probabilities.

Suppose you have drawn a red card from a deck of playing cards. What's the probability that it's a four? We apply conditional probability. There are 26 possible red cards and two of the are fours. Thus, $$P(four|red)=2/26=1/13$$. Bayes Theorem allows us to reformulate the problem as follows:

$$P(four|red) = P(four) \cdot P(red|four) / P(red)\\= (4/52 \cdot 2/4) / (26/52)\\= 1/13$$

• What are the types of the NB classifier?

scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian.

Bernoulli Naive Bayes

The predictors in this case are boolean variables. So your only options are 'True' and 'False' (you might also have 'Yes' or 'No'). When the data has a multivariate Bernoulli distribution, we use it.

Multinomial Naive Bayes

The frequency with which particular events were created by a multinomial distribution are represented by feature vectors. This is the event model that is most commonly used for document classification.This algorithm is used to tackle document classification difficulties. For example, if you want to know whether a document is in the 'Legal' or 'Human Resources' category, you'd use this technique to figure it out. It makes advantage of the frequency of the current words as a feature.

Gaussian Naive Bayes

It is used for numerical / continuous features. The distribution of continues values are "assumed" to be Gaussian. And therefore the likelihood probabilities are computed based on Gaussian distribution.

• How would you use Naive Bayes classifier for categorical features?

For a discrete variable with more than two possible outcomes, such as the roll of a dice, the categorical distribution is an extension of the Bernoulli distribution. In contrast, the categorical distribution provides a probability of different outcomes for one drawing rather than multiple drawings as is the multinomial distribution.

The properties should be encoded using label encoding techniques, and each category should be assigned a unique number.

It is given by:

$$p(x_i = t | y = c; α) = N_???+α /N_c+α n_i$$

$$?_???$$ = Number of times category t appears in the samples ??, which belong to class ?

$$?_?$$ = Total number of samples with class c

$$?$$ = Laplace smoothing parameter used to handle zero frequency problem

$$?_?$$ = Number of available categories of feature

• What is Laplace smoothing in the context of the NB classifier?

Laplace smoothing is a smoothing technique used in Naive Bayes to solve the problem of zero probability. Consider text categorization, where the aim is to determine if a review is good or negative. Based on the training data, we create a likelihood table. We use the Likelihood table values when querying a review, but what if a word in a review was not present in the training dataset?. For example, a test query has form, Query review= x1x2x’

Let, a test sample have three words, where we assume x1 and x2 are present in the training data but not x’. Laplace smoothing comes into picture.

$$P(x’/positive)= (number of reviews with x’ and target_outcome=positive + α) / (N+ α*k)$$

K denotes the number of dimensions (features) in the data.

N is the number of reviews with the target outcome=positive.

α represents the smoothing parameter.

• Can we use the NB classifier when features are not independent?

The process of evaluating features depending on how successful they are in predicting the target variable is known as feature importance.The naive bayes classifiers do not provide an intrinsic technique for determining the relevance of features. Naive Bayes algorithms forecast the class with the highest probability by computing the conditional and unconditional probabilities associated with the features.As a result, no coefficients have been generated or connected with the characteristics used to train the model.However, there are ways for analysing the model after it has been trained that can be used post-hoc. One of these strategies is the Permutation Importance, which has been neatly implemented in scikit-learn.

When the data is tabular, permutation feature importance is a model inspection technique that can be utilised for any fitted estimator. For a given dataset, the permutation importance function computes the feature importance of estimators. The n_ repeats option specifies how many times a feature is randomly shuffled before returning a sample of feature importances.

• What are some applications of the NB classifier?

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers, which are commonly employed in text classification (owing to better results in multi-class problems and the independence criterion), have a greater success rate than other techniques. As a result, it is commonly utilised in spam filtering (determining spam e-mail) and sentiment analysis (in social media analysis, to identify positive and negative customer sentiments)

Recommendation System: The Naive Bayes Classifier and Collaborative Filtering work together to create a Recommendation System that employs machine learning and data mining techniques to filter unseen data and forecast whether a user would enjoy a given resource or not.

Multi-class Prediction: This algorithm is also well-known for its multi-class prediction capability. We can anticipate the likelihood of various target variable classes here.

Real-time Prediction: Naive Bayes is a quick learning classifier that is eager to learn. As a result, it might be utilised to make real-time forecasts.

• How is the NB classifier related to logistic regression?

Given input features $$X$$, both NB classifier and logistic regression predict an output class, that is, output $$Y$$ is categorical. Logistic regression directly estimates $$P(Y|X)$$ whereas NB classifier applies the Bayes theorem and estimates $$P(Y)$$ and $$P(X|Y)$$. As such, we call logistic regression a discriminative classifier and NB a generative classifier.

It's been observed that on small training datasets, NB classifier does better than logistic regression. If more training samples are available, logistic regression does better. While logistic regression has a lower asymptotic error, NB classifier may converge faster to its higher asymptotic error.

It's known that the Gaussian Naive Bayes (GNB) classifier is closely related to logistic regression. Parameters of one model can be expressed in terms of the other. Moreover, asymptotically both converge to the same classifier when GNB assumptions hold. When the assumptions don't hold, such as dependence among features, logistic regression does better because it adjusts its parameters to give a better fit.

Advantages: Naive bayes is Simple to put into action. The conditional probabilities are simple to compute. The probabilities can be determined immediately, there is no need for iterations. As a result, this strategy is useful in situations when training speed is critical. If the conditional Independence assumption is true, the consequences could be spectacular. This algorithm predicts classes faster than many other classification algorithms.

Disadvantages:The premise of independent predictors is the main imitation of Naive Bayes. Naive Bayes implicitly assumes that all attributes are independent of one another. In practise, it is very hard to obtain a set of predictors that are totally independent. If a categorical variable in the test data set has a category that was not observed in the training data set, the model will assign a 0 (zero) probability and will be unable to predict. This is commonly referred to as Zero Frequency. you can utilise the smoothing approach to remedy this. Laplace estimation is one of the most basic smoothing techniques.

## Milestones

1763

The Royal Society publishes a paper on probability by Thomas Bayes after his death in 1761. It's titled Essay Towards Solving a Problem in the Doctrine of Chances and details what would later become famous as the Bayes inference. The basic idea is to revise predictions based on new evidence. Decades later (early 19th century), Pierre-Simon Laplace develops and popularizes Bayesian probability.

1940

Bayesian approach is applied during the Second World War. It sees a revival in the years after the war. Earlier, Bayesian approach had been criticized. The frequentist approach developed by R.A. Fisher had been favoured since the mid-1920s.

1960

Maron and Kuhns apply Bayes' Theorem to the task of Information Retrieval (IR). The probability of retrieving a relevant document given a query can be computed from the prior probability of document relevance and conditional probability of user making a particular query given the relevant document. Over the next forty years, Naive Bayes is the main technique in IR until machine learning techniques become popular.

1968

Hughes considers a two-class pattern recognition problem. The model considers $$n$$ discrete values that can be measured and $$m$$ sample patterns. He shows that for a given $$m$$, there's an optimal $$n$$ that minimizes the pattern recognition error. This is shown in the figure (right) for the case of equal class probabilities. The figure (left) also shows an example of $$n=5$$ in which values 1-3 imply class $$c_1$$ and values 4-5 imply class $$c_2$$.

1973

Duda and Hart use the Naive Bayes classifier in pattern recognition.

1992

Langley et al. present an analysis of Bayesian classifiers considering noisy classes and noise-free attributes. They find that the Naive Bayes classifier gives comparable results to the C4 algorithm that induces decision trees. They conclude that despite its simplicity, the Naive Bayes classifier deserves more research attention.

1997

Domingos and Pazzani show that even when attributes are not independent, the Bayesian classifier does well. It can be optimal under zero-one loss (misclassification rate). It's optimal under squared error loss only when the independence assumption holds.

1998

Kasif et al. propose a probabilistic framework for memory-based reasoning (MBR). Such a framework can be used for classification tasks. They note that a probabilistic graphical model is really another way of looking at the Naive Bayes classifier.

## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins Bhavani vangipurapu
8
5
1587 arvindpdmn
6
5
999
2083
Words
0
Likes
8802
Hits

## Cite As

Devopedia. 2022. "Naive Bayes Classifier." Version 14, March 31. Accessed 2022-10-09. https://devopedia.org/naive-bayes-classifier