Naive Bayes Classifier
 Summary

Discussion
 Could you explain the Naive Bayes classifier with examples?
 What is Bayes' Theorem and how is it relevant to the NB classifier?
 What are the types of the NB classifier?
 How would you use Naive Bayes classifier for categorical features?
 What is Laplace smoothing in the context of the NB classifier?
 Can we use the NB classifier when features are not independent?
 What are some applications of the NB classifier?
 How is the NB classifier related to logistic regression?
 What are some disadvantages and advantages of the NB classifier?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
Naive Bayes is a probabilistic classifier that returns the probability of a test point belonging to a class rather than the label of the test point. It's among the most basic Bayesian network models, but when combined with kernel density estimation, it may attain greater levels of accuracy.^{} . This algorithm is applicable for Classification tasks only, unlike many other ML algorithms which can typically perform Regression as well as Classification tasks.^{}
Naive Bayes algorithm is considered naive because the assumptions the algorithm makes are virtually impossible to find in reallife data. It uses conditional probability to calculate a product of individual probabilities of components. This means that the algorithm assumes the presence or absence of a specific feature of a class which is not related to the presence or absence of any other feature (absolute independence of features), given the class variable.
Discussion
Could you explain the Naive Bayes classifier with examples? Consider two groups of insects, grasshoppers and katydids. By studying the antenna lengths from many insect samples, we can discern some patterns and computed probabilities. For examples, given an antenna length of 3 cm, the insect is more likely to be a grasshopper than a katydid. Naive Bayes classifier is a technique to perform such a classification. Antenna length is a feature that's used to classify an insect into one of two classes.^{}
Suppose the antenna length is 5 cm. Probabilities computed from observed samples inform that both classes are equally likely. In this case, classification can be improved by considering more features such as abdomen length. NB classifier assumes that features are independent of one another.^{}
Consider the statement "Officer Drew arrested me." Is Drew male or female? We can answer this by gathering data on the officer: height, eye colour and long/short hair. Then we lookup a police database of all officers and apply NB classifier. This problem uses three independent features and two classes (male or female).^{}
What is Bayes' Theorem and how is it relevant to the NB classifier? Bayes theorem (aka Bayes rule) works on conditional probability. In conditional probability, the occurrence of a particular outcome is conditioned on the outcome of another event occurring.^{} Given two events A and B, Bayes theorem states that,
$$P(AB) = \frac{P(A⋂B)}{P(B)} = \frac{P(A) \cdot P(BA)}{P(B)}$$
where \(P(A)\) and \(P(B)\), called marginal probability or prior probability, are the probabilities of events A and B event occurring; where \(P(AB)\), called posterior probability, is the probability of event A occurring given that event B has occurred; where \(P(BA)\), called likelihood probability, is the probability of event B occurring given that event A has occurred; \(P(A⋂B)\) is the joint probability of both events occurring. \(P(AB)\) and \(P(BA)\) are also called conditional probabilities.
Suppose you have drawn a red card from a deck of playing cards. What's the probability that it's a four? We apply conditional probability. There are 26 possible red cards and two of the are fours. Thus, \(P(fourred)=2/26=1/13\). Bayes Theorem allows us to reformulate the problem as follows:^{}
$$P(fourred) = P(four) \cdot P(redfour) / P(red)\\= (4/52 \cdot 2/4) / (26/52)\\= 1/13$$
What are the types of the NB classifier? scikitlearn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian.
Bernoulli Naive Bayes
The predictors in this case are boolean variables. So your only options are 'True' and 'False' (you might also have 'Yes' or 'No'). When the data has a multivariate Bernoulli distribution, we use it.^{}
Multinomial Naive Bayes
The frequency with which particular events were created by a multinomial distribution are represented by feature vectors. This is the event model that is most commonly used for document classification.This algorithm is used to tackle document classification difficulties. For example, if you want to know whether a document is in the 'Legal' or 'Human Resources' category, you'd use this technique to figure it out. It makes advantage of the frequency of the current words as a feature.
Gaussian Naive Bayes
It is used for numerical / continuous features. The distribution of continues values are "assumed" to be Gaussian. And therefore the likelihood probabilities are computed based on Gaussian distribution.^{}
How would you use Naive Bayes classifier for categorical features? For a discrete variable with more than two possible outcomes, such as the roll of a dice, the categorical distribution is an extension of the Bernoulli distribution. In contrast, the categorical distribution provides a probability of different outcomes for one drawing rather than multiple drawings as is the multinomial distribution.^{}
The properties should be encoded using label encoding techniques, and each category should be assigned a unique number.
It is given by:^{}
\(p(x_i = t  y = c; α) = N_𝑡𝑖𝑐+α /N_c+α n_i\)
\(𝑁_𝑡𝑖𝑐\) = Number of times category t appears in the samples 𝑥𝑖, which belong to class 𝑐
\(𝑁_𝑐\) = Total number of samples with class c
\(𝛼\) = Laplace smoothing parameter used to handle zero frequency problem
\(𝑛_𝑖\) = Number of available categories of feature
What is Laplace smoothing in the context of the NB classifier? Laplace smoothing is a smoothing technique used in Naive Bayes to solve the problem of zero probability. Consider text categorization, where the aim is to determine if a review is good or negative. Based on the training data, we create a likelihood table. We use the Likelihood table values when querying a review, but what if a word in a review was not present in the training dataset?.^{} For example, a test query has form, Query review= x1x2x’
Let, a test sample have three words, where we assume x1 and x2 are present in the training data but not x’. Laplace smoothing comes into picture.
\(P(x’/positive)= (number of reviews with x’ and target_outcome=positive + α) / (N+ α*k)\)
K denotes the number of dimensions (features) in the data.
N is the number of reviews with the target outcome=positive.
α represents the smoothing parameter.^{}
Can we use the NB classifier when features are not independent? The process of evaluating features depending on how successful they are in predicting the target variable is known as feature importance.The naive bayes classifiers do not provide an intrinsic technique for determining the relevance of features. Naive Bayes algorithms forecast the class with the highest probability by computing the conditional and unconditional probabilities associated with the features.As a result, no coefficients have been generated or connected with the characteristics used to train the model.However, there are ways for analysing the model after it has been trained that can be used posthoc. One of these strategies is the Permutation Importance, which has been neatly implemented in scikitlearn.^{}
When the data is tabular, permutation feature importance is a model inspection technique that can be utilised for any fitted estimator. For a given dataset, the permutation importance function computes the feature importance of estimators. The n_ repeats option specifies how many times a feature is randomly shuffled before returning a sample of feature importances.^{}
What are some applications of the NB classifier? Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers, which are commonly employed in text classification (owing to better results in multiclass problems and the independence criterion), have a greater success rate than other techniques. As a result, it is commonly utilised in spam filtering (determining spam email) and sentiment analysis (in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: The Naive Bayes Classifier and Collaborative Filtering work together to create a Recommendation System that employs machine learning and data mining techniques to filter unseen data and forecast whether a user would enjoy a given resource or not.^{}
Multiclass Prediction: This algorithm is also wellknown for its multiclass prediction capability. We can anticipate the likelihood of various target variable classes here.^{}
Realtime Prediction: Naive Bayes is a quick learning classifier that is eager to learn. As a result, it might be utilised to make realtime forecasts.^{}
How is the NB classifier related to logistic regression? Given input features \(X\), both NB classifier and logistic regression predict an output class, that is, output \(Y\) is categorical. Logistic regression directly estimates \(P(YX)\) whereas NB classifier applies the Bayes theorem and estimates \(P(Y)\) and \(P(XY)\). As such, we call logistic regression a discriminative classifier and NB a generative classifier.^{}
It's been observed that on small training datasets, NB classifier does better than logistic regression. If more training samples are available, logistic regression does better. While logistic regression has a lower asymptotic error, NB classifier may converge faster to its higher asymptotic error.^{}
It's known that the Gaussian Naive Bayes (GNB) classifier is closely related to logistic regression. Parameters of one model can be expressed in terms of the other. Moreover, asymptotically both converge to the same classifier when GNB assumptions hold. When the assumptions don't hold, such as dependence among features, logistic regression does better because it adjusts its parameters to give a better fit.^{}
What are some disadvantages and advantages of the NB classifier? Advantages: Naive bayes is Simple to put into action. The conditional probabilities are simple to compute. The probabilities can be determined immediately, there is no need for iterations. As a result, this strategy is useful in situations when training speed is critical. If the conditional Independence assumption is true, the consequences could be spectacular. This algorithm predicts classes faster than many other classification algorithms.^{}
Disadvantages:The premise of independent predictors is the main imitation of Naive Bayes. Naive Bayes implicitly assumes that all attributes are independent of one another. In practise, it is very hard to obtain a set of predictors that are totally independent.^{} If a categorical variable in the test data set has a category that was not observed in the training data set, the model will assign a 0 (zero) probability and will be unable to predict. This is commonly referred to as Zero Frequency. you can utilise the smoothing approach to remedy this. Laplace estimation is one of the most basic smoothing techniques.^{}
Milestones
The Royal Society publishes a paper on probability by Thomas Bayes after his death in 1761. It's titled Essay Towards Solving a Problem in the Doctrine of Chances and details what would later become famous as the Bayes inference.^{} The basic idea is to revise predictions based on new evidence. Decades later (early 19th century), PierreSimon Laplace develops and popularizes Bayesian probability.^{} ^{}
Bayesian approach is applied during the Second World War. It sees a revival in the years after the war. Earlier, Bayesian approach had been criticized. The frequentist approach developed by R.A. Fisher had been favoured since the mid1920s.^{}
Maron and Kuhns apply Bayes' Theorem to the task of Information Retrieval (IR). The probability of retrieving a relevant document given a query can be computed from the prior probability of document relevance and conditional probability of user making a particular query given the relevant document.^{} Over the next forty years, Naive Bayes is the main technique in IR until machine learning techniques become popular.^{}
Hughes considers a twoclass pattern recognition problem. The model considers \(n\) discrete values that can be measured and \(m\) sample patterns. He shows that for a given \(m\), there's an optimal \(n\) that minimizes the pattern recognition error. This is shown in the figure (right) for the case of equal class probabilities. The figure (left) also shows an example of \(n=5\) in which values 13 imply class \(c_1\) and values 45 imply class \(c_2\).^{}
Duda and Hart use the Naive Bayes classifier in pattern recognition.^{}
Langley et al. present an analysis of Bayesian classifiers considering noisy classes and noisefree attributes. They find that the Naive Bayes classifier gives comparable results to the C4 algorithm that induces decision trees. They conclude that despite its simplicity, the Naive Bayes classifier deserves more research attention.^{}
Domingos and Pazzani show that even when attributes are not independent, the Bayesian classifier does well. It can be optimal under zeroone loss (misclassification rate). It's optimal under squared error loss only when the independence assumption holds.^{}
Kasif et al. propose a probabilistic framework for memorybased reasoning (MBR). Such a framework can be used for classification tasks. They note that a probabilistic graphical model is really another way of looking at the Naive Bayes classifier.^{}
References
 Bazett, Trefor. 2017. "Bayes' Theorem  The Simplest Case." Trefor Bazett, on YouTube, November 19. Accessed 20220223.
 Berrar, Daniel. 2018. "Bayes’ Theorem and Naive Bayes Classifier." Encyclopedia of Bioinformatics and Computational Biology, vol. 1, Elsevier, pp. 403412. Accessed 20220207.
 Chauhan, Nagesh Singh. 2020. "Introduction to the Naïve Bayes Algorithm." KDnuggets, June 8. Accessed 20220122.
 Domingos, P., and M. Pazzani. 1997. "On the Optimality of the Simple Bayesian Classifier under ZeroOne Loss." Machine Learning, vol. 29, pp. 103–130. doi: 10.1023/A:1007413511361. Accessed 20220330.
 Encyclopaedia Britannica. 2022. "Thomas Bayes." Encyclopedia Britannica, January 1. Accessed 20220329.
 Gandhi, Rohith. 2018. "Naive Bayes Classifier." Towards Data Science, on Medium, May 5. Accessed 20220122.
 HolyPython. 2020. "Naive Bayes Classifier History." HolyPython, July 29. Accessed 20220123.
 Hughes, G. 1968. "On the mean accuracy of statistical pattern recognizers." IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 5563, January. doi: 10.1109/TIT.1968.1054102. Accessed 20220329.
 Jayaswal, Vaibhav. 2020. "Laplace smoothing in Naïve Bayes algorithm." Towards Data Science, on Medium, November 22. Accessed 20220222.
 Kasif, Simon, Steven Salzberg, David Waltz, John Rachlin, and David W. Aha. 1998. "A probabilistic framework for memorybased reasoning." Artificial Intelligence, vol. 104, no. 1–2, pp. 287311. Accessed 20220330.
 Kaviani, Pouria and Sunita Dhotre. 2017. "Short Survey on Naive Bayes Algorithm." International Journal of Advance Engineering and Research Development, vol. 4, no. 11, pp. 607611, November. Accessed 20220223.
 Keogh, Eamonn. 2011. "Naïve Bayes Classifier." Computational Entomology, University of California, Riverside. Accessed 20220207.
 Kumar, Naresh. 2019. "Advantages and Disadvantages of Naive Bayes in Machine Learning." The Professionals Point, March 2. Accessed 20220223.
 Langley, Pat, Wayne Iba, and Kevin Thompson. 1992. "An analysis of Bayesian classifiers." Proceedings of the tenth national conference on Artificial intelligence (AAAI'92), AAAI Press, pp. 223–228. Accessed 20220329.
 Lewis, David D. 1998. "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval." In: Nédellec, C., and Rouveirol, C. (eds), Machine Learning: ECML98, ECML 1998, Lecture Notes in Computer Science, vol. 1398, Springer. doi: 10.1007/BFb0026666. Accessed 20220329.
 lukeprog. 2011. "A History of Bayes' Theorem." LessWrong, August 29. Accessed 20220329.
 MachineLearningInterview. 2021. "How does Naive Bayes Classifier Work? What are the pros and cons with Naive Bayes Classifier?" MachineLearningInterview, on YouTube, July 30. Accessed 20220123.
 Maron, M. E., and J. L. Kuhns. 1960. "On relevance, probabilistic indexing, and information retrieval." Journal of the ACM, vol. 7, no. 3, pp. 216244, July. doi: 10.1145/321033.321035. Accessed 20220329.
 Mitchell, Tom M. 2000. "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression." Chapter 3 in: Machine Learning, Draft, October 1. Accessed 20220331.
 Nelson, Daniel. 2020. "What is Bayes Theorem?" AI Masterclass, Unite.AI, August 23. Accessed 20220121.
 Ng, Andrew and Michael Jordan. 2001. "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes." In: T. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems 14 (NIPS 2001). Accessed 20220331.
 Rastogi, Rahul. 2020. "Naive Bayes & its Mathematical Implementation." On Medium, June 24. Accessed 20220123.
 Reddy, Suman Kumar. 2020a. "Categorical Naive Bayes Classifier implementation in Python." Blog, iNeuron, October 31. Accessed 20220123.
 Reddy, Suman Kumar. 2020b. "Feature Importance in Naive Bayes Classifiers." Blog, iNeuron, October 31. Accessed 20220223.
 Rish, I. 2001. "An empirical study of the naive Bayes classifier." T.J. Watson Research Center, IBM. Accessed 20220207.
 Santhosh, Gautham. 2020. "Understanding Naive Bayes in the real world." On Medium, February 07. Accessed 20220123.
 Scikitlearn. 2001. "Permutation feature importance." Scikitlearn. Accessed 20220223.
 scikitlearn. 2021. "1.9. Naive Bayes." scikitlearn v1.0.2, December. Accessed 20220123.
 Stecanella, Bruno. 2017. "A practical explanation of a Naive Bayes classifier." Blog, MonkeyLearn, May 07. Accessed 20220123.
 Wikipedia. 2022. "Naive Bayes classifier." Wikipedia, March 4. Accessed 20220121.
 Yang S. 2019. "An Introduction to Naïve Bayes Classifier." Towards Data Science, on Medium, September 9. Accessed 20220222.
Further Reading
 James H. Martin. 2021. "Naive Bayes and Sentiment Classification." web.stanford.edu, December 29. Accessed 20220123.
 Srijith Rajeev. 2019. "Naive Bayes and Sentiment Classification." www.commonlounge.com, December 29. Accessed 20220123.
 Vikramkumar. 2014. "Bayes and Naive Bayes Classifier." Arxiv.org, April 03. Accessed 20220123.