Naive Bayes Classifier

Naive Bayes is a probabilistic classifier that returns the probability of a test point belonging to a class rather than the label of the test point. It's among the most basic Bayesian network models, but when combined with kernel density estimation, it may attain greater levels of accuracy. . This algorithm is applicable for Classification tasks only, unlike many other ML algorithms which can typically perform Regression as well as Classification tasks.

Naive Bayes algorithm is considered naive because the assumptions the algorithm makes are virtually impossible to find in real-life data. It uses conditional probability to calculate a product of individual probabilities of components. This means that the algorithm assumes the presence or absence of a specific feature of a class which is not related to the presence or absence of any other feature (absolute independence of features), given the class variable.

Discussion

  • Could you explain the Naive Bayes classifier with examples?
    Two features with histograms of antenna length. Source: Adapted from Keogh 2011, slide 3.
    Two features with histograms of antenna length. Source: Adapted from Keogh 2011, slide 3.

    Consider two groups of insects, grasshoppers and katydids. By studying the antenna lengths from many insect samples, we can discern some patterns and computed probabilities. For examples, given an antenna length of 3 cm, the insect is more likely to be a grasshopper than a katydid. Naive Bayes classifier is a technique to perform such a classification. Antenna length is a feature that's used to classify an insect into one of two classes.

    Suppose the antenna length is 5 cm. Probabilities computed from observed samples inform that both classes are equally likely. In this case, classification can be improved by considering more features such as abdomen length. NB classifier assumes that features are independent of one another.

    Consider the statement "Officer Drew arrested me." Is Drew male or female? We can answer this by gathering data on the officer: height, eye colour and long/short hair. Then we lookup a police database of all officers and apply NB classifier. This problem uses three independent features and two classes (male or female).

  • What is Bayes' Theorem and how is it relevant to the NB classifier?
    Bayes theorem. Source: Bazett 2017.

    Bayes theorem (aka Bayes rule) works on conditional probability. In conditional probability, the occurrence of a particular outcome is conditioned on the outcome of another event occurring. Given two events A and B, Bayes theorem states that,

    $$P(A|B) = \frac{P(A⋂B)}{P(B)} = \frac{P(A) \cdot P(B|A)}{P(B)}$$

    where \(P(A)\) and \(P(B)\), called marginal probability or prior probability, are the probabilities of events A and B event occurring; where \(P(A|B)\), called posterior probability, is the probability of event A occurring given that event B has occurred; where \(P(B|A)\), called likelihood probability, is the probability of event B occurring given that event A has occurred; \(P(A⋂B)\) is the joint probability of both events occurring. \(P(A|B)\) and \(P(B|A)\) are also called conditional probabilities.

    Suppose you have drawn a red card from a deck of playing cards. What's the probability that it's a four? We apply conditional probability. There are 26 possible red cards and two of the are fours. Thus, \(P(four|red)=2/26=1/13\). Bayes Theorem allows us to reformulate the problem as follows:

    $$P(four|red) = P(four) \cdot P(red|four) / P(red)\\= (4/52 \cdot 2/4) / (26/52)\\= 1/13$$

  • What are the types of the NB classifier?
    Types of naive bayes classifier. Source: Rastogi 2020.
    Types of naive bayes classifier. Source: Rastogi 2020.

    scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian.

    Bernoulli Naive Bayes

    The predictors in this case are boolean variables. So your only options are 'True' and 'False' (you might also have 'Yes' or 'No'). When the data has a multivariate Bernoulli distribution, we use it.

    Multinomial Naive Bayes

    The frequency with which particular events were created by a multinomial distribution are represented by feature vectors. This is the event model that is most commonly used for document classification.This algorithm is used to tackle document classification difficulties. For example, if you want to know whether a document is in the 'Legal' or 'Human Resources' category, you'd use this technique to figure it out. It makes advantage of the frequency of the current words as a feature.

    Gaussian Naive Bayes

    It is used for numerical / continuous features. The distribution of continues values are "assumed" to be Gaussian. And therefore the likelihood probabilities are computed based on Gaussian distribution.

  • How would you use Naive Bayes classifier for categorical features?

    For a discrete variable with more than two possible outcomes, such as the roll of a dice, the categorical distribution is an extension of the Bernoulli distribution. In contrast, the categorical distribution provides a probability of different outcomes for one drawing rather than multiple drawings as is the multinomial distribution.

    The properties should be encoded using label encoding techniques, and each category should be assigned a unique number.

    It is given by:

    \(p(x_i = t | y = c; α) = N_???+α /N_c+α n_i\)

    \(?_???\) = Number of times category t appears in the samples ??, which belong to class ?

    \(?_?\) = Total number of samples with class c

    \(?\) = Laplace smoothing parameter used to handle zero frequency problem

    \(?_?\) = Number of available categories of feature

  • What is Laplace smoothing in the context of the NB classifier?

    Laplace smoothing is a smoothing technique used in Naive Bayes to solve the problem of zero probability. Consider text categorization, where the aim is to determine if a review is good or negative. Based on the training data, we create a likelihood table. We use the Likelihood table values when querying a review, but what if a word in a review was not present in the training dataset?. For example, a test query has form, Query review= x1x2x’

    Let, a test sample have three words, where we assume x1 and x2 are present in the training data but not x’. Laplace smoothing comes into picture.

    \(P(x’/positive)= (number of reviews with x’ and target_outcome=positive + α) / (N+ α*k)\)

    K denotes the number of dimensions (features) in the data.

    N is the number of reviews with the target outcome=positive.

    α represents the smoothing parameter.

  • Can we use the NB classifier when features are not independent?

    The process of evaluating features depending on how successful they are in predicting the target variable is known as feature importance.The naive bayes classifiers do not provide an intrinsic technique for determining the relevance of features. Naive Bayes algorithms forecast the class with the highest probability by computing the conditional and unconditional probabilities associated with the features.As a result, no coefficients have been generated or connected with the characteristics used to train the model.However, there are ways for analysing the model after it has been trained that can be used post-hoc. One of these strategies is the Permutation Importance, which has been neatly implemented in scikit-learn.

    When the data is tabular, permutation feature importance is a model inspection technique that can be utilised for any fitted estimator. For a given dataset, the permutation importance function computes the feature importance of estimators. The n_ repeats option specifies how many times a feature is randomly shuffled before returning a sample of feature importances.

  • What are some applications of the NB classifier?

    Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers, which are commonly employed in text classification (owing to better results in multi-class problems and the independence criterion), have a greater success rate than other techniques. As a result, it is commonly utilised in spam filtering (determining spam e-mail) and sentiment analysis (in social media analysis, to identify positive and negative customer sentiments)

    Recommendation System: The Naive Bayes Classifier and Collaborative Filtering work together to create a Recommendation System that employs machine learning and data mining techniques to filter unseen data and forecast whether a user would enjoy a given resource or not.

    Multi-class Prediction: This algorithm is also well-known for its multi-class prediction capability. We can anticipate the likelihood of various target variable classes here.

    Real-time Prediction: Naive Bayes is a quick learning classifier that is eager to learn. As a result, it might be utilised to make real-time forecasts.

  • How is the NB classifier related to logistic regression?

    Given input features \(X\), both NB classifier and logistic regression predict an output class, that is, output \(Y\) is categorical. Logistic regression directly estimates \(P(Y|X)\) whereas NB classifier applies the Bayes theorem and estimates \(P(Y)\) and \(P(X|Y)\). As such, we call logistic regression a discriminative classifier and NB a generative classifier.

    It's been observed that on small training datasets, NB classifier does better than logistic regression. If more training samples are available, logistic regression does better. While logistic regression has a lower asymptotic error, NB classifier may converge faster to its higher asymptotic error.

    It's known that the Gaussian Naive Bayes (GNB) classifier is closely related to logistic regression. Parameters of one model can be expressed in terms of the other. Moreover, asymptotically both converge to the same classifier when GNB assumptions hold. When the assumptions don't hold, such as dependence among features, logistic regression does better because it adjusts its parameters to give a better fit.

  • What are some disadvantages and advantages of the NB classifier?
    Strength and weakness of naive bayes. Source: MachineLearningInterview 2021.

    Advantages: Naive bayes is Simple to put into action. The conditional probabilities are simple to compute. The probabilities can be determined immediately, there is no need for iterations. As a result, this strategy is useful in situations when training speed is critical. If the conditional Independence assumption is true, the consequences could be spectacular. This algorithm predicts classes faster than many other classification algorithms.

    Disadvantages:The premise of independent predictors is the main imitation of Naive Bayes. Naive Bayes implicitly assumes that all attributes are independent of one another. In practise, it is very hard to obtain a set of predictors that are totally independent. If a categorical variable in the test data set has a category that was not observed in the training data set, the model will assign a 0 (zero) probability and will be unable to predict. This is commonly referred to as Zero Frequency. you can utilise the smoothing approach to remedy this. Laplace estimation is one of the most basic smoothing techniques.

Milestones

1763

The Royal Society publishes a paper on probability by Thomas Bayes after his death in 1761. It's titled Essay Towards Solving a Problem in the Doctrine of Chances and details what would later become famous as the Bayes inference. The basic idea is to revise predictions based on new evidence. Decades later (early 19th century), Pierre-Simon Laplace develops and popularizes Bayesian probability.

1940

Bayesian approach is applied during the Second World War. It sees a revival in the years after the war. Earlier, Bayesian approach had been criticized. The frequentist approach developed by R.A. Fisher had been favoured since the mid-1920s.

1960

Maron and Kuhns apply Bayes' Theorem to the task of Information Retrieval (IR). The probability of retrieving a relevant document given a query can be computed from the prior probability of document relevance and conditional probability of user making a particular query given the relevant document. Over the next forty years, Naive Bayes is the main technique in IR until machine learning techniques become popular.

1968
Probability-based rules (left) and finite data set accuracy (right). Source: Adapted from Hughes 1968.
Probability-based rules (left) and finite data set accuracy (right). Source: Adapted from Hughes 1968.

Hughes considers a two-class pattern recognition problem. The model considers \(n\) discrete values that can be measured and \(m\) sample patterns. He shows that for a given \(m\), there's an optimal \(n\) that minimizes the pattern recognition error. This is shown in the figure (right) for the case of equal class probabilities. The figure (left) also shows an example of \(n=5\) in which values 1-3 imply class \(c_1\) and values 4-5 imply class \(c_2\).

1973

Duda and Hart use the Naive Bayes classifier in pattern recognition.

1992

Langley et al. present an analysis of Bayesian classifiers considering noisy classes and noise-free attributes. They find that the Naive Bayes classifier gives comparable results to the C4 algorithm that induces decision trees. They conclude that despite its simplicity, the Naive Bayes classifier deserves more research attention.

1997

Domingos and Pazzani show that even when attributes are not independent, the Bayesian classifier does well. It can be optimal under zero-one loss (misclassification rate). It's optimal under squared error loss only when the independence assumption holds.

1998

Kasif et al. propose a probabilistic framework for memory-based reasoning (MBR). Such a framework can be used for classification tasks. They note that a probabilistic graphical model is really another way of looking at the Naive Bayes classifier.

References

  1. Bazett, Trefor. 2017. "Bayes' Theorem - The Simplest Case." Trefor Bazett, on YouTube, November 19. Accessed 2022-02-23.
  2. Berrar, Daniel. 2018. "Bayes’ Theorem and Naive Bayes Classifier." Encyclopedia of Bioinformatics and Computational Biology, vol. 1, Elsevier, pp. 403-412. Accessed 2022-02-07.
  3. Chauhan, Nagesh Singh. 2020. "Introduction to the Naïve Bayes Algorithm." KDnuggets, June 8. Accessed 2022-01-22.
  4. Domingos, P., and M. Pazzani. 1997. "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss." Machine Learning, vol. 29, pp. 103–130. doi: 10.1023/A:1007413511361. Accessed 2022-03-30.
  5. Encyclopaedia Britannica. 2022. "Thomas Bayes." Encyclopedia Britannica, January 1. Accessed 2022-03-29.
  6. Gandhi, Rohith. 2018. "Naive Bayes Classifier." Towards Data Science, on Medium, May 5. Accessed 2022-01-22.
  7. HolyPython. 2020. "Naive Bayes Classifier History." HolyPython, July 29. Accessed 2022-01-23.
  8. Hughes, G. 1968. "On the mean accuracy of statistical pattern recognizers." IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55-63, January. doi: 10.1109/TIT.1968.1054102. Accessed 2022-03-29.
  9. Jayaswal, Vaibhav. 2020. "Laplace smoothing in Naïve Bayes algorithm." Towards Data Science, on Medium, November 22. Accessed 2022-02-22.
  10. Kasif, Simon, Steven Salzberg, David Waltz, John Rachlin, and David W. Aha. 1998. "A probabilistic framework for memory-based reasoning." Artificial Intelligence, vol. 104, no. 1–2, pp. 287-311. Accessed 2022-03-30.
  11. Kaviani, Pouria and Sunita Dhotre. 2017. "Short Survey on Naive Bayes Algorithm." International Journal of Advance Engineering and Research Development, vol. 4, no. 11, pp. 607-611, November. Accessed 2022-02-23.
  12. Keogh, Eamonn. 2011. "Naïve Bayes Classifier." Computational Entomology, University of California, Riverside. Accessed 2022-02-07.
  13. Kumar, Naresh. 2019. "Advantages and Disadvantages of Naive Bayes in Machine Learning." The Professionals Point, March 2. Accessed 2022-02-23.
  14. Langley, Pat, Wayne Iba, and Kevin Thompson. 1992. "An analysis of Bayesian classifiers." Proceedings of the tenth national conference on Artificial intelligence (AAAI'92), AAAI Press, pp. 223–228. Accessed 2022-03-29.
  15. Lewis, David D. 1998. "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval." In: Nédellec, C., and Rouveirol, C. (eds), Machine Learning: ECML-98, ECML 1998, Lecture Notes in Computer Science, vol. 1398, Springer. doi: 10.1007/BFb0026666. Accessed 2022-03-29.
  16. MachineLearningInterview. 2021. "How does Naive Bayes Classifier Work? What are the pros and cons with Naive Bayes Classifier?" MachineLearningInterview, on YouTube, July 30. Accessed 2022-01-23.
  17. Maron, M. E., and J. L. Kuhns. 1960. "On relevance, probabilistic indexing, and information retrieval." Journal of the ACM, vol. 7, no. 3, pp. 216-244, July. doi: 10.1145/321033.321035. Accessed 2022-03-29.
  18. Mitchell, Tom M. 2000. "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression." Chapter 3 in: Machine Learning, Draft, October 1. Accessed 2022-03-31.
  19. Nelson, Daniel. 2020. "What is Bayes Theorem?" AI Masterclass, Unite.AI, August 23. Accessed 2022-01-21.
  20. Ng, Andrew and Michael Jordan. 2001. "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes." In: T. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems 14 (NIPS 2001). Accessed 2022-03-31.
  21. Rastogi, Rahul. 2020. "Naive Bayes & its Mathematical Implementation." On Medium, June 24. Accessed 2022-01-23.
  22. Reddy, Suman Kumar. 2020a. "Categorical Naive Bayes Classifier implementation in Python." Blog, iNeuron, October 31. Accessed 2022-01-23.
  23. Reddy, Suman Kumar. 2020b. "Feature Importance in Naive Bayes Classifiers." Blog, iNeuron, October 31. Accessed 2022-02-23.
  24. Rish, I. 2001. "An empirical study of the naive Bayes classifier." T.J. Watson Research Center, IBM. Accessed 2022-02-07.
  25. Santhosh, Gautham. 2020. "Understanding Naive Bayes in the real world." On Medium, February 07. Accessed 2022-01-23.
  26. Scikit-learn. 2001. "Permutation feature importance." Scikit-learn. Accessed 2022-02-23.
  27. Stecanella, Bruno. 2017. "A practical explanation of a Naive Bayes classifier." Blog, MonkeyLearn, May 07. Accessed 2022-01-23.
  28. Wikipedia. 2022. "Naive Bayes classifier." Wikipedia, March 4. Accessed 2022-01-21.
  29. Yang S. 2019. "An Introduction to Naïve Bayes Classifier." Towards Data Science, on Medium, September 9. Accessed 2022-02-22.
  30. lukeprog. 2011. "A History of Bayes' Theorem." LessWrong, August 29. Accessed 2022-03-29.
  31. scikit-learn. 2021. "1.9. Naive Bayes." scikit-learn v1.0.2, December. Accessed 2022-01-23.

Further Reading

  1. James H. Martin. 2021. "Naive Bayes and Sentiment Classification." web.stanford.edu, December 29. Accessed 2022-01-23.
  2. Srijith Rajeev. 2019. "Naive Bayes and Sentiment Classification." www.commonlounge.com, December 29. Accessed 2022-01-23.
  3. Vikramkumar. 2014. "Bayes and Naive Bayes Classifier." Arxiv.org, April 03. Accessed 2022-01-23.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
6
5
1197
2083
Words
0
Likes
15K
Hits

Cite As

Devopedia. 2022. "Naive Bayes Classifier." Version 14, March 31. Accessed 2023-11-13. https://devopedia.org/naive-bayes-classifier
Contributed by
2 authors


Last updated on
2022-03-31 05:01:43

Improve this article

Article Warnings

  • Readability score of this article is below 50 (47.3). Use shorter sentences. Use simpler words.