Probability for Data Scientists
 Summary

Discussion
 How do we mathematically define the probability of an event?
 Could you illustrate probability with some simple examples?
 Why probability works?
 How is probability related to distributions?
 What are probability distributions and how are they useful?
 When probability works?
 What are axioms of Probability?
 What are mutually exclusive and nonmutually exclusive events?
 Could you explain joint, conditional and marginal probabilities?
 How is odds different from probability?
 What is Bayes' Theorem?
 Could you give some applications of Bayes Probability?
 What are Frequentist and Bayesian approaches to Probability?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
In mathematics, the notation \(\pi\), pronounced as \(pi\), denotes ratio of circumference of any circle to diameter of same circle. \(\pi\) is constant. It will not vary for circles of any size. But many other facts in the world are not constant.
Let's assume alphabet \(X\) denotes height of adults in India. \(X\) can take any positive real value for any random individual. Hence \(X\) is a variable that takes random values in a range of positive real numbers.
Probability measures how likely or unlikely is an outcome, where outcome is a Random Variable. For instance, we can ask, "What's the probability of picking an Indian adult male who is above six feet?"
Probability is intimately related to another branch of mathematics called Statistics. Both these are of fundamental importance to the field of Data Science.^{}
Discussion
How do we mathematically define the probability of an event? Mathematically, probability is the ratio of the number of desired outcomes and all possible outcomes:
$$P(Outcome)=\frac{n(Desired\ Outcome)}{n(All\ Outcome)}$$
Any desired outcome is a subset of all possible outcomes. The value of probability therefore ranges from zero to one. The limits have the following interpretation:
 Zero: the outcome will never occur.
 One: the outcome is guaranteed to occur.
The set of all outcomes is called the Sample Space. When the number of outcomes is large or grouping outcomes is more suitable for a study, it's common to group one or more outcomes into what we call an Event.^{} An outcome can be part of multiple events.^{}
Could you illustrate probability with some simple examples? If we toss a coin, there are only two possible outcomes: head or tail. Assuming both outcomes are equally likely, the probability of getting a head is 1/2 = 0.5. Likewise, the probability of getting a tail is also 0.5. Taken together, the probability of getting either head or tail is 0.5 + 0.5 = 1. This makes sense since there are no other outcomes besides head or tail.^{}
Let's roll a dice. The probability of getting an odd prime number (3 or 5) is 2/6 = 0.33. The probability of getting a number greater than 6 is 0/6 = 0. The probability of getting a number less than or equal to 6 is 6/6 = 1.^{}
Why probability works? Any random variable, however random, will have its own identifiable characteristics. For example, the variable may be highly probable at one value with dropping probabilities at neighbouring values. This variability around the most probable value helps us to model random variables. In technical terms, we call this the distribution of the random variable. When we plot the number of occurrences against the value, we get a distribution curve. Random variables are typically modelled with average value, variability (spread), skewness (asymmetry) and kurtosis ("tailedness").^{}
For example, let's consider the response time of a computing system. When the system is under high load, the average response time increases. What's more interesting is that the spread of response time around this average is also more. Thus, under different loading the response time random variable exhibits different characteristics.
Exceptions (popularly called Outliers) will affect probability generalisation. They have to be kept out when building realistic models.^{}
How is probability related to distributions? Probability looks at the likelihood of a specific outcome or event. Distribution looks at all outcomes or events.^{}
Let's take the example of a coin toss. We know from theory that the probability of a head is 0.5. However, there's also an experimental approach. For example, experimenting with 100 tosses might result in 49 turning out to be heads. Hence, probability of head is 0.49. Such an experiment is termed Bernoulli Trials. When we list probabilities for all outcomes (head and tail), we end up with a Bernoulli Distribution.^{}
We can perform a variation of the cointoss experiment. We can toss a coin 32 times and call this a single experiment. We repeat this experiment many times, say 50,000 times. Finally, we calculate the probability of getting 4 heads in each experiment of 32 coin tosses. Such a series of experiments is called Binomial Trials. When we list the probabilities for all outcomes, we end up in Binomial Distribution.^{}
What are probability distributions and how are they useful? Probability, when identified and listed for all possible outcomes, is called Probability Distribution. For instance, if we find probabilities of adult males in India with heights in ranges of 03.5, 3.54, 44.5, 4.55, 55.5, 5.56 and 6+ feet, we have a probability distribution. Such a distribution is closely related to the concept of histogram. With histogram, we plot the count of values within each range. With distribution, we convert these counts into probabilities.^{} In both case, a graphical plot helps us to read easily the average, variability, skewness, kurtosis, outliers, etc.^{}
Given the distribution, an event can be simulated at random within the boundaries of the distribution. In other words, to create random variables for simulation purposes, we need the distribution. For example, let's consider the number of people arriving at ATM every 60 minutes. This can be modeled as Poisson distribution.^{} We can simulate queues at ATM, calculate waiting times and decide if need another ATM needs to be installed.^{} Likewise, if we know distributions of outcomes in a game, we can simulate winning odds and take appropriate risks.
When probability works? Often we are unable to gather data from the entire sample space or population. We typically collect a sample of data from the population. Probability works when sample size is large.
For instance, we wish to find the probability of an Indian adult male of height six feet and above. A sample size of 100 will not give a reliable number. However, a sample size 10,000 will be more reliable. The more, the better. Stated formally as the Law of Large Numbers, the probability of an event from a sample will converge to the actual value of the population when the sample size is large.^{}
What are axioms of Probability? There are obvious rules in probability. These rules are called Axioms of Probability. These were formulated by Russian mathematician Andrei Kolmogorov.^{}
These axioms can be explained as follows:^{}
 The probability of any event is a nonnegative real number.
 The probability of the entire sample space is one. This follows from the fact that there are no events outside the sample space.
 The probability of the union of two mutually exclusive events is the sum of their individual probabilities.
What are mutually exclusive and nonmutually exclusive events? Let us say, there are two events denoted with random variables \(A\) and \(B\). If \(A\) and \(B\) don't occur together, they're mutually exclusive. They're also call disjoint events.^{} For instance, cooking is event \(A\) and cycling is event \(B\). These two are mutually exclusive: a person doesn't cook and cycle at the same time.
$$P(A\ and\ B) = 0\ or\ neglibible \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)P(A\ and\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B), since\ P(A\ and\ B)=0$$
On the contrary, if \(A\) and \(B\) do happen together, they are nonmutually exclusive events.^{} For instance, cooking is event \(A\) and listening to music is event \(B\). They can happen at the same time.
$$P(A\ and\ B)\neq0 \\ \Rightarrow P(A\ and\ B) = P(A)+P(B)P(A\ or\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)P(A\ and\ B)$$
Could you explain joint, conditional and marginal probabilities? Let's consider two events: buying Bread \(A\), buying Jam \(B\). Marginal probability is the proportion of customers who bought Bread regardless of whether they bought Jam or not. It's called marginal because it occurs at the margins of the probability table (see figure). Joint probability is proportion of customers who bought both Bread and Jam. Conditional probability is proportion of customers who're likely to buy Bread when they've already bought Jam, and viceversa.^{}
Marginal Probability
$$P(A) = \frac{n(Customers\ buying\ Bread)}{n(Customers)} = \frac{90}{1000} \\ P(B) = \frac{n(Customers\ buying\ Jam)}{n(Customers)} = \frac{50}{1000}$$
Joint Probability
$$P(Customers\ buying\ Bread\ and\ Jam) = P(A\ and\ B) \\ = \frac{n(A\ and\ B)}{n(Customers)} = \frac{40}{1000}$$
Conditional Probability
$$P(Customers\ buying\ Bread\ when\ they\ already\ bought\ Jam) \\ = P(AB)=\frac{n(A\ and\ B)}{n(B)} = \frac{40}{50} \\ P(Customers\ buying\ Jam\ when\ they\ already\ bought\ Bread) = P(BA)=\frac{n(A\ and\ B)}{n(A)}= \frac{40}{90}$$
Conditional Probability reduces the sample space based on condition. Rather than considering all customers (1000), we only consider customers who bought Bread (90) or customers who bought Jam (50), that is, the marginal numbers.
How is odds different from probability? Odds is defined as ratio of chances of an event happening and chances of the same event not happening.^{} Consider the ratio of customers buying milk to those not buying milk. If this ratio is more than 1 then the odds are in favour of hypotheses (buying milk), else odds are against hypotheses.
Consider customers buying bread and jam:
$$Odds(Buying\ Bread\ with\ Jam) = \frac{P(Buying\ Bread\ with\ Jam)}{P(Buying\ Bread\ without\ Jam)}=\frac{\frac{40}{1000}}{\frac{9040}{1000}}=\frac{40}{50}=0.8 \\ Odds(Buying\ Jam\ with\ Bread) = \frac{P(Buying\ Jam\ with\ Bread)}{P(Buying\ Jam\ without\ Bread)}=\frac{\frac{40}{1000}}{\frac{5040}{1000}}=\frac{40}{10}=4$$
For every 0.8 customers who buy Bread and Jam, one customer will buy only Bread. For every 4 customers who buy Jam and Bread, one customer will buy only Jam. Thus, the Odds(Buying Jam with Bread) > Odds(Buying Bread with Jam). This implies,
 Jam drives Bread purchase.
 Sizable Bread buyers prefer Bread without Jam.
What is Bayes' Theorem? Given hypothesis H and evidence E, Bayes' Theorem can be written as \(P(HE) = P(EH) \dot P(H) / P(E)\). Bayes' Theorem, also called Bayes' Rule or Bayes' Law,^{} uses prior probability \(P(H)\), accounts for new evidence \(P(EH)\) and results in posterior probability \(P(HE)\).^{}
Often, prior probability is sourced from experts due to challenges in evaluating from evidence. Bayesian probability basically revises probability considering every new evidence. This probability will converge to its true value over many revisions on repeated evidence.
Bayesian approach is used in fields such as epistemology, statistics, and inductive logic. It relies on conditional probabilities and empirical learning. The key insight of the theorem is "that a hypothesis is confirmed by any body of data that its truth renders probable".^{}
Could you give some applications of Bayes Probability? One application is in spam filtering. The idea is to classify an email as spam. The email may or may not contain the word "Viagra" and not all mails with this word may be spam. We calculate the probabilities based on our prior knowledge of number of spam mails received. \(P(spam)\) is prior knowledge of spam mails in inbox. But the probability of the word appearing in a previous spam mail can give us a better estimate. P(Viagraspam) is the likelihood and P(Viagra) is the marginal likelihood.^{}
$$P(spamViagra) = P(spam) * \frac{P(Viagraspam)}{P(Viagra)} \\ where\ P(Viagra)=P(Viagraspam)+P(Viagranot\ spam)$$
\(P(Viagraspam)/P(Viagra)\) is evidence from data that probability of word Viagra in spam mail. \(P(spamViagra)\) is the posterior probability of mail being spam with word Viagra in it. When 100% of mails with Viagra are spam, then \(P(spamViagra)=P(spam)\). When less than 100% of mails with Viagra are spam, then \(P(spamViagra) < P(spam)\).
What are Frequentist and Bayesian approaches to Probability? Frequentists lean on the Law of Large numbers to back their probability estimate. For instance, a coin toss has equal probability of head or tail. This is derived from a large number of trials. Frequentists believe any deviation from equal probability is due to chance.^{}
Bayesians argue that belief or prior knowledge should be accounted for while calculating probability. Belief suggests a probability. New evidence may notch up or notch down the probability and form a new belief. Bayesians do not require Law of Large Numbers backing, but leverage them where applicable. Probability may be revised with a new piece of evidence, eventually converging to true probability after repeated revisions. For instance, the probability of new robot failing at a task starts with a belief, say \(p\), and as new evidence arrives, we revise \(p\).^{}
Frequentist and Bayesian approaches can be applied for all estimates including probability. While the two approaches are distinct, Bayesian probability complements Frequentist probability when,
 System is not yet stable.
 There's insufficient data to get backing from Law of Large Numbers.
Milestones
Sixteenth century Italian mathematician Girolamo Cardano is interested in gambling, to which he applies mathematics. Although gambling has been around for centuries, randomness is not a recognized concept. People continue to believe in Gods and oracles, until the Renaissance.^{} Cardano's work is the first of its kind, although his work is published only much later in 1663.^{} Today we know that he made some fundamental mistakes.^{}
In a series of letters analyzing the problem of points, French mathematicians Blaise Pascal and Pierre de Fermat develop what can be seen as the foundations of a mathematical theory of probability. Their work is popularized by Christian Huygens in a publication of 1657.^{}
Jakob (James) Bernoulli publishes Ars Conjectandi, in which he introduces many important concepts: permutations, a priori, a posteriori, Bernoulli trials, random variable.^{} Bernoulli showed that the probability of an event can be approximated by the frequency of occurrence of the event from a large number trials. This later came to be called the Law of Large Numbers.^{}
Thomas Bayes' now famous work on probability is posthumously published. The Bayesian approach to probability is adopted, and popularized by PierreSimon Laplace, until it's challenged in the early 20th century by mathematicians R. A. Fisher and Jerzy Neyman.^{} To Bayes, probability is a measure of personal belief or reasonable expectation of the event.^{}
With his publication of Analytical Theory of Probability, PierreSimon Laplace brings together recent developments in the field. This important work shows the application of probability to scientific problems. Thus, probability is no longer just about games of chance.^{} Laplace himself states,^{}
It is remarkable that probability, which began with the consideration of games of chance, should have become the most important object of human knowledge... [It] is at bottom nothing but common sense reduced to calculus... It teaches us to avoid the illusions which often mislead us.
Russian mathematician A. N. Kolmogorov provides an axiomatic basis for the mathematical theory of probability, thus laying the foundations for a modern treatment of the subject.^{}
References
 Aldrich, John. 2005. "Figures from the History of Probability and Statistics." University of Southampton, June. Updated October 2012. Accessed 20180422.
 Apostol, Tom M. 1969. "Calculus, Volume 2: A short history of probability." Second Edition, John Wiley & Sons, June. Accessed 20180423.
 Bayes, Thomas. 1763. "An essay towards solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society of London, Vol. 53, pp. 370418. Accessed 20180422.
 Bellhouse, David. 2005. "Decoding Cardano's Liber de Ludo Aleae." Historia Mathematica, Vol. 32, No. 2, May, pp. 180202. Accessed 20180423.
 BrooksBartlett, Jonny. 2018. "Probability concepts explained: probability distributions (introduction part 3)." Towards Data Science, on Medium, September 10. Accessed 20200818.
 Brownlee, Jason. 2019. "A Gentle Introduction to Joint, Marginal, and Conditional Probability." Machine Learning Mastery, September 27. Updated 20200506. Accessed 20200818.
 Buckingham, Steven D. 2011. "Bench philosophy: Bayesian statistics: Confidence Multiplied by Evidence." Lab Times Online, April. Updated 20121110. Accessed 20180422.
 CFI. 2020. "Poisson Distribution." CFI Education, June 6. Accessed 20200818.
 Cimbala, John M. 2010. "Probability Density Functions." ME345, Penn State Univ, January 20. Accessed 20200818.
 Cruzan, Jeff. 2018. "Probability and Statistics: Discrete Probability." xaktly.com. Accessed 20180429.
 DeepAI. 2019. "Odds (Probability)." ML Glossary and Terms, DeepAI, May 17. Accessed 20200818.
 FernandezGranda, Carlos. 2017. "Probability and Statistics for Data Science." Center for Data Science, NYU, August. Accessed 20200818.
 Ghemri, Lila. 2020. "Probabilistic Learning –Classification using Naïve Bayes." CS497, Department of Computer Science, Texas Southern University. Accessed 20200818.
 Haslwanter, Thomas. 2016. "Characterizing a Distribution." In: An Introduction to Statistics with Python, Springer. Accessed 20200818.
 Joyce, James. 2003. "Bayes’ Theorem." Stanford Encyclopedia of Philosophy, June 28. Updated 20030930. Accessed 20200818.
 Kirkpatrick, K. L. 2012. "Sample Space, Events and Probability." Dept of Math, Univ of Illinois. Accessed 20200818.
 Lightner, James E. 1991. "A Brief Look at the History of Probability and Statistics." The Mathematics Teacher, vol. 84, no. 8, November, pp. 623630. Accessed 20180422.
 NIST. 2003. "Poisson Distribution." Section 1.3.6.6.19 in: Engineering Statistics Handbook, NIST/SEMATECH, June 1. Accessed 20200818.
 One Minute Economics. 2017. "Probabilities Explained in One Minute  Probability Definition, Formula and Misconceptions." Youtube, April 11. Accessed 20180429.
 Orloff, Jeremy and Jonathan Bloom. 2014. "Comparison of frequentist and Bayesian inference." Introduction to Probability and Statistics, Class 20 18.05, MIT OpenCourseWare, Spring." Accessed 20180509.
 Owen, Sean. 2015. "Common Probability Distributions: The Data Scientist’s Crib Sheet". Cloudera Blog, December 3. Accessed 20180429.
 Pannetier, Alain. 2012. "Assymetric Normal Probability Distribution." Mathematics, StackExchange, August 31. Accessed 20180429.
 Routledge, Richard. 2018. "Law of large numbers." Encyclopædia Britannica. Accessed 20180507.
 Shafer, Glen. 1993. "The Early Development of Mathematical Probability." SemanticScholar. Accessed 20180422.
 Sourget, Camille. 2018. "First edition of a founding work of the theory of probability." Accessed 20180423.
 Stomp on Step1. 2018. "Definition and Calculation of Odds Ratio & Relative Risk." Accessed 20180429.
 Taylor, Courtney. 2017. "What Are Probability Axioms?" ThoughtCo., September 28. Accessed 20180429.
 Walker, John. 2018. "Introduction to Probability and Statistics." The RetroPsychoKinesis Project, University of Kent at Canterbury, UK. Accessed 20180429.
 Weisstein, Eric W. 2008a. "Bernoulli Distribution." MathWorldA Wolfram Web Resource, November 23. Accessed 20200818.
 Weisstein, Eric W. 2008b. "Binomial Distribution." MathWorldA Wolfram Web Resource, November 23. Accessed 20200818.
 Wikipedia. 2018a. "Bayesian probability." Wikipedia, April 10. Accessed 20180422.
 Wikipedia. 2018b. "Outcome (probability)." Wikipedia, April 18. Accessed 20180511.
 Wikipedia. 2020. "Bayes' theorem." Wikipedia, August 13. Accessed 20200818.
Further Reading
Article Stats
Cite As
See Also
 Data Science
 Probability Distributions
 Sampling and Estimation
 Hypothesis Testing and Types of Errors
 Market Basket Analysis
 Confusion Matrix