# Probability for Data Scientists

## Summary

In mathematics, the notation \(\pi\), pronounced as \(pi\), denotes ratio of circumference of any circle to diameter of same circle. \(\pi\) is constant. It will not vary for circles of any size. But many other facts in the world are not constant.

Let's assume alphabet \(X\) denotes height of adults in India. \(X\) can take any positive real value for any random individual. Hence \(X\) is a variable that takes random values in a range of positive real numbers.

Probability measures how likely or unlikely is an outcome, where outcome is a **Random Variable**. For instance, we can ask, "What's the probability of picking an Indian adult male who is above six feet?"

Probability is intimately related to another branch of mathematics called *Statistics*. Both these are of fundamental importance to the field of Data Science.^{}

## Milestones

## Discussion

How do we mathematically define the probability of an event? Mathematically, probability is the ratio of the number of desired outcomes and all possible outcomes:

$$P(Outcome)=\frac{n(Desired\ Outcome)}{n(All\ Outcome)}$$

Any desired outcome is a subset of all possible outcomes. The value of probability therefore ranges from zero to one. The limits have the following interpretation:

- Zero: the outcome will never occur.
- One: the outcome is guaranteed to occur.

The set of all outcomes is called the

**Sample Space**. When the number of outcomes is large or grouping outcomes is more suitable for a study, it's common to group one or more outcomes into what we call an**Event**.^{}An outcome can be part of multiple events.^{}Could you illustrate probability with some simple examples? If we toss a coin, there are only two possible outcomes: head or tail. Assuming both outcomes are equally likely, the probability of getting a head is 1/2 = 0.5. Likewise, the probability of getting a tail is also 0.5. Taken together, the probability of getting either head or tail is 0.5 + 0.5 = 1. This makes sense since there are no other outcomes besides head or tail.

^{}Let's roll a dice. The probability of getting an odd prime number (3 or 5) is 2/6 = 0.33. The probability of getting a number greater than 6 is 0/6 = 0. The probability of getting a number less than or equal to 6 is 6/6 = 1.

^{}Why probability works? Any random variable, however random, will have its own identifiable characteristics. For example, the variable may be highly probable at one value with dropping probabilities at neighbouring values. This variability around the most probable value helps us to model random variables. In technical terms, we call this the

*distribution of the random variable*. When we plot the number of occurrences against the value, we get a*distribution curve*. Random variables are typically modelled with average value, variability (spread), skewness (asymmetry) and kurtosis ("tailedness").^{}For example, let's consider the response time of a computing system. When the system is under high load, the average response time increases. What's more interesting is that the spread of response time around this average is also more. Thus, under different loading the response time random variable exhibits different characteristics.

Exceptions (popularly called

**Outliers**) will affect probability generalisation. They have to be kept out when building realistic models.^{}How is probability related to distributions? Probability looks at the likelihood of a specific outcome or event. Distribution looks at all outcomes or events.

^{}Let's take the example of a coin toss. We know from theory that the probability of a head is 0.5. However, there's also an experimental approach. For example, experimenting with 100 tosses might result in 49 turning out to be heads. Hence, probability of head is 0.49. Such an experiment is termed

**Bernoulli Trials**. When we list probabilities for all outcomes (head and tail), we end up with a**Bernoulli Distribution**.^{}We can perform a variation of the coin-toss experiment. We can toss a coin 32 times and call this a single experiment. We repeat this experiment many times, say 50,000 times. Finally, we calculate the probability of getting 4 heads in each experiment of 32 coin tosses. Such a series of experiments is called

**Binomial Trials**. When we list the probabilities for all outcomes, we end up in**Binomial Distribution**.^{}What are probability distributions and how are they useful? Probability, when identified and listed for all possible outcomes, is called

**Probability Distribution**. For instance, if we find probabilities of adult males in India with heights in ranges of 0-3.5, 3.5-4, 4-4.5, 4.5-5, 5-5.5, 5.5-6 and 6+ feet, we have a probability distribution. Such a distribution is closely related to the concept of histogram. With histogram, we plot the count of values within each range. With distribution, we convert these counts into probabilities.^{}In both case, a graphical plot helps us to read easily the average, variability, skewness, kurtosis, outliers, etc.^{}Given the distribution, an event can be simulated at random within the boundaries of the distribution. In other words, to create random variables for simulation purposes, we need the distribution. For example, let's consider the number of people arriving at ATM every 60 minutes. This can be modeled as Poisson distribution.

^{}We can simulate queues at ATM, calculate waiting times and decide if need another ATM needs to be installed.^{}Likewise, if we know distributions of outcomes in a game, we can simulate winning odds and take appropriate risks.When probability works? Often we are unable to gather data from the entire sample space or population. We typically collect a sample of data from the population. Probability works when sample size is large.

For instance, we wish to find the probability of an Indian adult male of height six feet and above. A sample size of 100 will not give a reliable number. However, a sample size 10,000 will be more reliable. The more, the better. Stated formally as the

**Law of Large Numbers**, the probability of an event from a sample will converge to the actual value of the population when the sample size is large.^{}What are axioms of Probability? There are obvious rules in probability. These rules are called

**Axioms of Probability**. These were formulated by Russian mathematician Andrei Kolmogorov.^{}These axioms can be explained as follows:

^{}- The probability of any event is a non-negative real number.
- The probability of the entire sample space is one. This follows from the fact that there are no events outside the sample space.
- The probability of the union of two mutually exclusive events is the sum of their individual probabilities.

What are mutually exclusive and non-mutually exclusive events? Let us say, there are two events denoted with random variables \(A\) and \(B\). If \(A\) and \(B\) don't occur together, they're mutually exclusive. They're also call

**disjoint events**.^{}For instance, cooking is event \(A\) and cycling is event \(B\). These two are mutually exclusive: a person doesn't cook and cycle at the same time.$$P(A\ and\ B) = 0\ or\ neglibible \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)-P(A\ and\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B), since\ P(A\ and\ B)=0$$

On the contrary, if \(A\) and \(B\) do happen together, they are non-mutually exclusive events.

^{}For instance, cooking is event \(A\) and listening to music is event \(B\). They can happen at the same time.$$P(A\ and\ B)\neq0 \\ \Rightarrow P(A\ and\ B) = P(A)+P(B)-P(A\ or\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)-P(A\ and\ B)$$

Could you explain joint, conditional and marginal probabilities? Let's consider two events: buying Bread \(A\), buying Jam \(B\). Marginal probability is the proportion of customers who bought Bread regardless of whether they bought Jam or not. It's called marginal because it occurs at the margins of the probability table (see figure). Joint probability is proportion of customers who bought both Bread and Jam. Conditional probability is proportion of customers who're likely to buy Bread when they've already bought Jam, and vice-versa.

^{}*Marginal Probability*$$P(A) = \frac{n(Customers\ buying\ Bread)}{n(Customers)} = \frac{90}{1000} \\ P(B) = \frac{n(Customers\ buying\ Jam)}{n(Customers)} = \frac{50}{1000}$$

*Joint Probability*$$P(Customers\ buying\ Bread\ and\ Jam) = P(A\ and\ B) \\ = \frac{n(A\ and\ B)}{n(Customers)} = \frac{40}{1000}$$

*Conditional Probability*$$P(Customers\ buying\ Bread\ when\ they\ already\ bought\ Jam) \\ = P(A|B)=\frac{n(A\ and\ B)}{n(B)} = \frac{40}{50} \\ P(Customers\ buying\ Jam\ when\ they\ already\ bought\ Bread) = P(B|A)=\frac{n(A\ and\ B)}{n(A)}= \frac{40}{90}$$

Conditional Probability reduces the sample space based on condition. Rather than considering all customers (1000), we only consider customers who bought Bread (90) or customers who bought Jam (50), that is, the marginal numbers.

How is odds different from probability? Odds is defined as ratio of chances of an event happening and chances of the same event not happening.

^{}Consider the ratio of customers buying milk to those not buying milk. If this ratio is more than 1 then the odds are in favour of hypotheses (buying milk), else odds are against hypotheses.Consider customers buying bread and jam:

$$Odds(Buying\ Bread\ with\ Jam) = \frac{P(Buying\ Bread\ with\ Jam)}{P(Buying\ Bread\ without\ Jam)}=\frac{\frac{40}{1000}}{\frac{90-40}{1000}}=\frac{40}{50}=0.8 \\ Odds(Buying\ Jam\ with\ Bread) = \frac{P(Buying\ Jam\ with\ Bread)}{P(Buying\ Jam\ without\ Bread)}=\frac{\frac{40}{1000}}{\frac{50-40}{1000}}=\frac{40}{10}=4$$

For every 0.8 customers who buy Bread and Jam, one customer will buy only Bread. For every 4 customers who buy Jam and Bread, one customer will buy only Jam. Thus, the Odds(Buying Jam with Bread) > Odds(Buying Bread with Jam). This implies,

- Jam drives Bread purchase.
- Sizable Bread buyers prefer Bread without Jam.

What is Bayes' Theorem? Given hypothesis H and evidence E, Bayes' Theorem can be written as \(P(H|E) = P(E|H) \dot P(H) / P(E)\). Bayes' Theorem, also called Bayes' Rule or Bayes' Law,

^{}uses**prior**probability \(P(H)\), accounts for new**evidence**\(P(E|H)\) and results in**posterior**probability \(P(H|E)\).^{}Often, prior probability is sourced from experts due to challenges in evaluating from evidence. Bayesian probability basically revises probability considering every new evidence. This probability will converge to its true value over many revisions on repeated evidence.

Bayesian approach is used in fields such as epistemology, statistics, and inductive logic. It relies on conditional probabilities and empirical learning. The key insight of the theorem is "that a hypothesis is confirmed by any body of data that its truth renders probable".

^{}Could you give some applications of Bayes Probability? One application is in spam filtering. The idea is to classify an email as spam. The email may or may not contain the word "Viagra" and not all mails with this word may be spam. We calculate the probabilities based on our prior knowledge of number of spam mails received. \(P(spam)\) is prior knowledge of spam mails in inbox. But the probability of the word appearing in a previous spam mail can give us a better estimate. P(Viagra|spam) is the

**likelihood**and P(Viagra) is the**marginal likelihood**.^{}$$P(spam|Viagra) = P(spam) * \frac{P(Viagra|spam)}{P(Viagra)} \\ where\ P(Viagra)=P(Viagra|spam)+P(Viagra|not\ spam)$$

\(P(Viagra|spam)/P(Viagra)\) is evidence from data that probability of word Viagra in spam mail. \(P(spam|Viagra)\) is the posterior probability of mail being spam with word Viagra in it. When 100% of mails with Viagra are spam, then \(P(spam|Viagra)=P(spam)\). When less than 100% of mails with Viagra are spam, then \(P(spam|Viagra) < P(spam)\).

What are Frequentist and Bayesian approaches to Probability? *Frequentists*lean on the**Law of Large numbers**to back their probability estimate. For instance, a coin toss has equal probability of head or tail. This is derived from a large number of trials. Frequentists believe any deviation from equal probability is due to chance.^{}*Bayesians*argue that**belief or prior knowledge**should be accounted for while calculating probability. Belief suggests a probability. New evidence may notch up or notch down the probability and form a new belief. Bayesians do not require Law of Large Numbers backing, but leverage them where applicable. Probability may be revised with a new piece of evidence, eventually converging to true probability after repeated revisions. For instance, the probability of new robot failing at a task starts with a belief, say \(p\), and as new evidence arrives, we revise \(p\).^{}Frequentist and Bayesian approaches can be applied for all estimates including probability. While the two approaches are distinct, Bayesian probability complements Frequentist probability when,

- System is not yet stable.
- There's insufficient data to get backing from Law of Large Numbers.

## References

- Aldrich, John. 2005. "Figures from the History of Probability and Statistics." University of Southampton, June. Updated October 2012. Accessed 2018-04-22.
- Apostol, Tom M. 1969. "Calculus, Volume 2: A short history of probability." Second Edition, John Wiley & Sons, June. Accessed 2018-04-23.
- Bayes, Thomas. 1763. "An essay towards solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society of London, Vol. 53, pp. 370-418. Accessed 2018-04-22.
- Bellhouse, David. 2005. "Decoding Cardano's Liber de Ludo Aleae." Historia Mathematica, Vol. 32, No. 2, May, pp. 180-202. Accessed 2018-04-23.
- Brooks-Bartlett, Jonny. 2018. "Probability concepts explained: probability distributions (introduction part 3)." Towards Data Science, on Medium, September 10. Accessed 2020-08-18.
- Brownlee, Jason. 2019. "A Gentle Introduction to Joint, Marginal, and Conditional Probability." Machine Learning Mastery, September 27. Updated 2020-05-06. Accessed 2020-08-18.
- Buckingham, Steven D. 2011. "Bench philosophy: Bayesian statistics: Confidence Multiplied by Evidence." Lab Times Online, April. Updated 2012-11-10. Accessed 2018-04-22.
- CFI. 2020. "Poisson Distribution." CFI Education, June 6. Accessed 2020-08-18.
- Cimbala, John M. 2010. "Probability Density Functions." ME345, Penn State Univ, January 20. Accessed 2020-08-18.
- Cruzan, Jeff. 2018. "Probability and Statistics: Discrete Probability." xaktly.com. Accessed 2018-04-29.
- DeepAI. 2019. "Odds (Probability)." ML Glossary and Terms, DeepAI, May 17. Accessed 2020-08-18.
- Fernandez-Granda, Carlos. 2017. "Probability and Statistics for Data Science." Center for Data Science, NYU, August. Accessed 2020-08-18.
- Ghemri, Lila. 2020. "Probabilistic Learning –Classification using Naïve Bayes." CS497, Department of Computer Science, Texas Southern University. Accessed 2020-08-18.
- Haslwanter, Thomas. 2016. "Characterizing a Distribution." In: An Introduction to Statistics with Python, Springer. Accessed 2020-08-18.
- Joyce, James. 2003. "Bayes’ Theorem." Stanford Encyclopedia of Philosophy, June 28. Updated 2003-09-30. Accessed 2020-08-18.
- Kirkpatrick, K. L. 2012. "Sample Space, Events and Probability." Dept of Math, Univ of Illinois. Accessed 2020-08-18.
- Lightner, James E. 1991. "A Brief Look at the History of Probability and Statistics." The Mathematics Teacher, vol. 84, no. 8, November, pp. 623-630. Accessed 2018-04-22.
- NIST. 2003. "Poisson Distribution." Section 1.3.6.6.19 in: Engineering Statistics Handbook, NIST/SEMATECH, June 1. Accessed 2020-08-18.
- One Minute Economics. 2017. "Probabilities Explained in One Minute - Probability Definition, Formula and Misconceptions." Youtube, April 11. Accessed 2018-04-29.
- Orloff, Jeremy and Jonathan Bloom. 2014. "Comparison of frequentist and Bayesian inference." Introduction to Probability and Statistics, Class 20 18.05, MIT OpenCourseWare, Spring." Accessed 2018-05-09.
- Owen, Sean. 2015. "Common Probability Distributions: The Data Scientist’s Crib Sheet". Cloudera Blog, December 3. Accessed 2018-04-29.
- Pannetier, Alain. 2012. "Assymetric Normal Probability Distribution." Mathematics, StackExchange, August 31. Accessed 2018-04-29.
- Routledge, Richard. 2018. "Law of large numbers." Encyclopædia Britannica. Accessed 2018-05-07.
- Shafer, Glen. 1993. "The Early Development of Mathematical Probability." SemanticScholar. Accessed 2018-04-22.
- Sourget, Camille. 2018. "First edition of a founding work of the theory of probability." Accessed 2018-04-23.
- Stomp on Step1. 2018. "Definition and Calculation of Odds Ratio & Relative Risk." Accessed 2018-04-29.
- Taylor, Courtney. 2017. "What Are Probability Axioms?" ThoughtCo., September 28. Accessed 2018-04-29.
- Walker, John. 2018. "Introduction to Probability and Statistics." The RetroPsychoKinesis Project, University of Kent at Canterbury, UK. Accessed 2018-04-29.
- Weisstein, Eric W. 2008a. "Bernoulli Distribution." MathWorld--A Wolfram Web Resource, November 23. Accessed 2020-08-18.
- Weisstein, Eric W. 2008b. "Binomial Distribution." MathWorld--A Wolfram Web Resource, November 23. Accessed 2020-08-18.
- Wikipedia. 2018a. "Bayesian probability." Wikipedia, April 10. Accessed 2018-04-22.
- Wikipedia. 2018b. "Outcome (probability)." Wikipedia, April 18. Accessed 2018-05-11.
- Wikipedia. 2020. "Bayes' theorem." Wikipedia, August 13. Accessed 2020-08-18.

## Milestones

## Tags

## See Also

- Data Science
- Probability Distributions
- Sampling and Estimation
- Hypothesis Testing and Types of Errors
- Market Basket Analysis
- Confusion Matrix

## Further Reading

- Probability: the basics
- Joint, Marginal, and Conditional Probabilities
- Bayes' Theorem and Conditional Probability
- Frequentist And Bayesian Approaches In Statistics