# Probability for Data Scientists

## Summary

In mathematics, the notation \(\pi\), pronounced as \(pi\), denotes ratio of circumference of any circle to diameter of same circle. \(\pi\) is constant. It will not vary for circles of any size. But many other facts in the world are not constant.

Let's assume alphabet \(X\) denotes height of adults in India. \(X\) can take any positive real value for any random individual. Hence \(X\) is a variable that takes random values in a range of positive real numbers.

Probability measures how likely or unlikely is an outcome, where outcome is a **Random Variable**. For instance, we can ask, "What's the probability of picking an Indian adult male who is above six feet?"

Probability is intimately related to another branch of mathematics called *Statistics*. Both these are of fundamental importance to the field of Data Science.

## Milestones

## Discussion

How do we mathematically define the probability of an event? Mathematically, probability is the ratio of the number of desired outcomes and all possible outcomes:

$$P(Outcome)=\frac{n(Desired\ Outcome)}{n(All\ Outcome)}$$

Any desired outcome is a subset of all possible outcomes. The value of probability therefore ranges from zero to one. The limits have the following interpretation:

- Zero: the outcome will never occur.
- One: the outcome is guaranteed to occur.

The set of all outcomes is called the

**Sample Space**. When the number of outcomes is large or grouping outcomes is more suitable for a study, it's common to group one or more outcomes into what we call an**Event**. An outcome can be part of multiple events.^{}Could you illustrate probability with some simple examples? If we toss a coin, there are only two possible outcomes: head or tail. Assuming both outcomes are equally likely, the probability of getting a head is 1/2 = 0.5. Likewise, the probability of getting a tail is also 0.5. Taken together, the probability of getting either head or tail is 0.5 + 0.5 = 1. This makes sense since there are no other outcomes besides head or tail.

Let's roll a dice. The probability of getting an odd prime number (3 or 5) is 2/6 = 0.33. The probability of getting a number greater than 6 is 0/6 = 0. The probability of getting a number less than or equal to 6 is 6/6 = 1.

Why probability works? Any random variable, however random, will have its own identifiable characteristics. For example, the variable may be highly probable at one value with dropping probabilities at neighbouring values. This variability around the most probable value helps us to model random variables. In technical terms, we call this the

*distribution of the random variable*. When we plot the number of occurrences against the value, we get a*distribution curve*. Random variables are typically modelled with average value, variability (spread), skewness (asymmetry) and kurtosis ("tailedness").For example, let's consider the response time of a computing system. When the system is under high load, the average response time increases. What's more interesting is that the spread of response time around this average is also more. Thus, under different loading the response time random variable exhibits different characteristics.

Exceptions (popularly called

**Outliers**) will affect probability generalisation. They have to be kept out when building realistic models.How is probability related to distributions? Probability looks at the likelihood of a specific outcome or event. Distribution looks at all outcomes or events.

Let's take the example of a coin toss. We know from theory that the probability of a head is 0.5. However, there's also an experimental approach. For example, experimenting with 100 tosses might result in 49 turning out to be heads. Hence, probability of head is 0.49. Such an experiment is termed

**Bernoulli Trials**.When we list probabilities for all outcomes (head and tail), we end up with a

**Bernoulli Distribution**.We can perform a variation of the coin-toss experiment. We can toss a coin 32 times and call this a single experiment. We repeat this experiment many times, say 50,000 times. Finally, we calculate the probability of getting 4 heads in each experiment of 32 coin tosses. Such a series of experiments is called

**Binomial Trials**. When we list the probabilities for all outcomes, we end up in**Binomial Distribution**.What are probability distributions and how are they useful? Probability, when identified and listed for all possible outcomes, is called

**Probability Distribution**. For instance, if we find probabilities of adult males in India with heights in ranges of 0-3.5, 3.5-4, 4-4.5, 4.5-5, 5-5.5, 5.5-6 and 6+ feet, we have a probability distribution. Such a distribution is closely related to the concept of histogram. With histogram, we plot the count of values within each range. With distribution, we convert these counts into probabilities. In both case, a graphical plot helps us to read easily the average, variability, skewness, kurtosis, outliers, etc.Given the distribution, an event can be simulated at random within the boundaries of the distribution. In other words, to create random variables for simulation purposes, we need the distribution. For example, let's consider the number of people arriving at ATM every 60 minutes. This can be modeled as Poisson distribution. We can simulate queues at ATM, calculate waiting times and decide if need another ATM needs to be installed. Likewise, if we know distributions of outcomes in a game, we can simulate winning odds and take appropriate risks.

When probability works? Often we are unable to gather data from the entire sample space or population. We typically collect a sample of data from the population. Probability works when sample size is large.

For instance, we wish to find the probability of an Indian adult male of height six feet and above. A sample size of 100 will not give a reliable number. However, a sample size 10,000 will be more reliable. The more, the better. Stated formally as the

**Law of Large Numbers**, the probability of an event from a sample will converge to the actual value of the population when the sample size is large.^{}What are axioms of Probability? There are obvious rules in probability. These rules are called

**Axioms of Probability**. These were formulated by Russian mathematician Andrei Kolmogorov.^{}These axioms can be explained as follows:

^{}- The probability of any event is a non-negative real number.
- The probability of the entire sample space is one. This follows from the fact that there are no events outside the sample space.
- The probability of the union of two mutually exclusive events is the sum of their individual probabilities.

What are mutually exclusive and non-mutually exclusive events? Let us say, there are two events denoted with random variables \(A\) and \(B\). If \(A\) and \(B\) do not happen together, they are mutually exclusive events. They are also call

**disjoint events**. For instance, raining is event \(A\) and cycling is event \(B\). Largely, these two are mutually exclusive.$$P(A\ and\ B) = 0\ or\ neglibible \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)-P(A\ and\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B), since\ P(A\ and\ B)=0$$

On the contrary, if \(A\) and \(B\) do happen together, they are non-mutually exclusive events. For instance, cycling is event \(A\) and listening to music is event \(B\). There can be happen at the same time.

$$P(A\ and\ B)\neq0 \\ \Rightarrow P(A\ and\ B) = P(A)+P(B)-P(A\ or\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)-P(A\ and\ B)$$

How is odds different from probability? Odds is defined as ratio of chances of an event happening and chances of the same event not happening. Consider the following:

$$\frac{P(Buying\ milk)}{P(Not\ buying\ milk)}$$

The above ratio provides us with an idea of milk-buying customers outnumbering customers not buying milk. If the ratio is more than 1 then the odds are in favour of hypotheses (buying milk), else odds are against hypotheses.

Could you explain Joint Probability and Conditional Probability? Let's explain with an example. Joint probability is proportion of customers who would buy both Bread and Jam. Conditional probability is proportion of customers who would buy Bread when they've already bought Jam, and vice-versa.

Let us assume \(A\) as customers buying Bread and \(B\) as customers buying Jam. Numbers are given for illustrative purpose.

$$P(A) = \frac{n(Customers\ buying\ Bread)}{n(Customers)} = \frac{90}{1000} \\ P(B) = \frac{n(Customers\ buying\ Jam)}{n(Customers)} = \frac{50}{1000}$$

*Joint Probability*$$P(Customers\ buying\ Bread\ and\ Jam) = P(A\ and\ B) \\ = \frac{n(A\ and\ B)}{n(Customers)} = \frac{40}{1000}$$

*Conditional Probability*$$P(Customers\ buying\ Bread\ when\ they\ already\ bought\ Jam) \\ = P(A|B)=\frac{n(A\ and\ B)}{n(B)} = \frac{40}{50} \\ P(Customers\ buying\ Jam\ when\ they\ already\ bought\ Bread) = P(B|A)=\frac{n(A\ and\ B)}{n(A)}= \frac{40}{90}$$

Conditional Probability reduces the sample space based on condition. Rather than considering all customers (1000), we only consider customers who bought Bread (90) or customers who bought Jam (50).

Can you extend the above examples to explain odds? $$Odds(Buying\ Bread\ with\ Jam) = \frac{P(Buying\ Bread\ with\ Jam)}{P(Buying\ Bread\ without\ Jam)}=\frac{\frac{40}{1000}}{\frac{90-40}{1000}}=\frac{40}{50}=0.8 \\ Odds(Buying\ Jam\ with\ Bread) = \frac{P(Buying\ Jam\ with\ Bread)}{P(Buying\ Jam\ without\ Bread)}=\frac{\frac{40}{1000}}{\frac{50-40}{1000}}=\frac{40}{10}=4$$

For every 0.8 customers who buy Bread and Jam, one customer will buy only Bread. For every 4 customers who buy Jam and Bread, one customer will buy only Jam. Thus, the Odds(Buying Jam with Bread) > Odds(Buying Bread with Jam). This implies,

- Jam drives Bread purchase.
- Sizable Bread buyers prefer Bread without Jam.

What is Bayes Probability? Bayes Probability uses

**prior**probability, accounts for new**evidence**and results in**posterior**probability. Often, prior probability is sourced from experts due to challenges in evaluating from evidence. Bayes probability basically revises probability considering every new evidence. This probability will converge to its true value over many revisions on repeated evidence.Could you give some applications of Bayes Probability? One application is in spam filtering. The idea is to classify an email as spam if it contains the word "Viagra". But not every email with this word may be a spam. Hence we need to calculate the probabilities based on our prior knowledge of number of spam mails received. \(P(spam)\) is prior knowledge of spam mails in inbox.

$$P(spam|Viagra) = P(spam) * \frac{P(Viagra|spam)}{P(Viagra)} \\ where\ P(Viagra)=P(Viagra|spam)+P(Viagra|not\ spam)$$

The word

*Viagra*could be in spam mails and non-spam mails. \(P(Viagra|spam)/P(Viagra)\) is evidence from data that probability of word*Viagra*in spam mail. \(P(spam|Viagra)\) is the posterior probability of mail being spam with word*Viagra*in it. When 100% of mails with*Viagra*are spam, then \(P(spam|Viagra)=P(spam)\). When less than 100% of mails with*Viagra*are spam, then \(P(spam|Viagra) < P(spam)\).What are Frequentist and Bayesian approaches to Probability? *Frequentists*lean on the**Law of Large numbers**to back their probability estimate. For instance, a coin toss has equal probability of head or tail. This is derived from a large number of trials. Frequentists believe any deviation from equal probability is due to chance.^{}*Bayesians*argue that**belief or prior knowledge**should be accounted for while calculating probability. Belief suggests a probability. New evidence may notch up or notch down the probability and form a new belief. Bayesians do not require Law of Large Numbers backing, but leverage them where applicable. Probability may be revised with a new piece of evidence, eventually converging to true probability after repeated revisions. For instance, the probability of new robot failing at a task starts with a belief, say \(p\), and as new evidence arrives, we revise \(p\).^{}Frequentist and Bayesian approaches can be applied for all estimates including probability. While the two approaches are distinct, Bayesian probability complements Frequentist probability when,

- System is not yet stable.
- There's insufficient data to get backing from Law of Large Numbers.

## References

- Aldrich, John. 2005. "Figures from the History of Probability and Statistics." University of Southampton, June. Updated October 2012. Accessed 2018-04-22.
- Apostol, Tom M. 1969. "Calculus, Volume 2: A short history of probability." Second Edition, John Wiley & Sons, June. Accessed 2018-04-23.
- Bayes, Thomas. 1763. "An essay towards solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society of London, Vol. 53, pp. 370-418. Accessed 2018-04-22.
- Bellhouse, David. 2005. "Decoding Cardano's Liber de Ludo Aleae." Historia Mathematica, Vol. 32, No. 2, May, pp. 180-202. Accessed 2018-04-23.
- Buckingham, Steven D. 2011. "Bench philosophy: Bayesian statistics: Confidence Multiplied by Evidence." Lab Times Online, April. Updated 2012-11-10. Accessed 2018-04-22.
- Cruzan, Jeff. 2018. "Probability and Statistics: Discrete Probability." xaktly.com. Accessed 2018-04-29.
- Lightner, James E. 1991. "A Brief Look at the History of Probability and Statistics." The Mathematics Teacher, vol. 84, no. 8, November, pp. 623-630. Accessed 2018-04-22.
- One Minute Economics. 2017. "Probabilities Explained in One Minute - Probability Definition, Formula and Misconceptions." Youtube, April 11. Accessed 2018-04-29.
- Orloff, Jeremy and Jonathan Bloom. 2014. "Comparison of frequentist and Bayesian inference." Introduction to Probability and Statistics, Class 20 18.05, MIT OpenCourseWare, Spring." Accessed 2018-05-09.
- Owen, Sean. 2015. "Common Probability Distributions: The Data Scientist’s Crib Sheet". Cloudera Blog, December 3. Accessed 2018-04-29.
- Pannetier, Alain. 2012. "Assymetric Normal Probability Distribution." Mathematics, StackExchange, August 31. Accessed 2018-04-29.
- Routledge, Richard. 2018. "Law of large numbers." Encyclopædia Britannica. Accessed 2018-05-07.
- Shafer, Glen. 1993. "The Early Development of Mathematical Probability." SemanticScholar. Accessed 2018-04-22.
- Sourget, Camille. 2018. "First edition of a founding work of the theory of probability." Accessed 2018-04-23.
- Stomp on Step1. 2018. "Definition and Calculation of Odds Ratio & Relative Risk." Accessed 2018-04-29.
- Taylor, Courtney. 2017. "What Are Probability Axioms?" ThoughtCo., September 28. Accessed 2018-04-29.
- Walker, John. 2018. "Introduction to Probability and Statistics." The RetroPsychoKinesis Project, University of Kent at Canterbury, UK. Accessed 2018-04-29.
- Wikipedia. 2018a. "Bayesian probability." April 10. Accessed 2018-04-22.
- Wikipedia. 2018b. "Outcome (probability)." April 18. Accessed 2018-05-11.

## Milestones

## Tags

## See Also

- Data Science
- Probability Distributions
- Sampling and Estimation
- Hypothesis Testing and Types of Errors
- Market Basket Analysis
- Confusion Matrix

## Further Reading

- Probability: the basics
- Joint, Marginal, and Conditional Probabilities
- Bayes' Theorem and Conditional Probability
- Frequentist And Bayesian Approaches In Statistics