# Probability for Data Scientists

In mathematics, the notation $$\pi$$, pronounced as $$pi$$, denotes ratio of circumference of any circle to diameter of same circle. $$\pi$$ is constant. It will not vary for circles of any size. But many other facts in the world are not constant.

Let's assume alphabet $$X$$ denotes height of adults in India. $$X$$ can take any positive real value for any random individual. Hence $$X$$ is a variable that takes random values in a range of positive real numbers.

Probability measures how likely or unlikely is an outcome, where outcome is a Random Variable. For instance, we can ask, "What's the probability of picking an Indian adult male who is above six feet?"

Probability is intimately related to another branch of mathematics called Statistics. Both these are of fundamental importance to the field of Data Science.

## Discussion

• How do we mathematically define the probability of an event?

Mathematically, probability is the ratio of the number of desired outcomes and all possible outcomes:

$$P(Outcome)=\frac{n(Desired\ Outcome)}{n(All\ Outcome)}$$

Any desired outcome is a subset of all possible outcomes. The value of probability therefore ranges from zero to one. The limits have the following interpretation:

• Zero: the outcome will never occur.
• One: the outcome is guaranteed to occur.

The set of all outcomes is called the Sample Space. When the number of outcomes is large or grouping outcomes is more suitable for a study, it's common to group one or more outcomes into what we call an Event. An outcome can be part of multiple events.

• Could you illustrate probability with some simple examples?

If we toss a coin, there are only two possible outcomes: head or tail. Assuming both outcomes are equally likely, the probability of getting a head is 1/2 = 0.5. Likewise, the probability of getting a tail is also 0.5. Taken together, the probability of getting either head or tail is 0.5 + 0.5 = 1. This makes sense since there are no other outcomes besides head or tail.

Let's roll a dice. The probability of getting an odd prime number (3 or 5) is 2/6 = 0.33. The probability of getting a number greater than 6 is 0/6 = 0. The probability of getting a number less than or equal to 6 is 6/6 = 1.

• Why probability works?

Any random variable, however random, will have its own identifiable characteristics. For example, the variable may be highly probable at one value with dropping probabilities at neighbouring values. This variability around the most probable value helps us to model random variables. In technical terms, we call this the distribution of the random variable. When we plot the number of occurrences against the value, we get a distribution curve. Random variables are typically modelled with average value, variability (spread), skewness (asymmetry) and kurtosis ("tailedness").

For example, let's consider the response time of a computing system. When the system is under high load, the average response time increases. What's more interesting is that the spread of response time around this average is also more. Thus, under different loading the response time random variable exhibits different characteristics.

Exceptions (popularly called Outliers) will affect probability generalisation. They have to be kept out when building realistic models.

• How is probability related to distributions?

Probability looks at the likelihood of a specific outcome or event. Distribution looks at all outcomes or events.

Let's take the example of a coin toss. We know from theory that the probability of a head is 0.5. However, there's also an experimental approach. For example, experimenting with 100 tosses might result in 49 turning out to be heads. Hence, probability of head is 0.49. Such an experiment is termed Bernoulli Trials. When we list probabilities for all outcomes (head and tail), we end up with a Bernoulli Distribution.

We can perform a variation of the coin-toss experiment. We can toss a coin 32 times and call this a single experiment. We repeat this experiment many times, say 50,000 times. Finally, we calculate the probability of getting 4 heads in each experiment of 32 coin tosses. Such a series of experiments is called Binomial Trials. When we list the probabilities for all outcomes, we end up in Binomial Distribution.

• What are probability distributions and how are they useful?

Probability, when identified and listed for all possible outcomes, is called Probability Distribution. For instance, if we find probabilities of adult males in India with heights in ranges of 0-3.5, 3.5-4, 4-4.5, 4.5-5, 5-5.5, 5.5-6 and 6+ feet, we have a probability distribution. Such a distribution is closely related to the concept of histogram. With histogram, we plot the count of values within each range. With distribution, we convert these counts into probabilities. In both case, a graphical plot helps us to read easily the average, variability, skewness, kurtosis, outliers, etc.

Given the distribution, an event can be simulated at random within the boundaries of the distribution. In other words, to create random variables for simulation purposes, we need the distribution. For example, let's consider the number of people arriving at ATM every 60 minutes. This can be modeled as Poisson distribution. We can simulate queues at ATM, calculate waiting times and decide if need another ATM needs to be installed. Likewise, if we know distributions of outcomes in a game, we can simulate winning odds and take appropriate risks.

• When probability works?

Often we are unable to gather data from the entire sample space or population. We typically collect a sample of data from the population. Probability works when sample size is large.

For instance, we wish to find the probability of an Indian adult male of height six feet and above. A sample size of 100 will not give a reliable number. However, a sample size 10,000 will be more reliable. The more, the better. Stated formally as the Law of Large Numbers, the probability of an event from a sample will converge to the actual value of the population when the sample size is large.

• What are axioms of Probability?

There are obvious rules in probability. These rules are called Axioms of Probability. These were formulated by Russian mathematician Andrei Kolmogorov.

These axioms can be explained as follows:

• The probability of any event is a non-negative real number.
• The probability of the entire sample space is one. This follows from the fact that there are no events outside the sample space.
• The probability of the union of two mutually exclusive events is the sum of their individual probabilities.
• What are mutually exclusive and non-mutually exclusive events?

Let us say, there are two events denoted with random variables $$A$$ and $$B$$. If $$A$$ and $$B$$ don't occur together, they're mutually exclusive. They're also call disjoint events. For instance, cooking is event $$A$$ and cycling is event $$B$$. These two are mutually exclusive: a person doesn't cook and cycle at the same time.

$$P(A\ and\ B) = 0\ or\ neglibible \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)-P(A\ and\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B), since\ P(A\ and\ B)=0$$

On the contrary, if $$A$$ and $$B$$ do happen together, they are non-mutually exclusive events. For instance, cooking is event $$A$$ and listening to music is event $$B$$. They can happen at the same time.

$$P(A\ and\ B)\neq0 \\ \Rightarrow P(A\ and\ B) = P(A)+P(B)-P(A\ or\ B) \\ \Rightarrow P(A\ or\ B)=P(A)+P(B)-P(A\ and\ B)$$

• Could you explain joint, conditional and marginal probabilities?

Let's consider two events: buying Bread $$A$$, buying Jam $$B$$. Marginal probability is the proportion of customers who bought Bread regardless of whether they bought Jam or not. It's called marginal because it occurs at the margins of the probability table (see figure). Joint probability is proportion of customers who bought both Bread and Jam. Conditional probability is proportion of customers who're likely to buy Bread when they've already bought Jam, and vice-versa.

Marginal Probability

$$P(A) = \frac{n(Customers\ buying\ Bread)}{n(Customers)} = \frac{90}{1000} \\ P(B) = \frac{n(Customers\ buying\ Jam)}{n(Customers)} = \frac{50}{1000}$$

Joint Probability

$$P(Customers\ buying\ Bread\ and\ Jam) = P(A\ and\ B) \\ = \frac{n(A\ and\ B)}{n(Customers)} = \frac{40}{1000}$$

Conditional Probability

$$P(Customers\ buying\ Bread\ when\ they\ already\ bought\ Jam) \\ = P(A|B)=\frac{n(A\ and\ B)}{n(B)} = \frac{40}{50} \\ P(Customers\ buying\ Jam\ when\ they\ already\ bought\ Bread) = P(B|A)=\frac{n(A\ and\ B)}{n(A)}= \frac{40}{90}$$

Conditional Probability reduces the sample space based on condition. Rather than considering all customers (1000), we only consider customers who bought Bread (90) or customers who bought Jam (50), that is, the marginal numbers.

• How is odds different from probability?

Odds is defined as ratio of chances of an event happening and chances of the same event not happening. Consider the ratio of customers buying milk to those not buying milk. If this ratio is more than 1 then the odds are in favour of hypotheses (buying milk), else odds are against hypotheses.

$$Odds(Buying\ Bread\ with\ Jam) = \frac{P(Buying\ Bread\ with\ Jam)}{P(Buying\ Bread\ without\ Jam)}=\frac{\frac{40}{1000}}{\frac{90-40}{1000}}=\frac{40}{50}=0.8 \\ Odds(Buying\ Jam\ with\ Bread) = \frac{P(Buying\ Jam\ with\ Bread)}{P(Buying\ Jam\ without\ Bread)}=\frac{\frac{40}{1000}}{\frac{50-40}{1000}}=\frac{40}{10}=4$$

• Jam drives Bread purchase.
• What is Bayes' Theorem?

Given hypothesis H and evidence E, Bayes' Theorem can be written as $$P(H|E) = P(E|H) \dot P(H) / P(E)$$. Bayes' Theorem, also called Bayes' Rule or Bayes' Law, uses prior probability $$P(H)$$, accounts for new evidence $$P(E|H)$$ and results in posterior probability $$P(H|E)$$.

Often, prior probability is sourced from experts due to challenges in evaluating from evidence. Bayesian probability basically revises probability considering every new evidence. This probability will converge to its true value over many revisions on repeated evidence.

Bayesian approach is used in fields such as epistemology, statistics, and inductive logic. It relies on conditional probabilities and empirical learning. The key insight of the theorem is "that a hypothesis is confirmed by any body of data that its truth renders probable".

• Could you give some applications of Bayes Probability?

One application is in spam filtering. The idea is to classify an email as spam. The email may or may not contain the word "Viagra" and not all mails with this word may be spam. We calculate the probabilities based on our prior knowledge of number of spam mails received. $$P(spam)$$ is prior knowledge of spam mails in inbox. But the probability of the word appearing in a previous spam mail can give us a better estimate. P(Viagra|spam) is the likelihood and P(Viagra) is the marginal likelihood.

$$P(spam|Viagra) = P(spam) * \frac{P(Viagra|spam)}{P(Viagra)} \\ where\ P(Viagra)=P(Viagra|spam)+P(Viagra|not\ spam)$$

$$P(Viagra|spam)/P(Viagra)$$ is evidence from data that probability of word Viagra in spam mail. $$P(spam|Viagra)$$ is the posterior probability of mail being spam with word Viagra in it. When 100% of mails with Viagra are spam, then $$P(spam|Viagra)=P(spam)$$. When less than 100% of mails with Viagra are spam, then $$P(spam|Viagra) < P(spam)$$.

• What are Frequentist and Bayesian approaches to Probability?

Frequentists lean on the Law of Large numbers to back their probability estimate. For instance, a coin toss has equal probability of head or tail. This is derived from a large number of trials. Frequentists believe any deviation from equal probability is due to chance.

Bayesians argue that belief or prior knowledge should be accounted for while calculating probability. Belief suggests a probability. New evidence may notch up or notch down the probability and form a new belief. Bayesians do not require Law of Large Numbers backing, but leverage them where applicable. Probability may be revised with a new piece of evidence, eventually converging to true probability after repeated revisions. For instance, the probability of new robot failing at a task starts with a belief, say $$p$$, and as new evidence arrives, we revise $$p$$.

Frequentist and Bayesian approaches can be applied for all estimates including probability. While the two approaches are distinct, Bayesian probability complements Frequentist probability when,

• System is not yet stable.
• There's insufficient data to get backing from Law of Large Numbers.

## Milestones

1564

Sixteenth century Italian mathematician Girolamo Cardano is interested in gambling, to which he applies mathematics. Although gambling has been around for centuries, randomness is not a recognized concept. People continue to believe in Gods and oracles, until the Renaissance. Cardano's work is the first of its kind, although his work is published only much later in 1663. Today we know that he made some fundamental mistakes.

1654

In a series of letters analyzing the problem of points, French mathematicians Blaise Pascal and Pierre de Fermat develop what can be seen as the foundations of a mathematical theory of probability. Their work is popularized by Christian Huygens in a publication of 1657.

1713

Jakob (James) Bernoulli publishes Ars Conjectandi, in which he introduces many important concepts: permutations, a priori, a posteriori, Bernoulli trials, random variable. Bernoulli showed that the probability of an event can be approximated by the frequency of occurrence of the event from a large number trials. This later came to be called the Law of Large Numbers.

1763

Thomas Bayes' now famous work on probability is posthumously published. The Bayesian approach to probability is adopted, and popularized by Pierre-Simon Laplace, until it's challenged in the early 20th century by mathematicians R. A. Fisher and Jerzy Neyman. To Bayes, probability is a measure of personal belief or reasonable expectation of the event.

1812

With his publication of Analytical Theory of Probability, Pierre-Simon Laplace brings together recent developments in the field. This important work shows the application of probability to scientific problems. Thus, probability is no longer just about games of chance. Laplace himself states,

It is remarkable that probability, which began with the consideration of games of chance, should have become the most important object of human knowledge... [It] is at bottom nothing but common sense reduced to calculus... It teaches us to avoid the illusions which often mislead us.
1933

Russian mathematician A. N. Kolmogorov provides an axiomatic basis for the mathematical theory of probability, thus laying the foundations for a modern treatment of the subject.

Author
No. of Edits
No. of Chats
DevCoins
16
2
1735
5
0
1105
2
0
69
2437
Words
4
Chats
23
Edits
6
Likes
4926
Hits

## Cite As

Devopedia. 2020. "Probability for Data Scientists." Version 23, September 15. Accessed 2020-11-24. https://devopedia.org/probability-for-data-scientists
Contributed by
3 authors

Last updated on
2020-09-15 09:51:09
• Site Map