# Exploratory Data Analysis

### Improve this article. Show messages.

## Summary

Exploratory Data Analysis (EDA) consists of techniques that are typically applied to gain insight into a dataset before doing any formal modelling.

EDA helps us to uncover the underlying structure of the dataset, identify important variables, detect outliers and anomalies, and test underlying assumptions.^{} With EDA, we identify relevant variables, their transformations, and interaction among variables with respect to the model we want to build. EDA can also point out missing data as may be relevant to building desired models.

EDA uses techniques of *statistical graphics* but has a broader scope. It's an approach rather than just a set of techniques.^{} The general idea is,^{}

Let the data speak for themselves... Exploratory Data Analysis is not “fishing” or “torturing” the data set until it confesses.

## Milestones

## Discussion

What's the recommended process for doing Exploratory Data Analysis? One can follow these steps:

- Look at the structure of the data: number of data points, number of features, feature names, data types, etc.
- When dealing with multiple data sources, check for consistency across datasets.
- Identify what data signifies (called measures) for each of data points and be mindful while obtaining metrics.
- Calculate key metrics for each data point (summary analysis): a. Measures of central tendency (Mean, Median, Mode); b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation); c. Measures of skewness and kurtosis.
- Investigate visuals: a. Histogram for each variable; b. Scatterplot to correlate variables.
- Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
- Identify outliers and mark them. Based on context, either discard outliers or analyze them separately.
- Estimate missing points using
*data imputation techniques*.

What are the data types used in EDA? In statistics and Machine Learning, data types are also called

*levels of measurement*. Four common ones are used:^{}**Nominal**: This is qualitative, not quantitative; eg. Religious Preference: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other.**Ordinal**: An ordinal scale that indicates ordering or direction in addition to providing nominal information; eg. Low/Medium/High or Faster/Slower are examples of ordinal levels of measurement. Ranking an experience as a "nine" on 1-10 scale tells us that it was higher than an experience ranked as a "six".**Interval**: Interval scales provide information about order, and also ability to compare ranges; eg. temperature measured either on a Fahrenheit or Celsius scale: measured in Fahrenheit units, the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68.**Ratio**: In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero, a point where none of the quality being measured exists; eg. income, years of work experience, number of children.

What are measures of central tendency? A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These include the following:

^{}**Mean**: Mean is equal to the sum of all the values in the data set divided by the number of values in the data set. This is also called*arithmetic mean*. Other means such as*geometric mean*and*harmonic mean*are also sometimes useful.^{}**Median**: Median is the middle score for a set of data that has been arranged in order of magnitude. For example, given an ordered list of student marks, [14 35 45 55 55 56 58 65 87 89 92], median is 56 because it is the middle mark since there are 5 items before it, 5 items after it.**Mode**: Mode is the most frequent score in our data set. For the above data set of student marks, mode is 55 because 55 is repeated for the maximum number of times.

What are measures of dispersion? Measures of dispersion are important for describing the spread of the data, or its variation around a central value.

**Range**is the difference between the smallest value and the largest value in the data set. This is the simplest measure but it's based on extreme values and tells nothing about the data in between.^{}**Standard Deviation**is therefore a better measure. A value within ±1 SD from mean is considered normal; a value beyond ±3 SD is considered extremely abnormal.^{}One alternative to this is a simple measure called**Mean Absolute Deviation (MAD)**.^{}Another alternative, often used as a measurement of error, is**Root Mean Square Anamoly (RMSA)**.^{}If one desires the spread of data around the central region of data,

**Quartile Deviation**is a good measure. This is half of what's called**Interquartile Range (IQR)**.^{}A variation of this that considers all data is called**Median Absolute Deviation (MAD)**.^{}What is Skewness And Kurtosis? Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the central point.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.

^{}Can measures of central tendency, dispersion, skewness and kurtosis be the same for different datasets ? Yes, it's possible. Statistician Francis Anscombe came up with four datasets to illustrate the importance of graphing data before analyzing it, and to show the effect of outliers on statistical properties. This is now called

*Anscombe's quartet*. It comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.Anscombe's quartet emphasizes the importance of looking at your data, not just the summary statistics and parameters you compute from it.

What are outliers and how to handle outliers ? Any observation that appears to deviate markedly from other observations in the sample is considered an outlier. Identifying an observation as an outlier depends on the underlying distribution of the data. Determining whether an observation is an outlier or not is a subjective exercise.

Context dictates whether to focus on or get rid of outliers. For example, in an income distribution, a luxury brand company would focus on the outliers (the rich people) while a Government public distribution system would choose to get rid of the outliers. It's recommended that you generate a normal probability plot of the data before applying an outlier test.

Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers.

^{}What are visual aids for exploratory analysis? Data can be represented visually in many ways with programming languages and visualization packages. Programming languages such as R, Python, Matlab, SAS, etc. provide libraries for creating data visuals. There are dedicated visualization platforms such as Tableau, Qlikview, and PowerBI in the market that even non-programmers and traditional data analysts can use to make visuals.

Histograms and scatterplots are widely used for exploratory analysis to quickly understand the structure of data and inter-relations of variables. However, numerous other charts can be used to create visuals that have repeat purpose and long shelf life

^{}What should we look for in a histogram? **Histogram**is a graphical representation of data that uncovers underlying structure in the form of a frequency distribution; that is, how often does a particular value occur. From a histogram one can assess the following:- Data symmetry
- Peaks: a single peak implies a homogeneous dataset whereas multiple peaks imply heterogeneity (more than one class within the dataset)
- Outliers and their strength: ignore if few and far away but analyse separately if substantial
- Data error through commission

For example, the accompanying image shows a histogram of two peaks, implying two distinct classes. Additional data informs us that the peaks are due to gender differentiation. If we split the data by gender, we will get two histograms, each with a single peak. Thus, when we see multimodal histograms (more than one peak), there's room to split the data. For every peak, we can build a different model.

What should we look for in a scatterplot? **Scatterplot**is a mechanism to plot two variables and see the*underlying relationship*between them. These can show the following:- Data symmetry
- Clusters
- Correlation between variables
- Extreme values or outliers

A scatterplot is two dimensional (two variables) while a histogram is one dimensional (one variable). Hence we should pay more attention to outliers in scatterplots. For example, in the accompanying image, Employee #2 and Employee #19 are both outliers when we consider their test scores and sales performance. However, if we analyze the data in either of these variables separately, they will not appear as outliers.

What's a pair plot and what's its utility? **Pair plot**is a plot that helps comprehend the underlying structure of a variable and its relationship with other variables in a single visual. Basically, it's a combination of histogram and scatterplot in one visual. This can help us notice patterns that may not be obvious when analyzed separately.How do we handle missing data? Data is rarely complete and may have missing points. Data can be missing due to various reasons: not captured, captured but may not be available, etc. In such circumstances, it's normal to estimate the missing value and proceed with analysis. This process is called

**imputation**. There are many standard imputation procedures and algorithms to estimate missing data.^{}

## References

- Cain, Lance. 2018. "Bimodal Distribution: Definition & Example." Chapter 19, Lesson 19, CAHSEE Math Exam: Help and Review, Study.com. Accessed 2018-04-14.
- Criteria Corp. 2018. "What is an Outlier?" Accessed 2018-04-14.
- Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH e-Handbook of Statistical Methods. Updated March 2018. Accessed 2018-04-15.
- Fiori, Anna M. and Michele Zenga. 2009. "Karl Pearson and the Origin of Kurtosis." International Statistical Review, vol. 77, no. 1, pp. 40-50. Accessed 2018-04-14.
- Friendly, Michael and Daniel J. Denis. 2001. "Milestones in the history of thematic cartography, statistical graphics, and data visualization." Accessed 2018-04-14.
- Grace-Martin, Karen. 2018. "Seven Ways to Make up Data: Common Methods to Imputing Missing Data." The Analysis Factor. Accessed 2018-04-14.
- Grosser, Zach. 2018. "Accessible Colors for Data Visualization. The Data Viz Project by Ferdio." The Corner, Square's Technical Blog, January 11. Accessed 2018-04-14.
- Ho Yu, Chong. 2017. "Exploratory Data Analysis." Oxford Bibliographies, November 29. Accessed 2018-04-14.
- IRI. 2018. "Measures of Dispersion." Statistical Tutorial, International Research Institute for Climate and Society, Columbia University. Accessed 2018-04-15.
- Joshi, Purva. 2016. "Measures of Dispersion." Biostatistics, Biology Discussion, August 24. Accessed 2018-04-15.
- Kukaswadia, Atif. 2013. "John Snow – The First Epidemiologist." PLOS Blogs, March 11. Accessed 2018-04-14.
- Laerd Statistics. 2018a. "Measures of Central Tendency." Laerd Statistics. Accessed 2018-04-15.
- Laerd Statistics. 2018b. "Absolute Deviation & Variance." Laerd Statistics. Accessed 2018-04-15.
- Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 2018-04-15.
- Manikandan, S. 2011. "Measures of central tendency: The mean." J Pharmacol Pharmacother. Apr-Jun; vol. 2, no. 2, pp. 140–142. Accessed 2018-04-15.
- Math Open Reference. 2011. "Outlier." Accessed 2018-04-15.
- Montgomery, Jacob. 2016. "Measures of Central Tendency." Quantitative Political Methods, Department of Political Science, Washington University in St. Louis, September 5. Accessed 2018-04-15.
- Pinterest. 2018. "Levels of measurement." Saved to Research Methods by Leah Fiorentino. Accessed 2018-04-15.
- Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 2018-04-15.
- Santoyo, Sergio. 2017. "A Brief Overview of Outlier Detection Techniques." Towards Data Science, September 12. Accessed 2018-04-14.
- Sharma, Megha. 2017. "Descriptive Statistics in R." Data Analytics Edge, June 16. Accessed 2018-04-15.
- Sommer, Barbara A. 2006. "Levels of measurement." Quantification: Outline, Psychology 41, Research Methods SSI'06, UC Davis. Accessed 2018-04-15.
- Stephanie. 2017. "Semi Interquartile Range / Quartile Deviation." Statistics How To, March 7. Accessed 2018-04-15.
- Turner, Stephen. 2016. "Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni." R-bloggers, February 26. Accessed 2018-04-14.
- Waskom, Michael. 2018. "seaborn.pairplot." Accessed 2018-04-14.

## Milestones

## Tags

## See Also

- Data Science
- Confirmatory Data Analysis
- Data imputation techniques
- Tools for Exploratory Data Analysis
- Probability distributions for data scientists
- Probability for data scientists

## Further Reading

- Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH e-Handbook of Statistical Methods. Updated March 2018. Accessed 2018-04-15.
- Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 2018-04-15.
- Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 2018-04-15.
- Siddiqi, Adnan. 2018. "Introduction to Exploratory Data Analysis in Python." Python Pandemonium, March 3. Accessed 2018-04-15.
- Ganguly, Ambarish. 2017. "Little Book on Exploratory Data Analysis." October 1. Accessed 2018-04-15.