Exploratory Data Analysis
 Summary

Discussion
 What's the recommended process for doing Exploratory Data Analysis?
 What are the data types used in EDA?
 What are measures of central tendency?
 What are measures of dispersion?
 What is Skewness And Kurtosis?
 Can measures of central tendency, dispersion, skewness and kurtosis be the same for different datasets?
 What are outliers and how to handle outliers?
 What are the visual aids for exploratory analysis?
 What should we look for in a histogram or a distribution?
 What should we look for in a scatterplot?
 What's a pair plot and what's its utility?
 How do we handle missing data?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
Exploratory Data Analysis (EDA) consists of techniques that are typically applied to gain insight into a dataset before doing any formal modelling.
EDA helps us to uncover the underlying structure of the dataset, identify important variables, detect outliers and anomalies, and test underlying assumptions.^{} With EDA, we identify relevant variables, their transformations, and interaction among variables with respect to the model we want to build. EDA can also point out missing data as may be relevant to building desired models.
EDA uses techniques of statistical graphics but has a broader scope. It's an approach rather than just a set of techniques.^{} The general idea is,^{}
Let the data speak for themselves... Exploratory Data Analysis is not “fishing” or “torturing” the data set until it confesses.
Discussion
What's the recommended process for doing Exploratory Data Analysis? One can follow these steps:^{} ^{}
 Look at the structure of the data: number of data points, number of features, feature names, data types, etc.
 When dealing with multiple data sources, check for consistency across datasets.
 Identify what data signifies (called measures) for each of data points and be mindful while obtaining metrics.
 Calculate key metrics for each data point (summary analysis): a. Measures of central tendency (Mean, Median, Mode); b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation); c. Measures of skewness and kurtosis.
 Investigate visuals: a. Histogram for each variable; b. Scatterplot to correlate variables.
 Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
 Identify outliers and mark them. Based on context, either discard outliers or analyze them separately.
 Estimate missing points using data imputation techniques.
What are the data types used in EDA? In statistics and Machine Learning, data types are also called levels of measurement. Four common ones are used:^{}
 Nominal: This is qualitative, not quantitative; eg. Religious Preference: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other.
 Ordinal: An ordinal scale that indicates ordering or direction in addition to providing nominal information; eg. Low/Medium/High or Faster/Slower are examples of ordinal levels of measurement. Ranking an experience as a "nine" on 110 scale tells us that it was higher than an experience ranked as a "six".
 Interval: Interval scales provide information about order, and also ability to compare ranges; eg. temperature measured either on a Fahrenheit or Celsius scale: measured in Fahrenheit units, the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68.
 Ratio: In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero, a point where none of the quality being measured exists; eg. income, years of work experience, number of children.
What are measures of central tendency? A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These include the following:^{}
 Mean: Mean is equal to the sum of all the values in the data set divided by the number of values in the data set. This is also called arithmetic mean. Other means such as geometric mean and harmonic mean are also sometimes useful.^{}
 Median: Median is the middle score for a set of data that has been arranged in order of magnitude. For example, given an ordered list of student marks, [14 35 45 55 55 56 58 65 87 89 92], median is 56 because it is the middle mark since there are 5 items before it, 5 items after it.
 Mode: Mode is the most frequent score in our data set. For the above data set of student marks, mode is 55 because 55 is repeated for the maximum number of times.
What are measures of dispersion? Measures of dispersion are important for describing the spread of the data, or its variation around a central value.
Range is the difference between the smallest value and the largest value in the data set. This is the simplest measure but it's based on extreme values and tells nothing about the data in between.^{}
Standard Deviation is therefore a better measure. A value within ±1 SD from mean is considered normal; a value beyond ±3 SD is considered extremely abnormal.^{} One alternative to this is a simple measure called Mean Absolute Deviation (MAD).^{} Another alternative, often used as a measurement of error, is Root Mean Square Anomaly (RMSA).^{}
If one desires the spread of data around the central region of data, Quartile Deviation is a good measure. This is half of what's called Interquartile Range (IQR).^{} A variation of this that considers all data is called Median Absolute Deviation (MAD).^{}
What is Skewness And Kurtosis? Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the central point.
Kurtosis is a measure of whether the data are heavytailed or lighttailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.^{}
Can measures of central tendency, dispersion, skewness and kurtosis be the same for different datasets? Yes, it's possible. Statistician Francis Anscombe came up with four datasets to illustrate the importance of graphing data before analyzing it, and to show the effect of outliers on statistical properties. This is now called Anscombe's quartet. It comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.^{}
Anscombe's quartet emphasizes the importance of looking at your data, not just the summary statistics and parameters you compute from it.
What are outliers and how to handle outliers? Any observation that appears to deviate markedly from other observations in the sample is considered an outlier. Identifying an observation as an outlier depends on the underlying distribution of the data. Determining whether an observation is an outlier or not is a subjective exercise.
Context dictates whether to focus on or get rid of outliers. For example, in an income distribution, a luxury brand company would focus on the outliers (the rich people) while a Government public distribution system would choose to get rid of the outliers. It's recommended that you generate a normal probability plot of the data before applying an outlier test.
Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers.^{}
What are the visual aids for exploratory analysis? Data can be represented visually in many ways with programming languages and visualization packages. Programming languages such as R, Python, Matlab, SAS, etc. provide libraries for creating data visuals. In JavaScript, we have D3.js, NVD3, FusionCharts and Chart.js.^{} In Python, we have Matplotlib, Seaborn, Bokeh and Plotly.^{}
There are dedicated visualization platforms such as Tableau, Qlikview, and PowerBI in the market that even nonprogrammers and traditional data analysts can use to make visuals.^{}
Histograms and scatterplots are widely used for exploratory analysis to quickly understand the structure of data and interrelations of variables. However, numerous other charts can be used to create visuals that have repeat purpose and long shelf life.^{}
What should we look for in a histogram or a distribution? Histogram represents the underlying structure in the form of a frequency distribution; that is, how often a particular value occurs. Visually, a histogram is similar to a bar chart. While a bar chart has bars for individual values, in a histogram it's more common to group together a range of values into a single bin. Often 515 bins should be considered depending on the range of values in the dataset. With too few bins, the graph will not be detailed enough to interpret the distribution.^{}
In fact, due to binning, histograms can plot both categorical and continuous variables. Bar charts are only for categorical variables.^{}
Histograms help us see data symmetry, peaks, outliers or data error through omission.^{} In the figure, two peaks imply two distinct classes. Additional data informs us that the peaks are due to gender differentiation. If we split the data by gender, we will get two histograms, each with a single peak. Thus, when we see multimodal histograms (more than one peak), there's room to split the data. For every peak, we can build a different model.
What should we look for in a scatterplot? Scatterplot is a mechanism to plot two variables and see the underlying relationship between them. A scatterplot can reveal data symmetry, clusters, correlation between variables, and extreme values or outliers. The plot is a series of dots "scattered" in two dimensions. Often a line is drawn across these dots. The line doesn't connect the actual points unlike a line graph. The line, often called regression line, shows the trend and can be used as a predictive tool.^{}
A scatterplot is twodimensional (two variables) while a histogram is onedimensional (one variable). Hence we should pay more attention to outliers in scatterplots. For example, in the accompanying image, Employee #2 and Employee #19 are both outliers when we consider their test scores and sales performance. However, if we analyze the data in either of these variables separately, they will not appear as outliers.^{}
In technical jargon, histogram provides Univariate Visualization. Scatterplot provides Bivariate Visualization.^{}
What's a pair plot and what's its utility? Pair plot is a plot that helps comprehend the underlying structure of a variable and its relationship with other variables in a single visual. Basically, it's a combination of histogram and scatterplot in one visual. This can help us notice patterns that may not be obvious when analyzed separately.^{}
How do we handle missing data? Data is rarely complete and may have missing points. Data can be missing due to various reasons: not captured, captured but may not be available, etc. In such circumstances, it's normal to estimate the missing value and proceed with analysis. This process is called imputation. There are many standard imputation procedures and algorithms to estimate missing data.^{}
Milestones
John Snow uses a dot plot on a map of London to analyze the 1854 cholera outbreak. He suspects water contamination at the Broad Street pump. The mapped data presents a compelling visual that this could be true.^{} Although not strictly EDA, this is an example of using data visualization to confirm a hypothesis.
Dmitri Mendeleev organizes known chemical elements into a periodic table. This visual suggests some undiscovered elements. This is a good example of EDA leading to new discoveries.^{}
Francis Galton creates a bivariate frequency chart that evolves later to today's more familiar correlation diagram. He uses it to analyze the relationship between the heights of parents and adult children.^{} In earlier experiments from the 1870s, he did a similar correlation study with sweetpea seeds.^{}
Karl Pearson proposes the kurtosis coefficient as a way to measure the degree of flatness of frequency distributions. Along with skewness coefficient proposed earlier, he challenges the notion that most distributions are normal or should be transformed to normality. Instead, we should accurately represent observed data.^{}
Statistician Francis Anscombe constructs the Anscombe's quartet to demonstrate the importance of graphing data before analyzing it and the effect of outliers on statistical properties.^{}
John W. Tukey, often considered the father of EDA, publishes "Exploratory Data Analysis" at a time when computeraided visualization was still nascent. He introduces new plots such as the stemleaf plot and the fivepoint boxplot. He implies that Confirmatory Data Analysis (CDA) can suffer from confirmation bias due to predetermined hypothesis. EDA is a more openminded approach to discover patterns in data and to answer specific scientific questions.^{}
Just as languages have grammar, Leland Wilkinson formalizes a grammar for making graphs.^{} Called Grammar of Graphics, it defines a structure to combine graph elements so that data can be shown in meaningful ways. This later inspires others to implement the same in popular languages (R, Python, Julia, D3).^{}
References
 Bierly, Melissa. 2016. "10 Python Data Visualization Libraries for Any Field." Mode Blog, June 8. Accessed 20200721.
 Bourke, Daniel. 2019. "A Gentle Introduction to Exploratory Data Analysis." Towards Data Science, on Medium, January 13. Accessed 20200722.
 Cain, Lance. 2018. "Bimodal Distribution: Definition & Example." Chapter 19, Lesson 19, CAHSEE Math Exam: Help and Review, Study.com. Accessed 20180414.
 Chapman, Cameron. 2019. "A Complete Overview of the Best Data Visualization Tools." Toptal, March 14. Accessed 20200721.
 CK12. 2020. "4.6 Interpreting Histograms." Probability and Statistics Concepts, CK12. Accessed 20200721.
 Conrad, Alainia. 2019. "Power BI vs Tableau vs Qlikview." Blog, SelectHub, March 26. Accessed 20200721.
 Criteria Corp. 2018. "What is an Outlier?" Accessed 20180414.
 Dunn, Kevin. 2020. "2.4. Histograms and probability distributions." Process Improvement Using Data, May 5. Accessed 20200721.
 Evergreen, Stephanie. 2010. "Scatterplot." BetterEvaluation, December 15. Updated 20141002. Accessed 20200721.
 Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH eHandbook of Statistical Methods. Updated March 2018. Accessed 20180415.
 Fiori, Anna M. and Michele Zenga. 2009. "Karl Pearson and the Origin of Kurtosis." International Statistical Review, vol. 77, no. 1, pp. 4050. Accessed 20180414.
 Friendly, Michael and Daniel J. Denis. 2001. "Milestones in the history of thematic cartography, statistical graphics, and data visualization." Accessed 20180414.
 Ghosh, Aindrila, Mona Nashaat, James Miller, Shaikh Quader, and Chad Marston. 2018. "A comprehensive review of tools for exploratory analysis of tabular industrial datasets." Visual Informatics, Elsevier B.V., vol. 2, pp. 235253. Accessed 20200722.
 GraceMartin, Karen. 2018. "Seven Ways to Make up Data: Common Methods to Imputing Missing Data." The Analysis Factor. Accessed 20180414.
 Grosser, Zach. 2018. "Accessible Colors for Data Visualization. The Data Viz Project by Ferdio." The Corner, Square's Technical Blog, January 11. Accessed 20180414.
 Ho Yu, Chong. 2017. "Exploratory Data Analysis." Oxford Bibliographies, November 29. Accessed 20180414.
 InData Labs. 2017. "Exploratory Data Analysis: the Best way to Start a Data Science Project." Medium, June 19. Accessed 20180503.
 IRI. 2018. "Measures of Dispersion." Statistical Tutorial, International Research Institute for Climate and Society, Columbia University. Accessed 20180415.
 Joshi, Purva. 2016. "Measures of Dispersion." Biostatistics, Biology Discussion, August 24. Accessed 20180415.
 Koehrsen, Will. 2018. "Visualizing Data with Pairs Plots in Python." Towards Data Science, on Medium, April 7. Accessed 20200721.
 Kukaswadia, Atif. 2013. "John Snow – The First Epidemiologist." PLOS Blogs, March 11. Accessed 20180414.
 Laerd Statistics. 2018a. "Measures of Central Tendency." Laerd Statistics. Accessed 20180415.
 Laerd Statistics. 2018b. "Absolute Deviation & Variance." Laerd Statistics. Accessed 20180415.
 Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 20180415.
 Manikandan, S. 2011. "Measures of central tendency: The mean." J Pharmacol Pharmacother. AprJun; vol. 2, no. 2, pp. 140–142. Accessed 20180415.
 Math Open Reference. 2011. "Outlier." Accessed 20180415.
 Montgomery, Jacob. 2016. "Measures of Central Tendency." Quantitative Political Methods, Department of Political Science, Washington University in St. Louis, September 5. Accessed 20180415.
 o0sfz8. 2014. "Talk at Digital Humanities 2014." DH Lab, Georgia Tech, July 24. Accessed 20180503.
 Pinterest. 2018. "Levels of measurement." Saved to Research Methods by Leah Fiorentino. Accessed 20180415.
 Rao, C. Radhakrishna. 1983. "Multivariate Analysis: Some Reminiscences on Its Origin and Development." Sankhyā: The Indian Journal of Statistics, Series B (19602002) 45, no. 2, pp. 28499. Accessed 20180830.
 Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 20180415.
 Santoyo, Sergio. 2017. "A Brief Overview of Outlier Detection Techniques." Towards Data Science, September 12. Accessed 20180414.
 Sharma, Megha. 2017. "Descriptive Statistics in R." Data Analytics Edge, June 16. Accessed 20180415.
 Sommer, Barbara A. 2006. "Levels of measurement." Quantification: Outline, Psychology 41, Research Methods SSI'06, UC Davis. Accessed 20180415.
 SOS. 2015. "Histograms." Statistics Online Support, The University of Texas at Austin, June. Accessed 20200721.
 Stephanie. 2017. "Semi Interquartile Range / Quartile Deviation." Statistics How To, March 7. Accessed 20180415.
 Turner, Stephen. 2016. "Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni." Rbloggers, February 26. Accessed 20180414.
 Waskom, Michael. 2018. "seaborn.pairplot." Accessed 20180414.
Further Reading
 Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH eHandbook of Statistical Methods. Updated March 2018. Accessed 20180415.
 Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 20180415.
 Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 20180415.
 Siddiqi, Adnan. 2018. "Introduction to Exploratory Data Analysis in Python." Python Pandemonium, March 3. Accessed 20180415.
 Ganguly, Ambarish. 2017. "Little Book on Exploratory Data Analysis." October 1. Accessed 20180415.
 InData Labs. 2017. "Exploratory Data Analysis: the Best way to Start a Data Science Project." Medium, June 19. Accessed 20180503.
Article Stats
Cite As
See Also
 Data Science
 Confirmatory Data Analysis
 Data Imputation
 Tools for Exploratory Data Analysis
 Probability Distributions
 Probability for Data Scientists