• John Snow's dot map showing locations of cholera cases. Source: Friendly and Denis 2001, 1850+: Dot map of disease.
    image
  • Francis Galton's correlation chart. Source: Friendly and Denis 2001, 1850+: Dot map of disease.
    image
  • Cover of Tukey's classic on Exploratory Data Analysis. Source: o0sfz8 2014.
    image
  • Levels of measurement. Source: Pinterest 2018.
    image
  • Comparing mean and median can tell us about skewness. Source: Montgomery 2016.
    image
  • Measures of dispersion. Source: Joshi 2016.
    image
  • Illustrating skewness and kurtosis in a distribution. Source: Sharma 2017.
    image
  • The Anscombe's quartet. Source: Turner 2016.
    image
  • Outlier example in linear regression. Source: Math Open Reference 2011.
    image
  • Various charts to aid exploratory data analysis. Source: Grosser 2018.
    image
  • Average Body Weight. Source: Cain 2018.
    image
  • Scatter plot with outliers in 2 dimension. Source: Criteria Corp 2018
    image
  • Pairs plot for Iris Data. Source: Waskom 2018.
    image

Exploratory Data Analysis

Summary

Exploratory Data Analysis (EDA) consists of techniques that are typically applied to gain insight into a dataset before doing any formal modelling.

EDA helps us to uncover the underlying structure of the dataset, identify important variables, detect outliers and anomalies, and test underlying assumptions. With EDA, we identify relevant variables, their transformations, and interaction among variables with respect to the model we want to build. EDA can also point out missing data as may be relevant to building desired models.

EDA uses techniques of statistical graphics but has a broader scope. It's an approach rather than just a set of techniques. The general idea is,

Let the data speak for themselves... Exploratory Data Analysis is not “fishing” or “torturing” the data set until it confesses.

Milestones

1855
image

John Snow uses a dot plot on a map of London to analyze the 1854 cholera outbreak. He suspects water contamination at the Broad Street pump. The mapped data presents a compelling visual that this could be true. Although not strictly EDA, this is an example of using data visualization to confirm a hypothesis.

1869

Dmitri Mendeleev organizes known chemical elements into a periodic table. This visual suggests some undiscovered elements. This is a good example of EDA leading to new discoveries.

1875
image

Francis Galton creates a correlation diagram to analyze the relationship between the sizes of mother and daughter sweet-pea seeds.

1905

Karl Pearson proposes the kurtosis coefficient as a way to measure the degree of flatness of frequency distributions. Along with skewness coefficient proposed earlier, he challenges the notion that most distributions are normal or should be transformed to normality. Instead, we should accurately represent observed data.

1973

Statistician Francis Anscombe constructs the Anscombe's quartet to demonstrate the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

1977
image

John W. Tukey, often considered the father of EDA, publishes "Exploratory Data Analysis" at a time when computer-aided visualization was still nascent. He introduces new plots such as the stem-leaf plot and the five-point boxplot. He implies that Confirmatory Data Analysis (CDA) can suffer from confirmation bias due to predetermined hypothesis. EDA is a more open-minded approach to discover patterns in data and to answer specific scientific questions.

1999

Just as languages have grammar, Leland Wilkinson formalizes a grammar for making graphs. Called Grammar of Graphics, it defines a structure to combine graph elements so that data can be shown in meaningful ways. This later inspires others to implement the same in popular languages (R, Python, Julia, D3).

Discussion

  • What's the recommended process for doing Exploratory Data Analysis?

    One can follow these steps:

    • Look at the structure of the data: number of data points, number of features, feature names, data types, etc.
    • When dealing with multiple data sources, check for consistency across datasets.
    • Identify what data signifies (called measures) for each of data points and be mindful while obtaining metrics.
    • Calculate key metrics for each data point (summary analysis): a. Measures of central tendency (Mean, Median, Mode); b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation); c. Measures of skewness and kurtosis.
    • Investigate visuals: a. Histogram for each variable; b. Scatterplot to correlate variables.
    • Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
    • Identify outliers and mark them. Based on context, either discard outliers or analyze them separately.
    • Estimate missing points using data imputation techniques.
  • What are the data types used in EDA?
    image
    Levels of measurement. Source: Pinterest 2018.

    In statistics and Machine Learning, data types are also called levels of measurement. Four common ones are used:

    • Nominal: This is qualitative, not quantitative; eg. Religious Preference: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other.
    • Ordinal: An ordinal scale that indicates ordering or direction in addition to providing nominal information; eg. Low/Medium/High or Faster/Slower are examples of ordinal levels of measurement. Ranking an experience as a "nine" on 1-10 scale tells us that it was higher than an experience ranked as a "six".
    • Interval: Interval scales provide information about order, and also ability to compare ranges; eg. temperature measured either on a Fahrenheit or Celsius scale: measured in Fahrenheit units, the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68.
    • Ratio: In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero, a point where none of the quality being measured exists; eg. income, years of work experience, number of children.
  • What are measures of central tendency?
    image
    Comparing mean and median can tell us about skewness. Source: Montgomery 2016.

    A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These include the following:

    • Mean: Mean is equal to the sum of all the values in the data set divided by the number of values in the data set. This is also called arithmetic mean. Other means such as geometric mean and harmonic mean are also sometimes useful.
    • Median: Median is the middle score for a set of data that has been arranged in order of magnitude. For example, given an ordered list of student marks, [14 35 45 55 55 56 58 65 87 89 92], median is 56 because it is the middle mark since there are 5 items before it, 5 items after it.
    • Mode: Mode is the most frequent score in our data set. For the above data set of student marks, mode is 55 because 55 is repeated for the maximum number of times.
  • What are measures of dispersion?
    image
    Measures of dispersion. Source: Joshi 2016.

    Measures of dispersion are important for describing the spread of the data, or its variation around a central value.

    Range is the difference between the smallest value and the largest value in the data set. This is the simplest measure but it's based on extreme values and tells nothing about the data in between.

    Standard Deviation is therefore a better measure. A value within ±1 SD from mean is considered normal; a value beyond ±3 SD is considered extremely abnormal. One alternative to this is a simple measure called Mean Absolute Deviation (MAD). Another alternative, often used as a measurement of error, is Root Mean Square Anamoly (RMSA).

    If one desires the spread of data around the central region of data, Quartile Deviation is a good measure. This is half of what's called Interquartile Range (IQR). A variation of this that considers all data is called Median Absolute Deviation (MAD).

  • What is Skewness And Kurtosis?
    image
    Illustrating skewness and kurtosis in a distribution. Source: Sharma 2017.

    Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the central point.

    Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.

  • Can measures of central tendency, dispersion, skewness and kurtosis be the same for different datasets ?
    image
    The Anscombe's quartet. Source: Turner 2016.

    Yes, it's possible. Statistician Francis Anscombe came up with four datasets to illustrate the importance of graphing data before analyzing it, and to show the effect of outliers on statistical properties. This is now called Anscombe's quartet. It comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.

    Anscombe's quartet emphasizes the importance of looking at your data, not just the summary statistics and parameters you compute from it.

  • What are outliers and how to handle outliers ?
    image
    Outlier example in linear regression. Source: Math Open Reference 2011.

    Any observation that appears to deviate markedly from other observations in the sample is considered an outlier. Identifying an observation as an outlier depends on the underlying distribution of the data. Determining whether an observation is an outlier or not is a subjective exercise.

    Context dictates whether to focus on or get rid of outliers. For example, in an income distribution, a luxury brand company would focus on the outliers (the rich people) while a Government public distribution system would choose to get rid of the outliers. It's recommended that you generate a normal probability plot of the data before applying an outlier test.

    Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers.

  • What are visual aids for exploratory analysis?
    image
    Various charts to aid exploratory data analysis. Source: Grosser 2018.

    Data can be represented visually in many ways with programming languages and visualization packages. Programming languages such as R, Python, Matlab, SAS, etc. provide libraries for creating data visuals. There are dedicated visualization platforms such as Tableau, Qlikview, and PowerBI in the market that even non-programmers and traditional data analysts can use to make visuals.

    Histograms and scatterplots are widely used for exploratory analysis to quickly understand the structure of data and inter-relations of variables. However, numerous other charts can be used to create visuals that have repeat purpose and long shelf life

  • What should we look for in a histogram?
    image
    Average Body Weight. Source: Cain 2018.

    Histogram is a graphical representation of data that uncovers underlying structure in the form of a frequency distribution; that is, how often does a particular value occur. From a histogram one can assess the following:

    • Data symmetry
    • Peaks: a single peak implies a homogeneous dataset whereas multiple peaks imply heterogeneity (more than one class within the dataset)
    • Outliers and their strength: ignore if few and far away but analyse separately if substantial
    • Data error through commission

    For example, the accompanying image shows a histogram of two peaks, implying two distinct classes. Additional data informs us that the peaks are due to gender differentiation. If we split the data by gender, we will get two histograms, each with a single peak. Thus, when we see multimodal histograms (more than one peak), there's room to split the data. For every peak, we can build a different model.

  • What should we look for in a scatterplot?
    image
    Scatter plot with outliers in 2 dimension. Source: Criteria Corp 2018

    Scatterplot is a mechanism to plot two variables and see the underlying relationship between them. These can show the following:

    • Data symmetry
    • Clusters
    • Correlation between variables
    • Extreme values or outliers

    A scatterplot is two-dimensional (two variables) while a histogram is one-dimensional (one variable). Hence we should pay more attention to outliers in scatterplots. For example, in the accompanying image, Employee #2 and Employee #19 are both outliers when we consider their test scores and sales performance. However, if we analyze the data in either of these variables separately, they will not appear as outliers.

    In technical jargon, histogram provides Univariate Visualization. Scatterplot provides Bivariate Visualization.

  • What's a pair plot and what's its utility?
    image
    Pairs plot for Iris Data. Source: Waskom 2018.

    Pair plot is a plot that helps comprehend the underlying structure of a variable and its relationship with other variables in a single visual. Basically, it's a combination of histogram and scatterplot in one visual. This can help us notice patterns that may not be obvious when analyzed separately.

  • How do we handle missing data?

    Data is rarely complete and may have missing points. Data can be missing due to various reasons: not captured, captured but may not be available, etc. In such circumstances, it's normal to estimate the missing value and proceed with analysis. This process is called imputation. There are many standard imputation procedures and algorithms to estimate missing data.

References

  1. Cain, Lance. 2018. "Bimodal Distribution: Definition & Example." Chapter 19, Lesson 19, CAHSEE Math Exam: Help and Review, Study.com. Accessed 2018-04-14.
  2. Criteria Corp. 2018. "What is an Outlier?" Accessed 2018-04-14.
  3. Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH e-Handbook of Statistical Methods. Updated March 2018. Accessed 2018-04-15.
  4. Fiori, Anna M. and Michele Zenga. 2009. "Karl Pearson and the Origin of Kurtosis." International Statistical Review, vol. 77, no. 1, pp. 40-50. Accessed 2018-04-14.
  5. Friendly, Michael and Daniel J. Denis. 2001. "Milestones in the history of thematic cartography, statistical graphics, and data visualization." Accessed 2018-04-14.
  6. Grace-Martin, Karen. 2018. "Seven Ways to Make up Data: Common Methods to Imputing Missing Data." The Analysis Factor. Accessed 2018-04-14.
  7. Grosser, Zach. 2018. "Accessible Colors for Data Visualization. The Data Viz Project by Ferdio." The Corner, Square's Technical Blog, January 11. Accessed 2018-04-14.
  8. Ho Yu, Chong. 2017. "Exploratory Data Analysis." Oxford Bibliographies, November 29. Accessed 2018-04-14.
  9. IRI. 2018. "Measures of Dispersion." Statistical Tutorial, International Research Institute for Climate and Society, Columbia University. Accessed 2018-04-15.
  10. InData Labs. 2017. "Exploratory Data Analysis: the Best way to Start a Data Science Project." Medium, June 19. Accessed 2018-05-03.
  11. Joshi, Purva. 2016. "Measures of Dispersion." Biostatistics, Biology Discussion, August 24. Accessed 2018-04-15.
  12. Kukaswadia, Atif. 2013. "John Snow – The First Epidemiologist." PLOS Blogs, March 11. Accessed 2018-04-14.
  13. Laerd Statistics. 2018a. "Measures of Central Tendency." Laerd Statistics. Accessed 2018-04-15.
  14. Laerd Statistics. 2018b. "Absolute Deviation & Variance." Laerd Statistics. Accessed 2018-04-15.
  15. Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 2018-04-15.
  16. Manikandan, S. 2011. "Measures of central tendency: The mean." J Pharmacol Pharmacother. Apr-Jun; vol. 2, no. 2, pp. 140–142. Accessed 2018-04-15.
  17. Math Open Reference. 2011. "Outlier." Accessed 2018-04-15.
  18. Montgomery, Jacob. 2016. "Measures of Central Tendency." Quantitative Political Methods, Department of Political Science, Washington University in St. Louis, September 5. Accessed 2018-04-15.
  19. Pinterest. 2018. "Levels of measurement." Saved to Research Methods by Leah Fiorentino. Accessed 2018-04-15.
  20. Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 2018-04-15.
  21. Santoyo, Sergio. 2017. "A Brief Overview of Outlier Detection Techniques." Towards Data Science, September 12. Accessed 2018-04-14.
  22. Sharma, Megha. 2017. "Descriptive Statistics in R." Data Analytics Edge, June 16. Accessed 2018-04-15.
  23. Sommer, Barbara A. 2006. "Levels of measurement." Quantification: Outline, Psychology 41, Research Methods SSI'06, UC Davis. Accessed 2018-04-15.
  24. Stephanie. 2017. "Semi Interquartile Range / Quartile Deviation." Statistics How To, March 7. Accessed 2018-04-15.
  25. Turner, Stephen. 2016. "Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni." R-bloggers, February 26. Accessed 2018-04-14.
  26. Waskom, Michael. 2018. "seaborn.pairplot." Accessed 2018-04-14.
  27. o0sfz8. 2014. "Talk at Digital Humanities 2014." DH Lab, Georgia Tech, July 24. Accessed 2018-05-03.

Milestones

1855
image

John Snow uses a dot plot on a map of London to analyze the 1854 cholera outbreak. He suspects water contamination at the Broad Street pump. The mapped data presents a compelling visual that this could be true. Although not strictly EDA, this is an example of using data visualization to confirm a hypothesis.

1869

Dmitri Mendeleev organizes known chemical elements into a periodic table. This visual suggests some undiscovered elements. This is a good example of EDA leading to new discoveries.

1875
image

Francis Galton creates a correlation diagram to analyze the relationship between the sizes of mother and daughter sweet-pea seeds.

1905

Karl Pearson proposes the kurtosis coefficient as a way to measure the degree of flatness of frequency distributions. Along with skewness coefficient proposed earlier, he challenges the notion that most distributions are normal or should be transformed to normality. Instead, we should accurately represent observed data.

1973

Statistician Francis Anscombe constructs the Anscombe's quartet to demonstrate the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

1977
image

John W. Tukey, often considered the father of EDA, publishes "Exploratory Data Analysis" at a time when computer-aided visualization was still nascent. He introduces new plots such as the stem-leaf plot and the five-point boxplot. He implies that Confirmatory Data Analysis (CDA) can suffer from confirmation bias due to predetermined hypothesis. EDA is a more open-minded approach to discover patterns in data and to answer specific scientific questions.

1999

Just as languages have grammar, Leland Wilkinson formalizes a grammar for making graphs. Called Grammar of Graphics, it defines a structure to combine graph elements so that data can be shown in meaningful ways. This later inspires others to implement the same in popular languages (R, Python, Julia, D3).

Tags

See Also

Further Reading

  1. Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH e-Handbook of Statistical Methods. Updated March 2018. Accessed 2018-04-15.
  2. Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 2018-04-15.
  3. Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 2018-04-15.
  4. Siddiqi, Adnan. 2018. "Introduction to Exploratory Data Analysis in Python." Python Pandemonium, March 3. Accessed 2018-04-15.
  5. Ganguly, Ambarish. 2017. "Little Book on Exploratory Data Analysis." October 1. Accessed 2018-04-15.
  6. InData Labs. 2017. "Exploratory Data Analysis: the Best way to Start a Data Science Project." Medium, June 19. Accessed 2018-05-03.

Top Contributors

Last update: 2018-05-03 10:03:37 by arvindpdmn
Creation: 2018-04-14 06:35:49 by arjun

Article Stats

1999
Words
2
Chats
5
Authors
22
Edits
3
Likes
806
Hits

Cite As

Devopedia. 2018. "Exploratory Data Analysis." Version 22, May 3. Accessed 2018-07-20. https://devopedia.org/exploratory-data-analysis
BETA V0.15.1