Exploratory Data Analysis

Exploratory Data Analysis (EDA) consists of techniques that are typically applied to gain insight into a dataset before doing any formal modelling.

EDA helps us to uncover the underlying structure of the dataset, identify important variables, detect outliers and anomalies, and test underlying assumptions. With EDA, we identify relevant variables, their transformations, and interaction among variables with respect to the model we want to build. EDA can also point out missing data as may be relevant to building desired models.

EDA uses techniques of statistical graphics but has a broader scope. It's an approach rather than just a set of techniques. The general idea is,

Let the data speak for themselves... Exploratory Data Analysis is not “fishing” or “torturing” the data set until it confesses.

Discussion

  • What's the recommended process for doing Exploratory Data Analysis?
    A typical EDA process. Source: Ghosh et al. 2018, fig. 3.
    A typical EDA process. Source: Ghosh et al. 2018, fig. 3.

    One can follow these steps:

    • Look at the structure of the data: number of data points, number of features, feature names, data types, etc.
    • When dealing with multiple data sources, check for consistency across datasets.
    • Identify what data signifies (called measures) for each of data points and be mindful while obtaining metrics.
    • Calculate key metrics for each data point (summary analysis): a. Measures of central tendency (Mean, Median, Mode); b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation); c. Measures of skewness and kurtosis.
    • Investigate visuals: a. Histogram for each variable; b. Scatterplot to correlate variables.
    • Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
    • Identify outliers and mark them. Based on context, either discard outliers or analyze them separately.
    • Estimate missing points using data imputation techniques.
  • What are the data types used in EDA?
    Levels of measurement. Source: Pinterest 2018.
    Levels of measurement. Source: Pinterest 2018.

    In statistics and Machine Learning, data types are also called levels of measurement. Four common ones are used:

    • Nominal: This is qualitative, not quantitative; eg. Religious Preference: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other.
    • Ordinal: An ordinal scale that indicates ordering or direction in addition to providing nominal information; eg. Low/Medium/High or Faster/Slower are examples of ordinal levels of measurement. Ranking an experience as a "nine" on 1-10 scale tells us that it was higher than an experience ranked as a "six".
    • Interval: Interval scales provide information about order, and also ability to compare ranges; eg. temperature measured either on a Fahrenheit or Celsius scale: measured in Fahrenheit units, the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68.
    • Ratio: In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero, a point where none of the quality being measured exists; eg. income, years of work experience, number of children.
  • What are measures of central tendency?
    Comparing mean and median can tell us about skewness. Source: Dugar 2018.
    Comparing mean and median can tell us about skewness. Source: Dugar 2018.

    A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These include the following:

    • Mean: Mean is equal to the sum of all the values in the data set divided by the number of values in the data set. This is also called arithmetic mean. Other means such as geometric mean and harmonic mean are also sometimes useful.
    • Median: Median is the middle score for a set of data that has been arranged in order of magnitude. For example, given an ordered list of student marks, [14 35 45 55 55 56 58 65 87 89 92], median is 56 because it is the middle mark since there are 5 items before it, 5 items after it.
    • Mode: Mode is the most frequent score in our data set. For the above data set of student marks, mode is 55 because 55 is repeated for the maximum number of times.
  • What are measures of dispersion?
    Measures of dispersion. Source: Banerjee 2020.
    Measures of dispersion. Source: Banerjee 2020.

    Measures of dispersion are important for describing the spread of the data, or its variation around a central value.

    Range is the difference between the smallest value and the largest value in the data set. This is the simplest measure but it's based on extreme values and tells nothing about the data in between.

    Standard Deviation is therefore a better measure. A value within ±1 SD from mean is considered normal; a value beyond ±3 SD is considered extremely abnormal. One alternative to this is a simple measure called Mean Absolute Deviation (MAD). Another alternative, often used as a measurement of error, is Root Mean Square Anomaly (RMSA).

    If one desires the spread of data around the central region of data, Quartile Deviation is a good measure. This is half of what's called Interquartile Range (IQR). A variation of this that considers all data is called Median Absolute Deviation (MAD).

  • What is Skewness And Kurtosis?
    Illustrating skewness and kurtosis in a distribution. Source: Sharma 2017.
    Illustrating skewness and kurtosis in a distribution. Source: Sharma 2017.

    Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the central point.

    Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.

  • Can measures of central tendency, dispersion, skewness and kurtosis be the same for different datasets?
    The Anscombe's quartet. Source: Turner 2016.
    The Anscombe's quartet. Source: Turner 2016.

    Yes, it's possible. Statistician Francis Anscombe came up with four datasets to illustrate the importance of graphing data before analyzing it, and to show the effect of outliers on statistical properties. This is now called Anscombe's quartet. It comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.

    Anscombe's quartet emphasizes the importance of looking at your data, not just the summary statistics and parameters you compute from it.

  • What are outliers and how to handle outliers?
    Outlier example in linear regression. Source: Math Open Reference 2011.
    Outlier example in linear regression. Source: Math Open Reference 2011.

    Any observation that appears to deviate markedly from other observations in the sample is considered an outlier. Identifying an observation as an outlier depends on the underlying distribution of the data. Determining whether an observation is an outlier or not is a subjective exercise.

    Context dictates whether to focus on or get rid of outliers. For example, in an income distribution, a luxury brand company would focus on the outliers (the rich people) while a Government public distribution system would choose to get rid of the outliers. It's recommended that you generate a normal probability plot of the data before applying an outlier test.

    Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers.

  • What are the visual aids for exploratory analysis?
    Various charts to aid exploratory data analysis. Source: Grosser 2018.
    Various charts to aid exploratory data analysis. Source: Grosser 2018.

    Data can be represented visually in many ways with programming languages and visualization packages. Programming languages such as R, Python, Matlab, SAS, etc. provide libraries for creating data visuals. In JavaScript, we have D3.js, NVD3, FusionCharts and Chart.js. In Python, we have Matplotlib, Seaborn, Bokeh and Plotly.

    There are dedicated visualization platforms such as Tableau, Qlikview, and PowerBI in the market that even non-programmers and traditional data analysts can use to make visuals.

    Histograms and scatterplots are widely used for exploratory analysis to quickly understand the structure of data and inter-relations of variables. However, numerous other charts can be used to create visuals that have repeat purpose and long shelf life.

  • What should we look for in a histogram or a distribution?
    Distribution of average body weight. Source: Cain 2018.
    Distribution of average body weight. Source: Cain 2018.

    Histogram represents the underlying structure in the form of a frequency distribution; that is, how often a particular value occurs. Visually, a histogram is similar to a bar chart. While a bar chart has bars for individual values, in a histogram it's more common to group together a range of values into a single bin. Often 5-15 bins should be considered depending on the range of values in the dataset. With too few bins, the graph will not be detailed enough to interpret the distribution.

    In fact, due to binning, histograms can plot both categorical and continuous variables. Bar charts are only for categorical variables.

    Histograms help us see data symmetry, peaks, outliers or data error through omission. In the figure, two peaks imply two distinct classes. Additional data informs us that the peaks are due to gender differentiation. If we split the data by gender, we will get two histograms, each with a single peak. Thus, when we see multimodal histograms (more than one peak), there's room to split the data. For every peak, we can build a different model.

  • What should we look for in a scatterplot?
    Scatter plot with outliers in two dimensions. Source: Criteria Corp 2018.
    Scatter plot with outliers in two dimensions. Source: Criteria Corp 2018.

    Scatterplot is a mechanism to plot two variables and see the underlying relationship between them. A scatterplot can reveal data symmetry, clusters, correlation between variables, and extreme values or outliers. The plot is a series of dots "scattered" in two dimensions. Often a line is drawn across these dots. The line doesn't connect the actual points unlike a line graph. The line, often called regression line, shows the trend and can be used as a predictive tool.

    A scatterplot is two-dimensional (two variables) while a histogram is one-dimensional (one variable). Hence we should pay more attention to outliers in scatterplots. For example, in the accompanying image, Employee #2 and Employee #19 are both outliers when we consider their test scores and sales performance. However, if we analyze the data in either of these variables separately, they will not appear as outliers.

    In technical jargon, histogram provides Univariate Visualization. Scatterplot provides Bivariate Visualization.

  • What's a pair plot and what's its utility?
    Pairs plot for Iris Data. Source: Waskom 2018.
    Pairs plot for Iris Data. Source: Waskom 2018.

    Pair plot is a plot that helps comprehend the underlying structure of a variable and its relationship with other variables in a single visual. Basically, it's a combination of histogram and scatterplot in one visual. This can help us notice patterns that may not be obvious when analyzed separately.

  • How do we handle missing data?

    Data is rarely complete and may have missing points. Data can be missing due to various reasons: not captured, captured but may not be available, etc. In such circumstances, it's normal to estimate the missing value and proceed with analysis. This process is called imputation. There are many standard imputation procedures and algorithms to estimate missing data.

Milestones

1855
John Snow's dot map showing locations of cholera cases. Source: Friendly and Denis 2001, 1850+: Dot map of disease.
John Snow's dot map showing locations of cholera cases. Source: Friendly and Denis 2001, 1850+: Dot map of disease.

John Snow uses a dot plot on a map of London to analyze the 1854 cholera outbreak. He suspects water contamination at the Broad Street pump. The mapped data presents a compelling visual that this could be true. Although not strictly EDA, this is an example of using data visualization to confirm a hypothesis.

1869

Dmitri Mendeleev organizes known chemical elements into a periodic table. This visual suggests some undiscovered elements. This is a good example of EDA leading to new discoveries.

1885
Francis Galton's bivariate frequency chart. Source: Rao 1983, fig. 1.
Francis Galton's bivariate frequency chart. Source: Rao 1983, fig. 1.

Francis Galton creates a bivariate frequency chart that evolves later to today's more familiar correlation diagram. He uses it to analyze the relationship between the heights of parents and adult children. In earlier experiments from the 1870s, he did a similar correlation study with sweet-pea seeds.

1905

Karl Pearson proposes the kurtosis coefficient as a way to measure the degree of flatness of frequency distributions. Along with skewness coefficient proposed earlier, he challenges the notion that most distributions are normal or should be transformed to normality. Instead, we should accurately represent observed data.

1973

Statistician Francis Anscombe constructs the Anscombe's quartet to demonstrate the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

1977
Cover of Tukey's classic on Exploratory Data Analysis. Source: o0sfz8 2014.
Cover of Tukey's classic on Exploratory Data Analysis. Source: o0sfz8 2014.

John W. Tukey, often considered the father of EDA, publishes "Exploratory Data Analysis" at a time when computer-aided visualization was still nascent. He introduces new plots such as the stem-leaf plot and the five-point boxplot. He implies that Confirmatory Data Analysis (CDA) can suffer from confirmation bias due to predetermined hypothesis. EDA is a more open-minded approach to discover patterns in data and to answer specific scientific questions.

1999

Just as languages have grammar, Leland Wilkinson formalizes a grammar for making graphs. Called Grammar of Graphics, it defines a structure to combine graph elements so that data can be shown in meaningful ways. This later inspires others to implement the same in popular languages (R, Python, Julia, D3).

References

  1. Banerjee, Priyam. 2020. "Statistics: Gauge the Spread of Your Data." Towards Data Science, May 30. Accessed 2020-07-12.
  2. Bierly, Melissa. 2016. "10 Python Data Visualization Libraries for Any Field." Mode Blog, June 8. Accessed 2020-07-21.
  3. Bourke, Daniel. 2019. "A Gentle Introduction to Exploratory Data Analysis." Towards Data Science, on Medium, January 13. Accessed 2020-07-22.
  4. CK-12. 2020. "4.6 Interpreting Histograms." Probability and Statistics Concepts, CK-12. Accessed 2020-07-21.
  5. Cain, Lance. 2018. "Bimodal Distribution: Definition & Example." Chapter 19, Lesson 19, CAHSEE Math Exam: Help and Review, Study.com. Accessed 2018-04-14.
  6. Chapman, Cameron. 2019. "A Complete Overview of the Best Data Visualization Tools." Toptal, March 14. Accessed 2020-07-21.
  7. Conrad, Alainia. 2019. "Power BI vs Tableau vs Qlikview." Blog, SelectHub, March 26. Accessed 2020-07-21.
  8. Criteria Corp. 2018. "What is an Outlier?" Accessed 2018-04-14.
  9. Dugar, Diva. 2018. "Skew and Kurtosis: 2 Important Statistics terms you need to know in Data Science." Codeburst.io, on Medium, August 23. Accessed 2020-07-12.
  10. Dunn, Kevin. 2020. "2.4. Histograms and probability distributions." Process Improvement Using Data, May 5. Accessed 2020-07-21.
  11. Evergreen, Stephanie. 2010. "Scatterplot." BetterEvaluation, December 15. Updated 2014-10-02. Accessed 2020-07-21.
  12. Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH e-Handbook of Statistical Methods. Updated March 2018. Accessed 2018-04-15.
  13. Fiori, Anna M. and Michele Zenga. 2009. "Karl Pearson and the Origin of Kurtosis." International Statistical Review, vol. 77, no. 1, pp. 40-50. Accessed 2018-04-14.
  14. Friendly, Michael and Daniel J. Denis. 2001. "Milestones in the history of thematic cartography, statistical graphics, and data visualization." Accessed 2018-04-14.
  15. Ghosh, Aindrila, Mona Nashaat, James Miller, Shaikh Quader, and Chad Marston. 2018. "A comprehensive review of tools for exploratory analysis of tabular industrial datasets." Visual Informatics, Elsevier B.V., vol. 2, pp. 235-253. Accessed 2020-07-22.
  16. Grace-Martin, Karen. 2018. "Seven Ways to Make up Data: Common Methods to Imputing Missing Data." The Analysis Factor. Accessed 2018-04-14.
  17. Grosser, Zach. 2018. "Accessible Colors for Data Visualization. The Data Viz Project by Ferdio." The Corner, Square's Technical Blog, January 11. Accessed 2018-04-14.
  18. Ho Yu, Chong. 2017. "Exploratory Data Analysis." Oxford Bibliographies, November 29. Accessed 2018-04-14.
  19. IRI. 2018. "Measures of Dispersion." Statistical Tutorial, International Research Institute for Climate and Society, Columbia University. Accessed 2018-04-15.
  20. InData Labs. 2017. "Exploratory Data Analysis: the Best way to Start a Data Science Project." Medium, June 19. Accessed 2018-05-03.
  21. Koehrsen, Will. 2018. "Visualizing Data with Pairs Plots in Python." Towards Data Science, on Medium, April 7. Accessed 2020-07-21.
  22. Kukaswadia, Atif. 2013. "John Snow – The First Epidemiologist." PLOS Blogs, March 11. Accessed 2018-04-14.
  23. Laerd Statistics. 2018a. "Measures of Central Tendency." Laerd Statistics. Accessed 2018-04-15.
  24. Laerd Statistics. 2018b. "Absolute Deviation & Variance." Laerd Statistics. Accessed 2018-04-15.
  25. Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 2018-04-15.
  26. Manikandan, S. 2011. "Measures of central tendency: The mean." J Pharmacol Pharmacother. Apr-Jun; vol. 2, no. 2, pp. 140–142. Accessed 2018-04-15.
  27. Math Open Reference. 2011. "Outlier." Accessed 2018-04-15.
  28. Pinterest. 2018. "Levels of measurement." Saved to Research Methods by Leah Fiorentino. Accessed 2018-04-15.
  29. Rao, C. Radhakrishna. 1983. "Multivariate Analysis: Some Reminiscences on Its Origin and Development." Sankhyā: The Indian Journal of Statistics, Series B (1960-2002) 45, no. 2, pp. 284-99. Accessed 2018-08-30.
  30. SOS. 2015. "Histograms." Statistics Online Support, The University of Texas at Austin, June. Accessed 2020-07-21.
  31. Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 2018-04-15.
  32. Santoyo, Sergio. 2017. "A Brief Overview of Outlier Detection Techniques." Towards Data Science, September 12. Accessed 2018-04-14.
  33. Sharma, Megha. 2017. "Descriptive Statistics in R." Data Analytics Edge, June 16. Accessed 2018-04-15.
  34. Sommer, Barbara A. 2006. "Levels of measurement." Quantification: Outline, Psychology 41, Research Methods SSI'06, UC Davis. Accessed 2018-04-15.
  35. Stephanie. 2017. "Semi Interquartile Range / Quartile Deviation." Statistics How To, March 7. Accessed 2018-04-15.
  36. Turner, Stephen. 2016. "Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni." R-bloggers, February 26. Accessed 2018-04-14.
  37. Waskom, Michael. 2018. "seaborn.pairplot." Accessed 2018-04-14.
  38. o0sfz8. 2014. "Talk at Digital Humanities 2014." DH Lab, Georgia Tech, July 24. Accessed 2018-05-03.

Further Reading

  1. Filliben, James J. and Alan Heckert. 2003. "Exploratory Data Analysis." Chapter 1 in NIST/SEMATECH e-Handbook of Statistical Methods. Updated March 2018. Accessed 2018-04-15.
  2. Lile, Samantha. 2017. "44 Types of Graphs Perfect for Every Top Industry." Visme Blog, July 5. Accessed 2018-04-15.
  3. Sander, Liz. 2016. "Telling stories with data using the grammar of graphics." CodeWords, Issue Six, March, Recurse Center. Accessed 2018-04-15.
  4. Siddiqi, Adnan. 2018. "Introduction to Exploratory Data Analysis in Python." Python Pandemonium, March 3. Accessed 2018-04-15.
  5. Ganguly, Ambarish. 2017. "Little Book on Exploratory Data Analysis." October 1. Accessed 2018-04-15.
  6. InData Labs. 2017. "Exploratory Data Analysis: the Best way to Start a Data Science Project." Medium, June 19. Accessed 2018-05-03.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
14
2
1836
1
0
539
10
0
437
3
0
421
1
0
294
1
4
74
2138
Words
9
Likes
24K
Hits

Cite As

Devopedia. 2022. "Exploratory Data Analysis." Version 30, February 15. Accessed 2024-06-25. https://devopedia.org/exploratory-data-analysis
Contributed by
6 authors


Last updated on
2022-02-15 11:50:45