Exploratory Data Analysis

Exploratory Data Analysis (EDA) consists of techniques that are typically applied to gain insight into a dataset before doing any formal modelling.

EDA helps us to uncover the underlying structure of the dataset, identify important variables, detect outliers and anomalies, and test underlying assumptions. With EDA, we identify relevant variables, their transformations, and interaction among variables with respect to the model we want to build. EDA can also point out missing data as may be relevant to building desired models.

EDA uses techniques of statistical graphics but has a broader scope. It's an approach rather than just a set of techniques. The general idea is,

Let the data speak for themselves... Exploratory Data Analysis is not “fishing” or “torturing” the data set until it confesses.

Discussion

What's the recommended process for doing Exploratory Data Analysis?
A typical EDA process. Source: Ghosh et al. 2018, fig. 3.
One can follow these steps:
- Look at the structure of the data: number of data points, number of features, feature names, data types, etc.
- When dealing with multiple data sources, check for consistency across datasets.
- Identify what data signifies (called measures) for each of data points and be mindful while obtaining metrics.
- Calculate key metrics for each data point (summary analysis): a. Measures of central tendency (Mean, Median, Mode); b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation); c. Measures of skewness and kurtosis.
- Investigate visuals: a. Histogram for each variable; b. Scatterplot to correlate variables.
- Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
- Identify outliers and mark them. Based on context, either discard outliers or analyze them separately.
- Estimate missing points using data imputation techniques.
What are the data types used in EDA?
Levels of measurement. Source: Pinterest 2018.
In statistics and Machine Learning, data types are also called levels of measurement. Four common ones are used:
- Nominal: This is qualitative, not quantitative; eg. Religious Preference: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other.
- Ordinal: An ordinal scale that indicates ordering or direction in addition to providing nominal information; eg. Low/Medium/High or Faster/Slower are examples of ordinal levels of measurement. Ranking an experience as a "nine" on 1-10 scale tells us that it was higher than an experience ranked as a "six".
- Interval: Interval scales provide information about order, and also ability to compare ranges; eg. temperature measured either on a Fahrenheit or Celsius scale: measured in Fahrenheit units, the difference between a temperature of 46 and 42 is the same as the difference between 72 and 68.
- Ratio: In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero, a point where none of the quality being measured exists; eg. income, years of work experience, number of children.
What are measures of central tendency?
Comparing mean and median can tell us about skewness. Source: Dugar 2018.
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. These include the following:
- Mean: Mean is equal to the sum of all the values in the data set divided by the number of values in the data set. This is also called arithmetic mean. Other means such as geometric mean and harmonic mean are also sometimes useful.
- Median: Median is the middle score for a set of data that has been arranged in order of magnitude. For example, given an ordered list of student marks, [14 35 45 55 55 56 58 65 87 89 92], median is 56 because it is the middle mark since there are 5 items before it, 5 items after it.
- Mode: Mode is the most frequent score in our data set. For the above data set of student marks, mode is 55 because 55 is repeated for the maximum number of times.
What are measures of dispersion?
Measures of dispersion. Source: Banerjee 2020.
Measures of dispersion are important for describing the spread of the data, or its variation around a central value.
Range is the difference between the smallest value and the largest value in the data set. This is the simplest measure but it's based on extreme values and tells nothing about the data in between.
Standard Deviation is therefore a better measure. A value within ±1 SD from mean is considered normal; a value beyond ±3 SD is considered extremely abnormal. One alternative to this is a simple measure called Mean Absolute Deviation (MAD). Another alternative, often used as a measurement of error, is Root Mean Square Anomaly (RMSA).
If one desires the spread of data around the central region of data, Quartile Deviation is a good measure. This is half of what's called Interquartile Range (IQR). A variation of this that considers all data is called Median Absolute Deviation (MAD).
What is Skewness And Kurtosis?
Illustrating skewness and kurtosis in a distribution. Source: Sharma 2017.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the central point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.
Can measures of central tendency, dispersion, skewness and kurtosis be the same for different datasets?
The Anscombe's quartet. Source: Turner 2016.
Yes, it's possible. Statistician Francis Anscombe came up with four datasets to illustrate the importance of graphing data before analyzing it, and to show the effect of outliers on statistical properties. This is now called Anscombe's quartet. It comprises of four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.
Anscombe's quartet emphasizes the importance of looking at your data, not just the summary statistics and parameters you compute from it.
What are outliers and how to handle outliers?
Outlier example in linear regression. Source: Math Open Reference 2011.
Any observation that appears to deviate markedly from other observations in the sample is considered an outlier. Identifying an observation as an outlier depends on the underlying distribution of the data. Determining whether an observation is an outlier or not is a subjective exercise.
Context dictates whether to focus on or get rid of outliers. For example, in an income distribution, a luxury brand company would focus on the outliers (the rich people) while a Government public distribution system would choose to get rid of the outliers. It's recommended that you generate a normal probability plot of the data before applying an outlier test.
Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers.
What are the visual aids for exploratory analysis?
Various charts to aid exploratory data analysis. Source: Grosser 2018.
Data can be represented visually in many ways with programming languages and visualization packages. Programming languages such as R, Python, Matlab, SAS, etc. provide libraries for creating data visuals. In JavaScript, we have D3.js, NVD3, FusionCharts and Chart.js. In Python, we have Matplotlib, Seaborn, Bokeh and Plotly.
There are dedicated visualization platforms such as Tableau, Qlikview, and PowerBI in the market that even non-programmers and traditional data analysts can use to make visuals.
Histograms and scatterplots are widely used for exploratory analysis to quickly understand the structure of data and inter-relations of variables. However, numerous other charts can be used to create visuals that have repeat purpose and long shelf life.
What should we look for in a histogram or a distribution?
Distribution of average body weight. Source: Cain 2018.
Histogram represents the underlying structure in the form of a frequency distribution; that is, how often a particular value occurs. Visually, a histogram is similar to a bar chart. While a bar chart has bars for individual values, in a histogram it's more common to group together a range of values into a single bin. Often 5-15 bins should be considered depending on the range of values in the dataset. With too few bins, the graph will not be detailed enough to interpret the distribution.
In fact, due to binning, histograms can plot both categorical and continuous variables. Bar charts are only for categorical variables.
Histograms help us see data symmetry, peaks, outliers or data error through omission. In the figure, two peaks imply two distinct classes. Additional data informs us that the peaks are due to gender differentiation. If we split the data by gender, we will get two histograms, each with a single peak. Thus, when we see multimodal histograms (more than one peak), there's room to split the data. For every peak, we can build a different model.
What should we look for in a scatterplot?
Scatter plot with outliers in two dimensions. Source: Criteria Corp 2018.
Scatterplot is a mechanism to plot two variables and see the underlying relationship between them. A scatterplot can reveal data symmetry, clusters, correlation between variables, and extreme values or outliers. The plot is a series of dots "scattered" in two dimensions. Often a line is drawn across these dots. The line doesn't connect the actual points unlike a line graph. The line, often called regression line, shows the trend and can be used as a predictive tool.
A scatterplot is two-dimensional (two variables) while a histogram is one-dimensional (one variable). Hence we should pay more attention to outliers in scatterplots. For example, in the accompanying image, Employee #2 and Employee #19 are both outliers when we consider their test scores and sales performance. However, if we analyze the data in either of these variables separately, they will not appear as outliers.
In technical jargon, histogram provides Univariate Visualization. Scatterplot provides Bivariate Visualization.
What's a pair plot and what's its utility?
Pairs plot for Iris Data. Source: Waskom 2018.
Pair plot is a plot that helps comprehend the underlying structure of a variable and its relationship with other variables in a single visual. Basically, it's a combination of histogram and scatterplot in one visual. This can help us notice patterns that may not be obvious when analyzed separately.
How do we handle missing data?
Data is rarely complete and may have missing points. Data can be missing due to various reasons: not captured, captured but may not be available, etc. In such circumstances, it's normal to estimate the missing value and proceed with analysis. This process is called imputation. There are many standard imputation procedures and algorithms to estimate missing data.

Milestones

1855

John Snow's dot map showing locations of cholera cases. Source: Friendly and Denis 2001, 1850+: Dot map of disease.

John Snow uses a dot plot on a map of London to analyze the 1854 cholera outbreak. He suspects water contamination at the Broad Street pump. The mapped data presents a compelling visual that this could be true. Although not strictly EDA, this is an example of using data visualization to confirm a hypothesis.

1869

Dmitri Mendeleev organizes known chemical elements into a periodic table. This visual suggests some undiscovered elements. This is a good example of EDA leading to new discoveries.

1885

Francis Galton's bivariate frequency chart. Source: Rao 1983, fig. 1.

Francis Galton creates a bivariate frequency chart that evolves later to today's more familiar correlation diagram. He uses it to analyze the relationship between the heights of parents and adult children. In earlier experiments from the 1870s, he did a similar correlation study with sweet-pea seeds.

1905

Karl Pearson proposes the kurtosis coefficient as a way to measure the degree of flatness of frequency distributions. Along with skewness coefficient proposed earlier, he challenges the notion that most distributions are normal or should be transformed to normality. Instead, we should accurately represent observed data.

1973

Statistician Francis Anscombe constructs the Anscombe's quartet to demonstrate the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

1977

Cover of Tukey's classic on Exploratory Data Analysis. Source: o0sfz8 2014.

John W. Tukey, often considered the father of EDA, publishes "Exploratory Data Analysis" at a time when computer-aided visualization was still nascent. He introduces new plots such as the stem-leaf plot and the five-point boxplot. He implies that Confirmatory Data Analysis (CDA) can suffer from confirmation bias due to predetermined hypothesis. EDA is a more open-minded approach to discover patterns in data and to answer specific scientific questions.

1999

Just as languages have grammar, Leland Wilkinson formalizes a grammar for making graphs. Called Grammar of Graphics, it defines a structure to combine graph elements so that data can be shown in meaningful ways. This later inspires others to implement the same in popular languages (R, Python, Julia, D3).