Data Preparation

Article Info

Contributed by
4 authors

Last updated on
2020-03-25 13:35:28

Data Mining
Data Fusion
Data Formatting
Data Imputation
Data Archiving
Data Pipeline Patterns

Article Versions

8 2020-03-25 13:35:28
2004,1999 8,2004

By arvindpdmn

Minor update to answer.
7 2020-03-25 02:17:20
1999,1998 7,1999

By arvindpdmn

No comments. Publishing.
6 2020-03-24 18:23:11
1998,1997 6,1998

By gurumoorthyP

Completing article content. Milestones added. More images added. Article can be reviewed.
5 2020-03-23 08:34:41
1997,1996 5,1997

By gurumoorthyP

Summary text added along with a custom image.
4 2020-03-22 13:59:04
1996,1504 4,1996 1

By gurumoorthyP

Work in progress. Identified two more questions. Added images with citations. Added See Also items.

Chat Room

Submitting ...

You are editing an existing chat message.
2019-12-03 04:31:19
-

By arvindpdmn

Hi Authors. This is a reminder to complete this article. Thanks.

Data preparation enables data analytics. Source: Devopedia 2020.

Raw data is usually not suitable for direct analysis. This is because the data might come from different sources in different formats. Moreover, real-world data is not clean. Some data points might be missing. Some others might be out of range. There could be duplicates. Data preparation is therefore an essential task that transforms or prepares data into a form that's suitable for analysis.

Data preparation assumes that data has already been collected. However, others may consider data collection and data ingestion as part of data preparation. Within data preparation, it's common to identify sub-stages that might include data pre-processing, data wrangling, and data transformation.

Useful insights from data via analytics is the final goal in today's data-driven world. However, data preparation is an important task. Poorly prepared data can make analytics more difficult and ineffective.

Discussion

What's the typical pipeline for data preparation?
Typical pipeline for data preparation. Source: Pearlman 2018.
The first step of a data preparation pipeline is to gather data from various sources and locations. Before any processing is done, we wish to discover what the data is about. At this stage, we understand the data within the context of business goals. Visualization of the data is also helpful here. The next stage is to cleanse the data of missing values and invalid values. We also reformat data to standard forms. Next we transform the data for a specific outcome or audience. We can enrich data by merging different datasets to enable richer insights. Finally, we store the data or directly send it out for analytics.
In the context of machine learning, data pre-processing converts raw data into clean data. This involves iteratively cleaning, integration, transformation and reduction of data. When an ML model is being prototyped, often data scientists may wish to go back to convert the data into a more useful form. This could be called data wrangling or data munging. This involves filtering, grouping, sorting, aggregating or decomposing data as required for modelling.
What are some common tasks involved in data preparation?
Some data preparation tasks to solve specific problems. Source: Gill 2018.
Data preparation involves one or more of the following tasks:
- Aggregation: Multiple columns are reduced to fewer columns. Records are summarized.
- Anonymization: Sensitive values are removed for the sake of privacy.
- Augmentation: Expand the dataset size without collecting more data. For example, image data is augmented via cropping or rotating.
- Blending: Combine and link related data from various sources. For example, combine an employee's HR data with payroll data.
- Decomposing: Decompose a data column that has sub-fields. For example, "6 ounces butter" is decomposed into three columns representing value, unit and ingredient.
- Deletion: Duplicates and outliers are removed. Exploratory Data Analysis (EDA) may be used to identify outliers.
- Formatting: Data is modified to a consistent form. For example, 2019-Jul-01, 2019-07-1, and 1/7/19 are changed to a single form, such as 2019-07-01.
- Imputation: Fill missing values using estimates from available data.
- Labelling: Data is labelled for supervised machine learning.
- Normalization: Data is scaled or shifted, perhaps to a range of 0-1.
- Sampling: For a quick analysis, select a small representative sample.
What attributes are important in the context of data preparation?
Results of a survey of 384 respondents about data preparation attributes. Source: Stodder 2016, fig. 3.
It's important to measure the data preparation pipeline and assess how well it delivers data for analytics. To start with, data quality is measured by its accuracy, completeness, consistency, timeliness, believability, added value, interpretability, and accessibility. The pipeline must be able to validate data, spot problems and give tools to understand why they're occurring.
With the growing demand for near real-time analytics, we expect frequent data refreshing. Ideally, data quality is maintained even when refresh rate is high. In practice, there may be a trade-off.
Data must be easy to access even with growing data variety and volume. Data lakes and Hadoop are enablers in this regard.
Data must also conform to data models, domain models and database schemas. Data preparation must check for conformance.
Data preparation must ensure consistency across various datasets. For example, variations due to spelling or abbreviations must be handled. One sports dataset may use the term 'soccer' while another may use the term 'football'.
Unlike ETL systems, data preparation pipeline must be more flexible for ad hoc processing.
What are some challenges involved in preparing data?
Results of a survey of 311 respondents about data preparation challenges. Source: Stodder 2016, fig. 4.
A common challenge faced by businesses is the diversity of data sources, siloed or proprietary tools, tedious processes, and regulatory compliance. Incompatible data formats, messy data and unbalanced data are further challenges.
When manual processes are used, businesses spend more time preparing data than analyzing it. Therefore, organizations must invest in tools and automation. Indeed, data scientists should get involved with teams tasked with data preparation.
It's been said that "bad data is bad for business". Organizations are unsure of data's quality and hence lack confidence in using it for decision making. This can be solved by investing early in data collection and preparation. Profile the data landscape. Improve data quality at source.
Some applications (such as fraud detection or industrial IoT apps) may require data preparation in real time. With large data volumes, collecting, preparing and storing data at scale is a challenge. In production, the preparation pipeline should be repeatable, handle errors and tuned for performance. It should work for initial data, incremental data and streamed data.
Could you share some best practices for data preparation?
Check data types and formats. Check if data is accurate. Graph the data to get a sense of the distribution and outliers. If these are not as expected, question your assumptions. Label and annotate the graphs. Backup the data. Document or record the steps so that they can be repeated when new data comes in.
Data professionals shouldn't rely too much on IT departments. Adopt data systems and tools that are user friendly.
Data literacy, which is the ability to read, analyze and argue from data, is an essential skill today. Engineers, data scientists and business users must talk in a common language. Prepare data with a good understanding of the context and the business goals.
Profile the data first. Start with a small sample. Iteratively, try different cleaning strategies and discuss the results with business stakeholders. Keep in mind that data may change in the future. Data preparation is therefore not a do-once-and-forget task.
Which programming languages are well suited for data preparation tasks?
Querying, filtering, and sampling in R and Pandas. Source: Pandas Docs 2020.
Two well-known libraries for data preparation are pandas (Python) and dplyr (R). Apache Spark, more a framework than a language, is also suited for data preparation. Apache Spark enables fast in-memory data processing in a distributed architecture.
The main data structure in pandas is the DataFrame. The method df.describe() is a quick way to describe the data. A more detailed profile can be obtained using the pandas-profiling package. Missing values are represented as NaN or "not a number". Methods df.fillna() and df.dropna() help in dealing with NaN values. A couple of useful tutorials using Pandas are by Jaiswal and by Osei.
In R, the equivalent data structure is data.frame. A brief structure of the data can be seen using str(). Missing data is marked as NA. Packages dplyr and tidyr can be used for data preparation.
Spark provides APIs in Python, R, Java, Scala and SQL. When Spark is used, it's possible to convert between Spark DataFrames and Pandas DataFrames using Apache Arrow.
Could you list some tools that aid data preparation with minimal coding?
Market map of self-service data preparation tools. Source: Howard 2019.
Data analysis can't happen without data preparation. To enable this with minimal or no coding, it's best to use a self-service data preparation tool. This helps data analysts, business owners and data scientists. There are many of these in the market today (2019): Altair Monarch, Alteryx, ClearStory, Datameer, DataWatch, Dialogue, Improvado, LavaStorm, Microstrategy, Oracle, Paxata, Qlik, Quest, SAP, SAS, Tableau Prep, TIBCO, Trifacta, and Zaloni.
Many tools can combine data from different sources. They include visualization and exploratory data analysis. For data preparation, they can clean, augment, transform, or anonymize data. Some self-service tools include analytics and cataloguing. Good tools manage metadata and offer a search feature. They can track data sources and data transformations.
While developing data preparation pipelines, tools should support real-time visualizations. This facilitates quick debugging of the processing logic. However, it's been noted that many tools are unable to handle big data or fast enough on complex queries.
When choosing a suitable tool, some aspects to consider are features, pricing, performance, usability, collaboration, licensing model, vendor viability, customer support, enterprise integration, security and ecosystem.

Milestones

2000

This decade sees the growing use of data in organizations. However, data is controlled and managed by IT departments. Data scientists work with IT staff. Data preparation involves coding and specialized expertise. Meanwhile, Business Intelligence (BI) tools show the benefits of data visualization and reporting. Users shift from spreadsheets to BI tools. This leads to more and more requests for data access and analysis. IT is soon overwhelmed by the workload.

May
2005

In a report titled Competing on Analytics, researchers note that companies are competing based on data and analytics. Statistical analysis and predictive modelling have become important. However, the report makes no mention of data preparation.

2010

To cope with the shift towards data-centric operations, IT open up data to other departments. While this brings speed, data also get siloed and inconsistent across datasets. Spreadsheets are error prone. ETL tools are rigid and therefore not suited for rapid prototyping. In this context, self-service data preparation tools emerge. These require almost no coding and enable rapid iterations.

Sep
2018

Google Cloud Platform announces the general availability of Google Cloud Dataprep, a cloud offering from Trifacta. This is a self-managed data preparation tool that integrates well with other parts of Google Cloud such as BigQuery and Cloud Dataflow. Prepared data can then be analyzed in Google Data Studio.

2019

A study by the International Data Corporation (IDC) finds that 33% of time is spent on data preparation, 32% on analytics and only 13% on data science. Respondents note that "too much time is spent on data preparation". They also recognize that they need to automate data preparation to acceleration their analytics pipeline.

References

Article Stats

1752

Words

Authors

Edits

Chats

Likes

7347

Hits

Cite As

Devopedia. 2020. "Data Preparation." Version 8, March 25. Accessed 2023-11-12. https://devopedia.org/data-preparation

Contributed by
4 authors

Last updated on
2020-03-25 13:35:28

tools data data analysis

Data Mining
Data Fusion
Data Formatting
Data Imputation
Data Archiving
Data Pipeline Patterns

Data Preparation

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login