Data Preparation
- Summary
-
Discussion
- What's the typical pipeline for data preparation?
- What are some common tasks involved in data preparation?
- What attributes are important in the context of data preparation?
- What are some challenges involved in preparing data?
- Could you share some best practices for data preparation?
- Which programming languages are well suited for data preparation tasks?
- Could you list some tools that aid data preparation with minimal coding?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Raw data is usually not suitable for direct analysis. This is because the data might come from different sources in different formats. Moreover, real-world data is not clean. Some data points might be missing. Some others might be out of range. There could be duplicates. Data preparation is therefore an essential task that transforms or prepares data into a form that's suitable for analysis.
Data preparation assumes that data has already been collected. However, others may consider data collection and data ingestion as part of data preparation. Within data preparation, it's common to identify sub-stages that might include data pre-processing, data wrangling, and data transformation.
Useful insights from data via analytics is the final goal in today's data-driven world. However, data preparation is an important task. Poorly prepared data can make analytics more difficult and ineffective.
Discussion
-
What's the typical pipeline for data preparation? The first step of a data preparation pipeline is to gather data from various sources and locations. Before any processing is done, we wish to discover what the data is about. At this stage, we understand the data within the context of business goals. Visualization of the data is also helpful here. The next stage is to cleanse the data of missing values and invalid values. We also reformat data to standard forms. Next we transform the data for a specific outcome or audience. We can enrich data by merging different datasets to enable richer insights. Finally, we store the data or directly send it out for analytics.
In the context of machine learning, data pre-processing converts raw data into clean data. This involves iteratively cleaning, integration, transformation and reduction of data. When an ML model is being prototyped, often data scientists may wish to go back to convert the data into a more useful form. This could be called data wrangling or data munging. This involves filtering, grouping, sorting, aggregating or decomposing data as required for modelling.
-
What are some common tasks involved in data preparation? Data preparation involves one or more of the following tasks:
- Aggregation: Multiple columns are reduced to fewer columns. Records are summarized.
- Anonymization: Sensitive values are removed for the sake of privacy.
- Augmentation: Expand the dataset size without collecting more data. For example, image data is augmented via cropping or rotating.
- Blending: Combine and link related data from various sources. For example, combine an employee's HR data with payroll data.
- Decomposing: Decompose a data column that has sub-fields. For example, "6 ounces butter" is decomposed into three columns representing value, unit and ingredient.
- Deletion: Duplicates and outliers are removed. Exploratory Data Analysis (EDA) may be used to identify outliers.
- Formatting: Data is modified to a consistent form. For example, 2019-Jul-01, 2019-07-1, and 1/7/19 are changed to a single form, such as 2019-07-01.
- Imputation: Fill missing values using estimates from available data.
- Labelling: Data is labelled for supervised machine learning.
- Normalization: Data is scaled or shifted, perhaps to a range of 0-1.
- Sampling: For a quick analysis, select a small representative sample.
-
What attributes are important in the context of data preparation? It's important to measure the data preparation pipeline and assess how well it delivers data for analytics. To start with, data quality is measured by its accuracy, completeness, consistency, timeliness, believability, added value, interpretability, and accessibility. The pipeline must be able to validate data, spot problems and give tools to understand why they're occurring.
With the growing demand for near real-time analytics, we expect frequent data refreshing. Ideally, data quality is maintained even when refresh rate is high. In practice, there may be a trade-off.
Data must be easy to access even with growing data variety and volume. Data lakes and Hadoop are enablers in this regard.
Data must also conform to data models, domain models and database schemas. Data preparation must check for conformance.
Data preparation must ensure consistency across various datasets. For example, variations due to spelling or abbreviations must be handled. One sports dataset may use the term 'soccer' while another may use the term 'football'.
Unlike ETL systems, data preparation pipeline must be more flexible for ad hoc processing.
-
What are some challenges involved in preparing data? A common challenge faced by businesses is the diversity of data sources, siloed or proprietary tools, tedious processes, and regulatory compliance. Incompatible data formats, messy data and unbalanced data are further challenges.
When manual processes are used, businesses spend more time preparing data than analyzing it. Therefore, organizations must invest in tools and automation. Indeed, data scientists should get involved with teams tasked with data preparation.
It's been said that "bad data is bad for business". Organizations are unsure of data's quality and hence lack confidence in using it for decision making. This can be solved by investing early in data collection and preparation. Profile the data landscape. Improve data quality at source.
Some applications (such as fraud detection or industrial IoT apps) may require data preparation in real time. With large data volumes, collecting, preparing and storing data at scale is a challenge. In production, the preparation pipeline should be repeatable, handle errors and tuned for performance. It should work for initial data, incremental data and streamed data.
-
Could you share some best practices for data preparation? Check data types and formats. Check if data is accurate. Graph the data to get a sense of the distribution and outliers. If these are not as expected, question your assumptions. Label and annotate the graphs. Backup the data. Document or record the steps so that they can be repeated when new data comes in.
Data professionals shouldn't rely too much on IT departments. Adopt data systems and tools that are user friendly.
Data literacy, which is the ability to read, analyze and argue from data, is an essential skill today. Engineers, data scientists and business users must talk in a common language. Prepare data with a good understanding of the context and the business goals.
Profile the data first. Start with a small sample. Iteratively, try different cleaning strategies and discuss the results with business stakeholders. Keep in mind that data may change in the future. Data preparation is therefore not a do-once-and-forget task.
-
Which programming languages are well suited for data preparation tasks? Two well-known libraries for data preparation are
pandas
(Python) anddplyr
(R). Apache Spark, more a framework than a language, is also suited for data preparation. Apache Spark enables fast in-memory data processing in a distributed architecture.The main data structure in
pandas
is the DataFrame. The methoddf.describe()
is a quick way to describe the data. A more detailed profile can be obtained using thepandas-profiling
package. Missing values are represented asNaN
or "not a number". Methodsdf.fillna()
anddf.dropna()
help in dealing with NaN values. A couple of useful tutorials using Pandas are by Jaiswal and by Osei.In R, the equivalent data structure is
data.frame
. A brief structure of the data can be seen usingstr()
. Missing data is marked asNA
. Packagesdplyr
andtidyr
can be used for data preparation.Spark provides APIs in Python, R, Java, Scala and SQL. When Spark is used, it's possible to convert between Spark DataFrames and Pandas DataFrames using Apache Arrow.
-
Could you list some tools that aid data preparation with minimal coding? Data analysis can't happen without data preparation. To enable this with minimal or no coding, it's best to use a self-service data preparation tool. This helps data analysts, business owners and data scientists. There are many of these in the market today (2019): Altair Monarch, Alteryx, ClearStory, Datameer, DataWatch, Dialogue, Improvado, LavaStorm, Microstrategy, Oracle, Paxata, Qlik, Quest, SAP, SAS, Tableau Prep, TIBCO, Trifacta, and Zaloni.
Many tools can combine data from different sources. They include visualization and exploratory data analysis. For data preparation, they can clean, augment, transform, or anonymize data. Some self-service tools include analytics and cataloguing. Good tools manage metadata and offer a search feature. They can track data sources and data transformations.
While developing data preparation pipelines, tools should support real-time visualizations. This facilitates quick debugging of the processing logic. However, it's been noted that many tools are unable to handle big data or fast enough on complex queries.
When choosing a suitable tool, some aspects to consider are features, pricing, performance, usability, collaboration, licensing model, vendor viability, customer support, enterprise integration, security and ecosystem.
Milestones
This decade sees the growing use of data in organizations. However, data is controlled and managed by IT departments. Data scientists work with IT staff. Data preparation involves coding and specialized expertise. Meanwhile, Business Intelligence (BI) tools show the benefits of data visualization and reporting. Users shift from spreadsheets to BI tools. This leads to more and more requests for data access and analysis. IT is soon overwhelmed by the workload.
2005
To cope with the shift towards data-centric operations, IT open up data to other departments. While this brings speed, data also get siloed and inconsistent across datasets. Spreadsheets are error prone. ETL tools are rigid and therefore not suited for rapid prototyping. In this context, self-service data preparation tools emerge. These require almost no coding and enable rapid iterations.
2018
Google Cloud Platform announces the general availability of Google Cloud Dataprep, a cloud offering from Trifacta. This is a self-managed data preparation tool that integrates well with other parts of Google Cloud such as BigQuery and Cloud Dataflow. Prepared data can then be analyzed in Google Data Studio.
A study by the International Data Corporation (IDC) finds that 33% of time is spent on data preparation, 32% on analytics and only 13% on data science. Respondents note that "too much time is spent on data preparation". They also recognize that they need to automate data preparation to acceleration their analytics pipeline.
References
- Altair. 2020. "What is Data Preparation?" Altair. Accessed 2020-03-22.
- Anderson, Alan, and David Semmelroth. 2016. "8 best practices in data preparation." Dummies, March 26. Accessed 2020-03-22.
- Cariou, Bertrand. 2019. "Improving data quality for machine learning and analytics with Cloud Dataprep." Blog, Google Cloud, May 6. Accessed 2020-03-23.
- Databricks Docs. 2020. "Optimize conversion between Apache Spark and pandas DataFrames." Databricks for SQL developers, Databricks Docs, March 12. Accessed 2020-03-24.
- Davenport, Thomas H. 2006. "Competing on Analytics." Harvard Business Review, January. Accessed 2020-03-24.
- DeZyre. 2016. "Apache Spark makes Data Processing & Preparation Faster." DeZyre, Iconiq Inc., March 5. Accessed 2020-03-22.
- Friedman, Hailey. 2019. "Top 23 Data Preparation Tools." Blog, Improvado, August 3. Accessed 2020-03-23.
- Gill, Jagreet Kaur. 2018. "Data Preparation Process, Preprocessing and Data Wrangling." Blog, XenonStack, December 23. Accessed 2020-03-22.
- GlobalNewswire. 2018. "Announcing General Availability of Google Cloud Dataprep by Trifacta." GlobalNewswire, September 20. Accessed 2020-03-24.
- Grace-Martin, Karen. 2009. "Seven Ways to Make up Data: Common Methods to Imputing Missing Data." The Analysis Factor, February 4. Updated 2019-02-26. Accessed 2020-03-23.
- Howard, Philip. 2019. "Data Discovery and Catalogues." Bloor Research International, April 17. Updated 2019-05-01. Accessed 2020-03-22.
- Johnson, Jodi. 2019. "Do You Speak Data Prep?" Infogix, April 17. Accessed 2020-03-24.
- Lawton, George. 2019. "7 enterprise use cases for real-time streaming analytics." Search Business Analytics, TechTarget, March 28. Accessed 2020-03-23.
- Mayo, Matthew. 2019. "7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition." KDNuggets, June. Accessed 2020-03-22.
- Menon, Ramesh. 2019. "The Top 5 Data Preparation Challenges to Get Big Data Pipelines to Run in Production." Infoworks, February 12. Updated 2019-06-19. Accessed 2020-03-22.
- Nguyen, Tony. 2016. "Data Mining: Data preparation." SlideShare, April 27. Accessed 2020-03-22.
- Opidi, Alfrick. 2019. "Solving Data Challenges In Machine Learning With Automated Tools." TOPBOTS, September 19. Accessed 2020-03-22.
- Pandas Docs. 2020. "Comparison with R / R libraries." Getting Started, Pandas, v1.0.1, February 5. Accessed 2020-03-24.
- Pearlman, Shana. 2018. "What is Data Preparation?" talend, April 1. Updated 2019-03-07.Accessed 2020-03-22.
- Penchikala, Srini. 2017. "Data Preparation Pipelines: Strategy, Options and Tools." InfoQ, April 16. Accessed 2020-03-22.
- Press, Gil. 2013. "A Very Short History Of Data Science." Forbes, May 28. Accessed 2020-03-24.
- Ross, Andrew. 2018. "The most common mistakes in data preparation." Information Age, Bonhill Group Plc, August 20. Accessed 2020-03-22.
- Stodder, David. 2016. "Improving Data Preparation for Business Analytics." Best Practices Report, TDWI. Accessed 2020-03-22.
- Tischler, Robert, and Timm Grosser. 2017. "Data Preparation - Refining Raw Data into Value." Research Study, CXP Group. Accessed 2020-03-22.
- Torres, Liz. 2020. "3 most common data preparation challenges—and how to solve them." Blog, Experian Data Quality, March 10. Accessed 2020-03-22.
- Trout, William. 2019. "The Citizen Data Scientist: Data Preparation, AI, and Self-Serve Analytics in Wealth Management." Celent, February 12. Accessed 2020-03-22.
- University of Wisconsin-Madison. 2014. "R for Researchers: Data preparation." University of Wisconsin-Madison, April 21. Accessed 2020-03-24.
- Yaddow, Wayne. 2019. "AI and BI Projects Are Bogged Down With Data Preparation Tasks." TDWI, August 16. Accessed 2020-03-22.
- Ziko, Alex. 2019. "Seven Essential Data Prep Tasks for Self-Service Analytics." Blog, Rapid Insight, January 10. Updated 2020-02-26. Accessed 2020-03-22.
Further Reading
- Stodder, David. 2016. "Improving Data Preparation for Business Analytics." Best Practices Report, TDWI. Accessed 2020-03-22.
- TDWI. 2017. "Data Preparation: Challenges Facing Every Enterprise." eBook, TDWI. Accessed 2020-03-22.
- Predictive Analytics Today. 2019. "Top 21 Self Service Data Preparation Software in 2020." Predictive Analytics Today, Darkdata Analytics Inc., August 7. Updated 2020-01-12. Accessed 2020-03-22.
- Joshi, Darshan. 2019. "Data Processing Pipeline Patterns." Blog, Informatica, August 20. Accessed 2020-03-22.
- Trifacta. 2020. "Data Preparation Challenges and the Trifacta Solution." Trifacta. Accessed 2020-03-22.
- Harris, Jim. 2018. "The five D's of data preparation." SAS Insights, SAS, January 24. Accessed 2020-03-22.