Wrangle (Language)

Real-world data is messy. Before we can do any useful analysis on data, we need to clean or format it in a manner that's acceptable to data analysis or visualization tools. This process is often called data wrangling. We can do this interactively but it's better to record all actions into a script or write a custom program. This will help us document how the data was wrangled and also repeat the process with new data.

Wrangle is a proprietary language for automating the task of data wrangling. It's owned and managed by Trifacta. Once wrangled, data can be exported in CSV or JSON formats, which are supported by most analysis or visualization tools.

The essence of Wrangle is this:

Spend less time fighting with your data and more time learning from it.

Discussion

  • In Wrangle, what are transforms, functions and recipes?

    Here are the three essentials aspects of Wrangle:

    • Transform: This is an action that modifies the dataset in a specific way. Typically, a dataset is in table format with rows and columns. Transforms accept parameters to set the context in terms of rows, columns or conditions. Examples include deduplicate, delete, derive, drop, extract, filter, flatten, nest, pivot, replace, etc.
    • Function: Like in any programming language, a Wrangler function is a computation unit. It works on one of more columns of data. Functions can be passed as parameters to transforms. There are lots of functions that are usually organized into function categories such as aggregate (COUNT, MAX), logical (AND, OR), comparison (ISODD, LESSTHAN), math (DIVIDE, SQRT), date (DATEADD, NOW), string (LOWER, FIND), nested (ARRAYCONCAT, LISTSUM), type (IFNULL, ISMISSING), window (FILL, ROLLINGSUM), and others (RAND, RANGE).
    • Recipe: This is a sequence of transforms applied to a dataset.

    It's common to name transforms in lowercase and functions in uppercase. Transforms and functions are the building blocks from which powerful recipes can be built.

  • Could you describe the Wrangle syntax with some examples?
    Illustrating the use of split transform. Source: Adapted from Trifacta Docs 2019i.
    Illustrating the use of split transform. Source: Adapted from Trifacta Docs 2019i.

    Wrangle transforms follow the general syntax of the transform's name followed by optional parameters: (transform) param1:(expression) param2:(expression). Let's look at a few examples to understand this.

    To create a new column called "circumference" based on another column of values named "diameter", we can apply the derive transform: derive type:single value: (3.14159 * diameter) as: 'circumference'

    To create a column of Boolean values indicating big orders, we can use the derive transform: derive type:single value:IF(order > 1000000, true, false) as:'bigOrder'

    By passing a regular expression parameter to the replace transform, we can delete last two digits from "qty" column: replace col: qty on: /^\d$|^\d\d$/ with: '' global: true

    Suppose a column named "myCol" is in the format of key-value pairs, such as {"brand":"Subaru","model":"Impreza","color","green"}. We can separate this into three columns using the unnest transform: unnest col:myCol keys:'brand','model','color'

    We can use filter transform to keep only rows that match a specific condition: filter row: (row_number >= 25 && firstName == 'Steve') action: Keep

  • Which data types are supported in Wrangle?

    Basic data types include String, Integer, Decimal, and Boolean. String size is limited by size of a row of data, which is about 1 MB. Integers can be safely used within the range (-2^53 + 1) to (2^53 - 1). Decimals are limited to 15 floating point digits. It's not clear if these are limits of the Wrangle language or the Trifacta Wrangler family of products.

    Other data types include Social Security Number, Phone Number, Email Address, Credit Card, Gender Data, Zip Code, State, IP Address, URL, HTTP Code and Datetime. However, some of these won't be useful for non-U.S. data.

    Array type can be used to group together a sequence of values. Arrays can be nested; that is, an array containing another array. Arrays can be ragged; that is, the number of items in an array can vary from one data record to the next.

    To store key-value pairs, we can use the Object type. This can also be nested like in arrays but there's no ordering of items within the type.

  • What are some concerns about the Wrangle language?
    Transformer page of Wrangle tool interactively suggests possible data transforms. Source: Trifacta Docs 2019g.
    Transformer page of Wrangle tool interactively suggests possible data transforms. Source: Trifacta Docs 2019g.

    Wrangle language is not an open standard. It's owned and managed by Trifacta. It's supported in Trifacta's Wrangler family of products. It's also available within partner products such as Google's Cloud Dataprep. While learning the language is useful, it's not essential since these products often generate the language syntax for the user via contextual suggestions and steps.

    Because it's not open, there's a vendor lock-in. This means that any data preparation workflows that you build with Wrangle can't be reused when you move to another platform. For example, AWS has its own tool for data preparation called Glue. ETL workflows can be programmed in Glue using Python, PySpark extensions and Scala. However, anything written in Wrangle can't be reused in AWS. Only the wrangled data can be exported and imported into other tools or platforms.

Milestones

2001
Transforms showing how to clean name format differences. Source: Raman and Hellerstein 2001, fig. 6.
Transforms showing how to clean name format differences. Source: Raman and Hellerstein 2001, fig. 6.

Researchers from UC Berkeley publish a paper titled Potter’s Wheel: An Interactive Data Cleaning System. They identify a number of transforms that can help with data wrangling.

May
2011
A declarative script written in JavaScript. Source: Kandel et al. 2011, fig. 7.
A declarative script written in JavaScript. Source: Kandel et al. 2011, fig. 7.

Led by researchers at the Stanford Visualization Group, a paper titled Wrangler: Interactive Visual Specification of Data Transformation Scripts is published. While the tool is visual and interactive in nature, the paper also talks about a declarative transformation language that has evolved along with the tool. Among the transforms supported are delete, extract, cut, split, lookups, joins, fold, unfold, fill, lag; plus sorting and aggregating functions such as sum, min, max and mean.

Feb
2014

Stanford's Wrangler is commercialized with the release of Data Transformation Platform. An alpha version of this was available back in April 2013 and the company behind it, Trifacta, itself was founded in 2012.

Oct
2015

Trifacta releases Wrangler, a free desktop application for data wrangling. The has limited features compared to the commercial offering that's renamed to Trifacta Wrangler Enterprise.

Mar
2017

Google launches Cloud Dataprep for data preparation in the cloud. Under the hood, it makes use of Trifacta Wrangler Enterprise along with the Wrangle language.

Apr
2018

Trifacta Wrangler is released as a cloud product. The desktop version will no longer be updated and cease operation (implying data loss) in August 2019.

Mar
2019

New functions are introduced including ARRAYINDEXOF, ARRAYRIGHTINDEXOF, ARRAYSLICE and ARRAYMERGEELEMENTS.

May
2019

Functions RANK and DENSERANK are introduced.

References

  1. AWS Docs. 2019. "Programming ETL Scripts." AWS Glue Developer Guide, July 24. Accessed 2019-08-01.
  2. Kandel, Sean, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. "Wrangler: Interactive Visual Specification of Data Transformation Scripts." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363-3372, ACM, May 7-12. Accessed 2019-08-01.
  3. Novet, Jordan. 2015. "Trifacta releases free Wrangler data transformation app for Mac and Windows." Venture Beat, October 19. Accessed 2019-08-01.
  4. Novet, Jordan. 2017. "Google launches Cloud Dataprep, an embedded version of Trifacta." Venture Beat, March 09. Accessed 2019-08-01.
  5. Raman, Vijayshankar and Joseph M. Hellerstein. 2001. "Potter’s Wheel: An Interactive Data Cleaning System." Accessed 2019-08-01.
  6. Stanford Visualization Group. 2013. "Data Wrangler." Accessed 2019-08-01.
  7. Trifacta Docs. 2018a. "Integer Data Type." Reference, November 26. Accessed 2019-08-01.
  8. Trifacta Docs. 2019a. "Supported Data Types." Reference, May 10. Accessed 2019-08-01.
  9. Trifacta Docs. 2019b. "String Data Type." Reference, June 12. Accessed 2019-08-01.
  10. Trifacta Docs. 2019c. "Array Data Type." Reference, February 11. Accessed 2019-08-01.
  11. Trifacta Docs. 2019d. "Object Data Type." Reference, February 11. Accessed 2019-08-01.
  12. Trifacta Docs. 2019e. "Supported File Formats." July 29. Accessed 2019-08-01.
  13. Trifacta Docs. 2019f. "Wrangle Language." June 11. Accessed 2019-08-01.
  14. Trifacta Docs. 2019g. "Transformer Page." Reference, June 13. Accessed 2019-08-01.
  15. Trifacta Docs. 2019h. "Filter Transform." Wrangle Language, March 09. Accessed 2019-08-01.
  16. Trifacta Docs. 2019i. "Split Transform." Wrangle Language, January 10. Accessed 2019-08-01.
  17. Trifacta Docs. 2019j. "Release Notes for Trifacta Wrangler." July 31. Accessed 2019-08-01.
  18. Wikipedia. 2019. "Trifacta." Wikipedia, April 08. Accessed 2019-08-01.

Further Reading

  1. Trifacta Docs. 2019f. "Wrangle Language." June 11. Accessed 2019-08-01.
  2. Kandel, Sean, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. "Wrangler: Interactive Visual Specification of Data Transformation Scripts." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363-3372, ACM, May 7-12. Accessed 2019-08-01.
  3. Raman, Vijayshankar and Joseph M. Hellerstein. 2001. "Potter’s Wheel: An Interactive Data Cleaning System." Accessed 2019-08-01.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
3
0
818
1
0
56
1
0
36
1
0
7
1097
Words
1
Likes
2476
Hits

Cite As

Devopedia. 2022. "Wrangle (Language)." Version 6, February 15. Accessed 2023-11-13. https://devopedia.org/wrangle-language
Contributed by
4 authors


Last updated on
2022-02-15 11:53:47