Wrangle (Language)

Article Info

Contributed by
4 authors

Last updated on
2022-02-15 11:53:47

Data Preparation
Data Visualization
Data Mining

Article Versions

6 2022-02-15 11:53:47
3364,3189 6,3364

By devbot5S

Migrating blockquotes to markdown syntax
5 2022-01-26 08:44:06
3189,1521 5,3189

By arvindpdmn

Corrected ref URL
4 2019-08-01 17:28:02
1521,1500 4,1521

By arvindpdmn

Improved old answers. Completed rest of the article.
3 2019-07-16 09:46:12
1500,1498 3,1500

By masture

Added answer to first question
2 2019-07-16 09:44:55
1498,1477 2,1498

By avisatna

Added answer for Wrangle syntax with an example

Chat Room

Submitting ...

You are editing an existing chat message.

Real-world data is messy. Before we can do any useful analysis on data, we need to clean or format it in a manner that's acceptable to data analysis or visualization tools. This process is often called data wrangling. We can do this interactively but it's better to record all actions into a script or write a custom program. This will help us document how the data was wrangled and also repeat the process with new data.

Wrangle is a proprietary language for automating the task of data wrangling. It's owned and managed by Trifacta. Once wrangled, data can be exported in CSV or JSON formats, which are supported by most analysis or visualization tools.

The essence of Wrangle is this:

Spend less time fighting with your data and more time learning from it.

Discussion

In Wrangle, what are transforms, functions and recipes?
Here are the three essentials aspects of Wrangle:
- Transform: This is an action that modifies the dataset in a specific way. Typically, a dataset is in table format with rows and columns. Transforms accept parameters to set the context in terms of rows, columns or conditions. Examples include deduplicate, delete, derive, drop, extract, filter, flatten, nest, pivot, replace, etc.
- Function: Like in any programming language, a Wrangler function is a computation unit. It works on one of more columns of data. Functions can be passed as parameters to transforms. There are lots of functions that are usually organized into function categories such as aggregate (COUNT, MAX), logical (AND, OR), comparison (ISODD, LESSTHAN), math (DIVIDE, SQRT), date (DATEADD, NOW), string (LOWER, FIND), nested (ARRAYCONCAT, LISTSUM), type (IFNULL, ISMISSING), window (FILL, ROLLINGSUM), and others (RAND, RANGE).
- Recipe: This is a sequence of transforms applied to a dataset.
It's common to name transforms in lowercase and functions in uppercase. Transforms and functions are the building blocks from which powerful recipes can be built.
Could you describe the Wrangle syntax with some examples?
Illustrating the use of split transform. Source: Adapted from Trifacta Docs 2019i.
Wrangle transforms follow the general syntax of the transform's name followed by optional parameters: (transform) param1:(expression) param2:(expression). Let's look at a few examples to understand this.
To create a new column called "circumference" based on another column of values named "diameter", we can apply the derive transform: derive type:single value: (3.14159 * diameter) as: 'circumference'
To create a column of Boolean values indicating big orders, we can use the derive transform: derive type:single value:IF(order > 1000000, true, false) as:'bigOrder'
By passing a regular expression parameter to the replace transform, we can delete last two digits from "qty" column: replace col: qty on: /^\d$|^\d\d$/ with: '' global: true
Suppose a column named "myCol" is in the format of key-value pairs, such as {"brand":"Subaru","model":"Impreza","color","green"}. We can separate this into three columns using the unnest transform: unnest col:myCol keys:'brand','model','color'
We can use filter transform to keep only rows that match a specific condition: filter row: (row_number >= 25 && firstName == 'Steve') action: Keep
Which data types are supported in Wrangle?
Basic data types include String, Integer, Decimal, and Boolean. String size is limited by size of a row of data, which is about 1 MB. Integers can be safely used within the range (-2^53 + 1) to (2^53 - 1). Decimals are limited to 15 floating point digits. It's not clear if these are limits of the Wrangle language or the Trifacta Wrangler family of products.
Other data types include Social Security Number, Phone Number, Email Address, Credit Card, Gender Data, Zip Code, State, IP Address, URL, HTTP Code and Datetime. However, some of these won't be useful for non-U.S. data.
Array type can be used to group together a sequence of values. Arrays can be nested; that is, an array containing another array. Arrays can be ragged; that is, the number of items in an array can vary from one data record to the next.
To store key-value pairs, we can use the Object type. This can also be nested like in arrays but there's no ordering of items within the type.
What are some concerns about the Wrangle language?
Transformer page of Wrangle tool interactively suggests possible data transforms. Source: Trifacta Docs 2019g.
Wrangle language is not an open standard. It's owned and managed by Trifacta. It's supported in Trifacta's Wrangler family of products. It's also available within partner products such as Google's Cloud Dataprep. While learning the language is useful, it's not essential since these products often generate the language syntax for the user via contextual suggestions and steps.
Because it's not open, there's a vendor lock-in. This means that any data preparation workflows that you build with Wrangle can't be reused when you move to another platform. For example, AWS has its own tool for data preparation called Glue. ETL workflows can be programmed in Glue using Python, PySpark extensions and Scala. However, anything written in Wrangle can't be reused in AWS. Only the wrangled data can be exported and imported into other tools or platforms.

Milestones

2001

Researchers from UC Berkeley publish a paper titled Potter’s Wheel: An Interactive Data Cleaning System. They identify a number of transforms that can help with data wrangling.

May
2011

A declarative script written in JavaScript. Source: Kandel et al. 2011, fig. 7.

Led by researchers at the Stanford Visualization Group, a paper titled Wrangler: Interactive Visual Specification of Data Transformation Scripts is published. While the tool is visual and interactive in nature, the paper also talks about a declarative transformation language that has evolved along with the tool. Among the transforms supported are delete, extract, cut, split, lookups, joins, fold, unfold, fill, lag; plus sorting and aggregating functions such as sum, min, max and mean.

Feb
2014

Stanford's Wrangler is commercialized with the release of Data Transformation Platform. An alpha version of this was available back in April 2013 and the company behind it, Trifacta, itself was founded in 2012.

Oct
2015

Trifacta releases Wrangler, a free desktop application for data wrangling. The has limited features compared to the commercial offering that's renamed to Trifacta Wrangler Enterprise.

Mar
2017

Google launches Cloud Dataprep for data preparation in the cloud. Under the hood, it makes use of Trifacta Wrangler Enterprise along with the Wrangle language.

Apr
2018

Trifacta Wrangler is released as a cloud product. The desktop version will no longer be updated and cease operation (implying data loss) in August 2019.

Mar
2019

New functions are introduced including ARRAYINDEXOF, ARRAYRIGHTINDEXOF, ARRAYSLICE and ARRAYMERGEELEMENTS.

May
2019

Functions RANK and DENSERANK are introduced.

References

Article Stats

1097

Words

Authors

Edits

Chats

Likes

2476

Hits

Cite As

Devopedia. 2022. "Wrangle (Language)." Version 6, February 15. Accessed 2023-11-13. https://devopedia.org/wrangle-language

Contributed by
4 authors

Last updated on
2022-02-15 11:53:47

languages data data analysis

Data Preparation
Data Visualization
Data Mining

Wrangle (Language)

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Wrangle (Language)

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login