Wrangle (Language)
Real-world data is messy. Before we can do any useful analysis on data, we need to clean or format it in a manner that's acceptable to data analysis or visualization tools. This process is often called data wrangling. We can do this interactively but it's better to record all actions into a script or write a custom program. This will help us document how the data was wrangled and also repeat the process with new data.
Wrangle is a proprietary language for automating the task of data wrangling. It's owned and managed by Trifacta. Once wrangled, data can be exported in CSV or JSON formats, which are supported by most analysis or visualization tools.
The essence of Wrangle is this:
Spend less time fighting with your data and more time learning from it.
Discussion
-
In Wrangle, what are transforms, functions and recipes? Here are the three essentials aspects of Wrangle:
- Transform: This is an action that modifies the dataset in a specific way. Typically, a dataset is in table format with rows and columns. Transforms accept parameters to set the context in terms of rows, columns or conditions. Examples include deduplicate, delete, derive, drop, extract, filter, flatten, nest, pivot, replace, etc.
- Function: Like in any programming language, a Wrangler function is a computation unit. It works on one of more columns of data. Functions can be passed as parameters to transforms. There are lots of functions that are usually organized into function categories such as aggregate (COUNT, MAX), logical (AND, OR), comparison (ISODD, LESSTHAN), math (DIVIDE, SQRT), date (DATEADD, NOW), string (LOWER, FIND), nested (ARRAYCONCAT, LISTSUM), type (IFNULL, ISMISSING), window (FILL, ROLLINGSUM), and others (RAND, RANGE).
- Recipe: This is a sequence of transforms applied to a dataset.
It's common to name transforms in lowercase and functions in uppercase. Transforms and functions are the building blocks from which powerful recipes can be built.
-
Could you describe the Wrangle syntax with some examples? Wrangle transforms follow the general syntax of the transform's name followed by optional parameters:
(transform) param1:(expression) param2:(expression)
. Let's look at a few examples to understand this.To create a new column called "circumference" based on another column of values named "diameter", we can apply the
derive
transform:derive type:single value: (3.14159 * diameter) as: 'circumference'
To create a column of Boolean values indicating big orders, we can use the
derive
transform:derive type:single value:IF(order > 1000000, true, false) as:'bigOrder'
By passing a regular expression parameter to the
replace
transform, we can delete last two digits from "qty" column:replace col: qty on: /^\d$|^\d\d$/ with: '' global: true
Suppose a column named "myCol" is in the format of key-value pairs, such as
{"brand":"Subaru","model":"Impreza","color","green"}
. We can separate this into three columns using theunnest
transform:unnest col:myCol keys:'brand','model','color'
We can use
filter
transform to keep only rows that match a specific condition:filter row: (row_number >= 25 && firstName == 'Steve') action: Keep
-
Which data types are supported in Wrangle? Basic data types include
String
,Integer
,Decimal
, andBoolean
. String size is limited by size of a row of data, which is about 1 MB. Integers can be safely used within the range (-2^53 + 1) to (2^53 - 1). Decimals are limited to 15 floating point digits. It's not clear if these are limits of the Wrangle language or the Trifacta Wrangler family of products.Other data types include Social Security Number, Phone Number, Email Address, Credit Card, Gender Data, Zip Code, State, IP Address, URL, HTTP Code and Datetime. However, some of these won't be useful for non-U.S. data.
Array
type can be used to group together a sequence of values. Arrays can be nested; that is, an array containing another array. Arrays can be ragged; that is, the number of items in an array can vary from one data record to the next.To store key-value pairs, we can use the
Object
type. This can also be nested like in arrays but there's no ordering of items within the type. -
What are some concerns about the Wrangle language? Wrangle language is not an open standard. It's owned and managed by Trifacta. It's supported in Trifacta's Wrangler family of products. It's also available within partner products such as Google's Cloud Dataprep. While learning the language is useful, it's not essential since these products often generate the language syntax for the user via contextual suggestions and steps.
Because it's not open, there's a vendor lock-in. This means that any data preparation workflows that you build with Wrangle can't be reused when you move to another platform. For example, AWS has its own tool for data preparation called Glue. ETL workflows can be programmed in Glue using Python, PySpark extensions and Scala. However, anything written in Wrangle can't be reused in AWS. Only the wrangled data can be exported and imported into other tools or platforms.
Milestones
2011

Led by researchers at the Stanford Visualization Group, a paper titled Wrangler: Interactive Visual Specification of Data Transformation Scripts is published. While the tool is visual and interactive in nature, the paper also talks about a declarative transformation language that has evolved along with the tool. Among the transforms supported are delete, extract, cut, split, lookups, joins, fold, unfold, fill, lag; plus sorting and aggregating functions such as sum, min, max and mean.
2014
2015
2017
2018
2019
References
- AWS Docs. 2019. "Programming ETL Scripts." AWS Glue Developer Guide, July 24. Accessed 2019-08-01.
- Kandel, Sean, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. "Wrangler: Interactive Visual Specification of Data Transformation Scripts." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363-3372, ACM, May 7-12. Accessed 2019-08-01.
- Novet, Jordan. 2015. "Trifacta releases free Wrangler data transformation app for Mac and Windows." Venture Beat, October 19. Accessed 2019-08-01.
- Novet, Jordan. 2017. "Google launches Cloud Dataprep, an embedded version of Trifacta." Venture Beat, March 09. Accessed 2019-08-01.
- Raman, Vijayshankar and Joseph M. Hellerstein. 2001. "Potter’s Wheel: An Interactive Data Cleaning System." Accessed 2019-08-01.
- Stanford Visualization Group. 2013. "Data Wrangler." Accessed 2019-08-01.
- Trifacta Docs. 2018a. "Integer Data Type." Reference, November 26. Accessed 2019-08-01.
- Trifacta Docs. 2019a. "Supported Data Types." Reference, May 10. Accessed 2019-08-01.
- Trifacta Docs. 2019b. "String Data Type." Reference, June 12. Accessed 2019-08-01.
- Trifacta Docs. 2019c. "Array Data Type." Reference, February 11. Accessed 2019-08-01.
- Trifacta Docs. 2019d. "Object Data Type." Reference, February 11. Accessed 2019-08-01.
- Trifacta Docs. 2019e. "Supported File Formats." July 29. Accessed 2019-08-01.
- Trifacta Docs. 2019f. "Wrangle Language." June 11. Accessed 2019-08-01.
- Trifacta Docs. 2019g. "Transformer Page." Reference, June 13. Accessed 2019-08-01.
- Trifacta Docs. 2019h. "Filter Transform." Wrangle Language, March 09. Accessed 2019-08-01.
- Trifacta Docs. 2019i. "Split Transform." Wrangle Language, January 10. Accessed 2019-08-01.
- Trifacta Docs. 2019j. "Release Notes for Trifacta Wrangler." July 31. Accessed 2019-08-01.
- Wikipedia. 2019. "Trifacta." Wikipedia, April 08. Accessed 2019-08-01.
Further Reading
- Trifacta Docs. 2019f. "Wrangle Language." June 11. Accessed 2019-08-01.
- Kandel, Sean, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. "Wrangler: Interactive Visual Specification of Data Transformation Scripts." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363-3372, ACM, May 7-12. Accessed 2019-08-01.
- Raman, Vijayshankar and Joseph M. Hellerstein. 2001. "Potter’s Wheel: An Interactive Data Cleaning System." Accessed 2019-08-01.
Article Stats
Cite As
See Also
- Data Preparation
- Data Visualization
- Data Mining