# R Data Structures

## Summary

R is an object-oriented language and all data structures are objects. R doesn't provide programmers direct access to memory and all data must be accessed via symbols or variables that refer to objects.^{}

Since vectorized operation is an important aspect of R, R does not have any scalars. The most basic data structure is a `vector`

, which is a sequence of data items. Thus, a single integer value is treated as an integer vector of unit length. The most versatile data structure is the `list`

while the most common one used for data analysis is the `data.frame`

.^{}

The terms *data type* and *mode* usually refers to what is stored (integer, character, etc.). The term *data structure* usually refers to how data is stored, that is, the containers (vector, list, etc.).^{}

## Milestones

## Discussion

What data types are available in R? Data types are many.

^{}The common ones include`integer`

,`real`

,`complex`

,`logical`

and`character`

. Types`integer`

and`real`

are termed as`numeric`

. There's no separate "string" type. Instead,`character`

type is sufficient to denote strings.^{}Integers are specified with a suffix "L", such as

`23L`

or`-2L`

. Real numbers are specified without this suffix, such as`2.3`

or`23`

. Examples of complex numbers are`-2+3i`

and`-45i`

. Logical type can take values`TRUE`

or`FALSE`

. These have shortforms`T`

and`F`

. Character type can be specified by a matching pair of single or double quotes, such as`"R"`

,`'R'`

or`"This is R!"`

.Could you compare vector, matrix, array, list and data.frame? The following data structures are common in R:

^{}`vector`

: Contains a sequence of items of the same type. This is most basic structure. Items of a vector can be accessed using`[]`

. Function`length`

can be called to know the number of items.`list`

: Represented as a vector but can contain items of different types. Different columns can contain different lengths. Items of a list can be accessed using`[[]]`

. This is a recursive data type: lists can contain other lists.`array`

: An n-dimensional structure that expands on a vector. Under the hood, this has`dim`

and optionally`dimnames`

attributes, which don't exist for vectors. Like vectors, all items must be of the same underlying type.`matrix`

: A two-dimensional array.`data.frame`

: While all columns of a matrix have same type, with data frames, different columns can have different types.

Formally, vectors can be said to be of two types: atomic vectors (items of same type) and lists.

^{}In practice, when we say vectors we are referring to atomic vectors.What are factors? Consider the sex of a person. This variable can have only two possibilities or categories: male or female. We call this categorical data and

*factors*are used to represent such data. Roughly equivalent to an "enum" type in C, factor represents a finite set of values.Under the hood, these are nothing more than integer vectors with each integer representing one category. In R, these possible values are called

*levels*.^{}Thus, though sex may contain values "male" or "female", these are not characters but integers. In addition, factors can be ordered or unordered. For example, sex may be defined as unordered factor. Olympics medal may be defined as ordered factor such that Bronze < Silver < Gold.

Is the NULL object a special data type? R documentation states that "the NULL object has no type and no modifiable properties". Attributes don't apply to NULL. When you want to indicate absence, NULL can be used. A vector or list of zero length is not the same as NULL.

^{}How to interpret the functions class, mode, typeof and storage.mode? All these functions can be called on R with differing results. Function

`class`

represents the object's abstract type whereas`typeof`

is the object's specific type.^{}A good example is factors: its class`factor`

but its type is`integer`

. Another example is a data frame: its class is`data.frame`

but its type if`list`

.^{}Function

`mode`

is similar to`typeof`

and it exists for compatibility with R's predecessor, the S language. Function`storage.mode`

also exists for compatibility with S. It's useful when interfacing to code in other languages. For example, consider a vector of integers. Functions typeof, mode and storage.mode will respectively return`integer`

,`numeric`

and`integer`

. In S, both integers and reals have the same mode and hence storge.mode becomes useful.^{}Hadley Wickham has commented that it's best to avoid using mode and storage.mode in R. If we need the underlying type, calling

`typeof`

should be preferred over`storage.mode`

.^{}What are some basic operations on R vectors? Here are some basic operations on vectors:

**Combining**: We can combine vectors into a single vector. Eg.`v <- c(v1, v2)`

to combine v1 and v1 into v.**Indexing**: Indexing starts from 1. Negative numbers imply selecting all others except those specified. Eg.`v[1]`

for first element;`v[-3]`

all elements except the third one. Indexing may be treated as a special case of subsetting.**Subsetting**: We can select a subset of a vector by using integer vectors for indexing. Eg.`v[c(1,3,5)]`

to select first, third and fifth element;`v[1:3]`

to select the first three elements. We can also use a logical vector to subset a vector. Eg.`v[v > 5]`

to select all elements who value is greater than 5.**Coercing**: Since vectors contain elements of a single type, values are coerced to a single type if they are different. Eg.`v <- c(12L, 2.2, TRUE)`

coerces to doubles [12.0, 2.2, 1.0];`v <- c(2.2, TRUE, "Hi")`

coerces to characters ["2.2", "TRUE", "Hi"].**Converting**: Convert the type. Eg.`as.integer(c(3, 2.2, TRUE))`

becomes [3, 3, 1];`as.numeric(c(2.2, TRUE, "Hi"))`

becomes [2.2, NA, NA], where NA stands for "Not Available".

Are there datasets to understand the different data structures? R comes with many datasets for experimental analysis and learning. These can be listed by typing

`data()`

in the R console. Details of each dataset can be obtained by using`?`

or`help`

. For example, for help on "rivers" dataset type either`?rivers`

or`help(rivers)`

.An example of vector is the "rivers" dataset. An example of a vector with names given to each observation is "precip". There are plenty of examples for data.frame: "airquality", "mtcars", "iris". An example of a list in "state.center" dataset.

In "CO2", "Plant" variable is an ordered factor whereas "Type" variable is an unordered factor. Dataset "Titanic" is of class

`table`

, which is a type of array. This data structure records counts of combinations of factor levels.An object can belong to multiple classes. As an example, try

`class(CO2)`

and`str(CO2)`

. It's a data.frame but also belongs to other classes.Can you give examples of data structures beyond the core ones given by R? Developers can create their own data structures that can build on top of the basic ones. One popular one is called

**data.table**, which is based on data.frame. It offers a simplified and consistent syntax for handling data.^{}Another example is

**tibble**, which retains the effective parts of data.frame and does less work so that developers can catch problems early on.^{}If you want to display data frames in HTML with conditional formatting (like in Microsoft Excel),

**formattable**is a suitable package to use.^{}Another package named

**dplyr**isn't exactly a data structure. Rather, it offers a number of functions for manipulating data. It comes as part of the*tidyverse*collection of R packages targetted towards data science.^{}

## Sample Code

## References

- CRAN data.table. 2018. "Introduction to data.table." data.table vignette, May 7. Accessed 2018-05-11.
- Ceballos, Maite and Nicolás Cardiel. 2013. "Data structure." First Steps in R. Accessed 2018-05-11.
- Colton, Arianne and Sean Chen. 2016. "Advanced R: Cheat Sheet." RStudio, February. Accessed 2018-05-11.
- Cotton, Richie. 2016. "A comprehensive survey of the types of things in R. 'mode' and 'class' and 'typeof' are insufficient." StackOverflow, October 21. Accessed 2018-05-11.
- Müller, Kirill and Hadley Wickham. 2018. "tibble." Part of the tidyverse. Accessed 2018-05-11.
- R Core Team. 2018. "R Language Definition." v3.5.0, CRAN, April 23. Accessed 2018-05-11.
- Ren, Kun. 2016. "Formattable data frame." Formattable vignettes, CRAN, August 5. Accessed 2018-05-12.
- Wickham, Hadley. 2016. "Replying to @richierocks." Twitter, October 21. Accessed 2018-05-11.
- Wickham, Hadley. 2018. "Data structures." Advanced R, April 26. Accessed 2018-05-11.
- Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. "dplyr." Part of the tidyverse. Accessed 2018-05-12.
- data.table Wiki. 2018. "Home." Rdatatable, GitHub. Accessed 2018-05-12.

## Milestones

## Tags

## See Also

- R (Language)
- R Plotting Systems
- R Object-Oriented Programming
- R data.table
- Vectorization in R
- R dplyr

## Further Reading

- R Core Team. 2018. "R Language Definition." v3.5.0, CRAN, April 23. Accessed 2018-05-11.
- Wickham, Hadley. 2018. "Data structures." Advanced R, April 26. Accessed 2018-05-11.
- Ceballos, Maite and Nicolás Cardiel. 2013. "Data structure." First Steps in R. Accessed 2018-05-11.
- Blischak, John, Daniel Chen, Harriet Dashnow, and Denis Haine (eds). 2016. "Data Types and Structures." Software Carpentry: Programming with R, June. Accessed 2018-05-11.