R Data Structures
- Summary
-
Discussion
- What data types are available in R?
- Could you compare vector, matrix, array, list and data.frame?
- What are factors?
- Is the NULL object a special data type?
- How to interpret the functions class, mode, typeof and storage.mode?
- What are some basic operations on R vectors?
- Are there datasets to understand the different data structures?
- Can you give examples of data structures beyond the core ones given by R?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As
R is an object-oriented language and all data structures are objects. R doesn't provide programmers direct access to memory and all data must be accessed via symbols or variables that refer to objects.
Since vectorized operation is an important aspect of R, R does not have any scalars. The most basic data structure is a vector
, which is a sequence of data items. Thus, a single integer value is treated as an integer vector of unit length. The most versatile data structure is the list
while the most common one used for data analysis is the data.frame
.
The terms data type and mode usually refers to what is stored (integer, character, etc.). The term data structure usually refers to how data is stored, that is, the containers (vector, list, etc.).
Discussion
-
What data types are available in R? Data types are many. The common ones include
integer
,real
,complex
,logical
andcharacter
. Typesinteger
andreal
are termed asnumeric
. There's no separate "string" type. Instead,character
type is sufficient to denote strings.Integers are specified with a suffix "L", such as
23L
or-2L
. Real numbers are specified without this suffix, such as2.3
or23
. Examples of complex numbers are-2+3i
and-45i
. Logical type can take valuesTRUE
orFALSE
. These have shortformsT
andF
. Character type can be specified by a matching pair of single or double quotes, such as"R"
,'R'
or"This is R!"
. -
Could you compare vector, matrix, array, list and data.frame? The following data structures are common in R:
vector
: Contains a sequence of items of the same type. This is most basic structure. Items of a vector can be accessed using[]
. Functionlength
can be called to know the number of items.list
: Represented as a vector but can contain items of different types. Different columns can contain different lengths. Items of a list can be accessed using[[]]
. This is a recursive data type: lists can contain other lists.array
: An n-dimensional structure that expands on a vector. Under the hood, this hasdim
and optionallydimnames
attributes, which don't exist for vectors. Like vectors, all items must be of the same underlying type.matrix
: A two-dimensional array.data.frame
: While all columns of a matrix have same type, with data frames, different columns can have different types.
Formally, vectors can be said to be of two types: atomic vectors (items of same type) and lists. In practice, when we say vectors we are referring to atomic vectors.
-
What are factors? Consider the sex of a person. This variable can have only two possibilities or categories: male or female. We call this categorical data and factors are used to represent such data. Roughly equivalent to an "enum" type in C, factor represents a finite set of values.
Under the hood, these are nothing more than integer vectors with each integer representing one category. In R, these possible values are called levels.
Thus, though sex may contain values "male" or "female", these are not characters but integers. In addition, factors can be ordered or unordered. For example, sex may be defined as unordered factor. Olympics medal may be defined as ordered factor such that Bronze < Silver < Gold.
-
Is the NULL object a special data type? R documentation states that "the NULL object has no type and no modifiable properties". Attributes don't apply to NULL. When you want to indicate absence, NULL can be used. A vector or list of zero length is not the same as NULL.
-
How to interpret the functions class, mode, typeof and storage.mode? All these functions can be called on R with differing results. Function
class
represents the object's abstract type whereastypeof
is the object's specific type. A good example is factors: its classfactor
but its type isinteger
. Another example is a data frame: its class isdata.frame
but its type iflist
.Function
mode
is similar totypeof
and it exists for compatibility with R's predecessor, the S language. Functionstorage.mode
also exists for compatibility with S. It's useful when interfacing to code in other languages. For example, consider a vector of integers. Functions typeof, mode and storage.mode will respectively returninteger
,numeric
andinteger
. In S, both integers and reals have the same mode and hence storage.mode becomes useful.Hadley Wickham has commented that it's best to avoid using mode and storage.mode in R. If we need the underlying type, calling
typeof
should be preferred overstorage.mode
. -
What are some basic operations on R vectors? Here are some basic operations on vectors:
- Combining: We can combine vectors into a single vector. Eg.
v <- c(v1, v2)
to combine v1 and v1 into v. - Indexing: Indexing starts from 1. Negative numbers imply selecting all others except those specified. Eg.
v[1]
for first element;v[-3]
for elements except the third one. Indexing may be treated as a special case of subsetting. - Subsetting: We can select a subset of a vector by using integer vectors for indexing. Eg.
v[c(1,3,5)]
to select first, third and fifth element;v[1:3]
to select the first three elements. We can also use a logical vector to subset a vector. Eg.v[v > 5]
to select elements with value greater than 5. - Coercing: Since vectors contain elements of a single type, values are coerced to a single type if they are different. Eg.
v <- c(12L, 2.2, TRUE)
coerces to doubles [12.0, 2.2, 1.0];v <- c(2.2, TRUE, "Hi")
coerces to characters ["2.2", "TRUE", "Hi"]. - Converting: May be called explicit coercing. Convert the type. Eg.
as.integer(c(3, 2.2, TRUE))
becomes [3, 3, 1];as.numeric(c(2.2, TRUE, "Hi"))
becomes [2.2, NA, NA], where NA stands for "Not Available".
- Combining: We can combine vectors into a single vector. Eg.
-
Are there datasets to understand the different data structures? R comes with many datasets for experimental analysis and learning. These can be listed by typing
data()
in the R console. Details of each dataset can be obtained by using?
orhelp
. For example, for help on "rivers" dataset type either?rivers
orhelp(rivers)
. It's been said that datasets "mtcars", "iris", "ToothGrowth", "PlantGrowth" and "USArrests" are commonly used by researchers.An example of vector is the "rivers" dataset. An example of a vector with names given to each observation is "precip". There are plenty of examples for data.frame: "airquality", "mtcars", "iris". An example of a list in "state.center" dataset.
In "CO2", "Plant" variable is an ordered factor whereas "Type" variable is an unordered factor. Dataset "Titanic" is of class
table
, which is a type of array. This data structure records counts of combinations of factor levels.An object can belong to multiple classes. As an example, try
class(CO2)
andstr(CO2)
. It's a data.frame but also belongs to other classes. -
Can you give examples of data structures beyond the core ones given by R? Developers can create their own data structures that can build on top of the basic ones. One popular one is called data.table, which is based on data.frame. It offers a simplified and consistent syntax for handling data.
Another example is tibble, which retains the effective parts of data.frame and does less work so that developers can catch problems early on.
If you want to display data frames in HTML with conditional formatting (like in Microsoft Excel), formattable is a suitable package to use.
Another package named dplyr isn't exactly a data structure. Rather, it offers a number of functions for manipulating data. It comes as part of the tidyverse collection of R packages targetted towards data science.
Milestones
2013
2014
Sample Code
References
- CRAN. 2013. "Changes in R 3.0.0." R News, CRAN. Accessed 2020-07-25.
- CRAN. 2020. "Previous Releases of R for Windows." CRAN, June. Accessed 2020-07-25.
- CRAN data.table. 2018. "Introduction to data.table." data.table vignette, May 7. Accessed 2018-05-11.
- Ceballos, Maite and Nicolás Cardiel. 2013. "Data structure." First Steps in R. Accessed 2018-05-11.
- Colton, Arianne and Sean Chen. 2016. "Advanced R: Cheat Sheet." RStudio, February. Accessed 2018-05-11.
- Cotton, Richie. 2016. "A comprehensive survey of the types of things in R. 'mode' and 'class' and 'typeof' are insufficient." StackOverflow, October 21. Accessed 2018-05-11.
- Dalgaard, Peter. 2020. "R 4.0.0 is released." Email, April 24. Accessed 2020-07-25.
- DataCamp. 2020. "Data Type Conversion." Quick-R, DataCamp. Accessed 2020-07-25.
- DataFlair. 2019. "8 R Vector Operations with Examples – A Complete Guide for R Programmers." DataFlair, July 6. Accessed 2020-07-25.
- Hugh-Jones, David. 2018. "Everything I know about R subsetting." February 8. Accessed 2020-07-25.
- Müller, Kirill and Hadley Wickham. 2018. "tibble." Part of the tidyverse. Accessed 2018-05-11.
- Peng, Roger D. 2018. "R Nuts and Bolts." Chapter 4 in: R Programming for Data Science, September 18. Accessed 2020-07-25.
- R Core Team. 2018. "R Language Definition." v3.5.0, CRAN, April 23. Accessed 2018-05-11.
- Ren, Kun. 2016. "Formattable data frame." Formattable vignettes, CRAN, August 5. Accessed 2018-05-12.
- STHDA Wiki. 2020. "R Built-in Data Sets." STHDA Wiki, Statistical Tools for High-Throughput Data Analysis. Accessed 2020-07-25.
- Tidyverse GitHub. 2020. " tidyverse/dplyr." Tidyverse GitHub, July 22. Accessed 2020-07-25.
- Wickham, Hadley. 2016. "Replying to @richierocks." Twitter, October 21. Accessed 2018-05-11.
- Wickham, Hadley. 2018. "Data structures." Advanced R, April 26. Accessed 2018-05-11.
- Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. "dplyr." Part of the tidyverse. Accessed 2018-05-12.
- data.table Wiki. 2018. "Home." Rdatatable, GitHub. Accessed 2018-05-12.
Further Reading
- R Core Team. 2018. "R Language Definition." v3.5.0, CRAN, April 23. Accessed 2018-05-11.
- Wickham, Hadley. 2018. "Data structures." Advanced R, April 26. Accessed 2018-05-11.
- Ceballos, Maite and Nicolás Cardiel. 2013. "Data structure." First Steps in R. Accessed 2018-05-11.
- Blischak, John, Daniel Chen, Harriet Dashnow, and Denis Haine (eds). 2016. "Data Types and Structures." Software Carpentry: Programming with R, June. Accessed 2018-05-11.
Article Stats
Cite As
See Also
- R (Language)
- R Plotting Systems
- R Object-Oriented Programming
- R data.table
- Vectorization in R
- R dplyr