R Data Structures

Five data structures in R. Source: Wickham 2018.
Five data structures in R. Source: Wickham 2018.

R is an object-oriented language and all data structures are objects. R doesn't provide programmers direct access to memory and all data must be accessed via symbols or variables that refer to objects.

Since vectorized operation is an important aspect of R, R does not have any scalars. The most basic data structure is a vector, which is a sequence of data items. Thus, a single integer value is treated as an integer vector of unit length. The most versatile data structure is the list while the most common one used for data analysis is the data.frame.

The terms data type and mode usually refers to what is stored (integer, character, etc.). The term data structure usually refers to how data is stored, that is, the containers (vector, list, etc.).

Discussion

  • What data types are available in R?

    Data types are many. The common ones include integer, real, complex, logical and character. Types integer and real are termed as numeric. There's no separate "string" type. Instead, character type is sufficient to denote strings.

    Integers are specified with a suffix "L", such as 23L or -2L. Real numbers are specified without this suffix, such as 2.3 or 23. Examples of complex numbers are -2+3i and -45i. Logical type can take values TRUE or FALSE. These have shortforms T and F. Character type can be specified by a matching pair of single or double quotes, such as "R", 'R' or "This is R!".

  • Could you compare vector, matrix, array, list and data.frame?
    Example representations of different data structures in R. Source: Ceballos and Cardiel 2013.
    Example representations of different data structures in R. Source: Ceballos and Cardiel 2013.

    The following data structures are common in R:

    • vector: Contains a sequence of items of the same type. This is most basic structure. Items of a vector can be accessed using []. Function length can be called to know the number of items.
    • list: Represented as a vector but can contain items of different types. Different columns can contain different lengths. Items of a list can be accessed using [[]]. This is a recursive data type: lists can contain other lists.
    • array: An n-dimensional structure that expands on a vector. Under the hood, this has dim and optionally dimnames attributes, which don't exist for vectors. Like vectors, all items must be of the same underlying type.
    • matrix: A two-dimensional array.
    • data.frame: While all columns of a matrix have same type, with data frames, different columns can have different types.

    Formally, vectors can be said to be of two types: atomic vectors (items of same type) and lists. In practice, when we say vectors we are referring to atomic vectors.

  • What are factors?

    Consider the sex of a person. This variable can have only two possibilities or categories: male or female. We call this categorical data and factors are used to represent such data. Roughly equivalent to an "enum" type in C, factor represents a finite set of values.

    Under the hood, these are nothing more than integer vectors with each integer representing one category. In R, these possible values are called levels.

    Thus, though sex may contain values "male" or "female", these are not characters but integers. In addition, factors can be ordered or unordered. For example, sex may be defined as unordered factor. Olympics medal may be defined as ordered factor such that Bronze < Silver < Gold.

  • Is the NULL object a special data type?

    R documentation states that "the NULL object has no type and no modifiable properties". Attributes don't apply to NULL. When you want to indicate absence, NULL can be used. A vector or list of zero length is not the same as NULL.

  • How to interpret the functions class, mode, typeof and storage.mode?
    Comparing the outputs of class, mode, typeof and storage.mode. Source: Cotton 2016.
    Comparing the outputs of class, mode, typeof and storage.mode. Source: Cotton 2016.

    All these functions can be called on R with differing results. Function class represents the object's abstract type whereas typeof is the object's specific type. A good example is factors: its class factor but its type is integer. Another example is a data frame: its class is data.frame but its type if list.

    Function mode is similar to typeof and it exists for compatibility with R's predecessor, the S language. Function storage.mode also exists for compatibility with S. It's useful when interfacing to code in other languages. For example, consider a vector of integers. Functions typeof, mode and storage.mode will respectively return integer, numeric and integer. In S, both integers and reals have the same mode and hence storage.mode becomes useful.

    Hadley Wickham has commented that it's best to avoid using mode and storage.mode in R. If we need the underlying type, calling typeof should be preferred over storage.mode.

  • What are some basic operations on R vectors?

    Here are some basic operations on vectors:

    • Combining: We can combine vectors into a single vector. Eg. v <- c(v1, v2) to combine v1 and v1 into v.
    • Indexing: Indexing starts from 1. Negative numbers imply selecting all others except those specified. Eg. v[1] for first element; v[-3] for elements except the third one. Indexing may be treated as a special case of subsetting.
    • Subsetting: We can select a subset of a vector by using integer vectors for indexing. Eg. v[c(1,3,5)] to select first, third and fifth element; v[1:3] to select the first three elements. We can also use a logical vector to subset a vector. Eg. v[v > 5] to select elements with value greater than 5.
    • Coercing: Since vectors contain elements of a single type, values are coerced to a single type if they are different. Eg. v <- c(12L, 2.2, TRUE) coerces to doubles [12.0, 2.2, 1.0]; v <- c(2.2, TRUE, "Hi") coerces to characters ["2.2", "TRUE", "Hi"].
    • Converting: May be called explicit coercing. Convert the type. Eg. as.integer(c(3, 2.2, TRUE)) becomes [3, 3, 1]; as.numeric(c(2.2, TRUE, "Hi")) becomes [2.2, NA, NA], where NA stands for "Not Available".
  • Are there datasets to understand the different data structures?
    View in RStudio of 'rivers' from package datasets. Source: Devopedia 2020.
    View in RStudio of 'rivers' from package datasets. Source: Devopedia 2020.

    R comes with many datasets for experimental analysis and learning. These can be listed by typing data() in the R console. Details of each dataset can be obtained by using ? or help. For example, for help on "rivers" dataset type either ?rivers or help(rivers). It's been said that datasets "mtcars", "iris", "ToothGrowth", "PlantGrowth" and "USArrests" are commonly used by researchers.

    An example of vector is the "rivers" dataset. An example of a vector with names given to each observation is "precip". There are plenty of examples for data.frame: "airquality", "mtcars", "iris". An example of a list in "state.center" dataset.

    In "CO2", "Plant" variable is an ordered factor whereas "Type" variable is an unordered factor. Dataset "Titanic" is of class table, which is a type of array. This data structure records counts of combinations of factor levels.

    An object can belong to multiple classes. As an example, try class(CO2) and str(CO2). It's a data.frame but also belongs to other classes.

  • Can you give examples of data structures beyond the core ones given by R?

    Developers can create their own data structures that can build on top of the basic ones. One popular one is called data.table, which is based on data.frame. It offers a simplified and consistent syntax for handling data.

    Another example is tibble, which retains the effective parts of data.frame and does less work so that developers can catch problems early on.

    If you want to display data frames in HTML with conditional formatting (like in Microsoft Excel), formattable is a suitable package to use.

    Another package named dplyr isn't exactly a data structure. Rather, it offers a number of functions for manipulating data. It comes as part of the tidyverse collection of R packages targetted towards data science.

Milestones

2006

Version 1.0 of data.table is released.

Apr
2013

R version 3.0.0 is released. This version supports vectors longer than 2^31 - 1 elements. This applies to raw, logical, integer, double, complex and character vectors, as well as lists. Elements of character vectors are limited to 2^31 - 1 bytes.

Jan
2014

Hadley Wickam releases version 0.1 of dplyr, which is meant "to provide a consistent set of verbs that help you solve the most common data manipulation challenges." Version 1.0.0 is release in May 2020.

Apr
2020

R version 4.0.0 is released. This version uses stringsAsFactors = FALSE as default when reading tabular data via data.frame() or read.table(). Previous versions used to convert strings to factors by default.

Sample Code

  • # Source: http://rpubs.com/arvindpdmn/r-basics-for-beginners
     
    # Some examples of vector
    c(1, 2.0, 0.3)              # create a numeric vector
    c("a", "bc", "def")         # create vector of character
    c(1, 0.3, 2L, "xyz")        # coercion to a single mode: character
    vector(length=4)            # create a vector of length 4 of logical mode
    c(1:10)                     # vector of integers 1-10
    c(1:5, 11:15)               # vector of non-contiguous integers
     
    # Indexing and subsetting on vectors
    v1[1]
    v1[2:5]
    v1[c(1:3, 5, 7)]                      # note that vectors can be used as indices for subsetting
    v1 > 4                                # get a logical vector satisfying the condition  
    v1[v1 > 4]                            # get items greater than 4
    v1[-1]                                # get a vector but ignore 1st element
    v1[c(-1,-length(v1))]                 # get a vector but ignore 1st and last elements
    v1[!(v1 %in% 1:5)] <- -99             # replace all values not in range [1,5] with -99
    v2[v2 %% 3 == 1]                      # pick every third element using a logical vector
    v2[which(v2 %% 3 == 1)]               # pick every third element: which() obtains indices
    v2[seq(1, length(v2), by=3)]          # pick every third element using indices
     
    # Some examples of matrix
    matrix(1:6, nrow=2, ncol=3)                    # create a matrix
    matrix(c(1:3, 10L, 11L, 12L), nrow=2, ncol=3)  # values coerced to numeric mode
    m1 <- matrix(1:6, nrow=2, ncol=3, byrow=T)     # create a matrix by filling rows first
    dim(m1)
    colnames(m1) <- c("a", "b", "c")
    rownames(m1) <- c("u", "v")
    dimnames(m1)
    nrow(m1)                                       # no. of rows
    ncol(m1)                                       # no. of columns
    m1["u",]                                       # display named row
    m1[1,]                                         # display 1st row
    m1[,"a"]                                       # display named column
    m1[,2]                                         # display 2nd column
    m1[,c(1,3)]                                    # display 1st and 3rd columns
    cbind(m1, d=c(7,8))                            # append a new column
    rbind(w=c(7,8,9), m1)                          # prepend a new row
    dim(m1) <- c(3, 2)                             # resize the matrix
    length(m1)                                     # works like in a vector
    m1[1]                                          # works like in a vector
     
    # Conversion between vector and matrix
    ten <- 1:10
    matrix(ten, nrow=2)                            # vector to matrix, vector not changed
    dim(ten) <- c(2,5)                             # vector to matrix in-place
    as.vector(ten)                                 # matrix to vector, matrix not changed
    dim(ten) <- c(1,10)                            # remains a matrix with modified dims
     
    # Some examples of array
    a1 <- array(1:5, c(2,4,3))                     # create array of 3 dimensions, recycle 1:5
    a1 <- array(1:24, c(2,4,3))                    # create array of 3 dimensions
    length(a1)                                     # works like in a vector
    a1[24]                                         # works like in a vector
    dimnames(a1) <- list(c("a", "b"),              # assign names
                         c("u", "v", "w", "x"), 
                         c("p", "q", "r"))
    dimnames(a1)                                   # display dimension names
    a1["a",,]                                      # display "a" values
    a1[,"u","r"]                                   # display "u" and "r" values
    a2 <- 1:100                                    
    dim(a2) <- c(10, 5, 2)                         # Transform the vector into a 3-D array
    class(a2)
    a2[,1,]
     
    # Some examples of factor
    gender <- factor(c(rep("male", 5),             # rep is used to repeat a value
                       rep("female", 8)))
    levels(gender)                                 # levels are in alphabetic order
    gender <- factor(gender, ordered=T)            # levels are ordered
    levels(gender)                                 # levels are in alphabetic order
    gender[gender < "male"]                        # possible when levels are ordered
    gender <- factor(gender,                    
                     ordered=T,
                     levels=c("male", "female"))   # explicitly specify a different order
    gender
    levels(gender) <- c("female", "male")          # has the effect of swapping the levels
    gender
     
    # Some examples of list
    list(1, "a", TRUE, 1+4i)                       # a list of varied modes
    rl <- list(list(1:4), list("a","b", 3))        # a list can contain other lists
    is.recursive(rl)
    mylist <- list(idx = c("a", "b"),              # create a list with two named columns
                   values = c(10, 12.1, 14, 12))
    mylist$idx                                     # display column named idx
    mylist[["idx"]]                                # display column named idx
    mylist[[1]]                                    # display 1st column
    mylist$values                                  # display column named values
    mylist$idx[2]                                  # display 2nd item of idx column
    mylist[[2]][[1]]                               # display 1st item of 2nd column
    mylist[[2]][1]                                 # display 1st item of 2nd column
    mylist[[c(2,1)]]                               # display 1st item of 2nd column
    mylist[c(2,1)]                                 # display 2nd column, then 1st column
    names(mylist)                                  # display names of list object (column names)
    names(mylist) <- c("x", "y")                   # change names of list object (column names)
    mylist$extra <- c(1:4)                         # add another column
     
    # Some examples of data.frame
    data.frame(a = c("x","y","z"), b = 1:3, c = T) # create a data frame directly
    m1 <- matrix(1:9, 3, 3, 
                 dimnames = list(NULL, c("a", "b", "c")))
    cbind(m1, d = c(T, F, F))                      # will result in a matrix after coercion to integer
    df <- cbind(data.frame(m1), d = c(T, F, F))    # convert matrix to data.frame and add column
    df$a                                           # display column named "a"
    df[c("a", "d")]                                # display columns "a" and "d"
    df$z <- df$a + df$b                            # create a new column from two other columns
    df[1,]                                         # display row 1 as a data.frame
    df[1]                                          # display column 1 as a data.frame
    df[,1]                                         # display column 1 as a vector
    df[c(2,3,4,1,5)]                               # display columns in a different order
    ddf <- df[c("c","a","b","b")]                  # reorder columns plus repeat column "b"
    names(ddf)                                     # extra column will have name "b.1"
    names(ddf)[4] <- "d"                           # rename the extra column
    df$c[1] <- 0                                   # change a specific item
    df[df$b %in% c(4,6),]                          # display rows if "b" column has value 4 or 6
    d[!(names(d) %in% c("a"))]                     # display all columns except "a"
     
    d <- data.frame(a=1:5, b=6:10, c=11:15)
    d$b[c(1,3)] <- NA                              # set couple of elements to NA
    sum(is.na(d$b))                                # count of NA values
    any(is.na(d$b))
    all(is.na(d$b))
    d[d$a %in% c(4,2),]                            # use logical vector to subset
    d[(d$a>3 | d$c>6),]                            # subset by conditions: condition returns logical vector
    d[(d$a>3 & d$c>6),]
    d[with(d, a>3 & c>6),]                         # use names as if they are variables
    subset(d, a>3 & c>6, select=c("a", "c"))       # use subset() function and display selected columns
    d[which(d$b>7),]                               # use which() when some values are NA: which() returns indices
    sort(d$b, na.last=T)                           # NA values are retained and come at the end
    d[order(-d$c),]                                # descending order by column c values
    d$d <- c(5,6,5,6,5)
    d[order(d$d,-d$a),]                            # ascending order by d, then descending order by a
    data.matrix(d)                                 # convert to matrix via coercion
    cbind(rbind(d, d), e = c(8, 9))                # value 8 and 9 are recycled to make a full column
    d$d <- NULL                                    # delete column d
     
    a <- data.frame(id=1:5, bid=3:7, value=rnorm(5))
    b <- data.frame(id=3:7, value=rnorm(5), err=rnorm(5))
    merge(a, b, by.x="bid", by.y="id", all=T)      # merge from multiple datasets
    intersect(names(a), names(b))
     
    data.frame(x = 1:3, y = matrix(1:9, nrow = 3)) # create data.frame from a list and a matrix
    df <- data.frame(x = 1:3)
    df$y <- list(1:2, 1:3, 1:4)                    # add a column containing lists, each of different length
    data.frame(x = 1:3, 
               y = I(list(1:2, 1:3, 1:4)))         # I() to treat list as one unit, not separate columns
     

References

  1. CRAN. 2013. "Changes in R 3.0.0." R News, CRAN. Accessed 2020-07-25.
  2. CRAN. 2020. "Previous Releases of R for Windows." CRAN, June. Accessed 2020-07-25.
  3. CRAN data.table. 2018. "Introduction to data.table." data.table vignette, May 7. Accessed 2018-05-11.
  4. Ceballos, Maite and Nicolás Cardiel. 2013. "Data structure." First Steps in R. Accessed 2018-05-11.
  5. Colton, Arianne and Sean Chen. 2016. "Advanced R: Cheat Sheet." RStudio, February. Accessed 2018-05-11.
  6. Cotton, Richie. 2016. "A comprehensive survey of the types of things in R. 'mode' and 'class' and 'typeof' are insufficient." StackOverflow, October 21. Accessed 2018-05-11.
  7. Dalgaard, Peter. 2020. "R 4.0.0 is released." Email, April 24. Accessed 2020-07-25.
  8. DataCamp. 2020. "Data Type Conversion." Quick-R, DataCamp. Accessed 2020-07-25.
  9. DataFlair. 2019. "8 R Vector Operations with Examples – A Complete Guide for R Programmers." DataFlair, July 6. Accessed 2020-07-25.
  10. Hugh-Jones, David. 2018. "Everything I know about R subsetting." February 8. Accessed 2020-07-25.
  11. Müller, Kirill and Hadley Wickham. 2018. "tibble." Part of the tidyverse. Accessed 2018-05-11.
  12. Peng, Roger D. 2018. "R Nuts and Bolts." Chapter 4 in: R Programming for Data Science, September 18. Accessed 2020-07-25.
  13. R Core Team. 2018. "R Language Definition." v3.5.0, CRAN, April 23. Accessed 2018-05-11.
  14. Ren, Kun. 2016. "Formattable data frame." Formattable vignettes, CRAN, August 5. Accessed 2018-05-12.
  15. STHDA Wiki. 2020. "R Built-in Data Sets." STHDA Wiki, Statistical Tools for High-Throughput Data Analysis. Accessed 2020-07-25.
  16. Tidyverse GitHub. 2020. " tidyverse/dplyr." Tidyverse GitHub, July 22. Accessed 2020-07-25.
  17. Wickham, Hadley. 2016. "Replying to @richierocks." Twitter, October 21. Accessed 2018-05-11.
  18. Wickham, Hadley. 2018. "Data structures." Advanced R, April 26. Accessed 2018-05-11.
  19. Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. "dplyr." Part of the tidyverse. Accessed 2018-05-12.
  20. data.table Wiki. 2018. "Home." Rdatatable, GitHub. Accessed 2018-05-12.

Further Reading

  1. R Core Team. 2018. "R Language Definition." v3.5.0, CRAN, April 23. Accessed 2018-05-11.
  2. Wickham, Hadley. 2018. "Data structures." Advanced R, April 26. Accessed 2018-05-11.
  3. Ceballos, Maite and Nicolás Cardiel. 2013. "Data structure." First Steps in R. Accessed 2018-05-11.
  4. Blischak, John, Daniel Chen, Harriet Dashnow, and Denis Haine (eds). 2016. "Data Types and Structures." Software Carpentry: Programming with R, June. Accessed 2018-05-11.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
6
0
1827
1409
Words
2
Likes
13K
Hits

Cite As

Devopedia. 2021. "R Data Structures." Version 6, June 28. Accessed 2024-06-25. https://devopedia.org/r-data-structures
Contributed by
1 author


Last updated on
2021-06-28 16:25:06

Improve this article

Article Warnings

  • In References, replace these sub-standard sources: data-flair.training