• Five data structures in R. Source: Wickham 2018. • Example representations of different data structures in R. Source: Ceballos and Cardiel 2013. • Comparing the outputs of class, mode, typeof and storage.mode. Source: Cotton 2016. # R Data Structures arvindpdmn
1209 DevCoins
Last updated by arvindpdmn
on 2020-01-06 04:02:41
Created by arvindpdmn
on 2017-05-06 14:37:13

## Summary

R is an object-oriented language and all data structures are objects. R doesn't provide programmers direct access to memory and all data must be accessed via symbols or variables that refer to objects.

Since vectorized operation is an important aspect of R, R does not have any scalars. The most basic data structure is a vector, which is a sequence of data items. Thus, a single integer value is treated as an integer vector of unit length. The most versatile data structure is the list while the most common one used for data analysis is the data.frame.

The terms data type and mode usually refers to what is stored (integer, character, etc.). The term data structure usually refers to how data is stored, that is, the containers (vector, list, etc.).

## Milestones

2006

Version 1.0 of data.table is released.

## Discussion

• What data types are available in R?

Data types are many. The common ones include integer, real, complex, logical and character. Types integer and real are termed as numeric. There's no separate "string" type. Instead, character type is sufficient to denote strings.

Integers are specified with a suffix "L", such as 23L or -2L. Real numbers are specified without this suffix, such as 2.3 or 23. Examples of complex numbers are -2+3i and -45i. Logical type can take values TRUE or FALSE. These have shortforms T and F. Character type can be specified by a matching pair of single or double quotes, such as "R", 'R' or "This is R!".

• Could you compare vector, matrix, array, list and data.frame?

The following data structures are common in R:

• vector: Contains a sequence of items of the same type. This is most basic structure. Items of a vector can be accessed using []. Function length can be called to know the number of items.
• list: Represented as a vector but can contain items of different types. Different columns can contain different lengths. Items of a list can be accessed using [[]]. This is a recursive data type: lists can contain other lists.
• array: An n-dimensional structure that expands on a vector. Under the hood, this has dim and optionally dimnames attributes, which don't exist for vectors. Like vectors, all items must be of the same underlying type.
• matrix: A two-dimensional array.
• data.frame: While all columns of a matrix have same type, with data frames, different columns can have different types.

Formally, vectors can be said to be of two types: atomic vectors (items of same type) and lists. In practice, when we say vectors we are referring to atomic vectors.

• What are factors?

Consider the sex of a person. This variable can have only two possibilities or categories: male or female. We call this categorical data and factors are used to represent such data. Roughly equivalent to an "enum" type in C, factor represents a finite set of values.

Under the hood, these are nothing more than integer vectors with each integer representing one category. In R, these possible values are called levels.

Thus, though sex may contain values "male" or "female", these are not characters but integers. In addition, factors can be ordered or unordered. For example, sex may be defined as unordered factor. Olympics medal may be defined as ordered factor such that Bronze < Silver < Gold.

• Is the NULL object a special data type?

R documentation states that "the NULL object has no type and no modifiable properties". Attributes don't apply to NULL. When you want to indicate absence, NULL can be used. A vector or list of zero length is not the same as NULL.

• How to interpret the functions class, mode, typeof and storage.mode?

All these functions can be called on R with differing results. Function class represents the object's abstract type whereas typeof is the object's specific type. A good example is factors: its class factor but its type is integer. Another example is a data frame: its class is data.frame but its type if list.

Function mode is similar to typeof and it exists for compatibility with R's predecessor, the S language. Function storage.mode also exists for compatibility with S. It's useful when interfacing to code in other languages. For example, consider a vector of integers. Functions typeof, mode and storage.mode will respectively return integer, numeric and integer. In S, both integers and reals have the same mode and hence storge.mode becomes useful.

Hadley Wickham has commented that it's best to avoid using mode and storage.mode in R. If we need the underlying type, calling typeof should be preferred over storage.mode.

• What are some basic operations on R vectors?

Here are some basic operations on vectors:

• Combining: We can combine vectors into a single vector. Eg. v <- c(v1, v2) to combine v1 and v1 into v.
• Indexing: Indexing starts from 1. Negative numbers imply selecting all others except those specified. Eg. v for first element; v[-3] all elements except the third one. Indexing may be treated as a special case of subsetting.
• Subsetting: We can select a subset of a vector by using integer vectors for indexing. Eg. v[c(1,3,5)] to select first, third and fifth element; v[1:3] to select the first three elements. We can also use a logical vector to subset a vector. Eg. v[v > 5] to select all elements who value is greater than 5.
• Coercing: Since vectors contain elements of a single type, values are coerced to a single type if they are different. Eg. v <- c(12L, 2.2, TRUE) coerces to doubles [12.0, 2.2, 1.0]; v <- c(2.2, TRUE, "Hi") coerces to characters ["2.2", "TRUE", "Hi"].
• Converting: Convert the type. Eg. as.integer(c(3, 2.2, TRUE)) becomes [3, 3, 1]; as.numeric(c(2.2, TRUE, "Hi")) becomes [2.2, NA, NA], where NA stands for "Not Available".
• Are there datasets to understand the different data structures?

R comes with many datasets for experimental analysis and learning. These can be listed by typing data() in the R console. Details of each dataset can be obtained by using ? or help. For example, for help on "rivers" dataset type either ?rivers or help(rivers).

An example of vector is the "rivers" dataset. An example of a vector with names given to each observation is "precip". There are plenty of examples for data.frame: "airquality", "mtcars", "iris". An example of a list in "state.center" dataset.

In "CO2", "Plant" variable is an ordered factor whereas "Type" variable is an unordered factor. Dataset "Titanic" is of class table, which is a type of array. This data structure records counts of combinations of factor levels.

An object can belong to multiple classes. As an example, try class(CO2) and str(CO2). It's a data.frame but also belongs to other classes.

• Can you give examples of data structures beyond the core ones given by R?

Developers can create their own data structures that can build on top of the basic ones. One popular one is called data.table, which is based on data.frame. It offers a simplified and consistent syntax for handling data.

Another example is tibble, which retains the effective parts of data.frame and does less work so that developers can catch problems early on.

If you want to display data frames in HTML with conditional formatting (like in Microsoft Excel), formattable is a suitable package to use.

Another package named dplyr isn't exactly a data structure. Rather, it offers a number of functions for manipulating data. It comes as part of the tidyverse collection of R packages targetted towards data science.

## Sample Code

• # Source: http://rpubs.com/arvindpdmn/r-basics-for-beginners

# Some examples of vector
c(1, 2.0, 0.3)              # create a numeric vector
c("a", "bc", "def")         # create vector of character
c(1, 0.3, 2L, "xyz")        # coercion to a single mode: character
vector(length=4)            # create a vector of length 4 of logical mode
c(1:10)                     # vector of integers 1-10
c(1:5, 11:15)               # vector of non-contiguous integers

# Indexing and subsetting on vectors
v1
v1[2:5]
v1[c(1:3, 5, 7)]                      # note that vectors can be used as indices for subsetting
v1 > 4                                # get a logical vector satisfying the condition
v1[v1 > 4]                            # get items greater than 4
v1[-1]                                # get a vector but ignore 1st element
v1[c(-1,-length(v1))]                 # get a vector but ignore 1st and last elements
v1[!(v1 %in% 1:5)] <- -99             # replace all values not in range [1,5] with -99
v2[v2 %% 3 == 1]                      # pick every third element using a logical vector
v2[which(v2 %% 3 == 1)]               # pick every third element: which() obtains indices
v2[seq(1, length(v2), by=3)]          # pick every third element using indices

# Some examples of matrix
matrix(1:6, nrow=2, ncol=3)                    # create a matrix
matrix(c(1:3, 10L, 11L, 12L), nrow=2, ncol=3)  # values coerced to numeric mode
m1 <- matrix(1:6, nrow=2, ncol=3, byrow=T)     # create a matrix by filling rows first
dim(m1)
colnames(m1) <- c("a", "b", "c")
rownames(m1) <- c("u", "v")
dimnames(m1)
nrow(m1)                                       # no. of rows
ncol(m1)                                       # no. of columns
m1["u",]                                       # display named row
m1[1,]                                         # display 1st row
m1[,"a"]                                       # display named column
m1[,2]                                         # display 2nd column
m1[,c(1,3)]                                    # display 1st and 3rd columns
cbind(m1, d=c(7,8))                            # append a new column
rbind(w=c(7,8,9), m1)                          # prepend a new row
dim(m1) <- c(3, 2)                             # resize the matrix
length(m1)                                     # works like in a vector
m1                                          # works like in a vector

# Conversion between vector and matrix
ten <- 1:10
matrix(ten, nrow=2)                            # vector to matrix, vector not changed
dim(ten) <- c(2,5)                             # vector to matrix in-place
as.vector(ten)                                 # matrix to vector, matrix not changed
dim(ten) <- c(1,10)                            # remains a matrix with modified dims

# Some examples of array
a1 <- array(1:5, c(2,4,3))                     # create array of 3 dimensions, recycle 1:5
a1 <- array(1:24, c(2,4,3))                    # create array of 3 dimensions
length(a1)                                     # works like in a vector
a1                                         # works like in a vector
dimnames(a1) <- list(c("a", "b"),              # assign names
c("u", "v", "w", "x"),
c("p", "q", "r"))
dimnames(a1)                                   # display dimension names
a1["a",,]                                      # display "a" values
a1[,"u","r"]                                   # display "u" and "r" values
a2 <- 1:100
dim(a2) <- c(10, 5, 2)                         # Transform the vector into a 3-D array
class(a2)
a2[,1,]

# Some examples of factor
gender <- factor(c(rep("male", 5),             # rep is used to repeat a value
rep("female", 8)))
levels(gender)                                 # levels are in alphabetic order
gender <- factor(gender, ordered=T)            # levels are ordered
levels(gender)                                 # levels are in alphabetic order
gender[gender < "male"]                        # possible when levels are ordered
gender <- factor(gender,
ordered=T,
levels=c("male", "female"))   # explicitly specify a different order
gender
levels(gender) <- c("female", "male")          # has the effect of swapping the levels
gender

# Some examples of list
list(1, "a", TRUE, 1+4i)                       # a list of varied modes
rl <- list(list(1:4), list("a","b", 3))        # a list can contain other lists
is.recursive(rl)
mylist <- list(idx = c("a", "b"),              # create a list with two named columns
values = c(10, 12.1, 14, 12))
mylist$idx # display column named idx mylist[["idx"]] # display column named idx mylist[] # display 1st column mylist$values                                  # display column named values
mylist$idx # display 2nd item of idx column mylist[][] # display 1st item of 2nd column mylist[] # display 1st item of 2nd column mylist[[c(2,1)]] # display 1st item of 2nd column mylist[c(2,1)] # display 2nd column, then 1st column names(mylist) # display names of list object (column names) names(mylist) <- c("x", "y") # change names of list object (column names) mylist$extra <- c(1:4)                         # add another column

# Some examples of data.frame
data.frame(a = c("x","y","z"), b = 1:3, c = T) # create a data frame directly
m1 <- matrix(1:9, 3, 3,
dimnames = list(NULL, c("a", "b", "c")))
cbind(m1, d = c(T, F, F))                      # will result in a matrix after coercion to integer
df <- cbind(data.frame(m1), d = c(T, F, F))    # convert matrix to data.frame and add column
df$a # display column named "a" df[c("a", "d")] # display columns "a" and "d" df$z <- df$a + df$b                            # create a new column from two other columns
df[1,]                                         # display row 1 as a data.frame
df                                          # display column 1 as a data.frame
df[,1]                                         # display column 1 as a vector
df[c(2,3,4,1,5)]                               # display columns in a different order
ddf <- df[c("c","a","b","b")]                  # reorder columns plus repeat column "b"
names(ddf)                                     # extra column will have name "b.1"
names(ddf) <- "d"                           # rename the extra column
df$c <- 0 # change a specific item df[df$b %in% c(4,6),]                          # display rows if "b" column has value 4 or 6
d[!(names(d) %in% c("a"))]                     # display all columns except "a"

d <- data.frame(a=1:5, b=6:10, c=11:15)
d$b[c(1,3)] <- NA # set couple of elements to NA sum(is.na(d$b))                                # count of NA values
any(is.na(d$b)) all(is.na(d$b))
d[d$a %in% c(4,2),] # use logical vector to subset d[(d$a>3 | d$c>6),] # subset by conditions: condition returns logical vector d[(d$a>3 & d$c>6),] d[with(d, a>3 & c>6),] # use names as if they are variables subset(d, a>3 & c>6, select=c("a", "c")) # use subset() function and display selected columns d[which(d$b>7),]                               # use which() when some values are NA: which() returns indices
sort(d$b, na.last=T) # NA values are retained and come at the end d[order(-d$c),]                                # descending order by column c values
d$d <- c(5,6,5,6,5) d[order(d$d,-d$a),] # ascending order by d, then descending order by a data.matrix(d) # convert to matrix via coercion cbind(rbind(d, d), e = c(8, 9)) # value 8 and 9 are recycled to make a full column d$d <- NULL                                    # delete column d

a <- data.frame(id=1:5, bid=3:7, value=rnorm(5))
b <- data.frame(id=3:7, value=rnorm(5), err=rnorm(5))
merge(a, b, by.x="bid", by.y="id", all=T)      # merge from multiple datasets
intersect(names(a), names(b))

data.frame(x = 1:3, y = matrix(1:9, nrow = 3)) # create data.frame from a list and a matrix
df <- data.frame(x = 1:3)
df\$y <- list(1:2, 1:3, 1:4)                    # add a column containing lists, each of different length
data.frame(x = 1:3,
y = I(list(1:2, 1:3, 1:4)))         # I() to treat list as one unit, not separate columns

## Milestones

2006

Version 1.0 of data.table is released.

## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins arvindpdmn
4
0
1209
1267
Words
0
Chats
4
Edits
0
Likes
3548
Hits

## Cite As

Devopedia. 2020. "R Data Structures." Version 4, January 6. Accessed 2020-01-28. https://devopedia.org/r-data-structures
• Site Map