Day Two:
Data structures

~30 min

Overview

Questions

  • What kind of objects have R?
  • How can I manage them?
  • How can I control the flow of code execution?
  • How can I define custom functions?

Lesson Objectives

To be able to do/use

  • Define for loops and if conditional executions
  • Create functions

Data Structures

(atomic) vectors [side]

Atomic vectors are homogeneous/flat objects, i.e. all the objects composing the sequence must be of the same type, and cannot contain other (nested) sequences.

Four main types (?typeof) of atomic vectors:

Logical ?is.logical
typeof(TRUE)
[1] "logical"
c(is.atomic(FALSE), is.logical(FALSE))
[1] TRUE TRUE


Integer ?is.integer
typeof(1:3)
[1] "integer"
c(is.atomic(1:3), is.integer(1:3))
[1] TRUE TRUE


Double ?is.double
typeof(1.2)
[1] "double"
c(is.atomic(1.2), is.double(1.2))
[1] TRUE TRUE


Character ?is.character
typeof("Hello supervisors")
[1] "character"
c( # line brakes don't brake execution
  is.atomic("Hello supervisors"),
  is.character("Hello supervisors")
)
[1] TRUE TRUE


Elements can be named

c(one = 1, two = 2, three = 3)
  one   two three 
    1     2     3 

(other) vectors - Factors [side]

Other structure in R are based on atomic vectors, i.e. are of one of the base types but have more structure (similar structures are called ?class)

Factors are discrete (i.e. based on ?integers) variables with labels.

Factors ?is.factor
gender <- factor(
  c("male", "female", "female"),
  levels = c("female", "male", "other")
)
gender
[1] male   female female
Levels: female male other
is.factor(gender)
[1] TRUE
typeof(gender)
[1] "integer"
class(gender)
[1] "factor"
as.character(gender)
[1] "male"   "female" "female"
as.integer(gender)
[1] 2 1 1
Levels ?levels
levels(gender)
[1] "female" "male"   "other" 

Tip

You can investigate the internal structure of any R objects using ?str.

Structure ?str
str(gender)
 Factor w/ 3 levels "female","male",..: 2 1 1

(other) vectors - Dates / Date-times [side]

Other structure in R are based on atomic vectors, i.e. are of one of the base types but have more structure (similar structures are called ?class)

Dates are counts (based on ?doubles) of days since 1970-01-01.1

Dates ?as.Date
date <- as.Date("1970-01-10")
date
[1] "1970-01-10"
typeof(date)
[1] "double"
class(date)
[1] "Date"
as.double(date)
[1] 9
str(date)
 Date[1:1], format: "1970-01-10"

Date-Time are counts (based on ?doubles) of seconds since 1970-01-01.

Date-times ?as.POSIXct2
date_time <- as.POSIXct(
  "1970-01-10 13:10",
  tz = "UTC"
)
date_time
[1] "1970-01-10 13:10:00 UTC"
typeof(date_time)
[1] "double"
class(date_time)
[1] "POSIXct" "POSIXt" 
as.double(date_time)
[1] 825000
str(date_time)
 POSIXct[1:1], format: "1970-01-10 13:10:00"

list (vectors) [side]

List vectors are heterogeneous/nestable objects, i.e. objects composing the sequence can be of distinct types, and can contain other (nested) sequences.

List ?list
db_list <- list(
  age = c(70, 85, 69),
  height = c(1.5, 1.72, 1.81),
  at_risk = c(TRUE, FALSE, TRUE),
  gender = factor(
    c("male", "female", "female"),
    levels = c("female", "male", "other")
  )
)
db_list
$age
[1] 70 85 69

$height
[1] 1.50 1.72 1.81

$at_risk
[1]  TRUE FALSE  TRUE

$gender
[1] male   female female
Levels: female male other


str(db_list)
List of 4
 $ age    : num [1:3] 70 85 69
 $ height : num [1:3] 1.5 1.72 1.81
 $ at_risk: logi [1:3] TRUE FALSE TRUE
 $ gender : Factor w/ 3 levels "female","male",..: 2 1 1

(other) lists - data frames (and tibble) [side]

Other structure in R are based on list vectors, i.e. are heterogeneous sequence of objects.

data frames are ordered list of equally sized homogeneous named vectors. I.e. the are used for tabular data:

  • ordered list of columns of information, with headers (?names)
  • in a column there is one type of information (homogeneous)
  • all columns have the same ?length, i.e. number of rows (?nrow)
Data frames ?data.frame

Tip

During the course we will see, explain and use tibbles (from the package {tibble}): a modern, enhanced, better displayed, and with stricter and more consistent structure than standard data frames.

db_df <- data.frame(
  age = c(70, 85, 69),
  height = c(1.5, 1.72, 1.81),
  at_risk = c(TRUE, FALSE, TRUE),
  gender = factor(
    c("male", "female", "female"),
    levels = c("female", "male", "other")
  )
)
db_df
names(db_df)
[1] "age"     "height"  "at_risk" "gender" 
nrow(db_df)
[1] 3
ncol(db_df)
[1] 4
dim(db_df)
[1] 3 4

vectors as trains [side]

Important

R works on vectors of two types only:

  • Atomic (homogeneous / flat)
  • List (heterogeneous / nested)

Think of objects in R (any objects in R!) as a train (either atomic or list) made of wagons:

  • a train (i.e., a vector) is sequence of wagons (i.e., objects, homogeneous or heterogeneous, possibly other trains)
  • wagons have content (i.e., the data they contain)
  • wagons can have labels (i.e., names)

x <- list(a = 1:3, b = "a", 4:6) (image adapted from Advanced-R)

Subsetting - subset [side]

Important

You can refer to subsetting objects (i.e., a train) as performing two operations mainly:

  • create another objects (i.e., another train) with a subset of its elements (i.e., wagons)
  • extract the content of a (single) object (i.e., the content of a wagon)

You can select more than one object/wagon when subsetting, but a single one only when extracting!

Three ways to identify elements (i.e., wagons):

Original
db_df
Subset by position [
db_df[c(2, 3)]
Subset by names [
db_df[c("height", "age")]
Subset by logic1 [
db_df[c(TRUE, FALSE, FALSE, TRUE)]

Subsetting data - coordinates [optional]

Important

A data frame (and tibbles) can be see as a “matrix” (or a table).

Data frames values can be subsetted using “[rows, column]” notation

coordinates [rows, cols]
db_df[3, 2]
[1] 1.81
db_df[3, "age"]
[1] 69
ask anything
get everything
[, cols]
[rows, ]
db_df[, "age"]
[1] 70 85 69
db_df[3, ]
multiple selection [

Tip

Use the additional argument drop = FALSE to maintain the data frame structure. Using tibbles we will consistently get always a tibble when subsetting with coordinates!

db_df[3, 1:2]
db_df[3:2, c(2, 4)]
db_df[, "age", drop = FALSE]
db_df[3, 2, drop = FALSE]
db_df[3, "age", drop = FALSE]

Subsetting data - extract [side]

Important

You can refer to subsetting objects (i.e., a train) as performing two operations mainly:

  • create another objects (i.e., another train) with a subset of its elements (i.e., wagons)
  • extract the content of a (single) object (i.e., the content of a wagon)

You can select more than one object/wagon when subsetting, but a single one only when extracting!

Two ways to identify a (single!) element (i.e., a wagon):

  • with its position.
  • with its name, if it has a name.
Extracting [[
db_df[[1]]
[1] 70 85 69
db_df[["height"]]
[1] 1.50 1.72 1.81
Extracting $1
db_df$height
[1] 1.50 1.72 1.81

Your turn (main: B; bk1: C; bk2: A)

Your turn

  1. Before to evaluate it, in the pad, under the section 2.2. Ex7, write (in a new line) what is your expected result (including an error).

  2. Before to evaluate it, in the pad, under the section 2.2. Ex8, write (in a new line) what is your expected result (including an error).

  3. Then, open the script 04-atomic-vectors.R and follow the instruction step by step.

  4. Then, open the script 05-subsetting.R and follow the instruction step by step.

15:00

Important

  • Coercion rule from specific to general:logical > integer > double > character

  • Subset operation can be performed in sequence on the same object directly.

  • Crucial to know if you are working on a subset of an object or its content.

My turn

YOU: Connect to our pad(https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-2 project in RStudio cloud

Control flow [optional]

If-then
if (cond) {
  # <code>
}
x <- 3

print("start")
if (x > 3) {
  print("ok")
}
print("end")
[1] "start"
[1] "end"


If-then-else
if (cond) {
  # <code>
} else {
  # <code>
}
print("start")
if (x > 3) {
  print("ok")
} else {
  print("ko")
}
print("end")
[1] "start"
[1] "ko"
[1] "end"


Tip

You don’t need to test if a logical is TRUE or FALSE, they are already TRUE or FALSE!

is_to_print <- TRUE

if (is_to_print) {
  print("ok")
} else {
  print("ko")
}
[1] "ok"
For cycles
for (<var> in <vector>) {
  # <code>
}
print("start")
for (i in seq_len(x)) {
  print(paste("i is:", i))
}
print("start")
[1] "start"
[1] "i is: 1"
[1] "i is: 2"
[1] "i is: 3"
[1] "start"

Functions [optional]

Definition
name <- function(args) {
  # body code of the function
}
sum_one <- function(x) {
1  x + 1
}
sum_one(x = 3)
1
A function always returns its last evaluated objects.
[1] 4


Default
name <- function(args = default) {
  # body code of the function
}
sum_one <- function(x = 3) {
  x + 1
}
sum_one()
sum_one(x = 3)
[1] 4
[1] 4


Positional arg match ?function
x_exp_y <- function(x, y) {
  x^y
}
x_exp_y(2, 3)
[1] 8


Tip

To avoid confusion, don’t mix positional and named argument match using non standard order: start positional, and once you name an argument name all the subsequent!

x_exp_y(y = 3, 2)
[1] 8

Your turn [optional]

Your turn

  1. Before to evaluate it, in the pad, under the section 2.2. Ex9, write (in a new line) what is your expected result from the following computation:
<code>
  1. Then, open the script 06-cond-and-funs.R and follow the instruction step by step.
20:00

Important

  • ?if returns the last computed value, if any, or NULL otherwise (which may happen if there is no else).
  • ?for returns NULL

My turn [optional]

YOU: Connect to our pad(https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-2 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 04-cond_loop.R

Break

10:00

Acknowledgment

To create the current lesson we explored, use, and adapt contents from the following resources:

The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s kintr.

Additionl Resources

License

This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0