Day Three:
Transform
Types

30 (+30) min approx

Overview

Questions

  • How to handle factors effectively in R/Tidyverse?
  • How to handle dates and time in R/Tidyverse?
  • How to handle strings in R/Tidyverse?

Lesson Objectives

To be able to

  • perform basic factor data management.
  • convert textual date/time into date/time R objects
  • use simple regular expression and main str_* functions to manage strings

Mange principal formats

Factors - why

Using strings for categories is not always the best choice. Factors are the best way to represent categories in R.

  • sorting issues
x1 <- c("Dec", "Apr", "Jan", "Mar")
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
  • missing/wrong levels issues
x2 <- c("Dec", "Apr", "Jam", "Mar")
x2
[1] "Dec" "Apr" "Jam" "Mar"
  • tabulation issues
table(x1)
x1
Apr Dec Jan Mar 
  1   1   1   1 

Factors - how

Define a set of possible values (levels), as a standard character vector.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
month_levels
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"
[11] "Nov" "Dec"

And define a variable as factor, specifying the levels using that.

Base

y1_base <- factor(x1, levels = month_levels)
y1_base
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Tidyverse - (forcats)

library(tidyverse)

y1_tidy <- fct(x1, levels = month_levels)
y1_tidy
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1_base)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1_tidy)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Factors - why tidyverse ({forcats})

If we don’t provide explicit levels, the levels are the unique values in the vector, sorted alphabetically in base R, or in the order of appearance in forcats.

Base

factor(x1)
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar

Tidyverse - (forcats)

fct(x1)
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar



If there are wrong values in the values used to create a factor, they are included as missing (NA) in base R silently, while forcats throws an (informative!) error.

y2_base <- x2 |> 
  factor(levels = month_levels)
y2_base
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
y2_tidy <- x2 |> 
  fct(levels = month_levels)
Error in `fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "Jam"

Factors - reorder levels

It could be useful to reordering levels, e.g. when plotting information.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

library(tidyverse)

# sample dataset from `{forcats}`
# ?gss_cat for information
gss_cat 

Factors - reorder levels

It could be useful to reordering levels, e.g. when plotting information.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

relig_summary <- gss_cat |> 
  group_by(relig) |> 
  summarize(
    tv_hours = tvhours |> 
      mean(na.rm = TRUE)
  )
relig_summary

Factors - reorder levels

It could be useful to reordering levels, e.g. when plotting information.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

relig_summary |> 
  ggplot(aes(
    x = tv_hours,
    y = relig
  )) +
  geom_point()

relig_summary |> 
  ggplot(aes(
    x = tv_hours,
    y = relig |> 
      fct_reorder(tv_hours)
  )) +
  geom_point()

Factors - reorder levels

There are also many other useful functions in forcats to reorder levels, e.g., fct_infreq and fct_rev. To see all of them, refer to its website https://forcats.tidyverse.org/.

gss_cat |>
  mutate(
    marital = marital |>
      # order by frequency
      fct_infreq() |>
      # reverse the order
      fct_rev()
  ) |>
  ggplot(aes(x = marital)) +
  geom_bar()

Factors - modify (AKA recode) levels

We can also modify levels, e.g., to change the wording, or to merge some of them together.

Change the wording

gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  ) |>
  count(partyid)

Important

forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!

Factors - modify (AKA recode) levels

We can also modify levels, e.g., to change the wording, or to merge some of them together.

Change the wording

gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)

Important

forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!

Dates and Time

In the Tidyverse, the main package to manage dates and time is lubridate.

Remind

  • Dates are counts (based on ?doubles) of days since 1970-01-01.
  • Date-Time are counts (based on ?doubles) of seconds since 1970-01-01.

To get the current date or date-time you can use today() or now():

library(tidyverse)
today()
now()
[1] "2024-02-29"
[1] "2024-02-29 11:16:17 CET"

Dates and Time - conversion from strings

Dates

ymd("2017-01-31")
mdy("January 31st, 2017")
dmy("31-Jan-2017")
[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31"

Dates-time

ymd_hms("2017-01-31 20:11:59")
mdy_hm("01/31/2017 20:11")
mdy_h("01/31/2017 20")

# Force date-time supplying a timezone
ymd("2017-01-31", tz = "UTC")
[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 20:11:00 UTC"
[1] "2017-01-31 20:00:00 UTC"
[1] "2017-01-31 UTC"

Date <-> Date-time conversion

as_datetime(today()) |> 
  str()
as_date(now()) |> 
  str()
 POSIXct[1:1], format: "2024-02-29"
 Date[1:1], format: "2024-02-29"

Extracting/Changing components

We can extract or modify components from date/date-time objects using:

  • year()
  • month()
  • day()
  • hour()
  • minute()
  • second()
  • wday() (day of the week)
  • yday() (day of the year)
  • week() (week of the year)
  • quarter() (quarter of the year).

Extract

(today_now <- now())
year(today_now)
month(today_now)
day(today_now)
hour(today_now)
minute(today_now)
second(today_now)
wday(today_now)
yday(today_now)
week(today_now)
quarter(today_now)
[1] "2024-02-29 11:16:17 CET"
[1] 2024
[1] 2
[1] 29
[1] 11
[1] 16
[1] 17.83648
[1] 5
[1] 60
[1] 9
[1] 1

Change

year(today_now)  <- 2020
today_now
month(today_now) <- 12
today_now
day(today_now) <- 30
today_now
hour(today_now) <- 17
today_now
minute(today_now) <- 14
today_now
second(today_now) <- 56
today_now
[1] "2020-02-29 11:16:17 CET"
[1] "2020-12-29 11:16:17 CET"
[1] "2020-12-30 11:16:17 CET"
[1] "2020-12-30 17:16:17 CET"
[1] "2020-12-30 17:14:17 CET"
[1] "2020-12-30 17:14:56 CET"

Your turn (main: B; bk1: C; bk2: A)

Your turn

…and:

  1. Under the sections 4.2. Ex26 Ex27 of the pad, write (in a new line) your answer to the questions reported.

  2. Then, open the scripts 19-factors.R and 20-date-time.R and follow the instruction step by step.

15:00

Tip

  • factors from base R, or forcats::fct from forcats are the best way to represent categories in R. They work similarly, but forcats is more informative and more flexible.

  • Date and Date-time are counts of days/seconds since 1970-01-01. Managing them in R is not easy, but lubridate makes it easier.

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 14-factors.R

Strings - Regular Expressions

Regular expressions are a powerful tool for matching text patterns. They are used in many programming languages to find and manipulate strings, and in R are implemented in the stringr package.

Base syntax for regular expressions

  • . matches any character
  • * matches zero or more times
  • + matches one or more times
  • ? matches zero or one time
  • ^ matches the start of a string
  • $ matches the end of a string
  • [] matches any one of the characters inside
  • [^] matches any character not inside the square brackets
  • | matches the pattern either on the left or the right
  • () groups together the pattern on the left and the right

Example

The following match any string that:

  • a contains a (str_view("banana", "a"): )
  • ^a starts with a
  • a$ ends with a
  • ^a$ starts and ends with a
  • ^a.*a$ starts and ends with a, with any number of characters in between
  • ^a.+a$ starts and ends with a, with at least one character in between
  • ^a[bc]+a$ starts and ends with a, with at least one b or c in between
  • ^a(b|c)d$ starts with a, followed by either b or c, followed by an endingd.

Tip

To match special characters, you need to escape them with a double backslash (\\). I.e., you need to use \\., \\*, \\+, \\?, \\^, \\$, \\[, \\], \\|, \\(, \\).

To match a backslash, you need \\\\.

Strings - {stringr}

The stringr package provides a consistent set of functions for working with strings, and it is designed to work consistently with the pipe.

Functions

  • str_detect(): does a string contain a pattern?
  • str_which(): which strings match a pattern?
  • str_subset(): subset of strings that match a pattern
  • str_sub(): extract a sub-string by position
  • str_replace(): replace the first match with a replacement
  • str_replace_all(): replace all matches with a replacement
  • str_remove(): remove the first match
  • str_remove_all(): remove all matches
  • str_split(): split up a string into pieces
  • str_extract(): extract the first match
  • str_extract_all(): extract all matches
  • str_locate(): locate the first match
  • str_locate_all(): locate all matches
  • str_count(): count the number of matches
  • str_length(): the number of characters in a string

Tip

Because all stringr functions start with str_, in RStudio you can type str_ and then pressing TAB to see all its available functions.

Examples

library(tidyverse)

x <- c("apple", "banana", "pear")
str_detect(x, "[aeiou]")
[1] TRUE TRUE TRUE
str_which(x, "[aeiou]")
[1] 1 2 3
library(tidyverse)

x <- c("apple", "banana", "pear")
str_subset(x, "[aeiou]")
[1] "apple"  "banana" "pear"  
str_sub(x, 1, 3)
[1] "app" "ban" "pea"
library(tidyverse)

x <- c("apple", "banana", "pear")
str_replace(x, "[aeiou]", "x")
[1] "xpple"  "bxnana" "pxar"  
str_replace_all(x, "[aeiou]", "x")
[1] "xpplx"  "bxnxnx" "pxxr"  
str_remove(x, "[aeiou]")
[1] "pple"  "bnana" "par"  
str_remove_all(x, "[aeiou]")
[1] "ppl" "bnn" "pr" 
library(tidyverse)

x <- c("apple", "banana", "pear")
str_split(x, "[aeiou]")
[[1]]
[1] ""    "ppl" ""   

[[2]]
[1] "b" "n" "n" "" 

[[3]]
[1] "p" ""  "r"
str_extract(x, "[aeiou]")
[1] "a" "a" "e"
str_extract_all(x, "[aeiou]")
[[1]]
[1] "a" "e"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "e" "a"
library(tidyverse)

x <- c("apple", "banana", "pear")
str_locate(x, "[aeiou]")
     start end
[1,]     1   1
[2,]     2   2
[3,]     2   2
str_locate_all(x, "[aeiou]")
[[1]]
     start end
[1,]     1   1
[2,]     5   5

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     2   2
[2,]     3   3
library(tidyverse)

x <- c("apple", "banana", "pear")
str_count(x, "[aeiou]")
[1] 2 3 2
str_length(x)
[1] 5 6 4

Strings - concatenate

  • str_c: takes any number of vectors as arguments and returns a character vector of the concatenated values.

  • str_glue: takes a string and interpolates values into it.

library(tidyverse)

tibble(
    x = c("apple", "banana", "pear"),
    y = c("red", "yellow", "green"),
    z = c("round", "long", "round")
  ) |> 
  mutate(
    fruit = str_c(x, y, z),
    fruit_space = str_c(x, y, z, sep = " "),
    fruit_comma = str_c(x, y, z, sep = ", "),
    fruit_glue = str_glue("I like {x}, {y} and {z} fruits")
  )

Your turn (main: C; bk1: A; bk2: B)

Your turn

…and:

  1. Before to evaluate it, in the pad, under the section 4.2. Ex8, write (in a new line) how can you match all files names that are R scripts (i.e., ending with .r or .R)? Report you option for a regular expression.

  2. Then, open the script 21-strings.R and follow the instruction step by step.

10:00

Tip

  • All functions in stringr start with str_, so you can type str_ and then pressing TAB to see all its available functions.

  • You can use str_view to see how a regular expression matches a string.

  • str_glue is a powerful tool to concatenate strings and variables.

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 16-strings.R

Homework

Posit’s RStudio Cloud Workspace

  • Project: Day-4
  • Instructions:
    • Go to: https://bit.ly/ubep-rws-website
    • The text is the Day-4 assessment under the tab “Summative Assessments”.
  • Script to complete on RStudio: solution.R

Acknowledgment

To create the current lesson, we explored, used, and adapted content from the following resources:

The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.

Additional resources

License

This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0

References

Break

10:00