Day Three:
Transform
Types

30 (+30) min approx

Overview

Questions

How to handle factors effectively in R/Tidyverse?
How to handle dates and time in R/Tidyverse?
How to handle strings in R/Tidyverse?

Lesson Objectives

To be able to

perform basic factor data management.
convert textual date/time into date/time R objects
use simple regular expression and main str_* functions to manage strings

Mange principal formats

Factors - why

Using strings for categories is not always the best choice. Factors are the best way to represent categories in R.

sorting issues

x1 <- c("Dec", "Apr", "Jan", "Mar")
sort(x1)

[1] "Apr" "Dec" "Jan" "Mar"

missing/wrong levels issues

x2 <- c("Dec", "Apr", "Jam", "Mar")
x2

[1] "Dec" "Apr" "Jam" "Mar"

tabulation issues

table(x1)

x1
Apr Dec Jan Mar 
  1   1   1   1

Factors - how

Define a set of possible values (levels), as a standard character vector.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
month_levels

 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"
[11] "Nov" "Dec"

And define a variable as factor, specifying the levels using that.

Base

y1_base <- factor(x1, levels = month_levels)
y1_base

[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Tidyverse - (forcats)

library(tidyverse)

y1_tidy <- fct(x1, levels = month_levels)
y1_tidy

[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

sort(y1_base)

[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

sort(y1_tidy)

[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Factors - why tidyverse (`{forcats}`)

If we don’t provide explicit levels, the levels are the unique values in the vector, sorted alphabetically in base R, or in the order of appearance in forcats.

Base

factor(x1)

[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar

Tidyverse - (forcats)

fct(x1)

[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar

If there are wrong values in the values used to create a factor, they are included as missing (NA) in base R silently, while forcats throws an (informative!) error.

y2_base <- x2 |> 
  factor(levels = month_levels)
y2_base

[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

y2_tidy <- x2 |> 
  fct(levels = month_levels)

Error in `fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "Jam"

Factors - reorder levels

It could be useful to reordering levels, e.g. when plotting information.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

library(tidyverse)

# sample dataset from `{forcats}`
# ?gss_cat for information
gss_cat

Factors - reorder levels

It could be useful to reordering levels, e.g. when plotting information.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

relig_summary <- gss_cat |> 
  group_by(relig) |> 
  summarize(
    tv_hours = tvhours |> 
      mean(na.rm = TRUE)
  )
relig_summary

Factors - reorder levels

It could be useful to reordering levels, e.g. when plotting information.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

Natural
Reordered

relig_summary |> 
  ggplot(aes(
    x = tv_hours,
    y = relig
  )) +
  geom_point()

relig_summary |> 
  ggplot(aes(
    x = tv_hours,
    y = relig |> 
      fct_reorder(tv_hours)
  )) +
  geom_point()

Factors - reorder levels

There are also many other useful functions in forcats to reorder levels, e.g., fct_infreq and fct_rev. To see all of them, refer to its website https://forcats.tidyverse.org/.

gss_cat |>
  mutate(
    marital = marital |>
      # order by frequency
      fct_infreq() |>
      # reverse the order
      fct_rev()
  ) |>
  ggplot(aes(x = marital)) +
  geom_bar()

Factors - modify (AKA recode) levels

We can also modify levels, e.g., to change the wording, or to merge some of them together.

Change the wording

gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  ) |>
  count(partyid)

Important

forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!

Factors - modify (AKA recode) levels

We can also modify levels, e.g., to change the wording, or to merge some of them together.

Change the wording

gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)

Important

forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!

Dates and Time

In the Tidyverse, the main package to manage dates and time is lubridate.

Remind

Dates are counts (based on ?doubles) of days since 1970-01-01.
Date-Time are counts (based on ?doubles) of seconds since 1970-01-01.

To get the current date or date-time you can use today() or now():

library(tidyverse)

today()
now()

[1] "2024-02-29"
[1] "2024-02-29 11:16:17 CET"

Dates and Time - conversion from strings

Dates

ymd("2017-01-31")
mdy("January 31st, 2017")
dmy("31-Jan-2017")

[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31"

Dates-time

ymd_hms("2017-01-31 20:11:59")
mdy_hm("01/31/2017 20:11")
mdy_h("01/31/2017 20")

# Force date-time supplying a timezone
ymd("2017-01-31", tz = "UTC")

[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 20:11:00 UTC"
[1] "2017-01-31 20:00:00 UTC"
[1] "2017-01-31 UTC"

Date <-> Date-time conversion

as_datetime(today()) |> 
  str()
as_date(now()) |> 
  str()

 POSIXct[1:1], format: "2024-02-29"
 Date[1:1], format: "2024-02-29"

Extracting/Changing components

We can extract or modify components from date/date-time objects using:

year()
month()
day()
hour()
minute()
second()
wday() (day of the week)
yday() (day of the year)
week() (week of the year)
quarter() (quarter of the year).

Extract

(today_now <- now())
year(today_now)
month(today_now)
day(today_now)
hour(today_now)
minute(today_now)
second(today_now)
wday(today_now)
yday(today_now)
week(today_now)
quarter(today_now)

[1] "2024-02-29 11:16:17 CET"
[1] 2024
[1] 2
[1] 29
[1] 11
[1] 16
[1] 17.83648
[1] 5
[1] 60
[1] 9
[1] 1

Change

year(today_now)  <- 2020
today_now
month(today_now) <- 12
today_now
day(today_now) <- 30
today_now
hour(today_now) <- 17
today_now
minute(today_now) <- 14
today_now
second(today_now) <- 56
today_now

[1] "2020-02-29 11:16:17 CET"
[1] "2020-12-29 11:16:17 CET"
[1] "2020-12-30 11:16:17 CET"
[1] "2020-12-30 17:16:17 CET"
[1] "2020-12-30 17:14:17 CET"
[1] "2020-12-30 17:14:56 CET"

Your turn (main: B; bk1: C; bk2: A)

Your turn

Connect to our pad (https://bit.ly/ubep-rws-pad-ed3)
Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)

…and:

Under the sections 4.2. Ex26 Ex27 of the pad, write (in a new line) your answer to the questions reported.
Then, open the scripts 19-factors.R and 20-date-time.R and follow the instruction step by step.

15:00

Tip

factors from base R, or forcats::fct from forcats are the best way to represent categories in R. They work similarly, but forcats is more informative and more flexible.
Date and Date-time are counts of days/seconds since 1970-01-01. Managing them in R is not easy, but lubridate makes it easier.

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 14-factors.R

Strings - Regular Expressions

Regular expressions are a powerful tool for matching text patterns. They are used in many programming languages to find and manipulate strings, and in R are implemented in the stringr package.

Base syntax for regular expressions

. matches any character
* matches zero or more times
+ matches one or more times
? matches zero or one time
^ matches the start of a string
$ matches the end of a string
[] matches any one of the characters inside
[^] matches any character not inside the square brackets
| matches the pattern either on the left or the right
() groups together the pattern on the left and the right

Example

The following match any string that:

a contains a (str_view("banana", "a"): )
^a starts with a
a$ ends with a
^a$ starts and ends with a
^a.*a$ starts and ends with a, with any number of characters in between
^a.+a$ starts and ends with a, with at least one character in between
^a[bc]+a$ starts and ends with a, with at least one b or c in between
^a(b|c)d$ starts with a, followed by either b or c, followed by an endingd.

Tip

To match special characters, you need to escape them with a double backslash (\\). I.e., you need to use \\., \\*, \\+, \\?, \\^, \\$, \\[, \\], \\|, \$, \$.

To match a backslash, you need \\\\.

Strings - `{stringr}`

The stringr package provides a consistent set of functions for working with strings, and it is designed to work consistently with the pipe.

Functions

str_detect(): does a string contain a pattern?
str_which(): which strings match a pattern?
str_subset(): subset of strings that match a pattern
str_sub(): extract a sub-string by position
str_replace(): replace the first match with a replacement
str_replace_all(): replace all matches with a replacement
str_remove(): remove the first match
str_remove_all(): remove all matches
str_split(): split up a string into pieces
str_extract(): extract the first match
str_extract_all(): extract all matches
str_locate(): locate the first match
str_locate_all(): locate all matches
str_count(): count the number of matches
str_length(): the number of characters in a string

Tip

Because all stringr functions start with str_, in RStudio you can type str_ and then pressing TAB to see all its available functions.

Examples

Detect
Subset
Replace
Split
Locate
Count

library(tidyverse)

x <- c("apple", "banana", "pear")
str_detect(x, "[aeiou]")

[1] TRUE TRUE TRUE

str_which(x, "[aeiou]")

[1] 1 2 3

library(tidyverse)

x <- c("apple", "banana", "pear")
str_subset(x, "[aeiou]")

[1] "apple"  "banana" "pear"

str_sub(x, 1, 3)

[1] "app" "ban" "pea"

library(tidyverse)

x <- c("apple", "banana", "pear")
str_replace(x, "[aeiou]", "x")

[1] "xpple"  "bxnana" "pxar"

str_replace_all(x, "[aeiou]", "x")

[1] "xpplx"  "bxnxnx" "pxxr"

str_remove(x, "[aeiou]")

[1] "pple"  "bnana" "par"

str_remove_all(x, "[aeiou]")

[1] "ppl" "bnn" "pr"

library(tidyverse)

x <- c("apple", "banana", "pear")
str_split(x, "[aeiou]")

[[1]]
[1] ""    "ppl" ""   

[[2]]
[1] "b" "n" "n" "" 

[[3]]
[1] "p" ""  "r"

str_extract(x, "[aeiou]")

[1] "a" "a" "e"

str_extract_all(x, "[aeiou]")

[[1]]
[1] "a" "e"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "e" "a"

library(tidyverse)

x <- c("apple", "banana", "pear")
str_locate(x, "[aeiou]")

     start end
[1,]     1   1
[2,]     2   2
[3,]     2   2

str_locate_all(x, "[aeiou]")

[[1]]
     start end
[1,]     1   1
[2,]     5   5

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     2   2
[2,]     3   3

library(tidyverse)

x <- c("apple", "banana", "pear")
str_count(x, "[aeiou]")

[1] 2 3 2

str_length(x)

[1] 5 6 4

Strings - concatenate

str_c: takes any number of vectors as arguments and returns a character vector of the concatenated values.
str_glue: takes a string and interpolates values into it.

library(tidyverse)

tibble(
    x = c("apple", "banana", "pear"),
    y = c("red", "yellow", "green"),
    z = c("round", "long", "round")
  ) |> 
  mutate(
    fruit = str_c(x, y, z),
    fruit_space = str_c(x, y, z, sep = " "),
    fruit_comma = str_c(x, y, z, sep = ", "),
    fruit_glue = str_glue("I like {x}, {y} and {z} fruits")
  )

Your turn (main: C; bk1: A; bk2: B)

Your turn

Connect to our pad(https://bit.ly/ubep-rws-pad-ed3)
Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)

…and:

Before to evaluate it, in the pad, under the section 4.2. Ex8, write (in a new line) how can you match all files names that are R scripts (i.e., ending with .r or .R)? Report you option for a regular expression.
Then, open the script 21-strings.R and follow the instruction step by step.

10:00

Tip

All functions in stringr start with str_, so you can type str_ and then pressing TAB to see all its available functions.
You can use str_view to see how a regular expression matches a string.
str_glue is a powerful tool to concatenate strings and variables.

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 16-strings.R

Homework

Posit’s RStudio Cloud Workspace

Project: Day-4
Instructions:
- Go to: https://bit.ly/ubep-rws-website
- The text is the Day-4 assessment under the tab “Summative Assessments”.
Script to complete on RStudio: solution.R

Acknowledgment

To create the current lesson, we explored, used, and adapted content from the following resources:

The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.

Additional resources

Luis D. Verde Arregoitia Data Cleaning with R

License

This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0

References

Break

10:00

Day Three:TransformTypes

Overview

Questions

Lesson Objectives

To be able to

Mange principal formats

Factors - why

Factors - how

Factors - why tidyverse ({forcats})

Factors - reorder levels

Factors - reorder levels

Factors - reorder levels

Factors - reorder levels

Factors - modify (AKA recode) levels

Change the wording

Factors - modify (AKA recode) levels

Change the wording

Dates and Time

Dates and Time - conversion from strings

Dates

Dates-time

Date <-> Date-time conversion

Extracting/Changing components

Extract

Change

Your turn (main: B; bk1: C; bk2: A)

My turn

Strings - Regular Expressions

Strings - {stringr}

Strings - concatenate

Your turn (main: C; bk1: A; bk2: B)

My turn

Homework

Posit’s RStudio Cloud Workspace

Acknowledgment

Additional resources

License

References

Break

Day Three:
Transform
Types

Factors - why tidyverse (`{forcats}`)

Strings - `{stringr}`