30 (+30) min approx
str_* functions to manage stringsUsing strings for categories is not always the best choice. Factors are the best way to represent categories in R.
Define a set of possible values (levels), as a standard character vector.
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"
[11] "Nov" "Dec"
And define a variable as factor, specifying the levels using that.
Base
Tidyverse - (forcats)
{forcats})If we don’t provide explicit levels, the levels are the unique values in the vector, sorted alphabetically in base R, or in the order of appearance in forcats.
Tidyverse - (forcats)
If there are wrong values in the values used to create a factor, they are included as missing (NA) in base R silently, while forcats throws an (informative!) error.
It could be useful to reordering levels, e.g. when plotting information.
We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.
Tip
Often, the numerical value you use to reorder a factor is another variable in your dataset!
It could be useful to reordering levels, e.g. when plotting information.
We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.
Tip
Often, the numerical value you use to reorder a factor is another variable in your dataset!
It could be useful to reordering levels, e.g. when plotting information.
We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.
Tip
Often, the numerical value you use to reorder a factor is another variable in your dataset!
There are also many other useful functions in forcats to reorder levels, e.g., fct_infreq and fct_rev. To see all of them, refer to its website https://forcats.tidyverse.org/.
We can also modify levels, e.g., to change the wording, or to merge some of them together.
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  ) |>
  count(partyid)Important
forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.
To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!
We can also modify levels, e.g., to change the wording, or to merge some of them together.
Important
forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.
To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!
In the Tidyverse, the main package to manage dates and time is lubridate.
Remind
To get the current date or date-time you can use today() or now():
[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 20:11:00 UTC"
[1] "2017-01-31 20:00:00 UTC"
[1] "2017-01-31 UTC"
We can extract or modify components from date/date-time objects using:
year()month()day()hour()minute()second()wday() (day of the week)yday() (day of the year)week() (week of the year)quarter() (quarter of the year).[1] "2024-02-29 11:16:17 CET"
[1] 2024
[1] 2
[1] 29
[1] 11
[1] 16
[1] 17.83648
[1] 5
[1] 60
[1] 9
[1] 1
[1] "2020-02-29 11:16:17 CET"
[1] "2020-12-29 11:16:17 CET"
[1] "2020-12-30 11:16:17 CET"
[1] "2020-12-30 17:16:17 CET"
[1] "2020-12-30 17:14:17 CET"
[1] "2020-12-30 17:14:56 CET"
Your turn
Connect to our pad (https://bit.ly/ubep-rws-pad-ed3)
Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
Under the sections 4.2. Ex26 Ex27 of the pad, write (in a new line) your answer to the questions reported.
Then, open the scripts 19-factors.R and 20-date-time.R and follow the instruction step by step.
15:00
Tip
factors from base R, or forcats::fct from forcats are the best way to represent categories in R. They work similarly, but forcats is more informative and more flexible.
Date and Date-time are counts of days/seconds since 1970-01-01. Managing them in R is not easy, but lubridate makes it easier.
YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)
ME: Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 14-factors.R
Regular expressions are a powerful tool for matching text patterns. They are used in many programming languages to find and manipulate strings, and in R are implemented in the stringr package.
Base syntax for regular expressions
. matches any character* matches zero or more times+ matches one or more times? matches zero or one time^ matches the start of a string$ matches the end of a string[] matches any one of the characters inside[^] matches any character not inside the square brackets| matches the pattern either on the left or the right() groups together the pattern on the left and the rightExample
The following match any string that:
a contains a (str_view("banana", "a"): )^a starts with a
a$ ends with a
^a$ starts and ends with a
^a.*a$ starts and ends with a, with any number of characters in between^a.+a$ starts and ends with a, with at least one character in between^a[bc]+a$ starts and ends with a, with at least one b or c in between^a(b|c)d$ starts with a, followed by either b or c, followed by an endingd.Tip
To match special characters, you need to escape them with a double backslash (\\). I.e., you need to use \\., \\*, \\+, \\?, \\^, \\$, \\[, \\], \\|, \\(, \\).
To match a backslash, you need \\\\.
{stringr}
The stringr package provides a consistent set of functions for working with strings, and it is designed to work consistently with the pipe.
Functions
str_detect(): does a string contain a pattern?str_which(): which strings match a pattern?str_subset(): subset of strings that match a patternstr_sub(): extract a sub-string by positionstr_replace(): replace the first match with a replacementstr_replace_all(): replace all matches with a replacementstr_remove(): remove the first matchstr_remove_all(): remove all matchesstr_split(): split up a string into piecesstr_extract(): extract the first matchstr_extract_all(): extract all matchesstr_locate(): locate the first matchstr_locate_all(): locate all matchesstr_count(): count the number of matchesstr_length(): the number of characters in a stringTip
Because all stringr functions start with str_, in RStudio you can type str_ and then pressing TAB to see all its available functions.
Examples
str_c: takes any number of vectors as arguments and returns a character vector of the concatenated values.
str_glue: takes a string and interpolates values into it.
library(tidyverse)
tibble(
    x = c("apple", "banana", "pear"),
    y = c("red", "yellow", "green"),
    z = c("round", "long", "round")
  ) |> 
  mutate(
    fruit = str_c(x, y, z),
    fruit_space = str_c(x, y, z, sep = " "),
    fruit_comma = str_c(x, y, z, sep = ", "),
    fruit_glue = str_glue("I like {x}, {y} and {z} fruits")
  )Your turn
Connect to our pad(https://bit.ly/ubep-rws-pad-ed3)
Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio)
…and:
Before to evaluate it, in the pad, under the section 4.2. Ex8, write (in a new line) how can you match all files names that are R scripts (i.e., ending with .r or .R)? Report you option for a regular expression.
Then, open the script 21-strings.R and follow the instruction step by step.
10:00
Tip
All functions in stringr start with str_, so you can type str_ and then pressing TAB to see all its available functions.
You can use str_view to see how a regular expression matches a string.
str_glue is a powerful tool to concatenate strings and variables.
YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)
ME: Connect to the Day-4 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 16-strings.R
solution.R
To create the current lesson, we explored, used, and adapted content from the following resources:
The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.
This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0
10:00

UBEP’s R training for supervisors