Day Three:
Transform
Datasets

~30 min

Overview

Questions

  • How to modify (even create) the content of a (possibly new) variable?
  • How to modify or summarize data by group?

Lesson Objectives

To be able to

  • Change content of variable programmatically, possibly using content from other variables.
  • Perform operation on groups of data both obtaining results for each observation (same results for observation within same group) or for each group.

Mutate [side-by-side]

We can add new columns which are calculated from existing ones.

We can use simple algebra

library(tidyverse)
library(here)
library(rio)

db <- here("data-raw", "Copenhagen_clean.xlsx") |> 
  import(setclass = "tibble")

db |> 
  # select just to return few results
  select(id, incubation) |> 
  mutate(
    incubation_days = incubation / 24
  )

Mutate [side-by-side]

We can also add new columns which are calculated from existing ones.

We can use functions on variables

library(tidyverse)

db |> 
  # select just to return few results
  select(id, incubation) |> 
  mutate(
    incubation_norm = (
      incubation - mean(incubation, na.rm = TRUE)
    ) / sd(incubation, na.rm = TRUE) 
  )

Mutate [side-by-side]

We can also add new columns which are calculated from existing ones.

We can use variables just created

library(tidyverse)

db |> 
  # select just to return few results
  select(id, age, group, class, case) |> 
  mutate(
    adult = (age > 18) & (
      (group != "student") |
      is.na(class)
    ),
    adult_case = adult & case
  )

Mutate [side-by-side]

We can also add new columns which are calculated from existing ones.

Pay attention on vectorized Vs. summary functions

library(tidyverse)

sample_df <- tibble(
  x = c(1, 5, 7),
  y = c(3, 2, NA)
)

sample_df |> 
  mutate(
    # rows element-wise
    min_vec = pmin(x, y, na.rm = TRUE),
    max_vec = pmax(x, y, na.rm = TRUE),
    # cols global
    min_all = min(x, y, na.rm = TRUE),
    max_all = max(x, y, na.rm = TRUE),
  )

Warning

  • Summary functions (e.g., min, max):
    • Takes: vectors.
    • Returns: a single value.
  • Vectorized functions (e.g., pmin, pmax):
    • Takes: vectors.
    • Returns: vectors (the same length as the input).

Conditional - Binary: dplyr::if_else [side-by-side]

To mutate a variable accordingly to a binary condition

library(tidyverse)

db |> 
  mutate(
    age_class = if_else(
      age >= 18,
      "adult",
      "child"
    )
  ) |> 
  select(age, age_class)

Important

dplyr::if_else requires compatible types in the output.

Conditional - Subsequent: dplyr::case_when [side]

To mutate a variable accordingly to multiple subsequent conditions

library(tidyverse)

db |> 
  mutate(
    age_class = case_when(
      age >  24 ~ "adult (prof)",
      age >= 18 ~ "adult (stud)",
      age >= 15 ~ "young (stud)",
      TRUE      ~ "child"
    )
  ) |> 
  select(age, age_class)

Important

dplyr::case_when takes condition ~ value pairs. condition must be a logical vector; when it’s TRUE, the valule will be used.

  • If none of the cases match, the output gets an NA.
  • Conditions are considered in order, so you should put the most specific case first!
  • TRUE ~ <dafault_value> is used to specify the “default”/catch all value.

Apply Functions Across Columns - dplyr::across [side]

When working with datasets, it’s often necessary to apply transformations or calculations across multiple columns. The dplyr::across function, used within dplyr::mutate, enables these operations to be performed succinctly and efficiently.

Tip

You can use all of the selectors we have seen before to select the columns to apply the function to. In particular, where is very useful to select the columns to apply the function to!

library(tidyverse)

db |> 
  mutate(
    id = as.character(id),
    class = fct(class),
    across(  # apply a function to selected columns
      where(is.double),  # selector
      scale              # function
    )
  )

Grouping and summarizing

Grouping

With .by argument, we can group rows into groups meaningful for your analysis by one or more variables, and then dplyr::mutate each row group-by-group.

library(tidyverse)

db |> 
  mutate(
    mean_age = age |> 
      mean(na.rm = TRUE),
    n = n(),
    n_teachers = sum(
      group == "teacher",
      na.rm = TRUE
    ),
    .by = class
  ) |> 
  select(
    class, mean_age, n, n_teachers,
    everything()
  )

Summarizing

With .by argument, we can group rows into groups meaningful for your analysis by one or more variables, and then dplyr::summarize to a single row per group.

db |> 
  summarize(
    mean_age = age |> 
      mean(na.rm = TRUE),
    n = n(),
    n_teachers = sum(
      group == "teacher",
      na.rm = TRUE
    ),
    .by = class
  )

Counts

If we want to count the number of rows in each group, we can use simply dplyr::count instead of dplyr::group_by and dplyr::summarize.

db |> 
  count(class)
db |> 
  count(class, group)

Your turn (main: B; bk1: C; bk2: A)

Your turn

  1. Before to try it, in the pad, under the section 3.2. Ex19 write your guess respect the output of using dplyr::mutate assigning the same name of an already existing variable. E.g.
library(tidyverse)

db |> 
  mutate(
  age = age * 365.25
)
  1. Then, open the script 13-mutate.R and follow the instruction step by step.
25:00

Important

As all the other verbs in the Tidyverse, dplyr::mutate

  • It takes a data frame in input, always.
  • It returns a data frame in output, always.
  • It doesn’t change it’s input, never.

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad-ed3) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 13-transforming.R

Break

10:00

Acknowledgment

To create the current lesson, we explored, used, and adapted content from the following resources:

The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.

Additional resources

License

This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0

References