Day Three - Assessment

UBEP’s R training for supervisors

Infrastructure, Import, and Tidying

Learning Objectives

At the end of the assessment, participants should have demonstrated their ability to: (16)

  • Know that exists specialized importing function and how to access their documentation, if in needs to use customized imports. In particular, the ones in the {readr}, and {haven} packages. (7,8)

  • Know how to manage table headers, and missing data at import stage using {janitor} (clean_names, remove_empty ), and {unheadr} (mash_colnames) , {tidyr} (fill).(2,3,9)

  • Convert a dataset to a tidy one using {tidyr} (separate, unite, pivot_longer, pivot_wider).(3)

  • Manage strings using basic regular expression and the package {stringr} .(4)

  • Manage factors using functions in the package {forcats}. (5)

  • Manage dates-time objects using functions in the package {lubridate}. (6)

  • Use the base {dplyr} verbs (i.e., filter, select, mutate), adverbs (i.e., across, where, starts_with, ends_with, contains, matches, all_of, any_of), and constructors (i.e., tibble, and tribble). (1)

Preamble

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

This assessment is the third of four ones regarding the UBEP’s R training for supervisors for the ECDC.

To solve the exercise follow the text and find the assessments in which you need to fill the ___ where required in the code. Next, go to the corresponding section of the R script solution.R and try/run sequentially your code.

All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.

You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-3” project, and find inside all the data, script and resources useful to complete the assessment.

The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(10)

The present work is released under the GPL-3 License.

Data preparation

Objectives

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

At the end of the case study, participants should be able to:

  • Conduct an investigation to identify the source of an outbreak
  • Apply the ten steps of an outbreak investigation
  • Explain the epidemiological and microbiological contributions to foodborne outbreak investigations
  • Perform data cleaning and analysis preparation steps using R
  • Perform descriptive, univariable and stratified analyses using R
  • Critically evaluate the results from statistical and microbiological analyses and identify food vehicles most likely associated with becoming ill
  • Understand the importance of writing outbreak reports (developing an analytical plan)

The Alert

On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.

The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.

Your mission

Your group has been tasked with investigating this outbreak; you have just received the information above.

The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.

A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.

Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.

These data were then exported from the survey tool and saved as Copenhagen_raw.xlsx.1

Load packages

In this section, you will become familiar with the raw data set provided.

Before to start, open your RStudio project day-3 within the course Posit Cloud Workspace by clicking on it.2 You can find your exercise within the solution.R file in the assessment/ folder. That script reports all the unsolved chunks of R code reported in this very document, including all the once under the “Exercise” tabs. Within the project’s folder there is a folder named data-raw/. In the data-raw/ folder there is our raw file Copenhagen_raw.xlsx.

First of all attach the packages we will use to the current working R session.

library(tidyverse)
library(gtsummary)
library(here)
library(janitor)
library(unheadr)
library(rio)
# linelist_raw <- import(here("Copenhagen_raw.xlsx"))
# linelist_raw <- import(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("../Copenhagen_raw.xlsx")
# linelist_raw <- read(here("Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("../Copenhagen_raw.xlsx")

linelist_raw
Code
# import(here("Copenhagen_raw.xlsx"))
linelist_raw <- import(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("../Copenhagen_raw.xlsx")
# linelist_raw <- read(here("Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("../Copenhagen_raw.xlsx")

linelist_raw

You should now see that you have a data.frame called linelist_raw in your environment tab. You should also be able to see how many observations and variables it contains.

You can view the data by clicking on the name of it (linelist_raw) in the environment pane, or alternatively, type and execute View(linelist_raw) in the R console.

As you can see, the data are quite messy, especially in the first rows. If you navigate them you can find that:

  • it seams that there are leading headers for each group of columns (in the first of them)
  • the “real” headers seams span across multiple rows (three)
  • the presence of logical column with all visible rows as NA could be highlight the presence of empty lines
  • the sex column seams to report only the a single (the first) occurrence of female and male
Instructions

In this section, participants should fill in ___ the proper instructions to correctly import the data set.

Suggestion: Explore the piped code sequentially, one istruction a time, looking on all the intermediate results step by step.

# read and import the dataset
# use the working instructions used previously
# In the first line, create the path to the data set only
linelist <- ___("___") |>  # here is the path only
  # import the data
  # don't consider the first row (headers' groups) as column names!
  ___(col_names = FALSE) |>
  # mash together the four rows composing the headers
  # merge the groups' headers with the proper headers in a single row
  # (tip.: use a function in the `{unheadr}`)
  ___(
    n_name_rows = ___,  # how many rows has to be mashed?
    keep_names = FALSE,  # do we count/consider current colnames?
    sliding_headers = ___  # are there grouping headers'name (1st row)
  ) |> 
  # remove empty cols (tip.: use a function in the `{janitor}`)
  ___(which = c("___")) |> 
  # clean all the names
  # convert them using standard cammel_case convention
  # (tip.: use a function in the `{janitor}`)
  ___() |> 
  # fill the `sex` column to have a complete data set (remind that we
  # have mashed the groups'heaaders with the columns'names)
  ___(demo_sex) |> 
  # Because of our importing strategy, all columns are pure characters.
  # We can use the `parse_guess` function of the {readr} package to 
  # mutate every columns to a suitable proper type across each one.
  ___(
    ___(everything(), \(x) parse_guess(x, guess_integer = TRUE))
  )



linelist
Code
# read and import the dataset
# use the working instructions used previously
# In the first line, create the path to the data set only
linelist <- here("data-raw/Copenhagen_raw.xlsx") |> 
  # import the data
  # don't consider the first row (headers' groups) as column names
  import(col_names = FALSE) |>
  # mash together the four rows composing the headers
  # merge the groups' headers with the proper headers in a single row
  # (tip.: use a function in the `{unheadr}`)
  mash_colnames(
    n_name_rows = 4,  # how many rows has to be mashed?
    keep_names = FALSE, # do we count/consider current colnames?
    sliding_headers = TRUE  # are there grouping headers'name (1st row)
  ) |>  
  # remove empty columns (tip.: use a function in the `{janitor}`)
  remove_empty(which = c("cols")) |> 
  # clean all the names
  # convert them using standard cammel_case convention
  # (tip.: use a function in the `{janitor}`)
  clean_names() |> 
  # fill the `sex` column to have a complete data set (remind that we
  # have mashed the groups'heaaders with the columns'names)
  fill(demo_sex) |> 
  # Because of our importing strategy, all columns are pure characters.
  # We can use the `parse_guess` function of the {readr} package to 
  # mutate every columns to a suitable proper type across each one.
  mutate(
    across(everything(), \(x) parse_guess(x, guess_integer = TRUE))
  )

linelist
1
Note: If you would like to throw away the groups’ names at the beginning of each variable names, you could ignore the first line when importing the dataset (import(skip = 1, col_names = FALSE)) and remove the sliding_headers option to the mash_colnames call, considering three cols now only (mash_colnames(3, keep_names = FALSE)).

Exploring the data

Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.

In addition, it is advisable to consider what format or class each variable (column) should be in. This is something you can include in your analysis plan. For example, if you know you will be creating an epidemic curve of onset dates, you will need to ensure that the onset dates have been correctly interpreted by R as date class on import and are not being read as character strings. The column class or type is particularly important if you plan on performing any calculations on that column.

We will see how to do that, in the next lesson.

Anyway we can have a look at the current summary of column types and values by the skim function of the {skimr} package.

skimr::skim(linelist)
Data summary
Name linelist
Number of rows 397
Number of columns 40
_______________________
Column type frequency:
character 2
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
demo_sex 0 1.00 4 6 0 2 0
time_day_onset 175 0.56 9 9 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
demo_id 0 1.00 216.55 123.41 1 112 216 320 435 ▇▇▇▇▇
demo_age 0 1.00 18.57 9.92 8 16 17 18 180 ▇▁▁▁▁
demo_group 0 1.00 0.96 0.20 0 1 1 1 1 ▁▁▁▁▇
demo_class 36 0.91 1.94 0.84 1 1 2 3 3 ▇▁▆▁▇
symptoms_diarrhoea 141 0.64 0.82 0.39 0 1 1 1 1 ▂▁▁▁▇
symptoms_bloody 200 0.50 0.03 0.17 0 0 0 0 1 ▇▁▁▁▁
symptoms_vomiting 179 0.55 0.30 0.46 0 0 0 1 1 ▇▁▁▁▃
symptoms_abdo 152 0.62 0.85 0.35 0 1 1 1 1 ▂▁▁▁▇
symptoms_nausea 170 0.57 0.76 0.43 0 1 1 1 1 ▂▁▁▁▇
symptoms_fever 223 0.44 0.26 0.44 0 0 0 1 1 ▇▁▁▁▃
symptoms_headache 174 0.56 0.63 0.48 0 0 1 1 1 ▅▁▁▁▇
symptoms_joint_pain 207 0.48 0.16 0.37 0 0 0 0 1 ▇▁▁▁▂
time_start_hour 177 0.55 12.49 4.92 3 9 9 15 21 ▁▇▁▆▃
food_meal 9 0.98 0.97 0.17 0 1 1 1 1 ▁▁▁▁▇
food_tuna 16 0.96 0.71 0.45 0 0 1 1 1 ▃▁▁▁▇
food_tuna_d 16 0.96 1.32 1.00 0 0 2 2 3 ▆▅▁▇▂
food_shrimps 17 0.96 0.67 0.47 0 0 1 1 1 ▃▁▁▁▇
food_shrimps_d 17 0.96 1.34 1.04 0 0 2 2 3 ▆▂▁▇▂
food_green 30 0.92 0.59 0.49 0 0 1 1 1 ▆▁▁▁▇
food_green_d 30 0.92 1.14 1.05 0 0 1 2 3 ▇▂▁▇▂
food_veal 15 0.96 0.89 0.31 0 1 1 1 1 ▁▁▁▁▇
food_veal_d 14 0.96 1.83 0.91 0 1 2 2 3 ▂▃▁▇▃
food_pasta 15 0.96 0.88 0.32 0 1 1 1 1 ▁▁▁▁▇
food_pasta_d 15 0.96 1.81 0.91 0 1 2 2 3 ▂▃▁▇▃
food_rocket 24 0.94 0.57 0.50 0 0 1 1 1 ▆▁▁▁▇
food_rocket_d 24 0.94 1.08 1.06 0 0 1 2 3 ▇▂▁▆▂
food_sauce 42 0.89 0.42 0.49 0 0 0 1 1 ▇▁▁▁▆
food_sauce_d 42 0.89 0.83 1.06 0 0 0 2 3 ▇▁▁▃▁
food_bread 18 0.95 0.91 0.29 0 1 1 1 1 ▁▁▁▁▇
food_bread_d 18 0.95 1.75 0.71 0 2 2 2 3 ▁▂▁▇▁
food_champagne 25 0.94 0.87 0.34 0 1 1 1 1 ▁▁▁▁▇
food_champagne_d 25 0.94 1.37 0.93 0 1 1 2 3 ▂▇▁▂▃
food_beer 30 0.92 0.78 0.42 0 1 1 1 1 ▂▁▁▁▇
food_beer_d 35 0.91 1.94 1.23 0 1 3 3 3 ▃▂▁▂▇
food_red_wine 50 0.87 0.23 0.42 0 0 0 0 1 ▇▁▁▁▂
food_red_wine_d 52 0.87 0.45 0.92 0 0 0 0 3 ▇▁▁▁▁
food_white_wine 31 0.92 0.73 0.45 0 0 1 1 1 ▃▁▁▁▇
food_white_wine_d 36 0.91 1.58 1.21 0 0 2 3 3 ▆▅▁▅▇
Instructions

In this section, participants should fill in ___ the proper instructions to correctly import the data set.

Suggestion: Explore the piped code sequentially, one istruction a time, looking on all the intermediate results step by step.

Date time variables

The dates are formatted as day, then month (abbreviated character string), then year, in a character vector; we will convert them to a date object.

Most people fell ill on Saturday 12 November (a day after the implicated meal). However, the data set also includes the hour of onset in a variable called time_start_hour. We will combine the date and hour of onset into a date-time variable, which will provide more appropriate units to construct an epicurve with later.

# Update linelist:
linelist <- linelist |> 
  ___(
    time_day_onset = ___(time_day_onset, tz = "UTC"),
    time_datetime_onset = paste(time_day_onset, time_start_hour) |> 
      ymd_h(truncated = 2)
  )

linelist
1
The argument truncated = 2 will result in dates with missing time_start_hour still being converted to date-time, with the missing time being set to 00:00 (midnight). Whether you want to deal with missing time_start_hour in this way or prefer to code these date-times as NA will depend on how you want them to be represented in your analysis.
Code
# Update linelist:
linelist <- linelist |> 
  mutate(
    time_day_onset = dmy(time_day_onset, tz = "UTC"),
    time_datetime_onset = paste(time_day_onset, time_start_hour) |> 
      ymd_h(truncated = 2)
  )

linelist
1
The argument truncated = 2 will result in dates with missing time_start_hour still being converted to date-time, with the missing time being set to 00:00 (midnight). Whether you want to deal with missing time_start_hour in this way or prefer to code these date-times as NA will depend on how you want them to be represented in your analysis.

Recoding values

There is a variable in the data set called group. In this variable, teachers have been encoded as 0, while students were encoded as 1. We can convert the group variable to a factor and give each factor level the appropriate label.

We will convert to factor also the demo_sex variable

linelist <- linelist |> 
  ___(
    demo_sex = factor(demo_sex),
    demo_group = ___(demo_group, labels = c("teacher", "student")),
    demo_class = factor(demo_class)
  )
linelist
Code
linelist <- linelist |> 
  mutate(
    demo_sex = factor(demo_sex),
    demo_group = factor(demo_group, labels = c("teacher", "student")),
    demo_class = factor(demo_class)
  )
linelist

Having a look to combination of demo_age and demo_group (Table 1) we can highlight and fix some typographical errors.3

Table 1: Age vs. Group comparison for the original data.
demo_group Total
teacher student
demo_age


    8 0 1 1
    15 0 11 11
    16 1 99 100
    17 0 115 115
    18 0 112 112
    19 0 39 39
    20 0 3 3
    26 1 0 1
    29 1 0 1
    30 1 0 1
    31 1 0 1
    32 1 0 1
    33 1 0 1
    34 1 0 1
    39 1 0 1
    43 1 0 1
    54 1 0 1
    56 1 0 1
    58 2 0 2
    59 1 0 1
    65 1 0 1
    180 0 1 1
Total 16 381 397
  • There is one teacher aged 16 (likely digit reversal - should be 61)
  • There is one student aged 8 (likely missing a digit - should be 18)
  • There is one student aged 180 (likely has an extra digit - should be 18)

We will fix those values accordingly. (Table 2)

linelist <- linelist |> 
  ___(
    demo_age = case_when(
      demo_age == 16 & demo_group == "teacher" ~ 61, 
      demo_age %in% c(8, 180) & demo_group == "student" ~ 18, 
      TRUE ~ demo_age
    )
  )
linelist
Code
linelist <- linelist |> 
  mutate(
    demo_age = case_when(
      demo_age == 16 & demo_group == "teacher" ~ 61, 
      demo_age %in% c(8, 180) & demo_group == "student" ~ 18, 
      TRUE ~ demo_age
    )
  )
linelist
Table 2: Age vs. Group comparison after fix
demo_group Total
teacher student
demo_age


    15 0 11 11
    16 0 99 99
    17 0 115 115
    18 0 114 114
    19 0 39 39
    20 0 3 3
    26 1 0 1
    29 1 0 1
    30 1 0 1
    31 1 0 1
    32 1 0 1
    33 1 0 1
    34 1 0 1
    39 1 0 1
    43 1 0 1
    54 1 0 1
    56 1 0 1
    58 2 0 2
    59 1 0 1
    61 1 0 1
    65 1 0 1
Total 16 381 397

Variable Class

Most variables are either binary symptoms or food exposures, that have been encoded as 0 (where the symptom was absent or there was no exposure) or 1 (symptom present or person was exposed to the food item). We will convert them to logicals (i.e., TRUE/FALSE).

# Update linelist
linelist <- linelist |>
  ## Please note: the following sequence of three `___` is a one of the
  ## most useful and used standard patters in R managing data in the 
  ## tidyverse
  ___(
    ___(
      ___(\(x) all(x %in% c(0, 1, NA))),
      as.logical
    )
  )
linelist
Code
# Update linelist
linelist <- linelist |>
  mutate(
    across(
      where(\(x) all(x %in% c(0, 1, NA))),
      as.logical
    )
  )
linelist
Instructions

In this section, participants should fill in ___ the proper instructions to correctly import the data set.

Suggestion: Explore the piped code sequentially, one istruction a time, looking on all the intermediate results step by step.

The final step to undertake before proceeding to descriptive analysis, is to create a new column in the data set to hold the case definition. You can call this column case and set it to TRUE if the individual meets the case definition criteria and FALSE if not.

Variables of interest:

Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called food_meal. This variable indicates whether people attended the school dinner party and ate a meal there, or not.

Other variables that will be helpful to include in your case definition are time_datetime_onset (hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:

  • symptoms_abdo (abdominal pain)
  • symptoms_diarrhoea
  • symptoms_bloody (bloody diarrhoea)
  • symptoms_nausea
  • symptoms_vomiting
  • symptoms_fever
  • symptoms_headache
  • symptoms_jointpain

Defining a case

To demonstrate how this works, we will first construct an example case definition in words:

A case was defined as a person who:

  • attended the school dinner on 11 November 2006 (i.e. is on the linelist)
  • ate a meal at the school dinner (i.e. was exposed)
  • fell ill after the start of the meal
  • fell ill no later than two days after the school dinner
  • suffered from diarrhoea with or without blood, or vomiting

Non cases were defined as people who:

  • attended the school dinner on 11 November 2006 (i.e. are on the linelist)
  • ate a meal at the school dinner (i.e. were exposed)
  • did not fall ill within the time period of interest
  • did not develop diarrhoea (with or without blood) or vomiting

For the sake of the analysis, we exclude any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.

The variables needed to define this case definition are:

  • food_meal
  • time_datetime_onset
  • symptoms_diarrhoea
  • symptoms_bloody
  • symptoms_vomiting

Case definition

First, we create a column for the date and time of the meal. We can use this column to determine if onset date/times occured before or after the meal. We can limit the case definition to people who had onset date/times after the meal, since we hypothesise that the meal is where exposure to the pathogen occured.

Moreover, for the case definition, we are primarily interested in three symptoms:

  • symptoms_diarrhoea (without blood)
  • symptoms_bloody (diarrhoea with blood)
  • symptoms_vomiting

However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column symptoms_gastro.

linelist <- linelist |>  
  ___(
    meal_datetime = ymd_h("2006-11-11 18"),
    symptoms_gastro = (
      symptoms_diarrhoea | symptoms_bloody | symptoms_vomiting
    ) |>
      # Replace missing values with FALSE
      ___(FALSE)
  )
linelist
Code
# Start with linelist:
linelist <- linelist |>  
  mutate(
    meal_datetime = ymd_h("2006-11-11 18"),
    symptoms_gastro = (
      symptoms_diarrhoea | symptoms_bloody | symptoms_vomiting
    ) |>
      # Replace missing values with FALSE
      replace_na(FALSE)
  )
linelist

Note that we have included people who have no data for all three symptoms of interest, and categorised them as FALSE. This means that they will be defined as non-cases in the next step.

Creating the case definition column

Next, we can create a column for the case definition, in which we will define all the respondents as either cases, non-cases or NA.

We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.

We can exclude (define as NA) respondents that answered the survey, but did not attend the meal.

linelist <- linelist |>
  rowwise() |> 
  ___(
    # Consider only vars that start with "food" and do not end with "_d"
    ate_anything = c_across(___("food") & !___("_d")) |> 
      reduce(`|`) |>
      # Replace missing values with FALSE
      ___(FALSE)
  ) |> 
  ungroup()
linelist
Code
linelist <- linelist |>
  rowwise() |> 
  mutate(
    # Consider only vars that start with "food" and do not end with "_d"
    ate_anything = c_across(starts_with("food") & !ends_with("_d")) |> 
      reduce(`|`) |>
      # Replace missing values with FALSE
      replace_na(FALSE)
  ) |> 
  ungroup()
linelist

Next, we can check if there is anyone that said they didn’t eat a meal (or skipped that question) but did consume one or more of the food or drink items at the party (Table 3):

Table 3: People eating something without having the meal.
ate_anything Total
FALSE TRUE
food_meal


    FALSE 4 7 11
    TRUE 0 377 377
    Unknown 9 0 9
Total 13 384 397

We can see from this table that 7 respondents said they didn’t have a meal, but did actually consume items on the party meal menu, according to their answers for the food and drink questions.

We can therefore recode the food_meal column for these individuals as TRUE.

linelist <- linelist |> 
  # modify the food_meal column for people who ate something but didn't
  # have a meal
  ___(
    food_meal = if_else(
      !food_meal & ate_anything,
      true = TRUE,
      false = food_meal
    )
  )
linelist
Code
# Start with linelist:
linelist <- linelist |> 
  # modify the food_meal column for people who ate something but didn't
  # have a meal
  mutate(
    food_meal = if_else(
      !food_meal & ate_anything,
      true = TRUE,
      false = food_meal
    )
  )
linelist

Check teh result (Table 4).

Table 4: People eating something without having the meal, after fix.
ate_anything Total
FALSE TRUE
food_meal


    FALSE 4 0 4
    TRUE 0 384 384
    Unknown 9 0 9
Total 13 384 397

Now that these values are corrected, we can proceed with the case definition:

# Start with linelist:
linelist <- linelist |>  
  # Create case definition:
  ___(
    # Define cases as those with any gastro symptoms with onset after
    # the meal:
    case = case_when(
      # Define excluded individuals as those with no record of a meal:
      !food_meal | is.na(food_meal) ~ NA,
  
      # Cases have any gastro symptoms with onset after the meal:
      symptoms_gastro &
        !is.na(time_datetime_onset) &
        (time_datetime_onset >= meal_datetime) 
      ~ TRUE, 
      
      # All the rest (i.e., `TRUE` evaluates as `TRUE` for all the cases
      # that didn't match a previous condition!) marked as FALSE,
      # i.e., non cases have no gastro symptoms but ate a meal at the party:
      # !symptoms_gastro |
      #   (symptoms_gastro & (time_datetime_onset < meal_datetime))
      TRUE ~ FALSE
    ),
  )
linelist
Code
# Start with linelist:
linelist <- linelist |>  
  # Create case definition:
  mutate(
    case = case_when(
      # Define excluded individuals as those with no record of a meal:
      !food_meal | is.na(food_meal) ~ NA,
  
      # Cases have any gastro symptoms with onset after the meal:
      symptoms_gastro & !is.na(time_datetime_onset) & (time_datetime_onset >= meal_datetime) 
      ~ TRUE, 
      
      # All the rest (i.e., `TRUE` evaluates as `TRUE` for all the cases
      # that didn't match a previous condition!) marked as FALSE,
      # i.e., non cases have no gastro symptoms but ate a meal at the party:
      # !symptoms_gastro |
      #   (symptoms_gastro & (time_datetime_onset < meal_datetime))
      TRUE ~ FALSE
    ),
  )
linelist

Incubation times

A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).

We will create a column for incubation times (date and time of meal subtracted from the date and time of symptom onset).

linelist <- linelist |>  
  # Update incubation to be NA if case is not TRUE:
  ___(
    incubation = if_else(case, time_datetime_onset - meal_datetime, NA)
  )
linelist
Code
linelist <- linelist |>  
  # Update incubation to be NA if case is not TRUE:
  mutate(
    incubation = if_else(case, time_datetime_onset - meal_datetime, NA)
  )
linelist

We can now refine the case definition and limit the maximum incubation period to 48 hours after the meal, as the data points to a fast-acting bacterial toxin or a virus.

We will update the case definition, this time only changing values which were previously defined as a case but had onset of symptoms after 13 November 2006 at 18:00. Respondents that meet this condition will be reclassified as non cases (i.e. case = FALSE).

# Update the case definition to limit to onset three days after meal
linelist <- linelist |> 
  ___(
    case = case & (time_datetime_onset <= (meal_datetime + ___(2)))
  )
linelist
Code
# Update the case definition to limit to onset three days after meal:
linelist <- linelist |> 
  mutate(
    case = case & (time_datetime_onset <= (meal_datetime + days(2)))
  )
linelist

As it happens, there is no change in the number of cases, because none of them had an onset date and time that was more than two days after the meal. We can double-check this by cross-tabulating the onset date/time and case status (Table 5).

Table 5: Cases and non-cases each onset datetime
case Total
FALSE TRUE Unknown
time_datetime_onset



    2006-11-11 21:00:00 0 9 0 9
    2006-11-12 03:00:00 0 10 0 10
    2006-11-12 09:00:00 0 95 2 97
    2006-11-12 15:00:00 1 63 0 64
    2006-11-12 21:00:00 1 25 0 26
    2006-11-13 03:00:00 0 2 0 2
    2006-11-13 09:00:00 0 6 0 6
    2006-11-13 15:00:00 0 6 0 6
    Unknown 166 0 11 177
Total 168 216 13 397

We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.

Exclusions

Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis. We can do this easily with the function drop_na():

linelist <- linelist |> 
  # Remove rows where case is NA:
  drop_na(___)
linelist
Code
linelist <- linelist |> 
  # Remove rows where case is NA:
  drop_na(case)
linelist

Descriptive analysis

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

Goals

In this session, we will perform some descriptive analysis. Typically, we would describe the outbreak in terms of time, place and person. In this data set, we don’t have geospatial information, but we do have time (onset dates and times, incubation periods) and person (age, sex, clinical symptoms). You will:

  • Describe cases by age and sex
  • Describe case symptoms
  • Describe and illustrate incubation periods
  • Create an epidemic curve

Many of the descriptive features can be illustrated graphically, for which we will use the package ggplot2.

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

In theory, you create an analysis plan before data collection - it is an important way to ensure you collect all the data you need and that you only collect the data you need (think: data protection!). However, during outbreak investigations, you may need to deploy questionnaires as soon as possible and before a plan has been developed. In these cases, your experience will be an important resource to fall back on.

In your analysis plan you should create a document to include:

  • research question(s) and hypothesis
  • dataset(s) to be used
  • inclusion / exclusion criteria for records you will use in your analysis
  • list of variables that will be used in the main analysis;
    • outcome variable (being a case)
    • main exposure variables (e.g. food consumed)
    • stratifying variables (e.g. age, sex, group, class)
  • statistical methods
  • key mock-up tables including title, labels and expected format for results
    • univariable
    • bivariable
    • stratified
Instructions

This section is for context only. Nothing has to be done in this section by the participants.

Incubation period

The incubation period of a disease is the time from an exposure resulting in infection until the onset of disease. Because infection usually occurs immediately after exposure, the incubation period is generally the duration of the period from the onset of infection to the onset of disease. (12)

In the previous section, we calculated incubation period by subtracting the meal date and time from the date and time of symptom onset. This gave us the incubation period in hours. We can now look at this on a graph:

Plot of incubation period:

incplot <- linelist |>  
  
  # Remove missing values from incubation column:
  ___(___) |> 
  
  # Create an empty ggplot frame:
  ggplot(aes(x = incubation)) +
  
  # Add a histogram of incubation:
  geom_histogram(
    # Set bin widths to 6 hours:
    binwidth = 6
  ) +
  
  # Adapt scale to better fit data
  scale_x_continuous(breaks = seq(0, 48, 6)) + 
  
  # Label x and y axes:
  labs(
    x = "Incubation period in 6-hour bins",
    y = "Number of cases"
  )

# Print plot:
incplot
Code
incplot <- linelist |>  
  
  # Remove missing values from incubation column:
  drop_na(incubation) |> 
  
  # Create an empty ggplot frame:
  ggplot(aes(x = incubation)) +
  
  # Add a histogram of incubation:
  geom_histogram(
    # Set bin widths to 6 hours:
    binwidth = 6
  ) +
  
  # Adapt scale to better fit data
  scale_x_continuous(breaks = seq(0, 48, 6)) + 
  
  # Label x and y axes:
  labs(
    x = "Incubation period in 6-hour bins",
    y = "Number of cases"
  )

# Print plot:
incplot

Epidemic curve

To create an epicurve, we are going to use ggplot2. ggplot2 is a versatile package that helps create beautiful visualizations in R. There are plenty good in-depth guides to the package, for example: The Epidemiologist R Handbook Chapter on ggplot and ggplot2 - Elegant Graphics for Data Analysis.

First, we can create an epicurve for the date of onset, limiting the input data to cases:

# Fetch data:
epicurve_date <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  ___(case, !is.na(time_day_onset)) |>
  
  # Add dayonset to ggplot aesthetic:
  ggplot(aes(x = time_day_onset)) + 
  
  # Add the proper `geom_*` for a barplot:
  geom_bar() +
  
  # Update x and y axis labels:
  labs(x = "Date of onset", 
       y = "Number of cases") +
  
  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_date
1
Note that case is logical, so we can use it as a filter directly (i.e., case == TRUE is exactly the same as case alone).
Code
# Fetch data:
epicurve_date <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  filter(case, !is.na(time_day_onset)) |>
  
  # Add dayonset to ggplot aesthetic:
  ggplot(aes(x = time_day_onset)) + 
  
  # Add the proper `geom_*` for a barplot:
  geom_bar() +

  # Update x and y axis labels:
  labs(x = "Date of onset", 
       y = "Number of cases") +
  
  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_date
1
Note that case is logical, so we can use it as a filter directly (i.e., case == TRUE is exactly the same as case alone).

As you can see, the epicurve is quite crude, because most cases had their disease onset on 12 November. To increase resolution, we could enhance the time information to include both day and hour.

During the data preparation session, we already concatenated the variables time_day_onset and start_hour together and formatted them with the lubridate package to create a new date-time variable called time_datetime_onset. We can now create the epicurve with this variable:

# Fetch data:
epicurve_hour <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  ___(___, !is.na(time_datetime_onset)) |> 
  
  # Add onset_datetime to ggplot aesthetic:
  ggplot(aes(x = time_datetime_onset)) + 
  
  # Add the proper `geom_*` for a barplot:
  geom_bar() +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Update x and y axis labels:
  labs(
    x = "Date and time of onset", 
    y = "Number of cases"
  ) +

  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_hour
Code
# Fetch data:
epicurve_hour <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  filter(case, !is.na(time_datetime_onset)) |> 
  
  # Add onset_datetime to ggplot aesthetic:
  ggplot(aes(x = time_datetime_onset)) + 
  
  # Add the proper `geom_*` for a barplot:
  geom_bar() +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Update x and y axis labels:
  labs(
    x = "Date and time of onset", 
    y = "Number of cases"
  ) +

  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_hour

Finally, we could compare the epicurve between the sexes and additionally investigate how teachers versus students were distributed.

# Fetch data:
epicurve_strata <- linelist |> 
  
  # Filter for cases where onset_datetime is not missing:
  ___(___, !is.na(time_datetime_onset)) |> 
  
  # Add factor onset_day to ggplot aesthetic:
  ggplot(aes(x = time_datetime_onset)) + 
  
  # Add the proper `geom_*` for an "overall" barplot:
  geom_bar() +
  
  # Superimpose the same `geom_*` for a barplot
  # stratified by sex:
  geom_bar(aes(fill = demo_sex), position = "dodge") +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Stratify by group:
  facet_wrap("demo_group", nrow = 2) +
  
  # Update x and y axis labels:
  labs(
    x = "Date and time of onset", 
    y = "Number of cases", 
    fill = "Sex", 
    title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
    subtitle = str_glue(
      "Copenhagen, November 2006, N = {sum(linelist$case)}"
    )
  ) +
  
  # Add theme:
  theme_bw()

# Print epicurve:
epicurve_strata
Code
# Fetch data:
epicurve_strata <- linelist |> 
  
  # Filter for cases where onset_datetime is not missing:
  filter(case, !is.na(time_datetime_onset)) |> 
  
  # Add factor onset_day to ggplot aesthetic:
  ggplot(aes(x = time_datetime_onset)) + 
  
  # Add the proper `geom_*` for an "overall" barplot:
  geom_bar() +
  
  # Superimpose the same `geom_*` for a barplot
  # stratified by sex:
  geom_bar(aes(fill = demo_sex), position = "dodge") +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Stratify by group:
  facet_wrap("demo_group", nrow = 2) +
  
  # Update x and y axis labels:
  labs(
    x = "Date and time of onset", 
    y = "Number of cases", 
    fill = "Sex", 
    title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
    subtitle = str_glue(
      "Copenhagen, November 2006, N = {sum(linelist$case)}"
    )
  ) +

  # Add theme:
  theme_bw()

# Print epicurve:
epicurve_strata

In this section, we will describe the outbreak in terms of personal characteristics (age, sex and symptoms of cases).

Age & sex

During the data cleaning exercise, we already looked at the age characteristics of the cohort, since there were some typographic errors that needed to be corrected. However, we have not yet looked at the distribution of cases and non-cases by sex (Table 6).

Table 6: Frequency and percentage of cases stratified by sex.
demo_sex Total
female male
case


    FALSE 102 (60.71%) 66 (39.29%) 168 (100.00%)
    TRUE 117 (54.17%) 99 (45.83%) 216 (100.00%)
Total 219 (57.03%) 165 (42.97%) 384 (100.00%)

Now we can also look at group (whether the respondent is a student or a teacher) (Table 7).

Table 7: Frequency and percentage of cases stratified by group
demo_group Total
teacher student
case


    FALSE 9 (5.36%) 159 (94.64%) 168 (100.00%)
    TRUE 6 (2.78%) 210 (97.22%) 216 (100.00%)
Total 15 (3.91%) 369 (96.09%) 384 (100.00%)

We can see that the majority of respondents are students; this is true for both cases and non-cases.

Lastly, we will look at school class. The respondents belong to three classes (Table 8).

Table 8: Frequency and percentage of cases stratified by class.
demo_class Total
1 2 3 Unknown
case




    FALSE 66 (39.29%) 46 (27.38%) 39 (23.21%) 17 (10.12%) 168 (100.00%)
    TRUE 68 (31.48%) 57 (26.39%) 73 (33.80%) 18 (8.33%) 216 (100.00%)
Total 134 (34.90%) 103 (26.82%) 112 (29.17%) 35 (9.11%) 384 (100.00%)

Symptoms

In the first part of this case study, we selected three symptoms to use in the case definition; these were diarrhoea (with or without blood) and vomiting. However, other symptoms were also recorded in the survey.

It may be useful to determine which symptoms are most common among cases and non-cases (Table 9).

Table 9: Summary statistics of symptoms by case/non-case.
Characteristic Overall, N = 3841 Non-case N = 1681 Case N = 2161 P value2
Diarrhoea 206 (82%) 0 (0%) 206 (97%) <0.001
    Unknown 132 128 4
Dysentary 5 (2.6%) 0 (0%) 5 (3.3%) 0.6
    Unknown 190 126 64
Vomiting 66 (31%) 0 (0%) 66 (38%) <0.001
    Unknown 169 126 43
Abdominal pain 207 (86%) 44 (88%) 163 (85%) 0.6
    Unknown 142 118 24
Nausea 169 (75%) 34 (74%) 135 (76%) 0.8
    Unknown 160 122 38
Fever 44 (26%) 8 (20%) 36 (27%) 0.3
    Unknown 213 128 85
Headache 137 (62%) 33 (75%) 104 (59%) 0.051
    Unknown 164 124 40
Joint pain 29 (15%) 6 (16%) 23 (15%) >0.9
    Unknown 196 130 66
symptoms_gastro 216 (56%) 0 (0%) 216 (100%) <0.001
1 n (%)
2 Pearson’s Chi-squared test; Fisher’s exact test

Looking at the table, do you think the symptoms selected for the case definition were the right ones, or would you change anything? This might be a bit easier to assess if we look at the symptoms in an ordered bar chart. To do this, we will need to reshape the data, so that one column contains symptoms and another column contains the proportion of respondents with a given symptom in each case.

We will reshape the data, and tally up the counts for each symptom, stratified by case definition.

Instructions

In this section, participants should fill in ___ the proper instructions to correctly reshape the data set of symptoms to have intermediate data set composed by three columns: case, symptoms, and value.

To do that, use the proper pivot_* function of hte {tidyr} package.

tbl_symptoms_case <- linelist |> 
  # Select all symptoms columns, i.e., those starting with "symptoms":
  ___(case, ___("symptoms")) |> 
  # Remove rows with missing values:
  ___() |> 
  
  # Reshape the dataset from wide to long
  ___(
    -___, 
    names_to = "___",
    ___ = "value",
    values_drop_na = TRUE
  ) |> 
  # Keep only TRUE `value`s:
  ___(value) |> 
  # Count the number of cases with each symptom:
  count(symptoms, case)

tbl_symptoms_case
Code
tbl_symptoms_case <- linelist |> 
  # Select all symptoms columns, i.e., those starting with "symptoms":
  select(case, starts_with("symptoms")) |> 
  # Remove rows with missing values:
  drop_na() |> 
  
  # Reshape the dataset from wide to long
  pivot_longer(
    -case, 
    names_to = "symptoms",
    values_to = "value",
    values_drop_na = TRUE
  ) |> 
  # Keep only TRUE `value`s:
  filter(value) |> 
  count(symptoms, case)
tbl_symptoms_case
Code
fig_symptoms_case <- tbl_symptoms_case |> 
  ggplot(aes(
    x = fct_infreq(symptoms), 
    y = n,
  )) +
  geom_bar(stat = "identity") +
  labs(
    x = "Symptoms",
    y = "Proportion of respondents"
  ) +
  coord_flip() +
  facet_wrap(
    case ~ .,
    ncol = 2,
    labeller = c(
      "FALSE" = "Non case",
      "TRUE" = "Case"
    ) |> as_labeller()
  )
fig_symptoms_case

References

1.
Wickham H, François R, Henry L, Müller K, Vaughan D. Dplyr: A grammar of data manipulation [Internet]. 2023. Available from: https://dplyr.tidyverse.org
2.
Firke S. Janitor: Simple tools for examining and cleaning dirty data [Internet]. 2023. Available from: https://github.com/sfirke/janitor
3.
Wickham H, Vaughan D, Girlich M. Tidyr: Tidy messy data [Internet]. 2023. Available from: https://tidyr.tidyverse.org
4.
Wickham H. Stringr: Simple, consistent wrappers for common string operations [Internet]. 2023. Available from: https://stringr.tidyverse.org
5.
Wickham H. Forcats: Tools for working with categorical variables (factors) [Internet]. 2023. Available from: https://forcats.tidyverse.org/
6.
Spinu V, Grolemund G, Wickham H. Lubridate: Make dealing with dates a little easier [Internet]. 2023. Available from: https://lubridate.tidyverse.org
7.
Wickham H, Hester J, Bryan J. Readr: Read rectangular text data [Internet]. 2023. Available from: https://readr.tidyverse.org
8.
Wickham H, Miller E, Smith D. Haven: Import and export SPSS, stata and SAS files [Internet]. 2023. Available from: https://haven.tidyverse.org
9.
Verde Arregoitia LD. Unheadr: Handle data with messy header rows and broken values [Internet]. 2022. Available from: https://github.com/luisDVA/unheadr
10.
EPIET/OutbreakInvestigation [Internet]. [cited 2023 Dec 4]. Available from: https://github.com/EPIET/OutbreakInvestigation
11.
EPIET/OutbreakInvestigation [Internet]. Available from: https://github.com/EPIET/OutbreakInvestigation
12.
Rothman KJ, Greenland S, Lash TL. Modern epidemiology [Internet]. Third edition. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia; 2008. Available from: http://www.r2library.com/public/ResourceDetail.aspx?authCheck=true&resid=2127