library(tidyverse)
library(gtsummary)
library(here)
library(janitor)
library(unheadr)
library(rio)
Day Three - Assessment
UBEP’s R training for supervisors
Infrastructure, Import, and Tidying
Preamble
This section is for context only. Nothing has to be done in this section by the participants.
This assessment is the third of four ones regarding the UBEP’s R training for supervisors for the ECDC.
To solve the exercise follow the text and find the assessments in which you need to fill the ___
where required in the code. Next, go to the corresponding section of the R script solution.R
and try/run sequentially your code.
All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.
You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-3” project, and find inside all the data, script and resources useful to complete the assessment.
The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(10)
The present work is released under the GPL-3 License.
Data preparation
Objectives
This section is for context only. Nothing has to be done in this section by the participants.
At the end of the case study, participants should be able to:
- Conduct an investigation to identify the source of an outbreak
- Apply the ten steps of an outbreak investigation
- Explain the epidemiological and microbiological contributions to foodborne outbreak investigations
- Perform data cleaning and analysis preparation steps using R
- Perform descriptive, univariable and stratified analyses using R
- Critically evaluate the results from statistical and microbiological analyses and identify food vehicles most likely associated with becoming ill
- Understand the importance of writing outbreak reports (developing an analytical plan)
The Alert
On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.
The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.
Your mission
Your group has been tasked with investigating this outbreak; you have just received the information above.
The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.
A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.
Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.
These data were then exported from the survey tool and saved as Copenhagen_raw.xlsx
.1
Load packages
In this section, you will become familiar with the raw data set provided.
Before to start, open your RStudio project day-3 within the course Posit Cloud Workspace by clicking on it.2 You can find your exercise within the solution.R
file in the assessment/
folder. That script reports all the unsolved chunks of R code reported in this very document, including all the once under the “Exercise” tabs. Within the project’s folder there is a folder named data-raw/
. In the data-raw/
folder there is our raw file Copenhagen_raw.xlsx
.
First of all attach the packages we will use to the current working R session.
# linelist_raw <- import(here("Copenhagen_raw.xlsx"))
# linelist_raw <- import(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("../Copenhagen_raw.xlsx")
# linelist_raw <- read(here("Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("../Copenhagen_raw.xlsx")
linelist_raw
Code
# import(here("Copenhagen_raw.xlsx"))
<- import(here("data-raw/Copenhagen_raw.xlsx"))
linelist_raw # linelist_raw <- import(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- import("../Copenhagen_raw.xlsx")
# linelist_raw <- read(here("Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read(here("../data-raw/Copenhagen_raw.xlsx"))
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("data-raw/Copenhagen_raw.xlsx")
# linelist_raw <- read("../Copenhagen_raw.xlsx")
linelist_raw
You should now see that you have a data.frame
called linelist_raw
in your environment tab. You should also be able to see how many observations and variables it contains.
You can view the data by clicking on the name of it (linelist_raw
) in the environment pane, or alternatively, type and execute View(linelist_raw)
in the R console.
As you can see, the data are quite messy, especially in the first rows. If you navigate them you can find that:
- it seams that there are leading headers for each group of columns (in the first of them)
- the “real” headers seams span across multiple rows (three)
- the presence of logical column with all visible rows as NA could be highlight the presence of empty lines
- the sex column seams to report only the a single (the first) occurrence of
female
andmale
In this section, participants should fill in ___
the proper instructions to correctly import the data set.
Suggestion: Explore the piped code sequentially, one istruction a time, looking on all the intermediate results step by step.
# read and import the dataset
# use the working instructions used previously
# In the first line, create the path to the data set only
<- ___("___") |> # here is the path only
linelist # import the data
# don't consider the first row (headers' groups) as column names!
___(col_names = FALSE) |>
# mash together the four rows composing the headers
# merge the groups' headers with the proper headers in a single row
# (tip.: use a function in the `{unheadr}`)
___(
n_name_rows = ___, # how many rows has to be mashed?
keep_names = FALSE, # do we count/consider current colnames?
sliding_headers = ___ # are there grouping headers'name (1st row)
|>
) # remove empty cols (tip.: use a function in the `{janitor}`)
___(which = c("___")) |>
# clean all the names
# convert them using standard cammel_case convention
# (tip.: use a function in the `{janitor}`)
___() |>
# fill the `sex` column to have a complete data set (remind that we
# have mashed the groups'heaaders with the columns'names)
___(demo_sex) |>
# Because of our importing strategy, all columns are pure characters.
# We can use the `parse_guess` function of the {readr} package to
# mutate every columns to a suitable proper type across each one.
___(
___(everything(), \(x) parse_guess(x, guess_integer = TRUE))
)
linelist
Code
# read and import the dataset
# use the working instructions used previously
# In the first line, create the path to the data set only
<- here("data-raw/Copenhagen_raw.xlsx") |>
linelist # import the data
# don't consider the first row (headers' groups) as column names
import(col_names = FALSE) |>
# mash together the four rows composing the headers
# merge the groups' headers with the proper headers in a single row
# (tip.: use a function in the `{unheadr}`)
mash_colnames(
n_name_rows = 4, # how many rows has to be mashed?
keep_names = FALSE, # do we count/consider current colnames?
sliding_headers = TRUE # are there grouping headers'name (1st row)
|>
) # remove empty columns (tip.: use a function in the `{janitor}`)
remove_empty(which = c("cols")) |>
# clean all the names
# convert them using standard cammel_case convention
# (tip.: use a function in the `{janitor}`)
clean_names() |>
# fill the `sex` column to have a complete data set (remind that we
# have mashed the groups'heaaders with the columns'names)
fill(demo_sex) |>
# Because of our importing strategy, all columns are pure characters.
# We can use the `parse_guess` function of the {readr} package to
# mutate every columns to a suitable proper type across each one.
mutate(
across(everything(), \(x) parse_guess(x, guess_integer = TRUE))
)
linelist
- 1
-
Note: If you would like to throw away the groups’ names at the beginning of each variable names, you could ignore the first line when importing the dataset (
import(skip = 1, col_names = FALSE)
) and remove thesliding_headers
option to themash_colnames
call, considering three cols now only (mash_colnames(3, keep_names = FALSE)
).
Exploring the data
Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.
In addition, it is advisable to consider what format or class each variable (column) should be in. This is something you can include in your analysis plan. For example, if you know you will be creating an epidemic curve of onset dates, you will need to ensure that the onset dates have been correctly interpreted by R as date
class on import and are not being read as character strings. The column class or type is particularly important if you plan on performing any calculations on that column.
We will see how to do that, in the next lesson.
Anyway we can have a look at the current summary of column types and values by the skim
function of the {skimr}
package.
::skim(linelist) skimr
Name | linelist |
Number of rows | 397 |
Number of columns | 40 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 38 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
demo_sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
time_day_onset | 175 | 0.56 | 9 | 9 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
demo_id | 0 | 1.00 | 216.55 | 123.41 | 1 | 112 | 216 | 320 | 435 | ▇▇▇▇▇ |
demo_age | 0 | 1.00 | 18.57 | 9.92 | 8 | 16 | 17 | 18 | 180 | ▇▁▁▁▁ |
demo_group | 0 | 1.00 | 0.96 | 0.20 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
demo_class | 36 | 0.91 | 1.94 | 0.84 | 1 | 1 | 2 | 3 | 3 | ▇▁▆▁▇ |
symptoms_diarrhoea | 141 | 0.64 | 0.82 | 0.39 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▇ |
symptoms_bloody | 200 | 0.50 | 0.03 | 0.17 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
symptoms_vomiting | 179 | 0.55 | 0.30 | 0.46 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
symptoms_abdo | 152 | 0.62 | 0.85 | 0.35 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▇ |
symptoms_nausea | 170 | 0.57 | 0.76 | 0.43 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▇ |
symptoms_fever | 223 | 0.44 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
symptoms_headache | 174 | 0.56 | 0.63 | 0.48 | 0 | 0 | 1 | 1 | 1 | ▅▁▁▁▇ |
symptoms_joint_pain | 207 | 0.48 | 0.16 | 0.37 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
time_start_hour | 177 | 0.55 | 12.49 | 4.92 | 3 | 9 | 9 | 15 | 21 | ▁▇▁▆▃ |
food_meal | 9 | 0.98 | 0.97 | 0.17 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
food_tuna | 16 | 0.96 | 0.71 | 0.45 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
food_tuna_d | 16 | 0.96 | 1.32 | 1.00 | 0 | 0 | 2 | 2 | 3 | ▆▅▁▇▂ |
food_shrimps | 17 | 0.96 | 0.67 | 0.47 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
food_shrimps_d | 17 | 0.96 | 1.34 | 1.04 | 0 | 0 | 2 | 2 | 3 | ▆▂▁▇▂ |
food_green | 30 | 0.92 | 0.59 | 0.49 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
food_green_d | 30 | 0.92 | 1.14 | 1.05 | 0 | 0 | 1 | 2 | 3 | ▇▂▁▇▂ |
food_veal | 15 | 0.96 | 0.89 | 0.31 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
food_veal_d | 14 | 0.96 | 1.83 | 0.91 | 0 | 1 | 2 | 2 | 3 | ▂▃▁▇▃ |
food_pasta | 15 | 0.96 | 0.88 | 0.32 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
food_pasta_d | 15 | 0.96 | 1.81 | 0.91 | 0 | 1 | 2 | 2 | 3 | ▂▃▁▇▃ |
food_rocket | 24 | 0.94 | 0.57 | 0.50 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
food_rocket_d | 24 | 0.94 | 1.08 | 1.06 | 0 | 0 | 1 | 2 | 3 | ▇▂▁▆▂ |
food_sauce | 42 | 0.89 | 0.42 | 0.49 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▆ |
food_sauce_d | 42 | 0.89 | 0.83 | 1.06 | 0 | 0 | 0 | 2 | 3 | ▇▁▁▃▁ |
food_bread | 18 | 0.95 | 0.91 | 0.29 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
food_bread_d | 18 | 0.95 | 1.75 | 0.71 | 0 | 2 | 2 | 2 | 3 | ▁▂▁▇▁ |
food_champagne | 25 | 0.94 | 0.87 | 0.34 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
food_champagne_d | 25 | 0.94 | 1.37 | 0.93 | 0 | 1 | 1 | 2 | 3 | ▂▇▁▂▃ |
food_beer | 30 | 0.92 | 0.78 | 0.42 | 0 | 1 | 1 | 1 | 1 | ▂▁▁▁▇ |
food_beer_d | 35 | 0.91 | 1.94 | 1.23 | 0 | 1 | 3 | 3 | 3 | ▃▂▁▂▇ |
food_red_wine | 50 | 0.87 | 0.23 | 0.42 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
food_red_wine_d | 52 | 0.87 | 0.45 | 0.92 | 0 | 0 | 0 | 0 | 3 | ▇▁▁▁▁ |
food_white_wine | 31 | 0.92 | 0.73 | 0.45 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
food_white_wine_d | 36 | 0.91 | 1.58 | 1.21 | 0 | 0 | 2 | 3 | 3 | ▆▅▁▅▇ |
In this section, participants should fill in ___
the proper instructions to correctly import the data set.
Suggestion: Explore the piped code sequentially, one istruction a time, looking on all the intermediate results step by step.
Date time variables
The dates are formatted as day, then month (abbreviated character string), then year, in a character vector; we will convert them to a date object.
Most people fell ill on Saturday 12 November (a day after the implicated meal). However, the data set also includes the hour of onset in a variable called time_start_hour
. We will combine the date and hour of onset into a date-time variable, which will provide more appropriate units to construct an epicurve with later.
# Update linelist:
<- linelist |>
linelist ___(
time_day_onset = ___(time_day_onset, tz = "UTC"),
time_datetime_onset = paste(time_day_onset, time_start_hour) |>
ymd_h(truncated = 2)
)
linelist
- 1
-
The argument
truncated = 2
will result in dates with missingtime_start_hour
still being converted to date-time, with the missing time being set to00:00
(midnight). Whether you want to deal with missingtime_start_hour
in this way or prefer to code these date-times asNA
will depend on how you want them to be represented in your analysis.
Code
# Update linelist:
<- linelist |>
linelist mutate(
time_day_onset = dmy(time_day_onset, tz = "UTC"),
time_datetime_onset = paste(time_day_onset, time_start_hour) |>
ymd_h(truncated = 2)
)
linelist
- 1
-
The argument
truncated = 2
will result in dates with missingtime_start_hour
still being converted to date-time, with the missing time being set to00:00
(midnight). Whether you want to deal with missingtime_start_hour
in this way or prefer to code these date-times asNA
will depend on how you want them to be represented in your analysis.
Recoding values
There is a variable in the data set called group
. In this variable, teachers have been encoded as 0
, while students were encoded as 1
. We can convert the group
variable to a factor and give each factor level the appropriate label.
We will convert to factor also the demo_sex
variable
<- linelist |>
linelist ___(
demo_sex = factor(demo_sex),
demo_group = ___(demo_group, labels = c("teacher", "student")),
demo_class = factor(demo_class)
) linelist
Code
<- linelist |>
linelist mutate(
demo_sex = factor(demo_sex),
demo_group = factor(demo_group, labels = c("teacher", "student")),
demo_class = factor(demo_class)
) linelist
Having a look to combination of demo_age
and demo_group
(Table 1) we can highlight and fix some typographical errors.3
demo_group | Total | ||
---|---|---|---|
teacher | student | ||
demo_age | |||
8 | 0 | 1 | 1 |
15 | 0 | 11 | 11 |
16 | 1 | 99 | 100 |
17 | 0 | 115 | 115 |
18 | 0 | 112 | 112 |
19 | 0 | 39 | 39 |
20 | 0 | 3 | 3 |
26 | 1 | 0 | 1 |
29 | 1 | 0 | 1 |
30 | 1 | 0 | 1 |
31 | 1 | 0 | 1 |
32 | 1 | 0 | 1 |
33 | 1 | 0 | 1 |
34 | 1 | 0 | 1 |
39 | 1 | 0 | 1 |
43 | 1 | 0 | 1 |
54 | 1 | 0 | 1 |
56 | 1 | 0 | 1 |
58 | 2 | 0 | 2 |
59 | 1 | 0 | 1 |
65 | 1 | 0 | 1 |
180 | 0 | 1 | 1 |
Total | 16 | 381 | 397 |
- There is one teacher aged 16 (likely digit reversal - should be 61)
- There is one student aged 8 (likely missing a digit - should be 18)
- There is one student aged 180 (likely has an extra digit - should be 18)
We will fix those values accordingly. (Table 2)
<- linelist |>
linelist ___(
demo_age = case_when(
== 16 & demo_group == "teacher" ~ 61,
demo_age %in% c(8, 180) & demo_group == "student" ~ 18,
demo_age TRUE ~ demo_age
)
) linelist
Code
<- linelist |>
linelist mutate(
demo_age = case_when(
== 16 & demo_group == "teacher" ~ 61,
demo_age %in% c(8, 180) & demo_group == "student" ~ 18,
demo_age TRUE ~ demo_age
)
) linelist
demo_group | Total | ||
---|---|---|---|
teacher | student | ||
demo_age | |||
15 | 0 | 11 | 11 |
16 | 0 | 99 | 99 |
17 | 0 | 115 | 115 |
18 | 0 | 114 | 114 |
19 | 0 | 39 | 39 |
20 | 0 | 3 | 3 |
26 | 1 | 0 | 1 |
29 | 1 | 0 | 1 |
30 | 1 | 0 | 1 |
31 | 1 | 0 | 1 |
32 | 1 | 0 | 1 |
33 | 1 | 0 | 1 |
34 | 1 | 0 | 1 |
39 | 1 | 0 | 1 |
43 | 1 | 0 | 1 |
54 | 1 | 0 | 1 |
56 | 1 | 0 | 1 |
58 | 2 | 0 | 2 |
59 | 1 | 0 | 1 |
61 | 1 | 0 | 1 |
65 | 1 | 0 | 1 |
Total | 16 | 381 | 397 |
Variable Class
Most variables are either binary symptoms or food exposures, that have been encoded as 0 (where the symptom was absent or there was no exposure) or 1 (symptom present or person was exposed to the food item). We will convert them to logicals (i.e., TRUE
/FALSE
).
# Update linelist
<- linelist |>
linelist ## Please note: the following sequence of three `___` is a one of the
## most useful and used standard patters in R managing data in the
## tidyverse
___(
___(
___(\(x) all(x %in% c(0, 1, NA))),
as.logical
)
) linelist
Code
# Update linelist
<- linelist |>
linelist mutate(
across(
where(\(x) all(x %in% c(0, 1, NA))),
as.logical
)
) linelist
In this section, participants should fill in ___
the proper instructions to correctly import the data set.
Suggestion: Explore the piped code sequentially, one istruction a time, looking on all the intermediate results step by step.
The final step to undertake before proceeding to descriptive analysis, is to create a new column in the data set to hold the case definition. You can call this column case
and set it to TRUE
if the individual meets the case definition criteria and FALSE
if not.
Variables of interest:
Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called food_meal
. This variable indicates whether people attended the school dinner party and ate a meal there, or not.
Other variables that will be helpful to include in your case definition are time_datetime_onset
(hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:
symptoms_abdo
(abdominal pain)symptoms_diarrhoea
symptoms_bloody
(bloody diarrhoea)symptoms_nausea
symptoms_vomiting
symptoms_fever
symptoms_headache
symptoms_jointpain
Defining a case
To demonstrate how this works, we will first construct an example case definition in words:
A case was defined as a person who:
- attended the school dinner on 11 November 2006 (i.e. is on the linelist)
- ate a meal at the school dinner (i.e. was exposed)
- fell ill after the start of the meal
- fell ill no later than two days after the school dinner
- suffered from diarrhoea with or without blood, or vomiting
Non cases were defined as people who:
- attended the school dinner on 11 November 2006 (i.e. are on the linelist)
- ate a meal at the school dinner (i.e. were exposed)
- did not fall ill within the time period of interest
- did not develop diarrhoea (with or without blood) or vomiting
For the sake of the analysis, we exclude any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.
The variables needed to define this case definition are:
food_meal
time_datetime_onset
symptoms_diarrhoea
symptoms_bloody
symptoms_vomiting
Case definition
First, we create a column for the date and time of the meal. We can use this column to determine if onset date/times occured before or after the meal. We can limit the case definition to people who had onset date/times after the meal, since we hypothesise that the meal is where exposure to the pathogen occured.
Moreover, for the case definition, we are primarily interested in three symptoms:
symptoms_diarrhoea
(without blood)symptoms_bloody
(diarrhoea with blood)symptoms_vomiting
However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column symptoms_gastro
.
<- linelist |>
linelist ___(
meal_datetime = ymd_h("2006-11-11 18"),
symptoms_gastro = (
| symptoms_bloody | symptoms_vomiting
symptoms_diarrhoea |>
) # Replace missing values with FALSE
___(FALSE)
) linelist
Code
# Start with linelist:
<- linelist |>
linelist mutate(
meal_datetime = ymd_h("2006-11-11 18"),
symptoms_gastro = (
| symptoms_bloody | symptoms_vomiting
symptoms_diarrhoea |>
) # Replace missing values with FALSE
replace_na(FALSE)
) linelist
Note that we have included people who have no data for all three symptoms of interest, and categorised them as
FALSE
. This means that they will be defined as non-cases in the next step.
Creating the case definition column
Next, we can create a column for the case definition, in which we will define all the respondents as either cases, non-cases or NA
.
We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.
We can exclude (define as NA
) respondents that answered the survey, but did not attend the meal.
<- linelist |>
linelist rowwise() |>
___(
# Consider only vars that start with "food" and do not end with "_d"
ate_anything = c_across(___("food") & !___("_d")) |>
reduce(`|`) |>
# Replace missing values with FALSE
___(FALSE)
|>
) ungroup()
linelist
Code
<- linelist |>
linelist rowwise() |>
mutate(
# Consider only vars that start with "food" and do not end with "_d"
ate_anything = c_across(starts_with("food") & !ends_with("_d")) |>
reduce(`|`) |>
# Replace missing values with FALSE
replace_na(FALSE)
|>
) ungroup()
linelist
Next, we can check if there is anyone that said they didn’t eat a meal (or skipped that question) but did consume one or more of the food or drink items at the party (Table 3):
ate_anything | Total | ||
---|---|---|---|
FALSE | TRUE | ||
food_meal | |||
FALSE | 4 | 7 | 11 |
TRUE | 0 | 377 | 377 |
Unknown | 9 | 0 | 9 |
Total | 13 | 384 | 397 |
We can see from this table that 7 respondents said they didn’t have a meal, but did actually consume items on the party meal menu, according to their answers for the food and drink questions.
We can therefore recode the food_meal
column for these individuals as TRUE
.
<- linelist |>
linelist # modify the food_meal column for people who ate something but didn't
# have a meal
___(
food_meal = if_else(
!food_meal & ate_anything,
true = TRUE,
false = food_meal
)
) linelist
Code
# Start with linelist:
<- linelist |>
linelist # modify the food_meal column for people who ate something but didn't
# have a meal
mutate(
food_meal = if_else(
!food_meal & ate_anything,
true = TRUE,
false = food_meal
)
) linelist
Check teh result (Table 4).
ate_anything | Total | ||
---|---|---|---|
FALSE | TRUE | ||
food_meal | |||
FALSE | 4 | 0 | 4 |
TRUE | 0 | 384 | 384 |
Unknown | 9 | 0 | 9 |
Total | 13 | 384 | 397 |
Now that these values are corrected, we can proceed with the case definition:
# Start with linelist:
<- linelist |>
linelist # Create case definition:
___(
# Define cases as those with any gastro symptoms with onset after
# the meal:
case = case_when(
# Define excluded individuals as those with no record of a meal:
!food_meal | is.na(food_meal) ~ NA,
# Cases have any gastro symptoms with onset after the meal:
&
symptoms_gastro !is.na(time_datetime_onset) &
>= meal_datetime)
(time_datetime_onset ~ TRUE,
# All the rest (i.e., `TRUE` evaluates as `TRUE` for all the cases
# that didn't match a previous condition!) marked as FALSE,
# i.e., non cases have no gastro symptoms but ate a meal at the party:
# !symptoms_gastro |
# (symptoms_gastro & (time_datetime_onset < meal_datetime))
TRUE ~ FALSE
),
) linelist
Code
# Start with linelist:
<- linelist |>
linelist # Create case definition:
mutate(
case = case_when(
# Define excluded individuals as those with no record of a meal:
!food_meal | is.na(food_meal) ~ NA,
# Cases have any gastro symptoms with onset after the meal:
& !is.na(time_datetime_onset) & (time_datetime_onset >= meal_datetime)
symptoms_gastro ~ TRUE,
# All the rest (i.e., `TRUE` evaluates as `TRUE` for all the cases
# that didn't match a previous condition!) marked as FALSE,
# i.e., non cases have no gastro symptoms but ate a meal at the party:
# !symptoms_gastro |
# (symptoms_gastro & (time_datetime_onset < meal_datetime))
TRUE ~ FALSE
),
) linelist
Incubation times
A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).
We will create a column for incubation times (date and time of meal subtracted from the date and time of symptom onset).
<- linelist |>
linelist # Update incubation to be NA if case is not TRUE:
___(
incubation = if_else(case, time_datetime_onset - meal_datetime, NA)
) linelist
Code
<- linelist |>
linelist # Update incubation to be NA if case is not TRUE:
mutate(
incubation = if_else(case, time_datetime_onset - meal_datetime, NA)
) linelist
We can now refine the case definition and limit the maximum incubation period to 48 hours after the meal, as the data points to a fast-acting bacterial toxin or a virus.
We will update the case definition, this time only changing values which were previously defined as a case but had onset of symptoms after 13 November 2006 at 18:00. Respondents that meet this condition will be reclassified as non cases (i.e. case = FALSE
).
# Update the case definition to limit to onset three days after meal
<- linelist |>
linelist ___(
case = case & (time_datetime_onset <= (meal_datetime + ___(2)))
) linelist
Code
# Update the case definition to limit to onset three days after meal:
<- linelist |>
linelist mutate(
case = case & (time_datetime_onset <= (meal_datetime + days(2)))
) linelist
As it happens, there is no change in the number of cases, because none of them had an onset date and time that was more than two days after the meal. We can double-check this by cross-tabulating the onset date/time and case status (Table 5).
case | Total | |||
---|---|---|---|---|
FALSE | TRUE | Unknown | ||
time_datetime_onset | ||||
2006-11-11 21:00:00 | 0 | 9 | 0 | 9 |
2006-11-12 03:00:00 | 0 | 10 | 0 | 10 |
2006-11-12 09:00:00 | 0 | 95 | 2 | 97 |
2006-11-12 15:00:00 | 1 | 63 | 0 | 64 |
2006-11-12 21:00:00 | 1 | 25 | 0 | 26 |
2006-11-13 03:00:00 | 0 | 2 | 0 | 2 |
2006-11-13 09:00:00 | 0 | 6 | 0 | 6 |
2006-11-13 15:00:00 | 0 | 6 | 0 | 6 |
Unknown | 166 | 0 | 11 | 177 |
Total | 168 | 216 | 13 | 397 |
We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.
Exclusions
Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis. We can do this easily with the function drop_na()
:
<- linelist |>
linelist # Remove rows where case is NA:
drop_na(___)
linelist
Code
<- linelist |>
linelist # Remove rows where case is NA:
drop_na(case)
linelist
Descriptive analysis
This section is for context only. Nothing has to be done in this section by the participants.
Goals
In this session, we will perform some descriptive analysis. Typically, we would describe the outbreak in terms of time, place and person. In this data set, we don’t have geospatial information, but we do have time (onset dates and times, incubation periods) and person (age, sex, clinical symptoms). You will:
- Describe cases by age and sex
- Describe case symptoms
- Describe and illustrate incubation periods
- Create an epidemic curve
Many of the descriptive features can be illustrated graphically, for which we will use the package ggplot2
.
This section is for context only. Nothing has to be done in this section by the participants.
In theory, you create an analysis plan before data collection - it is an important way to ensure you collect all the data you need and that you only collect the data you need (think: data protection!). However, during outbreak investigations, you may need to deploy questionnaires as soon as possible and before a plan has been developed. In these cases, your experience will be an important resource to fall back on.
In your analysis plan you should create a document to include:
- research question(s) and hypothesis
- dataset(s) to be used
- inclusion / exclusion criteria for records you will use in your analysis
- list of variables that will be used in the main analysis;
- outcome variable (being a case)
- main exposure variables (e.g. food consumed)
- stratifying variables (e.g. age, sex, group, class)
- statistical methods
- key mock-up tables including title, labels and expected format for results
- univariable
- bivariable
- stratified
This section is for context only. Nothing has to be done in this section by the participants.
Incubation period
The incubation period of a disease is the time from an exposure resulting in infection until the onset of disease. Because infection usually occurs immediately after exposure, the incubation period is generally the duration of the period from the onset of infection to the onset of disease. (12)
In the previous section, we calculated incubation period by subtracting the meal date and time from the date and time of symptom onset. This gave us the incubation period in hours. We can now look at this on a graph:
Plot of incubation period:
<- linelist |>
incplot
# Remove missing values from incubation column:
___(___) |>
# Create an empty ggplot frame:
ggplot(aes(x = incubation)) +
# Add a histogram of incubation:
geom_histogram(
# Set bin widths to 6 hours:
binwidth = 6
+
)
# Adapt scale to better fit data
scale_x_continuous(breaks = seq(0, 48, 6)) +
# Label x and y axes:
labs(
x = "Incubation period in 6-hour bins",
y = "Number of cases"
)
# Print plot:
incplot
Code
<- linelist |>
incplot
# Remove missing values from incubation column:
drop_na(incubation) |>
# Create an empty ggplot frame:
ggplot(aes(x = incubation)) +
# Add a histogram of incubation:
geom_histogram(
# Set bin widths to 6 hours:
binwidth = 6
+
)
# Adapt scale to better fit data
scale_x_continuous(breaks = seq(0, 48, 6)) +
# Label x and y axes:
labs(
x = "Incubation period in 6-hour bins",
y = "Number of cases"
)
# Print plot:
incplot
Epidemic curve
To create an epicurve, we are going to use ggplot2
. ggplot2
is a versatile package that helps create beautiful visualizations in R
. There are plenty good in-depth guides to the package, for example: The Epidemiologist R Handbook Chapter on ggplot and ggplot2 - Elegant Graphics for Data Analysis.
First, we can create an epicurve for the date of onset, limiting the input data to cases:
# Fetch data:
<- linelist |>
epicurve_date
# Filter for cases where dayonset is not missing:
___(case, !is.na(time_day_onset)) |>
# Add dayonset to ggplot aesthetic:
ggplot(aes(x = time_day_onset)) +
# Add the proper `geom_*` for a barplot:
geom_bar() +
# Update x and y axis labels:
labs(x = "Date of onset",
y = "Number of cases") +
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_date
- 1
-
Note that
case
is logical, so we can use it as a filter directly (i.e.,case == TRUE
is exactly the same ascase
alone).
Code
# Fetch data:
<- linelist |>
epicurve_date
# Filter for cases where dayonset is not missing:
filter(case, !is.na(time_day_onset)) |>
# Add dayonset to ggplot aesthetic:
ggplot(aes(x = time_day_onset)) +
# Add the proper `geom_*` for a barplot:
geom_bar() +
# Update x and y axis labels:
labs(x = "Date of onset",
y = "Number of cases") +
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_date
- 1
-
Note that
case
is logical, so we can use it as a filter directly (i.e.,case == TRUE
is exactly the same ascase
alone).
As you can see, the epicurve is quite crude, because most cases had their disease onset on 12 November. To increase resolution, we could enhance the time information to include both day and hour.
During the data preparation session, we already concatenated the variables time_day_onset
and start_hour
together and formatted them with the lubridate
package to create a new date-time variable called time_datetime_onset
. We can now create the epicurve with this variable:
# Fetch data:
<- linelist |>
epicurve_hour
# Filter for cases where dayonset is not missing:
___(___, !is.na(time_datetime_onset)) |>
# Add onset_datetime to ggplot aesthetic:
ggplot(aes(x = time_datetime_onset)) +
# Add the proper `geom_*` for a barplot:
geom_bar() +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Update x and y axis labels:
labs(
x = "Date and time of onset",
y = "Number of cases"
+
)
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_hour
Code
# Fetch data:
<- linelist |>
epicurve_hour
# Filter for cases where dayonset is not missing:
filter(case, !is.na(time_datetime_onset)) |>
# Add onset_datetime to ggplot aesthetic:
ggplot(aes(x = time_datetime_onset)) +
# Add the proper `geom_*` for a barplot:
geom_bar() +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Update x and y axis labels:
labs(
x = "Date and time of onset",
y = "Number of cases"
+
)
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_hour
Finally, we could compare the epicurve between the sexes and additionally investigate how teachers versus students were distributed.
# Fetch data:
<- linelist |>
epicurve_strata
# Filter for cases where onset_datetime is not missing:
___(___, !is.na(time_datetime_onset)) |>
# Add factor onset_day to ggplot aesthetic:
ggplot(aes(x = time_datetime_onset)) +
# Add the proper `geom_*` for an "overall" barplot:
geom_bar() +
# Superimpose the same `geom_*` for a barplot
# stratified by sex:
geom_bar(aes(fill = demo_sex), position = "dodge") +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Stratify by group:
facet_wrap("demo_group", nrow = 2) +
# Update x and y axis labels:
labs(
x = "Date and time of onset",
y = "Number of cases",
fill = "Sex",
title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
subtitle = str_glue(
"Copenhagen, November 2006, N = {sum(linelist$case)}"
)+
)
# Add theme:
theme_bw()
# Print epicurve:
epicurve_strata
Code
# Fetch data:
<- linelist |>
epicurve_strata
# Filter for cases where onset_datetime is not missing:
filter(case, !is.na(time_datetime_onset)) |>
# Add factor onset_day to ggplot aesthetic:
ggplot(aes(x = time_datetime_onset)) +
# Add the proper `geom_*` for an "overall" barplot:
geom_bar() +
# Superimpose the same `geom_*` for a barplot
# stratified by sex:
geom_bar(aes(fill = demo_sex), position = "dodge") +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Stratify by group:
facet_wrap("demo_group", nrow = 2) +
# Update x and y axis labels:
labs(
x = "Date and time of onset",
y = "Number of cases",
fill = "Sex",
title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
subtitle = str_glue(
"Copenhagen, November 2006, N = {sum(linelist$case)}"
)+
)
# Add theme:
theme_bw()
# Print epicurve:
epicurve_strata
In this section, we will describe the outbreak in terms of personal characteristics (age, sex and symptoms of cases).
Age & sex
During the data cleaning exercise, we already looked at the age characteristics of the cohort, since there were some typographic errors that needed to be corrected. However, we have not yet looked at the distribution of cases and non-cases by sex (Table 6).
demo_sex | Total | ||
---|---|---|---|
female | male | ||
case | |||
FALSE | 102 (60.71%) | 66 (39.29%) | 168 (100.00%) |
TRUE | 117 (54.17%) | 99 (45.83%) | 216 (100.00%) |
Total | 219 (57.03%) | 165 (42.97%) | 384 (100.00%) |
Now we can also look at group
(whether the respondent is a student or a teacher) (Table 7).
demo_group | Total | ||
---|---|---|---|
teacher | student | ||
case | |||
FALSE | 9 (5.36%) | 159 (94.64%) | 168 (100.00%) |
TRUE | 6 (2.78%) | 210 (97.22%) | 216 (100.00%) |
Total | 15 (3.91%) | 369 (96.09%) | 384 (100.00%) |
We can see that the majority of respondents are students; this is true for both cases and non-cases.
Lastly, we will look at school class. The respondents belong to three classes (Table 8).
demo_class | Total | ||||
---|---|---|---|---|---|
1 | 2 | 3 | Unknown | ||
case | |||||
FALSE | 66 (39.29%) | 46 (27.38%) | 39 (23.21%) | 17 (10.12%) | 168 (100.00%) |
TRUE | 68 (31.48%) | 57 (26.39%) | 73 (33.80%) | 18 (8.33%) | 216 (100.00%) |
Total | 134 (34.90%) | 103 (26.82%) | 112 (29.17%) | 35 (9.11%) | 384 (100.00%) |
Symptoms
In the first part of this case study, we selected three symptoms to use in the case definition; these were diarrhoea (with or without blood) and vomiting. However, other symptoms were also recorded in the survey.
It may be useful to determine which symptoms are most common among cases and non-cases (Table 9).
Characteristic | Overall, N = 3841 | Non-case N = 1681 | Case N = 2161 | P value2 |
---|---|---|---|---|
Diarrhoea | 206 (82%) | 0 (0%) | 206 (97%) | <0.001 |
Unknown | 132 | 128 | 4 | |
Dysentary | 5 (2.6%) | 0 (0%) | 5 (3.3%) | 0.6 |
Unknown | 190 | 126 | 64 | |
Vomiting | 66 (31%) | 0 (0%) | 66 (38%) | <0.001 |
Unknown | 169 | 126 | 43 | |
Abdominal pain | 207 (86%) | 44 (88%) | 163 (85%) | 0.6 |
Unknown | 142 | 118 | 24 | |
Nausea | 169 (75%) | 34 (74%) | 135 (76%) | 0.8 |
Unknown | 160 | 122 | 38 | |
Fever | 44 (26%) | 8 (20%) | 36 (27%) | 0.3 |
Unknown | 213 | 128 | 85 | |
Headache | 137 (62%) | 33 (75%) | 104 (59%) | 0.051 |
Unknown | 164 | 124 | 40 | |
Joint pain | 29 (15%) | 6 (16%) | 23 (15%) | >0.9 |
Unknown | 196 | 130 | 66 | |
symptoms_gastro | 216 (56%) | 0 (0%) | 216 (100%) | <0.001 |
1 n (%) | ||||
2 Pearson’s Chi-squared test; Fisher’s exact test |
Looking at the table, do you think the symptoms selected for the case definition were the right ones, or would you change anything? This might be a bit easier to assess if we look at the symptoms in an ordered bar chart. To do this, we will need to reshape the data, so that one column contains symptoms and another column contains the proportion of respondents with a given symptom in each case.
We will reshape the data, and tally up the counts for each symptom, stratified by case definition.
In this section, participants should fill in ___
the proper instructions to correctly reshape the data set of symptoms to have intermediate data set composed by three columns: case
, symptoms
, and value
.
To do that, use the proper pivot_*
function of hte {tidyr}
package.
<- linelist |>
tbl_symptoms_case # Select all symptoms columns, i.e., those starting with "symptoms":
___(case, ___("symptoms")) |>
# Remove rows with missing values:
___() |>
# Reshape the dataset from wide to long
___(
-___,
names_to = "___",
___ = "value",
values_drop_na = TRUE
|>
) # Keep only TRUE `value`s:
___(value) |>
# Count the number of cases with each symptom:
count(symptoms, case)
tbl_symptoms_case
Code
<- linelist |>
tbl_symptoms_case # Select all symptoms columns, i.e., those starting with "symptoms":
select(case, starts_with("symptoms")) |>
# Remove rows with missing values:
drop_na() |>
# Reshape the dataset from wide to long
pivot_longer(
-case,
names_to = "symptoms",
values_to = "value",
values_drop_na = TRUE
|>
) # Keep only TRUE `value`s:
filter(value) |>
count(symptoms, case)
tbl_symptoms_case
Code
<- tbl_symptoms_case |>
fig_symptoms_case ggplot(aes(
x = fct_infreq(symptoms),
y = n,
+
)) geom_bar(stat = "identity") +
labs(
x = "Symptoms",
y = "Proportion of respondents"
+
) coord_flip() +
facet_wrap(
~ .,
case ncol = 2,
labeller = c(
"FALSE" = "Non case",
"TRUE" = "Case"
|> as_labeller()
)
) fig_symptoms_case