___(tidyverse)
___(here)
___(rio)
<- here("data-raw/Copenhagen_clean.xlsx") ___
linelist ___() ___
mutate(across(where(is.character), fct))
Day One - Assessment
UBEP’s R training for supervisors
Visualization
Preamble
This section is for context only. Nothing has to be done in this section by the participants.
This assessment is the second of four ones regarding the UBEP’s R training for supervisors for the ECDC.
To solve the exercise follow the text and find the assessments in which you need to fill the ___
where required in the code. Next, go to the corresponding section of the R script solution.R
and try/run sequentially your code.
All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.
You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-2” project, and find inside all the data, script and resources useful to complete the assessment.
The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(3)
The present work is released under the GPL-3 License.
Environment preparation
In this section, participants should attach to the R session the following packages using the library
function: (2)
tidyverse
here
rio
Moreover, they need to correctly pipe the instructions to read the data.
Task: fill in
___
the proper function/command
Code
library(tidyverse)
library(here)
library(rio)
<- here("data-raw/Copenhagen_clean.xlsx") |>
linelist import() |>
mutate(across(where(is.character), fct))
Data preparation
This section is for context only. Nothing has to be done in this section by the participants.
The Alert
On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.
The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.
Your mission
Your group has been tasked with investigating this outbreak; you have just received the information above.
The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.
A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.
Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.
Exploring the data
Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.
For this assessment all the data preprocessing is already done. Participants will find the data already in their environment named as linelist
.
In the following:
The variables are grouped by type (e.g. character or numeric)
n_missing shows the number of observations with missing values for each variable
complete_rate shows the proportion of observations that are not missing
min
andmax
for character variables refer to the number of characters per stringp0
andp100
for numeric variables refer to minimum and maximum values, respectively.For more details, see the help file by typing
?skimr::skim
in the console
Code
::skim(linelist) skimr
Name | linelist |
Number of rows | 384 |
Number of columns | 46 |
_______________________ | |
Column type frequency: | |
factor | 3 |
logical | 24 |
numeric | 16 |
POSIXct | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
sex | 0 | 1.00 | FALSE | 2 | fem: 219, mal: 165 |
group | 0 | 1.00 | FALSE | 2 | stu: 369, tea: 15 |
class | 35 | 0.91 | FALSE | 3 | 1: 134, 3: 112, 2: 103 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
diarrhoea | 132 | 0.66 | 0.82 | TRU: 206, FAL: 46 |
bloody | 190 | 0.51 | 0.03 | FAL: 189, TRU: 5 |
vomiting | 169 | 0.56 | 0.31 | FAL: 149, TRU: 66 |
abdo | 142 | 0.63 | 0.86 | TRU: 207, FAL: 35 |
nausea | 160 | 0.58 | 0.75 | TRU: 169, FAL: 55 |
fever | 213 | 0.45 | 0.26 | FAL: 127, TRU: 44 |
headache | 164 | 0.57 | 0.62 | TRU: 137, FAL: 83 |
jointpain | 196 | 0.49 | 0.15 | FAL: 159, TRU: 29 |
meal | 0 | 1.00 | 1.00 | TRU: 384 |
tuna | 4 | 0.99 | 0.72 | TRU: 272, FAL: 108 |
shrimps | 5 | 0.99 | 0.67 | TRU: 255, FAL: 124 |
green | 18 | 0.95 | 0.59 | TRU: 216, FAL: 150 |
veal | 3 | 0.99 | 0.89 | TRU: 340, FAL: 41 |
pasta | 3 | 0.99 | 0.89 | TRU: 338, FAL: 43 |
rocket | 12 | 0.97 | 0.57 | TRU: 211, FAL: 161 |
sauce | 30 | 0.92 | 0.42 | FAL: 205, TRU: 149 |
bread | 6 | 0.98 | 0.91 | TRU: 345, FAL: 33 |
champagne | 13 | 0.97 | 0.87 | TRU: 323, FAL: 48 |
beer | 18 | 0.95 | 0.78 | TRU: 286, FAL: 80 |
redwine | 38 | 0.90 | 0.23 | FAL: 265, TRU: 81 |
whitewine | 19 | 0.95 | 0.73 | TRU: 266, FAL: 99 |
gastrosymptoms | 0 | 1.00 | 0.56 | TRU: 216, FAL: 168 |
ate_anything | 0 | 1.00 | 1.00 | TRU: 384 |
case | 0 | 1.00 | 0.56 | TRU: 216, FAL: 168 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 215.80 | 123.65 | 1 | 111.75 | 216.5 | 318.25 | 435 | ▇▇▇▇▇ |
age | 0 | 1.00 | 18.30 | 6.13 | 15 | 16.00 | 17.0 | 18.00 | 65 | ▇▁▁▁▁ |
starthour | 166 | 0.57 | 12.52 | 4.94 | 3 | 9.00 | 9.0 | 15.00 | 21 | ▁▇▁▆▃ |
tunaD | 4 | 0.99 | 1.32 | 1.00 | 0 | 0.00 | 2.0 | 2.00 | 3 | ▆▅▁▇▂ |
shrimpsD | 5 | 0.99 | 1.35 | 1.04 | 0 | 0.00 | 2.0 | 2.00 | 3 | ▆▂▁▇▂ |
greenD | 18 | 0.95 | 1.14 | 1.05 | 0 | 0.00 | 1.0 | 2.00 | 3 | ▇▂▁▇▂ |
vealD | 2 | 0.99 | 1.83 | 0.90 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▂▃▁▇▃ |
pastaD | 3 | 0.99 | 1.81 | 0.91 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▂▃▁▇▃ |
rocketD | 12 | 0.97 | 1.08 | 1.06 | 0 | 0.00 | 1.0 | 2.00 | 3 | ▇▂▁▆▂ |
sauceD | 30 | 0.92 | 0.83 | 1.06 | 0 | 0.00 | 0.0 | 2.00 | 3 | ▇▁▁▃▁ |
breadD | 6 | 0.98 | 1.75 | 0.71 | 0 | 2.00 | 2.0 | 2.00 | 3 | ▁▂▁▇▁ |
champagneD | 13 | 0.97 | 1.37 | 0.93 | 0 | 1.00 | 1.0 | 2.00 | 3 | ▂▇▁▂▃ |
beerD | 23 | 0.94 | 1.95 | 1.23 | 0 | 1.00 | 3.0 | 3.00 | 3 | ▃▂▁▂▇ |
redwineD | 40 | 0.90 | 0.45 | 0.92 | 0 | 0.00 | 0.0 | 0.00 | 3 | ▇▁▁▁▁ |
whitewineD | 24 | 0.94 | 1.58 | 1.21 | 0 | 0.00 | 2.0 | 3.00 | 3 | ▆▅▁▅▇ |
incubation | 168 | 0.56 | 19.03 | 7.93 | 3 | 15.00 | 15.0 | 21.00 | 45 | ▂▇▇▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
dayonset | 164 | 0.57 | 2006-11-11 00:00:00 | 2006-11-13 00:00:00 | 2006-11-12 00:00:00 | 3 |
onset_datetime | 164 | 0.57 | 2006-11-11 00:00:00 | 2006-11-13 15:00:00 | 2006-11-12 09:00:00 | 9 |
meal_datetime | 0 | 1.00 | 2006-11-11 18:00:00 | 2006-11-11 18:00:00 | 2006-11-11 18:00:00 | 1 |
Variables of interest:
Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called meal
. This variable indicates whether people attended the school dinner party and ate a meal there, or not.
Other variables that will be helpful to include in your case definition are onset_datetime
(hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:
abdo
(abdominal pain)diarrhoea
bloody
(bloody diarrhoea)nausea
vomiting
fever
headache
jointpain
Defining a case
A case was defined as a person who:
- attended the school dinner on 11 November 2006 (i.e. is on the linelist)
- ate a meal at the school dinner (i.e. was exposed)
- fell ill after the start of the meal
- fell ill no later than two days after the school dinner
- suffered from diarrhoea with or without blood, or vomiting
Non cases were defined as people who:
- attended the school dinner on 11 November 2006 (i.e. are on the linelist)
- ate a meal at the school dinner (i.e. were exposed)
- did not fall ill within the time period of interest
- did not develop diarrhoea (with or without blood) or vomiting
For the sake of the analysis, we excluded any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.
The variables needed to define this case definition are:
meal
onset_datetime
diarrhoea
bloody
vomiting
For the case definition, we are primarily interested in three symptoms:
diarrhoea
(without blood)bloody
(diarrhoea with blood)vomiting
However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column gastrosymptoms
.
Next, we created a column for the case definition, in which we defined all the respondents as either cases, non-cases or NA
.
We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.
We can exclude (define as NA
) respondents that answered the survey, but did not attend the meal.
Incubation times
A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).
We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.
Exclusions
Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis.
Descriptive Analyses
In this section, participants should fill in ___
the proper function/command.
Goals
In this session, we will perform some descriptive analysis. Typically, we would describe the outbreak in terms of time, place and person. In this data set, we don’t have geospatial information, but we do have time (onset dates and times, incubation periods) and person (age, sex, clinical symptoms). You will:
- Describe cases by age and sex
- Describe case symptoms
- Describe and illustrate incubation periods
- Create an epidemic curve
Many of the descriptive features can be illustrated graphically, for which we will use the package ggplot2
.
In theory, you create an analysis plan before data collection - it is an important way to ensure you collect all the data you need and that you only collect the data you need (think: data protection!). However, during outbreak investigations, you may need to deploy questionnaires as soon as possible and before a plan has been developed. In these cases, your experience will be an important resource to fall back on.
In your analysis plan you should create a document to include:
- research question(s) and hypothesis
- dataset(s) to be used
- inclusion / exclusion criteria for records you will use in your analysis
- list of variables that will be used in the main analysis;
- outcome variable (being a case)
- main exposure variables (e.g. food consumed)
- stratifying variables (e.g. age, sex, group, class)
- statistical methods
- key mock-up tables including title, labels and expected format for results
- univariable
- bivariable
- stratified
Incubation period
The incubation period of a disease is the time from an exposure resulting in infection until the onset of disease. Because infection usually occurs immediately after exposure, the incubation period is generally the duration of the period from the onset of infection to the onset of disease. (6)
In the previous section, we calculated incubation period by subtracting the meal date and time from the date and time of symptom onset. This gave us the incubation period in hours. We can now look at this on a graph:
Plot of incubation period:
Produce a histogram
of the incubation
period variable (with missing values dropped out), using the ggplot
function.
Set the following labels:
x axis: “Incubation period in 6-hour bins”
y axis: “Number of cases”
<- linelist |>
incplot
# Remove missing values:
drop_na(incubation) |>
# Create an empty ggplot frame setting `incubation` as aesthetic:
___(___(x = ___)) +
# Add a histogram of incubation using the appropriate geom_*:
___(
# Set bin widths to 6 hours:
binwidth = 6
+
)
# Adapt scale to better fit data
scale_x_continuous(breaks = seq(0, 48, 6)) +
# Set the required labs to the x and y axes:
___(
x = ___,
y = ___
)
# Print plot:
incplot
Code
<- linelist |>
incplot
# Remove missing values:
drop_na(incubation) |>
# Create an empty ggplot frame:
ggplot(aes(x = incubation)) +
# Add a histogram of incubation:
geom_histogram(
# Set bin widths to 6 hours:
binwidth = 6
+
)
# Adapt scale to better fit data
scale_x_continuous(breaks = seq(0, 48, 6)) +
# Label x and y axes:
labs(
x = "Incubation period in 6-hour bins",
y = "Number of cases"
)
# Print plot:
incplot
Epidemic curve
To create an epicurve, we are going to use ggplot2
. ggplot2
is a versatile package that helps create beautiful visualizations in R
. There are plenty good in-depth guides to the package, for example: The Epidemiologist R Handbook Chapter on ggplot and ggplot2 - Elegant Graphics for Data Analysis.
First, we can create an epicurve for the date of onset, limiting the input data to cases:
Produce a daily epicurve using a barplot
of the dayonset
variable (considering case
only, and with missing values dropped out), using the ggplot
function.
Set the following labels:
x axis: “Date of onset”
y axis: “Number of cases”
# Fetch data:
<- linelist |>
epicurve_date
# Filter for cases where dayonset is not missing:
filter(
== TRUE,
case !is.na(dayonset)
) ___
# Add dayonset to ggplot aesthetic:
ggplot(aes(x = dayonset)) ___
# Add the proper `geom_*` for a barplot:
___() +
# Adapt scale to data and adjust axis label angle:
scale_x_datetime(
date_breaks = "1 day",
labels = scales::label_date_short()) +
# Update x and y axis labels:
___(
___ = "Date of onset",
___ = "Number of cases"
+
)
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_date
Code
# Fetch data:
<- linelist |>
epicurve_date
# Filter for cases where dayonset is not missing:
filter(
== TRUE,
case !is.na(dayonset)
|>
)
# Add dayonset to ggplot aesthetic:
ggplot(aes(x = dayonset)) +
# Add the proper `geom_*` for a barplot:
geom_bar() +
# Adapt scale to data and adjust axis label angle:
scale_x_datetime(
date_breaks = "1 day",
labels = scales::label_date_short()) +
# Update x and y axis labels:
labs(x = "Date of onset",
y = "Number of cases") +
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_date
As you can see, the epicurve is quite crude, because most cases had their disease onset on 12 November. To increase resolution, we could enhance the time information to include both day and hour.
During the data preparation session, we already concatenated the variables dayonset
and starthour
together and formatted them with the lubridate
package to create a new date-time variable called onset_datetime
. We can now create the epicurve with this variable:
Produce a 6-hours epicurve using a barplot
of the onset_datetime
variable (considering case
only, and with missing values dropped out), using the ggplot
function.
Set the following labels:
x axis: “Date of onset”
y axis: “Number of cases”
# Fetch data:
<- linelist |>
epicurve_hour
# Filter for cases where onset_datetime is not missing:
filter(
== TRUE,
onset_datetime !___(onset_datetime)
) ___
# Add onset_datetime to ggplot aesthetic:
ggplot(aes(___ = onset_datetime)) ___
# Add the proper `geom_*` for a barplot:
___() +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Update x and y axis labels:
___(
___ = "Date and time of onset",
___ = "Number of cases"
+
)
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_hour
Code
# Fetch data:
<- linelist |>
epicurve_hour
# Filter for cases where dayonset is not missing:
filter(
== TRUE,
case !is.na(onset_datetime)
|>
)
# Add onset_datetime to ggplot aesthetic:
ggplot(aes(x = onset_datetime)) +
# Add the proper `geom_*` for a barplot:
geom_bar() +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Update x and y axis labels:
labs(
x = "Date and time of onset",
y = "Number of cases"
+
)
# Remove unnecessary grid lines:
theme_bw()
# Print epicurve:
epicurve_hour
Finally, we could compare the epicurve between the sexes and additionally investigate how teachers versus students were distributed. Here, fill
adds an additional variable to be displayed in the plot: sex
is going to determine the fill-colour of our bars. facet_wrap
splits the graph in two: one each for the two levels of group
.
We will also use the function str_glue()
to add the total number of cases to the sub-title of the plot. str_glue()
is a very useful function that allows you to dynamically create a summary statistic from your data within some normal text.
Produce a 6-hours epicurve using a barplot
of the onset_datetime
variable (considering case
only, and with missing values dropped out), using the ggplot
function. Superimpose the stratification by sex
, and split the chart in two base on the group
variable.
Set the following labels:
x axis: “Date of onset”
y axis: “Number of cases”
fill legend title: “Sex”
# Fetch data:
<- linelist |>
epicurve_strata
# Filter for cases where onset_datetime is not missing:
filter(
== TRUE,
___ !___(___)
|>
)
# Add factor onset_day to ggplot aesthetic:
ggplot(aes(___ = onset_datetime)) +
# Add the proper `geom_*` for an "overall" barplot:
___() +
# Superimpose the same `geom_*` for a barplot
# stratified by sex:
___(___(fill = ___), position = "dodge") +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Stratify by group:
facet_wrap("group", nrow = 2) +
# Update x and y axis labels:
___(
___ = "Date and time of onset",
___ = "Number of cases",
___ = "Sex",
title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
subtitle = str_glue(
"Copenhagen, November 2006, N = {sum(linelist$case)}"
)+
)
# Add theme:
theme_bw()
# Print epicurve:
epicurve_strata
Code
# Fetch data:
<- linelist |>
epicurve_strata
# Filter for cases where onset_datetime is not missing:
filter(
== TRUE,
case !is.na(onset_datetime)
|>
)
# Add factor onset_day to ggplot aesthetic:
ggplot(aes(x = onset_datetime)) +
# Add the proper `geom_*` for an "overall" barplot:
geom_bar() +
# Superimpose the same `geom_*` for a barplot
# stratified by sex:
geom_bar(aes(fill = sex), position = "dodge") +
# Adjust x axis scales to a suitable unit:
scale_x_datetime(
date_breaks = "6 hours",
labels = scales::label_date_short()) +
# Stratify by group:
facet_wrap("group", nrow = 2) +
# Update x and y axis labels:
labs(
x = "Date and time of onset",
y = "Number of cases",
fill = "Sex",
title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
subtitle = str_glue(
"Copenhagen, November 2006, N = {sum(linelist$case)}"
)+
)
# Add theme:
theme_bw()
# Print epicurve:
epicurve_strata