Day One - Assessment

UBEP’s R training for supervisors

Visualization

Learning Objectives

At the end of the assessment, participants should have demonstrated their ability to: (1) (2)

  • Use the native pipe (|>) and the {ggplot2} plus-sign (+) correctly

  • Use the {rio} (import) for general-purpose import. (2)

  • Create general plots using {ggplot2} package and corresponding grammar; in particular:

    • select/use tidy-data for a plot

    • understand the meaning of “aesthetics” (aes)

    • know the use and the basic geometries (geom_*)

    • ability to facet a plot (facet)

    • be able to modify the basic plot style (theme and labs)

    • be aware that axes can be modified (e.g., coord_flip) and geometries can be personalized (e.g., `position = “dodge”`)

Preamble

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

This assessment is the second of four ones regarding the UBEP’s R training for supervisors for the ECDC.

To solve the exercise follow the text and find the assessments in which you need to fill the ___ where required in the code. Next, go to the corresponding section of the R script solution.R and try/run sequentially your code.

All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.

You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-2” project, and find inside all the data, script and resources useful to complete the assessment.

The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(3)

The present work is released under the GPL-3 License.

Environment preparation

Instructions

In this section, participants should attach to the R session the following packages using the library function: (2)

  • tidyverse

  • here

  • rio

Moreover, they need to correctly pipe the instructions to read the data.

Task: fill in ___ the proper function/command

___(tidyverse)
___(here)
___(rio)

linelist <- here("data-raw/Copenhagen_clean.xlsx") ___ 
  ___() ___ 
  mutate(across(where(is.character), fct))
Code
library(tidyverse)
library(here)
library(rio)

linelist <- here("data-raw/Copenhagen_clean.xlsx") |> 
  import() |> 
  mutate(across(where(is.character), fct))

Data preparation

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

The Alert

On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.

The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.

Your mission

Your group has been tasked with investigating this outbreak; you have just received the information above.

The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.

A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.

Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.

Exploring the data

Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.

Preprocessing

For this assessment all the data preprocessing is already done. Participants will find the data already in their environment named as linelist.

In the following:

  • The variables are grouped by type (e.g. character or numeric)

  • n_missing shows the number of observations with missing values for each variable

  • complete_rate shows the proportion of observations that are not missing

  • min and max for character variables refer to the number of characters per string

  • p0 and p100 for numeric variables refer to minimum and maximum values, respectively.

  • For more details, see the help file by typing ?skimr::skim in the console

Code
skimr::skim(linelist)
Data summary
Name linelist
Number of rows 384
Number of columns 46
_______________________
Column type frequency:
factor 3
logical 24
numeric 16
POSIXct 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1.00 FALSE 2 fem: 219, mal: 165
group 0 1.00 FALSE 2 stu: 369, tea: 15
class 35 0.91 FALSE 3 1: 134, 3: 112, 2: 103

Variable type: logical

skim_variable n_missing complete_rate mean count
diarrhoea 132 0.66 0.82 TRU: 206, FAL: 46
bloody 190 0.51 0.03 FAL: 189, TRU: 5
vomiting 169 0.56 0.31 FAL: 149, TRU: 66
abdo 142 0.63 0.86 TRU: 207, FAL: 35
nausea 160 0.58 0.75 TRU: 169, FAL: 55
fever 213 0.45 0.26 FAL: 127, TRU: 44
headache 164 0.57 0.62 TRU: 137, FAL: 83
jointpain 196 0.49 0.15 FAL: 159, TRU: 29
meal 0 1.00 1.00 TRU: 384
tuna 4 0.99 0.72 TRU: 272, FAL: 108
shrimps 5 0.99 0.67 TRU: 255, FAL: 124
green 18 0.95 0.59 TRU: 216, FAL: 150
veal 3 0.99 0.89 TRU: 340, FAL: 41
pasta 3 0.99 0.89 TRU: 338, FAL: 43
rocket 12 0.97 0.57 TRU: 211, FAL: 161
sauce 30 0.92 0.42 FAL: 205, TRU: 149
bread 6 0.98 0.91 TRU: 345, FAL: 33
champagne 13 0.97 0.87 TRU: 323, FAL: 48
beer 18 0.95 0.78 TRU: 286, FAL: 80
redwine 38 0.90 0.23 FAL: 265, TRU: 81
whitewine 19 0.95 0.73 TRU: 266, FAL: 99
gastrosymptoms 0 1.00 0.56 TRU: 216, FAL: 168
ate_anything 0 1.00 1.00 TRU: 384
case 0 1.00 0.56 TRU: 216, FAL: 168

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 215.80 123.65 1 111.75 216.5 318.25 435 ▇▇▇▇▇
age 0 1.00 18.30 6.13 15 16.00 17.0 18.00 65 ▇▁▁▁▁
starthour 166 0.57 12.52 4.94 3 9.00 9.0 15.00 21 ▁▇▁▆▃
tunaD 4 0.99 1.32 1.00 0 0.00 2.0 2.00 3 ▆▅▁▇▂
shrimpsD 5 0.99 1.35 1.04 0 0.00 2.0 2.00 3 ▆▂▁▇▂
greenD 18 0.95 1.14 1.05 0 0.00 1.0 2.00 3 ▇▂▁▇▂
vealD 2 0.99 1.83 0.90 0 1.00 2.0 2.00 3 ▂▃▁▇▃
pastaD 3 0.99 1.81 0.91 0 1.00 2.0 2.00 3 ▂▃▁▇▃
rocketD 12 0.97 1.08 1.06 0 0.00 1.0 2.00 3 ▇▂▁▆▂
sauceD 30 0.92 0.83 1.06 0 0.00 0.0 2.00 3 ▇▁▁▃▁
breadD 6 0.98 1.75 0.71 0 2.00 2.0 2.00 3 ▁▂▁▇▁
champagneD 13 0.97 1.37 0.93 0 1.00 1.0 2.00 3 ▂▇▁▂▃
beerD 23 0.94 1.95 1.23 0 1.00 3.0 3.00 3 ▃▂▁▂▇
redwineD 40 0.90 0.45 0.92 0 0.00 0.0 0.00 3 ▇▁▁▁▁
whitewineD 24 0.94 1.58 1.21 0 0.00 2.0 3.00 3 ▆▅▁▅▇
incubation 168 0.56 19.03 7.93 3 15.00 15.0 21.00 45 ▂▇▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
dayonset 164 0.57 2006-11-11 00:00:00 2006-11-13 00:00:00 2006-11-12 00:00:00 3
onset_datetime 164 0.57 2006-11-11 00:00:00 2006-11-13 15:00:00 2006-11-12 09:00:00 9
meal_datetime 0 1.00 2006-11-11 18:00:00 2006-11-11 18:00:00 2006-11-11 18:00:00 1

Variables of interest:

Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called meal. This variable indicates whether people attended the school dinner party and ate a meal there, or not.

Other variables that will be helpful to include in your case definition are onset_datetime (hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:

  • abdo (abdominal pain)
  • diarrhoea
  • bloody (bloody diarrhoea)
  • nausea
  • vomiting
  • fever
  • headache
  • jointpain

Defining a case

A case was defined as a person who:

  • attended the school dinner on 11 November 2006 (i.e. is on the linelist)
  • ate a meal at the school dinner (i.e. was exposed)
  • fell ill after the start of the meal
  • fell ill no later than two days after the school dinner
  • suffered from diarrhoea with or without blood, or vomiting

Non cases were defined as people who:

  • attended the school dinner on 11 November 2006 (i.e. are on the linelist)
  • ate a meal at the school dinner (i.e. were exposed)
  • did not fall ill within the time period of interest
  • did not develop diarrhoea (with or without blood) or vomiting

For the sake of the analysis, we excluded any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.

The variables needed to define this case definition are:

  • meal
  • onset_datetime
  • diarrhoea
  • bloody
  • vomiting

For the case definition, we are primarily interested in three symptoms:

  • diarrhoea (without blood)
  • bloody (diarrhoea with blood)
  • vomiting

However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column gastrosymptoms.

Next, we created a column for the case definition, in which we defined all the respondents as either cases, non-cases or NA.

We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.

We can exclude (define as NA) respondents that answered the survey, but did not attend the meal.

Incubation times

A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).

We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.

Exclusions

Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis.

Descriptive Analyses

Instructions

In this section, participants should fill in ___ the proper function/command.

Goals

In this session, we will perform some descriptive analysis. Typically, we would describe the outbreak in terms of time, place and person. In this data set, we don’t have geospatial information, but we do have time (onset dates and times, incubation periods) and person (age, sex, clinical symptoms). You will:

  • Describe cases by age and sex
  • Describe case symptoms
  • Describe and illustrate incubation periods
  • Create an epidemic curve

Many of the descriptive features can be illustrated graphically, for which we will use the package ggplot2.

In theory, you create an analysis plan before data collection - it is an important way to ensure you collect all the data you need and that you only collect the data you need (think: data protection!). However, during outbreak investigations, you may need to deploy questionnaires as soon as possible and before a plan has been developed. In these cases, your experience will be an important resource to fall back on.

In your analysis plan you should create a document to include:

  • research question(s) and hypothesis
  • dataset(s) to be used
  • inclusion / exclusion criteria for records you will use in your analysis
  • list of variables that will be used in the main analysis;
    • outcome variable (being a case)
    • main exposure variables (e.g. food consumed)
    • stratifying variables (e.g. age, sex, group, class)
  • statistical methods
  • key mock-up tables including title, labels and expected format for results
    • univariable
    • bivariable
    • stratified

Incubation period

The incubation period of a disease is the time from an exposure resulting in infection until the onset of disease. Because infection usually occurs immediately after exposure, the incubation period is generally the duration of the period from the onset of infection to the onset of disease. (6)

In the previous section, we calculated incubation period by subtracting the meal date and time from the date and time of symptom onset. This gave us the incubation period in hours. We can now look at this on a graph:

Plot of incubation period:

Instructions

Produce a histogram of the incubation period variable (with missing values dropped out), using the ggplot function.

Set the following labels:

  • x axis: “Incubation period in 6-hour bins”

  • y axis: “Number of cases”

incplot <- linelist |>  
  
  # Remove missing values:
  drop_na(incubation) |> 
  
  # Create an empty ggplot frame setting `incubation` as aesthetic:
  ___(___(x = ___)) +
  
  # Add a histogram of incubation using the appropriate geom_*:
  ___(
    # Set bin widths to 6 hours:
    binwidth = 6
  ) +
  
  # Adapt scale to better fit data
  scale_x_continuous(breaks = seq(0, 48, 6)) + 
  
  # Set the required labs to the x and y axes:
  ___(
    x = ___,
    y = ___
  )

# Print plot:
incplot
Code
incplot <- linelist |>  
  
  # Remove missing values:
  drop_na(incubation) |> 
  
  # Create an empty ggplot frame:
  ggplot(aes(x = incubation)) +
  
  # Add a histogram of incubation:
  geom_histogram(
    # Set bin widths to 6 hours:
    binwidth = 6
  ) +
  
  # Adapt scale to better fit data
  scale_x_continuous(breaks = seq(0, 48, 6)) + 
  
  # Label x and y axes:
  labs(
    x = "Incubation period in 6-hour bins",
    y = "Number of cases"
  )

# Print plot:
incplot

Epidemic curve

To create an epicurve, we are going to use ggplot2. ggplot2 is a versatile package that helps create beautiful visualizations in R. There are plenty good in-depth guides to the package, for example: The Epidemiologist R Handbook Chapter on ggplot and ggplot2 - Elegant Graphics for Data Analysis.

First, we can create an epicurve for the date of onset, limiting the input data to cases:

Instructions

Produce a daily epicurve using a barplot of the dayonset variable (considering case only, and with missing values dropped out), using the ggplot function.

Set the following labels:

  • x axis: “Date of onset”

  • y axis: “Number of cases”

# Fetch data:
epicurve_date <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  filter(
    case == TRUE,
    !is.na(dayonset)
  ) ___
  
  # Add dayonset to ggplot aesthetic:
  ggplot(aes(x = dayonset)) ___ 
  
  # Add the proper `geom_*` for a barplot:
  ___() +
  
  # Adapt scale to data and adjust axis label angle:
  scale_x_datetime(
    date_breaks = "1 day",
    labels = scales::label_date_short()) +
  
  # Update x and y axis labels:
  ___(
    ___ = "Date of onset", 
    ___ = "Number of cases"
  ) +
  
  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_date
Code
# Fetch data:
epicurve_date <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  filter(
    case == TRUE,
    !is.na(dayonset)
  ) |> 
  
  # Add dayonset to ggplot aesthetic:
  ggplot(aes(x = dayonset)) + 
  
  # Add the proper `geom_*` for a barplot:
  geom_bar() +
  
  # Adapt scale to data and adjust axis label angle:
  scale_x_datetime(
    date_breaks = "1 day",
    labels = scales::label_date_short()) +
  
  # Update x and y axis labels:
  labs(x = "Date of onset", 
       y = "Number of cases") +
  
  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_date

As you can see, the epicurve is quite crude, because most cases had their disease onset on 12 November. To increase resolution, we could enhance the time information to include both day and hour.

During the data preparation session, we already concatenated the variables dayonset and starthour together and formatted them with the lubridate package to create a new date-time variable called onset_datetime. We can now create the epicurve with this variable:

Instructions

Produce a 6-hours epicurve using a barplot of the onset_datetime variable (considering case only, and with missing values dropped out), using the ggplot function.

Set the following labels:

  • x axis: “Date of onset”

  • y axis: “Number of cases”

# Fetch data:
epicurve_hour <- linelist |> 
  
  # Filter for cases where onset_datetime is not missing:
  filter(
    onset_datetime == TRUE,
    !___(onset_datetime)
  ) ___
  
  # Add onset_datetime to ggplot aesthetic:
  ggplot(aes(___ = onset_datetime)) ___
  
  # Add the proper `geom_*` for a barplot:
  ___() +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Update x and y axis labels:
  ___(
    ___ = "Date and time of onset", 
    ___ = "Number of cases"
  ) +

  
  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_hour
Code
# Fetch data:
epicurve_hour <- linelist |> 
  
  # Filter for cases where dayonset is not missing:
  filter(
    case == TRUE,
    !is.na(onset_datetime)
  ) |> 
  
  # Add onset_datetime to ggplot aesthetic:
  ggplot(aes(x = onset_datetime)) + 
  
  # Add the proper `geom_*` for a barplot:
  geom_bar() +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Update x and y axis labels:
  labs(
    x = "Date and time of onset", 
    y = "Number of cases"
  ) +

  # Remove unnecessary grid lines:
  theme_bw()

# Print epicurve:
epicurve_hour

Finally, we could compare the epicurve between the sexes and additionally investigate how teachers versus students were distributed. Here, fill adds an additional variable to be displayed in the plot: sex is going to determine the fill-colour of our bars. facet_wrap splits the graph in two: one each for the two levels of group.

We will also use the function str_glue() to add the total number of cases to the sub-title of the plot. str_glue() is a very useful function that allows you to dynamically create a summary statistic from your data within some normal text.

Instructions

Produce a 6-hours epicurve using a barplot of the onset_datetime variable (considering case only, and with missing values dropped out), using the ggplot function. Superimpose the stratification by sex, and split the chart in two base on the group variable.

Set the following labels:

  • x axis: “Date of onset”

  • y axis: “Number of cases”

  • fill legend title: “Sex”

# Fetch data:
epicurve_strata <- linelist |> 
  
  # Filter for cases where onset_datetime is not missing:
  filter(
    ___ == TRUE,
    !___(___)
  ) |> 
  
  # Add factor onset_day to ggplot aesthetic:
  ggplot(aes(___ = onset_datetime)) + 
  
  # Add the proper `geom_*` for an "overall" barplot:
  ___() +
  
  # Superimpose the same `geom_*` for a barplot
  # stratified by sex:
  ___(___(fill = ___), position = "dodge") +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Stratify by group:
  facet_wrap("group", nrow = 2) +
  
  # Update x and y axis labels:
  ___(
    ___ = "Date and time of onset", 
    ___ = "Number of cases", 
    ___ = "Sex", 
    title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
    subtitle = str_glue(
      "Copenhagen, November 2006, N = {sum(linelist$case)}"
    )
  ) +

  # Add theme:
  theme_bw()

# Print epicurve:
epicurve_strata
Code
# Fetch data:
epicurve_strata <- linelist |> 
  
  # Filter for cases where onset_datetime is not missing:
  filter(
    case == TRUE,
    !is.na(onset_datetime)
  ) |> 
  
  # Add factor onset_day to ggplot aesthetic:
  ggplot(aes(x = onset_datetime)) + 
  
  # Add the proper `geom_*` for an "overall" barplot:
  geom_bar() +
  
  # Superimpose the same `geom_*` for a barplot
  # stratified by sex:
  geom_bar(aes(fill = sex), position = "dodge") +
  
  # Adjust x axis scales to a suitable unit:
  scale_x_datetime(
    date_breaks = "6 hours", 
    labels = scales::label_date_short()) +
  
  # Stratify by group:
  facet_wrap("group", nrow = 2) +
  
  # Update x and y axis labels:
  labs(
    x = "Date and time of onset", 
    y = "Number of cases", 
    fill = "Sex", 
    title = "Epicurve of the outbreak, overall (gray bars) and stratified by sex (coloured bars)",
    subtitle = str_glue(
      "Copenhagen, November 2006, N = {sum(linelist$case)}"
    )
  ) +

  # Add theme:
  theme_bw()

# Print epicurve:
epicurve_strata

References

1.
Wickham H, Chang W, Henry L, Pedersen TL, Takahashi K, Wilke C, et al. ggplot2: Create elegant data visualisations using the grammar of graphics [Internet]. 2023. Available from: https://ggplot2.tidyverse.org
2.
Becker J, Chan C, Schoch D, Leeper TJ. Rio: A swiss-army knife for data i/o [Internet]. 2023. Available from: https://github.com/gesistsa/rio
3.
EPIET/OutbreakInvestigation [Internet]. Available from: https://github.com/EPIET/OutbreakInvestigation
4.
Wickham H. Tidyverse: Easily install and load the tidyverse [Internet]. 2023. Available from: https://tidyverse.tidyverse.org
5.
Müller K. Here: A simpler way to find your files [Internet]. 2020. Available from: https://here.r-lib.org/
6.
Rothman KJ, Greenland S, Lash TL. Modern epidemiology [Internet]. Third edition. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia; 2008. Available from: http://www.r2library.com/public/ResourceDetail.aspx?authCheck=true&resid=2127