Day One - Assessment

UBEP’s R training for supervisors

Visualization

Learning Objectives

At the end of the assessment, participants should have demonstrated their ability to: base R, and project management and organization

  • Create an RStudio project, understanding it’s advantage against standard folders organization. Activate {renv} within the project, and manage it using it’s three main functions (status, restore, and snapshot ). (R-renv?)

  • Understand and use a folder organization to navigate input (e.g., data-raw/, data/, R/) and output (e.g., output/) files. Ability to navigate them using {here} (here). (1)

Preamble

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

This assessment is the first of four ones regarding the UBEP’s R training for supervisors for the ECDC.

To solve the exercise follow the text and find the assessments in which you need to fill the ___ where required in the code. Next, go to the corresponding section of the R script solution.R and try/run sequentially your code.

All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.

You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-1” project, and find inside all the data, script and resources useful to complete the assessment.

The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(2)

The present work is released under the GPL-3 License.

Environment preparation

Instructions

In this section, participants should attach to the R session the following packages using the library function: (4)

  • tidyverse

  • here

  • rio

For the moment, participants should not worry about code they don’t understand. We will explain it later during the course.

Task: fill in ___ the proper function/command

___(tidyverse)
___(here)
___(rio)

# the file to be imported is in the data-raw folder and it is called Copenhagen_clean.xlsx
linelist <- here("data-raw", "Copenhagen_clean.xlsx") |>  
  import() |>  
  mutate(across(where(is.character), fct))
Code
library(tidyverse)
library(here)
library(rio)

linelist <- here("data-raw", "Copenhagen_clean.xlsx") |> 
  import() |> 
  mutate(across(where(is.character), fct))

Data preparation

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

The Alert

On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.

The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.

Your mission

Your group has been tasked with investigating this outbreak; you have just received the information above.

The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.

A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.

Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.

Exploring the data

Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.

Preprocessing

For this assessment all the data preprocessing is already done. Participants will find the data already in their environment named as linelist.

In the following:

  • The variables are grouped by type (e.g. character or numeric)

  • n_missing shows the number of observations with missing values for each variable

  • complete_rate shows the proportion of observations that are not missing

  • min and max for character variables refer to the number of characters per string

  • p0 and p100 for numeric variables refer to minimum and maximum values, respectively.

  • For more details, see the help file by typing ?skimr::skim in the console

Code
skimr::skim(linelist)
Data summary
Name linelist
Number of rows 384
Number of columns 46
_______________________
Column type frequency:
factor 3
logical 24
numeric 16
POSIXct 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1.00 FALSE 2 fem: 219, mal: 165
group 0 1.00 FALSE 2 stu: 369, tea: 15
class 35 0.91 FALSE 3 1: 134, 3: 112, 2: 103

Variable type: logical

skim_variable n_missing complete_rate mean count
diarrhoea 132 0.66 0.82 TRU: 206, FAL: 46
bloody 190 0.51 0.03 FAL: 189, TRU: 5
vomiting 169 0.56 0.31 FAL: 149, TRU: 66
abdo 142 0.63 0.86 TRU: 207, FAL: 35
nausea 160 0.58 0.75 TRU: 169, FAL: 55
fever 213 0.45 0.26 FAL: 127, TRU: 44
headache 164 0.57 0.62 TRU: 137, FAL: 83
jointpain 196 0.49 0.15 FAL: 159, TRU: 29
meal 0 1.00 1.00 TRU: 384
tuna 4 0.99 0.72 TRU: 272, FAL: 108
shrimps 5 0.99 0.67 TRU: 255, FAL: 124
green 18 0.95 0.59 TRU: 216, FAL: 150
veal 3 0.99 0.89 TRU: 340, FAL: 41
pasta 3 0.99 0.89 TRU: 338, FAL: 43
rocket 12 0.97 0.57 TRU: 211, FAL: 161
sauce 30 0.92 0.42 FAL: 205, TRU: 149
bread 6 0.98 0.91 TRU: 345, FAL: 33
champagne 13 0.97 0.87 TRU: 323, FAL: 48
beer 18 0.95 0.78 TRU: 286, FAL: 80
redwine 38 0.90 0.23 FAL: 265, TRU: 81
whitewine 19 0.95 0.73 TRU: 266, FAL: 99
gastrosymptoms 0 1.00 0.56 TRU: 216, FAL: 168
ate_anything 0 1.00 1.00 TRU: 384
case 0 1.00 0.56 TRU: 216, FAL: 168

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 215.80 123.65 1 111.75 216.5 318.25 435 ▇▇▇▇▇
age 0 1.00 18.30 6.13 15 16.00 17.0 18.00 65 ▇▁▁▁▁
starthour 166 0.57 12.52 4.94 3 9.00 9.0 15.00 21 ▁▇▁▆▃
tunaD 4 0.99 1.32 1.00 0 0.00 2.0 2.00 3 ▆▅▁▇▂
shrimpsD 5 0.99 1.35 1.04 0 0.00 2.0 2.00 3 ▆▂▁▇▂
greenD 18 0.95 1.14 1.05 0 0.00 1.0 2.00 3 ▇▂▁▇▂
vealD 2 0.99 1.83 0.90 0 1.00 2.0 2.00 3 ▂▃▁▇▃
pastaD 3 0.99 1.81 0.91 0 1.00 2.0 2.00 3 ▂▃▁▇▃
rocketD 12 0.97 1.08 1.06 0 0.00 1.0 2.00 3 ▇▂▁▆▂
sauceD 30 0.92 0.83 1.06 0 0.00 0.0 2.00 3 ▇▁▁▃▁
breadD 6 0.98 1.75 0.71 0 2.00 2.0 2.00 3 ▁▂▁▇▁
champagneD 13 0.97 1.37 0.93 0 1.00 1.0 2.00 3 ▂▇▁▂▃
beerD 23 0.94 1.95 1.23 0 1.00 3.0 3.00 3 ▃▂▁▂▇
redwineD 40 0.90 0.45 0.92 0 0.00 0.0 0.00 3 ▇▁▁▁▁
whitewineD 24 0.94 1.58 1.21 0 0.00 2.0 3.00 3 ▆▅▁▅▇
incubation 168 0.56 19.03 7.93 3 15.00 15.0 21.00 45 ▂▇▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
dayonset 164 0.57 2006-11-11 00:00:00 2006-11-13 00:00:00 2006-11-12 00:00:00 3
onset_datetime 164 0.57 2006-11-11 00:00:00 2006-11-13 15:00:00 2006-11-12 09:00:00 9
meal_datetime 0 1.00 2006-11-11 18:00:00 2006-11-11 18:00:00 2006-11-11 18:00:00 1

Variables of interest:

Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called meal. This variable indicates whether people attended the school dinner party and ate a meal there, or not.

Other variables that will be helpful to include in your case definition are onset_datetime (hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:

  • abdo (abdominal pain)
  • diarrhoea
  • bloody (bloody diarrhoea)
  • nausea
  • vomiting
  • fever
  • headache
  • jointpain

Defining a case

A case was defined as a person who:

  • attended the school dinner on 11 November 2006 (i.e. is on the linelist)
  • ate a meal at the school dinner (i.e. was exposed)
  • fell ill after the start of the meal
  • fell ill no later than two days after the school dinner
  • suffered from diarrhoea with or without blood, or vomiting

Non cases were defined as people who:

  • attended the school dinner on 11 November 2006 (i.e. are on the linelist)
  • ate a meal at the school dinner (i.e. were exposed)
  • did not fall ill within the time period of interest
  • did not develop diarrhoea (with or without blood) or vomiting

For the sake of the analysis, we excluded any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.

The variables needed to define this case definition are:

  • meal
  • onset_datetime
  • diarrhoea
  • bloody
  • vomiting

For the case definition, we are primarily interested in three symptoms:

  • diarrhoea (without blood)
  • bloody (diarrhoea with blood)
  • vomiting

However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column gastrosymptoms.

Next, we created a column for the case definition, in which we defined all the respondents as either cases, non-cases or NA.

We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.

We can exclude (define as NA) respondents that answered the survey, but did not attend the meal.

Incubation times

A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).

We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.

Exclusions

Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis.

References

1.
Müller K. Here: A simpler way to find your files [Internet]. 2020. Available from: https://here.r-lib.org/
2.
EPIET/OutbreakInvestigation [Internet]. Available from: https://github.com/EPIET/OutbreakInvestigation
3.
Wickham H. Tidyverse: Easily install and load the tidyverse [Internet]. 2023. Available from: https://tidyverse.tidyverse.org
4.
Becker J, Chan C, Schoch D, Leeper TJ. Rio: A swiss-army knife for data i/o [Internet]. 2023. Available from: https://github.com/gesistsa/rio