Day One - Assessment

UBEP’s R training for supervisors

Visualization

Learning Objectives

At the end of the assessment, participants should have demonstrated their ability to: base R, and project management and organization

Create an RStudio project, understanding it’s advantage against standard folders organization. Activate {renv} within the project, and manage it using it’s three main functions (status, restore, and snapshot ). (R-renv?)
Understand and use a folder organization to navigate input (e.g., data-raw/, data/, R/) and output (e.g., output/) files. Ability to navigate them using {here} (here). (1)

Preamble

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

This assessment is the first of four ones regarding the UBEP’s R training for supervisors for the ECDC.

To solve the exercise follow the text and find the assessments in which you need to fill the ___ where required in the code. Next, go to the corresponding section of the R script solution.R and try/run sequentially your code.

All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.

You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-1” project, and find inside all the data, script and resources useful to complete the assessment.

The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(2)

The present work is released under the GPL-3 License.

Environment preparation

Instructions

In this section, participants should attach to the R session the following packages using the library function: (4)

tidyverse
here
rio

For the moment, participants should not worry about code they don’t understand. We will explain it later during the course.

Task: fill in ___ the proper function/command

___(tidyverse)
___(here)
___(rio)

# the file to be imported is in the data-raw folder and it is called Copenhagen_clean.xlsx
linelist <- here("data-raw", "Copenhagen_clean.xlsx") |>  
  import() |>  
  mutate(across(where(is.character), fct))

Code

library(tidyverse)
library(here)
library(rio)

linelist <- here("data-raw", "Copenhagen_clean.xlsx") |> 
  import() |> 
  mutate(across(where(is.character), fct))

Data preparation

Instructions

This section is for context only. Nothing has to be done in this section by the participants.

The Alert

On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.

The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.

Your mission

Your group has been tasked with investigating this outbreak; you have just received the information above.

The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.

A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.

Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.

Exploring the data

Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.

Preprocessing

For this assessment all the data preprocessing is already done. Participants will find the data already in their environment named as linelist.

In the following:

The variables are grouped by type (e.g. character or numeric)
n_missing shows the number of observations with missing values for each variable
complete_rate shows the proportion of observations that are not missing
min and max for character variables refer to the number of characters per string
p0 and p100 for numeric variables refer to minimum and maximum values, respectively.
For more details, see the help file by typing ?skimr::skim in the console

Code

skimr::skim(linelist)

Data summary
Name	linelist
Number of rows	384
Number of columns	46
_______________________
Column type frequency:
factor	3
logical	24
numeric	16
POSIXct	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
sex	0	1.00	FALSE	2	fem: 219, mal: 165
group	0	1.00	FALSE	2	stu: 369, tea: 15
class	35	0.91	FALSE	3	1: 134, 3: 112, 2: 103

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
diarrhoea	132	0.66	0.82	TRU: 206, FAL: 46
bloody	190	0.51	0.03	FAL: 189, TRU: 5
vomiting	169	0.56	0.31	FAL: 149, TRU: 66
abdo	142	0.63	0.86	TRU: 207, FAL: 35
nausea	160	0.58	0.75	TRU: 169, FAL: 55
fever	213	0.45	0.26	FAL: 127, TRU: 44
headache	164	0.57	0.62	TRU: 137, FAL: 83
jointpain	196	0.49	0.15	FAL: 159, TRU: 29
meal	0	1.00	1.00	TRU: 384
tuna	4	0.99	0.72	TRU: 272, FAL: 108
shrimps	5	0.99	0.67	TRU: 255, FAL: 124
green	18	0.95	0.59	TRU: 216, FAL: 150
veal	3	0.99	0.89	TRU: 340, FAL: 41
pasta	3	0.99	0.89	TRU: 338, FAL: 43
rocket	12	0.97	0.57	TRU: 211, FAL: 161
sauce	30	0.92	0.42	FAL: 205, TRU: 149
bread	6	0.98	0.91	TRU: 345, FAL: 33
champagne	13	0.97	0.87	TRU: 323, FAL: 48
beer	18	0.95	0.78	TRU: 286, FAL: 80
redwine	38	0.90	0.23	FAL: 265, TRU: 81
whitewine	19	0.95	0.73	TRU: 266, FAL: 99
gastrosymptoms	0	1.00	0.56	TRU: 216, FAL: 168
ate_anything	0	1.00	1.00	TRU: 384
case	0	1.00	0.56	TRU: 216, FAL: 168

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	215.80	123.65	1	111.75	216.5	318.25	435	▇▇▇▇▇
age	0	1.00	18.30	6.13	15	16.00	17.0	18.00	65	▇▁▁▁▁
starthour	166	0.57	12.52	4.94	3	9.00	9.0	15.00	21	▁▇▁▆▃
tunaD	4	0.99	1.32	1.00	0	0.00	2.0	2.00	3	▆▅▁▇▂
shrimpsD	5	0.99	1.35	1.04	0	0.00	2.0	2.00	3	▆▂▁▇▂
greenD	18	0.95	1.14	1.05	0	0.00	1.0	2.00	3	▇▂▁▇▂
vealD	2	0.99	1.83	0.90	0	1.00	2.0	2.00	3	▂▃▁▇▃
pastaD	3	0.99	1.81	0.91	0	1.00	2.0	2.00	3	▂▃▁▇▃
rocketD	12	0.97	1.08	1.06	0	0.00	1.0	2.00	3	▇▂▁▆▂
sauceD	30	0.92	0.83	1.06	0	0.00	0.0	2.00	3	▇▁▁▃▁
breadD	6	0.98	1.75	0.71	0	2.00	2.0	2.00	3	▁▂▁▇▁
champagneD	13	0.97	1.37	0.93	0	1.00	1.0	2.00	3	▂▇▁▂▃
beerD	23	0.94	1.95	1.23	0	1.00	3.0	3.00	3	▃▂▁▂▇
redwineD	40	0.90	0.45	0.92	0	0.00	0.0	0.00	3	▇▁▁▁▁
whitewineD	24	0.94	1.58	1.21	0	0.00	2.0	3.00	3	▆▅▁▅▇
incubation	168	0.56	19.03	7.93	3	15.00	15.0	21.00	45	▂▇▇▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
dayonset	164	0.57	2006-11-11 00:00:00	2006-11-13 00:00:00	2006-11-12 00:00:00	3
onset_datetime	164	0.57	2006-11-11 00:00:00	2006-11-13 15:00:00	2006-11-12 09:00:00	9
meal_datetime	0	1.00	2006-11-11 18:00:00	2006-11-11 18:00:00	2006-11-11 18:00:00	1

Variables of interest:

Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called meal. This variable indicates whether people attended the school dinner party and ate a meal there, or not.

Other variables that will be helpful to include in your case definition are onset_datetime (hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:

abdo (abdominal pain)
diarrhoea
bloody (bloody diarrhoea)
nausea
vomiting
fever
headache
jointpain

Defining a case

A case was defined as a person who:

attended the school dinner on 11 November 2006 (i.e. is on the linelist)
ate a meal at the school dinner (i.e. was exposed)
fell ill after the start of the meal
fell ill no later than two days after the school dinner
suffered from diarrhoea with or without blood, or vomiting

Non cases were defined as people who:

attended the school dinner on 11 November 2006 (i.e. are on the linelist)
ate a meal at the school dinner (i.e. were exposed)
did not fall ill within the time period of interest
did not develop diarrhoea (with or without blood) or vomiting

For the sake of the analysis, we excluded any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.

The variables needed to define this case definition are:

meal
onset_datetime
diarrhoea
bloody
vomiting

For the case definition, we are primarily interested in three symptoms:

diarrhoea (without blood)
bloody (diarrhoea with blood)
vomiting

However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column gastrosymptoms.

Next, we created a column for the case definition, in which we defined all the respondents as either cases, non-cases or NA.

We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.

We can exclude (define as NA) respondents that answered the survey, but did not attend the meal.

Incubation times

A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).

We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.

Exclusions

Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis.

References

Müller K. Here: A simpler way to find your files [Internet]. 2020. Available from: https://here.r-lib.org/

EPIET/OutbreakInvestigation [Internet]. Available from: https://github.com/EPIET/OutbreakInvestigation

Wickham H. Tidyverse: Easily install and load the tidyverse [Internet]. 2023. Available from: https://tidyverse.tidyverse.org

Becker J, Chan C, Schoch D, Leeper TJ. Rio: A swiss-army knife for data i/o [Internet]. 2023. Available from: https://github.com/gesistsa/rio