___(tidyverse)
___(here)
___(rio)
# the file to be imported is in the data-raw folder and it is called Copenhagen_clean.xlsx
<- here("data-raw", "Copenhagen_clean.xlsx") |>
linelist import() |>
mutate(across(where(is.character), fct))
Day One - Assessment
UBEP’s R training for supervisors
Visualization
Preamble
This section is for context only. Nothing has to be done in this section by the participants.
This assessment is the first of four ones regarding the UBEP’s R training for supervisors for the ECDC.
To solve the exercise follow the text and find the assessments in which you need to fill the ___
where required in the code. Next, go to the corresponding section of the R script solution.R
and try/run sequentially your code.
All the exercise are presented in a tabset panel with a tab containing all the missing parts, and a tab with the solved, completed (including output) code. This version of the file has the solution exposed.
You can access to a dedicated R/RStudio environment on Posit Cloud at connecting here. You need to create a free account on Posit Cloud, and next accept to join the “R training for supervisors” workspace. Inside that space, you can enter in the “day-1” project, and find inside all the data, script and resources useful to complete the assessment.
The text and examples in the present document are UBEP’s variation/adaptation from the ECDC EPIET Outbreak Investigation, that can be found on GitHub at https://github.com/EPIET/OutbreakInvestigation.(2)
The present work is released under the GPL-3 License.
Environment preparation
In this section, participants should attach to the R session the following packages using the library
function: (4)
tidyverse
here
rio
For the moment, participants should not worry about code they don’t understand. We will explain it later during the course.
Task: fill in
___
the proper function/command
Code
library(tidyverse)
library(here)
library(rio)
<- here("data-raw", "Copenhagen_clean.xlsx") |>
linelist import() |>
mutate(across(where(is.character), fct))
Data preparation
This section is for context only. Nothing has to be done in this section by the participants.
The Alert
On November 14th 2006 the director of a high school in Greater Copenhagen, Denmark, contacted the regional public health authorities to inform them about an outbreak of diarrhoea and vomiting among participants from a school dinner party held on the 11th of November 2006. Almost all students and teachers of the school (750 people) attended the party.
The first people fell ill the same night and by 14 November, the school had received reports of diarrhoeal illness from around 200 - 300 students and teachers, many of whom also reported vomiting.
Your mission
Your group has been tasked with investigating this outbreak; you have just received the information above.
The epidemiologists in the outbreak team decided to perform a retrospective cohort study in order to identify the food item that was the vehicle of the outbreak. The cohort was defined as students and teachers who had attended the party at the high school on 11th of November 2006.
A questionnaire was designed to conduct a survey on food consumption and on presentation of the illness. Information about the survey and a link to the questionnaire was circulated to students and teachers via the school’s intranet with the request that everyone who attended the school party on 11th of November 2006 should fill in the questionnaire.
Practically all students and teachers check the intranet on a daily basis, because it is the school’s main communication channel for information about courses, homework assignments, cancellation of lessons etc. The school’s intranet was accessible for ill students or teachers from home so that everyone in the cohort could potentially participate and the response rate could be maximised. Additionally, the information about the investigation was also displayed on the screen in the main hall of the school.
Exploring the data
Before you can begin analysing the data, you will need to become familiar with its contents and check for any errors, or values that don’t make sense.
For this assessment all the data preprocessing is already done. Participants will find the data already in their environment named as linelist
.
In the following:
The variables are grouped by type (e.g. character or numeric)
n_missing shows the number of observations with missing values for each variable
complete_rate shows the proportion of observations that are not missing
min
andmax
for character variables refer to the number of characters per stringp0
andp100
for numeric variables refer to minimum and maximum values, respectively.For more details, see the help file by typing
?skimr::skim
in the console
Code
::skim(linelist) skimr
Name | linelist |
Number of rows | 384 |
Number of columns | 46 |
_______________________ | |
Column type frequency: | |
factor | 3 |
logical | 24 |
numeric | 16 |
POSIXct | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
sex | 0 | 1.00 | FALSE | 2 | fem: 219, mal: 165 |
group | 0 | 1.00 | FALSE | 2 | stu: 369, tea: 15 |
class | 35 | 0.91 | FALSE | 3 | 1: 134, 3: 112, 2: 103 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
diarrhoea | 132 | 0.66 | 0.82 | TRU: 206, FAL: 46 |
bloody | 190 | 0.51 | 0.03 | FAL: 189, TRU: 5 |
vomiting | 169 | 0.56 | 0.31 | FAL: 149, TRU: 66 |
abdo | 142 | 0.63 | 0.86 | TRU: 207, FAL: 35 |
nausea | 160 | 0.58 | 0.75 | TRU: 169, FAL: 55 |
fever | 213 | 0.45 | 0.26 | FAL: 127, TRU: 44 |
headache | 164 | 0.57 | 0.62 | TRU: 137, FAL: 83 |
jointpain | 196 | 0.49 | 0.15 | FAL: 159, TRU: 29 |
meal | 0 | 1.00 | 1.00 | TRU: 384 |
tuna | 4 | 0.99 | 0.72 | TRU: 272, FAL: 108 |
shrimps | 5 | 0.99 | 0.67 | TRU: 255, FAL: 124 |
green | 18 | 0.95 | 0.59 | TRU: 216, FAL: 150 |
veal | 3 | 0.99 | 0.89 | TRU: 340, FAL: 41 |
pasta | 3 | 0.99 | 0.89 | TRU: 338, FAL: 43 |
rocket | 12 | 0.97 | 0.57 | TRU: 211, FAL: 161 |
sauce | 30 | 0.92 | 0.42 | FAL: 205, TRU: 149 |
bread | 6 | 0.98 | 0.91 | TRU: 345, FAL: 33 |
champagne | 13 | 0.97 | 0.87 | TRU: 323, FAL: 48 |
beer | 18 | 0.95 | 0.78 | TRU: 286, FAL: 80 |
redwine | 38 | 0.90 | 0.23 | FAL: 265, TRU: 81 |
whitewine | 19 | 0.95 | 0.73 | TRU: 266, FAL: 99 |
gastrosymptoms | 0 | 1.00 | 0.56 | TRU: 216, FAL: 168 |
ate_anything | 0 | 1.00 | 1.00 | TRU: 384 |
case | 0 | 1.00 | 0.56 | TRU: 216, FAL: 168 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 215.80 | 123.65 | 1 | 111.75 | 216.5 | 318.25 | 435 | ▇▇▇▇▇ |
age | 0 | 1.00 | 18.30 | 6.13 | 15 | 16.00 | 17.0 | 18.00 | 65 | ▇▁▁▁▁ |
starthour | 166 | 0.57 | 12.52 | 4.94 | 3 | 9.00 | 9.0 | 15.00 | 21 | ▁▇▁▆▃ |
tunaD | 4 | 0.99 | 1.32 | 1.00 | 0 | 0.00 | 2.0 | 2.00 | 3 | ▆▅▁▇▂ |
shrimpsD | 5 | 0.99 | 1.35 | 1.04 | 0 | 0.00 | 2.0 | 2.00 | 3 | ▆▂▁▇▂ |
greenD | 18 | 0.95 | 1.14 | 1.05 | 0 | 0.00 | 1.0 | 2.00 | 3 | ▇▂▁▇▂ |
vealD | 2 | 0.99 | 1.83 | 0.90 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▂▃▁▇▃ |
pastaD | 3 | 0.99 | 1.81 | 0.91 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▂▃▁▇▃ |
rocketD | 12 | 0.97 | 1.08 | 1.06 | 0 | 0.00 | 1.0 | 2.00 | 3 | ▇▂▁▆▂ |
sauceD | 30 | 0.92 | 0.83 | 1.06 | 0 | 0.00 | 0.0 | 2.00 | 3 | ▇▁▁▃▁ |
breadD | 6 | 0.98 | 1.75 | 0.71 | 0 | 2.00 | 2.0 | 2.00 | 3 | ▁▂▁▇▁ |
champagneD | 13 | 0.97 | 1.37 | 0.93 | 0 | 1.00 | 1.0 | 2.00 | 3 | ▂▇▁▂▃ |
beerD | 23 | 0.94 | 1.95 | 1.23 | 0 | 1.00 | 3.0 | 3.00 | 3 | ▃▂▁▂▇ |
redwineD | 40 | 0.90 | 0.45 | 0.92 | 0 | 0.00 | 0.0 | 0.00 | 3 | ▇▁▁▁▁ |
whitewineD | 24 | 0.94 | 1.58 | 1.21 | 0 | 0.00 | 2.0 | 3.00 | 3 | ▆▅▁▅▇ |
incubation | 168 | 0.56 | 19.03 | 7.93 | 3 | 15.00 | 15.0 | 21.00 | 45 | ▂▇▇▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
dayonset | 164 | 0.57 | 2006-11-11 00:00:00 | 2006-11-13 00:00:00 | 2006-11-12 00:00:00 | 3 |
onset_datetime | 164 | 0.57 | 2006-11-11 00:00:00 | 2006-11-13 15:00:00 | 2006-11-12 09:00:00 | 9 |
meal_datetime | 0 | 1.00 | 2006-11-11 18:00:00 | 2006-11-11 18:00:00 | 2006-11-11 18:00:00 | 1 |
Variables of interest:
Your exposure of interest is the school dinner party held on 11 November 2006 at 18:00. You may have noticed while skimming the data, that there is a binary variable called meal
. This variable indicates whether people attended the school dinner party and ate a meal there, or not.
Other variables that will be helpful to include in your case definition are onset_datetime
(hint: check that case onset date/time is after exposure) and symptom variables (hint: not everyone on the linelist fell ill). The symptoms included in the data set are:
abdo
(abdominal pain)diarrhoea
bloody
(bloody diarrhoea)nausea
vomiting
fever
headache
jointpain
Defining a case
A case was defined as a person who:
- attended the school dinner on 11 November 2006 (i.e. is on the linelist)
- ate a meal at the school dinner (i.e. was exposed)
- fell ill after the start of the meal
- fell ill no later than two days after the school dinner
- suffered from diarrhoea with or without blood, or vomiting
Non cases were defined as people who:
- attended the school dinner on 11 November 2006 (i.e. are on the linelist)
- ate a meal at the school dinner (i.e. were exposed)
- did not fall ill within the time period of interest
- did not develop diarrhoea (with or without blood) or vomiting
For the sake of the analysis, we excluded any people from the cohort who didn’t eat at the dinner, because we specifically hypothesise a food item to be the vehicle of infection in this outbreak. Excluding people reduces the sample size and therefore the power slightly, but the investigators considered that this would increase specificity.
The variables needed to define this case definition are:
meal
onset_datetime
diarrhoea
bloody
vomiting
For the case definition, we are primarily interested in three symptoms:
diarrhoea
(without blood)bloody
(diarrhoea with blood)vomiting
However, these are not the only symptoms in the data set. When creating the case definition, it would be easier to refer to these three symptoms if there was one column, indicating whether people had those symptoms or not. We can call this column gastrosymptoms
.
Next, we created a column for the case definition, in which we defined all the respondents as either cases, non-cases or NA
.
We can define non-cases as those who attended the meal, but didn’t develop any gastro symptoms, or if they did, developed them before the meal took place.
We can exclude (define as NA
) respondents that answered the survey, but did not attend the meal.
Incubation times
A suitable incubation period to use in the case definition can be defined by calculating the time between exposure (to the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).
We can see that the onset dates of all cases were from 11 to 13 November 2006 inclusive; this is why case numbers didn’t change when we updated the case definition.
Exclusions
Ultimately, the investigation team decided to remove respondents that did not meet the definition for a case or a non-case from the data set prior to analysis.