R Exercises - COVID
The COVID-19 pandemics is a lot of bad things, but it’s also a source of data to play with…
Here we’re gonna use the data gathered by the John’s Hopkins Hospital and published on GitHub, for: - confirmed cases, and - deaths.
In case the links above die at some point, here is a version of these files as of April 21st, 2020:
In fact, a quick lookup on the Internet will now return plenty of versions for this data, with even cleaner (tidy) versions, including country population, etc. There are even R packages to get this. The point here is to learn how to read a given dataset, clean it and use it, so we’ll stay with the JHU dataset.
- Download the datafiles archive and unzip it.
Data wrangling
First, load the
tidyverse
andlubridate
packagesLoad the raw .csv files from the links above and store them in
confirmed_raw
anddeaths_raw
. You can provideread_csv()
with an url. The numbers given in these tables are cumulative numbers of confirmed cases and deaths.-
Using successive operations with the pipe operator
|>
create a functiontidyfy_jhh(df, name)
, that, given a tibbledf
(i.e., the data from JHH we just loaded), will:- Rename the columns
Province/State
andCountry/Region
tostate
andcountry
- Remove the lines from the
Diamond Princess
- Remove the columns from
state
,Lat
andLong
- Make the table tidy by having 3 columns :
country
,date
andcount
- Using the
mdy()
function from thelubridate
package, convert thedate
column to a R Date format - Some countries appear several times as their numbers are separated by provinces or states. Summarize the counts per country and per date as the total number of cases per country and per day.
- Give this column the name of the type of case we look at (confirmed or deaths, string that id provided in the
name
parameter of the functiontidyfy_jhh(df, name)
. For this, use the notationsummarize(!!sym(name):= your_result_here)
to tell R thatname
should be used as a column name)
- Give this column the name of the type of case we look at (confirmed or deaths, string that id provided in the
- Rename the columns
Apply this function to the raw tibbles to get tidy
confirmed
anddeaths
tibbles.
Now, join these three tibbles in alldata
, and successively:
- Add the columns
newcases
andnewdeaths
containing the number of new cases and deaths every day, asconfirmed
anddeaths
contain cumulative numbers. For this look into thelag()
function. - Filter rows to remove data for which the number of cases is lower than the minimum number of cases in China (for easier comparison). This day will be considered as
Day 1
. - Add a
day
column containing the day number sinceDay 1
. - Filter rows to have only countries for which we have at least one week of data
In case you didn’t manage to get there, here is a csv file containing alldata
so that you can continue with the exercise.
Plotting
- Using
ggplot2
, plot (P1) side by side the number of confirmed cases and deaths vs time in 5 countries of your choice.- Using
ggrepel
, add the country names as annotation - Do the plot using linear scale, then a log scale
- Using
- Plot (P2) the death ratio (i.e. deaths per confirmed cases) per country averaged over time as a barplot, for 50 random countries.
- Try ordering the countries per decreasing ratio using
reorder()
- Flip the plot to have horizontal bars
- Is there anything wrong?
- Try ordering the countries per decreasing ratio using
- To better see the effect of confinement regulation, it’s probably more interesting to plot the number of new confirmed cases.
- Plot as a barplot the number of new confirmed cases per day for 2 countries of your choice, and add a vertical line corresponding to the date of application of confinement rules. For example, in France and Italy it was on March 17th and March 10th, respectively.
- Add a smooth line to smooth out the effect or irregular numbers reporting. You can play with the
span
parameter to have a more or less smooth curve.
Fitting
- Load the library
broom
(vignette).broom
allows tidying the output of fit functions such aslm()
ornls()
, taking the text output and making it into a tibble. Example:
library(broom)
library(tidyverse)
library(ggplot2)
theme_set(theme_bw())
# Create fake data
a <- seq(-10,10,1)
d <- tibble(x=rep(a,3),
y=c(a*runif(1)+runif(length(a)),
a*runif(1)+1+runif(length(a)),
a*runif(1)+2+runif(length(a))),
T=factor(rep(1:3,each=length(a)))
)
# Fit the data using nls() by nesting
d_fit <- d |>
nest(data=-T) |>
mutate(fit=map(data, ~ nls(data = .,
y~x*A+B,
start=list(A=1, B=1))),
tidied = map(fit, tidy),
augmented = map(fit, augment))
d_fit
# A tibble: 3 × 5
T data fit tidied augmented
<fct> <list> <list> <list> <list>
1 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> <tibble [21 × 4]>
2 2 <tibble [21 × 2]> <nls> <tibble [2 × 5]> <tibble [21 × 4]>
3 3 <tibble [21 × 2]> <nls> <tibble [2 × 5]> <tibble [21 × 4]>
d_fit |> unnest(augmented)
# A tibble: 63 × 8
T data fit tidied x y .fitted .resid
<fct> <list> <list> <list> <dbl> <dbl> <dbl> <dbl>
1 1 <tibble [21 × 2]> <nls> <tibble> -10 -2.84 -2.59 -0.247
2 1 <tibble [21 × 2]> <nls> <tibble> -9 -1.90 -2.29 0.388
3 1 <tibble [21 × 2]> <nls> <tibble> -8 -2.12 -1.98 -0.144
4 1 <tibble [21 × 2]> <nls> <tibble> -7 -1.85 -1.67 -0.182
5 1 <tibble [21 × 2]> <nls> <tibble> -6 -1.18 -1.36 0.183
6 1 <tibble [21 × 2]> <nls> <tibble> -5 -0.503 -1.05 0.547
7 1 <tibble [21 × 2]> <nls> <tibble> -4 -1.07 -0.741 -0.327
8 1 <tibble [21 × 2]> <nls> <tibble> -3 -0.364 -0.432 0.0684
9 1 <tibble [21 × 2]> <nls> <tibble> -2 -0.550 -0.123 -0.427
10 1 <tibble [21 × 2]> <nls> <tibble> -1 -0.00995 0.186 -0.196
# ℹ 53 more rows
d_fit |> unnest(tidied)
# A tibble: 6 × 9
T data fit term estimate std.error statistic p.value augmented
<fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> <list>
1 1 <tibble> <nls> A 0.309 0.0116 26.7 1.56e-16 <tibble>
2 1 <tibble> <nls> B 0.495 0.0700 7.07 1.01e- 6 <tibble>
3 2 <tibble> <nls> A 0.430 0.00925 46.4 4.99e-21 <tibble>
4 2 <tibble> <nls> B 1.53 0.0560 27.3 1.02e-16 <tibble>
5 3 <tibble> <nls> A 0.149 0.0109 13.7 2.65e-11 <tibble>
6 3 <tibble> <nls> B 2.49 0.0658 37.8 2.45e-19 <tibble>
d_fit |> unnest(tidied) |>
select(T, term, estimate, std.error) |>
pivot_wider(names_from = term,
values_from = c(estimate,std.error))
# A tibble: 3 × 5
T estimate_A estimate_B std.error_A std.error_B
<fct> <dbl> <dbl> <dbl> <dbl>
1 1 0.309 0.495 0.0116 0.0700
2 2 0.430 1.53 0.00925 0.0560
3 3 0.149 2.49 0.0109 0.0658
ggplot(data=d, aes(x=x, y=y, col=T))+
geom_point(alpha=0.5) +
geom_line(data=d_fit |> unnest(augmented),
aes(x=x,y=.fitted, col=T))
Here, by combining nest()
and purrr::map()
, we can apply the fit to each group. The result is a tibble containing the fit result for each group. Then, using broom
’s function augment()
and tidy()
, we respectively get the fitted values and residues, and the fitting parameters.
- Now then, using the same procedure as in the example above, fit the total Confirmed cases for the 5 countries you chose earlier for the 100 first days using a logistic function, and plot the data and their fit, and show the results of the fits. You may want to use
minpack.lm::nlsLM()
instead ofnls()
. Here is how the plot should look like without ever running afor loop
:
In fact, this exponential model is rather wrong… A better model of epidemics dynamics is done by the SIR model, for Susceptible, Infectious, Recovered. A description of the SIR modeling through R can be found here, for example.