The COVID-19 pandemics is a lot of bad things, but it’s also a source of data to play with…
Here we’re gonna use the data gathered by the John’s Hopkins Hospital and published on GitHub, for: - confirmed cases, and - deaths.
In case the links above die at some point, here is a version of these files as of April 21st, 2020:
In fact, a quick lookup on the Internet will now return plenty of versions for this data, with even cleaner (tidy) versions, including country population, etc. There are even R packages to get this. The point here is to learn how to read a given dataset, clean it and use it, so we’ll stay with the JHU dataset.
First, load the tidyverse
and lubridate
packages
Load the raw .csv files from the links above and store them in
confirmed_raw
and deaths_raw
. You can provide
read_csv()
with an url. The numbers given in these tables are cumulative numbers of
confirmed cases and deaths.
Using successive operations with the pipe operator %>%
create a
function tidyfy_jhh(df, name)
,
that, given a tibble df
(i.e., the data from JHH
we just loaded), will:
Province/State
and
Country/Region
to state
and
country
Diamond Princess
state
, Lat
and
Long
country
,
date
and count
mdy()
function from
the lubridate
package, convert the date
column
to a R Date formatname
parameter of
the function tidyfy_jhh(df, name)
.
For this, use the notation summarize(!!sym(name):= your_result_here)
to tell R that name
should be used as a column name)Apply this function to the raw tibbles to get tidy
confirmed
and deaths
tibbles.
In case you didn’t manage to get there, here are the csv files containing:
Now, join these three tibbles in alldata
, and
successively:
newcases
and newdeaths
containing the number of new cases and deaths every day, as
confirmed
and deaths
contain cumulative
numbers. For this look into the lag()
function.Day 1
.day
column containing the day number since
Day 1
.In case you didn’t manage to get there, here is a csv file containing
alldata
so that you can continue with the exercise.
ggplot2
, plot (P1) side by side
the number of confirmed cases and
deaths vs time in 5 countries of your choice.
ggrepel
, add the country names as annotationreorder()
span
parameter to have a
more or less smooth curve.broom
(vignette).
broom
allows tidying the output of fit functions such as
lm()
or nls()
, taking the text output and
making it into a tibble. Example:library(broom)
library(tidyverse)
library(ggplot2)
theme_set(theme_bw())
# Create fake data
<- seq(-10,10,1)
a <- tibble(x=rep(a,3),
d y=c(a*runif(1)+runif(length(a)),
*runif(1)+1+runif(length(a)),
a*runif(1)+2+runif(length(a))),
aT=factor(rep(1:3,each=length(a)))
)# Fit the data using nls() by nesting
<- d %>%
d_fit nest(data=-T) %>%
mutate(fit=map(data, ~ nls(data = .,
~x*A+B,
ystart=list(A=1, B=1))),
tidied = map(fit, tidy),
augmented = map(fit, augment))
d_fit
## # A tibble: 3 × 5
## T data fit tidied augmented
## <fct> <list> <list> <list> <list>
## 1 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> <tibble [21 × 4]>
## 2 2 <tibble [21 × 2]> <nls> <tibble [2 × 5]> <tibble [21 × 4]>
## 3 3 <tibble [21 × 2]> <nls> <tibble [2 × 5]> <tibble [21 × 4]>
%>% unnest(augmented) d_fit
## # A tibble: 63 × 8
## T data fit tidied x y .fitted .resid
## <fct> <list> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -10 -7.71 -7.91 0.208
## 2 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -9 -7.12 -7.07 -0.0469
## 3 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -8 -6.46 -6.23 -0.229
## 4 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -7 -5.70 -5.39 -0.314
## 5 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -6 -4.76 -4.55 -0.215
## 6 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -5 -3.44 -3.71 0.266
## 7 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -4 -2.92 -2.87 -0.0559
## 8 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -3 -2.46 -2.02 -0.432
## 9 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -2 -1.17 -1.18 0.0165
## 10 1 <tibble [21 × 2]> <nls> <tibble [2 × 5]> -1 0.143 -0.341 0.484
## # … with 53 more rows
%>% unnest(tidied) d_fit
## # A tibble: 6 × 9
## T data fit term estim…¹ std.e…² stati…³ p.value augmen…⁴
## <fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> <list>
## 1 1 <tibble [21 × 2]> <nls> A 0.842 0.0109 77.1 3.49e-25 <tibble>
## 2 1 <tibble [21 × 2]> <nls> B 0.501 0.0661 7.58 3.73e- 7 <tibble>
## 3 2 <tibble [21 × 2]> <nls> A 0.0619 0.0102 6.07 7.74e- 6 <tibble>
## 4 2 <tibble [21 × 2]> <nls> B 1.60 0.0618 26.0 2.67e-16 <tibble>
## 5 3 <tibble [21 × 2]> <nls> A 0.730 0.0114 64.0 1.19e-23 <tibble>
## 6 3 <tibble [21 × 2]> <nls> B 2.48 0.0691 35.9 6.26e-19 <tibble>
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴augmented
%>% unnest(tidied) %>%
d_fit select(T, term, estimate, std.error) %>%
pivot_wider(names_from = term,
values_from = c(estimate,std.error))
## # A tibble: 3 × 5
## T estimate_A estimate_B std.error_A std.error_B
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.842 0.501 0.0109 0.0661
## 2 2 0.0619 1.60 0.0102 0.0618
## 3 3 0.730 2.48 0.0114 0.0691
ggplot(data=d, aes(x=x, y=y, col=T))+
geom_point(alpha=0.5) +
geom_line(data=d_fit %>% unnest(augmented),
aes(x=x,y=.fitted, col=T))
Here, by combining nest()
and purrr::map()
,
we can apply the fit to each group. The result is a tibble containing
the fit result for each group. Then, using broom
’s function
augment()
and tidy()
, we respectively get the
fitted values and residues, and the fitting parameters.
minpack.lm::nlsLM()
instead of nls()
. Here is how
the plot should look like without ever running a
for loop
:In fact, this exponential model is rather wrong… A better model of epidemics dynamics is done by the SIR model, for Susceptible, Infectious, Recovered. A description of the SIR modeling through R can be found here, for example.