R Exercises - COVID

Published

November 6, 2024


The COVID-19 pandemics is a lot of bad things, but it’s also a source of data to play with…

Here we’re gonna use the data gathered by the John’s Hopkins Hospital and published on GitHub, for: - confirmed cases, and - deaths.

In case the links above die at some point, here is a version of these files as of April 21st, 2020:

In fact, a quick lookup on the Internet will now return plenty of versions for this data, with even cleaner (tidy) versions, including country population, etc. There are even R packages to get this. The point here is to learn how to read a given dataset, clean it and use it, so we’ll stay with the JHU dataset.

Data wrangling

  • First, load the tidyverse and lubridate packages
  • Load the raw .csv files from the links above and store them in confirmed_raw and deaths_raw. You can provide read_csv() with an url. The numbers given in these tables are cumulative numbers of confirmed cases and deaths.
urlConfirmed <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
urlDeaths <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
confirmed_raw <- read_csv(urlConfirmed)
deaths_raw    <- read_csv(urlDeaths)
  • Using successive operations with the pipe operator |> create a function tidyfy_jhh(df, name), that, given a tibble df (i.e., the data from JHH we just loaded), will:
    • Rename the columns Province/State and Country/Region to state and country
    • Remove the lines from the Diamond Princess
    • Remove the columns from state, Lat and Long
    • Make the table tidy by having 3 columns : country, date and count
    • Using the mdy() function from the lubridate package, convert the date column to a R Date format
    • Some countries appear several times as their numbers are separated by provinces or states. Summarize the counts per country and per date as the total number of cases per country and per day.
      • Give this column the name of the type of case we look at (confirmed or deaths, string that id provided in the name parameter of the function tidyfy_jhh(df, name). For this, use the notation summarize(!!sym(name):= your_result_here) to tell R that name should be used as a column name)
tidyfy_jhh <- function(df, name) {
    df |> 
        rename(state   = `Province/State`,
               country = `Country/Region`) |>
        filter(country != "Diamond Princess") |>
        select(-c(state, Lat, Long)) |>
        pivot_longer(cols      = -country,
                     names_to  = "date",
                     values_to = "count") |> 
        mutate(date = mdy(date)) |> 
        group_by(country, date) |> 
        summarize(!!sym(name):=sum(count)) |> 
        ungroup()
}
  • Apply this function to the raw tibbles to get tidy confirmed and deaths tibbles.
confirmed <- tidyfy_jhh(confirmed_raw, 'confirmed')
deaths    <- tidyfy_jhh(deaths_raw, 'deaths')
Note

In case you didn’t manage to get there, here are the csv files containing:

Now, join these three tibbles in alldata, and successively:

  • Add the columns newcases and newdeaths containing the number of new cases and deaths every day, as confirmed and deaths contain cumulative numbers. For this look into the lag() function.
  • Filter rows to remove data for which the number of cases is lower than the minimum number of cases in China (for easier comparison). This day will be considered as Day 1.
  • Add a day column containing the day number since Day 1.
  • Filter rows to have only countries for which we have at least one week of data
MIN <- min(confirmed$confirmed[confirmed$country == "China"])
alldata <- confirmed |> 
            full_join(deaths) |> 
            group_by(country) |> 
            mutate(newcases  = confirmed - lag(confirmed),
                   newdeaths = deaths - lag(deaths)) |> 
            filter(confirmed >= MIN) |> 
            mutate(day=1:n()) |> 
            filter(n() >= 7) |> 
            ungroup()
Note

In case you didn’t manage to get there, here is a csv file containing alldata so that you can continue with the exercise.

Plotting

  • Using ggplot2, plot (P1) side by side the number of confirmed cases and deaths vs time in 5 countries of your choice.
    • Using ggrepel, add the country names as annotation
    • Do the plot using linear scale, then a log scale
library(ggplot2)
library(ggrepel)

countries <- c("China","Italy","France","Spain","US")
colors <- c("black","red","seagreen","orange","royalblue")

labs <- alldata |> 
            filter(country %in% countries) |>
            group_by(country) |> 
            filter(date==last(date)) |> 
            ungroup()

P1a <- alldata |> 
    filter(country %in% countries) |>
    ggplot(aes(x=date, y=confirmed, color=country))+
        geom_line(size=1)+
        scale_color_manual(values=colors)+
        xlab("Days")+
        ylab("Number of Confirmed cases ")+
        theme_bw()+
        theme(legend.position = 'none')+
        ggtitle("Confirmed cases")+
        scale_y_log10(
            breaks = 10^(seq(0, 10, by = 1)), 
            minor_breaks = rep(1:9, 2*5+1)*(10^rep(0:10, each=9)), 
            labels = scales::trans_format("log10", 
                scales::math_format(10^.x)))+
        annotation_logticks(sides ='lr')+
        geom_label_repel(data=labs, 
            show.legend=FALSE, 
            force=10,nudge_x=2,
            aes(x=date, y=confirmed, 
                col=country, 
                label = country))

P1b <- alldata |> 
    filter(country %in% countries) |>
    ggplot(aes(x=date, y=deaths, color=country))+
        geom_line(size=1)+
        scale_color_manual(values=colors)+
        xlab("Days")+
        ylab("Number of Deaths")+
        theme_bw()+
        theme(legend.position = 'none')+
        ggtitle("Deaths")+
        scale_y_log10(
            breaks = 10^(seq(0, 10, by = 1)), 
            minor_breaks = rep(1:9, 2*5+1)*(10^rep(0:10, each=9)), 
            labels = scales::trans_format("log10", 
                scales::math_format(10^.x)))+
        geom_label_repel(data=labs, 
            show.legend=FALSE, 
            force=10,nudge_x=2,
            aes(x=date, y=deaths, 
                col=country, 
                label = country))
library(patchwork)
P1a + P1b

  • Plot (P2) the death ratio (i.e. deaths per confirmed cases) per country averaged over time as a barplot, for 50 random countries.
    • Try ordering the countries per decreasing ratio using reorder()
    • Flip the plot to have horizontal bars
    • Is there anything wrong?
P2 <- alldata |> 
    filter(country %in% sample( unique(country), 50) ) |> 
    group_by(country) |> 
    summarize(ratio=mean(deaths/confirmed*100)) |> 
    ggplot(aes(x=reorder(country,ratio), y=ratio, fill=country))+
        geom_bar(stat="identity", position="dodge") +
        ylab("Death ratio [%]")+
        theme_bw()+
        coord_flip()+
        theme(axis.title.y=element_blank(),
              axis.ticks.y=element_blank(),
              legend.position = 'none'
              )+
        ggtitle("Death ratio")
P2

  • To better see the effect of confinement regulation, it’s probably more interesting to plot the number of new confirmed cases.
    • Plot as a barplot the number of new confirmed cases per day for 2 countries of your choice, and add a vertical line corresponding to the date of application of confinement rules. For example, in France and Italy it was on March 17th and March 10th, respectively.
    • Add a smooth line to smooth out the effect or irregular numbers reporting. You can play with the span parameter to have a more or less smooth curve.
conf_france <- as.Date("2020-03-17")
conf_italy  <- as.Date("2020-03-10")
colors <- c('royalblue','orange')
alldata |> 
    filter(country %in% c("France", "Italy")) |>
    filter(newcases>0) |> 
    filter(newcases<12000) |> #remove outliers
    ggplot(aes(x=date, y=newcases, fill=country))+
        scale_fill_manual(values=colors, name="")+
        scale_color_manual(values=colors, name="")+
        geom_bar(stat = 'identity', position = 'dodge', alpha=.7)+
        geom_smooth(span=.15, se=FALSE, aes(color=country))+
        geom_vline(xintercept=conf_france, col=colors[1], lty=2)+
        geom_vline(xintercept=conf_italy, col=colors[2], lty=2)+
        scale_x_date(minor_breaks='1 day')+
        theme(legend.position='none')+
        xlab("Date")+
        ylab("New confirmed cases")+
        ggtitle("New confirmed cases") -> P1
alldata |> 
    filter(country %in% c("France", "Italy")) |>
    filter(newdeaths>0) |> 
    filter(date>as.Date("2020-02-25")) |> 
    ggplot(aes(x=date, y=newdeaths, fill=country))+
        scale_fill_manual(values=colors, name="")+
        scale_color_manual(values=colors, name="")+
        geom_bar(stat = 'identity', position = 'dodge', alpha=.7)+
        geom_smooth(span=.15, se=FALSE, aes(color=country))+
        geom_vline(xintercept=conf_france, col=colors[1], lty=2)+
        geom_vline(xintercept=conf_italy, col=colors[2], lty=2)+
        scale_x_date(minor_breaks='1 day')+
        xlab("Date")+
        ylab("New deaths")+
        ggtitle("New deaths") -> P2
P1+P2

Fitting

  • Load the library broom (vignette). broom allows tidying the output of fit functions such as lm() or nls(), taking the text output and making it into a tibble. Example:
library(broom)
library(tidyverse)
library(ggplot2)
theme_set(theme_bw())
# Create fake data
a <- seq(-10,10,1)
d <- tibble(x=rep(a,3),
            y=c(a*runif(1)+runif(length(a)),
                a*runif(1)+1+runif(length(a)),
                a*runif(1)+2+runif(length(a))),
            T=factor(rep(1:3,each=length(a)))
            )
# Fit the data using nls() by nesting
d_fit <- d |> 
    nest(data=-T) |> 
    mutate(fit=map(data, ~ nls(data = .,
                   y~x*A+B, 
                   start=list(A=1, B=1))),
           tidied = map(fit, tidy),
           augmented = map(fit, augment))
d_fit
# A tibble: 3 × 5
  T     data              fit    tidied           augmented        
  <fct> <list>            <list> <list>           <list>           
1 1     <tibble [21 × 2]> <nls>  <tibble [2 × 5]> <tibble [21 × 4]>
2 2     <tibble [21 × 2]> <nls>  <tibble [2 × 5]> <tibble [21 × 4]>
3 3     <tibble [21 × 2]> <nls>  <tibble [2 × 5]> <tibble [21 × 4]>
d_fit |> unnest(augmented)
# A tibble: 63 × 8
   T     data              fit    tidied       x        y .fitted  .resid
   <fct> <list>            <list> <list>   <dbl>    <dbl>   <dbl>   <dbl>
 1 1     <tibble [21 × 2]> <nls>  <tibble>   -10 -0.179   -0.474   0.295 
 2 1     <tibble [21 × 2]> <nls>  <tibble>    -9 -0.308   -0.377   0.0687
 3 1     <tibble [21 × 2]> <nls>  <tibble>    -8 -0.118   -0.279   0.161 
 4 1     <tibble [21 × 2]> <nls>  <tibble>    -7  0.283   -0.181   0.464 
 5 1     <tibble [21 × 2]> <nls>  <tibble>    -6 -0.430   -0.0835 -0.347 
 6 1     <tibble [21 × 2]> <nls>  <tibble>    -5 -0.359    0.0143 -0.373 
 7 1     <tibble [21 × 2]> <nls>  <tibble>    -4 -0.177    0.112  -0.289 
 8 1     <tibble [21 × 2]> <nls>  <tibble>    -3  0.580    0.210   0.371 
 9 1     <tibble [21 × 2]> <nls>  <tibble>    -2  0.00771  0.307  -0.300 
10 1     <tibble [21 × 2]> <nls>  <tibble>    -1  0.0625   0.405  -0.343 
# ℹ 53 more rows
d_fit |> unnest(tidied)
# A tibble: 6 × 9
  T     data     fit    term  estimate std.error statistic  p.value augmented
  <fct> <list>   <list> <chr>    <dbl>     <dbl>     <dbl>    <dbl> <list>   
1 1     <tibble> <nls>  A       0.0977   0.0108       9.04 2.62e- 8 <tibble> 
2 1     <tibble> <nls>  B       0.503    0.0655       7.68 3.05e- 7 <tibble> 
3 2     <tibble> <nls>  A       0.101    0.00845     11.9  2.80e-10 <tibble> 
4 2     <tibble> <nls>  B       1.56     0.0512      30.5  1.31e-17 <tibble> 
5 3     <tibble> <nls>  A       0.366    0.0120      30.5  1.31e-17 <tibble> 
6 3     <tibble> <nls>  B       2.55     0.0726      35.1  9.85e-19 <tibble> 
d_fit |> unnest(tidied) |> 
  select(T, term, estimate, std.error) |> 
  pivot_wider(names_from = term, 
              values_from = c(estimate,std.error))
# A tibble: 3 × 5
  T     estimate_A estimate_B std.error_A std.error_B
  <fct>      <dbl>      <dbl>       <dbl>       <dbl>
1 1         0.0977      0.503     0.0108       0.0655
2 2         0.101       1.56      0.00845      0.0512
3 3         0.366       2.55      0.0120       0.0726
ggplot(data=d, aes(x=x, y=y, col=T))+
    geom_point(alpha=0.5) + 
    geom_line(data=d_fit |> unnest(augmented), 
        aes(x=x,y=.fitted, col=T))

Here, by combining nest() and purrr::map(), we can apply the fit to each group. The result is a tibble containing the fit result for each group. Then, using broom’s function augment() and tidy(), we respectively get the fitted values and residues, and the fitting parameters.

  • Now then, using the same procedure as in the example above, fit the total Confirmed cases for the 5 countries you chose earlier for the 100 first days using a logistic function, and plot the data and their fit, and show the results of the fits. You may want to use minpack.lm::nlsLM() instead of nls(). Here is how the plot should look like without ever running a for loop:
library(broom)
library(minpack.lm)
countries <- c("China","Italy","France","Spain")
colors    <- c("black","red","seagreen","orange")
fits <- alldata |> 
    filter(country %in% countries) |> 
    filter(day <= 100) |> 
    nest(data=-country) |> 
    mutate(fit=map(data, ~ nlsLM(data = .,
               confirmed ~ a/(1+exp(-b*(day-c))), 
               start=list(a=max(.$confirmed), b=.1, c=15))),
           tidied = map(fit, tidy),
           augmented = map(fit, augment))
fits |> unnest(augmented) |> 
    ggplot(aes(x=day, color=country))+
        geom_point(aes(y=confirmed), alpha=.2, size=3)+
        geom_line(aes(y=.fitted))+
        scale_color_manual(values=colors, name="")+
        theme(legend.position='top')+
        xlab("Day since Day 1")+
        ylab("Total confirmed cases")

# fits |> unnest(tidied) |> 
#   select(country, term, estimate) |> 
#   pivot_wider(names_from = term, 
#               values_from = estimate)

In fact, this exponential model is rather wrong… A better model of epidemics dynamics is done by the SIR model, for Susceptible, Infectious, Recovered. A description of the SIR modeling through R can be found here, for example.