6  Data frames

6.1 Defining a data.frame

6.1.1 Defining a data.frame from vectors

In R, the principal object is the data. Hence the data.frame object, which is basically a table of vectors. A data.frame is a list presented under the form of a table – i.e. a spreadsheet. On a day-to-day basis, you will either define data.frame from existing vectors or other data.frame, or define a data.frame from a file (text, Excel…). In this example, we use test.dat and test.xlsx.

To define a data.frame from known vectors, we just have to do:

x  <- seq(-pi, pi, length = 6) 
y  <- sin(x)
df <- data.frame(x, y) # df is a data.frame (a table)
df
#>            x             y
#> 1 -3.1415927 -1.224647e-16
#> 2 -1.8849556 -9.510565e-01
#> 3 -0.6283185 -5.877853e-01
#> 4  0.6283185  5.877853e-01
#> 5  1.8849556  9.510565e-01
#> 6  3.1415927  1.224647e-16

Then some information about our table are readily accessible, like:

  • The dimension, the number of rows and columns:
dim(df); nrow(df); ncol(df)
#> [1] 6 2
#> [1] 6
#> [1] 2
  • Print the first and last 3 values
head(df, 3); tail(df, 3)
#>            x             y
#> 1 -3.1415927 -1.224647e-16
#> 2 -1.8849556 -9.510565e-01
#> 3 -0.6283185 -5.877853e-01
#>           x            y
#> 4 0.6283185 5.877853e-01
#> 5 1.8849556 9.510565e-01
#> 6 3.1415927 1.224647e-16
  • Print some information on df
str(df)
#> 'data.frame':    6 obs. of  2 variables:
#>  $ x: num  -3.142 -1.885 -0.628 0.628 1.885 ...
#>  $ y: num  -1.22e-16 -9.51e-01 -5.88e-01 5.88e-01 9.51e-01 ...
  • Print some statistics on df
#>        x                y          
#>  Min.   :-3.142   Min.   :-0.9511  
#>  1st Qu.:-1.571   1st Qu.:-0.4408  
#>  Median : 0.000   Median : 0.0000  
#>  Mean   : 0.000   Mean   : 0.0000  
#>  3rd Qu.: 1.571   3rd Qu.: 0.4408  
#>  Max.   : 3.142   Max.   : 0.9511

If not defined when creating the data.frame, the column names will be by default the vector names. To specify your own column names, do it when creating the data.frame:

df <- data.frame(xxx = x, yyy = y)
head(df, 2)
#>         xxx           yyy
#> 1 -3.141593 -1.224647e-16
#> 2 -1.884956 -9.510565e-01

Or, once it’s created, do it using names()

names(df)
#> [1] "xxx" "yyy"
names(df) <- c("X", "Y")
head(df, 2)
#>           X             Y
#> 1 -3.141593 -1.224647e-16
#> 2 -1.884956 -9.510565e-01

6.1.2 Defining a data.frame from a file

6.1.2.1 A text file

Let’s say we have test.dat that looks like this:

# Bash code:
head Data/test.dat
#> x   y
#> 1   2
#> 2   3
  • Then, to read this file into a data.frame, we will use read.table(). If you don’t specify that the file contains a header, read.table() will default to attributing column names that will be V1, V2, V3, etc:
read.table("Data/test.dat")
#>   V1 V2
#> 1  x  y
#> 2  1  2
#> 3  2  3
  • If your file contains column names, you can use the first line as column names, like so:
read.table("Data/test.dat", header=TRUE)
#>   x y
#> 1 1 2
#> 2 2 3
  • If you want to skip some lines before starting the reading, use skip:
read.table("Data/test.dat", skip=1)
#>   V1 V2
#> 1  1  2
#> 2  2  3
  • You can specify your own column names using col.names:
read.table("Data/test.dat", skip=1, col.names = c("A","B"))
#>   A B
#> 1 1 2
#> 2 2 3

6.1.2.2 An Excel file

Now, to read an Excel file, use the readxl library:

library(readxl) # load readxl from tidyverse to read Excel files
read_excel("Data/test.xlsx", sheet=1)
#> # A tibble: 10 × 2
#>        x      y
#>    <dbl>  <dbl>
#>  1     1  5.21 
#>  2     2  6.55 
#>  3     3  3.71 
#>  4     4  0.216
#>  5     5  0.205
#>  6     6  4.60 
#>  7     7 10.3  
#>  8     8 12.9  
#>  9     9 11.1  
#> 10    10  7.28
read_excel("Data/test.xlsx", sheet=2)
#> # A tibble: 4 × 2
#>   hello  world   
#>   <chr>  <chr>   
#> 1 ac     th      
#> 2 asc    thh     
#> 3 ascsa  dthdh   
#> 4 ascacs dthtdhdh

6.2 Accessing values

  • Like with vectors, accessing values is done using the [] notation, except that here there are two indexes: df[row, column]:
df[3,1]
#> [1] -0.6283185
  • In general however, what you want is to access a given column, by its index:
df[,1]
#> [1] -3.1415927 -1.8849556 -0.6283185  0.6283185  1.8849556  3.1415927
df[[1]]# this is a vector too
#> [1] -3.1415927 -1.8849556 -0.6283185  0.6283185  1.8849556  3.1415927
  • Or, preferably, by its name using the $ notation:
df$X
#> [1] -3.1415927 -1.8849556 -0.6283185  0.6283185  1.8849556  3.1415927
  • Finally, you may want to apply filters on your table:
df[df$X < 0, ]
#>            X             Y
#> 1 -3.1415927 -1.224647e-16
#> 2 -1.8849556 -9.510565e-01
#> 3 -0.6283185 -5.877853e-01
  • Using the function subset(), the conditions are applied on column names (no need for df$col_name here, while you need it in the above expression):
subset(df, X>1)
#>          X            Y
#> 5 1.884956 9.510565e-01
#> 6 3.141593 1.224647e-16
subset(df, X>1, select = c(X))
#>          X
#> 5 1.884956
#> 6 3.141593

6.3 Adding columns or rows

6.3.1 Adding columns

  • To add a column, just attribute a value to an column that do not exist yet, it will be created:
# Adding columns
df   <- data.frame(x,y)
df$z <- df$x^2
df
#>            x             y         z
#> 1 -3.1415927 -1.224647e-16 9.8696044
#> 2 -1.8849556 -9.510565e-01 3.5530576
#> 3 -0.6283185 -5.877853e-01 0.3947842
#> 4  0.6283185  5.877853e-01 0.3947842
#> 5  1.8849556  9.510565e-01 3.5530576
#> 6  3.1415927  1.224647e-16 9.8696044
  • You can also create a data.frame of a data.frame:
data.frame(df, z=df$x^2, u=cos(df$x))
#>            x             y         z       z.1         u
#> 1 -3.1415927 -1.224647e-16 9.8696044 9.8696044 -1.000000
#> 2 -1.8849556 -9.510565e-01 3.5530576 3.5530576 -0.309017
#> 3 -0.6283185 -5.877853e-01 0.3947842 0.3947842  0.809017
#> 4  0.6283185  5.877853e-01 0.3947842 0.3947842  0.809017
#> 5  1.8849556  9.510565e-01 3.5530576 3.5530576 -0.309017
#> 6  3.1415927  1.224647e-16 9.8696044 9.8696044 -1.000000
  • Finally, you can use the cbind() function to bind two data.frame column-wise:
df2 <- data.frame(a = 1:length(x), b = 1:length(x))
cbind(df, df2)
#>            x             y         z a b
#> 1 -3.1415927 -1.224647e-16 9.8696044 1 1
#> 2 -1.8849556 -9.510565e-01 3.5530576 2 2
#> 3 -0.6283185 -5.877853e-01 0.3947842 3 3
#> 4  0.6283185  5.877853e-01 0.3947842 4 4
#> 5  1.8849556  9.510565e-01 3.5530576 5 5
#> 6  3.1415927  1.224647e-16 9.8696044 6 6

6.3.2 Adding rows

For this, use the rbind() function.

Attention

The two data.frame must have the same number of columns and the same column names.

rbind(df, df)
#>             x             y         z
#> 1  -3.1415927 -1.224647e-16 9.8696044
#> 2  -1.8849556 -9.510565e-01 3.5530576
#> 3  -0.6283185 -5.877853e-01 0.3947842
#> 4   0.6283185  5.877853e-01 0.3947842
#> 5   1.8849556  9.510565e-01 3.5530576
#> 6   3.1415927  1.224647e-16 9.8696044
#> 7  -3.1415927 -1.224647e-16 9.8696044
#> 8  -1.8849556 -9.510565e-01 3.5530576
#> 9  -0.6283185 -5.877853e-01 0.3947842
#> 10  0.6283185  5.877853e-01 0.3947842
#> 11  1.8849556  9.510565e-01 3.5530576
#> 12  3.1415927  1.224647e-16 9.8696044

6.3.3 Deleting rows/columns

This works like with vectors:

df[-1,]
#>            x             y         z
#> 2 -1.8849556 -9.510565e-01 3.5530576
#> 3 -0.6283185 -5.877853e-01 0.3947842
#> 4  0.6283185  5.877853e-01 0.3947842
#> 5  1.8849556  9.510565e-01 3.5530576
#> 6  3.1415927  1.224647e-16 9.8696044
df[,-1]
#>               y         z
#> 1 -1.224647e-16 9.8696044
#> 2 -9.510565e-01 3.5530576
#> 3 -5.877853e-01 0.3947842
#> 4  5.877853e-01 0.3947842
#> 5  9.510565e-01 3.5530576
#> 6  1.224647e-16 9.8696044

6.4 Tidy up!

6.4.1 What is tidy data?

A good practice in R is to tidy your data. R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules:

  • Each variable in the data set is placed in its own column
  • Each observation is placed in its own row
  • Each value is placed in its own cell

Illustration of tidy data.

Data that satisfies these rules is known as tidy data: you see that thanks to this representation, a 2D table can handle an arbitrary number of variables – this avoids using multi-dimensional arrays or multi-tab Excel documents. Note that it does’t matter if a value is repeated in a column.

Here is an example:

df <- read.csv("Data/population.csv")
df # is not tidy
Show output
#>   year Angers Bordeaux  Brest  Dijon Grenoble LeHavre LeMans  Lille   Lyon
#> 1 1962 115273   278403 136104 135694   156707  187845 132181 239955 535746
#> 2 1968 128557   266662 154023 145357   161616  207150 143246 238554 527800
#> 3 1975 137591   223131 166826 151705   166037  217882 152285 219204 456716
#> 4 1982 136038   208159 156060 140942   156637  199388 147697 196705 413095
#> 5 1990 141404   210336 147956 146703   150758  195854 145502 198691 415487
#> 6 1999 151279   215363 149634 149867   153317  190905 146105 212597 445452
#> 7 2007 151108   235178 142722 151543   156793  179751 144164 225789 472330
#> 8 2012 149017   241287 139676 152071   158346  173142 143599 228652 496343
#>   Marseille Montpellier Nantes   Nice   Paris  Reims Rennes Saint.Etienne
#> 1    778071      118864 240048 292958 2790091 134856 151948        210311
#> 2    889029      161910 260244 322442 2590771 154534 180943        223223
#> 3    908600      191354 256693 344481 2299830 178381 198305        220181
#> 4    874436      197231 240539 337085 2176243 177234 194656        204955
#> 5    800550      207996 244995 342439 2152423 180620 197536        199396
#> 6    798430      225392 270251 342738 2125246 187206 206229        180210
#> 7    852395      253712 283025 348721 2193030 183500 207922        175318
#> 8    852516      268456 291604 343629 2240621 181893 209860        171483
#>   Strasbourg Toulon Toulouse
#> 1     228971 161797   323724
#> 2     249396 174746   370796
#> 3     253384 181801   373796
#> 4     248712 179423   347995
#> 5     252338 167619   358688
#> 6     264115 160639   390350
#> 7     272123 166537   439453
#> 8     274394 164899   453317


library(tidyr)
df <- pivot_longer(df, cols=-year, names_to="city", values_to="pop")
df #is tidy
Show output
#> # A tibble: 160 × 3
#>     year city         pop
#>    <int> <chr>      <int>
#>  1  1962 Angers    115273
#>  2  1962 Bordeaux  278403
#>  3  1962 Brest     136104
#>  4  1962 Dijon     135694
#>  5  1962 Grenoble  156707
#>  6  1962 LeHavre   187845
#>  7  1962 LeMans    132181
#>  8  1962 Lille     239955
#>  9  1962 Lyon      535746
#> 10  1962 Marseille 778071
#> # ℹ 150 more rows


# is not tidy
pivot_wider(df, names_from="city", values_from="pop")
Show output
#> # A tibble: 8 × 21
#>    year Angers Bordeaux  Brest  Dijon Grenoble LeHavre LeMans  Lille   Lyon
#>   <int>  <int>    <int>  <int>  <int>    <int>   <int>  <int>  <int>  <int>
#> 1  1962 115273   278403 136104 135694   156707  187845 132181 239955 535746
#> 2  1968 128557   266662 154023 145357   161616  207150 143246 238554 527800
#> 3  1975 137591   223131 166826 151705   166037  217882 152285 219204 456716
#> 4  1982 136038   208159 156060 140942   156637  199388 147697 196705 413095
#> 5  1990 141404   210336 147956 146703   150758  195854 145502 198691 415487
#> 6  1999 151279   215363 149634 149867   153317  190905 146105 212597 445452
#> 7  2007 151108   235178 142722 151543   156793  179751 144164 225789 472330
#> 8  2012 149017   241287 139676 152071   158346  173142 143599 228652 496343
#> # ℹ 11 more variables: Marseille <int>, Montpellier <int>, Nantes <int>,
#> #   Nice <int>, Paris <int>, Reims <int>, Rennes <int>, Saint.Etienne <int>,
#> #   Strasbourg <int>, Toulon <int>, Toulouse <int>


You can find more information on data import and tidyness on the data-import cheatsheet and on the tidyr package.

Understanding long and wide data with an animation. Source: tidyexplain

6.4.2 Tibbles

A tibble is an enhanced version of the data.frame provided by the tibble package (which is part of the tidyverse). The main advantage of tibble is that it has easier initialization and nicer printing than data.frame.

Moreover, the performance are also enhanced for the reading from files with read_csv(), read_tsv(), read_table() and read_delim() that do the same things as their read.xx() counterparts and return a tibble. Otherwise, the handling is basically the same.

More on tibbles here.

  • Note that when initializing tibbles, the construction is iterative. It means that when creating a second column, one can refer to the first one that was created. This does’t work with data.frames.
# won't work unless a `x` vector was created before
data.frame(x=runif(1e3), y=cumsum(x)) 
#> Error: object 'x' not found
library(tidyverse)
tib <- tibble(x=runif(1e3), y=cumsum(x))
tib
#> # A tibble: 1,000 × 2
#>         x     y
#>     <dbl> <dbl>
#>  1 0.858  0.858
#>  2 0.587  1.44 
#>  3 0.0768 1.52 
#>  4 0.0940 1.62 
#>  5 0.484  2.10 
#>  6 0.440  2.54 
#>  7 0.516  3.06 
#>  8 0.486  3.54 
#>  9 0.436  3.98 
#> 10 0.724  4.70 
#> # ℹ 990 more rows
Attention

Tibbles are quite strict about subsetting. [ always returns another tibble. Contrast this with a data frame: sometimes [ returns a data frame and sometimes it just returns a vector:

head(tib[[1]]) # is a vector
#> [1] 0.85826603 0.58665330 0.07683809 0.09397802 0.48431710 0.43986323
head(tib[,1])  # is a tibble
#> # A tibble: 6 × 1
#>        x
#>    <dbl>
#> 1 0.858 
#> 2 0.587 
#> 3 0.0768
#> 4 0.0940
#> 5 0.484 
#> 6 0.440

Unless you want to get a tibble, I recommend always using the $ notation when you want to get a column as a vector to avoid problems.

  • Another interesting feature of tibbles is that their columns can contain vectors, like usual, but also lists of any R objects like other tibbles, nls() objects, etc. This is called “nesting”, and you can nest and un-nest tibbles using these explicit functions:
tib1 <- tibble(x=1:3, y=1:3)
tib2 <- tibble(x=1:5, y=1:5)
tib  <- tibble(number=1:2, data=list(tib1, tib2))
tib
#> # A tibble: 2 × 2
#>   number data            
#>    <int> <list>          
#> 1      1 <tibble [3 × 2]>
#> 2      2 <tibble [5 × 2]>
Figure 6.1: Excel equivalent to a nested tibble.
tib_unnested <- unnest(tib, data)
tib_unnested
#> # A tibble: 8 × 3
#>   number     x     y
#>    <int> <int> <int>
#> 1      1     1     1
#> 2      1     2     2
#> 3      1     3     3
#> 4      2     1     1
#> 5      2     2     2
#> 6      2     3     3
#> 7      2     4     4
#> 8      2     5     5
tib_unnested_renested <- nest(tib_unnested, data = c(number, y))
tib_unnested_renested
#> # A tibble: 5 × 2
#>       x data            
#>   <int> <list>          
#> 1     1 <tibble [2 × 2]>
#> 2     2 <tibble [2 × 2]>
#> 3     3 <tibble [2 × 2]>
#> 4     4 <tibble [1 × 2]>
#> 5     5 <tibble [1 × 2]>
tib_unnested_renested$data # The `data` column is a list
#> [[1]]
#> # A tibble: 2 × 2
#>   number     y
#>    <int> <int>
#> 1      1     1
#> 2      2     1
#> 
#> [[2]]
#> # A tibble: 2 × 2
#>   number     y
#>    <int> <int>
#> 1      1     2
#> 2      2     2
#> 
#> [[3]]
#> # A tibble: 2 × 2
#>   number     y
#>    <int> <int>
#> 1      1     3
#> 2      2     3
#> 
#> [[4]]
#> # A tibble: 1 × 2
#>   number     y
#>    <int> <int>
#> 1      2     4
#> 
#> [[5]]
#> # A tibble: 1 × 2
#>   number     y
#>    <int> <int>
#> 1      2     5

6.5 Operations in the tidyverse

In the end, base R and the tidyverse package provide many efficient functions to perform most of the tasks you would want to perform recursively, thus allowing avoiding explicit for loops (that are slow).

Here are some examples, and you will find much more here. Take a look at the cheatsheets on tidyr and on dplyr, it’s really helpful.

Let’s work on this tibble:

# Let's create a random tibble
library(tidyverse)
N <- 500
dt <- tibble(x     = rep(runif(N, -1, 1), 3), 
             y     = runif(N*3, -1, 1), 
             signx = ifelse(x>0, "positive", "negative"),
             signy = ifelse(y>0, "positive", "negative")
)
dt
#> # A tibble: 1,500 × 4
#>         x        y signx    signy   
#>     <dbl>    <dbl> <chr>    <chr>   
#>  1  0.747  0.717   positive positive
#>  2  0.385  0.167   positive positive
#>  3 -0.840  0.903   negative positive
#>  4  0.885  0.718   positive positive
#>  5  0.682  0.899   positive positive
#>  6 -0.696  0.197   negative positive
#>  7 -0.661 -0.256   negative negative
#>  8  0.250 -0.00847 positive negative
#>  9  0.430 -0.512   positive negative
#> 10 -0.482  0.245   negative positive
#> # ℹ 1,490 more rows

6.5.1 The pipe operator

In the following, we will introduce the pipe operator |>, that was introduced in the version 4.1 of R. This operator allows a clear syntax for successive operations, as “what is on the left of the operator is given as first argument of what is on the right”. It is thus a good habit to write each operation on a separate line to facilitate the reading. This is particularly helpful when performing multiple nested operations. For example, summary(head(tail(dt),2)), which is hard to read, would translate to:

dt |> 
    tail() |> 
    head(2) |> 
    summary()

Note that before the 4.1 version of R, the pipe operator was only present thanks to the magrittr package, and was written %>%. In magrittr’s pipe, retrieving the piped object is done with the operator ., while it is done with the operator _ in base R’s.

Silly example:

"Hello" |> gsub("o", "e") # replace the substring "Hello" by "o" in string "e"
#> [1] "e"
"Hello" |> gsub("o", "e", x=_) # replace the substring "o" by "e" in string "Hello"
#> [1] "Helle"
"Hello" %>% gsub("o", "e") # replace the substring "Hello" by "o" in string "e"
#> [1] "e"
# /!\ In base R pipe, the '_' needs to  be for a named argument, 
# while it is not necessary for the '%>%' pipe
"Hello" %>% gsub("o", "e", .) # replace the substring "o" by "e" in string "Hello"
#> [1] "Helle"

6.5.2 Sampling data

dt |> slice(1:3)  # by index
#> # A tibble: 3 × 4
#>        x     y signx    signy   
#>    <dbl> <dbl> <chr>    <chr>   
#> 1  0.747 0.717 positive positive
#> 2  0.385 0.167 positive positive
#> 3 -0.840 0.903 negative positive
dt |> slice_sample(n=3) # randomly
#> # A tibble: 3 × 4
#>        x      y signx    signy   
#>    <dbl>  <dbl> <chr>    <chr>   
#> 1  0.385 -0.473 positive negative
#> 2  0.361 -0.748 positive negative
#> 3 -0.514  0.514 negative positive

6.5.3 Operations on groups of a variable

group_by(column) groups by similar values of the wanted column(s) and performs the next operations on each element of the group successively.

Alternatively, you can use the parameter by=column or .by=column in the functions you want to use on groups. The difference between group_by(column) and .by=column is that after group_by(column), the tibble stays grouped – so an ungroup() is needed to remove the grouping.

dt |> 
    group_by(signx)
#> # A tibble: 1,500 × 4
#> # Groups:   signx [2]
#>         x        y signx    signy   
#>     <dbl>    <dbl> <chr>    <chr>   
#>  1  0.747  0.717   positive positive
#>  2  0.385  0.167   positive positive
#>  3 -0.840  0.903   negative positive
#>  4  0.885  0.718   positive positive
#>  5  0.682  0.899   positive positive
#>  6 -0.696  0.197   negative positive
#>  7 -0.661 -0.256   negative negative
#>  8  0.250 -0.00847 positive negative
#>  9  0.430 -0.512   positive negative
#> 10 -0.482  0.245   negative positive
#> # ℹ 1,490 more rows
dt |> 
    group_by(signx) |> 
    slice_sample(n=3)
#> # A tibble: 6 × 4
#> # Groups:   signx [2]
#>         x       y signx    signy   
#>     <dbl>   <dbl> <chr>    <chr>   
#> 1 -0.182  -0.868  negative negative
#> 2 -0.979  -0.0215 negative negative
#> 3 -0.469   0.775  negative positive
#> 4  0.0818  0.151  positive positive
#> 5  0.172   0.152  positive positive
#> 6  0.909   0.995  positive positive
dt |> 
    group_by(signx, signy) |> 
    slice_sample(n=3)
#> # A tibble: 12 × 4
#> # Groups:   signx, signy [4]
#>           x        y signx    signy   
#>       <dbl>    <dbl> <chr>    <chr>   
#>  1 -0.507   -0.338   negative negative
#>  2 -0.382   -0.502   negative negative
#>  3 -0.972   -0.00249 negative negative
#>  4 -0.00435  0.723   negative positive
#>  5 -0.554    0.659   negative positive
#>  6 -0.517    0.458   negative positive
#>  7  0.0548  -0.939   positive negative
#>  8  0.672   -0.425   positive negative
#>  9  0.877   -0.435   positive negative
#> 10  0.202    0.574   positive positive
#> 11  0.747    0.0656  positive positive
#> 12  0.952    0.573   positive positive
dt |> 
    slice_sample(by = c(signx, signy), 
                 n  = 3)
#> # A tibble: 12 × 4
#>          x       y signx    signy   
#>      <dbl>   <dbl> <chr>    <chr>   
#>  1  0.712   0.926  positive positive
#>  2  0.328   0.993  positive positive
#>  3  0.617   0.356  positive positive
#>  4 -0.991   0.854  negative positive
#>  5 -0.840   0.892  negative positive
#>  6 -0.525   0.577  negative positive
#>  7 -0.222  -0.0618 negative negative
#>  8 -0.503  -0.682  negative negative
#>  9 -0.656  -0.612  negative negative
#> 10  0.145  -0.538  positive negative
#> 11  0.435  -0.821  positive negative
#> 12  0.0501 -0.388  positive negative
dt |> 
    group_by(signx, signy) |> 
    slice_sample(n=3) |> 
    ungroup()
#> # A tibble: 12 × 4
#>           x       y signx    signy   
#>       <dbl>   <dbl> <chr>    <chr>   
#>  1 -0.481   -0.0600 negative negative
#>  2 -0.0706  -0.469  negative negative
#>  3 -0.116   -0.535  negative negative
#>  4 -0.869    0.239  negative positive
#>  5 -0.00830  0.375  negative positive
#>  6 -0.796    0.207  negative positive
#>  7  0.145   -0.538  positive negative
#>  8  0.788   -0.594  positive negative
#>  9  0.564   -0.451  positive negative
#> 10  0.0818   0.233  positive positive
#> 11  0.487    0.652  positive positive
#> 12  0.758    0.132  positive positive

6.5.4 Summary by groups of a variable

summarise() returns a single value for each element of the groups.

dt |> 
    group_by(signx) |> 
    summarise(count  = n(),
              mean_x = mean(x), 
              sd_x   = sd(x))
#> # A tibble: 2 × 4
#>   signx    count mean_x  sd_x
#>   <chr>    <int>  <dbl> <dbl>
#> 1 negative   759 -0.528 0.283
#> 2 positive   741  0.503 0.295
dt |> 
    group_by(signx, signy) |> 
    summarise(count  = n(),
              mean_x = mean(x), 
              mean_y = mean(y))
#> # A tibble: 4 × 5
#> # Groups:   signx [2]
#>   signx    signy    count mean_x mean_y
#>   <chr>    <chr>    <int>  <dbl>  <dbl>
#> 1 negative negative   357 -0.518 -0.502
#> 2 negative positive   402 -0.537  0.495
#> 3 positive negative   370  0.517 -0.516
#> 4 positive positive   371  0.488  0.512

6.5.5 Sorting

dt |> arrange(x)
#> # A tibble: 1,500 × 4
#>         x        y signx    signy   
#>     <dbl>    <dbl> <chr>    <chr>   
#>  1 -0.999 -0.509   negative negative
#>  2 -0.999  0.911   negative positive
#>  3 -0.999 -0.343   negative negative
#>  4 -0.993 -0.898   negative negative
#>  5 -0.993  0.571   negative positive
#>  6 -0.993  0.340   negative positive
#>  7 -0.992 -0.00961 negative negative
#>  8 -0.992 -0.528   negative negative
#>  9 -0.992  0.211   negative positive
#> 10 -0.991  0.785   negative positive
#> # ℹ 1,490 more rows
dt |> arrange(x, desc(y))
#> # A tibble: 1,500 × 4
#>         x        y signx    signy   
#>     <dbl>    <dbl> <chr>    <chr>   
#>  1 -0.999  0.911   negative positive
#>  2 -0.999 -0.343   negative negative
#>  3 -0.999 -0.509   negative negative
#>  4 -0.993  0.571   negative positive
#>  5 -0.993  0.340   negative positive
#>  6 -0.993 -0.898   negative negative
#>  7 -0.992  0.211   negative positive
#>  8 -0.992 -0.00961 negative negative
#>  9 -0.992 -0.528   negative negative
#> 10 -0.991  0.937   negative positive
#> # ℹ 1,490 more rows

6.5.6 Merge tables column-wise

At least one column with the exact same name must be present in each table to use the xx_join() functions. There are more possibilities than inner_join() that I show here, see the help for more information.

dt2 <- tibble(signx=c("positive","positive","negative","negative"), 
              signy=c("positive","negative","positive","negative"), 
              value=c(TRUE, FALSE, FALSE, TRUE))
dt2
#> # A tibble: 4 × 3
#>   signx    signy    value
#>   <chr>    <chr>    <lgl>
#> 1 positive positive TRUE 
#> 2 positive negative FALSE
#> 3 negative positive FALSE
#> 4 negative negative TRUE
inner_join(dt, dt2)
#> # A tibble: 1,500 × 5
#>         x        y signx    signy    value
#>     <dbl>    <dbl> <chr>    <chr>    <lgl>
#>  1  0.747  0.717   positive positive TRUE 
#>  2  0.385  0.167   positive positive TRUE 
#>  3 -0.840  0.903   negative positive FALSE
#>  4  0.885  0.718   positive positive TRUE 
#>  5  0.682  0.899   positive positive TRUE 
#>  6 -0.696  0.197   negative positive FALSE
#>  7 -0.661 -0.256   negative negative TRUE 
#>  8  0.250 -0.00847 positive negative FALSE
#>  9  0.430 -0.512   positive negative FALSE
#> 10 -0.482  0.245   negative positive FALSE
#> # ℹ 1,490 more rows

6.5.7 Merge tables row-wise

This works even if there are missing rows.

dt3 <- tibble(a=1:3, b=3:5, c=6:8)
dt4 <- tibble(a=3:1, c=3:5)
bind_rows(dt3, dt4)
#> # A tibble: 6 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     3     6
#> 2     2     4     7
#> 3     3     5     8
#> 4     3    NA     3
#> 5     2    NA     4
#> 6     1    NA     5

6.5.8 Add/modify a column

mutate(), like $, adds a column if it doesn’t exist, and modifies it if it does.

dt |> mutate(w=seq_along(x), z=sin(x))
#> # A tibble: 1,500 × 6
#>         x        y signx    signy        w      z
#>     <dbl>    <dbl> <chr>    <chr>    <int>  <dbl>
#>  1  0.747  0.717   positive positive     1  0.680
#>  2  0.385  0.167   positive positive     2  0.375
#>  3 -0.840  0.903   negative positive     3 -0.745
#>  4  0.885  0.718   positive positive     4  0.774
#>  5  0.682  0.899   positive positive     5  0.630
#>  6 -0.696  0.197   negative positive     6 -0.641
#>  7 -0.661 -0.256   negative negative     7 -0.614
#>  8  0.250 -0.00847 positive negative     8  0.247
#>  9  0.430 -0.512   positive negative     9  0.417
#> 10 -0.482  0.245   negative positive    10 -0.464
#> # ℹ 1,490 more rows
dt |> mutate(x=seq_along(x))
#> # A tibble: 1,500 × 4
#>        x        y signx    signy   
#>    <int>    <dbl> <chr>    <chr>   
#>  1     1  0.717   positive positive
#>  2     2  0.167   positive positive
#>  3     3  0.903   negative positive
#>  4     4  0.718   positive positive
#>  5     5  0.899   positive positive
#>  6     6  0.197   negative positive
#>  7     7 -0.256   negative negative
#>  8     8 -0.00847 positive negative
#>  9     9 -0.512   positive negative
#> 10    10  0.245   negative positive
#> # ℹ 1,490 more rows

6.5.9 Selecting columns

dt |> select(x)  # only x
#> # A tibble: 1,500 × 1
#>         x
#>     <dbl>
#>  1  0.747
#>  2  0.385
#>  3 -0.840
#>  4  0.885
#>  5  0.682
#>  6 -0.696
#>  7 -0.661
#>  8  0.250
#>  9  0.430
#> 10 -0.482
#> # ℹ 1,490 more rows
dt |> select(-x) # all but x
#> # A tibble: 1,500 × 3
#>           y signx    signy   
#>       <dbl> <chr>    <chr>   
#>  1  0.717   positive positive
#>  2  0.167   positive positive
#>  3  0.903   negative positive
#>  4  0.718   positive positive
#>  5  0.899   positive positive
#>  6  0.197   negative positive
#>  7 -0.256   negative negative
#>  8 -0.00847 positive negative
#>  9 -0.512   positive negative
#> 10  0.245   negative positive
#> # ℹ 1,490 more rows
dt |> select(starts_with("sign"))
#> # A tibble: 1,500 × 2
#>    signx    signy   
#>    <chr>    <chr>   
#>  1 positive positive
#>  2 positive positive
#>  3 negative positive
#>  4 positive positive
#>  5 positive positive
#>  6 negative positive
#>  7 negative negative
#>  8 positive negative
#>  9 positive negative
#> 10 negative positive
#> # ℹ 1,490 more rows
dt |> select(contains("x"))
#> # A tibble: 1,500 × 2
#>         x signx   
#>     <dbl> <chr>   
#>  1  0.747 positive
#>  2  0.385 positive
#>  3 -0.840 negative
#>  4  0.885 positive
#>  5  0.682 positive
#>  6 -0.696 negative
#>  7 -0.661 negative
#>  8  0.250 positive
#>  9  0.430 positive
#> 10 -0.482 negative
#> # ℹ 1,490 more rows

6.5.10 Filtering columns

dt |> filter(signx=="positive")
#> # A tibble: 741 × 4
#>        x        y signx    signy   
#>    <dbl>    <dbl> <chr>    <chr>   
#>  1 0.747  0.717   positive positive
#>  2 0.385  0.167   positive positive
#>  3 0.885  0.718   positive positive
#>  4 0.682  0.899   positive positive
#>  5 0.250 -0.00847 positive negative
#>  6 0.430 -0.512   positive negative
#>  7 0.705  0.230   positive positive
#>  8 0.691  0.496   positive positive
#>  9 0.437  0.446   positive positive
#> 10 0.490  0.0645  positive positive
#> # ℹ 731 more rows
dt |> filter(x<0, y>.1) # multiple filters can be applied at once
#> # A tibble: 358 × 4
#>         x     y signx    signy   
#>     <dbl> <dbl> <chr>    <chr>   
#>  1 -0.840 0.903 negative positive
#>  2 -0.696 0.197 negative positive
#>  3 -0.482 0.245 negative positive
#>  4 -0.341 0.526 negative positive
#>  5 -0.644 0.356 negative positive
#>  6 -0.651 0.621 negative positive
#>  7 -0.164 0.860 negative positive
#>  8 -0.611 0.496 negative positive
#>  9 -0.833 0.826 negative positive
#> 10 -0.673 0.326 negative positive
#> # ℹ 348 more rows

6.5.11 Reorder columns

dt |> relocate(y, .after = signy)
#> # A tibble: 1,500 × 4
#>         x signx    signy           y
#>     <dbl> <chr>    <chr>       <dbl>
#>  1  0.747 positive positive  0.717  
#>  2  0.385 positive positive  0.167  
#>  3 -0.840 negative positive  0.903  
#>  4  0.885 positive positive  0.718  
#>  5  0.682 positive positive  0.899  
#>  6 -0.696 negative positive  0.197  
#>  7 -0.661 negative negative -0.256  
#>  8  0.250 positive negative -0.00847
#>  9  0.430 positive negative -0.512  
#> 10 -0.482 negative positive  0.245  
#> # ℹ 1,490 more rows

6.5.12 Separate columns

The separation is based on standard separators such as “-”, “_”, “.”, ” “, etc. A single separator can be specified with the argument sep, otherwise all separators are used. One must provide the resulting vector of new column names: if one value is NA, this column will be discarded. Examples:

dt5 <- tibble(file=list.files(path="Exo/FTIR/Data/", pattern=".xls"))
dt5
#> # A tibble: 10 × 1
#>    file             
#>    <chr>            
#>  1 sample_0_25C.xls 
#>  2 sample_0_300C.xls
#>  3 sample_1_25C.xls 
#>  4 sample_1_300C.xls
#>  5 sample_2_25C.xls 
#>  6 sample_2_300C.xls
#>  7 sample_3_25C.xls 
#>  8 sample_3_300C.xls
#>  9 sample_4_25C.xls 
#> 10 sample_4_300C.xls
dt5 |> separate(file, c(NA, "sample", "temperature", NA), convert = TRUE)
#> # A tibble: 10 × 2
#>    sample temperature
#>     <int> <chr>      
#>  1      0 25C        
#>  2      0 300C       
#>  3      1 25C        
#>  4      1 300C       
#>  5      2 25C        
#>  6      2 300C       
#>  7      3 25C        
#>  8      3 300C       
#>  9      4 25C        
#> 10      4 300C
dt5 |> separate(file, 
                c("name", "extension"), 
                sep = "\\.")
#> # A tibble: 10 × 2
#>    name          extension
#>    <chr>         <chr>    
#>  1 sample_0_25C  xls      
#>  2 sample_0_300C xls      
#>  3 sample_1_25C  xls      
#>  4 sample_1_300C xls      
#>  5 sample_2_25C  xls      
#>  6 sample_2_300C xls      
#>  7 sample_3_25C  xls      
#>  8 sample_3_300C xls      
#>  9 sample_4_25C  xls      
#> 10 sample_4_300C xls

6.5.13 Apply a function recursively on each element of a column

Take a look at the cheatsheet on the purrr package for more options and a visual help on the map() family. I show here a use of purrr::map(vector, function) that returns a list. map(x, f) applies the function f() to each element of the vector x, putting the result in a separate element of a list: map(x, f) -> list(f(x1), f(x2), ... f(xn)). In case f(xi) returns a single value, you might want to use map_dbl() or map_chr(), for example, that will return a vector of doubles or of characters, respectively.

x <- c(pi, pi/3, pi/2)
map(x, sin)     # returns a list
#> [[1]]
#> [1] 1.224647e-16
#> 
#> [[2]]
#> [1] 0.8660254
#> 
#> [[3]]
#> [1] 1
x |> map_dbl(sin) # returns a vector
#> [1] 1.224647e-16 8.660254e-01 1.000000e+00

Of course, in the above case, it’s a stupid use of the power of map(). A typical use case is when you want to read multiple files, for example:

dt6 <- tibble(file=list.files(path="Exo/spectro2/Data", 
                              pattern = ".txt", 
                              full.names = TRUE)) |> 
    slice(1:5) |> 
    mutate(data = map(file, read_table, col_names = c("w", "Int"))) |> 
    mutate(file = basename(file))
dt6
#> # A tibble: 5 × 2
#>   file         data                  
#>   <chr>        <list>                
#> 1 rubis_01.txt <spc_tbl_ [1,015 × 2]>
#> 2 rubis_02.txt <spc_tbl_ [1,015 × 2]>
#> 3 rubis_03.txt <spc_tbl_ [1,015 × 2]>
#> 4 rubis_04.txt <spc_tbl_ [1,015 × 2]>
#> 5 rubis_05.txt <spc_tbl_ [1,015 × 2]>

This is (almost) equivalent to:

dt6 <- tibble(file=list.files(path="Exo/spectro2/Data", 
                              pattern = ".txt", 
                              full.names = TRUE)) |> 
    slice(1:5) |> 
    mutate(data = map(file, ~read_table(., col_names = c("w", "Int")))) |> 
    mutate(file = basename(file))

You see that you can create the function directly within the call to map using the shortcut map(vector, ~ function(.)). This is useful to provide more arguments to the function – another solution is to write your own function before the call to map() and then call this function in map().

Note that in case you need more parameters, you can use purrr::map2(vector1, vector2, ~function(.x, .y)), where .x and .y refer to vector1 and vector2, respectively (it’s always .x and .y whatever the name of vector1 and vector2).

tibble(x=1:3, y=5:7) |> 
    mutate(sum = map2_dbl(x, y, sum))
#> # A tibble: 3 × 3
#>       x     y   sum
#>   <int> <int> <dbl>
#> 1     1     5     6
#> 2     2     6     8
#> 3     3     7    10
tibble(a=list(tibble(x=1:3, y=5:7), 
              tibble(x=0:3, y=4:7)), 
       b=list(tibble(x=10:13, y=15:18), 
              tibble(x=-1:2,  y=-14:-17))) |> 
    mutate(sumx = map2_dbl(a, b, ~sum(.x$x, .y$x)),
           sumy = map2_dbl(a, b, ~sum(.x$y, .y$y)))
#> # A tibble: 2 × 4
#>   a                b                 sumx  sumy
#>   <list>           <list>           <dbl> <dbl>
#> 1 <tibble [3 × 2]> <tibble [4 × 2]>    52    84
#> 2 <tibble [4 × 2]> <tibble [4 × 2]>     8   -40

6.5.14 Nesting and un-nesting data

dt7 <- dt6 |> 
    mutate(file = basename(file)) |> 
    unnest(data)
dt7
#> # A tibble: 5,075 × 3
#>    file             w   Int
#>    <chr>        <dbl> <dbl>
#>  1 rubis_01.txt 3064.  43.9
#>  2 rubis_01.txt 3064.  47.9
#>  3 rubis_01.txt 3064.  44.5
#>  4 rubis_01.txt 3065.  50.5
#>  5 rubis_01.txt 3065.  50.5
#>  6 rubis_01.txt 3065.  44.5
#>  7 rubis_01.txt 3065.  44.9
#>  8 rubis_01.txt 3066.  39.9
#>  9 rubis_01.txt 3066.  49.5
#> 10 rubis_01.txt 3066.  48.9
#> # ℹ 5,065 more rows
# Nesting data per repeated values in a column (~equivalent to grouping)
dt7 |> nest(data=-file)
#> # A tibble: 5 × 2
#>   file         data                
#>   <chr>        <list>              
#> 1 rubis_01.txt <tibble [1,015 × 2]>
#> 2 rubis_02.txt <tibble [1,015 × 2]>
#> 3 rubis_03.txt <tibble [1,015 × 2]>
#> 4 rubis_04.txt <tibble [1,015 × 2]>
#> 5 rubis_05.txt <tibble [1,015 × 2]>

6.5.15 Providing data to ggplot

dt |> 
    filter(abs(y) > 0.1) |> 
    ggplot(aes(x=x, y=y, color=signy))+
        geom_point()

6.6 Exercises

Download the exercises and solutions from the following repositories, then create a Rstudio project from the unzipped folder:


Exercise 1
  • Create a 3 column data.frame containing 10 random values, their sinus, and the sum of the two first columns.
  • Print the 4 first lines of the table
  • Print the second column
  • Print the average of the third column
  • Using plot(x,y) where x and y are vectors, plot the 2nd column as a function of the first
  • Look into the function write.table() to write a text file containing this data.frame
  • Do the all the same things with a tibble
Solution
# Create a 3 column `data.frame`{.R} containing 10 random values, their sinus, 
# and the sum of the two first columns.
x <- runif(10)
y <- sin(x)
z <- x + y
df <- data.frame(x=x, y=y, z=z)
# Print the 4 first lines of the table
head(df, 4)
#>            x          y          z
#> 1 0.02481736 0.02481481 0.04963217
#> 2 0.97544054 0.82794902 1.80338956
#> 3 0.03751536 0.03750656 0.07502193
#> 4 0.16368246 0.16295254 0.32663500
# Print the second column
df[,2]
#>  [1] 0.02481481 0.82794902 0.03750656 0.16295254 0.53369116 0.47024700
#>  [7] 0.56220368 0.25057748 0.64617038 0.53441411
# Print the average of the third column
mean(df$z); mean(df[3]); mean(df[,3])
#> [1] 0.8421207
#> [1] NA
#> [1] 0.8421207
# Using `plot(x,y)`{.R} where `x` and `y` are vectors, 
# plot the 2nd column as a function of the first
plot(df[,1], df[,2])

plot(df$x, df$y)

# Look into the function `write.table()`{.R} to write a text file 
# containing this `data.frame`{.R}
write.table(df, "Data/some_data.dat", quote = FALSE, row.names = FALSE)
# # # # # # # # # # # # # # # # # 
# Tibble version
library(tidyverse)
df_tib <- tibble(a = runif(10), b = sin(a), c = a + b)
head(df_tib, 4)
#> # A tibble: 4 × 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1 0.454 0.439 0.893
#> 2 0.272 0.268 0.540
#> 3 0.200 0.199 0.399
#> 4 0.119 0.119 0.238
df_tib[,2]; df_tib[[2]];
#> # A tibble: 10 × 1
#>         b
#>     <dbl>
#>  1 0.439 
#>  2 0.268 
#>  3 0.199 
#>  4 0.119 
#>  5 0.236 
#>  6 0.175 
#>  7 0.0257
#>  8 0.789 
#>  9 0.525 
#> 10 0.0389
#>  [1] 0.43875863 0.26829860 0.19899558 0.11863278 0.23635077 0.17545042
#>  [7] 0.02566330 0.78946592 0.52540591 0.03894471
mean(df_tib$c); mean(df_tib[3]); mean(df_tib[,3]); mean(df_tib[[3]])
#> [1] 0.5803778
#> [1] NA
#> [1] NA
#> [1] 0.5803778
write.table(df_tib, "Data/some_data.dat", quote = FALSE, row.names = FALSE)
plot(df_tib$a, df_tib$b)

Exercise 2
  • Download the files:
  • Load them into separate data.frames. Look into the options of read.table(), read.csv(), readxl::read_excel(), to get the proper data fields.
  • Add column names to the data.frame containing rubis_01.txt.
  • Print their dimensions.
  • Do the same things with tibbles.
Solution
rubis_01   <- read.table("Data/rubis_01.txt", col.names = c("w", "intensity"))
population <- read.csv("Data/population.csv")
FTIR_rocks <- readxl::read_excel("Data/FTIR_rocks.xlsx")
dim(rubis_01); names(rubis_01)
#> [1] 1015    2
#> [1] "w"         "intensity"
dim(population); names(population)
#> [1]  8 21
#>  [1] "year"          "Angers"        "Bordeaux"      "Brest"        
#>  [5] "Dijon"         "Grenoble"      "LeHavre"       "LeMans"       
#>  [9] "Lille"         "Lyon"          "Marseille"     "Montpellier"  
#> [13] "Nantes"        "Nice"          "Paris"         "Reims"        
#> [17] "Rennes"        "Saint.Etienne" "Strasbourg"    "Toulon"       
#> [21] "Toulouse"
dim(FTIR_rocks); names(FTIR_rocks)
#> [1] 4718    4
#> [1] "wavenumber, cm-1" "rock 1"           "rock 2"           "rock 3"
library(tidyverse)
rubis_01 <- read_table("Data/rubis_01.txt", col_names = c("w", "intensity"))
population <- read_csv("Data/population.csv")
Exercise 3
  • Download the TGA data file ATG.txt
  • Load it into a data.frame. Look into the options of read.table() to get the proper data fields.
  • Do the same with a tibble
Solution
d <- read.table("Data/ATG.txt", 
                skip=12,
                header=FALSE, 
                nrows=4088)
names(d) <- c("Index", "t", "Ts", "Tr", "Value")
head(d)
#>   Index  t      Ts Tr   Value
#> 1     0  0 32.3769 25 32.9680
#> 2     3  3 32.4051 25 32.9655
#> 3     6  6 32.4332 25 32.9619
#> 4     9  9 32.4726 25 32.9582
#> 5    12 12 32.5066 25 32.9544
#> 6    15 15 32.5221 25 32.9504
d <- read.table("Data/ATG.txt", 
                skip=10,
                comment.char="[",
                header=TRUE, 
                nrows=4088)
head(d)
#>   Index  t      Ts Tr   Value
#> 1     0  0 32.3769 25 32.9680
#> 2     3  3 32.4051 25 32.9655
#> 3     6  6 32.4332 25 32.9619
#> 4     9  9 32.4726 25 32.9582
#> 5    12 12 32.5066 25 32.9544
#> 6    15 15 32.5221 25 32.9504
library(tidyverse)
d <- read_table("Data/ATG.txt", 
                skip    = 10,
                comment = "[") |> 
        drop_na()
d
#> # A tibble: 4,088 × 5
#>    Index     t    Ts    Tr Value
#>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     0     0  32.4    25  33.0
#>  2     3     3  32.4    25  33.0
#>  3     6     6  32.4    25  33.0
#>  4     9     9  32.5    25  33.0
#>  5    12    12  32.5    25  33.0
#>  6    15    15  32.5    25  33.0
#>  7    18    18  32.6    25  32.9
#>  8    21    21  32.6    25  32.9
#>  9    24    24  32.6    25  32.9
#> 10    27    27  32.7    25  32.9
#> # ℹ 4,078 more rows
Exercise 4

Download population.csv and load it into a tibble.

  • What are the names of the columns?
  • Are the data tidy? make the table tidy if needed
  • Create a subset containing the data for Montpellier
    • What is the max and min of population in this city?
    • The average population over time?
  • What is the total population in 2012?
  • What is the total population per year?
  • What is the average population per city over the years?
Solution
# Download population.txt and load it into a `data.frame`{.R}.
library(tidyverse)
popul <- read_csv("Data/population.csv")
# What are the names of the columns and the dimension of the table?
names(popul); dim(popul)
#>  [1] "year"          "Angers"        "Bordeaux"      "Brest"        
#>  [5] "Dijon"         "Grenoble"      "LeHavre"       "LeMans"       
#>  [9] "Lille"         "Lyon"          "Marseille"     "Montpellier"  
#> [13] "Nantes"        "Nice"          "Paris"         "Reims"        
#> [17] "Rennes"        "Saint-Etienne" "Strasbourg"    "Toulon"       
#> [21] "Toulouse"
#> [1]  8 21
# Are the data tidy?
head(popul) # no
#> # A tibble: 6 × 21
#>    year Angers Bordeaux  Brest  Dijon Grenoble LeHavre LeMans  Lille   Lyon
#>   <dbl>  <dbl>    <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1  1962 115273   278403 136104 135694   156707  187845 132181 239955 535746
#> 2  1968 128557   266662 154023 145357   161616  207150 143246 238554 527800
#> 3  1975 137591   223131 166826 151705   166037  217882 152285 219204 456716
#> 4  1982 136038   208159 156060 140942   156637  199388 147697 196705 413095
#> 5  1990 141404   210336 147956 146703   150758  195854 145502 198691 415487
#> 6  1999 151279   215363 149634 149867   153317  190905 146105 212597 445452
#> # ℹ 11 more variables: Marseille <dbl>, Montpellier <dbl>, Nantes <dbl>,
#> #   Nice <dbl>, Paris <dbl>, Reims <dbl>, Rennes <dbl>, `Saint-Etienne` <dbl>,
#> #   Strasbourg <dbl>, Toulon <dbl>, Toulouse <dbl>
popul.tidy <- popul |> 
    pivot_longer(cols=-year,
                 names_to = "city",
                 values_to = "pop"
                )
popul.tidy
#> # A tibble: 160 × 3
#>     year city         pop
#>    <dbl> <chr>      <dbl>
#>  1  1962 Angers    115273
#>  2  1962 Bordeaux  278403
#>  3  1962 Brest     136104
#>  4  1962 Dijon     135694
#>  5  1962 Grenoble  156707
#>  6  1962 LeHavre   187845
#>  7  1962 LeMans    132181
#>  8  1962 Lille     239955
#>  9  1962 Lyon      535746
#> 10  1962 Marseille 778071
#> # ℹ 150 more rows
# Create a subset containing the data for Montpellier
mtp <- subset(popul.tidy, city == "Montpellier")
# I prefer the tidyverse version
mtp <- popul.tidy |> filter(city == "Montpellier")
# What is the max and min of population in this city?
max(mtp$pop)
#> [1] 268456
min(mtp$pop)
#> [1] 118864
range(mtp$pop)
#> [1] 118864 268456
# The average population over time?
mean(mtp$pop)
#> [1] 203114.4
# What is the total population in 2012?
sum(popul.tidy[popul.tidy$year == 2012, "pop"])
#> [1] 7334805
popul.tidy |> 
    filter(year==2012) |> 
    select(pop) |> 
    sum()
#> [1] 7334805
# What is the total population per year?
popul.tidy |> 
    group_by(year) |> 
    summarise(pop_tot=sum(pop))
#> # A tibble: 8 × 2
#>    year pop_tot
#>   <dbl>   <dbl>
#> 1  1962 7349547
#> 2  1968 7550999
#> 3  1975 7298183
#> 4  1982 6933230
#> 5  1990 6857291
#> 6  1999 6965325
#> 7  2007 7235114
#> 8  2012 7334805
# What is the average population per city over the years?
popul.tidy |> 
    group_by(city) |> 
    summarise(pop_ave=mean(pop))
#> # A tibble: 20 × 2
#>    city           pop_ave
#>    <chr>            <dbl>
#>  1 Angers         138783.
#>  2 Bordeaux       234815.
#>  3 Brest          149125.
#>  4 Dijon          146735.
#>  5 Grenoble       157526.
#>  6 LeHavre        193990.
#>  7 LeMans         144347.
#>  8 Lille          220018.
#>  9 Lyon           470371.
#> 10 Marseille      844253.
#> 11 Montpellier    203114.
#> 12 Nantes         260925.
#> 13 Nice           334312.
#> 14 Paris         2321032.
#> 15 Reims          172278 
#> 16 Rennes         193425.
#> 17 Saint-Etienne  198135.
#> 18 Strasbourg     255429.
#> 19 Toulon         169683.
#> 20 Toulouse       382265.
Exercise 5
  • First, load the tidyverse and lubridate package
  • Load people1.csv and people2.csv into pp1 and pp2
  • Create a new tibble pp by using the pipe operator (%>%) and successively:
    • joining the two tibbles into one using inner_join()
    • adding a column age containing the age in years (use lubridate::time_length(x, 'years') with x a time difference in days) by using mutate()
  • Display a summary of the table using str()
  • Using groupe_by() and summarize():
    • Show the number of males and females in the table (use the counter n())
    • Show the average age per gender
    • Show the average size per gender and institution
    • Show the number of people from each country, sorted by descending population (arrange())
  • Using select(), display:
    • only the name and age columns
    • all but the name column
  • Using filter(), show data only for
    • Chinese people
    • From institution ECL and UCBL
    • People older than 22
    • People with a e in their name
Solution
# First, load the `tidyverse` package
library(tidyverse)
# Load people1.csv and people2.csv
pp1  <- read_csv("Data/people1.csv")
pp2  <- read_csv("Data/people2.csv")
# Create a new tibble `pp` by using the pipe operator (`%>%`)
# and successively:
# - joining the two tibbles into one using `inner_join()`
# - adding a column `age` containing the age in years 
#   (use lubridate's `time_length(x, 'years')` with x a time
#   difference in days) by using `mutate()`
pp <- pp1 |> 
        inner_join(pp2) |> 
        mutate(age=time_length(today()-dateofbirth,'years'))
# Display a summary of the table using `str()`
str(pp)
#> tibble [20 × 7] (S3: tbl_df/tbl/data.frame)
#>  $ name       : chr [1:20] "Salem" "Dilruwan-Shanaka-Perera" "Hanna" "Sabin" ...
#>  $ gender     : chr [1:20] "Male" "Male" "Female" "Male" ...
#>  $ origin     : chr [1:20] "Yemen" "Sri Lanka" "Ukraine" "India" ...
#>  $ institution: chr [1:20] "UCBL" "INSA" "ECL" "INSA" ...
#>  $ dateofbirth: Date[1:20], format: "1997-12-26" "1997-03-28" ...
#>  $ size       : num [1:20] 161 172 165 186 176 ...
#>  $ age        : num [1:20] 26.9 27.6 26.9 29.3 29.6 ...
# Using `groupe_by()` and `summarize()`:
# - Show the number of males and females in the table 
#   (use the counter `n()`)
pp |> 
    group_by(gender) |> 
    summarize(count=n())
#> # A tibble: 2 × 2
#>   gender count
#>   <chr>  <int>
#> 1 Female     4
#> 2 Male      16
# - Show the average age per gender
pp |> 
    group_by(gender) |> 
    summarize(age=mean(age))
#> # A tibble: 2 × 2
#>   gender   age
#>   <chr>  <dbl>
#> 1 Female  28.9
#> 2 Male    28.6
# - Show the average size per gender and institution
pp |> 
    group_by(gender, institution) |> 
    summarize(size=mean(size))
#> # A tibble: 4 × 3
#> # Groups:   gender [2]
#>   gender institution  size
#>   <chr>  <chr>       <dbl>
#> 1 Female ECL          178.
#> 2 Male   ECL          168.
#> 3 Male   INSA         174.
#> 4 Male   UCBL         174.
# - Show the number of people from each country, 
#   sorted by descending population
pp |> 
    group_by(origin) |> 
    summarize(count=n()) |> 
    arrange(desc(count))
#> # A tibble: 13 × 2
#>    origin      count
#>    <chr>       <int>
#>  1 China           4
#>  2 Ukraine         4
#>  3 USA             2
#>  4 Afghanistan     1
#>  5 Austria         1
#>  6 Brazil          1
#>  7 Colombia        1
#>  8 Cyprus          1
#>  9 India           1
#> 10 Iran            1
#> 11 Sri Lanka       1
#> 12 Tunisia         1
#> 13 Yemen           1
# Using `select()`, display:
# - only the name and age columns
pp |> select(c(name, age))
#> # A tibble: 20 × 2
#>    name                      age
#>    <chr>                   <dbl>
#>  1 Salem                    26.9
#>  2 Dilruwan-Shanaka-Perera  27.6
#>  3 Hanna                    26.9
#>  4 Sabin                    29.3
#>  5 Benedikt                 29.6
#>  6 Jordyn                   27.7
#>  7 Jennifer                 29.5
#>  8 Yiran                    29.8
#>  9 Leran                    31.9
#> 10 Aymen                    34.7
#> 11 Pavlo                    27.6
#> 12 Saulo                    30.1
#> 13 Nicolas-Estevan          31.0
#> 14 Farzad                   27.9
#> 15 Roein                    25.3
#> 16 Paraskevas               25.4
#> 17 Ihor                     25.1
#> 18 Iryna                    31.7
#> 19 Peng                     27.9
#> 20 Mingyuan                 27.2
# - all but the name column
pp |> select(-name)
#> # A tibble: 20 × 6
#>    gender origin      institution dateofbirth  size   age
#>    <chr>  <chr>       <chr>       <date>      <dbl> <dbl>
#>  1 Male   Yemen       UCBL        1997-12-26   161.  26.9
#>  2 Male   Sri Lanka   INSA        1997-03-28   172.  27.6
#>  3 Female Ukraine     ECL         1997-12-30   165.  26.9
#>  4 Male   India       INSA        1995-08-04   186.  29.3
#>  5 Male   Austria     UCBL        1995-04-25   176.  29.6
#>  6 Female USA         ECL         1997-02-19   176.  27.7
#>  7 Female USA         ECL         1995-05-28   179   29.5
#>  8 Male   China       UCBL        1995-02-04   188.  29.8
#>  9 Male   China       UCBL        1992-12-30   186   31.9
#> 10 Male   Tunisia     INSA        1990-03-03   160.  34.7
#> 11 Male   Ukraine     ECL         1997-04-12   151.  27.6
#> 12 Male   Brazil      ECL         1994-09-24   184.  30.1
#> 13 Male   Colombia    INSA        1993-11-25   184.  31.0
#> 14 Male   Iran        INSA        1996-12-27   183   27.9
#> 15 Male   Afghanistan INSA        1999-07-11   155.  25.3
#> 16 Male   Cyprus      INSA        1999-06-25   176.  25.4
#> 17 Male   Ukraine     ECL         1999-10-03   170.  25.1
#> 18 Female Ukraine     ECL         1993-02-27   192   31.7
#> 19 Male   China       UCBL        1996-12-14   171   27.9
#> 20 Male   China       UCBL        1997-08-21   164.  27.2
# Using `filter()`, show data only for
# - Chinese people
pp |> filter(origin=='China')
#> # A tibble: 4 × 7
#>   name     gender origin institution dateofbirth  size   age
#>   <chr>    <chr>  <chr>  <chr>       <date>      <dbl> <dbl>
#> 1 Yiran    Male   China  UCBL        1995-02-04   188.  29.8
#> 2 Leran    Male   China  UCBL        1992-12-30   186   31.9
#> 3 Peng     Male   China  UCBL        1996-12-14   171   27.9
#> 4 Mingyuan Male   China  UCBL        1997-08-21   164.  27.2
# - From institution ECL and UCBL
pp |> filter(institution %in% c('ECL', 'UCBL'))
#> # A tibble: 13 × 7
#>    name     gender origin  institution dateofbirth  size   age
#>    <chr>    <chr>  <chr>   <chr>       <date>      <dbl> <dbl>
#>  1 Salem    Male   Yemen   UCBL        1997-12-26   161.  26.9
#>  2 Hanna    Female Ukraine ECL         1997-12-30   165.  26.9
#>  3 Benedikt Male   Austria UCBL        1995-04-25   176.  29.6
#>  4 Jordyn   Female USA     ECL         1997-02-19   176.  27.7
#>  5 Jennifer Female USA     ECL         1995-05-28   179   29.5
#>  6 Yiran    Male   China   UCBL        1995-02-04   188.  29.8
#>  7 Leran    Male   China   UCBL        1992-12-30   186   31.9
#>  8 Pavlo    Male   Ukraine ECL         1997-04-12   151.  27.6
#>  9 Saulo    Male   Brazil  ECL         1994-09-24   184.  30.1
#> 10 Ihor     Male   Ukraine ECL         1999-10-03   170.  25.1
#> 11 Iryna    Female Ukraine ECL         1993-02-27   192   31.7
#> 12 Peng     Male   China   UCBL        1996-12-14   171   27.9
#> 13 Mingyuan Male   China   UCBL        1997-08-21   164.  27.2
# - People older than 22 
pp |> filter(age>22)
#> # A tibble: 20 × 7
#>    name                    gender origin     institution dateofbirth  size   age
#>    <chr>                   <chr>  <chr>      <chr>       <date>      <dbl> <dbl>
#>  1 Salem                   Male   Yemen      UCBL        1997-12-26   161.  26.9
#>  2 Dilruwan-Shanaka-Perera Male   Sri Lanka  INSA        1997-03-28   172.  27.6
#>  3 Hanna                   Female Ukraine    ECL         1997-12-30   165.  26.9
#>  4 Sabin                   Male   India      INSA        1995-08-04   186.  29.3
#>  5 Benedikt                Male   Austria    UCBL        1995-04-25   176.  29.6
#>  6 Jordyn                  Female USA        ECL         1997-02-19   176.  27.7
#>  7 Jennifer                Female USA        ECL         1995-05-28   179   29.5
#>  8 Yiran                   Male   China      UCBL        1995-02-04   188.  29.8
#>  9 Leran                   Male   China      UCBL        1992-12-30   186   31.9
#> 10 Aymen                   Male   Tunisia    INSA        1990-03-03   160.  34.7
#> 11 Pavlo                   Male   Ukraine    ECL         1997-04-12   151.  27.6
#> 12 Saulo                   Male   Brazil     ECL         1994-09-24   184.  30.1
#> 13 Nicolas-Estevan         Male   Colombia   INSA        1993-11-25   184.  31.0
#> 14 Farzad                  Male   Iran       INSA        1996-12-27   183   27.9
#> 15 Roein                   Male   Afghanist… INSA        1999-07-11   155.  25.3
#> 16 Paraskevas              Male   Cyprus     INSA        1999-06-25   176.  25.4
#> 17 Ihor                    Male   Ukraine    ECL         1999-10-03   170.  25.1
#> 18 Iryna                   Female Ukraine    ECL         1993-02-27   192   31.7
#> 19 Peng                    Male   China      UCBL        1996-12-14   171   27.9
#> 20 Mingyuan                Male   China      UCBL        1997-08-21   164.  27.2
# - People with a `e` in their name
pp |> filter(grepl('e',name))
#> # A tibble: 10 × 7
#>    name                    gender origin     institution dateofbirth  size   age
#>    <chr>                   <chr>  <chr>      <chr>       <date>      <dbl> <dbl>
#>  1 Salem                   Male   Yemen      UCBL        1997-12-26   161.  26.9
#>  2 Dilruwan-Shanaka-Perera Male   Sri Lanka  INSA        1997-03-28   172.  27.6
#>  3 Benedikt                Male   Austria    UCBL        1995-04-25   176.  29.6
#>  4 Jennifer                Female USA        ECL         1995-05-28   179   29.5
#>  5 Leran                   Male   China      UCBL        1992-12-30   186   31.9
#>  6 Aymen                   Male   Tunisia    INSA        1990-03-03   160.  34.7
#>  7 Nicolas-Estevan         Male   Colombia   INSA        1993-11-25   184.  31.0
#>  8 Roein                   Male   Afghanist… INSA        1999-07-11   155.  25.3
#>  9 Paraskevas              Male   Cyprus     INSA        1999-06-25   176.  25.4
#> 10 Peng                    Male   China      UCBL        1996-12-14   171   27.9

For more interesting exercises in the tidyverse, look at: