In R, the principal object is the data. Hence the data.frame object, which is basically a table of vectors. A data.frame is a list presented under the form of a table – i.e. a spreadsheet. On a day-to-day basis, you will either define data.frame from existing vectors or other data.frame, or define a data.frame from a file (text, Excel…). In this example, we use test.dat and test.xlsx.
To define a data.frame from known vectors, we just have to do:
x<-seq(-pi, pi, length =6)y<-sin(x)df<-data.frame(x, y)# df is a data.frame (a table)df
#> x y
#> Min. :-3.142 Min. :-0.9511
#> 1st Qu.:-1.571 1st Qu.:-0.4408
#> Median : 0.000 Median : 0.0000
#> Mean : 0.000 Mean : 0.0000
#> 3rd Qu.: 1.571 3rd Qu.: 0.4408
#> Max. : 3.142 Max. : 0.9511
If not defined when creating the data.frame, the column names will be by default the vector names. To specify your own column names, do it when creating the data.frame:
#> X Y
#> 1 -3.141593 -1.224647e-16
#> 2 -1.884956 -9.510565e-01
6.1.2 Defining a data.frame from a file A text file
Let’s say we have test.dat that looks like this:
# Bash code:head Data/test.dat
#> x y
#> 1 2
#> 2 3
Then, to read this file into a data.frame, we will use read.table(). If you don’t specify that the file contains a header, read.table() will default to attributing column names that will be V1, V2, V3, etc:
A good practice in R is to tidy your data. R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules:
Each variable in the data set is placed in its own column
Each observation is placed in its own row
Each value is placed in its own cell
Illustration of tidy data.
Data that satisfies these rules is known as tidy data: you see that thanks to this representation, a 2D table can handle an arbitrary number of variables – this avoids using multi-dimensional arrays or multi-tab Excel documents. Note that it does’t matter if a value is repeated in a column.
Here is an example:
df<-read.csv("Data/population.csv")df# is not tidy
Understanding long and wide data with an animation. Source: tidyexplain
6.4.2 Tibbles
A tibble is an enhanced version of the data.frame provided by the tibble package (which is part of the tidyverse). The main advantage of tibble is that it has easier initialization and nicer printing than data.frame.
Moreover, the performance are also enhanced for the reading from files with read_csv(), read_tsv(), read_table() and read_delim() that do the same things as their read.xx() counterparts and return a tibble. Otherwise, the handling is basically the same.
Note that when initializing tibbles, the construction is iterative. It means that when creating a second column, one can refer to the first one that was created. This does’t work with data.frames.
# won't work unless a `x` vector was created beforedata.frame(x=runif(1e3), y=cumsum(x))
Tibbles are quite strict about subsetting. [ always returns another tibble. Contrast this with a data frame: sometimes [ returns a data frame and sometimes it just returns a vector:
Unless you want to get a tibble, I recommend always using the $ notation when you want to get a column as a vector to avoid problems.
Another interesting feature of tibbles is that their columns can contain vectors, like usual, but also lists of any R objects like other tibbles, nls() objects, etc. This is called “nesting”, and you can nest and un-nest tibbles using these explicit functions:
tib_unnested_renested$data# The `data` column is a list
#> [[1]]
#> # A tibble: 2 × 2
#> number y
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> [[2]]
#> # A tibble: 2 × 2
#> number y
#> <int> <int>
#> 1 1 2
#> 2 2 2
#> [[3]]
#> # A tibble: 2 × 2
#> number y
#> <int> <int>
#> 1 1 3
#> 2 2 3
#> [[4]]
#> # A tibble: 1 × 2
#> number y
#> <int> <int>
#> 1 2 4
#> [[5]]
#> # A tibble: 1 × 2
#> number y
#> <int> <int>
#> 1 2 5
6.5 Operations in the tidyverse
In the end, base R and the tidyverse package provide many efficient functions to perform most of the tasks you would want to perform recursively, thus allowing avoiding explicit for loops (that are slow).
Here are some examples, and you will find much more here. Take a look at the cheatsheets on tidyr and on dplyr, it’s really helpful.
Let’s work on this tibble:
# Let's create a random tibblelibrary(tidyverse)N<-500dt<-tibble(x =rep(runif(N, -1, 1), 3), y =runif(N*3, -1, 1), signx =ifelse(x>0, "positive", "negative"), signy =ifelse(y>0, "positive", "negative"))dt
In the following, we will introduce the pipe operator |>, that was introduced in the version 4.1 of R. This operator allows a clear syntax for successive operations, as “what is on the left of the operator is given as first argument of what is on the right”. It is thus a good habit to write each operation on a separate line to facilitate the reading. This is particularly helpful when performing multiple nested operations. For example, summary(head(tail(dt),2)), which is hard to read, would translate to:
Note that before the 4.1 version of R, the pipe operator was only present thanks to the magrittr package, and was written %>%. In magrittr’s pipe, retrieving the piped object is done with the operator ., while it is done with the operator _ in base R’s.
Silly example:
"Hello"|>gsub("o", "e")# replace the substring "Hello" by "o" in string "e"
#> [1] "e"
"Hello"|>gsub("o", "e", x=_)# replace the substring "o" by "e" in string "Hello"
#> [1] "Helle"
"Hello"%>%gsub("o", "e")# replace the substring "Hello" by "o" in string "e"
#> [1] "e"
# /!\ In base R pipe, the '_' needs to be for a named argument, # while it is not necessary for the '%>%' pipe"Hello"%>%gsub("o", "e", .)# replace the substring "o" by "e" in string "Hello"
group_by(column) groups by similar values of the wanted column(s) and performs the next operations on each element of the group successively.
Alternatively, you can use the parameter by=column or .by=column in the functions you want to use on groups. The difference between group_by(column) and .by=column is that after group_by(column), the tibble stays grouped – so an ungroup() is needed to remove the grouping.
At least one column with the exact same name must be present in each table to use the xx_join() functions. There are more possibilities than inner_join() that I show here, see the help for more information.
The separation is based on standard separators such as “-”, “_”, “.”, ” “, etc. A single separator can be specified with the argument sep, otherwise all separators are used. One must provide the resulting vector of new column names: if one value is NA, this column will be discarded. Examples:
6.5.13 Apply a function recursively on each element of a column
Take a look at the cheatsheet on the purrr package for more options and a visual help on the map() family. I show here a use of purrr::map(vector, function) that returns a list. map(x, f) applies the function f() to each element of the vector x, putting the result in a separate element of a list: map(x, f) ->list(f(x1), f(x2), ... f(xn)). In case f(xi) returns a single value, you might want to use map_dbl() or map_chr(), for example, that will return a vector of doubles or of characters, respectively.
You see that you can create the function directly within the call to map using the shortcut map(vector, ~function(.)). This is useful to provide more arguments to the function – another solution is to write your own function before the call to map() and then call this function in map().
Note that in case you need more parameters, you can use purrr::map2(vector1, vector2, ~function(.x, .y)), where .x and .y refer to vector1 and vector2, respectively (it’s always .x and .y whatever the name of vector1 and vector2).
Create a 3 column data.frame containing 10 random values, their sinus, and the sum of the two first columns.
Print the 4 first lines of the table
Print the second column
Print the average of the third column
Using plot(x,y) where x and y are vectors, plot the 2nd column as a function of the first
Look into the function write.table() to write a text file containing this data.frame
Do the all the same things with a tibble
# Create a 3 column `data.frame`{.R} containing 10 random values, their sinus, # and the sum of the two first columns.x<-runif(10)y<-sin(x)z<-x+ydf<-data.frame(x=x, y=y, z=z)# Print the 4 first lines of the tablehead(df, 4)
#> x y z
#> 1 0.62483524 0.58496365 1.2097989
#> 2 0.36899686 0.36068000 0.7296769
#> 3 0.69465978 0.64012409 1.3347839
#> 4 0.07386248 0.07379534 0.1476578
# Look into the function `write.table()`{.R} to write a text file # containing this `data.frame`{.R}write.table(df, "Data/some_data.dat", quote =FALSE, row.names =FALSE)# # # # # # # # # # # # # # # # # # Tibble versionlibrary(tidyverse)df_tib<-tibble(a =runif(10), b =sin(a), c =a+b)head(df_tib, 4)
#> # A tibble: 4 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 0.199 0.198 0.397
#> 2 0.523 0.499 1.02
#> 3 0.814 0.727 1.54
#> 4 0.642 0.599 1.24
Create a subset containing the data for Montpellier
What is the max and min of population in this city?
The average population over time?
What is the total population in 2012?
What is the total population per year?
What is the average population per city over the years?
# Download population.txt and load it into a `data.frame`{.R}.library(tidyverse)popul<-read_csv("Data/population.csv")# What are the names of the columns and the dimension of the table?names(popul); dim(popul)
# Create a subset containing the data for Montpelliermtp<-subset(popul.tidy, city=="Montpellier")# I prefer the tidyverse versionmtp<-popul.tidy|>filter(city=="Montpellier")# What is the max and min of population in this city?max(mtp$pop)
Create a new tibble pp by using the pipe operator (%>%) and successively:
joining the two tibbles into one using inner_join()
adding a column age containing the age in years (use lubridate::time_length(x, 'years') with x a time difference in days) by using mutate()
Display a summary of the table using str()
Using groupe_by() and summarize():
Show the number of males and females in the table (use the counter n())
Show the average age per gender
Show the average size per gender and institution
Show the number of people from each country, sorted by descending population (arrange())
Using select(), display:
only the name and age columns
all but the name column
Using filter(), show data only for
Chinese people
From institution ECL and UCBL
People older than 22
People with a e in their name
# First, load the `tidyverse` packagelibrary(tidyverse)# Load people1.csv and people2.csvpp1<-read_csv("Data/people1.csv")pp2<-read_csv("Data/people2.csv")# Create a new tibble `pp` by using the pipe operator (`%>%`)# and successively:# - joining the two tibbles into one using `inner_join()`# - adding a column `age` containing the age in years # (use lubridate's `time_length(x, 'years')` with x a time# difference in days) by using `mutate()`pp<-pp1|>inner_join(pp2)|>mutate(age=time_length(today()-dateofbirth,'years'))# Display a summary of the table using `str()`str(pp)
# Using `groupe_by()` and `summarize()`:# - Show the number of males and females in the table # (use the counter `n()`)pp|>group_by(gender)|>summarize(count=n())
