x <- c(1, 5, 3, 12, 4.2)
x
#> [1] 1.0 5.0 3.0 12.0 4.2
R is a vectorized language with built-in arithmetic and relational operators, mathematical functions and types of number. It means that each mathematical function and operator works on a set of data rather than a single scalar value as traditional computer language do, and thus formal looping can usually be avoided.
If you want to define a vector by specifying the values, use the function c()
, like so:
x
is a vector of doubles:x <- c(1, 5, 3, 12, 4.2)
x
#> [1] 1.0 5.0 3.0 12.0 4.2
x
is converted to a vector of strings because it contains a string:x <- c(1, 5, 3, "hello")
x
#> [1] "1" "5" "3" "hello"
start:end
for a sequence going from start
to end
by step of 1, or use the seq()
function that is more versatile:1:10
#> [1] 1 2 3 4 5 6 7 8 9 10
seq(-10, 10, by = .5)
#> [1] -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5
#> [13] -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
#> [25] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
#> [37] 8.0 8.5 9.0 9.5 10.0
seq(-10, 10, length = 6)
#> [1] -10 -6 -2 2 6 10
seq(-10, 10, along = x)
#> [1] -10.000000 -3.333333 3.333333 10.000000
rep()
function:rnorm()
or runif()
for normally or uniformly distributed numbers, respectively:# vector with 10 random values normally distributed around mean
# with given standard deviation `sd`
rnorm(10, mean=3, sd=1)
#> [1] 4.649907 3.518421 3.505278 2.581017 1.505077 4.070046 2.918621 3.216492
#> [9] 2.649879 1.844394
# vector with 10 random values uniformly distributed between min and max
runif(10, min = 0, max = 1)
#> [1] 0.07467977 0.86038816 0.12199068 0.13169847 0.88626019 0.02991051
#> [7] 0.43487958 0.49711280 0.36083000 0.28517218
Data can be of two different types: numerical, or categorical. Let’s say you are measuring a the temperature in a room and recording its value over time:
T1 <- c(22.3, 23.5, 26.0, 30.2)
T1
is a vector containing numerical data.
Let’s say that now you are recording the temperature level, which can be low
, high
or medium
T2 <- c("low", "low", "medium", "high")
T2
is a vector containing categorical data, i.e. the data in this example can fall into either of 3 categories. For now, T2
is however a vector of strings, and we need to tell R that it contains categorical data by using the function factor()
:
T2 <- factor(T2)
T2
#> [1] low low medium high
#> Levels: high low medium
We see here that we now have 3 levels, and a numerization of T2
leads to obtaining the numbers 1, 2 and 3 according to the levels in T2
:
as.numeric(T2)
#> [1] 2 2 3 1
Numerical data can be converted to factors in the same way – this can be useful sometimes, e.g. when plotting with ggplot
as we will see later:
factor(T1)
#> [1] 22.3 23.5 26 30.2
#> Levels: 22.3 23.5 26 30.2
Because R is a vectorized language, you don’t need to loop on all elements of a vector to perform element-wise operations on it. Let’s say that x <- 1:6
, then:
x + 2.5
#> [1] 3.5 4.5 5.5 6.5 7.5 8.5
x*2
#> [1] 2 4 6 8 10 12
x %/% 3
#> [1] 0 0 1 1 1 2
x^2.5
#> [1] 1.000000 5.656854 15.588457 32.000000 55.901699 88.181631
y <- c(2.3, 5, 7, 10, 12, 20)
x*y
#> [1] 2.3 10.0 21.0 40.0 60.0 120.0
x*1:2
#> [1] 1 4 3 8 5 12
x*1:4
#> [1] 1 4 9 16 5 12
x %% 2
#> [1] 1 0 1 0 1 0
x %o% y
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2.3 5 7 10 12 20
#> [2,] 4.6 10 14 20 24 40
#> [3,] 6.9 15 21 30 36 60
#> [4,] 9.2 20 28 40 48 80
#> [5,] 11.5 25 35 50 60 100
#> [6,] 13.8 30 42 60 72 120
Let’s work on this vector x
:
x <- c(5, 3, 4, 9, 3)
x
, use the x[i]
notation, where i
is the index you want to access. i
can be (and in fact, is always) a vector.In R, indexes numbering start at 1 !!!
j
from a vector x
, use the notation x[-j]
:# remove elements 1 and 3
x[-c(1,3)]
#> [1] 3 9 3
Writing x[-c(1,3)]
will just print the result of x[-c(1,3)]
, but not actually modify x
.
To really modify x
, you’d need to write x <- x[-c(1,3)]
.
You can access values with booleans. Values getting a TRUE
will be kept, while values with a FALSE
will be discarded – or return a NA
if a TRUE
is given to a non existing value (i.e. to an index larger than the size of the vector):
Therefore, you can apply tests on your values and filter them very easily:
x > 4 # is a vector of booleans
#> [1] TRUE FALSE FALSE TRUE FALSE
x[x > 4] # is a filtered vector
#> [1] 5 9
Finally, values in vectors can be named, and thus can be accessed by their name:
y <- c(age=32, name="John", pet="Cat")
y
#> age name pet
#> "32" "John" "Cat"
y[c('age','pet')] # prints a named vector
#> age pet
#> "32" "Cat"
y[['name']] # prints an un-named vector
#> [1] "John"
And you can access the names of the vector using names()
:
names(y)
#> [1] "age" "name" "pet"
sort(x)
#> [1] 3 3 4 5 9
stringr::str_sort()
might be needed for strings mixing letters and numbers:#> [1] "a" "ab" "c" "d"
#> [1] "a10" "a2" "ab" "c" "d"
#> [1] "a2" "a10" "ab" "c" "d"
sort(x, decreasing = TRUE)
#> [1] 9 5 4 3 3
rev(x)
#> [1] 3 9 4 3 5
duplicated(x)
#> [1] FALSE FALSE FALSE FALSE TRUE
unique(x)
#> [1] 5 3 4 9
sample(x, 3)
#> [1] 9 3 3
This is quite straightforward:
length(x)
#> [1] 5
summary(x)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3.0 3.0 4.0 4.8 5.0 9.0
sum(x)
#> [1] 24
mean(x)
#> [1] 4.8
median(x)
#> [1] 4
sd(x)
#> [1] 2.48998
table(x)
#> x
#> 3 4 5 9
#> 2 1 1 1
cumsum(x)
#> [1] 5 8 12 21 24
diff(x)
#> [1] -2 1 5 -6
This is done using the c()
notation: you basically create a vector of vectors, the result being a new vector:
# concatenate vectors
z <- c(-1:4, NA, -x); z
#> [1] -1 0 1 2 3 4 NA -5 -3 -4 -9 -3
Another option is to use the append()
function that allows for more options:
Download the exercises and solutions from the following repository, then create a Rstudio project from the unzipped folder:
Let’s create a named vector containing age of the students in the class, the names of each value being the first name of the students. Then:
hist()
and play with the parameter breaks
to modify the histogramboxplot()
This exercise is adapted from here.
Open Rstudio and create a new R script, save it as population.R
in your wanted directory, say Rcourse/
.
Download the population.csv
file and save it in your working directory.
A csv file contains raw data stored as plain text and separated by a comma (Comma Separated Values). Open it with Rstudio.
We can of course directly load such file with R and store its data in an appropriate format (i.e. a data.frame
), but this is for the next chapter. For now, just copy-paste the text in the Rstudio script area to:
cities
vector containing all the cities listed in population.csv
pop_1962
and pop_2012
vectors containing the populations of each city at these years. Print the 2 vectors.names()
to name values of pop_1962
and pop_2012
. Print the 2 vectors again. Are there any change?pop_diff
vector to store population change between 1962 and 2012# Create a `cities` vector containing all the cities listed in `population.csv`
cities <- c("Angers", "Bordeaux", "Brest", "Dijon", "Grenoble", "Le Havre",
"Le Mans", "Lille", "Lyon", "Marseille", "Montpellier", "Nantes",
"Nice", "Paris", "Reims", "Rennes", "Saint-Etienne", "Strasbourg",
"Toulon", "Toulouse")
# Create a `pop_1962` and `pop_2012` vectors containing the populations
# of each city at these years. Print the 2 vectors.
pop_1962 <- c(115273,278403,136104,135694,156707,187845,132181,239955,
535746,778071,118864,240048,292958,2790091,134856,151948,
210311,228971,161797,323724)
pop_2012 <- c(149017,241287,139676,152071,158346,173142,143599,228652,
496343,852516,268456,291604,343629,2240621,181893,209860,
171483,274394,164899,453317)
pop_1962; pop_2012
#> [1] 115273 278403 136104 135694 156707 187845 132181 239955 535746
#> [10] 778071 118864 240048 292958 2790091 134856 151948 210311 228971
#> [19] 161797 323724
#> [1] 149017 241287 139676 152071 158346 173142 143599 228652 496343
#> [10] 852516 268456 291604 343629 2240621 181893 209860 171483 274394
#> [19] 164899 453317
# Use names() to name values of `pop_1962` and `pop_2012`.
# Print the 2 vectors again. Are there any change?
names(pop_2012) <- names(pop_1962) <- cities
pop_1962; pop_2012
#> Angers Bordeaux Brest Dijon Grenoble
#> 115273 278403 136104 135694 156707
#> Le Havre Le Mans Lille Lyon Marseille
#> 187845 132181 239955 535746 778071
#> Montpellier Nantes Nice Paris Reims
#> 118864 240048 292958 2790091 134856
#> Rennes Saint-Etienne Strasbourg Toulon Toulouse
#> 151948 210311 228971 161797 323724
#> Angers Bordeaux Brest Dijon Grenoble
#> 149017 241287 139676 152071 158346
#> Le Havre Le Mans Lille Lyon Marseille
#> 173142 143599 228652 496343 852516
#> Montpellier Nantes Nice Paris Reims
#> 268456 291604 343629 2240621 181893
#> Rennes Saint-Etienne Strasbourg Toulon Toulouse
#> 209860 171483 274394 164899 453317
# What are the cities with more than 200000 people in 1962?
# For these, how many residents in 2012?
cities200k <- cities[pop_1962>200000]
cities200k; pop_2012[cities200k]
#> [1] "Bordeaux" "Lille" "Lyon" "Marseille"
#> [5] "Nantes" "Nice" "Paris" "Saint-Etienne"
#> [9] "Strasbourg" "Toulouse"
#> Bordeaux Lille Lyon Marseille Nantes
#> 241287 228652 496343 852516 291604
#> Nice Paris Saint-Etienne Strasbourg Toulouse
#> 343629 2240621 171483 274394 453317
# What is the population evolution of Montpellier and Nantes?
pop_2012['Montpellier'] - pop_1962['Montpellier']; pop_2012['Nantes'] - pop_1962['Nantes']
#> Montpellier
#> 149592
#> Nantes
#> 51556
# Create a `pop_diff` vector to store population change between 1962 and 2012
pop_diff <- pop_2012 - pop_1962
# Print cities with a negative change
cities[pop_diff<0]
#> [1] "Bordeaux" "Le Havre" "Lille" "Lyon"
#> [5] "Paris" "Saint-Etienne"
# Print cities which broke the 300000 people barrier between 1962 and 2012
cities[pop_2012>300000 & pop_1962<300000]
#> [1] "Nice"
# Compute the total change in population of the 10 largest cities
# (as of 1962) between 1962 and 2012.
ten_largest <- cities[order(pop_1962, decreasing = TRUE)[1:10]]
sum(pop_2012[ten_largest] - pop_1962[ten_largest])
#> [1] -324432
# Compute the population mean for year 1962
mean(pop_1962)
#> [1] 367477.3
#> [1] 2515356
# Sort the cities by decreasing order of population for 1962
(pop_1962_sorted <- sort(pop_1962, decreasing = TRUE))
#> Paris Marseille Lyon Toulouse Nice
#> 2790091 778071 535746 323724 292958
#> Bordeaux Nantes Lille Strasbourg Saint-Etienne
#> 278403 240048 239955 228971 210311
#> Le Havre Toulon Grenoble Rennes Brest
#> 187845 161797 156707 151948 136104
#> Dijon Reims Le Mans Montpellier Angers
#> 135694 134856 132181 118864 115273