x <- c(1, 5, 3, 12, 4.2)
x#> [1] 1.0 5.0 3.0 12.0 4.2
R is a vectorized language with built-in arithmetic and relational operators, mathematical functions and types of number. It means that each mathematical function and operator works on a set of data rather than a single scalar value as traditional computer language do, and thus formal looping can usually be avoided.
If you want to define a vector by specifying the values, use the function c(), like so:
x is a vector of doubles:x <- c(1, 5, 3, 12, 4.2)
x#> [1] 1.0 5.0 3.0 12.0 4.2
x is converted to a vector of strings because it contains a string:x <- c(1, 5, 3, "hello")
x#> [1] "1" "5" "3" "hello"
start:end for a sequence going from start to end by step of 1, or use the seq() function that is more versatile:1:10#> [1] 1 2 3 4 5 6 7 8 9 10
seq(-10, 10, by = .5)#> [1] -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5
#> [13] -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
#> [25] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
#> [37] 8.0 8.5 9.0 9.5 10.0
seq(-10, 10, length = 6)#> [1] -10 -6 -2 2 6 10
seq(-10, 10, along = x)#> [1] -10.000000 -3.333333 3.333333 10.000000
rep() function:rnorm() or runif() for normally or uniformly distributed numbers, respectively:# vector with 10 random values normally distributed around mean
# with given standard deviation `sd`
rnorm(10, mean=3, sd=1)#> [1] 2.772932 2.860914 1.633551 2.799405 4.183160 3.050509 3.679921 6.021539
#> [9] 1.097593 2.352588
# vector with 10 random values uniformly distributed between min and max
runif(10, min = 0, max = 1)#> [1] 0.80316437 0.36451792 0.19757124 0.05517922 0.08234020 0.72324840
#> [7] 0.33825254 0.95372647 0.52669921 0.05505320
Data can be of two different types: numerical, or categorical. Let’s say you are measuring a the temperature in a room and recording its value over time:
T1 <- c(22.3, 23.5, 26.0, 30.2)T1 is a vector containing numerical data.
Let’s say that now you are recording the temperature level, which can be low, high or medium
T2 <- c("low", "low", "medium", "high")T2 is a vector containing categorical data, i.e. the data in this example can fall into either of 3 categories. For now, T2 is however a vector of strings, and we need to tell R that it contains categorical data by using the function factor():
T2 <- factor(T2)
T2#> [1] low low medium high
#> Levels: high low medium
We see here that we now have 3 levels, and a numerization of T2 leads to obtaining the numbers 1, 2 and 3 according to the levels in T2:
as.numeric(T2)#> [1] 2 2 3 1
Numerical data can be converted to factors in the same way – this can be useful sometimes, e.g. when plotting with ggplot as we will see later:
factor(T1)#> [1] 22.3 23.5 26 30.2
#> Levels: 22.3 23.5 26 30.2
Because R is a vectorized language, you don’t need to loop on all elements of a vector to perform element-wise operations on it. Let’s say that x <- 1:6, then:
x + 2.5#> [1] 3.5 4.5 5.5 6.5 7.5 8.5
x*2#> [1] 2 4 6 8 10 12
x %/% 3#> [1] 0 0 1 1 1 2
x^2.5#> [1] 1.000000 5.656854 15.588457 32.000000 55.901699 88.181631
y <- c(2.3, 5, 7, 10, 12, 20)
x*y#> [1] 2.3 10.0 21.0 40.0 60.0 120.0
x*1:2#> [1] 1 4 3 8 5 12
x*1:4#> [1] 1 4 9 16 5 12
x %% 2#> [1] 1 0 1 0 1 0
x %o% y#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2.3 5 7 10 12 20
#> [2,] 4.6 10 14 20 24 40
#> [3,] 6.9 15 21 30 36 60
#> [4,] 9.2 20 28 40 48 80
#> [5,] 11.5 25 35 50 60 100
#> [6,] 13.8 30 42 60 72 120
Let’s work on this vector x:
x <- c(5, 3, 4, 9, 3)x, use the x[i] notation, where i is the index you want to access. i can be (and in fact, is always) a vector.In R, indexes numbering start at 1 !!!
j from a vector x, use the notation x[-j]:# remove elements 1 and 3
x[-c(1,3)]#> [1] 3 9 3
Writing x[-c(1,3)] will just print the result of x[-c(1,3)], but not actually modify x.
To really modify x, you’d need to write x <- x[-c(1,3)].
You can access values with booleans. Values getting a TRUE will be kept, while values with a FALSE will be discarded – or return a NA if a TRUE is given to a non existing value (i.e. to an index larger than the size of the vector):
Therefore, you can apply tests on your values and filter them very easily:
x > 4 # is a vector of booleans#> [1] TRUE FALSE FALSE TRUE FALSE
x[x > 4] # is a filtered vector#> [1] 5 9
Finally, values in vectors can be named, and thus can be accessed by their name:
y <- c(age=32, name="John", pet="Cat")
y#> age name pet
#> "32" "John" "Cat"
y[c('age','pet')] # prints a named vector#> age pet
#> "32" "Cat"
y[['name']] # prints an un-named vector#> [1] "John"
And you can access the names of the vector using names():
names(y)#> [1] "age" "name" "pet"
sort(x)#> [1] 3 3 4 5 9
stringr::str_sort() might be needed for strings mixing letters and numbers:#> [1] "a" "ab" "c" "d"
#> [1] "a10" "a2" "ab" "c" "d"
#> [1] "a2" "a10" "ab" "c" "d"
sort(x, decreasing = TRUE) #> [1] 9 5 4 3 3
rev(x)#> [1] 3 9 4 3 5
duplicated(x)#> [1] FALSE FALSE FALSE FALSE TRUE
unique(x)#> [1] 5 3 4 9
sample(x, 3)#> [1] 3 4 5
This is quite straightforward:
length(x)#> [1] 5
summary(x)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3.0 3.0 4.0 4.8 5.0 9.0
sum(x)#> [1] 24
mean(x)#> [1] 4.8
median(x)#> [1] 4
sd(x)#> [1] 2.48998
table(x)#> x
#> 3 4 5 9
#> 2 1 1 1
cumsum(x)#> [1] 5 8 12 21 24
diff(x)#> [1] -2 1 5 -6
This is done using the c() notation: you basically create a vector of vectors, the result being a new vector:
# concatenate vectors
z <- c(-1:4, NA, -x); z#> [1] -1 0 1 2 3 4 NA -5 -3 -4 -9 -3
Another option is to use the append() function that allows for more options:
Download the exercises and solutions from the following repository, then create a Rstudio project from the unzipped folder:
Let’s create a named vector containing age of the students in the class, the names of each value being the first name of the students. Then:
hist() and play with the parameter breaks to modify the histogramboxplot()
This exercise is adapted from here.
Open Rstudio and create a new R script, save it as population.R in your wanted directory, say Rcourse/.
Download the population.csv file and save it in your working directory.
A csv file contains raw data stored as plain text and separated by a comma (Comma Separated Values). Open it with Rstudio.
We can of course directly load such file with R and store its data in an appropriate format (i.e. a data.frame), but this is for the next chapter. For now, just copy-paste the text in the Rstudio script area to:
cities vector containing all the cities listed in population.csv
pop_1962 and pop_2012 vectors containing the populations of each city at these years. Print the 2 vectors.names() to name values of pop_1962 and pop_2012. Print the 2 vectors again. Are there any change?pop_diff vector to store population change between 1962 and 2012# Create a `cities` vector containing all the cities listed in `population.csv`
cities <- c("Angers", "Bordeaux", "Brest", "Dijon", "Grenoble", "Le Havre",
"Le Mans", "Lille", "Lyon", "Marseille", "Montpellier", "Nantes",
"Nice", "Paris", "Reims", "Rennes", "Saint-Etienne", "Strasbourg",
"Toulon", "Toulouse")
# Create a `pop_1962` and `pop_2012` vectors containing the populations
# of each city at these years. Print the 2 vectors.
pop_1962 <- c(115273,278403,136104,135694,156707,187845,132181,239955,
535746,778071,118864,240048,292958,2790091,134856,151948,
210311,228971,161797,323724)
pop_2012 <- c(149017,241287,139676,152071,158346,173142,143599,228652,
496343,852516,268456,291604,343629,2240621,181893,209860,
171483,274394,164899,453317)
pop_1962; pop_2012#> [1] 115273 278403 136104 135694 156707 187845 132181 239955 535746
#> [10] 778071 118864 240048 292958 2790091 134856 151948 210311 228971
#> [19] 161797 323724
#> [1] 149017 241287 139676 152071 158346 173142 143599 228652 496343
#> [10] 852516 268456 291604 343629 2240621 181893 209860 171483 274394
#> [19] 164899 453317
# Use names() to name values of `pop_1962` and `pop_2012`.
# Print the 2 vectors again. Are there any change?
names(pop_2012) <- names(pop_1962) <- cities
pop_1962; pop_2012#> Angers Bordeaux Brest Dijon Grenoble
#> 115273 278403 136104 135694 156707
#> Le Havre Le Mans Lille Lyon Marseille
#> 187845 132181 239955 535746 778071
#> Montpellier Nantes Nice Paris Reims
#> 118864 240048 292958 2790091 134856
#> Rennes Saint-Etienne Strasbourg Toulon Toulouse
#> 151948 210311 228971 161797 323724
#> Angers Bordeaux Brest Dijon Grenoble
#> 149017 241287 139676 152071 158346
#> Le Havre Le Mans Lille Lyon Marseille
#> 173142 143599 228652 496343 852516
#> Montpellier Nantes Nice Paris Reims
#> 268456 291604 343629 2240621 181893
#> Rennes Saint-Etienne Strasbourg Toulon Toulouse
#> 209860 171483 274394 164899 453317
# What are the cities with more than 200000 people in 1962?
# For these, how many residents in 2012?
cities200k <- cities[pop_1962>200000]
cities200k; pop_2012[cities200k]#> [1] "Bordeaux" "Lille" "Lyon" "Marseille"
#> [5] "Nantes" "Nice" "Paris" "Saint-Etienne"
#> [9] "Strasbourg" "Toulouse"
#> Bordeaux Lille Lyon Marseille Nantes
#> 241287 228652 496343 852516 291604
#> Nice Paris Saint-Etienne Strasbourg Toulouse
#> 343629 2240621 171483 274394 453317
# What is the population evolution of Montpellier and Nantes?
pop_2012['Montpellier'] - pop_1962['Montpellier']; pop_2012['Nantes'] - pop_1962['Nantes']#> Montpellier
#> 149592
#> Nantes
#> 51556
# Create a `pop_diff` vector to store population change between 1962 and 2012
pop_diff <- pop_2012 - pop_1962
# Print cities with a negative change
cities[pop_diff<0]#> [1] "Bordeaux" "Le Havre" "Lille" "Lyon"
#> [5] "Paris" "Saint-Etienne"
# Print cities which broke the 300000 people barrier between 1962 and 2012
cities[pop_2012>300000 & pop_1962<300000]#> [1] "Nice"
# Compute the total change in population of the 10 largest cities
# (as of 1962) between 1962 and 2012.
ten_largest <- cities[order(pop_1962, decreasing = TRUE)[1:10]]
sum(pop_2012[ten_largest] - pop_1962[ten_largest])#> [1] -324432
# Compute the population mean for year 1962
mean(pop_1962)#> [1] 367477.3
#> [1] 2515356
# Sort the cities by decreasing order of population for 1962
(pop_1962_sorted <- sort(pop_1962, decreasing = TRUE))#> Paris Marseille Lyon Toulouse Nice
#> 2790091 778071 535746 323724 292958
#> Bordeaux Nantes Lille Strasbourg Saint-Etienne
#> 278403 240048 239955 228971 210311
#> Le Havre Toulon Grenoble Rennes Brest
#> 187845 161797 156707 151948 136104
#> Dijon Reims Le Mans Montpellier Angers
#> 135694 134856 132181 118864 115273