5 Vectors

R is a vectorized language with built-in arithmetic and relational operators, mathematical functions and types of number. It means that each mathematical function and operator works on a set of data rather than a single scalar value as traditional computer language do, and thus formal looping can usually be avoided.

5.1 Different ways of defining a vector

If you want to define a vector by specifying the values, use the function c(), like so:

Here, x is a vector of doubles:

x <- c(1, 5, 3, 12, 4.2)
x

#> [1]  1.0  5.0  3.0 12.0  4.2

But here, x is converted to a vector of strings because it contains a string:

x <- c(1, 5, 3, "hello")
x

#> [1] "1"     "5"     "3"     "hello"

To define a sequence of increasing numbers, either use the notation start:end for a sequence going from start to end by step of 1, or use the seq() function that is more versatile:

1:10

#>  [1]  1  2  3  4  5  6  7  8  9 10

seq(-10, 10, by = .5)

#>  [1] -10.0  -9.5  -9.0  -8.5  -8.0  -7.5  -7.0  -6.5  -6.0  -5.5  -5.0  -4.5
#> [13]  -4.0  -3.5  -3.0  -2.5  -2.0  -1.5  -1.0  -0.5   0.0   0.5   1.0   1.5
#> [25]   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5
#> [37]   8.0   8.5   9.0   9.5  10.0

seq(-10, 10, length = 6)

#> [1] -10  -6  -2   2   6  10

seq(-10, 10, along = x)

#> [1] -10.000000  -3.333333   3.333333  10.000000

To repeat values, use the rep() function:

rep(0, 10)

#>  [1] 0 0 0 0 0 0 0 0 0 0

rep(c(0, 2), 5)

#>  [1] 0 2 0 2 0 2 0 2 0 2

rep(c(0, 2), each = 5)

#>  [1] 0 0 0 0 0 2 2 2 2 2

To create vectors of random numbers, use rnorm() or runif() for normally or uniformly distributed numbers, respectively:

# vector with 10 random values normally distributed around mean 
# with given standard deviation `sd`
rnorm(10, mean=3, sd=1)

#>  [1] 2.772932 2.860914 1.633551 2.799405 4.183160 3.050509 3.679921 6.021539
#>  [9] 1.097593 2.352588

# vector with 10 random values uniformly distributed between min and max
runif(10, min = 0, max = 1)

#>  [1] 0.80316437 0.36451792 0.19757124 0.05517922 0.08234020 0.72324840
#>  [7] 0.33825254 0.95372647 0.52669921 0.05505320

5.2 Numerical and categorical data types

Data can be of two different types: numerical, or categorical. Let’s say you are measuring a the temperature in a room and recording its value over time:

T1 <- c(22.3, 23.5, 26.0, 30.2)

T1 is a vector containing numerical data.

Let’s say that now you are recording the temperature level, which can be low, high or medium

T2 <- c("low", "low", "medium", "high")

T2 is a vector containing categorical data, i.e. the data in this example can fall into either of 3 categories. For now, T2 is however a vector of strings, and we need to tell R that it contains categorical data by using the function factor():

T2 <- factor(T2)
T2

#> [1] low    low    medium high  
#> Levels: high low medium

We see here that we now have 3 levels, and a numerization of T2 leads to obtaining the numbers 1, 2 and 3 according to the levels in T2:

as.numeric(T2)

#> [1] 2 2 3 1

Numerical data can be converted to factors in the same way – this can be useful sometimes, e.g. when plotting with ggplot as we will see later:

factor(T1)

#> [1] 22.3 23.5 26   30.2
#> Levels: 22.3 23.5 26 30.2

5.3 Principal operations on vectors

5.3.1 Mathematical operations

Because R is a vectorized language, you don’t need to loop on all elements of a vector to perform element-wise operations on it. Let’s say that x <- 1:6, then:

Addition of a value to all elements:

x + 2.5

#> [1] 3.5 4.5 5.5 6.5 7.5 8.5

Multiplication / division of all elements:

x*2

#> [1]  2  4  6  8 10 12

Integer division:

x %/% 3

#> [1] 0 0 1 1 1 2

Math functions apply on all elements:

sqrt(abs(cos(x)))

#> [1] 0.7350526 0.6450944 0.9949837 0.8084823 0.5325995 0.9798828

Power:

x^2.5

#> [1]  1.000000  5.656854 15.588457 32.000000 55.901699 88.181631

Multiplication of vectors of the same size is performed element by element:

y <- c(2.3, 5, 7, 10, 12, 20)
x*y

#> [1]   2.3  10.0  21.0  40.0  60.0 120.0

Multiplication of vectors of different sizes: the smaller vector is automatically repeated the number of times needed to get a vector of the size of the larger one. It will work also if the longer object length is not a multiple of shorter object length, but the shorter object will be truncated and you’ll get an error:

x*1:2

#> [1]  1  4  3  8  5 12

x*1:4

#> [1]  1  4  9 16  5 12

Modulo:

x %% 2

#> [1] 1 0 1 0 1 0

Outer product of vectors (the result is a matrix):

x %o% y

#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]  2.3    5    7   10   12   20
#> [2,]  4.6   10   14   20   24   40
#> [3,]  6.9   15   21   30   36   60
#> [4,]  9.2   20   28   40   48   80
#> [5,] 11.5   25   35   50   60  100
#> [6,] 13.8   30   42   60   72  120

5.3.2 Accessing values

Let’s work on this vector x:

x <- c(5, 3, 4, 9, 3)

5.3.2.1 Accessing values by index

To access values of a vector x, use the x[i] notation, where i is the index you want to access. i can be (and in fact, is always) a vector.

Attention

In R, indexes numbering start at 1 !!!

# accessing by indexes
x[3]

#> [1] 4

# access index 1, 5 and 2
x[c(1,5,2)]

#> [1] 5 3 3

To remove elements j from a vector x, use the notation x[-j]:

# remove elements 1 and 3
x[-c(1,3)]

#> [1] 3 9 3

Attention

Writing x[-c(1,3)] will just print the result of x[-c(1,3)], but not actually modify x.

To really modify x, you’d need to write x <- x[-c(1,3)].

5.3.2.2 Filtering values with tests

You can access values with booleans. Values getting a TRUE will be kept, while values with a FALSE will be discarded – or return a NA if a TRUE is given to a non existing value (i.e. to an index larger than the size of the vector):

#> [1] 5 3 4 9 3

x[c(TRUE,TRUE,TRUE,FALSE,TRUE,TRUE)]

#> [1]  5  3  4  3 NA

Therefore, you can apply tests on your values and filter them very easily:

x > 4 # is a vector of booleans

#> [1]  TRUE FALSE FALSE  TRUE FALSE

x[x > 4] # is a filtered vector

#> [1] 5 9

5.3.2.3 Accessing values by name

Finally, values in vectors can be named, and thus can be accessed by their name:

y <- c(age=32, name="John", pet="Cat")
y

#>    age   name    pet 
#>   "32" "John"  "Cat"

y[c('age','pet')] # prints a named vector

#>   age   pet 
#>  "32" "Cat"

y[['name']] # prints an un-named vector

#> [1] "John"

And you can access the names of the vector using names():

names(y)

#> [1] "age"  "name" "pet"

5.3.3 Sorting, removing duplicates, sampling

Sorting by ascending number:

sort(x)

#> [1] 3 3 4 5 9

It works with strings too, but stringr::str_sort() might be needed for strings mixing letters and numbers:

sort(c("c","a","d","ab"))

#> [1] "a"  "ab" "c"  "d"

sort(c("c", "a10", "a2", "d", "ab"))

#> [1] "a10" "a2"  "ab"  "c"   "d"

stringr::str_sort(c("c", "a10", "a2", "d", "ab"), numeric = TRUE)

#> [1] "a2"  "a10" "ab"  "c"   "d"

Sorting by descending number:

sort(x, decreasing = TRUE)

#> [1] 9 5 4 3 3

Inverting the order of the vector:

rev(x)

#> [1] 3 9 4 3 5

Find the order of the indexes of the sorting:

order(x)

#> [1] 2 5 3 1 4

x[order(x)] # is thus equivalent to `sort(x)`

#> [1] 3 3 4 5 9

Find duplicates:

duplicated(x)

#> [1] FALSE FALSE FALSE FALSE  TRUE

Remove duplicates:

unique(x)

#> [1] 5 3 4 9

Choose 3 random values:

sample(x, 3)

#> [1] 3 4 5

5.3.4 Maximum and minimum

This is quite straightforward:

# maximum of x and its index
x; max(x); which.max(x)

#> [1] 5 3 4 9 3

#> [1] 9

#> [1] 4

# minimum of x and its index
x; min(x); which.min(x)

#> [1] 5 3 4 9 3

#> [1] 3

#> [1] 2

# range of a vector
range(x)

#> [1] 3 9

5.3.5 Characteristics of vectors

Size of a vector:

length(x)

#> [1] 5

Statistics on vector:

summary(x)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     3.0     3.0     4.0     4.8     5.0     9.0

Sum of all terms:

sum(x)

#> [1] 24

Average value:

mean(x)

#> [1] 4.8

Median value:

median(x)

#> [1] 4

Standard deviation:

sd(x)

#> [1] 2.48998

Count occurrence of values:

table(x)

#> x
#> 3 4 5 9 
#> 2 1 1 1

Cumulative sum:

cumsum(x)

#> [1]  5  8 12 21 24

Term-by-term difference

diff(x)

#> [1] -2  1  5 -6

5.3.6 Concatenation of vectors

This is done using the c() notation: you basically create a vector of vectors, the result being a new vector:

# concatenate vectors
z <- c(-1:4, NA, -x); z

#>  [1] -1  0  1  2  3  4 NA -5 -3 -4 -9 -3

Another option is to use the append() function that allows for more options:

# append values
append(x, 4)    # at the end

#> [1] 5 3 4 9 3 4

append(x, 4:1, 3) # or after a given index

#> [1] 5 3 4 4 3 2 1 9 3

5.4 Exercises

Download the exercises and solutions from the following repository, then create a Rstudio project from the unzipped folder:

Beginner exercises

Exercise 1

Let’s create a named vector containing age of the students in the class, the names of each value being the first name of the students. Then:

Compute the average age and its standard deviation
Compute the median age
What is the maximum, minimum and range of the ages in the class?
What are all the student names in the class?
Print the sorted ages by increasing and decreasing order
Print the ages sorted by alphabetically ordered names (increasing and decreasing)
Show a histogram of the ages distribution using hist() and play with the parameter breaks to modify the histogram
Show a boxplot of the ages distribution using boxplot()

Exercise 2

This exercise is adapted from here.

Open Rstudio and create a new R script, save it as population.R in your wanted directory, say Rcourse/.

Download the population.csv file and save it in your working directory.

A csv file contains raw data stored as plain text and separated by a comma (Comma Separated Values). Open it with Rstudio.

We can of course directly load such file with R and store its data in an appropriate format (i.e. a data.frame), but this is for the next chapter. For now, just copy-paste the text in the Rstudio script area to:

Create a cities vector containing all the cities listed in population.csv
Create a pop_1962 and pop_2012 vectors containing the populations of each city at these years. Print the 2 vectors.
Use names() to name values of pop_1962 and pop_2012. Print the 2 vectors again. Are there any change?
What are the cities with more than 200000 people in 1962? For these, how many residents in 2012?
What is the population evolution of Montpellier and Nantes?
Create a pop_diff vector to store population change between 1962 and 2012
Print cities with a negative change
Print cities which broke the 300000 people barrier between 1962 and 2012
Compute the total change in population of the 10 largest cities (as of 1962) between 1962 and 2012.
Compute the population mean for year 1962
Compute the population mean of Paris over these two years
Sort the cities by decreasing order of population for 1962

Solution

# Create a `cities` vector containing all the cities listed in `population.csv`
cities <- c("Angers", "Bordeaux", "Brest", "Dijon", "Grenoble", "Le Havre", 
            "Le Mans", "Lille", "Lyon", "Marseille", "Montpellier", "Nantes", 
            "Nice", "Paris", "Reims", "Rennes", "Saint-Etienne", "Strasbourg", 
            "Toulon", "Toulouse")
# Create a `pop_1962` and `pop_2012` vectors containing the populations 
# of each city at these years. Print the 2 vectors. 
pop_1962 <- c(115273,278403,136104,135694,156707,187845,132181,239955,
              535746,778071,118864,240048,292958,2790091,134856,151948,
              210311,228971,161797,323724)
pop_2012 <- c(149017,241287,139676,152071,158346,173142,143599,228652,
              496343,852516,268456,291604,343629,2240621,181893,209860,
              171483,274394,164899,453317)
pop_1962; pop_2012

#>  [1]  115273  278403  136104  135694  156707  187845  132181  239955  535746
#> [10]  778071  118864  240048  292958 2790091  134856  151948  210311  228971
#> [19]  161797  323724

#>  [1]  149017  241287  139676  152071  158346  173142  143599  228652  496343
#> [10]  852516  268456  291604  343629 2240621  181893  209860  171483  274394
#> [19]  164899  453317

# Use names() to name values of `pop_1962` and `pop_2012`. 
# Print the 2 vectors again. Are there any change?
names(pop_2012) <- names(pop_1962) <- cities
pop_1962; pop_2012

#>        Angers      Bordeaux         Brest         Dijon      Grenoble 
#>        115273        278403        136104        135694        156707 
#>      Le Havre       Le Mans         Lille          Lyon     Marseille 
#>        187845        132181        239955        535746        778071 
#>   Montpellier        Nantes          Nice         Paris         Reims 
#>        118864        240048        292958       2790091        134856 
#>        Rennes Saint-Etienne    Strasbourg        Toulon      Toulouse 
#>        151948        210311        228971        161797        323724

#>        Angers      Bordeaux         Brest         Dijon      Grenoble 
#>        149017        241287        139676        152071        158346 
#>      Le Havre       Le Mans         Lille          Lyon     Marseille 
#>        173142        143599        228652        496343        852516 
#>   Montpellier        Nantes          Nice         Paris         Reims 
#>        268456        291604        343629       2240621        181893 
#>        Rennes Saint-Etienne    Strasbourg        Toulon      Toulouse 
#>        209860        171483        274394        164899        453317

# What are the cities with more than 200000 people in 1962? 
# For these, how many residents in 2012?
cities200k <- cities[pop_1962>200000]
cities200k; pop_2012[cities200k]

#>  [1] "Bordeaux"      "Lille"         "Lyon"          "Marseille"    
#>  [5] "Nantes"        "Nice"          "Paris"         "Saint-Etienne"
#>  [9] "Strasbourg"    "Toulouse"

#>      Bordeaux         Lille          Lyon     Marseille        Nantes 
#>        241287        228652        496343        852516        291604 
#>          Nice         Paris Saint-Etienne    Strasbourg      Toulouse 
#>        343629       2240621        171483        274394        453317

# What is the population evolution of Montpellier and Nantes?
pop_2012['Montpellier'] - pop_1962['Montpellier']; pop_2012['Nantes'] - pop_1962['Nantes']

#> Montpellier 
#>      149592

#> Nantes 
#>  51556

# Create a `pop_diff` vector to store population change between 1962 and 2012
pop_diff <- pop_2012 - pop_1962
# Print cities with a negative change
cities[pop_diff<0]

#> [1] "Bordeaux"      "Le Havre"      "Lille"         "Lyon"         
#> [5] "Paris"         "Saint-Etienne"

# Print cities which broke the 300000 people barrier between 1962 and 2012
cities[pop_2012>300000 & pop_1962<300000]

#> [1] "Nice"

# Compute the total change in population of the 10 largest cities
# (as of 1962) between 1962 and 2012.
ten_largest <- cities[order(pop_1962, decreasing = TRUE)[1:10]]
sum(pop_2012[ten_largest] - pop_1962[ten_largest])

#> [1] -324432

# Compute the population mean for year 1962
mean(pop_1962)

#> [1] 367477.3

# Compute the population mean of Paris
mean(c(pop_1962['Paris'], pop_2012['Paris']))

#> [1] 2515356

# Sort the cities by decreasing order of population for 1962
(pop_1962_sorted <- sort(pop_1962, decreasing = TRUE))

#>         Paris     Marseille          Lyon      Toulouse          Nice 
#>       2790091        778071        535746        323724        292958 
#>      Bordeaux        Nantes         Lille    Strasbourg Saint-Etienne 
#>        278403        240048        239955        228971        210311 
#>      Le Havre        Toulon      Grenoble        Rennes         Brest 
#>        187845        161797        156707        151948        136104 
#>         Dijon         Reims       Le Mans   Montpellier        Angers 
#>        135694        134856        132181        118864        115273