5  Vectors

R is a vectorized language with built-in arithmetic and relational operators, mathematical functions and types of number. It means that each mathematical function and operator works on a set of data rather than a single scalar value as traditional computer language do, and thus formal looping can usually be avoided.

5.1 Different ways of defining a vector

If you want to define a vector by specifying the values, use the function c(), like so:

  • Here, x is a vector of doubles:
x <- c(1, 5, 3, 12, 4.2)
x
#> [1]  1.0  5.0  3.0 12.0  4.2
  • But here, x is converted to a vector of strings because it contains a string:
x <- c(1, 5, 3, "hello")
x
#> [1] "1"     "5"     "3"     "hello"
  • To define a sequence of increasing numbers, either use the notation start:end for a sequence going from start to end by step of 1, or use the seq() function that is more versatile:
1:10
#>  [1]  1  2  3  4  5  6  7  8  9 10
seq(-10, 10, by = .5)
#>  [1] -10.0  -9.5  -9.0  -8.5  -8.0  -7.5  -7.0  -6.5  -6.0  -5.5  -5.0  -4.5
#> [13]  -4.0  -3.5  -3.0  -2.5  -2.0  -1.5  -1.0  -0.5   0.0   0.5   1.0   1.5
#> [25]   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5
#> [37]   8.0   8.5   9.0   9.5  10.0
seq(-10, 10, length = 6)
#> [1] -10  -6  -2   2   6  10
seq(-10, 10, along = x)
#> [1] -10.000000  -3.333333   3.333333  10.000000
  • To repeat values, use the rep() function:
rep(0, 10)
#>  [1] 0 0 0 0 0 0 0 0 0 0
rep(c(0, 2), 5)
#>  [1] 0 2 0 2 0 2 0 2 0 2
rep(c(0, 2), each = 5)
#>  [1] 0 0 0 0 0 2 2 2 2 2
  • To create vectors of random numbers, use rnorm() or runif() for normally or uniformly distributed numbers, respectively:
# vector with 10 random values normally distributed around mean 
# with given standard deviation `sd`
rnorm(10, mean=3, sd=1)
#>  [1] 2.619771 3.025152 3.293682 4.490925 4.431168 3.066530 4.233528 3.863887
#>  [9] 3.932504 2.220107
# vector with 10 random values uniformly distributed between min and max
runif(10, min = 0, max = 1)
#>  [1] 0.3529724 0.7712596 0.9803776 0.9240055 0.1539559 0.1982481 0.8150497
#>  [8] 0.2295002 0.6342184 0.8145764

5.2 Numerical and categorical data types

Data can be of two different types: numerical, or categorical. Let’s say you are measuring a the temperature in a room and recording its value over time:

T1 <- c(22.3, 23.5, 26.0, 30.2)

T1 is a vector containing numerical data.

Let’s say that now you are recording the temperature level, which can be low, high or medium

T2 <- c("low", "low", "medium", "high")

T2 is a vector containing categorical data, i.e. the data in this example can fall into either of 3 categories. For now, T2 is however a vector of strings, and we need to tell R that it contains categorical data by using the function factor():

T2 <- factor(T2)
T2
#> [1] low    low    medium high  
#> Levels: high low medium

We see here that we now have 3 levels, and a numerization of T2 leads to obtaining the numbers 1, 2 and 3 according to the levels in T2:

#> [1] 2 2 3 1

Numerical data can be converted to factors in the same way – this can be useful sometimes, e.g. when plotting with ggplot as we will see later:

factor(T1)
#> [1] 22.3 23.5 26   30.2
#> Levels: 22.3 23.5 26 30.2

5.3 Principal operations on vectors

5.3.1 Mathematical operations

Because R is a vectorized language, you don’t need to loop on all elements of a vector to perform element-wise operations on it. Let’s say that x <- 1:6, then:

  • Addition of a value to all elements:
x + 2.5
#> [1] 3.5 4.5 5.5 6.5 7.5 8.5
  • Multiplication / division of all elements:
x*2
#> [1]  2  4  6  8 10 12
  • Integer division:
x %/% 3
#> [1] 0 0 1 1 1 2
  • Math functions apply on all elements:
sqrt(abs(cos(x)))
#> [1] 0.7350526 0.6450944 0.9949837 0.8084823 0.5325995 0.9798828
  • Power:
x^2.5
#> [1]  1.000000  5.656854 15.588457 32.000000 55.901699 88.181631
  • Multiplication of vectors of the same size is performed element by element:
y <- c(2.3, 5, 7, 10, 12, 20)
x*y
#> [1]   2.3  10.0  21.0  40.0  60.0 120.0
  • Multiplication of vectors of different sizes: the smaller vector is automatically repeated the number of times needed to get a vector of the size of the larger one. It will work also if the longer object length is not a multiple of shorter object length, but the shorter object will be truncated and you’ll get an error:
x*1:2
#> [1]  1  4  3  8  5 12
x*1:4
#> [1]  1  4  9 16  5 12
  • Modulo:
x %% 2
#> [1] 1 0 1 0 1 0
  • Outer product of vectors (the result is a matrix):
x %o% y
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]  2.3    5    7   10   12   20
#> [2,]  4.6   10   14   20   24   40
#> [3,]  6.9   15   21   30   36   60
#> [4,]  9.2   20   28   40   48   80
#> [5,] 11.5   25   35   50   60  100
#> [6,] 13.8   30   42   60   72  120

5.3.2 Accessing values

Let’s work on this vector x:

x <- c(5, 3, 4, 9, 3)

5.3.2.1 Accessing values by index

  • To access values of a vector x, use the x[i] notation, where i is the index you want to access. i can be (and in fact, is always) a vector.
Attention

In R, indexes numbering start at 1 !!!

# accessing by indexes
x[3]
#> [1] 4
# access index 1, 5 and 2
x[c(1,5,2)] 
#> [1] 5 3 3
  • To remove elements j from a vector x, use the notation x[-j]:
# remove elements 1 and 3
x[-c(1,3)]
#> [1] 3 9 3
Attention

Writing x[-c(1,3)] will just print the result of x[-c(1,3)], but not actually modify x.

To really modify x, you’d need to write x <- x[-c(1,3)].

5.3.2.2 Filtering values with tests

You can access values with booleans. Values getting a TRUE will be kept, while values with a FALSE will be discarded – or return a NA if a TRUE is given to a non existing value (i.e. to an index larger than the size of the vector):

x
#> [1] 5 3 4 9 3
x[c(TRUE,TRUE,TRUE,FALSE,TRUE,TRUE)]
#> [1]  5  3  4  3 NA

Therefore, you can apply tests on your values and filter them very easily:

x > 4 # is a vector of booleans
#> [1]  TRUE FALSE FALSE  TRUE FALSE
x[x > 4] # is a filtered vector
#> [1] 5 9

5.3.2.3 Accessing values by name

Finally, values in vectors can be named, and thus can be accessed by their name:

y <- c(age=32, name="John", pet="Cat")
y
#>    age   name    pet 
#>   "32" "John"  "Cat"
y[c('age','pet')] # prints a named vector
#>   age   pet 
#>  "32" "Cat"
y[['name']] # prints an un-named vector
#> [1] "John"

And you can access the names of the vector using names():

#> [1] "age"  "name" "pet"

5.3.3 Sorting, removing duplicates, sampling

  • Sorting by ascending number:
sort(x)
#> [1] 3 3 4 5 9
  • It works with strings too, but stringr::str_sort() might be needed for strings mixing letters and numbers:
sort(c("c","a","d","ab")) 
#> [1] "a"  "ab" "c"  "d"
sort(c("c", "a10", "a2", "d", "ab"))
#> [1] "a10" "a2"  "ab"  "c"   "d"
stringr::str_sort(c("c", "a10", "a2", "d", "ab"), numeric = TRUE)
#> [1] "a2"  "a10" "ab"  "c"   "d"
  • Sorting by descending number:
sort(x, decreasing = TRUE) 
#> [1] 9 5 4 3 3
  • Inverting the order of the vector:
rev(x)
#> [1] 3 9 4 3 5
  • Find the order of the indexes of the sorting:
#> [1] 2 5 3 1 4
x[order(x)] # is thus equivalent to `sort(x)`
#> [1] 3 3 4 5 9
  • Find duplicates:
#> [1] FALSE FALSE FALSE FALSE  TRUE
  • Remove duplicates:
#> [1] 5 3 4 9
  • Choose 3 random values:
sample(x, 3)
#> [1] 4 3 9

5.3.4 Maximum and minimum

This is quite straightforward:

# maximum of x and its index
x; max(x); which.max(x) 
#> [1] 5 3 4 9 3
#> [1] 9
#> [1] 4
# minimum of x and its index
x; min(x); which.min(x)
#> [1] 5 3 4 9 3
#> [1] 3
#> [1] 2
# range of a vector
range(x)
#> [1] 3 9

5.3.5 Characteristics of vectors

  • Size of a vector:
#> [1] 5
  • Statistics on vector:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     3.0     3.0     4.0     4.8     5.0     9.0
  • Sum of all terms:
sum(x)
#> [1] 24
  • Average value:
mean(x)
#> [1] 4.8
  • Median value:
#> [1] 4
  • Standard deviation:
sd(x)
#> [1] 2.48998
  • Count occurrence of values:
#> x
#> 3 4 5 9 
#> 2 1 1 1
  • Cumulative sum:
#> [1]  5  8 12 21 24
  • Term-by-term difference
diff(x)
#> [1] -2  1  5 -6

5.3.6 Concatenation of vectors

This is done using the c() notation: you basically create a vector of vectors, the result being a new vector:

# concatenate vectors
z <- c(-1:4, NA, -x); z
#>  [1] -1  0  1  2  3  4 NA -5 -3 -4 -9 -3

Another option is to use the append() function that allows for more options:

# append values
append(x, 4)    # at the end
#> [1] 5 3 4 9 3 4
append(x, 4:1, 3) # or after a given index
#> [1] 5 3 4 4 3 2 1 9 3

5.4 Exercises

Download the archive with all the exercises files, unzip it in your R class RStudio project, and edit the R files.


Exercise 1

Let’s create a named vector containing age of the students in the class, the names of each value being the first name of the students. Then:

  • Compute the average age and its standard deviation
  • Compute the median age
  • What is the maximum, minimum and range of the ages in the class?
  • What are all the student names in the class?
  • Print the sorted ages by increasing and decreasing order
  • Print the ages sorted by alphabetically ordered names (increasing and decreasing)
  • Show a histogram of the ages distribution using hist() and play with the parameter breaks to modify the histogram
  • Show a boxplot of the ages distribution using boxplot()
Exercise 2

This exercise is adapted from here.

Open Rstudio and create a new R script, save it as population.R in your wanted directory, say Rcourse/.

Download the population.csv file and save it in your working directory.

A csv file contains raw data stored as plain text and separated by a comma (Comma Separated Values). Open it with Rstudio.

We can of course directly load such file with R and store its data in an appropriate format (i.e. a data.frame), but this is for the next chapter. For now, just copy-paste the text in the Rstudio script area to:

  • Create a cities vector containing all the cities listed in population.csv
  • Create a pop_1962 and pop_2012 vectors containing the populations of each city at these years. Print the 2 vectors.
  • Use names() to name values of pop_1962 and pop_2012. Print the 2 vectors again. Are there any change?
  • What are the cities with more than 200000 people in 1962? For these, how many residents in 2012?
  • What is the population evolution of Montpellier and Nantes?
  • Create a pop_diff vector to store population change between 1962 and 2012
  • Print cities with a negative change
  • Print cities which broke the 300000 people barrier between 1962 and 2012
  • Compute the total change in population of the 10 largest cities (as of 1962) between 1962 and 2012.
  • Compute the population mean for year 1962
  • Compute the population mean of Paris over these two years
  • Sort the cities by decreasing order of population for 1962
Solution
# Create a `cities` vector containing all the cities listed in `population.csv`
cities <- c("Angers", "Bordeaux", "Brest", "Dijon", "Grenoble", "Le Havre", 
            "Le Mans", "Lille", "Lyon", "Marseille", "Montpellier", "Nantes", 
            "Nice", "Paris", "Reims", "Rennes", "Saint-Etienne", "Strasbourg", 
            "Toulon", "Toulouse")
# Create a `pop_1962` and `pop_2012` vectors containing the populations 
# of each city at these years. Print the 2 vectors. 
pop_1962 <- c(115273,278403,136104,135694,156707,187845,132181,239955,
              535746,778071,118864,240048,292958,2790091,134856,151948,
              210311,228971,161797,323724)
pop_2012 <- c(149017,241287,139676,152071,158346,173142,143599,228652,
              496343,852516,268456,291604,343629,2240621,181893,209860,
              171483,274394,164899,453317)
pop_1962; pop_2012
#>  [1]  115273  278403  136104  135694  156707  187845  132181  239955  535746
#> [10]  778071  118864  240048  292958 2790091  134856  151948  210311  228971
#> [19]  161797  323724
#>  [1]  149017  241287  139676  152071  158346  173142  143599  228652  496343
#> [10]  852516  268456  291604  343629 2240621  181893  209860  171483  274394
#> [19]  164899  453317
# Use names() to name values of `pop_1962` and `pop_2012`. 
# Print the 2 vectors again. Are there any change?
names(pop_2012) <- names(pop_1962) <- cities
pop_1962; pop_2012
#>        Angers      Bordeaux         Brest         Dijon      Grenoble 
#>        115273        278403        136104        135694        156707 
#>      Le Havre       Le Mans         Lille          Lyon     Marseille 
#>        187845        132181        239955        535746        778071 
#>   Montpellier        Nantes          Nice         Paris         Reims 
#>        118864        240048        292958       2790091        134856 
#>        Rennes Saint-Etienne    Strasbourg        Toulon      Toulouse 
#>        151948        210311        228971        161797        323724
#>        Angers      Bordeaux         Brest         Dijon      Grenoble 
#>        149017        241287        139676        152071        158346 
#>      Le Havre       Le Mans         Lille          Lyon     Marseille 
#>        173142        143599        228652        496343        852516 
#>   Montpellier        Nantes          Nice         Paris         Reims 
#>        268456        291604        343629       2240621        181893 
#>        Rennes Saint-Etienne    Strasbourg        Toulon      Toulouse 
#>        209860        171483        274394        164899        453317
# What are the cities with more than 200000 people in 1962? 
# For these, how many residents in 2012?
cities200k <- cities[pop_1962>200000]
cities200k; pop_2012[cities200k]
#>  [1] "Bordeaux"      "Lille"         "Lyon"          "Marseille"    
#>  [5] "Nantes"        "Nice"          "Paris"         "Saint-Etienne"
#>  [9] "Strasbourg"    "Toulouse"
#>      Bordeaux         Lille          Lyon     Marseille        Nantes 
#>        241287        228652        496343        852516        291604 
#>          Nice         Paris Saint-Etienne    Strasbourg      Toulouse 
#>        343629       2240621        171483        274394        453317
# What is the population evolution of Montpellier and Nantes?
pop_2012['Montpellier'] - pop_1962['Montpellier']; pop_2012['Nantes'] - pop_1962['Nantes']
#> Montpellier 
#>      149592
#> Nantes 
#>  51556
# Create a `pop_diff` vector to store population change between 1962 and 2012
pop_diff <- pop_2012 - pop_1962
# Print cities with a negative change
cities[pop_diff<0]
#> [1] "Bordeaux"      "Le Havre"      "Lille"         "Lyon"         
#> [5] "Paris"         "Saint-Etienne"
# Print cities which broke the 300000 people barrier between 1962 and 2012
cities[pop_2012>300000 & pop_1962<300000]
#> [1] "Nice"
# Compute the total change in population of the 10 largest cities
# (as of 1962) between 1962 and 2012.
ten_largest <- cities[order(pop_1962, decreasing = TRUE)[1:10]]
sum(pop_2012[ten_largest] - pop_1962[ten_largest])
#> [1] -324432
# Compute the population mean for year 1962
mean(pop_1962)
#> [1] 367477.3
# Compute the population mean of Paris
mean(c(pop_1962['Paris'], pop_2012['Paris']))
#> [1] 2515356
# Sort the cities by decreasing order of population for 1962
(pop_1962_sorted <- sort(pop_1962, decreasing = TRUE))
#>         Paris     Marseille          Lyon      Toulouse          Nice 
#>       2790091        778071        535746        323724        292958 
#>      Bordeaux        Nantes         Lille    Strasbourg Saint-Etienne 
#>        278403        240048        239955        228971        210311 
#>      Le Havre        Toulon      Grenoble        Rennes         Brest 
#>        187845        161797        156707        151948        136104 
#>         Dijon         Reims       Le Mans   Montpellier        Angers 
#>        135694        134856        132181        118864        115273