#> Country Name,Country Code,Year,Value
#> Arab World,ARB,1960,92490932
#> Arab World,ARB,1961,95044497
#> Arab World,ARB,1962,97682294
#> Arab World,ARB,1963,100411076
#> Arab World,ARB,1964,103239902
#> Arab World,ARB,1965,106174988
#> Arab World,ARB,1966,109230593
#> Arab World,ARB,1967,112406932
#> Arab World,ARB,1968,115680165
7 Reading/writing all kinds of files
7.1 Reading files
Working in any data-based scientific field, you will encounter many different types of files. ASCII text files are usually predominant, but you may want to read files from coming from Excel, Origin, etc. Here is a non-exhaustive reminder to help you read the kinds of files you often encounter with R. I invite you to visit the RopenSci webpage for more packages, and if this isn’t enough for your need, well, Google is your friend.
7.1.1 Column text files
Most base R functions like read.csv()
or read.table()
have their tidyverse
counterpart, like read_csv()
and read_table()
. I usually prefer the tidyverse
version as it outputs a tibble
instead of a data.frame
. These functions can also be directly provided with an url to the text file.
If your data file is “complicated” – in the sense that it contains lines or columns to be skipped – look at the help on the reader function with ?function_name
.
7.1.1.1 Comma separated values
Input file looks like this:
#> # A tibble: 14,885 × 4
#> `Country Name` `Country Code` Year Value
#> <chr> <chr> <dbl> <dbl>
#> 1 Arab World ARB 1960 92490932
#> 2 Arab World ARB 1961 95044497
#> 3 Arab World ARB 1962 97682294
#> 4 Arab World ARB 1963 100411076
#> 5 Arab World ARB 1964 103239902
#> 6 Arab World ARB 1965 106174988
#> 7 Arab World ARB 1966 109230593
#> 8 Arab World ARB 1967 112406932
#> 9 Arab World ARB 1968 115680165
#> 10 Arab World ARB 1969 119016542
#> # ℹ 14,875 more rows
read_csv("Data/tot_population.csv", skip = 1)
#> # A tibble: 14,884 × 4
#> `Arab World` ARB `1960` `92490932`
#> <chr> <chr> <dbl> <dbl>
#> 1 Arab World ARB 1961 95044497
#> 2 Arab World ARB 1962 97682294
#> 3 Arab World ARB 1963 100411076
#> 4 Arab World ARB 1964 103239902
#> 5 Arab World ARB 1965 106174988
#> 6 Arab World ARB 1966 109230593
#> 7 Arab World ARB 1967 112406932
#> 8 Arab World ARB 1968 115680165
#> 9 Arab World ARB 1969 119016542
#> 10 Arab World ARB 1970 122398374
#> # ℹ 14,874 more rows
read_csv("Data/tot_population.csv", skip = 1, col_names = LETTERS[1:4])
#> # A tibble: 14,885 × 4
#> A B C D
#> <chr> <chr> <dbl> <dbl>
#> 1 Arab World ARB 1960 92490932
#> 2 Arab World ARB 1961 95044497
#> 3 Arab World ARB 1962 97682294
#> 4 Arab World ARB 1963 100411076
#> 5 Arab World ARB 1964 103239902
#> 6 Arab World ARB 1965 106174988
#> 7 Arab World ARB 1966 109230593
#> 8 Arab World ARB 1967 112406932
#> 9 Arab World ARB 1968 115680165
#> 10 Arab World ARB 1969 119016542
#> # ℹ 14,875 more rows
Since version 2.0 of readr
, read_csv()
can also take a vector as argument, which will result in reading all files in the vector successively:
7.1.1.2 Space separated values
Input file looks like this:
#> 3063.7136 43.916748
#> 3063.991 47.916748
#> 3064.2668 44.5
#> 3064.5442 50.5
#> 3064.8201 50.5
#> 3065.0972 44.5
#> 3065.373 44.916748
#> 3065.6504 39.916748
#> 3065.9263 49.5
#> 3066.2034 48.916748
library(tidyverse)
read_table("Data/rubis_01.txt")
read_table("Data/rubis_01.txt", col_names = c("w","int"))
7.1.1.3 Other separators
For tab-separated values, use read_tsv()
. For other exotic separators, look into read_delim()
.
7.1.2 Excel files
For this, use the readxl
library and its function read_excel()
returning a tibble
:
library(readxl)
read_excel("Data/test.xlsx")
#> # A tibble: 10 × 2
#> x y
#> <dbl> <dbl>
#> 1 1 5.21
#> 2 2 6.55
#> 3 3 3.71
#> 4 4 0.216
#> 5 5 0.205
#> 6 6 4.60
#> 7 7 10.3
#> 8 8 12.9
#> 9 9 11.1
#> 10 10 7.28
read_excel("Data/test.xlsx", sheet=2) # specify the sheet by its number or its name
#> # A tibble: 4 × 2
#> hello world
#> <chr> <chr>
#> 1 ac th
#> 2 asc thh
#> 3 ascsa dthdh
#> 4 ascacs dthtdhdh
In case your Excel file contains merged cells, read_excel()
will fill the merged cells by NA
values. If you want to avoid this behavior, use openxlsx::read.xlsx()
(which returns a data.frame
):
read_excel("Data/test.xlsx", sheet=3)
#> # A tibble: 5 × 3
#> a b ...3
#> <chr> <dbl> <chr>
#> 1 <NA> 12 t
#> 2 <NA> 13 h
#> 3 <NA> 14 d
#> 4 b 15 f
#> 5 <NA> 16 g
#> a b b
#> 1 a 12 t
#> 2 a 13 h
#> 3 a 14 d
#> 4 b 15 f
#> 5 b 16 g
7.1.3 Origin files
If you moved to R coming from a workflow where you used Origin, chances are you have some .opj files lying around that you still want to be able to read. Lucky you, the Ropj
library is here:
7.1.4 Matlab files
To read Matlab’s .mat format datasets, use the R.matlab package and its readMat()
function.
7.1.5 Images
You can read an image as a matrix. For example:
For more image processing (pictures or videos), I recommend the imager
package.
7.1.6 Spectroscopic files
In case your spectroscopic data wasn’t saved as an ASCII file but as a spc
or another format, take a look at the lightr
package, for example.
Example of a reading function for an spc file:
7.1.7 Compressed binary data files: HDF, netCDF
HDF: Go to this vignette to see how to read Hierarchical Data Files.
netCDF: see here.
7.2 Reading multiple files into a tidy table
We very often encounter the situation where we need to read multiple similar files into a tidy table. For this, you can use a for
loop: this would work but would be un-R-ly, but in some cases you need to do this because you perform some other operations during the for
loop.
Let’s say we store the list of file names into a vector file_list
and read these files using the function read_function()
:
library(tidyverse)
df <- tibble() # empty initialization
for (file in file_list) {
df_temp <- read_function(file) |>
mutate(name = file) # add the column `name` to make the tibble tidy
df <- bind_rows(df, def_temp)
}
An R-friendly way of doing this would be to avoid using a for
loop:
library(tidyverse)
df <- tibble(name = file_list) |>
mutate(data = map(name, read_function)) |>
unnest(data)
In case you are reading csv files, you can use the fact that since the version 2.0 of readr
, read_csv()
takes a vector as first argument. You’ll also need to use the id
argument to get a column with the list of file names:
7.3 Writing files
7.3.1 Text files
Sometimes, you want to output your data as a csv or an Excel file to share it with others or to save your data. Use the write_csv()
function to write a csv file (prefer the tidyverse
’s write_csv()
to the base R write.csv()
as it is more easy to use):
Note that the write_*()
functions will automatically compress outputs if an appropriate extension is given. Three extensions are currently supported: .gz for gzip compression, .bz2 for bzip2 compression and .xz for lzma compression. See the examples in the help for more information.
If you don’t want to use csv files, look into write_tsv()
for tab-separated values or write_delim()
for any delimiter. In case you want to output fixed width files, look into gdata
’s write.fwf()
.
7.3.2 Excel files
To write Excel files, use the library openxlsx
and its function write.xlsx()
(see the help on the function for more options):
library(openxlsx)
write.xlsx(df, "your_file.xlsx")