7 Reading/writing all kinds of files

7.1 Reading files

Working in any data-based scientific field, you will encounter many different types of files. ASCII text files are usually predominant, but you may want to read files from coming from Excel, Origin, etc. Here is a non-exhaustive reminder to help you read the kinds of files you often encounter with R. I invite you to visit the RopenSci webpage for more packages, and if this isn’t enough for your need, well, Google is your friend.

7.1.1 Column text files

Most base R functions like read.csv() or read.table() have their tidyverse counterpart, like read_csv() and read_table(). I usually prefer the tidyverse version as it outputs a tibble instead of a data.frame. These functions can also be directly provided with an url to the text file.

If your data file is “complicated” – in the sense that it contains lines or columns to be skipped – look at the help on the reader function with ?function_name.

7.1.1.1 Comma separated values

Input file looks like this:

#> Country Name,Country Code,Year,Value
#> Arab World,ARB,1960,92490932
#> Arab World,ARB,1961,95044497
#> Arab World,ARB,1962,97682294
#> Arab World,ARB,1963,100411076
#> Arab World,ARB,1964,103239902
#> Arab World,ARB,1965,106174988
#> Arab World,ARB,1966,109230593
#> Arab World,ARB,1967,112406932
#> Arab World,ARB,1968,115680165

library(tidyverse)
read_csv("Data/tot_population.csv")

#> # A tibble: 14,885 × 4
#>    `Country Name` `Country Code`  Year     Value
#>    <chr>          <chr>          <dbl>     <dbl>
#>  1 Arab World     ARB             1960  92490932
#>  2 Arab World     ARB             1961  95044497
#>  3 Arab World     ARB             1962  97682294
#>  4 Arab World     ARB             1963 100411076
#>  5 Arab World     ARB             1964 103239902
#>  6 Arab World     ARB             1965 106174988
#>  7 Arab World     ARB             1966 109230593
#>  8 Arab World     ARB             1967 112406932
#>  9 Arab World     ARB             1968 115680165
#> 10 Arab World     ARB             1969 119016542
#> # ℹ 14,875 more rows

read_csv("Data/tot_population.csv", skip = 1)

#> # A tibble: 14,884 × 4
#>    `Arab World` ARB   `1960` `92490932`
#>    <chr>        <chr>  <dbl>      <dbl>
#>  1 Arab World   ARB     1961   95044497
#>  2 Arab World   ARB     1962   97682294
#>  3 Arab World   ARB     1963  100411076
#>  4 Arab World   ARB     1964  103239902
#>  5 Arab World   ARB     1965  106174988
#>  6 Arab World   ARB     1966  109230593
#>  7 Arab World   ARB     1967  112406932
#>  8 Arab World   ARB     1968  115680165
#>  9 Arab World   ARB     1969  119016542
#> 10 Arab World   ARB     1970  122398374
#> # ℹ 14,874 more rows

read_csv("Data/tot_population.csv", skip = 1, col_names = LETTERS[1:4])

#> # A tibble: 14,885 × 4
#>    A          B         C         D
#>    <chr>      <chr> <dbl>     <dbl>
#>  1 Arab World ARB    1960  92490932
#>  2 Arab World ARB    1961  95044497
#>  3 Arab World ARB    1962  97682294
#>  4 Arab World ARB    1963 100411076
#>  5 Arab World ARB    1964 103239902
#>  6 Arab World ARB    1965 106174988
#>  7 Arab World ARB    1966 109230593
#>  8 Arab World ARB    1967 112406932
#>  9 Arab World ARB    1968 115680165
#> 10 Arab World ARB    1969 119016542
#> # ℹ 14,875 more rows

Since version 2.0 of readr, read_csv() can also take a vector as argument, which will result in reading all files in the vector successively:

read_csv(c("Data/test1.csv",
           "Data/test2.csv"),
         id = "file", show_col_types = FALSE)

#> # A tibble: 4 × 3
#>   file           name  value
#>   <chr>          <chr> <dbl>
#> 1 Data/test1.csv John      3
#> 2 Data/test1.csv Doe       2
#> 3 Data/test2.csv Colin     5
#> 4 Data/test2.csv Louis     8

7.1.1.2 Space separated values

Input file looks like this:

#> 3063.7136    43.916748
#> 3063.991 47.916748
#> 3064.2668    44.5
#> 3064.5442    50.5
#> 3064.8201    50.5
#> 3065.0972    44.5
#> 3065.373 44.916748
#> 3065.6504    39.916748
#> 3065.9263    49.5
#> 3066.2034    48.916748

library(tidyverse)
read_table("Data/rubis_01.txt")
read_table("Data/rubis_01.txt", col_names = c("w","int"))

7.1.1.3 Other separators

For tab-separated values, use read_tsv(). For other exotic separators, look into read_delim().

7.1.2 Excel files

For this, use the readxl library and its function read_excel() returning a tibble:

library(readxl)
read_excel("Data/test.xlsx")

#> # A tibble: 10 × 2
#>        x      y
#>    <dbl>  <dbl>
#>  1     1  5.21 
#>  2     2  6.55 
#>  3     3  3.71 
#>  4     4  0.216
#>  5     5  0.205
#>  6     6  4.60 
#>  7     7 10.3  
#>  8     8 12.9  
#>  9     9 11.1  
#> 10    10  7.28

read_excel("Data/test.xlsx", sheet=2) # specify the sheet by its number or its name

#> # A tibble: 4 × 2
#>   hello  world   
#>   <chr>  <chr>   
#> 1 ac     th      
#> 2 asc    thh     
#> 3 ascsa  dthdh   
#> 4 ascacs dthtdhdh

In case your Excel file contains merged cells, read_excel() will fill the merged cells by NA values. If you want to avoid this behavior, use openxlsx::read.xlsx() (which returns a data.frame):

read_excel("Data/test.xlsx", sheet=3)

#> # A tibble: 5 × 3
#>   a         b ...3 
#>   <chr> <dbl> <chr>
#> 1 <NA>     12 t    
#> 2 <NA>     13 h    
#> 3 <NA>     14 d    
#> 4 b        15 f    
#> 5 <NA>     16 g

library(openxlsx)
read.xlsx("Data/test.xlsx", fillMergedCells = TRUE, sheet = 3)

#>   a  b b
#> 1 a 12 t
#> 2 a 13 h
#> 3 a 14 d
#> 4 b 15 f
#> 5 b 16 g

7.1.3 Origin files

If you moved to R coming from a workflow where you used Origin, chances are you have some .opj files lying around that you still want to be able to read. Lucky you, the Ropj library is here:

library(Ropj)
read.opj("Data/opjfile.opj")

7.1.4 Matlab files

To read Matlab’s .mat format datasets, use the R.matlab package and its readMat() function.

library(R.matlab)
df <- readMat("yourfile.mat")

7.1.5 Images

You can read an image as a matrix. For example:

library(png)
readPNG("image.png")
library(tiff)
readTIFF("image.tiff")

For more image processing (pictures or videos), I recommend the imager package.

7.1.6 Spectroscopic files

In case your spectroscopic data wasn’t saved as an ASCII file but as a spc or another format, take a look at the lightr package, for example.

Example of a reading function for an spc file:

library(tidyverse)
library(lightr)
read_spc <- function(fname){
    d <- lr_parse_spc(fname) # d is a list
    tibble(w         = d[[1]]$wl, 
           intensity = d[[1]]$processed)
}

7.1.7 Compressed binary data files: HDF, netCDF

HDF: Go to this vignette to see how to read Hierarchical Data Files.

netCDF: see here.

7.2 Reading multiple files into a tidy table

We very often encounter the situation where we need to read multiple similar files into a tidy table. For this, you can use a for loop: this would work but would be un-R-ly, but in some cases you need to do this because you perform some other operations during the for loop.

Let’s say we store the list of file names into a vector file_list and read these files using the function read_function():

library(tidyverse)
df <- tibble() # empty initialization
for (file in file_list) {
    df_temp <- read_function(file) |> 
            mutate(name = file) # add the column `name` to make the tibble tidy
    df <- bind_rows(df, def_temp)
}

An R-friendly way of doing this would be to avoid using a for loop:

library(tidyverse)
df <- tibble(name = file_list) |> 
    mutate(data = map(name, read_function)) |> 
    unnest(data)

In case you are reading csv files, you can use the fact that since the version 2.0 of readr, read_csv() takes a vector as first argument. You’ll also need to use the id argument to get a column with the list of file names:

library(tidyverse)
df <- read_csv(file_list, id = "name")

7.3 Writing files

7.3.1 Text files

Sometimes, you want to output your data as a csv or an Excel file to share it with others or to save your data. Use the write_csv() function to write a csv file (prefer the tidyverse’s write_csv() to the base R write.csv() as it is more easy to use):

library(tidyverse)
write_csv(df, "your_file.csv")

Note that the write_*() functions will automatically compress outputs if an appropriate extension is given. Three extensions are currently supported: .gz for gzip compression, .bz2 for bzip2 compression and .xz for lzma compression. See the examples in the help for more information.

If you don’t want to use csv files, look into write_tsv() for tab-separated values or write_delim() for any delimiter. In case you want to output fixed width files, look into gdata’s write.fwf().

7.3.2 Excel files

To write Excel files, use the library openxlsx and its function write.xlsx() (see the help on the function for more options):

library(openxlsx)
write.xlsx(df, "your_file.xlsx")