1 About the class

1.1 Objectives of the class

The goal of this class is that at the end, the students are able to:

Treat their data with the free and open source language R, i.e.:
- Read, browse, manipulate and plot their data
- Model or simulate their data
Make automatic reporting through Rmarkdown or Quarto
Build a graphical interface with Shiny to interact with their data and output something (a value, a pdf report, a graph…)

What you will learn here will be useful in any scientific domain. The examples in this course are however mainly coming from the type of data you might encounter in Materials Science because, well, it’s what I have on hand…

1.2 Prerequisites

Coding skills: none expected.
The students should come with a laptop with admin rights (i.e. you should be able to install stuff).

1.3 Motivations

1.3.1 Reproducible data treatment: why it matters

Here is an introduction from the Wikipedia page on reproducible research:

In 2016, Nature conducted a survey of 1576 researchers who took a brief online questionnaire on reproducibility in research. According to the survey, more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. […] Although 52% of those surveyed agree there is a significant ‘crisis’ of reproducibility, less than 31% think failure to reproduce published results means the result is probably wrong, and most say they still trust the published literature.¹

Replicability and reproducibility are some of the keys to scientific integrity. Establishing a workflow in which your data are always treated in the same manner is a necessity, because it is a way to:

Minimize errors inherent to human manipulation
Keep track of all the treatments you perform on your data and document your methodology: this allows others to reproduce your data, but also yourself.
Help you to make sense of all your data, and avoid disregarding some data (hence help you keep your scientific integrity)
Gain tremendous amounts of time

Goal of this class

It is the objective of this class to provide you the tools necessary to work within this philosophy.

1.3.2 Why with R and not python?

The eternal question… R was originally designed by statisticians for statisticians and it might still suffers from this “statistics only” label that sticks to it.

Python is a wide spectrum programming language with very efficient numerical libraries used in the computer science community.

R is focused on data treatment, statistics and representation. In R, the object is the data, and base R allows you to read, treat, fit and plot your data very easily – although you will still most certainly need additional packages.

So with python, you can do everything, including treating and analyzing scientific data – with the right packages. With R, you can do less but do very well what you do, and in my opinion more seamlessly (probably because I learned and used R for years before starting with python…). In my opinion, this xkcd comic about python environment is only slightly exaggerated… while for R, installation and maintenance is sooooo easy in comparison…

Each language has his own strengths and weaknesses. To my tastes, I would say that python and R compare like that (although a pythonist would probably say the opposite):

	R	Python
Free and open source	✔✔✔	✔✔✔
IDE	✔✔✔	✔✔✔
Large code repository	✔✔✔	✔✔✔
Large community	✔✔✔	✔✔✔
Notebooks	✔✔✔	✔✔✔
Machine Learning	✔✔✔	✔✔✔
Performances	✔✔	✔✔✔
Ease of installation and maintenance	✔✔✔	✔
Data visualization	✔✔✔	✔
Statistical analysis	✔✔✔	✔
Multi-purpose	✔	✔✔✔
Syntax, productivity, flexibility	✔✔✔	✔✔✔
Rmarkdown	✔✔✔✔✔	✔
Quarto	✔✔✔✔✔	✔✔✔✔✔

Well, it’s all very subjective, really. In the end, I still use both languages, each one for a different purpose:

Let’s say I want to produce an initial atomic configuration for a molecular dynamics simulation, or read a molecular dynamics trajectory and compute some quantities such as a pair correlation or a mean square displacement, or perform some image-based machine learning: python (or even C, if I need to treat large trajectories).
Now if I want to make sense of some experimental measurements or results of simulations, do some fits and produce publication-quality graphs or experimental reports: R.

Both languages are great and being able to use both is the best thing that can happen to you (relatively speaking) – especially since you can combine them in Rmarkdown using the reticulate package, which we will see later in this class.

So, since my goal is to provide you with tools for seamlessly read, make sense, and plot your data in the reproducible science philosophy, let’s go with R. Also, R has a great IDE (Rstudio) that really eases working with data and code. Such a nice IDE is still lacking for python.

1.4 Further reading

This class is indented to provide the students with the tools to handle themselves with R, Rmarkdown and Shiny, and not to provide an extensive review of everything that is possible with R. To go further:

R
- R manual on CRAN
- Some cheatsheets
- The tidyverse website
- Tibbles.
- Tidy your data
- Tips to improve your code
Plotting
- The R Graph Gallery
- The R Graph Cookbook
- The ggplot cheatsheet
  - Another one
  - Another one quite extensive
  - Another one
Rmarkdown
- Rmarkdown complete guide
- Rmarkdown cheatsheet
- Rmarkdown cookbook
- Rmarkdown code chunks
- Rmarkdown mixing languages
Shiny
- The Shiny cheatsheet
- Guide to application layout
- The Shiny Gallery: find what you want to do and adapt it to your needs
- The official Shiny video tutorial
And as always, if you have a question, Google is your friend!

1.5 Teaser

You want to be able to produce interactive plots like these in an automatic experimental report?

You want to produce publication-quality graphs like these?

You want to be able to build graphical interfaces like this to help you in your data treatment?

Stay tuned! You’ve come to the right place.

https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970 ↩︎

# About the class ## Objectives of the class The goal of this class is that at the end, the students are able to: - Treat their data with the **free** and **open source** language **[R](https://www.r-project.org/)**, *i.e.*: + Read, browse, manipulate and plot their data + Model or simulate their data - Make automatic reporting through **[Rmarkdown](https://rmarkdown.rstudio.com/)** or **[Quarto](https://quarto.org/)** - Build a graphical interface with **[Shiny](https://shiny.rstudio.com/)** to interact with their data and output something (a value, a pdf report, a graph...) What you will learn here will be useful in **any** scientific domain. The examples in this course are however mainly coming from the type of data you might encounter in Materials Science because, well, it's what I have on hand... ## Prerequisites - **Coding skills:** none expected. - The students should **come with a laptop** with admin rights (*i.e.* you should be able to install stuff). ## Motivations ### Reproducible data treatment: why it matters Here is an introduction from the [Wikipedia page](https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research) on reproducible research: > In 2016, Nature conducted a survey of 1576 researchers who took a brief online questionnaire on reproducibility in research. According to the survey, **more than 70% of researchers have tried and failed to reproduce another scientist's experiments**, and **more than half have failed to reproduce their own experiments.** [...] Although 52% of those surveyed agree there is a significant 'crisis' of reproducibility, **less than 31% think failure to reproduce published results means the result is probably wrong**, and most say they still trust the published literature.[^1] [^1]: [https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970) *Replicability* and *reproducibility* are some of the keys to **scientific integrity**. Establishing a workflow in which your data are always treated in the same manner is a **necessity**, because it is a way to: - **Minimize errors** inherent to human manipulation - **Keep track** of all the treatments you perform on your data and document your methodology: this allows others to reproduce your data, but also *yourself*. - Help you to **make sense** of all your data, and avoid disregarding some data (hence help you keep your scientific integrity) - **Gain tremendous amounts of time** ::: {.callout-important} ## Goal of this class **It is the objective of this class to provide you the tools necessary to work within this philosophy.** ::: ### Why with R and not python? The eternal question... [R](https://www.r-project.org/) was originally designed by statisticians for statisticians and it might still suffers from this "statistics only" label that sticks to it. [Python](https://www.python.org/) is a __wide spectrum__ programming language with very efficient numerical libraries used in the computer science community. R is __focused on data treatment__, statistics and representation. In R, _**the object is the data**_, and base R allows you to read, treat, fit and plot your data very easily – although you will still most certainly need additional packages. So with python, you can do everything, including treating and analyzing scientific data – with the right packages. With R, you can do less but do very well what you do, and in my opinion more seamlessly (probably because I learned and used R for years before starting with python...). In my opinion, this [xkcd comic about python environment](https://xkcd.com/1987/) is only *slightly* exaggerated... while for R, installation and maintenance is sooooo easy in comparison... Each language has his own strengths and weaknesses. To my tastes, I would say that python and R compare like that (although a pythonist would probably say the opposite): | | R | Python | |:------------------------------------:|:----------------------------------------:|:------------------------:| | Free and open source | ✔✔✔ | ✔✔✔ | | IDE | ✔✔✔ | ✔✔✔ | | Large code repository | ✔✔✔ | ✔✔✔ | | Large community | ✔✔✔ | ✔✔✔ | | Notebooks | ✔✔✔ | ✔✔✔ | | Machine Learning | ✔✔✔ | ✔✔✔ | | Performances | ✔✔ | ✔✔✔ | | Ease of installation and maintenance | ✔✔✔ | ✔ | | Data visualization | ✔✔✔ | ✔ | | Statistical analysis | ✔✔✔ | ✔ | | Multi-purpose | ✔ | ✔✔✔ | | Syntax, productivity, flexibility | ✔✔✔ | ✔✔✔ | | Rmarkdown | ✔✔✔✔✔ | ✔ | | Quarto | ✔✔✔✔✔ | ✔✔✔✔✔ | Well, it's all very subjective, really. In the end, *I still use both languages*, each one for a different purpose: - Let's say I want to produce an initial atomic configuration for a molecular dynamics simulation, or read a molecular dynamics trajectory and compute some quantities such as a pair correlation or a mean square displacement, or perform some image-based machine learning: **python** (or even _C_, if I need to treat large trajectories). - Now if I want to make sense of some experimental measurements or results of simulations, do some fits and produce publication-quality graphs or experimental reports: **R**. Both languages are great and being able to use both is the best thing that can happen to you (relatively speaking) – _especially since you can combine them in Rmarkdown using the [reticulate](https://rstudio.github.io/reticulate/) package_, which we will see later in this class. So, since my goal is to provide you with tools for seamlessly read, make sense, and plot your data in the reproducible science philosophy, let's go with __R__. Also, R has a great IDE ([Rstudio](https://rstudio.com/)) that really eases working with data and code. Such a nice IDE is still lacking for python. ## Further reading This class is indented to provide the students with the tools to handle themselves with R, Rmarkdown and Shiny, and not to provide an extensive review of everything that is possible with R. To go further: - **R** + [R manual on CRAN](https://cran.r-project.org/doc/manuals/r-release/R-intro.html) + Some [cheatsheets](https://www.rstudio.com/resources/cheatsheets/) + The tidyverse [website](https://ggplot2.tidyverse.org/) + [Tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html). + [Tidy your data](http://www.sthda.com/english/wiki/tidyr-crucial-step-reshaping-data-with-r-for-easier-analyses) + [Tips](https://drsimonj.svbtle.com/five-simple-tricks-to-improve-your-r-code) to improve your code - **Plotting** + The [R Graph Gallery](https://www.r-graph-gallery.com/) + The [R Graph Cookbook](https://r-graphics.org/index.html) + The [ggplot cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf) * [Another one](http://r-statistics.co/ggplot2-cheatsheet.html) * [Another one](http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization) quite extensive * [Another one](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html) - **Rmarkdown** + Rmarkdown [complete guide](https://bookdown.org/yihui/rmarkdown/) + Rmarkdown [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) + Rmarkdown [cookbook](https://dr-harper.github.io/rmarkdown-cookbook/index.html) + Rmarkdown [code chunks](https://bookdown.org/yihui/rmarkdown/r-code.html) + Rmarkdown mixing [languages](https://bookdown.org/yihui/rmarkdown/language-engines.html) - **Shiny** + The Shiny [cheatsheet](https://shiny.rstudio.com/images/shiny-cheatsheet.pdf) + Guide to application [layout](https://shiny.rstudio.com/articles/layout-guide.html) + The [Shiny Gallery](https://shiny.rstudio.com/gallery/): find what you want to do and adapt it to your needs + The official [Shiny video tutorial](https://shiny.rstudio.com/tutorial/) - And as always, if you have a question, [Google](https://www.google.fr/search?source=hp&ei=g0MLXeGwKNLPgweV04-IBA&q=r+how+to&oq=r+how+to) is your friend! ## Teaser ```{r setup, include=FALSE, warning = FALSE, cache=FALSE} remove(list = ls()) library(plotly) library(ggplot2) ``` **You want to be able to produce interactive plots like these in an automatic experimental report?** ```{r, echo=FALSE, warning = FALSE} d <- tibble::tibble(x = seq(-pi,pi,.01), y = sin(x)) P <- ggplot(d, aes(x,y))+geom_point() ggplotly(P, dynamicTicks = TRUE) remove(list = ls()) ``` <br> **You want to produce publication-quality graphs like these?** ```{r} #| layout-ncol: 3 #| echo: false knitr::include_graphics(c("Plots/data_RBM.png", "Plots/Grt_CS07_0.5.png", "Plots/Fig2epskT.png")) ``` <br> **You want to be able to build graphical interfaces like this to help you in your data treatment?** ```{r, echo=FALSE} knitr::include_graphics(c("Plots/shiny.png")) ``` **Stay tuned! You've come to the right place.**