13 Fitting
13.1 What does it mean to “fit data”?
It might sound like a trivial question, but let’s make sure we’re all on the same page…
For the sake of the example, let’s take a look at some experimental measurements of the oscillation period T of a pendulum1. We have measured this period for various lengths L of the pendulum, and it results in such an evolution:
Now, if you recall your high school physics class, you might remember that the period of the pendulum T is given by:
\[ T \simeq 2\pi\sqrt{\frac{L}{g}}, \tag{13.1}\]
where L is the pendulum length and g the acceleration of gravity. Such and experimental design, i.e. measuring T as a function of L, is therefore a means to determining the value of g.
For this, we need to perform a fit of our experimental data by a physical model, i.e. the one of Equation 13.1. When doing so, one has to distinguish between:
- Variables: the period T and the pendulum length L are variables, they are the quantities that vary for a given parameter.
- Parameters: the gravity g is a parameter of the fit that we wish to determine, i.e. it is a constant for this set of measurements.
So now, what exactly is a fit? Fitting a model to experimental data means:
- Finding the model that best describe our data (a line, a Gaussian, a polynomial, Equation 13.1, etc.). You can try several models and compare them to determine which model is the best.
- Finding the set of parameters of the model that best describe our data. This is done through an optimization algorithm.
How do we know the parameters are the best ones to describe our data? For this, we use a tool called the residual error function, \(\chi^2\), such as:
\[ \chi^2=\sum_{i=1}^N\left(y_i(x_i)-m_i(x_i, \{p\})\right)^2, \] where \(y_i(x_i)\) are the N experimental observations of the value \(y\) as a function of \(x\), and \(m_i(x_i, \{p\})\) is the value of the model as a function of \(x\) and the ensemble of parameters \(\{p\}\). We use the squared sum of residuals and not the simple sum of residuals because the residuals can either be positive or negative, and we are interested in minimizing the total distance from our model to our experimental data.
Applied to our pendulum problem, we thus get:
\[ \chi^2=\sum_{i=1}^N\left(T_i-2\pi\sqrt{\frac{L_i}{g}}\right)^2, \] where the summation is performed on the N points that we have recorded in Figure 13.1. In this case there is only one parameter, but you will encounter other models needing more than one parameter (for example if you’re fitting a peak, you’ll need it’s position, width, and height: 3 parameters). The rule of thumb is that the less parameters you introduce the better, “less” being compared to the number of data points.
So, let’s try a few values of g and see what is the resulting residual error \(\chi^2\) in each case:
We see that there is a sweet spot in the values of g for which \(\chi^2\) is minimum:
Fitting a model to experimental data is finding this sweet spot for the ensemble of parameters through an optimization procedure.
There are numerous algorithms out there to perform this optimization, but we will not delve into their mechanics.
Now, let’s get back to R. You will mainly use two functions to perform these optimizations:
Note that, whenever it is possible, it is usually preferable to make your problem a linear one, as it avoids the hassle of providing starting values. For example, in the pendulum case above, taking the square of Equation 13.1 results in a linear evolution of \(T^2\) with respect to \(l\).
13.2 Linear fitting with lm()
Let’s learn how to do simple linear fits with R’s lm()
and plot the results. Let’s start with some fake data stored in a tibble
called d
:
We see that this data shows a linear evolution, for which we might want to extract the slope and the intercept. This is very simply done by applying the lm()
function, like so:
Here, and everywhere else in R, the operator ~
is to be understood as “as a function of”. So with this bit of code, we tell R to “fit y as a function of x using a linear model”.
Now to see the fit results, we can just display fit
, or call summary(fit)
# Summary of the fit
fit
#>
#> Call:
#> lm(formula = y ~ x, data = d)
#>
#> Coefficients:
#> (Intercept) x
#> 1.7652 0.2283
summary(fit)
#>
#> Call:
#> lm(formula = y ~ x, data = d)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.18213 -0.09685 -0.01735 0.10309 0.21678
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.76519 0.09858 17.91 9.70e-08 ***
#> x 0.22833 0.01589 14.37 5.37e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1443 on 8 degrees of freedom
#> Multiple R-squared: 0.9627, Adjusted R-squared: 0.9581
#> F-statistic: 206.6 on 1 and 8 DF, p-value: 5.366e-07
To actually retrieve and store the fit parameters, call coef(fit)
:
# Retrieve the coefficients and errors
coef(fit)
#> (Intercept) x
#> 1.765189 0.228335
coef(fit)[1]
#> (Intercept)
#> 1.765189
To get it properly stored in a tibble, see the broom
package that we describe later in this chapter:
# Summary of the fit
broom::tidy(fit)
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.77 0.0986 17.9 0.0000000970
#> 2 x 0.228 0.0159 14.4 0.000000537
To get the standard error of the fitted parameters and the R2:
summary(fit)$coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.765189 0.09857938 17.90627 9.697010e-08
#> x 0.228335 0.01588751 14.37198 5.366052e-07
summary(fit)$coefficients["x", "Std. Error"]
#> [1] 0.01588751
summary(fit)$r.squared
#> [1] 0.9627133
And finally, to plot the result of the fit:
# Get the fitted paramters and make a string with it to be printed
to_print <- glue("y = {round(coef(fit)[1],2)} + x*{round(coef(fit)[2],2)}")
# Base plot
plot(d, pch = 16, main = "With base plot", sub = to_print)
abline(coef(fit), col="red")
# GGplot versions
ggplot(data=d, aes(x,y)) +
geom_point(cex=3) +
geom_abline(slope = coef(fit)[2], intercept = coef(fit)[1], col="red") +
labs(title = "With ggplot and the parameters you get from the external call to lm()",
subtitle = to_print)
ggplot(data=d, aes(x,y)) +
geom_point(cex=3) +
geom_smooth(method="lm") + # does the fit but does not allow saving the parameters
labs(title = "With ggplot and the geom_smooth() function",
subtitle = to_print)
The function geom_smooth()
will fit the data and display the fitted line, but to retrieve the actual coefficients you still need to run lm()
.
Finally, you may want to impose an intercept that will be 0 or a given value. For this, you will need to add +0
in the formula, like so:
13.3 Nonlinear Least Squares fitting
13.3.1 The nls()
workhorse
You can fit data with your own functions and constraints using nls()
. Here is an example of data we may want to fit, stored into a tibble called df
:
ggplot(data=df, aes(x,y))+
geom_point()+
ggtitle("Some fake data we want to fit with 2 Gaussians")
We first need to define a function to fit our data. We see here that it contains two peaks that look Gaussian, so let’s go with the sum of two Gaussian functions:
# Create a function to fit the data
myfunc <- function(x, y0, x1, x2, A1, A2, sd1, sd2) {
y0 + # baseline
A1 * dnorm(x, mean = x1, sd = sd1) +# 1st Gaussian
A2 * dnorm(x, mean = x2, sd = sd2) # 2nd Gaussian
}
# Fit the data using a user function
fit_nls <- nls(data=df,
y ~ myfunc(x, y0, x1, x2, A1, A2, sd1, sd2),
# Provide starting point for parameters values:
start = list(y0=0, x1=0, x2=1.5,
sd1=.2, sd2=.2,
A1=1, A2=1)
)
summary(fit_nls)
#>
#> Formula: y ~ myfunc(x, y0, x1, x2, A1, A2, sd1, sd2)
#>
#> Parameters:
#> Estimate Std. Error t value Pr(>|t|)
#> y0 -0.001327 0.003424 -0.388 0.6991
#> x1 -0.023505 0.010801 -2.176 0.0316 *
#> x2 1.985089 0.033723 58.864 <2e-16 ***
#> sd1 0.488308 0.010343 47.212 <2e-16 ***
#> sd2 0.941443 0.036434 25.840 <2e-16 ***
#> A1 1.019132 0.027038 37.693 <2e-16 ***
#> A2 0.985925 0.035406 27.846 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.02747 on 114 degrees of freedom
#>
#> Number of iterations to convergence: 8
#> Achieved convergence tolerance: 9.55e-06
And to plot the data and the result of the fit, we use predict(fit)
to retrieve the fitted y values:
13.3.2 Using constraints
In nls()
it is even possible to constraint the fitting by adding lower and upper boundaries. These boundaries are useful when you want to give some physical meaning to your parameters, for example, like forcing the width and amplitude to be positive or above a certain minimum value. However, you have to be careful with these and not provide stupid ones, e.g.:
# Constraining the upper and lower values of the fitting parameters
fit_constr <- nls(data = df,
y ~ myfunc(x, y0, x1, x2, A1, A2, sd1, sd2),
start = list(y0=0, x1=0, x2=3,
sd1=.2, sd2=.2, A1=1, A2=1),
upper = list(y0=Inf, x1=Inf, x2=Inf,
sd1=Inf, sd2=Inf, A1=2, A2=2),
lower = list(y0=-Inf, x1=-Inf, x2=2.9,
sd1=0, sd2=0, A1=0, A2=0),
algorithm = "port"
)
# Plotting the resulting function in blue
ggplot(data=df, aes(x,y))+
ggtitle("Beware of bad constraints!")+
geom_point(size=2, alpha=.5) +
geom_line(aes(y = predict(fit_constr)), color="royalblue", linewidth=1)
13.3.3 A more robust version of nls
Sometimes, nls()
will struggle to converge towards a solution, especially if you provide initial guesses that are too far from the expected values.
fit3 <- nls(data = df,
y ~ myfunc(x, y0, x1, x2, A1, A2, sd1, sd2),
start = list(y0 = 0, x1 = 1, x2 = 5,
sd1 = .2, sd2 = .2, A1 = 10, A2 = 10)
)
#> Error in numericDeriv(form[[3L]], names(ind), env, central = nDcentral): Missing value or an infinity produced when evaluating the model
In that case, you may want to use a more robust nls()
function such as nlsLM()
from the minpack.lm
package.
library(minpack.lm)
fit_nlsLM <- nlsLM(data = df,
y ~ myfunc(x, y0, x1, x2, A1, A2, sd1, sd2),
start = list(y0 = 0, x1 = 1, x2 = 5,
sd1 = .2, sd2 = .2, A1 = 10, A2 = 10)
)
summary(fit_nlsLM)
#>
#> Formula: y ~ myfunc(x, y0, x1, x2, A1, A2, sd1, sd2)
#>
#> Parameters:
#> Estimate Std. Error t value Pr(>|t|)
#> y0 -0.001327 0.003424 -0.388 0.6991
#> x1 -0.023505 0.010801 -2.176 0.0316 *
#> x2 1.985090 0.033723 58.865 <2e-16 ***
#> sd1 0.488308 0.010343 47.212 <2e-16 ***
#> sd2 0.941442 0.036434 25.840 <2e-16 ***
#> A1 1.019133 0.027038 37.693 <2e-16 ***
#> A2 0.985923 0.035406 27.846 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.02747 on 114 degrees of freedom
#>
#> Number of iterations to convergence: 19
#> Achieved convergence tolerance: 1.49e-08
Also, nlsLM()
won’t fail when the fit is exact, whereas nls()
will:
testdf <- tibble(x = seq(-10,10),
y = dnorm(x))
nls(data = testdf,
y ~ A*dnorm(x, sd=B, mean=x0) + y0,
start = list(y0=0, x0=0, A=1, B=1)
)
#> Error in nls(data = testdf, y ~ A * dnorm(x, sd = B, mean = x0) + y0, : number of iterations exceeded maximum of 50
#> Nonlinear regression model
#> model: y ~ A * dnorm(x, sd = B, mean = x0) + y0
#> data: testdf
#> y0 x0 A B
#> 0 0 1 1
#> residual sum-of-squares: 0
#>
#> Number of iterations to convergence: 1
#> Achieved convergence tolerance: 1.49e-08
13.4 The broom library
Thanks to the broom
library, it is easy to retrieve all the fit parameters in a tibble
:
#> # A tibble: 7 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 y0 -0.00133 0.00342 -0.388 6.99e- 1
#> 2 x1 -0.0235 0.0108 -2.18 3.16e- 2
#> 3 x2 1.99 0.0337 58.9 3.62e-87
#> 4 sd1 0.488 0.0103 47.2 1.12e-76
#> 5 sd2 0.941 0.0364 25.8 1.76e-49
#> 6 A1 1.02 0.0270 37.7 3.37e-66
#> 7 A2 0.986 0.0354 27.8 1.11e-52
# Get the fitted curve and residuals next to the original data
augment(fit_nls)
#> # A tibble: 121 × 4
#> x y .fitted .resid
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -5 -0.0108 -0.00133 -0.00948
#> 2 -4.9 -0.0453 -0.00133 -0.0440
#> 3 -4.8 -0.0365 -0.00133 -0.0351
#> 4 -4.7 -0.0308 -0.00133 -0.0295
#> 5 -4.6 -0.0393 -0.00133 -0.0379
#> 6 -4.5 -0.0380 -0.00133 -0.0367
#> 7 -4.4 0.0297 -0.00133 0.0310
#> 8 -4.3 -0.0127 -0.00133 -0.0114
#> 9 -4.2 -0.0234 -0.00133 -0.0221
#> 10 -4.1 -0.00933 -0.00133 -0.00800
#> # ℹ 111 more rows
It is then easy to make a recursive fit on your data without using a for loop
, like so:
library(broom)
library(tidyverse)
library(ggplot2)
theme_set(theme_bw())
# Create fake data
a <- seq(-10,10,.1)
centers <- c(-2*pi,pi,pi/6)
widths <- runif(3, min=0.5, max=1)
amp <- runif(3, min=2, max=10)
noise <- .3*runif(length(a))-.15
d <- tibble(x=rep(a,3),
y=c(amp[1]*dnorm(a,mean=centers[1],sd=widths[1])+sample(noise),
amp[2]*dnorm(a,mean=centers[2],sd=widths[2])+sample(noise),
amp[3]*dnorm(a,mean=centers[3],sd=widths[3])+sample(noise)),
T=rep(1:3, each=length(a))
)
# Plot the data
d |> ggplot(aes(x=x, y=y, color=factor(T))) +
geom_point()
# Fit all data
d_fitted <- d |>
nest(data = -T) |>
mutate(fit = map(data, ~ nls(data = .,
y ~ y0 + A*dnorm(x, mean=x0, sd=FW),
start=list(A = max(.$y),
y0 = .01,
x0 = .$x[which.max(.$y)],
FW = .7)
)),
tidied = map(fit, tidy),
augmented = map(fit, augment)
)
d_fitted
#> # A tibble: 3 × 5
#> T data fit tidied augmented
#> <int> <list> <list> <list> <list>
#> 1 1 <tibble [201 × 2]> <nls> <tibble [4 × 5]> <tibble [201 × 4]>
#> 2 2 <tibble [201 × 2]> <nls> <tibble [4 × 5]> <tibble [201 × 4]>
#> 3 3 <tibble [201 × 2]> <nls> <tibble [4 × 5]> <tibble [201 × 4]>
In case you want to provide fit parameters that vary depending on the group you are looking at, use the notation .$column_name
, like is done here.
Then you can see the results for all your data at once:
# data and fit resulting curve
d_fitted |>
unnest(augmented)
#> # A tibble: 603 × 8
#> T data fit tidied x y .fitted .resid
#> <int> <list> <list> <list> <dbl> <dbl> <dbl> <dbl>
#> 1 1 <tibble [201 × 2]> <nls> <tibble> -10 0.0786 0.00734 0.0712
#> 2 1 <tibble [201 × 2]> <nls> <tibble> -9.9 -0.0384 0.00801 -0.0464
#> 3 1 <tibble [201 × 2]> <nls> <tibble> -9.8 -0.0902 0.00900 -0.0992
#> 4 1 <tibble [201 × 2]> <nls> <tibble> -9.7 -0.0193 0.0104 -0.0297
#> 5 1 <tibble [201 × 2]> <nls> <tibble> -9.6 0.108 0.0125 0.0950
#> 6 1 <tibble [201 × 2]> <nls> <tibble> -9.5 0.0778 0.0154 0.0624
#> 7 1 <tibble [201 × 2]> <nls> <tibble> -9.4 -0.0293 0.0194 -0.0487
#> 8 1 <tibble [201 × 2]> <nls> <tibble> -9.3 -0.113 0.0249 -0.138
#> 9 1 <tibble [201 × 2]> <nls> <tibble> -9.2 0.0403 0.0325 0.00781
#> 10 1 <tibble [201 × 2]> <nls> <tibble> -9.1 0.0865 0.0427 0.0438
#> # ℹ 593 more rows
# fit parameters
d_fitted |>
unnest(tidied)
#> # A tibble: 12 × 9
#> T data fit term estimate std.error statistic p.value augmented
#> <int> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> <list>
#> 1 1 <tibble> <nls> A 7.77 0.0734 106. 1.19e-175 <tibble>
#> 2 1 <tibble> <nls> y0 0.00603 0.00732 0.824 4.11e- 1 <tibble>
#> 3 1 <tibble> <nls> x0 -6.28 0.00888 -707. 0 <tibble>
#> 4 1 <tibble> <nls> FW 0.940 0.00936 100. 3.54e-171 <tibble>
#> 5 2 <tibble> <nls> A 5.79 0.0573 101. 1.13e-171 <tibble>
#> 6 2 <tibble> <nls> y0 -0.000687 0.00698 -0.0984 9.22e- 1 <tibble>
#> 7 2 <tibble> <nls> x0 3.15 0.00658 479. 5.62e-304 <tibble>
#> 8 2 <tibble> <nls> FW 0.631 0.00680 92.8 1.46e-164 <tibble>
#> 9 3 <tibble> <nls> A 8.54 0.0664 129. 4.81e-192 <tibble>
#> 10 3 <tibble> <nls> y0 0.00469 0.00717 0.654 5.14e- 1 <tibble>
#> 11 3 <tibble> <nls> x0 0.523 0.00638 81.9 3.82e-154 <tibble>
#> 12 3 <tibble> <nls> FW 0.801 0.00666 120. 2.51e-186 <tibble>
# fit parameters as a wide table
d_fitted |>
unnest(tidied) |>
select(T, term, estimate, std.error) |>
pivot_wider(names_from = term,
values_from = c(estimate,std.error))
#> # A tibble: 3 × 9
#> T estimate_A estimate_y0 estimate_x0 estimate_FW std.error_A std.error_y0
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 7.77 0.00603 -6.28 0.940 0.0734 0.00732
#> 2 2 5.79 -0.000687 3.15 0.631 0.0573 0.00698
#> 3 3 8.54 0.00469 0.523 0.801 0.0664 0.00717
#> # ℹ 2 more variables: std.error_x0 <dbl>, std.error_FW <dbl>
# plot fit result
d_fitted |>
unnest(augmented) |>
ggplot(aes(x=x, color=factor(T)))+
geom_point(aes(y=y), alpha=0.5, size=3) +
geom_line(aes(y=.fitted))
# plot fit parameters
d_fitted |>
unnest(tidied) |>
ggplot(aes(x=T, y=estimate, color=term))+
geom_point()+
geom_errorbar(aes(ymin=estimate-std.error,
ymax=estimate+std.error),
width=.1)+
facet_wrap(~term, scales="free_y")+
theme(legend.position = "none")
13.5 Exercises
Download the exercises and solutions from the following repositories, then create a Rstudio project from the unzipped folder:
Exercise 1
- Load exo_fit.txt in a
tibble
. - Using
lm()
ornls()
fit each column as a function ofx
and display the “experimental” data and the fit on the same graph.- Tip: Take a look at the function
dnorm()
to define a Gaussian
- Tip: Take a look at the function
Exercise 2
- Load the
tidyverse
library. - Define a function
norm01(x)
, that, given a vectorx
, returns this vector normalized to [0,1]. - Load the Raman spectrum rubis_01.txt, normalize it to [0,1] and plot it
- Define the normalized Lorentzian function
Lorentzian(x, x0, FWHM)
, defined by \(L(x, x_0, FWHM)=\frac{FWHM}{2\pi}\frac{1}{(FWHM/2)^2 + (x-x_0)^2}\) - Guess grossly the initial parameters and plot the resulting curve as a blue dashed line
- Fit the data by a sum of 2 Lorentzians using
nls()
- Add the result on the plot as a red line
- Add the 2 Lorentzian components as area-filled curves with
alpha=0.2
and two different colors
Solution
# Load the `tidyverse` library.
library(tidyverse)
# Define a function `norm01(x)`, that, given a vector `x`, returns this vector normalized to [0,1].
norm01 <- function(x) {(x-min(x))/(max(x)-min(x))}
# Load rubis_1.txt, normalize it to [0,1] and plot it
d <- read_table("Data/rubis_01.txt", col_names=c("w", "Int")) |>
mutate(Int_n = norm01(Int))
P <- d |>
ggplot(aes(x=w, y=Int_n))+
geom_point(alpha=0.5)
P
# Define the Lorentzian function
Lorentzian <- function(x, x0=0, FWHM=1){
FWHM / (2*pi) / ((FWHM/2)^2 + (x - x0)^2)
}
# Guess grossly the initial parameters and plot the resulting curve as a blue dashed line
P+geom_line(aes(y = 0.03 +
3*Lorentzian(w, x0=3160, FWHM=10) +
7*Lorentzian(w, x0=3210, FWHM=10)),
col="blue", lty=2)
# Fit the data by a sum of 2 Lorentzians using `nls`
fit <- nls(data=d,
Int_n ~ y0 + A1*Lorentzian(w,x1,FWHM1) + A2*Lorentzian(w,x2,FWHM2),
start=list(y0=0.03,
x1=3160, FWHM1=10, A1=3,
x2=3200, FWHM2=10, A2=7))
summary(fit)
#>
#> Formula: Int_n ~ y0 + A1 * Lorentzian(w, x1, FWHM1) + A2 * Lorentzian(w,
#> x2, FWHM2)
#>
#> Parameters:
#> Estimate Std. Error t value Pr(>|t|)
#> y0 9.825e-03 6.165e-04 15.94 <2e-16 ***
#> x1 3.173e+03 3.503e-02 90576.82 <2e-16 ***
#> FWHM1 1.076e+01 1.087e-01 98.96 <2e-16 ***
#> A1 1.039e+01 8.028e-02 129.38 <2e-16 ***
#> x2 3.203e+03 2.709e-02 118214.68 <2e-16 ***
#> FWHM2 1.453e+01 8.600e-02 169.00 <2e-16 ***
#> A2 2.112e+01 9.680e-02 218.19 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01568 on 1008 degrees of freedom
#>
#> Number of iterations to convergence: 10
#> Achieved convergence tolerance: 7.652e-06
# Add the result on the plot as a red line
P + geom_line(aes(y=.03 +
3*Lorentzian(w, x0=3160, FWHM=10) +
7*Lorentzian(w, x0=3210, FWHM=10)),
col="blue", lty=2)+
geom_line(aes(y=predict(fit)), col="red")
# Add the 2 Lorentzian components as area-filled curves with `alpha=0.2` and two different colors
y0 <- coef(fit)['y0']
A1 <- coef(fit)['A1']; A2 <- coef(fit)['A2']
x1 <- coef(fit)['x1']; x2 <- coef(fit)['x2']
FW1 <- coef(fit)['FWHM1']; FW2 <- coef(fit)['FWHM2']
P + geom_line(aes(y=.03 +
3*Lorentzian(w, x0=3160, FWHM=10) +
7*Lorentzian(w, x0=3210, FWHM=10)),
col="blue", lty=2)+
geom_line(aes(y = predict(fit)), col="red")+
geom_area(aes(y = A1*Lorentzian(w, x0=x1, FWHM=FW1)),
fill="royalblue", alpha=.2)+
geom_area(aes(y = A2*Lorentzian(w, x0=x2, FWHM=FW2)),
fill="orange", alpha=.2)
Exercise 3
- Create an
Exercise_fit
folder, and create an Rstudio project linked to this folder -
Download the corresponding data files and unzip them in a folder called
Data
. - Create a new .R file in which you will write your code and save it.
- Like in the previous exercise, we will fit ruby Raman spectra, but we will do it on many files at once. First, using
list.files()
, save inflist
the list of ruby files in theData
folder. - Define the
Lorentzian(x, x0, FWHM)
function - Define the
norm01(x)
function returning the vector x normalized to [0,1] - Create a
read_ruby(filename)
function that, given a file namefilename
, reads the file into a tibble, gives the proper column names, normalizes the intensity to [0,1], and returns the tibble. - Create a
fitfunc(tib)
function that, given a tibbletib
, fits this tibble’s y values as a function of the x values using the sum of two Lorentzians, and returns thenls()
fit result. Make sure to provide “clever” starting parameters, especially for the positions of the two peaks. For example, one peak is where the spectrum maximum is, and the other one is always at an energy roughly 30 cm-1 lower. - Using pipe operations and
map()
, recursively:- Create a tibble with only one column called
file
, which contains the list of file namesflist
. - Create a column
data
containing a list of all read files, obtained by mappingread_ruby()
ontoflist
. - Create a column
fit
containing a list ofnls()
objects, obtained by mappingfitfunc()
onto the list (column)data
- Create a column
tidied
containing a list of tidiednls()
tibbles, obtained by mappingbbroom::tidy()
onto the list (column)fit
- Create a column
augmented
containing a list of augmentednls()
tibbles, obtained by mappingbbroom::augment()
onto the list (column)fit
- Turn the
file
column intorun
that will contain the run number (i.e. the number in the file name) as a numeric value. Use the functionseparate()
to do so.
- Create a tibble with only one column called
- Plot all the experimental data and the fit with black points and red lines into a faceted plot. Play with various ways of plotting this, like a ridge or slider plot.
- Plot the evolution of all fitting parameters as a function of the run number.
- Plot the evolution of the position of the higher energy peak as a function of the file name.
- Fit linearly the evolution of the higher intensity peaks with respect to the run number. Display the fitted parameters and R2. Plot the fit as a red line. Make sure to use proper weighing for the fit.
Solution
# - Using `list.files()`, save in `flist` the list of ruby files in the `Data` folder.
flist <- list.files(path = "Data", pattern = "rubis", full.names = TRUE)
# - Define the `Lorentzian(x, x0, FWHM)` function
Lorentzian <- function(x, x0 = 0, FWHM = 1) {
2 / (pi * FWHM) / (1 + ((x - x0) / (FWHM / 2))^2)
}
# - Define the `norm01(x)` function returning the vector x normalized to [0,1]
norm01 <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
# - Create a `read_ruby(filename)` function that, given a file name `filename`,
# reads the file into a tibble, gives the proper column names, normalizes the
# intensity to [0,1], and returns the tibble.
read_ruby <- function(filename){
read_table(filename, col_names = c("w", "Int")) |>
mutate(Int = norm01(Int))
}
# - Create a `fitfunc(tib)` function that, given a tibble `tib`, fits this
# tibble's *y* values as a function of the *x* values using the sum of two Lorentzians,
# and returns the `nls()` fit result. Make sure to provide "clever" starting parameters,
# especially for the positions of the two peaks. For example, one peak is where
# the spectrum maximum is, and the other one is always at an energy roughly 30 cm^-1^ lower.
fitfunc <- function(tib){
nls(data=tib,
Int ~ y0 + A1*Lorentzian(w,x1,FWHM1)+
A2*Lorentzian(w,x2,FWHM2),
start=list(y0=0.03,
x1=tib$w[which.max(tib$Int)] - 30,
FWHM1=10,
A1=max(tib$Int)*10,
x2=tib$w[which.max(tib$Int)],
FWHM2=10,
A2=max(tib$Int)*10)
)
}
# Test it:
# df <- read_ruby(flist[1])
# fitfunc(df)
#
# - Using pipe operations and `map()`, recursively:
# - Create a tibble with only one column called `file`, which contains the
# list of file names `flist`.
# - Create a column `data` containing a list of all read files,
# obtained by mapping `read_ruby()` onto `flist`.
# - Create a column `fit` containing a list of `nls()` objects,
# obtained by mapping `fitfunc()` onto the list (column) `data`
# - Create a column `tidied` containing a list of tidied `nls()` tibbles,
# obtained by mapping `bbroom::tidy()` onto the list (column) `fit`
# - Create a column `augmented` containing a list of augmented `nls()` tibbles,
# obtained by mapping `bbroom::augment()` onto the list (column) `fit`
# - Turn the `file` column into `run` that will contain the run number
# (*i.e.* the number in the file name) as a numeric value.
# Use the function `separate()`{.R} to do so.
data <- tibble(file=flist) |>
mutate(data = map(file, read_ruby),
fit = map(data, fitfunc),
tidied = map(fit, tidy),
augmented = map(fit, augment)
) |>
separate(file, c(NA, NA, "run", NA), convert = TRUE)
# - Plot all the experimental data and the fit with black points and red lines
# into a faceted plot. Play with various ways of plotting this,
# like a ridge or slider plot.
data |>
unnest(augmented) |>
ggplot(aes(x = w, y = Int)) +
geom_point(alpha=0.5)+
geom_line(aes(y=.fitted), col="red")+
facet_wrap(~run)
data |>
unnest(augmented) |>
ggplot(aes(x = w, y = Int + as.numeric(factor(run)), group=run)) +
geom_point(alpha=0.5)+
geom_line(aes(y = .fitted + as.numeric(factor(run))), col = "red")
P <- data |>
unnest(augmented) |>
ggplot(aes(x = w, y = Int, frame=run)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = .fitted), col = "red")
library(plotly)
ggplotly(P, dynamicTicks = TRUE) |>
animation_opts(5)|>
layout(xaxis = list(autorange=FALSE, range = c(3050, 3550)))
# - Plot the evolution of all fitting parameters as a function of the run number.
data |>
unnest(tidied) |>
ggplot(aes(x = run, y = estimate)) +
geom_point()+
facet_wrap(~term, scales = "free")+
geom_errorbar(aes(ymin = estimate - std.error,
ymax = estimate + std.error),
width=0.5)
# - Plot the evolution of the position of the higher intensity peaks.
data |>
unnest(tidied) |>
filter(term == "x1") |>
ggplot(aes(x = run, y = estimate)) +
geom_point()+
geom_errorbar(aes(ymin = estimate - std.error,
ymax = estimate + std.error),
width=0.5)
# - Fit linearly the evolution of the higher intensity peaks with respect to the
# run number. Display the fitted parameters and R^2^.Plot the fit as a red line.
# Make sure to use proper weighing for the fit.
lmfit <- data |>
unnest(tidied) |>
filter(term == "x1") |>
lm(data=_,
estimate ~ run,
weights = 1/std.error^2)
tidy(lmfit)
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 3169. 0.557 5686. 6.39e-49
#> 2 run 6.99 0.0756 92.6 4.23e-22
glance(lmfit)
#> # A tibble: 1 × 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.998 0.998 57.6 8566. 4.23e-22 1 -36.0 78.1 80.6
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
data |>
unnest(tidied) |>
filter(term == "x1") |>
ggplot(aes(x = run, y = estimate)) +
geom_point()+
geom_errorbar(aes(ymin = estimate - std.error,
ymax = estimate + std.error),
width=0.5)+
geom_line(aes(y = predict(lmfit)), col="red")