Data analysis using ML models with OPTIMEO

Let’s create an experimental_data(temp, conc) function that simulates the yield of a chemical reaction based on temperature, concentration A, concentration B and concentration C.

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "notebook"

def experimental_data(temp, cA, cB, cC):
    """
    This function simulates experimental data based on temperature and concentrations.
    The function is not based on any real experimental data and is purely for demonstration purposes.
    """
    out = .2*temp + .5*temp*cA + (cA)/3 + (1 - cB)**2/2 + (3 - cC)/1.5 + np.random.normal(0, 0.2, len(temp))
    return out

def generate_data(N=100):
    temp = np.random.uniform(0, 100, N)
    cA = np.random.uniform(0, 1, N)
    cB = np.random.uniform(0, 1, N)
    cC = np.random.uniform(0, 1, N)
    exp_response = experimental_data(temp, cA, cB, cC)
    # Create a DataFrame with the generated data
    df = pd.DataFrame({'temp': temp, 
                       'cA': cA, 
                       'cB': cB, 
                       'cC': cC, 
                       'response': exp_response})
    return df

df = generate_data(50)
df.to_csv('dataML.csv', index=False)
df.head()
temp cA cB cC response
0 11.251967 0.274572 0.510769 0.320083 5.665817
1 52.767722 0.149580 0.000185 0.751806 16.517494
2 70.363543 0.251513 0.184655 0.169892 25.203428
3 52.331971 0.467241 0.287215 0.154678 25.058860
4 67.584432 0.132317 0.879010 0.868066 19.396625

Now, we will use the OPTIMEO package to analyse the data.

from optimeo.analysis import *

data = pd.read_csv('dataML.csv')
factors = data.columns[:-1]
response = data.columns[-1]
analysis = DataAnalysis(data, factors, response)
analysis
DataAnalysis(data=(50, 5), factors=Index(['temp', 'cA', 'cB', 'cC'], dtype='object'), response=response, model_type=None, split_size=0.2, encoders={})

First, let’s take a look at the correlation between the variables.

analysis.plot_corr()
analysis.plot_pairplot_plotly()

First, let’s look at a simple linear model:

analysis.compute_linear_model()
analysis.linear_model.summary()
OLS Regression Results
Dep. Variable: response R-squared: 0.918
Model: OLS Adj. R-squared: 0.910
Method: Least Squares F-statistic: 125.2
Date: Mon, 05 May 2025 Prob (F-statistic): 8.97e-24
Time: 13:57:16 Log-Likelihood: -140.98
No. Observations: 50 AIC: 292.0
Df Residuals: 45 BIC: 301.5
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -9.5469 2.298 -4.155 0.000 -14.174 -4.919
temp 0.4498 0.021 21.194 0.000 0.407 0.493
cA 24.8669 2.192 11.343 0.000 20.451 29.282
cB -1.6388 2.266 -0.723 0.473 -6.203 2.925
cC -2.2266 2.117 -1.052 0.299 -6.491 2.038
Omnibus: 2.186 Durbin-Watson: 2.252
Prob(Omnibus): 0.335 Jarque-Bera (JB): 1.556
Skew: -0.218 Prob(JB): 0.459
Kurtosis: 2.254 Cond. No. 281.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
figs = analysis.plot_linear_model()
for fig in figs:
    fig.show()

The equation used for the fit is this one, you can change it if you want, e.g to add interaction terms or other polynomial terms:

analysis.write_equation()
'response ~ temp + cA + cB + cC '
analysis.equation = 'response ~ temp+ temp:cA + cA + cB + cC'
analysis.compute_linear_model()
analysis.linear_model.summary()
OLS Regression Results
Dep. Variable: response R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.605e+04
Date: Mon, 05 May 2025 Prob (F-statistic): 1.29e-80
Time: 13:57:17 Log-Likelihood: 10.715
No. Observations: 50 AIC: -9.430
Df Residuals: 44 BIC: 2.042
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.4806 0.142 17.480 0.000 2.195 2.767
temp 0.1986 0.002 94.721 0.000 0.194 0.203
temp:cA 0.5020 0.004 137.665 0.000 0.495 0.509
cA 0.3120 0.208 1.501 0.140 -0.107 0.731
cB -0.5796 0.111 -5.243 0.000 -0.802 -0.357
cC -0.6712 0.104 -6.475 0.000 -0.880 -0.462
Omnibus: 1.167 Durbin-Watson: 1.788
Prob(Omnibus): 0.558 Jarque-Bera (JB): 0.672
Skew: -0.276 Prob(JB): 0.715
Kurtosis: 3.132 Cond. No. 506.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
figs = analysis.plot_linear_model()
for fig in figs:
    fig.show()

Now let’s make a ML model to predict the yield based on the temperature and concentrations of A, B and C.

analysis.model_type = "ElasticNetCV"
# analysis.model_type = "RidgeCV"
# analysis.model_type = "LinearRegression"
# analysis.model_type = "RandomForest"
# analysis.model_type = "GaussianProcess"
# analysis.model_type = "GradientBoosting"
MLmodel = analysis.compute_ML_model()
figs = analysis.plot_ML_model()
for fig in figs:
    fig.show()

And if we want to make a prediction:

new_value = pd.DataFrame({'temp': [50], 
                          'cA': [0.35], 
                          'cB': [0.5], 
                          'cC': [0.5]})
analysis.predict(new_value)
prediction model
0 20.246943 ElasticNetCV
1 20.680577 Linear Model