Data analysis using ML models with OPTIMEO

Let’s create an experimental_data(temp, conc) function that simulates the yield of a chemical reaction based on temperature, concentration A, concentration B and concentration C.

import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "notebook"

def experimental_data(temp, cA, cB, cC):
    """
    This function simulates experimental data based on temperature and concentrations.
    The function is not based on any real experimental data and is purely for demonstration purposes.
    """
    out = .2*temp + .5*temp*cA + (cA)/3 + (1 - cB)**2/2 + (3 - cC)/1.5 + np.random.normal(0, 0.2, len(temp))
    return out

def generate_data(N=100):
    temp = np.random.uniform(0, 100, N)
    cA = np.random.uniform(0, 1, N)
    cB = np.random.uniform(0, 1, N)
    cC = np.random.uniform(0, 1, N)
    exp_response = experimental_data(temp, cA, cB, cC)
    # Create a DataFrame with the generated data
    df = pd.DataFrame({'temp': temp, 
                       'cA': cA, 
                       'cB': cB, 
                       'cC': cC, 
                       'response': exp_response})
    return df

df = generate_data(50)
df.to_csv('dataML.csv', index=False)
df.head()

	temp	cA	cB	cC	response
0	11.251967	0.274572	0.510769	0.320083	5.665817
1	52.767722	0.149580	0.000185	0.751806	16.517494
2	70.363543	0.251513	0.184655	0.169892	25.203428
3	52.331971	0.467241	0.287215	0.154678	25.058860
4	67.584432	0.132317	0.879010	0.868066	19.396625

Now, we will use the OPTIMEO package to analyse the data.

from optimeo.analysis import *

data = pd.read_csv('dataML.csv')
factors = data.columns[:-1]
response = data.columns[-1]
analysis = DataAnalysis(data, factors, response)
analysis

DataAnalysis(data=(50, 5), factors=Index(['temp', 'cA', 'cB', 'cC'], dtype='object'), response=response, model_type=None, split_size=0.2, encoders={})

First, let’s take a look at the correlation between the variables.

analysis.plot_corr()

analysis.plot_pairplot_plotly()

First, let’s look at a simple linear model:

analysis.compute_linear_model()
analysis.linear_model.summary()

OLS Regression Results
Dep. Variable:	response	R-squared:	0.918
Model:	OLS	Adj. R-squared:	0.910
Method:	Least Squares	F-statistic:	125.2
Date:	Mon, 05 May 2025	Prob (F-statistic):	8.97e-24
Time:	13:57:16	Log-Likelihood:	-140.98
No. Observations:	50	AIC:	292.0
Df Residuals:	45	BIC:	301.5
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-9.5469	2.298	-4.155	0.000	-14.174	-4.919
temp	0.4498	0.021	21.194	0.000	0.407	0.493
cA	24.8669	2.192	11.343	0.000	20.451	29.282
cB	-1.6388	2.266	-0.723	0.473	-6.203	2.925
cC	-2.2266	2.117	-1.052	0.299	-6.491	2.038

Omnibus:	2.186	Durbin-Watson:	2.252
Prob(Omnibus):	0.335	Jarque-Bera (JB):	1.556
Skew:	-0.218	Prob(JB):	0.459
Kurtosis:	2.254	Cond. No.	281.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

figs = analysis.plot_linear_model()
for fig in figs:
    fig.show()

The equation used for the fit is this one, you can change it if you want, e.g to add interaction terms or other polynomial terms:

analysis.write_equation()

'response ~ temp + cA + cB + cC '

analysis.equation = 'response ~ temp+ temp:cA + cA + cB + cC'
analysis.compute_linear_model()
analysis.linear_model.summary()

OLS Regression Results
Dep. Variable:	response	R-squared:	1.000
Model:	OLS	Adj. R-squared:	1.000
Method:	Least Squares	F-statistic:	4.605e+04
Date:	Mon, 05 May 2025	Prob (F-statistic):	1.29e-80
Time:	13:57:17	Log-Likelihood:	10.715
No. Observations:	50	AIC:	-9.430
Df Residuals:	44	BIC:	2.042
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.4806	0.142	17.480	0.000	2.195	2.767
temp	0.1986	0.002	94.721	0.000	0.194	0.203
temp:cA	0.5020	0.004	137.665	0.000	0.495	0.509
cA	0.3120	0.208	1.501	0.140	-0.107	0.731
cB	-0.5796	0.111	-5.243	0.000	-0.802	-0.357
cC	-0.6712	0.104	-6.475	0.000	-0.880	-0.462

Omnibus:	1.167	Durbin-Watson:	1.788
Prob(Omnibus):	0.558	Jarque-Bera (JB):	0.672
Skew:	-0.276	Prob(JB):	0.715
Kurtosis:	3.132	Cond. No.	506.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

figs = analysis.plot_linear_model()
for fig in figs:
    fig.show()

Now let’s make a ML model to predict the yield based on the temperature and concentrations of A, B and C.

analysis.model_type = "ElasticNetCV"
# analysis.model_type = "RidgeCV"
# analysis.model_type = "LinearRegression"
# analysis.model_type = "RandomForest"
# analysis.model_type = "GaussianProcess"
# analysis.model_type = "GradientBoosting"
MLmodel = analysis.compute_ML_model()
figs = analysis.plot_ML_model()
for fig in figs:
    fig.show()

And if we want to make a prediction:

new_value = pd.DataFrame({'temp': [50], 
                          'cA': [0.35], 
                          'cB': [0.5], 
                          'cC': [0.5]})
analysis.predict(new_value)

	prediction	model
0	20.246943	ElasticNetCV
1	20.680577	Linear Model