optimeo.analysis#

The analysis module provides tools for data analysis and regression modeling. The main workhorse is the DataAnalysis class, which allows for encoding categorical variables, performing regression analysis, and visualizing results.

It supports both linear regression using the statsmodels package and machine learning models from sklearn. The class also provides methods for plotting Q-Q plots, box plots, histograms, and scatter plots. It includes functionality for bootstrap resampling to estimate the variability of model coefficients. The DataAnalysis class is designed to be flexible and extensible, allowing users to customize the regression analysis process.

You can see an example notebook [here](../examples/MLanalysis.ipynb).

class optimeo.analysis.DataAnalysis(data, factors, response, split_size=0.2, model_type=None)[source]#

Bases: object

This class is used to analyze the data and perform regression analysis.

Example

import pandas as pd
from optimeo.analysis import DataAnalysis

data = pd.read_csv('dataML.csv')
factors = data.columns[:-1].tolist()
response = data.columns[-1]
analysis = DataAnalysis(data, factors, response)
analysis.model_type = "ElasticNetCV"
analysis.compute_ML_model()

Parameters:

data (pd.DataFrame) – The input data.
factors (list) – The list of factor variables.
response (str) – The response variable.
split_size (float, optional) – The proportion of the dataset to include in the test split. Default is 0.2.
model_type (str, optional) – The type of machine learning model to use. Default is None. Must be one of the following: “ElasticNetCV”, “RidgeCV”, “LinearRegression”, “RandomForest”, “GaussianProcess”, “GradientBoosting”.

Variables:

data (pd.DataFrame) – The input data.
factors (list) – The list of factor variables.
response (str) – The response variable.
encoders (dict) – The encoders for categorical variables.
dtypes (pd.Series) – The data types of the columns.
linear_model (object) – The linear model object.
equation (str) – The equation for the linear model, in the form response ~ var1 + var2 + var1:var2.
model (object) – The machine learning model object.
model_type (str) – The type of machine learning model to use.
split_size (float) – The proportion of the dataset to include in the test split.

encode_data()[source]#: Encode categorical variables in input data.

plot_qq()[source]#: Plot a Q-Q plot of residuals.

plot_boxplot()[source]#: Plot response distribution as boxplot.

plot_histogram()[source]#: Plot response distribution as histogram.

plot_scatter_response()[source]#: Plot scatter response against factors.

plot_corr()[source]#: Plot factor correlation matrix.

plot_pairplot_seaborn()[source]#: Generate pairplot with seaborn.

plot_pairplot_plotly()[source]#: Generate pairplot with plotly.

write_equation(order=1, quadratic=[])[source]#: Build formula string for statsmodels.

compute_linear_model(order=1, quadratic=[])[source]#: Fit linear model with statsmodels.

plot_linear_model()[source]#: Visualize linear model effects.

compute_ML_model(**kwargs)[source]#: Fit machine-learning model.

plot_ML_model(features_in_log=False)[source]#: Plot ML model response surfaces.

property data#: The input pandas.DataFrame.

encode_data()[source]#: Called during initialization: encodes categorical variables in the data if there are any. Uses LabelEncoder() from sklearn to convert categorical variables to numerical values. Also drops rows with NaN values.

property factors#: The list of names of the columns of the data DataFrame that contain factor variables.

property encoders#: The list of encoders for categorical variables.

property dtypes#: Get the data types of the columns.

property response#: The name of the column of the data DataFrame that contain the response variable.

property linear_model#: Get the linear model.

property equation#

The equation for the linear model, in the form response ~ var1 + var2 + var1:var2.

See statsmodels formula examples: https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html

property model_type#: The type of machine learning model to use. Default is None. Must be one of the following: “ElasticNetCV”, “RidgeCV”, “LinearRegression”, “RandomForest”, “GaussianProcess”, “GradientBoosting”.

property model#: The machine learning model object.

property split_size#: The proportion of the dataset to include in the test split. Default is 0.2.

plot_qq()[source]#

Plot a Q-Q plot for the response variable.

Returns:: fig – The Q-Q plot figure.
Return type:: plotly.graph_objs.Figure

plot_boxplot()[source]#

Plot a boxplot for the response variable.

Returns:: fig – The boxplot figure.
Return type:: plotly.graph_objs.Figure

plot_histogram()[source]#

Plot a histogram for the response variable.

Returns:: fig – The histogram figure.
Return type:: plotly.graph_objs.Figure

plot_scatter_response()[source]#

Plot a scatter plot for the response variable.

Returns:: fig – The scatter plot figure.
Return type:: plotly.graph_objs.Figure

plot_pairplot_seaborn()[source]#

Plot a pairplot for the data using seaborn.

Returns:: fig – The pairplot figure.
Return type:: seaborn.axisgrid.PairGrid

plot_pairplot_plotly()[source]#

Plot a pairplot for the data using plotly.

Returns:: fig – The plotly figure.
Return type:: plotly.graph_objs.Figure

plot_corr()[source]#

Plot a correlation matrix for the data.

Returns:: fig – The pairplot figure.
Return type:: seaborn.axisgrid.PairGrid

write_equation(order=1, quadratic=[])[source]#

Write R-style equation for multivariate fitting procedure using the statsmodels package.

Parameters:

order (int, optional) – The order of the polynomial. Default is 1.
quadratic (list, optional) – The list of quadratic factors. Default is an empty list.

Returns:

The R-style equation.

Return type:

str

compute_linear_model(order=1, quadratic=[])[source]#

Compute the linear model using the statsmodels package.

Parameters:

order (int, optional) – The order of the polynomial. Default is 1. The parameter is not used if the equation is already set.
quadratic (list, optional) – The list of quadratic factors. Default is an empty list. The parameter is not used if the equation is already set.

Returns:

The fitted linear model.

Return type:

statsmodels.regression.linear_model.RegressionResultsWrapper

plot_linear_model()[source]#

Plot the linear model using plotly.

Returns:: fig – The list of plotly figures.
Return type:: list

compute_ML_model(**kwargs)[source]#

Compute the machine learning model.

Parameters:

kwargs (dict) – Additional keyword arguments for the model. Overrides default parameters.
type (Default parameters by model)
--------------------------------
ElasticNetCV (-) –
- l1_ratiolist, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0].
  List of L1 ratios to try.
- cvint, default=5.
  Cross-validation folds.
- max_iterint, default=1000.
  Maximum iterations.
RidgeCV (-) –
- alphaslist, default=[0.1, 1.0, 10.0].
  List of alpha values to try.
- cvint, default=5.
  Cross-validation folds.
LinearRegression (-) –
- fit_interceptbool, default=True.
  Whether to calculate the intercept.
RandomForest (-) –
- n_estimatorsint, default=100.
  Number of trees in the forest.
- max_depthint or None, default=None.
  Maximum depth of trees.
- min_samples_splitint, default=2.
  Minimum samples required to split a node.
- random_stateint, default=42.
  Random seed for reproducibility.
GaussianProcess (-) –
- kernelkernel object, default=None.
  Kernel for the Gaussian Process.
- alphafloat, default=1e-10.
  Value added to diagonal of kernel matrix.
- normalize_ybool, default=True.
  Normalize target values.
- random_stateint, default=42.
  Random seed for reproducibility.
GradientBoosting (-) –
- n_estimatorsint, default=100.
  Number of boosting stages.
- learning_ratefloat, default=0.1.
  Learning rate.
- max_depthint, default=3.
  Maximum depth of trees.
- random_stateint, default=42.
  Random seed for reproducibility.

Returns:

The fitted machine learning model.

Return type:

object

plot_ML_model(features_in_log=False)[source]#

Plot the machine learning model using plotly.

Parameters:: features_in_log (bool, optional) – Whether to plot the feature importances in log scale. Default is False.
Returns:: fig – The list of plotly figures.
Return type:: list

predict(X=None, model='all')[source]#

Predict using the machine learning model and the linear model, if they are trained. Use the encoders to encode the data. If X is not provided, use the original data. If the model has not been trained, raise a warning.

Parameters:: X (pd.DataFrame, optional) – The input features. Default is None, which uses the original data.
Returns:: The predicted values with a column indicating the model used.
Return type:: pd.DataFrame

optimeo.analysis.bootstrap_coefficients(mod, X, y, n_bootstrap=100, random_state=None)[source]#

Perform bootstrap resampling to estimate the variability of model coefficients.

Parameters:

mod (object) – The machine learning model.
X (pd.DataFrame) – The input features.
y (pd.Series) – The target variable.
n_bootstrap (int, optional) – The number of bootstrap samples. Default is 100.
random_state (int, optional) – The seed for the random number generator.

Returns:

results – The bootstrapped coefficients.

Return type:

np.ndarray