optimeo.analysis#
The analysis module provides tools for data analysis and regression modeling. The main workhorse is the DataAnalysis class, which allows for encoding categorical variables, performing regression analysis, and visualizing results.
It supports both linear regression using the statsmodels package and machine learning models from sklearn. The class also provides methods for plotting Q-Q plots, box plots, histograms, and scatter plots. It includes functionality for bootstrap resampling to estimate the variability of model coefficients. The DataAnalysis class is designed to be flexible and extensible, allowing users to customize the regression analysis process.
You can see an example notebook [here](../examples/MLanalysis.ipynb).
- class optimeo.analysis.DataAnalysis(data, factors, response, split_size=0.2, model_type=None)[source]#
Bases:
objectThis class is used to analyze the data and perform regression analysis.
Example
import pandas as pd from optimeo.analysis import DataAnalysis data = pd.read_csv('dataML.csv') factors = data.columns[:-1].tolist() response = data.columns[-1] analysis = DataAnalysis(data, factors, response) analysis.model_type = "ElasticNetCV" analysis.compute_ML_model()
- Parameters:
data (pd.DataFrame) – The input data.
factors (list) – The list of factor variables.
response (str) – The response variable.
split_size (float, optional) – The proportion of the dataset to include in the test split. Default is 0.2.
model_type (str, optional) – The type of machine learning model to use. Default is None. Must be one of the following: “ElasticNetCV”, “RidgeCV”, “LinearRegression”, “RandomForest”, “GaussianProcess”, “GradientBoosting”.
- Variables:
data (pd.DataFrame) – The input data.
factors (list) – The list of factor variables.
response (str) – The response variable.
encoders (dict) – The encoders for categorical variables.
dtypes (pd.Series) – The data types of the columns.
linear_model (object) – The linear model object.
equation (str) – The equation for the linear model, in the form response ~ var1 + var2 + var1:var2.
model (object) – The machine learning model object.
model_type (str) – The type of machine learning model to use.
split_size (float) – The proportion of the dataset to include in the test split.
- property data#
The input pandas.DataFrame.
- encode_data()[source]#
Called during initialization: encodes categorical variables in the data if there are any. Uses LabelEncoder() from sklearn to convert categorical variables to numerical values. Also drops rows with NaN values.
- property factors#
The list of names of the columns of the data DataFrame that contain factor variables.
- property encoders#
The list of encoders for categorical variables.
- property dtypes#
Get the data types of the columns.
- property response#
The name of the column of the data DataFrame that contain the response variable.
- property linear_model#
Get the linear model.
- property equation#
The equation for the linear model, in the form
response ~ var1 + var2 + var1:var2.See statsmodels formula examples: https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html
- property model_type#
The type of machine learning model to use. Default is None. Must be one of the following: “ElasticNetCV”, “RidgeCV”, “LinearRegression”, “RandomForest”, “GaussianProcess”, “GradientBoosting”.
- property model#
The machine learning model object.
- property split_size#
The proportion of the dataset to include in the test split. Default is 0.2.
- plot_qq()[source]#
Plot a Q-Q plot for the response variable.
- Returns:
fig – The Q-Q plot figure.
- Return type:
plotly.graph_objs.Figure
- plot_boxplot()[source]#
Plot a boxplot for the response variable.
- Returns:
fig – The boxplot figure.
- Return type:
plotly.graph_objs.Figure
- plot_histogram()[source]#
Plot a histogram for the response variable.
- Returns:
fig – The histogram figure.
- Return type:
plotly.graph_objs.Figure
- plot_scatter_response()[source]#
Plot a scatter plot for the response variable.
- Returns:
fig – The scatter plot figure.
- Return type:
plotly.graph_objs.Figure
- plot_pairplot_seaborn()[source]#
Plot a pairplot for the data using seaborn.
- Returns:
fig – The pairplot figure.
- Return type:
seaborn.axisgrid.PairGrid
- plot_pairplot_plotly()[source]#
Plot a pairplot for the data using plotly.
- Returns:
fig – The plotly figure.
- Return type:
plotly.graph_objs.Figure
- plot_corr()[source]#
Plot a correlation matrix for the data.
- Returns:
fig – The pairplot figure.
- Return type:
seaborn.axisgrid.PairGrid
- write_equation(order=1, quadratic=[])[source]#
Write R-style equation for multivariate fitting procedure using the statsmodels package.
- compute_linear_model(order=1, quadratic=[])[source]#
Compute the linear model using the statsmodels package.
- Parameters:
- Returns:
The fitted linear model.
- Return type:
statsmodels.regression.linear_model.RegressionResultsWrapper
- plot_linear_model()[source]#
Plot the linear model using plotly.
- Returns:
fig – The list of plotly figures.
- Return type:
- compute_ML_model(**kwargs)[source]#
Compute the machine learning model.
- Parameters:
kwargs (dict) – Additional keyword arguments for the model. Overrides default parameters.
type (Default parameters by model)
--------------------------------
ElasticNetCV (-) –
- l1_ratiolist, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0].
List of L1 ratios to try.
- cvint, default=5.
Cross-validation folds.
- max_iterint, default=1000.
Maximum iterations.
RidgeCV (-) –
- alphaslist, default=[0.1, 1.0, 10.0].
List of alpha values to try.
- cvint, default=5.
Cross-validation folds.
LinearRegression (-) –
- fit_interceptbool, default=True.
Whether to calculate the intercept.
RandomForest (-) –
- n_estimatorsint, default=100.
Number of trees in the forest.
- max_depthint or None, default=None.
Maximum depth of trees.
- min_samples_splitint, default=2.
Minimum samples required to split a node.
- random_stateint, default=42.
Random seed for reproducibility.
GaussianProcess (-) –
- kernelkernel object, default=None.
Kernel for the Gaussian Process.
- alphafloat, default=1e-10.
Value added to diagonal of kernel matrix.
- normalize_ybool, default=True.
Normalize target values.
- random_stateint, default=42.
Random seed for reproducibility.
GradientBoosting (-) –
- n_estimatorsint, default=100.
Number of boosting stages.
- learning_ratefloat, default=0.1.
Learning rate.
- max_depthint, default=3.
Maximum depth of trees.
- random_stateint, default=42.
Random seed for reproducibility.
- Returns:
The fitted machine learning model.
- Return type:
- predict(X=None, model='all')[source]#
Predict using the machine learning model and the linear model, if they are trained. Use the encoders to encode the data. If X is not provided, use the original data. If the model has not been trained, raise a warning.
- Parameters:
X (pd.DataFrame, optional) – The input features. Default is None, which uses the original data.
- Returns:
The predicted values with a column indicating the model used.
- Return type:
pd.DataFrame