optimeo.analysis

The analysis module provides tools for data analysis and regression modeling. The main workhorse is the DataAnalysis class, which allows for encoding categorical variables, performing regression analysis, and visualizing results.

It supports both linear regression using the statsmodels package and machine learning models from sklearn. The class also provides methods for plotting Q-Q plots, box plots, histograms, and scatter plots. It includes functionality for bootstrap resampling to estimate the variability of model coefficients. The DataAnalysis class is designed to be flexible and extensible, allowing users to customize the regression analysis process.

You can see an example notebook here.

   1# Copyright (c) 2025 Colin BOUSIGE
   2# Contact: colin.bousige@cnrs.fr
   3#
   4# This program is free software: you can redistribute it and/or modify
   5# it under the terms of the MIT License as published by
   6# the Free Software Foundation, either version 3 of the License, or
   7# any later version. 
   8
   9"""
  10The analysis module provides tools for data analysis and regression modeling.
  11The main workhorse is the `DataAnalysis` class, which allows for encoding categorical variables, performing regression analysis, and visualizing results.
  12
  13It supports both linear regression using the `statsmodels` package and machine learning models from `sklearn`.
  14The class also provides methods for plotting Q-Q plots, box plots, histograms, and scatter plots.
  15It includes functionality for bootstrap resampling to estimate the variability of model coefficients.
  16The `DataAnalysis` class is designed to be flexible and extensible, allowing users to customize the regression analysis process.
  17
  18You can see an example notebook [here](../examples/MLanalysis.html).
  19
  20"""
  21
  22
  23import numpy as np
  24import pandas as pd
  25import statsmodels.formula.api as smf
  26from scipy import stats
  27from scipy.stats import t
  28import plotly.express as px
  29import plotly.graph_objects as go
  30from plotly.subplots import make_subplots
  31from statsmodels.graphics.gofplots import qqplot
  32# import ML models and scaling functions
  33from sklearn.preprocessing import StandardScaler
  34from sklearn.linear_model import LinearRegression, RidgeCV, ElasticNetCV
  35from sklearn.gaussian_process import GaussianProcessRegressor
  36from sklearn.ensemble import RandomForestRegressor
  37from sklearn.ensemble import GradientBoostingRegressor
  38from sklearn.model_selection import train_test_split
  39from sklearn.metrics import root_mean_squared_error, r2_score
  40from sklearn.preprocessing import LabelEncoder
  41from sklearn.pipeline import make_pipeline
  42import seaborn as sns
  43
  44
  45class DataAnalysis:
  46    """
  47    This class is used to analyze the data and perform regression analysis.
  48
  49    Parameters
  50    ----------
  51
  52    data : pd.DataFrame
  53        The input data.
  54    factors : list
  55        The list of factor variables.
  56    response : str
  57        The response variable.
  58    split_size : float, optional
  59        The proportion of the dataset to include in the test split. Default is `0.2`.
  60    model_type : str, optional
  61        The type of machine learning model to use. Default is None. 
  62        Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`,
  63        `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`.
  64    
  65    Attributes
  66    ----------
  67    
  68    data : pd.DataFrame
  69        The input data.
  70    factors : list
  71        The list of factor variables.
  72    response : str
  73        The response variable.
  74    encoders : dict
  75        The encoders for categorical variables.
  76    dtypes : pd.Series  
  77        The data types of the columns.
  78    linear_model : object
  79        The linear model object.
  80    equation : str
  81        The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`.
  82    model : object
  83        The machine learning model object.
  84    model_type : str
  85        The type of machine learning model to use.
  86    split_size : float
  87        The proportion of the dataset to include in the test split.
  88
  89    Methods
  90    -------
  91    
  92    - **encode_data()**:
  93        Encodes categorical variables in the data. Called during initialization.
  94    - **plot_qq()**:
  95        Plots a Q-Q plot for the response variable using `plotly`.
  96    - **plot_boxplot()**:
  97        Plots a boxplot for the response variable using `plotly`.
  98    - **plot_histogram()**:
  99        Plots a histogram for the response variable using `plotly`.
 100    - **plot_scatter_response()**:
 101        Plots a scatter plot for the response variable using `plotly`.
 102    - **plot_corr()**:
 103        Plots a correlation matrix for the data using `plotly`.
 104    - **plot_pairplot_seaborn()**:
 105        Plots a pairplot for the data using `seaborn`.
 106    - **plot_pairplot_plotly()**:
 107        Plots a pairplot for the data using `plotly`.
 108    - **write_equation(order=1, quadratic=[])**:
 109        Writes R-style equation for multivariate fitting procedure using the `statsmodels` package.
 110    - **compute_linear_model(order=1, quadratic=[])**:
 111        Computes the linear model using the `statsmodels` package.
 112    - **plot_linear_model()**:
 113        Plots the linear model using `plotly`.
 114    - <b>compute_ML_model(**kwargs)</b>:
 115        Computes the machine learning model using `sklearn`.
 116    - **plot_ML_model(features_in_log=False)**:
 117        Plots the machine learning model using `plotly`.
 118    
 119    Example
 120    -------
 121    
 122    ```python
 123    from optimeo.analysis import * 
 124
 125    data = pd.read_csv('dataML.csv')
 126    factors = data.columns[:-1]
 127    response = data.columns[-1]
 128    analysis = DataAnalysis(data, factors, response)
 129    analysis.model_type = "ElasticNetCV"
 130    MLmodel = analysis.compute_ML_model()
 131    figs = analysis.plot_ML_model()
 132    for fig in figs:
 133        fig.show()
 134    ```
 135    """
 136
 137    def __init__(self, 
 138                 data: pd.DataFrame, 
 139                 factors: list, 
 140                 response: str, 
 141                 split_size=.2, 
 142                 model_type=None):
 143        self._dtypes = None
 144        self._encoders = {}
 145        self._linear_model = None
 146        self._model = None
 147        self._equation = ''
 148        self._data = data
 149        self._factors = factors
 150        self._response = response
 151        self._model_type = model_type
 152        self._split_size = split_size
 153        self.encode_data()
 154
 155    @property
 156    def data(self):
 157        """The input `pandas.DataFrame`."""
 158        return self._data
 159
 160    @data.setter
 161    def data(self, value):
 162        if not isinstance(value, pd.DataFrame):
 163            raise ValueError("Data must be a pandas DataFrame.")
 164        self._data = value
 165    
 166    def encode_data(self):
 167        """
 168        Called during initialization: encodes categorical variables in the data if there are any. 
 169        Uses `LabelEncoder()` from `sklearn` to convert categorical variables to numerical values.
 170        Also drops rows with NaN values.
 171        """
 172        self._dtypes = self._data.dtypes
 173        for factor in self._factors:
 174            if self._dtypes[factor] == 'object':
 175                le = LabelEncoder()
 176                self._encoders[factor] = le
 177                self._data[factor] = le.fit_transform([str(d) for d in self._data[factor]])
 178        self.data = self.data[self._factors + [self._response]]
 179        self.data = self.data.dropna(axis=0, how='any')
 180
 181    @property
 182    def factors(self):
 183        """The list of names of the columns of the `data` DataFrame that contain factor variables."""
 184        return self._factors
 185
 186    @factors.setter
 187    def factors(self, value):
 188        if not isinstance(value, list):
 189            raise ValueError("Factors must be a list.")
 190        self._factors = value
 191    
 192    @property
 193    def encoders(self):
 194        """The list of encoders for categorical variables."""
 195        return self._encoders
 196    
 197    @encoders.setter
 198    def encoders(self, value):
 199        if not isinstance(value, dict):
 200            raise ValueError("Encoders must be a dictionary.")
 201        self._encoders = value
 202
 203    @property
 204    def dtypes(self):
 205        """Get the data types of the columns."""
 206        return self._dtypes
 207
 208    @dtypes.setter
 209    def dtypes(self, value):
 210        self._dtypes = value
 211
 212    @property
 213    def response(self):
 214        """The name of the column of the `data` DataFrame that contain the response variable."""
 215        return self._response
 216
 217    @response.setter
 218    def response(self, value):
 219        if not isinstance(value, str):
 220            raise ValueError("Response must be a string.")
 221        self._response = value
 222
 223    @property
 224    def linear_model(self):
 225        """Get the linear model."""
 226        return self._linear_model
 227
 228    @linear_model.setter
 229    def linear_model(self, value):
 230        self._linear_model = value
 231
 232    @property
 233    def equation(self):
 234        """The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. This is based on the [statsmodels package](https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html)."""
 235        return self._equation
 236
 237    @equation.setter
 238    def equation(self, value):
 239        if not isinstance(value, str):
 240            raise ValueError("Equation must be a string.")
 241        self._equation = value
 242
 243    @property
 244    def model_type(self):
 245        """The type of machine learning model to use. Default is None. 
 246            Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`,
 247            `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`."""
 248        return self._model_type
 249
 250    @model_type.setter
 251    def model_type(self, value):
 252        if value is not None and value not in ["ElasticNetCV", "RidgeCV",
 253                                                "LinearRegression", "RandomForest",
 254                                                "GaussianProcess", "GradientBoosting"]:
 255            raise ValueError("Model must be one of the following: "
 256                             "ElasticNetCV, RidgeCV, LinearRegression, "
 257                             "RandomForest, GaussianProcess, GradientBoosting.")
 258        self._model_type = value
 259
 260    @property
 261    def model(self):
 262        """The machine learning model object."""
 263        return self._model
 264
 265    @model.setter
 266    def model(self, value):
 267        self._model = value
 268
 269    @property
 270    def split_size(self):
 271        """The proportion of the dataset to include in the test split. Default is `0.2`."""
 272        return self._split_size
 273
 274    @split_size.setter
 275    def split_size(self, value):
 276        if not isinstance(value, (int, float)):
 277            raise ValueError("Split size must be a number.")
 278        self._split_size = value
 279
 280    def __str__(self):
 281        """Return a string representation of the DataAnalysis object."""
 282        return (f"DataAnalysis(data={self._data.shape}, "
 283                f"factors={self._factors}, "
 284                f"response={self._response}, "
 285                f"model_type={self._model_type}, "
 286                f"split_size={self._split_size}, "
 287                f"encoders={self._encoders})"
 288                )
 289
 290    def __repr__(self):
 291        """Return a string representation of the DataAnalysis object."""
 292        return self.__str__()
 293
 294    def plot_qq(self):
 295        """
 296        Plot a Q-Q plot for the response variable.
 297
 298        Returns
 299        -------
 300        fig : plotly.graph_objs.Figure
 301            The Q-Q plot figure.
 302        """
 303        qqplot_data = pd.Series(self._data[self._response], copy=True)
 304        qqplot_data = qqplot(qqplot_data, line='s').gca().lines
 305        fig = go.Figure()
 306
 307        fig.add_trace({
 308            'type': 'scatter',
 309            'x': qqplot_data[0].get_xdata(),
 310            'y': qqplot_data[0].get_ydata(),
 311            'mode': 'markers',
 312        })
 313
 314        fig.add_trace({
 315            'type': 'scatter',
 316            'x': qqplot_data[1].get_xdata(),
 317            'y': qqplot_data[1].get_ydata(),
 318            'mode': 'lines',
 319            'line': {
 320                'color': 'red'
 321            }
 322        })
 323
 324        fig.update_layout(
 325            title='Q-Q Plot',
 326            xaxis_title='Theoretical Quantiles',
 327            yaxis_title='Sample Quantiles',
 328            plot_bgcolor="white",  # White background
 329            showlegend=False,
 330            margin=dict(l=10, r=10, t=50, b=50),
 331            xaxis=dict(
 332                showgrid=True,  # Enable grid
 333                gridcolor="lightgray",  # Light gray grid lines
 334                zeroline=False,
 335                zerolinecolor="black",  # Black zero line
 336                showline=True,
 337                linewidth=1,
 338                linecolor="black",  # Black border
 339                mirror=True
 340            ),
 341            yaxis=dict(
 342                showgrid=True,  # Enable grid
 343                gridcolor="lightgray",  # Light gray grid lines
 344                zeroline=False,
 345                zerolinecolor="black",  # Black zero line
 346                showline=True,
 347                linewidth=1,
 348                linecolor="black",  # Black border
 349                mirror=True
 350            ),
 351            height=300,
 352        )
 353        return fig
 354
 355    def plot_boxplot(self):
 356        """
 357        Plot a boxplot for the response variable.
 358
 359        Returns
 360        -------
 361        fig : plotly.graph_objs.Figure
 362            The boxplot figure.
 363        """
 364        fig = px.box(self._data, y=self._response, points="all", title='Box Plot')
 365        fig.update_layout(
 366            plot_bgcolor="white",  # White background
 367            legend=dict(bgcolor='rgba(0,0,0,0)'),
 368            margin=dict(l=10, r=10, t=50, b=50),
 369            xaxis=dict(
 370                showgrid=True,  # Enable grid
 371                gridcolor="lightgray",  # Light gray grid lines
 372                zeroline=False,
 373                zerolinecolor="black",  # Black zero line
 374                showline=True,
 375                linewidth=1,
 376                linecolor="black",  # Black border
 377                mirror=True
 378            ),
 379            yaxis=dict(
 380                showgrid=True,  # Enable grid
 381                gridcolor="lightgray",  # Light gray grid lines
 382                zeroline=False,
 383                zerolinecolor="black",  # Black zero line
 384                showline=True,
 385                linewidth=1,
 386                linecolor="black",  # Black border
 387                mirror=True
 388            ), height=300,
 389        )
 390        return fig
 391
 392    def plot_histogram(self):
 393        """
 394        Plot a histogram for the response variable.
 395
 396        Returns
 397        -------
 398        fig : plotly.graph_objs.Figure
 399            The histogram figure.
 400        """
 401        fig = px.histogram(self._data, x=self._response, title='Histogram')
 402        fig.update_layout(
 403            plot_bgcolor="white",  # White background
 404            legend=dict(bgcolor='rgba(0,0,0,0)'),
 405            margin=dict(l=10, r=10, t=50, b=50),
 406            xaxis=dict(
 407                showgrid=True,  # Enable grid
 408                gridcolor="lightgray",  # Light gray grid lines
 409                zeroline=False,
 410                zerolinecolor="black",  # Black zero line
 411                showline=True,
 412                linewidth=1,
 413                linecolor="black",  # Black border
 414                mirror=True
 415            ),
 416            yaxis=dict(
 417                showgrid=True,  # Enable grid
 418                gridcolor="lightgray",  # Light gray grid lines
 419                zeroline=False,
 420                zerolinecolor="black",  # Black zero line
 421                showline=True,
 422                linewidth=1,
 423                linecolor="black",  # Black border
 424                mirror=True
 425            ), height=300,
 426        )
 427        return fig
 428
 429    def plot_scatter_response(self):
 430        """
 431        Plot a scatter plot for the response variable.
 432
 433        Returns
 434        -------
 435        fig : plotly.graph_objs.Figure
 436            The scatter plot figure.
 437        """
 438        fig = px.scatter(x=np.arange(1, len(self._data[self._response]) + 1),
 439                         y=self._data[self._response],
 440                         labels={'x': 'Measurement number', 'y': self._response},
 441                         title='Scatter Plot')
 442        fig.update_layout(
 443            plot_bgcolor="white",  # White background
 444            legend=dict(bgcolor='rgba(0,0,0,0)'),
 445            margin=dict(l=10, r=10, t=50, b=50),
 446            xaxis=dict(
 447                showgrid=True,  # Enable grid
 448                gridcolor="lightgray",  # Light gray grid lines
 449                zeroline=False,
 450                zerolinecolor="black",  # Black zero line
 451                showline=True,
 452                linewidth=1,
 453                linecolor="black",  # Black border
 454                mirror=True
 455            ),
 456            yaxis=dict(
 457                showgrid=True,  # Enable grid
 458                gridcolor="lightgray",  # Light gray grid lines
 459                zeroline=False,
 460                zerolinecolor="black",  # Black zero line
 461                showline=True,
 462                linewidth=1,
 463                linecolor="black",  # Black border
 464                mirror=True
 465            ), height=300,
 466        )
 467        return fig
 468
 469    def plot_pairplot_seaborn(self):
 470        """
 471        Plot a pairplot for the data using seaborn.
 472
 473        Returns
 474        -------
 475        fig : seaborn.axisgrid.PairGrid
 476            The pairplot figure.
 477        """
 478        fig = sns.pairplot(
 479            self._data,
 480            kind="reg",
 481            diag_kind="kde",
 482            plot_kws={"scatter_kws": {"alpha": 0.1}},
 483        )
 484        return fig
 485    
 486    def plot_pairplot_plotly(self):
 487        """
 488        Plot a pairplot for the data using plotly.
 489
 490        Returns
 491        -------
 492        fig : plotly.graph_objs.Figure
 493            The plotly figure.
 494        """
 495        
 496        # Get the column names
 497        columns = self._data.columns
 498        n_cols = len(columns)
 499
 500        # Create subplot grid with partially shared axes
 501        fig = make_subplots(
 502            rows=n_cols,
 503            cols=n_cols,
 504            shared_xaxes=True,  
 505            shared_yaxes=True,  
 506            vertical_spacing=0.05,
 507            horizontal_spacing=0.05
 508        )
 509
 510        # Add scatter plots and regression lines to each subplot
 511        for i in range(n_cols):
 512            for j in range(n_cols):
 513                x_data = self._data[columns[j]]
 514                y_data = self._data[columns[i]]
 515
 516                if i == j:  # Diagonal: Add KDE plot
 517                    # Calculate KDE
 518                    data = self._data[columns[i]].dropna()
 519                    if len(data) > 1:  # Ensure enough data points
 520                        kde_x = np.linspace(data.min(), data.max(), 100)
 521                        kde = stats.gaussian_kde(data)
 522                        kde_y = kde(kde_x)
 523                        # Scale to match y-axis (needed for top left corner)
 524                        kde_y = kde_y / kde_y.max() * np.max(y_data.dropna())  
 525
 526                        # Add KDE plot to diagonal
 527                        fig.add_trace(
 528                            go.Scatter(
 529                                x=kde_x,
 530                                y=kde_y,
 531                                mode='lines',
 532                                fill='tozeroy',
 533                                line=dict(color='rgba(29, 81, 189, 1)'),
 534                                showlegend=False
 535                            ),
 536                            row=i+1, col=j+1
 537                        )
 538                        # Ensure the y-axis is independent for diagonal plots
 539                        # don't know why, doesn't work with top left corner
 540                        fig.update_yaxes(
 541                            matches=None,  # Ensure independent y-axis
 542                            showticklabels=True,  # Show y-axis values
 543                            row=i+1,
 544                            col=j+1
 545                        )
 546                else:  # Off-diagonal: Add scatter plot with regression line
 547                    # Add scatter plot
 548                    fig.add_trace(
 549                        go.Scatter(
 550                            x=x_data,
 551                            y=y_data,
 552                            mode='markers',
 553                            marker=dict(
 554                                color='rgba(29, 81, 189, 0.5)',
 555                                size=5
 556                            ),
 557                            showlegend=False
 558                        ),
 559                        row=i+1, col=j+1
 560                    )
 561
 562                    # Calculate and add regression line
 563                    if len(x_data.dropna()) > 1 and len(y_data.dropna()) > 1:
 564                        # Drop NaN values for regression calculation
 565                        valid_indices = x_data.notna() & y_data.notna()
 566                        x_clean = x_data[valid_indices]
 567                        y_clean = y_data[valid_indices]
 568                        ytype = y_clean.dtype
 569                        xtype = x_clean.dtype
 570                        unique_x = x_clean.unique()
 571                        if len(x_clean) > 1 and len(unique_x)>1 and ytype != 'object' and xtype != 'object':
 572                            # Calculate regression parameters
 573                            slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean)
 574                            x_range = np.linspace(x_clean.min(), x_clean.max(), 100)
 575                            y_pred = slope * x_range + intercept
 576
 577                            # Standard error of estimate
 578                            y_fit = slope * x_clean + intercept
 579                            residuals = y_clean - y_fit
 580                            dof = len(x_clean) - 2
 581                            residual_std_error = np.sqrt(np.sum(residuals**2) / dof)
 582
 583                            mean_x = np.mean(x_clean)
 584                            t_val = t.ppf(0.975, dof)  # 95% confidence
 585
 586                            se_line = residual_std_error * np.sqrt(1/len(x_clean) + (x_range - mean_x)**2 / np.sum((x_clean - mean_x)**2))
 587                            y_upper = y_pred + t_val * se_line
 588                            y_lower = y_pred - t_val * se_line
 589
 590                            # Add regression line
 591                            fig.add_trace(
 592                                go.Scatter(
 593                                    x=x_range,
 594                                    y=y_pred,
 595                                    mode='lines',
 596                                    line=dict(color='red', width=2),
 597                                    showlegend=False
 598                                ),
 599                                row=i+1, col=j+1
 600                            )
 601
 602                            # Add confidence interval area
 603                            fig.add_trace(
 604                                go.Scatter(
 605                                    x=np.concatenate([x_range, x_range[::-1]]),
 606                                    y=np.concatenate([y_upper, y_lower[::-1]]),
 607                                    fill='toself',
 608                                    fillcolor='rgba(255, 0, 0, 0.2)',
 609                                    line=dict(color='rgba(255, 0, 0, 0)'),
 610                                    showlegend=False
 611                                ),
 612                                row=i+1, col=j+1
 613                            )
 614
 615        # Update layout and axis properties
 616        fig.update_layout(
 617            margin=dict(l=10, r=10, t=50, b=50),
 618            title="Pair Plot",
 619            height=600,
 620            width=600,
 621            showlegend=False,
 622            plot_bgcolor="white"
 623        )
 624
 625        # Update all axes properties first (hiding tick labels by default)
 626        fig.update_xaxes(
 627            showgrid=True,
 628            gridcolor="lightgray",
 629            zeroline=False,
 630            showline=True,
 631            linewidth=1,
 632            linecolor="black",
 633            mirror=True,
 634            showticklabels=False
 635        )
 636
 637        fig.update_yaxes(
 638            showgrid=True,
 639            gridcolor="lightgray",
 640            zeroline=False,
 641            showline=True,
 642            linewidth=1,
 643            linecolor="black",
 644            mirror=True,
 645            showticklabels=False
 646        )
 647
 648        # Show tick labels and titles only for bottom row and leftmost column
 649        for i, col_name in enumerate(columns):
 650            # Bottom row: show x-axis titles and tick labels
 651            fig.update_xaxes(
 652                title_text=col_name,
 653                showticklabels=True,
 654                row=n_cols,
 655                col=i+1
 656            )
 657
 658            # Leftmost column: show y-axis titles and tick labels
 659            fig.update_yaxes(
 660                title_text=col_name,
 661                showticklabels=True,
 662                row=i+1,
 663                col=1
 664            )
 665            
 666        return fig
 667    
 668    def plot_corr(self):
 669        """
 670        Plot a correlation matrix for the data.
 671
 672        Returns
 673        -------
 674        fig : seaborn.axisgrid.PairGrid
 675            The pairplot figure.
 676        """
 677        corr_matrix = self._data.corr(method='pearson')
 678        mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
 679        # Apply mask - replace upper triangle with NaN values
 680        corr_matrix_lower = corr_matrix.copy()
 681        corr_matrix_lower.values[mask] = np.nan
 682        
 683        fig = px.imshow(
 684            corr_matrix_lower,
 685            text_auto='.2f',  # Display correlation values
 686            color_continuous_scale='RdBu_r',  # Red to Blue color scale (reversed)
 687            zmin=-1,  # Minimum correlation value
 688            zmax=1,   # Maximum correlation value
 689            aspect="auto",  # Keep aspect ratio adaptive
 690            title="Pearson Correlation Heatmap"
 691        )
 692        # Customize hover template
 693        fig.update_traces(
 694            hovertemplate='<b>%{y} vs %{x}</b><br>Correlation: %{z}<extra></extra>'
 695        )
 696        # Improve layout
 697        fig.update_layout(
 698            coloraxis_colorbar=dict(
 699                title="Correlation",
 700                tickvals=[-1, -0.5, 0, 0.5, 1],
 701                ticktext=["-1", "-0.5", "0", "0.5", "1"],
 702            ),
 703            plot_bgcolor="white",  # White background
 704            legend=dict(bgcolor='rgba(0,0,0,0)'),
 705            margin=dict(l=10, r=10, t=50, b=50),
 706            xaxis=dict(
 707                showgrid=False,  # Enable grid
 708                gridcolor="lightgray",  # Light gray grid lines
 709                zeroline=False,
 710                zerolinecolor="black",  # Black zero line
 711                showline=False,
 712                linewidth=1,
 713                tickangle=-45,
 714                linecolor="black",  # Black border
 715                mirror=True
 716            ),
 717            yaxis=dict(
 718                showgrid=False,  # Enable grid
 719                gridcolor="lightgray",  # Light gray grid lines
 720                zeroline=False,
 721                zerolinecolor="black",  # Black zero line
 722                showline=False,
 723                linewidth=1,
 724                linecolor="black",  # Black border
 725                mirror=True
 726            ), 
 727            height=600,
 728            width=600
 729        )
 730
 731        return fig
 732
 733    def write_equation(self, order=1, quadratic=[]):
 734        """
 735        Write R-style equation for multivariate fitting procedure using the statsmodels package.
 736
 737        Parameters
 738        ----------
 739        order : int, optional
 740            The order of the polynomial. Default is 1.
 741        quadratic : list, optional
 742            The list of quadratic factors. Default is an empty list.
 743
 744        Returns
 745        -------
 746        str
 747            The R-style equation.
 748        """
 749        myfactors = self._factors.copy()
 750        if self._dtypes is not None:
 751            if not isinstance(myfactors, list):
 752                myfactors = myfactors.tolist()  # Convert to list to allow mutable operations
 753            for i in range(len(self._factors)):
 754                if self._dtypes[self._factors[i]] == 'object':
 755                    myfactors[i] = f'C({myfactors[i]})'
 756        eqn = f'{self._response} ~ {myfactors[0]} '
 757        for factor in myfactors[1:]:
 758            eqn += f'+ {factor} '
 759        if order > 1:
 760            for i in range(len(myfactors)):
 761                for j in range(i + 1, len(myfactors)):
 762                    eqn += f'+ {myfactors[i]}:{myfactors[j]} '
 763        if order > 2:
 764            for i in range(len(myfactors)):
 765                for j in range(i + 1, len(myfactors)):
 766                    for k in range(j + 1, len(myfactors)):
 767                        eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]} '
 768        if order > 3:
 769            for i in range(len(myfactors)):
 770                for j in range(i + 1, len(myfactors)):
 771                    for k in range(j + 1, len(myfactors)):
 772                        for l in range(k + 1, len(myfactors)):
 773                            eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]}:{myfactors[l]} '
 774        if len(quadratic) > 0:
 775            for factor in quadratic:
 776                eqn += f'+ I({factor}**2) '
 777        self._equation = eqn
 778        return self._equation
 779
 780    def compute_linear_model(self, order=1, quadratic=[]):
 781        """
 782        Compute the linear model using the statsmodels package.
 783
 784        Parameters
 785        ----------
 786        order : int, optional
 787            The order of the polynomial. Default is 1. The parameter 
 788            is not used if the equation is already set.
 789        quadratic : list, optional
 790            The list of quadratic factors. Default is an empty list.
 791            The parameter is not used if the equation is already set.
 792
 793        Returns
 794        -------
 795        statsmodels.regression.linear_model.RegressionResultsWrapper
 796            The fitted linear model.
 797        """
 798        if self._equation == '':
 799            eqn = self.write_equation(order=order, quadratic=quadratic)
 800        else:
 801            eqn = self._equation
 802        model = smf.ols(formula=eqn, data=self._data)
 803        self._linear_model = model.fit()
 804        return self._linear_model
 805
 806    def plot_linear_model(self):
 807        """
 808        Plot the linear model using plotly.
 809
 810        Returns
 811        -------
 812        fig : list
 813            The list of plotly figures.
 814        """
 815        if self._linear_model is None:
 816            raise Warning("Linear model has not been computed yet.")
 817        fig = [None] * 2
 818        fig[0] = go.Figure()
 819        # Fig[0]: actual vs predicted line
 820        fig[0].add_trace(go.Scatter(x=self._data[self._response],
 821                                    y=self._linear_model.predict(),
 822                                    mode='markers',
 823                                    marker=dict(size=12),
 824                                    name='Actual vs Predicted'))
 825        # Add 1:1 line
 826        fig[0].add_shape(type="line",
 827                      x0=min(self._data[self._response]),
 828                      y0=min(self._data[self._response]),
 829                      x1=max(self._data[self._response]),
 830                      y1=max(self._data[self._response]),
 831                      line=dict(color="Gray", width=1, dash="dash"))
 832        fig[0].update_layout(
 833            plot_bgcolor="white",  # White background
 834            legend=dict(bgcolor='rgba(0,0,0,0)'),
 835            xaxis_title=f'Actual {self._response}',
 836            yaxis_title=f'Predicted {self._response}',
 837            height=500,  # Adjust height as needed
 838            margin=dict(l=10, r=10, t=50, b=50),
 839            xaxis=dict(
 840                showgrid=True,  # Enable grid
 841                gridcolor="lightgray",  # Light gray grid lines
 842                zeroline=False,
 843                zerolinecolor="black",  # Black zero line
 844                showline=True,
 845                linewidth=1,
 846                linecolor="black",  # Black border
 847                mirror=True
 848            ),
 849            yaxis=dict(
 850                showgrid=True,  # Enable grid
 851                gridcolor="lightgray",  # Light gray grid lines
 852                zeroline=False,
 853                zerolinecolor="black",  # Black zero line
 854                showline=True,
 855                linewidth=1,
 856                linecolor="black",  # Black border
 857                mirror=True
 858            )
 859        )
 860        # # # # # # # # # # # # # # # # # # # # #
 861        # Fig[1]: slope values for each factor
 862        # # # # # # # # # # # # # # # # # # # # #
 863        # Plot: Slope values for each factor
 864        res = self._linear_model.params.rename_axis('terms').reset_index()[1:]
 865        res.columns = ['terms', 'slope']
 866        error = self._linear_model.bse.rename_axis('terms').reset_index()[1:]
 867        error.columns = ['terms', 'error']
 868        res['error'] = error['error']
 869        res['pvalue'] = [self._linear_model.pvalues[res['terms']].iloc[i] for i in range(len(res))]
 870
 871        # Sort by p-values
 872        res = res.sort_values(by='pvalue', ascending=False)
 873        # Prepare colors and labels
 874        colors = ['red' if x < 0 else 'green' for x in res['slope']]
 875        res['slope'] = res['slope'].abs()
 876
 877        fig[1] = go.Figure()
 878
 879        # Add bar plot
 880        fig[1].add_trace(go.Bar(
 881            y=[term.replace('I(', '').replace('C(', '').replace(')', '').replace(' ** ', '^') for term in res['terms']],
 882            x=res['slope'],
 883            error_x=dict(type='data', array=res['error']),
 884            marker_color=colors,
 885            orientation='h',
 886            name='Slope',
 887            showlegend=False  # Hide the legend entry for the bar trace
 888        ))
 889
 890        # Update layout for log scale and labels
 891        fig[1].update_layout(
 892            xaxis_title='Magnitude of effect',
 893            xaxis_type="log",
 894            height=500,  # Adjust height as needed
 895            plot_bgcolor="white",  # White background
 896            legend=dict(
 897                orientation="h",  # Horizontal orientation
 898                y=1.1  # Place legend on top
 899            ),
 900            margin=dict(l=10, r=150, t=50, b=50),  # Increase right margin for annotations
 901            xaxis=dict(
 902                showgrid=True,  # Enable grid
 903                gridcolor="lightgray",  # Light gray grid lines
 904                zeroline=False,
 905                zerolinecolor="black",  # Black zero line
 906                showline=True,
 907                linewidth=1,
 908                linecolor="black",  # Black border
 909                mirror=True
 910            ),
 911            yaxis=dict(
 912                showgrid=True,  # Enable grid
 913                gridcolor="lightgray",  # Light gray grid lines
 914                zeroline=False,
 915                zerolinecolor="black",  # Black zero line
 916                showline=True,
 917                linewidth=1,
 918                linecolor="black",  # Black border
 919                mirror=True
 920            )
 921        )
 922
 923        # Add legend for Negative and Positive
 924        fig[1].add_trace(go.Scatter(
 925            x=[None], y=[None], mode='markers',
 926            marker=dict(size=10, color='red'),
 927            name='Negative'
 928        ))
 929        fig[1].add_trace(go.Scatter(
 930            x=[None], y=[None], mode='markers',
 931            marker=dict(size=10, color='green'),
 932            name='Positive'
 933        ))
 934
 935        # Add p-values as annotations outside the plot
 936        for i, p in enumerate(res['slope']):
 937            fig[1].add_annotation(
 938                x=1.025,  # Place annotations outside the plot
 939                y=i,
 940                text=f"p = {self._linear_model.pvalues[res['terms'].iloc[i]]:.2g}",
 941                showarrow=False,
 942                xref="paper",  # Use paper coordinates for x
 943                xanchor='left'
 944            )
 945
 946        return fig
 947
 948    def compute_ML_model(self, **kwargs):
 949        """
 950        Compute the machine learning model.
 951
 952        Parameters
 953        ----------
 954        kwargs : dict
 955            Additional keyword arguments for the model. Overrides default parameters.
 956            
 957        Default Parameters by Model Type
 958        --------------------------------
 959        - **ElasticNetCV:**
 960            - l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0].
 961                List of L1 ratios to try.
 962            - cv : int, default=5.
 963                Cross-validation folds.
 964            - max_iter : int, default=1000.
 965                Maximum iterations.
 966        - **RidgeCV:**
 967            - alphas : list, default=[0.1, 1.0, 10.0].
 968                List of alpha values to try.
 969            - cv : int, default=5.
 970                Cross-validation folds.
 971        - **LinearRegression:**
 972            - fit_intercept : bool, default=True.
 973                Whether to calculate the intercept.
 974        - **RandomForest:**
 975            - n_estimators : int, default=100.
 976                Number of trees in the forest.
 977            - max_depth : int or None, default=None.
 978                Maximum depth of trees.
 979            - min_samples_split : int, default=2.
 980                Minimum samples required to split a node.
 981            - random_state : int, default=42.
 982                Random seed for reproducibility.
 983        - **GaussianProcess:**
 984            - kernel : kernel object, default=None.
 985                Kernel for the Gaussian Process.
 986            - alpha : float, default=1e-10.
 987                Value added to diagonal of kernel matrix.
 988            - normalize_y : bool, default=True.
 989                Normalize target values.
 990            - random_state : int, default=42.
 991                Random seed for reproducibility.
 992        - **GradientBoosting:**
 993            - n_estimators : int, default=100.
 994                Number of boosting stages.
 995            - learning_rate : float, default=0.1.
 996                Learning rate.
 997            - max_depth : int, default=3.
 998                Maximum depth of trees.
 999            - random_state : int, default=42.
1000                Random seed for reproducibility.
1001
1002        Returns
1003        -------
1004        object
1005            The fitted machine learning model.
1006        """
1007        if self._model_type is None:
1008            raise ValueError("Model must be provided.")
1009
1010        X = self._data[self._factors]
1011        y = self._data[self._response]
1012        if self._split_size > 0:
1013            X_train, X_test, y_train, y_test = train_test_split(
1014                X, y, test_size=self._split_size)
1015        else:
1016            X_train, X_test, y_train, y_test = X, X, y, y
1017        
1018        # Default parameters for each model type
1019        default_params = {
1020            "ElasticNetCV": {"l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0],
1021                            "cv": 5, 
1022                            "max_iter": 1000},
1023            "RidgeCV": {"alphas": [0.1, 1.0, 10.0], 
1024                        "cv": 5},
1025            "LinearRegression": {"fit_intercept": True},
1026            "RandomForest": {"n_estimators": 100, 
1027                            "max_depth": None,
1028                            "min_samples_split": 2,
1029                            "random_state": 42},
1030            "GaussianProcess": {"kernel": None, 
1031                                "alpha": 1e-10,
1032                                "normalize_y": True,
1033                                "random_state": 42},
1034            "GradientBoosting": {"n_estimators": 100,
1035                                "learning_rate": 0.1,
1036                                "max_depth": 3,
1037                                "random_state": 42}
1038        }
1039        
1040        # Get default parameters for the selected model
1041        model_defaults = default_params.get(self._model_type, {})
1042        
1043        # Override defaults with any provided kwargs
1044        model_params = {**model_defaults, **kwargs}
1045        
1046        if self._model_type == "ElasticNetCV":
1047            self._model = make_pipeline(StandardScaler(),
1048                                        ElasticNetCV(**model_params))
1049        elif self._model_type == "RidgeCV":
1050            self._model = make_pipeline(StandardScaler(),
1051                                        RidgeCV(**model_params))
1052        elif self._model_type == "LinearRegression":
1053            self._model = make_pipeline(StandardScaler(),
1054                                        LinearRegression(**model_params))
1055        elif self._model_type == "RandomForest":
1056            self._model = make_pipeline(StandardScaler(),
1057                                        RandomForestRegressor(**model_params))
1058        elif self._model_type == "GaussianProcess":
1059            self._model = make_pipeline(StandardScaler(),
1060                                        GaussianProcessRegressor(**model_params))
1061        elif self._model_type == "GradientBoosting":
1062            self._model = make_pipeline(StandardScaler(),
1063                                        GradientBoostingRegressor(**model_params))
1064        
1065        # Fit the model
1066        self._model.fit(X_train, y_train)
1067        return self._model
1068
1069    def plot_ML_model(self, features_in_log=False):
1070        """
1071        Plot the machine learning model using plotly.
1072
1073        Parameters
1074        ----------
1075        features_in_log : bool, optional
1076            Whether to plot the feature importances in log scale. Default is False.
1077
1078        Returns
1079        -------
1080        fig : list
1081            The list of plotly figures.
1082        """
1083        coef_names = self._data[self._factors].columns
1084        X = self._data[self._factors]
1085        y = self._data[self._response]
1086        if self._split_size > 0:
1087            X_train, X_test, y_train, y_test = train_test_split(
1088                X, y, test_size=self._split_size)
1089        else:
1090            X_train, X_test, y_train, y_test = X, X, y, y
1091        bootstrap_coefs = bootstrap_coefficients(
1092            self._model[1], X, y, n_bootstrap=100, random_state=42)
1093        # Calculate mean and standard deviation of coefficients
1094        mean_coefs = np.mean(bootstrap_coefs, axis=0)
1095        std_coefs = np.std(bootstrap_coefs, axis=0)
1096        # make the fig
1097        fig = [None] * 2
1098        fig[0] = go.Figure()
1099        # Add actual vs predicted line
1100        fig[0].add_trace(go.Scatter(x=y_train,
1101                                 y=self._model.predict(X_train),
1102                                 mode='markers',
1103                                 marker=dict(size=12, color='royalblue'),
1104                                 name='Training'))
1105        if self._split_size > 0:
1106            fig[0].add_trace(go.Scatter(x=y_test,
1107                                    y=self._model.predict(X_test),
1108                                    mode='markers',
1109                                    marker=dict(size=12, color='orange'),
1110                                    name='Validation'))
1111        # Add 1:1 line
1112        fig[0].add_shape(type="line",
1113                      x0=min(y_train), y0=min(y_train),
1114                      x1=max(y_train), y1=max(y_train),
1115                      line=dict(color="Gray", width=1, dash="dash"))
1116        fig[0].update_layout(
1117            plot_bgcolor="white",  # White background
1118            legend=dict(bgcolor='rgba(0,0,0,0)'),
1119            xaxis_title=f'Actual {self._response}',
1120            yaxis_title=f'Predicted {self._response}',
1121            height=500,  # Adjust height as needed
1122            margin=dict(l=10, r=10, t=120, b=50),
1123            xaxis=dict(
1124                showgrid=True,  # Enable grid
1125                gridcolor="lightgray",  # Light gray grid lines
1126                zeroline=False,
1127                zerolinecolor="black",  # Black zero line
1128                showline=True,
1129                linewidth=1,
1130                linecolor="black",  # Black border
1131                mirror=True
1132            ),
1133            yaxis=dict(
1134                showgrid=True,  # Enable grid
1135                gridcolor="lightgray",  # Light gray grid lines
1136                zeroline=False,
1137                zerolinecolor="black",  # Black zero line
1138                showline=True,
1139                linewidth=1,
1140                linecolor="black",  # Black border
1141                mirror=True
1142            )
1143        )
1144        # add the score in title
1145        fig[0].update_layout(
1146            title={
1147                'text': f"Predicted vs Actual for {str(self._model[1])}<br>R²_test = {r2_score(y_test, self._model.predict(X_test)):.4f}  -  R²_train = {r2_score(y_train, self._model.predict(X_train)):.4f}<br>RMSE_test = {root_mean_squared_error(y_test, self._model.predict(X_test)):.4f}  -  RMSE_train = {root_mean_squared_error(y_train, self._model.predict(X_train)):.4f}",
1148                'x': 0.45,
1149                'xanchor': 'center'
1150            }
1151        )
1152        # make feature importance plot
1153        if self._model_type != "GaussianProcess":
1154            fig[1] = go.Figure()
1155            pos_coefs = mean_coefs[mean_coefs > 0]
1156            pos_coefs_names = coef_names[mean_coefs > 0]
1157            neg_coefs = mean_coefs[mean_coefs < 0]
1158            neg_coefs_names = coef_names[mean_coefs < 0]
1159            # Add bars for positive mean coefficients
1160            fig[1].add_trace(go.Bar(
1161                y=[f"{pos_coefs_names[i]}" for i in range(len(pos_coefs))],
1162                x=mean_coefs[mean_coefs > 0],
1163                error_x=dict(type='data', array=std_coefs[mean_coefs > 0], visible=True),
1164                orientation='h',
1165                marker_color='royalblue',
1166                name='Positive'
1167            ))
1168            fig[1].add_trace(go.Bar(
1169                y=[f"{neg_coefs_names[i]}" for i in range(len(neg_coefs))],
1170                x=-mean_coefs[mean_coefs < 0],
1171                error_x=dict(type='data', array=std_coefs[mean_coefs < 0], visible=True),
1172                orientation='h',
1173                marker_color='orange',
1174                name='Negative'
1175            ))
1176            # Update layout
1177            fig[1].update_layout(
1178                title={
1179                    'text': f"Features importance for {str(self._model[1])}",
1180                    'x': 0.5,
1181                    'xanchor': 'center'
1182                },
1183                xaxis_title="Coefficient Value",
1184                yaxis_title="Features",
1185                barmode='relative',
1186                margin=dict(l=150)
1187            )
1188            if features_in_log:
1189                fig[1].update_xaxes(type="log")
1190            else:
1191                fig[1].update_xaxes(type="linear")
1192            fig[1].update_layout(
1193                plot_bgcolor="white",  # White background
1194                legend=dict(bgcolor='rgba(0,0,0,0)'),
1195                height=500,  # Adjust height as needed
1196                margin=dict(l=10, r=10, t=120, b=50),
1197                xaxis=dict(
1198                    showgrid=True,  # Enable grid
1199                    gridcolor="lightgray",  # Light gray grid lines
1200                    zeroline=False,
1201                    zerolinecolor="black",  # Black zero line
1202                    showline=True,
1203                    linewidth=1,
1204                    linecolor="black",  # Black border
1205                    mirror=True
1206                ),
1207                yaxis=dict(
1208                    showgrid=True,  # Enable grid
1209                    gridcolor="lightgray",  # Light gray grid lines
1210                    zeroline=False,
1211                    zerolinecolor="black",  # Black zero line
1212                    showline=True,
1213                    linewidth=1,
1214                    linecolor="black",  # Black border
1215                    mirror=True
1216                )
1217            )
1218        else:
1219            print('GaussianProcess does not have coefficients or feature importances')
1220        return fig
1221
1222    def predict(self, X=None, model='all'):
1223        """
1224        Predict using the machine learning model and the linear model,
1225        if they are trained. Use the encoders to encode the data.
1226        If X is not provided, use the original data.
1227        If the model has not been trained, raise a warning.
1228
1229        Parameters
1230        ----------
1231        X : pd.DataFrame, optional
1232            The input features. Default is None, which uses the original data.
1233
1234        Returns
1235        -------
1236        pd.DataFrame
1237            The predicted values with a column indicating the model used.
1238        """
1239        if X is None:
1240            X = self._data[self._factors].copy()
1241        else:
1242            X = X.copy()
1243
1244        # Encode the input data using the same encoders
1245        for factor in self._factors:
1246            if factor in self._encoders:
1247                X[factor] = self._encoders[factor].transform(X[factor].astype(str))
1248
1249        predML = None
1250        predlin = None
1251
1252        if self._model is not None:
1253            # Check if the model is fitted
1254            if not hasattr(self._model, 'predict'):
1255                raise Warning("Model has not been trained yet.")
1256            predML = self._model.predict(X)
1257            predML = pd.DataFrame(predML, columns=['prediction'])
1258            predML['model'] = self._model_type
1259
1260        if self._linear_model is not None:
1261            # Check if the linear model is fitted
1262            if not hasattr(self._linear_model, 'predict'):
1263                raise Warning("Linear model has not been trained yet.")
1264            predlin = self._linear_model.predict(X)
1265            predlin = pd.DataFrame(predlin, columns=['prediction'])
1266            predlin['model'] = 'Linear Model'
1267
1268        if predML is not None and predlin is not None and model == 'all':
1269            # Combine the predictions on top of each other
1270            pred = pd.concat([predML, predlin], axis=0)
1271            pred = pred.reset_index(drop=True)
1272            return pred
1273        elif predML is not None and model == 'ML':
1274            return predML
1275        elif predlin is not None and model == 'linear':
1276            return predlin
1277        else:
1278            raise Warning("No model has been trained yet.")
1279
1280
1281
1282
1283def bootstrap_coefficients(mod, X, y, n_bootstrap=100, random_state=None):
1284    """
1285    Perform bootstrap resampling to estimate the variability of model coefficients.
1286
1287    Parameters
1288    ----------
1289    mod : object
1290        The machine learning model.
1291    X : pd.DataFrame
1292        The input features.
1293    y : pd.Series
1294        The target variable.
1295    n_bootstrap : int, optional
1296        The number of bootstrap samples. Default is 100.
1297    random_state : int, optional
1298        The seed for the random number generator.
1299
1300    Returns
1301    -------
1302    results : np.ndarray
1303        The bootstrapped coefficients.
1304    """
1305    np.random.seed(random_state)
1306    n_samples = X.shape[0]
1307    results = []
1308
1309    for _ in range(n_bootstrap):
1310        # Resample the data
1311        indices = np.random.choice(np.arange(n_samples), size=n_samples, replace=True)
1312        X_resample, y_resample = X.values[indices], y.values[indices]
1313
1314        # Fit the model
1315        if isinstance(mod, RidgeCV):
1316            model = RidgeCV(alphas=np.logspace(-3, 3, 10)).fit(X_resample, y_resample)
1317            results.append(model.coef_)
1318        elif isinstance(mod, ElasticNetCV):
1319            model = ElasticNetCV(alphas=np.logspace(-3, 3, 10), l1_ratio=0.5).fit(X_resample, y_resample)
1320            results.append(model.coef_)
1321        elif isinstance(mod, LinearRegression):
1322            model = LinearRegression().fit(X_resample, y_resample)
1323            results.append(model.coef_)
1324        elif isinstance(mod, RandomForestRegressor):
1325            model = RandomForestRegressor().fit(X_resample, y_resample)
1326            results.append(model.feature_importances_)
1327        elif isinstance(mod, GradientBoostingRegressor):
1328            model = GradientBoostingRegressor().fit(X_resample, y_resample)
1329            results.append(model.feature_importances_)
1330        elif isinstance(mod, GaussianProcessRegressor):
1331            # Gaussian Process does not have coefficients or feature importances
1332            model = GaussianProcessRegressor().fit(X_resample, y_resample)
1333            results.append(np.zeros(X.shape[1]))  # Placeholder for Gaussian Process
1334        else:
1335            raise ValueError(f"Unsupported model type: {mod}")
1336
1337    return np.array(results)
class DataAnalysis:
  46class DataAnalysis:
  47    """
  48    This class is used to analyze the data and perform regression analysis.
  49
  50    Parameters
  51    ----------
  52
  53    data : pd.DataFrame
  54        The input data.
  55    factors : list
  56        The list of factor variables.
  57    response : str
  58        The response variable.
  59    split_size : float, optional
  60        The proportion of the dataset to include in the test split. Default is `0.2`.
  61    model_type : str, optional
  62        The type of machine learning model to use. Default is None. 
  63        Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`,
  64        `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`.
  65    
  66    Attributes
  67    ----------
  68    
  69    data : pd.DataFrame
  70        The input data.
  71    factors : list
  72        The list of factor variables.
  73    response : str
  74        The response variable.
  75    encoders : dict
  76        The encoders for categorical variables.
  77    dtypes : pd.Series  
  78        The data types of the columns.
  79    linear_model : object
  80        The linear model object.
  81    equation : str
  82        The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`.
  83    model : object
  84        The machine learning model object.
  85    model_type : str
  86        The type of machine learning model to use.
  87    split_size : float
  88        The proportion of the dataset to include in the test split.
  89
  90    Methods
  91    -------
  92    
  93    - **encode_data()**:
  94        Encodes categorical variables in the data. Called during initialization.
  95    - **plot_qq()**:
  96        Plots a Q-Q plot for the response variable using `plotly`.
  97    - **plot_boxplot()**:
  98        Plots a boxplot for the response variable using `plotly`.
  99    - **plot_histogram()**:
 100        Plots a histogram for the response variable using `plotly`.
 101    - **plot_scatter_response()**:
 102        Plots a scatter plot for the response variable using `plotly`.
 103    - **plot_corr()**:
 104        Plots a correlation matrix for the data using `plotly`.
 105    - **plot_pairplot_seaborn()**:
 106        Plots a pairplot for the data using `seaborn`.
 107    - **plot_pairplot_plotly()**:
 108        Plots a pairplot for the data using `plotly`.
 109    - **write_equation(order=1, quadratic=[])**:
 110        Writes R-style equation for multivariate fitting procedure using the `statsmodels` package.
 111    - **compute_linear_model(order=1, quadratic=[])**:
 112        Computes the linear model using the `statsmodels` package.
 113    - **plot_linear_model()**:
 114        Plots the linear model using `plotly`.
 115    - <b>compute_ML_model(**kwargs)</b>:
 116        Computes the machine learning model using `sklearn`.
 117    - **plot_ML_model(features_in_log=False)**:
 118        Plots the machine learning model using `plotly`.
 119    
 120    Example
 121    -------
 122    
 123    ```python
 124    from optimeo.analysis import * 
 125
 126    data = pd.read_csv('dataML.csv')
 127    factors = data.columns[:-1]
 128    response = data.columns[-1]
 129    analysis = DataAnalysis(data, factors, response)
 130    analysis.model_type = "ElasticNetCV"
 131    MLmodel = analysis.compute_ML_model()
 132    figs = analysis.plot_ML_model()
 133    for fig in figs:
 134        fig.show()
 135    ```
 136    """
 137
 138    def __init__(self, 
 139                 data: pd.DataFrame, 
 140                 factors: list, 
 141                 response: str, 
 142                 split_size=.2, 
 143                 model_type=None):
 144        self._dtypes = None
 145        self._encoders = {}
 146        self._linear_model = None
 147        self._model = None
 148        self._equation = ''
 149        self._data = data
 150        self._factors = factors
 151        self._response = response
 152        self._model_type = model_type
 153        self._split_size = split_size
 154        self.encode_data()
 155
 156    @property
 157    def data(self):
 158        """The input `pandas.DataFrame`."""
 159        return self._data
 160
 161    @data.setter
 162    def data(self, value):
 163        if not isinstance(value, pd.DataFrame):
 164            raise ValueError("Data must be a pandas DataFrame.")
 165        self._data = value
 166    
 167    def encode_data(self):
 168        """
 169        Called during initialization: encodes categorical variables in the data if there are any. 
 170        Uses `LabelEncoder()` from `sklearn` to convert categorical variables to numerical values.
 171        Also drops rows with NaN values.
 172        """
 173        self._dtypes = self._data.dtypes
 174        for factor in self._factors:
 175            if self._dtypes[factor] == 'object':
 176                le = LabelEncoder()
 177                self._encoders[factor] = le
 178                self._data[factor] = le.fit_transform([str(d) for d in self._data[factor]])
 179        self.data = self.data[self._factors + [self._response]]
 180        self.data = self.data.dropna(axis=0, how='any')
 181
 182    @property
 183    def factors(self):
 184        """The list of names of the columns of the `data` DataFrame that contain factor variables."""
 185        return self._factors
 186
 187    @factors.setter
 188    def factors(self, value):
 189        if not isinstance(value, list):
 190            raise ValueError("Factors must be a list.")
 191        self._factors = value
 192    
 193    @property
 194    def encoders(self):
 195        """The list of encoders for categorical variables."""
 196        return self._encoders
 197    
 198    @encoders.setter
 199    def encoders(self, value):
 200        if not isinstance(value, dict):
 201            raise ValueError("Encoders must be a dictionary.")
 202        self._encoders = value
 203
 204    @property
 205    def dtypes(self):
 206        """Get the data types of the columns."""
 207        return self._dtypes
 208
 209    @dtypes.setter
 210    def dtypes(self, value):
 211        self._dtypes = value
 212
 213    @property
 214    def response(self):
 215        """The name of the column of the `data` DataFrame that contain the response variable."""
 216        return self._response
 217
 218    @response.setter
 219    def response(self, value):
 220        if not isinstance(value, str):
 221            raise ValueError("Response must be a string.")
 222        self._response = value
 223
 224    @property
 225    def linear_model(self):
 226        """Get the linear model."""
 227        return self._linear_model
 228
 229    @linear_model.setter
 230    def linear_model(self, value):
 231        self._linear_model = value
 232
 233    @property
 234    def equation(self):
 235        """The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. This is based on the [statsmodels package](https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html)."""
 236        return self._equation
 237
 238    @equation.setter
 239    def equation(self, value):
 240        if not isinstance(value, str):
 241            raise ValueError("Equation must be a string.")
 242        self._equation = value
 243
 244    @property
 245    def model_type(self):
 246        """The type of machine learning model to use. Default is None. 
 247            Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`,
 248            `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`."""
 249        return self._model_type
 250
 251    @model_type.setter
 252    def model_type(self, value):
 253        if value is not None and value not in ["ElasticNetCV", "RidgeCV",
 254                                                "LinearRegression", "RandomForest",
 255                                                "GaussianProcess", "GradientBoosting"]:
 256            raise ValueError("Model must be one of the following: "
 257                             "ElasticNetCV, RidgeCV, LinearRegression, "
 258                             "RandomForest, GaussianProcess, GradientBoosting.")
 259        self._model_type = value
 260
 261    @property
 262    def model(self):
 263        """The machine learning model object."""
 264        return self._model
 265
 266    @model.setter
 267    def model(self, value):
 268        self._model = value
 269
 270    @property
 271    def split_size(self):
 272        """The proportion of the dataset to include in the test split. Default is `0.2`."""
 273        return self._split_size
 274
 275    @split_size.setter
 276    def split_size(self, value):
 277        if not isinstance(value, (int, float)):
 278            raise ValueError("Split size must be a number.")
 279        self._split_size = value
 280
 281    def __str__(self):
 282        """Return a string representation of the DataAnalysis object."""
 283        return (f"DataAnalysis(data={self._data.shape}, "
 284                f"factors={self._factors}, "
 285                f"response={self._response}, "
 286                f"model_type={self._model_type}, "
 287                f"split_size={self._split_size}, "
 288                f"encoders={self._encoders})"
 289                )
 290
 291    def __repr__(self):
 292        """Return a string representation of the DataAnalysis object."""
 293        return self.__str__()
 294
 295    def plot_qq(self):
 296        """
 297        Plot a Q-Q plot for the response variable.
 298
 299        Returns
 300        -------
 301        fig : plotly.graph_objs.Figure
 302            The Q-Q plot figure.
 303        """
 304        qqplot_data = pd.Series(self._data[self._response], copy=True)
 305        qqplot_data = qqplot(qqplot_data, line='s').gca().lines
 306        fig = go.Figure()
 307
 308        fig.add_trace({
 309            'type': 'scatter',
 310            'x': qqplot_data[0].get_xdata(),
 311            'y': qqplot_data[0].get_ydata(),
 312            'mode': 'markers',
 313        })
 314
 315        fig.add_trace({
 316            'type': 'scatter',
 317            'x': qqplot_data[1].get_xdata(),
 318            'y': qqplot_data[1].get_ydata(),
 319            'mode': 'lines',
 320            'line': {
 321                'color': 'red'
 322            }
 323        })
 324
 325        fig.update_layout(
 326            title='Q-Q Plot',
 327            xaxis_title='Theoretical Quantiles',
 328            yaxis_title='Sample Quantiles',
 329            plot_bgcolor="white",  # White background
 330            showlegend=False,
 331            margin=dict(l=10, r=10, t=50, b=50),
 332            xaxis=dict(
 333                showgrid=True,  # Enable grid
 334                gridcolor="lightgray",  # Light gray grid lines
 335                zeroline=False,
 336                zerolinecolor="black",  # Black zero line
 337                showline=True,
 338                linewidth=1,
 339                linecolor="black",  # Black border
 340                mirror=True
 341            ),
 342            yaxis=dict(
 343                showgrid=True,  # Enable grid
 344                gridcolor="lightgray",  # Light gray grid lines
 345                zeroline=False,
 346                zerolinecolor="black",  # Black zero line
 347                showline=True,
 348                linewidth=1,
 349                linecolor="black",  # Black border
 350                mirror=True
 351            ),
 352            height=300,
 353        )
 354        return fig
 355
 356    def plot_boxplot(self):
 357        """
 358        Plot a boxplot for the response variable.
 359
 360        Returns
 361        -------
 362        fig : plotly.graph_objs.Figure
 363            The boxplot figure.
 364        """
 365        fig = px.box(self._data, y=self._response, points="all", title='Box Plot')
 366        fig.update_layout(
 367            plot_bgcolor="white",  # White background
 368            legend=dict(bgcolor='rgba(0,0,0,0)'),
 369            margin=dict(l=10, r=10, t=50, b=50),
 370            xaxis=dict(
 371                showgrid=True,  # Enable grid
 372                gridcolor="lightgray",  # Light gray grid lines
 373                zeroline=False,
 374                zerolinecolor="black",  # Black zero line
 375                showline=True,
 376                linewidth=1,
 377                linecolor="black",  # Black border
 378                mirror=True
 379            ),
 380            yaxis=dict(
 381                showgrid=True,  # Enable grid
 382                gridcolor="lightgray",  # Light gray grid lines
 383                zeroline=False,
 384                zerolinecolor="black",  # Black zero line
 385                showline=True,
 386                linewidth=1,
 387                linecolor="black",  # Black border
 388                mirror=True
 389            ), height=300,
 390        )
 391        return fig
 392
 393    def plot_histogram(self):
 394        """
 395        Plot a histogram for the response variable.
 396
 397        Returns
 398        -------
 399        fig : plotly.graph_objs.Figure
 400            The histogram figure.
 401        """
 402        fig = px.histogram(self._data, x=self._response, title='Histogram')
 403        fig.update_layout(
 404            plot_bgcolor="white",  # White background
 405            legend=dict(bgcolor='rgba(0,0,0,0)'),
 406            margin=dict(l=10, r=10, t=50, b=50),
 407            xaxis=dict(
 408                showgrid=True,  # Enable grid
 409                gridcolor="lightgray",  # Light gray grid lines
 410                zeroline=False,
 411                zerolinecolor="black",  # Black zero line
 412                showline=True,
 413                linewidth=1,
 414                linecolor="black",  # Black border
 415                mirror=True
 416            ),
 417            yaxis=dict(
 418                showgrid=True,  # Enable grid
 419                gridcolor="lightgray",  # Light gray grid lines
 420                zeroline=False,
 421                zerolinecolor="black",  # Black zero line
 422                showline=True,
 423                linewidth=1,
 424                linecolor="black",  # Black border
 425                mirror=True
 426            ), height=300,
 427        )
 428        return fig
 429
 430    def plot_scatter_response(self):
 431        """
 432        Plot a scatter plot for the response variable.
 433
 434        Returns
 435        -------
 436        fig : plotly.graph_objs.Figure
 437            The scatter plot figure.
 438        """
 439        fig = px.scatter(x=np.arange(1, len(self._data[self._response]) + 1),
 440                         y=self._data[self._response],
 441                         labels={'x': 'Measurement number', 'y': self._response},
 442                         title='Scatter Plot')
 443        fig.update_layout(
 444            plot_bgcolor="white",  # White background
 445            legend=dict(bgcolor='rgba(0,0,0,0)'),
 446            margin=dict(l=10, r=10, t=50, b=50),
 447            xaxis=dict(
 448                showgrid=True,  # Enable grid
 449                gridcolor="lightgray",  # Light gray grid lines
 450                zeroline=False,
 451                zerolinecolor="black",  # Black zero line
 452                showline=True,
 453                linewidth=1,
 454                linecolor="black",  # Black border
 455                mirror=True
 456            ),
 457            yaxis=dict(
 458                showgrid=True,  # Enable grid
 459                gridcolor="lightgray",  # Light gray grid lines
 460                zeroline=False,
 461                zerolinecolor="black",  # Black zero line
 462                showline=True,
 463                linewidth=1,
 464                linecolor="black",  # Black border
 465                mirror=True
 466            ), height=300,
 467        )
 468        return fig
 469
 470    def plot_pairplot_seaborn(self):
 471        """
 472        Plot a pairplot for the data using seaborn.
 473
 474        Returns
 475        -------
 476        fig : seaborn.axisgrid.PairGrid
 477            The pairplot figure.
 478        """
 479        fig = sns.pairplot(
 480            self._data,
 481            kind="reg",
 482            diag_kind="kde",
 483            plot_kws={"scatter_kws": {"alpha": 0.1}},
 484        )
 485        return fig
 486    
 487    def plot_pairplot_plotly(self):
 488        """
 489        Plot a pairplot for the data using plotly.
 490
 491        Returns
 492        -------
 493        fig : plotly.graph_objs.Figure
 494            The plotly figure.
 495        """
 496        
 497        # Get the column names
 498        columns = self._data.columns
 499        n_cols = len(columns)
 500
 501        # Create subplot grid with partially shared axes
 502        fig = make_subplots(
 503            rows=n_cols,
 504            cols=n_cols,
 505            shared_xaxes=True,  
 506            shared_yaxes=True,  
 507            vertical_spacing=0.05,
 508            horizontal_spacing=0.05
 509        )
 510
 511        # Add scatter plots and regression lines to each subplot
 512        for i in range(n_cols):
 513            for j in range(n_cols):
 514                x_data = self._data[columns[j]]
 515                y_data = self._data[columns[i]]
 516
 517                if i == j:  # Diagonal: Add KDE plot
 518                    # Calculate KDE
 519                    data = self._data[columns[i]].dropna()
 520                    if len(data) > 1:  # Ensure enough data points
 521                        kde_x = np.linspace(data.min(), data.max(), 100)
 522                        kde = stats.gaussian_kde(data)
 523                        kde_y = kde(kde_x)
 524                        # Scale to match y-axis (needed for top left corner)
 525                        kde_y = kde_y / kde_y.max() * np.max(y_data.dropna())  
 526
 527                        # Add KDE plot to diagonal
 528                        fig.add_trace(
 529                            go.Scatter(
 530                                x=kde_x,
 531                                y=kde_y,
 532                                mode='lines',
 533                                fill='tozeroy',
 534                                line=dict(color='rgba(29, 81, 189, 1)'),
 535                                showlegend=False
 536                            ),
 537                            row=i+1, col=j+1
 538                        )
 539                        # Ensure the y-axis is independent for diagonal plots
 540                        # don't know why, doesn't work with top left corner
 541                        fig.update_yaxes(
 542                            matches=None,  # Ensure independent y-axis
 543                            showticklabels=True,  # Show y-axis values
 544                            row=i+1,
 545                            col=j+1
 546                        )
 547                else:  # Off-diagonal: Add scatter plot with regression line
 548                    # Add scatter plot
 549                    fig.add_trace(
 550                        go.Scatter(
 551                            x=x_data,
 552                            y=y_data,
 553                            mode='markers',
 554                            marker=dict(
 555                                color='rgba(29, 81, 189, 0.5)',
 556                                size=5
 557                            ),
 558                            showlegend=False
 559                        ),
 560                        row=i+1, col=j+1
 561                    )
 562
 563                    # Calculate and add regression line
 564                    if len(x_data.dropna()) > 1 and len(y_data.dropna()) > 1:
 565                        # Drop NaN values for regression calculation
 566                        valid_indices = x_data.notna() & y_data.notna()
 567                        x_clean = x_data[valid_indices]
 568                        y_clean = y_data[valid_indices]
 569                        ytype = y_clean.dtype
 570                        xtype = x_clean.dtype
 571                        unique_x = x_clean.unique()
 572                        if len(x_clean) > 1 and len(unique_x)>1 and ytype != 'object' and xtype != 'object':
 573                            # Calculate regression parameters
 574                            slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean)
 575                            x_range = np.linspace(x_clean.min(), x_clean.max(), 100)
 576                            y_pred = slope * x_range + intercept
 577
 578                            # Standard error of estimate
 579                            y_fit = slope * x_clean + intercept
 580                            residuals = y_clean - y_fit
 581                            dof = len(x_clean) - 2
 582                            residual_std_error = np.sqrt(np.sum(residuals**2) / dof)
 583
 584                            mean_x = np.mean(x_clean)
 585                            t_val = t.ppf(0.975, dof)  # 95% confidence
 586
 587                            se_line = residual_std_error * np.sqrt(1/len(x_clean) + (x_range - mean_x)**2 / np.sum((x_clean - mean_x)**2))
 588                            y_upper = y_pred + t_val * se_line
 589                            y_lower = y_pred - t_val * se_line
 590
 591                            # Add regression line
 592                            fig.add_trace(
 593                                go.Scatter(
 594                                    x=x_range,
 595                                    y=y_pred,
 596                                    mode='lines',
 597                                    line=dict(color='red', width=2),
 598                                    showlegend=False
 599                                ),
 600                                row=i+1, col=j+1
 601                            )
 602
 603                            # Add confidence interval area
 604                            fig.add_trace(
 605                                go.Scatter(
 606                                    x=np.concatenate([x_range, x_range[::-1]]),
 607                                    y=np.concatenate([y_upper, y_lower[::-1]]),
 608                                    fill='toself',
 609                                    fillcolor='rgba(255, 0, 0, 0.2)',
 610                                    line=dict(color='rgba(255, 0, 0, 0)'),
 611                                    showlegend=False
 612                                ),
 613                                row=i+1, col=j+1
 614                            )
 615
 616        # Update layout and axis properties
 617        fig.update_layout(
 618            margin=dict(l=10, r=10, t=50, b=50),
 619            title="Pair Plot",
 620            height=600,
 621            width=600,
 622            showlegend=False,
 623            plot_bgcolor="white"
 624        )
 625
 626        # Update all axes properties first (hiding tick labels by default)
 627        fig.update_xaxes(
 628            showgrid=True,
 629            gridcolor="lightgray",
 630            zeroline=False,
 631            showline=True,
 632            linewidth=1,
 633            linecolor="black",
 634            mirror=True,
 635            showticklabels=False
 636        )
 637
 638        fig.update_yaxes(
 639            showgrid=True,
 640            gridcolor="lightgray",
 641            zeroline=False,
 642            showline=True,
 643            linewidth=1,
 644            linecolor="black",
 645            mirror=True,
 646            showticklabels=False
 647        )
 648
 649        # Show tick labels and titles only for bottom row and leftmost column
 650        for i, col_name in enumerate(columns):
 651            # Bottom row: show x-axis titles and tick labels
 652            fig.update_xaxes(
 653                title_text=col_name,
 654                showticklabels=True,
 655                row=n_cols,
 656                col=i+1
 657            )
 658
 659            # Leftmost column: show y-axis titles and tick labels
 660            fig.update_yaxes(
 661                title_text=col_name,
 662                showticklabels=True,
 663                row=i+1,
 664                col=1
 665            )
 666            
 667        return fig
 668    
 669    def plot_corr(self):
 670        """
 671        Plot a correlation matrix for the data.
 672
 673        Returns
 674        -------
 675        fig : seaborn.axisgrid.PairGrid
 676            The pairplot figure.
 677        """
 678        corr_matrix = self._data.corr(method='pearson')
 679        mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
 680        # Apply mask - replace upper triangle with NaN values
 681        corr_matrix_lower = corr_matrix.copy()
 682        corr_matrix_lower.values[mask] = np.nan
 683        
 684        fig = px.imshow(
 685            corr_matrix_lower,
 686            text_auto='.2f',  # Display correlation values
 687            color_continuous_scale='RdBu_r',  # Red to Blue color scale (reversed)
 688            zmin=-1,  # Minimum correlation value
 689            zmax=1,   # Maximum correlation value
 690            aspect="auto",  # Keep aspect ratio adaptive
 691            title="Pearson Correlation Heatmap"
 692        )
 693        # Customize hover template
 694        fig.update_traces(
 695            hovertemplate='<b>%{y} vs %{x}</b><br>Correlation: %{z}<extra></extra>'
 696        )
 697        # Improve layout
 698        fig.update_layout(
 699            coloraxis_colorbar=dict(
 700                title="Correlation",
 701                tickvals=[-1, -0.5, 0, 0.5, 1],
 702                ticktext=["-1", "-0.5", "0", "0.5", "1"],
 703            ),
 704            plot_bgcolor="white",  # White background
 705            legend=dict(bgcolor='rgba(0,0,0,0)'),
 706            margin=dict(l=10, r=10, t=50, b=50),
 707            xaxis=dict(
 708                showgrid=False,  # Enable grid
 709                gridcolor="lightgray",  # Light gray grid lines
 710                zeroline=False,
 711                zerolinecolor="black",  # Black zero line
 712                showline=False,
 713                linewidth=1,
 714                tickangle=-45,
 715                linecolor="black",  # Black border
 716                mirror=True
 717            ),
 718            yaxis=dict(
 719                showgrid=False,  # Enable grid
 720                gridcolor="lightgray",  # Light gray grid lines
 721                zeroline=False,
 722                zerolinecolor="black",  # Black zero line
 723                showline=False,
 724                linewidth=1,
 725                linecolor="black",  # Black border
 726                mirror=True
 727            ), 
 728            height=600,
 729            width=600
 730        )
 731
 732        return fig
 733
 734    def write_equation(self, order=1, quadratic=[]):
 735        """
 736        Write R-style equation for multivariate fitting procedure using the statsmodels package.
 737
 738        Parameters
 739        ----------
 740        order : int, optional
 741            The order of the polynomial. Default is 1.
 742        quadratic : list, optional
 743            The list of quadratic factors. Default is an empty list.
 744
 745        Returns
 746        -------
 747        str
 748            The R-style equation.
 749        """
 750        myfactors = self._factors.copy()
 751        if self._dtypes is not None:
 752            if not isinstance(myfactors, list):
 753                myfactors = myfactors.tolist()  # Convert to list to allow mutable operations
 754            for i in range(len(self._factors)):
 755                if self._dtypes[self._factors[i]] == 'object':
 756                    myfactors[i] = f'C({myfactors[i]})'
 757        eqn = f'{self._response} ~ {myfactors[0]} '
 758        for factor in myfactors[1:]:
 759            eqn += f'+ {factor} '
 760        if order > 1:
 761            for i in range(len(myfactors)):
 762                for j in range(i + 1, len(myfactors)):
 763                    eqn += f'+ {myfactors[i]}:{myfactors[j]} '
 764        if order > 2:
 765            for i in range(len(myfactors)):
 766                for j in range(i + 1, len(myfactors)):
 767                    for k in range(j + 1, len(myfactors)):
 768                        eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]} '
 769        if order > 3:
 770            for i in range(len(myfactors)):
 771                for j in range(i + 1, len(myfactors)):
 772                    for k in range(j + 1, len(myfactors)):
 773                        for l in range(k + 1, len(myfactors)):
 774                            eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]}:{myfactors[l]} '
 775        if len(quadratic) > 0:
 776            for factor in quadratic:
 777                eqn += f'+ I({factor}**2) '
 778        self._equation = eqn
 779        return self._equation
 780
 781    def compute_linear_model(self, order=1, quadratic=[]):
 782        """
 783        Compute the linear model using the statsmodels package.
 784
 785        Parameters
 786        ----------
 787        order : int, optional
 788            The order of the polynomial. Default is 1. The parameter 
 789            is not used if the equation is already set.
 790        quadratic : list, optional
 791            The list of quadratic factors. Default is an empty list.
 792            The parameter is not used if the equation is already set.
 793
 794        Returns
 795        -------
 796        statsmodels.regression.linear_model.RegressionResultsWrapper
 797            The fitted linear model.
 798        """
 799        if self._equation == '':
 800            eqn = self.write_equation(order=order, quadratic=quadratic)
 801        else:
 802            eqn = self._equation
 803        model = smf.ols(formula=eqn, data=self._data)
 804        self._linear_model = model.fit()
 805        return self._linear_model
 806
 807    def plot_linear_model(self):
 808        """
 809        Plot the linear model using plotly.
 810
 811        Returns
 812        -------
 813        fig : list
 814            The list of plotly figures.
 815        """
 816        if self._linear_model is None:
 817            raise Warning("Linear model has not been computed yet.")
 818        fig = [None] * 2
 819        fig[0] = go.Figure()
 820        # Fig[0]: actual vs predicted line
 821        fig[0].add_trace(go.Scatter(x=self._data[self._response],
 822                                    y=self._linear_model.predict(),
 823                                    mode='markers',
 824                                    marker=dict(size=12),
 825                                    name='Actual vs Predicted'))
 826        # Add 1:1 line
 827        fig[0].add_shape(type="line",
 828                      x0=min(self._data[self._response]),
 829                      y0=min(self._data[self._response]),
 830                      x1=max(self._data[self._response]),
 831                      y1=max(self._data[self._response]),
 832                      line=dict(color="Gray", width=1, dash="dash"))
 833        fig[0].update_layout(
 834            plot_bgcolor="white",  # White background
 835            legend=dict(bgcolor='rgba(0,0,0,0)'),
 836            xaxis_title=f'Actual {self._response}',
 837            yaxis_title=f'Predicted {self._response}',
 838            height=500,  # Adjust height as needed
 839            margin=dict(l=10, r=10, t=50, b=50),
 840            xaxis=dict(
 841                showgrid=True,  # Enable grid
 842                gridcolor="lightgray",  # Light gray grid lines
 843                zeroline=False,
 844                zerolinecolor="black",  # Black zero line
 845                showline=True,
 846                linewidth=1,
 847                linecolor="black",  # Black border
 848                mirror=True
 849            ),
 850            yaxis=dict(
 851                showgrid=True,  # Enable grid
 852                gridcolor="lightgray",  # Light gray grid lines
 853                zeroline=False,
 854                zerolinecolor="black",  # Black zero line
 855                showline=True,
 856                linewidth=1,
 857                linecolor="black",  # Black border
 858                mirror=True
 859            )
 860        )
 861        # # # # # # # # # # # # # # # # # # # # #
 862        # Fig[1]: slope values for each factor
 863        # # # # # # # # # # # # # # # # # # # # #
 864        # Plot: Slope values for each factor
 865        res = self._linear_model.params.rename_axis('terms').reset_index()[1:]
 866        res.columns = ['terms', 'slope']
 867        error = self._linear_model.bse.rename_axis('terms').reset_index()[1:]
 868        error.columns = ['terms', 'error']
 869        res['error'] = error['error']
 870        res['pvalue'] = [self._linear_model.pvalues[res['terms']].iloc[i] for i in range(len(res))]
 871
 872        # Sort by p-values
 873        res = res.sort_values(by='pvalue', ascending=False)
 874        # Prepare colors and labels
 875        colors = ['red' if x < 0 else 'green' for x in res['slope']]
 876        res['slope'] = res['slope'].abs()
 877
 878        fig[1] = go.Figure()
 879
 880        # Add bar plot
 881        fig[1].add_trace(go.Bar(
 882            y=[term.replace('I(', '').replace('C(', '').replace(')', '').replace(' ** ', '^') for term in res['terms']],
 883            x=res['slope'],
 884            error_x=dict(type='data', array=res['error']),
 885            marker_color=colors,
 886            orientation='h',
 887            name='Slope',
 888            showlegend=False  # Hide the legend entry for the bar trace
 889        ))
 890
 891        # Update layout for log scale and labels
 892        fig[1].update_layout(
 893            xaxis_title='Magnitude of effect',
 894            xaxis_type="log",
 895            height=500,  # Adjust height as needed
 896            plot_bgcolor="white",  # White background
 897            legend=dict(
 898                orientation="h",  # Horizontal orientation
 899                y=1.1  # Place legend on top
 900            ),
 901            margin=dict(l=10, r=150, t=50, b=50),  # Increase right margin for annotations
 902            xaxis=dict(
 903                showgrid=True,  # Enable grid
 904                gridcolor="lightgray",  # Light gray grid lines
 905                zeroline=False,
 906                zerolinecolor="black",  # Black zero line
 907                showline=True,
 908                linewidth=1,
 909                linecolor="black",  # Black border
 910                mirror=True
 911            ),
 912            yaxis=dict(
 913                showgrid=True,  # Enable grid
 914                gridcolor="lightgray",  # Light gray grid lines
 915                zeroline=False,
 916                zerolinecolor="black",  # Black zero line
 917                showline=True,
 918                linewidth=1,
 919                linecolor="black",  # Black border
 920                mirror=True
 921            )
 922        )
 923
 924        # Add legend for Negative and Positive
 925        fig[1].add_trace(go.Scatter(
 926            x=[None], y=[None], mode='markers',
 927            marker=dict(size=10, color='red'),
 928            name='Negative'
 929        ))
 930        fig[1].add_trace(go.Scatter(
 931            x=[None], y=[None], mode='markers',
 932            marker=dict(size=10, color='green'),
 933            name='Positive'
 934        ))
 935
 936        # Add p-values as annotations outside the plot
 937        for i, p in enumerate(res['slope']):
 938            fig[1].add_annotation(
 939                x=1.025,  # Place annotations outside the plot
 940                y=i,
 941                text=f"p = {self._linear_model.pvalues[res['terms'].iloc[i]]:.2g}",
 942                showarrow=False,
 943                xref="paper",  # Use paper coordinates for x
 944                xanchor='left'
 945            )
 946
 947        return fig
 948
 949    def compute_ML_model(self, **kwargs):
 950        """
 951        Compute the machine learning model.
 952
 953        Parameters
 954        ----------
 955        kwargs : dict
 956            Additional keyword arguments for the model. Overrides default parameters.
 957            
 958        Default Parameters by Model Type
 959        --------------------------------
 960        - **ElasticNetCV:**
 961            - l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0].
 962                List of L1 ratios to try.
 963            - cv : int, default=5.
 964                Cross-validation folds.
 965            - max_iter : int, default=1000.
 966                Maximum iterations.
 967        - **RidgeCV:**
 968            - alphas : list, default=[0.1, 1.0, 10.0].
 969                List of alpha values to try.
 970            - cv : int, default=5.
 971                Cross-validation folds.
 972        - **LinearRegression:**
 973            - fit_intercept : bool, default=True.
 974                Whether to calculate the intercept.
 975        - **RandomForest:**
 976            - n_estimators : int, default=100.
 977                Number of trees in the forest.
 978            - max_depth : int or None, default=None.
 979                Maximum depth of trees.
 980            - min_samples_split : int, default=2.
 981                Minimum samples required to split a node.
 982            - random_state : int, default=42.
 983                Random seed for reproducibility.
 984        - **GaussianProcess:**
 985            - kernel : kernel object, default=None.
 986                Kernel for the Gaussian Process.
 987            - alpha : float, default=1e-10.
 988                Value added to diagonal of kernel matrix.
 989            - normalize_y : bool, default=True.
 990                Normalize target values.
 991            - random_state : int, default=42.
 992                Random seed for reproducibility.
 993        - **GradientBoosting:**
 994            - n_estimators : int, default=100.
 995                Number of boosting stages.
 996            - learning_rate : float, default=0.1.
 997                Learning rate.
 998            - max_depth : int, default=3.
 999                Maximum depth of trees.
1000            - random_state : int, default=42.
1001                Random seed for reproducibility.
1002
1003        Returns
1004        -------
1005        object
1006            The fitted machine learning model.
1007        """
1008        if self._model_type is None:
1009            raise ValueError("Model must be provided.")
1010
1011        X = self._data[self._factors]
1012        y = self._data[self._response]
1013        if self._split_size > 0:
1014            X_train, X_test, y_train, y_test = train_test_split(
1015                X, y, test_size=self._split_size)
1016        else:
1017            X_train, X_test, y_train, y_test = X, X, y, y
1018        
1019        # Default parameters for each model type
1020        default_params = {
1021            "ElasticNetCV": {"l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0],
1022                            "cv": 5, 
1023                            "max_iter": 1000},
1024            "RidgeCV": {"alphas": [0.1, 1.0, 10.0], 
1025                        "cv": 5},
1026            "LinearRegression": {"fit_intercept": True},
1027            "RandomForest": {"n_estimators": 100, 
1028                            "max_depth": None,
1029                            "min_samples_split": 2,
1030                            "random_state": 42},
1031            "GaussianProcess": {"kernel": None, 
1032                                "alpha": 1e-10,
1033                                "normalize_y": True,
1034                                "random_state": 42},
1035            "GradientBoosting": {"n_estimators": 100,
1036                                "learning_rate": 0.1,
1037                                "max_depth": 3,
1038                                "random_state": 42}
1039        }
1040        
1041        # Get default parameters for the selected model
1042        model_defaults = default_params.get(self._model_type, {})
1043        
1044        # Override defaults with any provided kwargs
1045        model_params = {**model_defaults, **kwargs}
1046        
1047        if self._model_type == "ElasticNetCV":
1048            self._model = make_pipeline(StandardScaler(),
1049                                        ElasticNetCV(**model_params))
1050        elif self._model_type == "RidgeCV":
1051            self._model = make_pipeline(StandardScaler(),
1052                                        RidgeCV(**model_params))
1053        elif self._model_type == "LinearRegression":
1054            self._model = make_pipeline(StandardScaler(),
1055                                        LinearRegression(**model_params))
1056        elif self._model_type == "RandomForest":
1057            self._model = make_pipeline(StandardScaler(),
1058                                        RandomForestRegressor(**model_params))
1059        elif self._model_type == "GaussianProcess":
1060            self._model = make_pipeline(StandardScaler(),
1061                                        GaussianProcessRegressor(**model_params))
1062        elif self._model_type == "GradientBoosting":
1063            self._model = make_pipeline(StandardScaler(),
1064                                        GradientBoostingRegressor(**model_params))
1065        
1066        # Fit the model
1067        self._model.fit(X_train, y_train)
1068        return self._model
1069
1070    def plot_ML_model(self, features_in_log=False):
1071        """
1072        Plot the machine learning model using plotly.
1073
1074        Parameters
1075        ----------
1076        features_in_log : bool, optional
1077            Whether to plot the feature importances in log scale. Default is False.
1078
1079        Returns
1080        -------
1081        fig : list
1082            The list of plotly figures.
1083        """
1084        coef_names = self._data[self._factors].columns
1085        X = self._data[self._factors]
1086        y = self._data[self._response]
1087        if self._split_size > 0:
1088            X_train, X_test, y_train, y_test = train_test_split(
1089                X, y, test_size=self._split_size)
1090        else:
1091            X_train, X_test, y_train, y_test = X, X, y, y
1092        bootstrap_coefs = bootstrap_coefficients(
1093            self._model[1], X, y, n_bootstrap=100, random_state=42)
1094        # Calculate mean and standard deviation of coefficients
1095        mean_coefs = np.mean(bootstrap_coefs, axis=0)
1096        std_coefs = np.std(bootstrap_coefs, axis=0)
1097        # make the fig
1098        fig = [None] * 2
1099        fig[0] = go.Figure()
1100        # Add actual vs predicted line
1101        fig[0].add_trace(go.Scatter(x=y_train,
1102                                 y=self._model.predict(X_train),
1103                                 mode='markers',
1104                                 marker=dict(size=12, color='royalblue'),
1105                                 name='Training'))
1106        if self._split_size > 0:
1107            fig[0].add_trace(go.Scatter(x=y_test,
1108                                    y=self._model.predict(X_test),
1109                                    mode='markers',
1110                                    marker=dict(size=12, color='orange'),
1111                                    name='Validation'))
1112        # Add 1:1 line
1113        fig[0].add_shape(type="line",
1114                      x0=min(y_train), y0=min(y_train),
1115                      x1=max(y_train), y1=max(y_train),
1116                      line=dict(color="Gray", width=1, dash="dash"))
1117        fig[0].update_layout(
1118            plot_bgcolor="white",  # White background
1119            legend=dict(bgcolor='rgba(0,0,0,0)'),
1120            xaxis_title=f'Actual {self._response}',
1121            yaxis_title=f'Predicted {self._response}',
1122            height=500,  # Adjust height as needed
1123            margin=dict(l=10, r=10, t=120, b=50),
1124            xaxis=dict(
1125                showgrid=True,  # Enable grid
1126                gridcolor="lightgray",  # Light gray grid lines
1127                zeroline=False,
1128                zerolinecolor="black",  # Black zero line
1129                showline=True,
1130                linewidth=1,
1131                linecolor="black",  # Black border
1132                mirror=True
1133            ),
1134            yaxis=dict(
1135                showgrid=True,  # Enable grid
1136                gridcolor="lightgray",  # Light gray grid lines
1137                zeroline=False,
1138                zerolinecolor="black",  # Black zero line
1139                showline=True,
1140                linewidth=1,
1141                linecolor="black",  # Black border
1142                mirror=True
1143            )
1144        )
1145        # add the score in title
1146        fig[0].update_layout(
1147            title={
1148                'text': f"Predicted vs Actual for {str(self._model[1])}<br>R²_test = {r2_score(y_test, self._model.predict(X_test)):.4f}  -  R²_train = {r2_score(y_train, self._model.predict(X_train)):.4f}<br>RMSE_test = {root_mean_squared_error(y_test, self._model.predict(X_test)):.4f}  -  RMSE_train = {root_mean_squared_error(y_train, self._model.predict(X_train)):.4f}",
1149                'x': 0.45,
1150                'xanchor': 'center'
1151            }
1152        )
1153        # make feature importance plot
1154        if self._model_type != "GaussianProcess":
1155            fig[1] = go.Figure()
1156            pos_coefs = mean_coefs[mean_coefs > 0]
1157            pos_coefs_names = coef_names[mean_coefs > 0]
1158            neg_coefs = mean_coefs[mean_coefs < 0]
1159            neg_coefs_names = coef_names[mean_coefs < 0]
1160            # Add bars for positive mean coefficients
1161            fig[1].add_trace(go.Bar(
1162                y=[f"{pos_coefs_names[i]}" for i in range(len(pos_coefs))],
1163                x=mean_coefs[mean_coefs > 0],
1164                error_x=dict(type='data', array=std_coefs[mean_coefs > 0], visible=True),
1165                orientation='h',
1166                marker_color='royalblue',
1167                name='Positive'
1168            ))
1169            fig[1].add_trace(go.Bar(
1170                y=[f"{neg_coefs_names[i]}" for i in range(len(neg_coefs))],
1171                x=-mean_coefs[mean_coefs < 0],
1172                error_x=dict(type='data', array=std_coefs[mean_coefs < 0], visible=True),
1173                orientation='h',
1174                marker_color='orange',
1175                name='Negative'
1176            ))
1177            # Update layout
1178            fig[1].update_layout(
1179                title={
1180                    'text': f"Features importance for {str(self._model[1])}",
1181                    'x': 0.5,
1182                    'xanchor': 'center'
1183                },
1184                xaxis_title="Coefficient Value",
1185                yaxis_title="Features",
1186                barmode='relative',
1187                margin=dict(l=150)
1188            )
1189            if features_in_log:
1190                fig[1].update_xaxes(type="log")
1191            else:
1192                fig[1].update_xaxes(type="linear")
1193            fig[1].update_layout(
1194                plot_bgcolor="white",  # White background
1195                legend=dict(bgcolor='rgba(0,0,0,0)'),
1196                height=500,  # Adjust height as needed
1197                margin=dict(l=10, r=10, t=120, b=50),
1198                xaxis=dict(
1199                    showgrid=True,  # Enable grid
1200                    gridcolor="lightgray",  # Light gray grid lines
1201                    zeroline=False,
1202                    zerolinecolor="black",  # Black zero line
1203                    showline=True,
1204                    linewidth=1,
1205                    linecolor="black",  # Black border
1206                    mirror=True
1207                ),
1208                yaxis=dict(
1209                    showgrid=True,  # Enable grid
1210                    gridcolor="lightgray",  # Light gray grid lines
1211                    zeroline=False,
1212                    zerolinecolor="black",  # Black zero line
1213                    showline=True,
1214                    linewidth=1,
1215                    linecolor="black",  # Black border
1216                    mirror=True
1217                )
1218            )
1219        else:
1220            print('GaussianProcess does not have coefficients or feature importances')
1221        return fig
1222
1223    def predict(self, X=None, model='all'):
1224        """
1225        Predict using the machine learning model and the linear model,
1226        if they are trained. Use the encoders to encode the data.
1227        If X is not provided, use the original data.
1228        If the model has not been trained, raise a warning.
1229
1230        Parameters
1231        ----------
1232        X : pd.DataFrame, optional
1233            The input features. Default is None, which uses the original data.
1234
1235        Returns
1236        -------
1237        pd.DataFrame
1238            The predicted values with a column indicating the model used.
1239        """
1240        if X is None:
1241            X = self._data[self._factors].copy()
1242        else:
1243            X = X.copy()
1244
1245        # Encode the input data using the same encoders
1246        for factor in self._factors:
1247            if factor in self._encoders:
1248                X[factor] = self._encoders[factor].transform(X[factor].astype(str))
1249
1250        predML = None
1251        predlin = None
1252
1253        if self._model is not None:
1254            # Check if the model is fitted
1255            if not hasattr(self._model, 'predict'):
1256                raise Warning("Model has not been trained yet.")
1257            predML = self._model.predict(X)
1258            predML = pd.DataFrame(predML, columns=['prediction'])
1259            predML['model'] = self._model_type
1260
1261        if self._linear_model is not None:
1262            # Check if the linear model is fitted
1263            if not hasattr(self._linear_model, 'predict'):
1264                raise Warning("Linear model has not been trained yet.")
1265            predlin = self._linear_model.predict(X)
1266            predlin = pd.DataFrame(predlin, columns=['prediction'])
1267            predlin['model'] = 'Linear Model'
1268
1269        if predML is not None and predlin is not None and model == 'all':
1270            # Combine the predictions on top of each other
1271            pred = pd.concat([predML, predlin], axis=0)
1272            pred = pred.reset_index(drop=True)
1273            return pred
1274        elif predML is not None and model == 'ML':
1275            return predML
1276        elif predlin is not None and model == 'linear':
1277            return predlin
1278        else:
1279            raise Warning("No model has been trained yet.")

This class is used to analyze the data and perform regression analysis.

Parameters
  • data (pd.DataFrame): The input data.
  • factors (list): The list of factor variables.
  • response (str): The response variable.
  • split_size (float, optional): The proportion of the dataset to include in the test split. Default is 0.2.
  • model_type (str, optional): The type of machine learning model to use. Default is None. Must be one of the following: "ElasticNetCV", "RidgeCV", "LinearRegression", "RandomForest", "GaussianProcess", "GradientBoosting".
Attributes
  • data (pd.DataFrame): The input data.
  • factors (list): The list of factor variables.
  • response (str): The response variable.
  • encoders (dict): The encoders for categorical variables.
  • dtypes (pd.Series): The data types of the columns.
  • linear_model (object): The linear model object.
  • equation (str): The equation for the linear model, in the form response ~ var1 + var2 + var1:var2.
  • model (object): The machine learning model object.
  • model_type (str): The type of machine learning model to use.
  • split_size (float): The proportion of the dataset to include in the test split.
Methods
  • encode_data(): Encodes categorical variables in the data. Called during initialization.
  • plot_qq(): Plots a Q-Q plot for the response variable using plotly.
  • plot_boxplot(): Plots a boxplot for the response variable using plotly.
  • plot_histogram(): Plots a histogram for the response variable using plotly.
  • plot_scatter_response(): Plots a scatter plot for the response variable using plotly.
  • plot_corr(): Plots a correlation matrix for the data using plotly.
  • plot_pairplot_seaborn(): Plots a pairplot for the data using seaborn.
  • plot_pairplot_plotly(): Plots a pairplot for the data using plotly.
  • write_equation(order=1, quadratic=[]): Writes R-style equation for multivariate fitting procedure using the statsmodels package.
  • compute_linear_model(order=1, quadratic=[]): Computes the linear model using the statsmodels package.
  • plot_linear_model(): Plots the linear model using plotly.
  • compute_ML_model(**kwargs): Computes the machine learning model using sklearn.
  • plot_ML_model(features_in_log=False): Plots the machine learning model using plotly.
Example
from optimeo.analysis import * 

data = pd.read_csv('dataML.csv')
factors = data.columns[:-1]
response = data.columns[-1]
analysis = DataAnalysis(data, factors, response)
analysis.model_type = "ElasticNetCV"
MLmodel = analysis.compute_ML_model()
figs = analysis.plot_ML_model()
for fig in figs:
    fig.show()
DataAnalysis( data: pandas.core.frame.DataFrame, factors: list, response: str, split_size=0.2, model_type=None)
138    def __init__(self, 
139                 data: pd.DataFrame, 
140                 factors: list, 
141                 response: str, 
142                 split_size=.2, 
143                 model_type=None):
144        self._dtypes = None
145        self._encoders = {}
146        self._linear_model = None
147        self._model = None
148        self._equation = ''
149        self._data = data
150        self._factors = factors
151        self._response = response
152        self._model_type = model_type
153        self._split_size = split_size
154        self.encode_data()
data
156    @property
157    def data(self):
158        """The input `pandas.DataFrame`."""
159        return self._data

The input pandas.DataFrame.

def encode_data(self):
167    def encode_data(self):
168        """
169        Called during initialization: encodes categorical variables in the data if there are any. 
170        Uses `LabelEncoder()` from `sklearn` to convert categorical variables to numerical values.
171        Also drops rows with NaN values.
172        """
173        self._dtypes = self._data.dtypes
174        for factor in self._factors:
175            if self._dtypes[factor] == 'object':
176                le = LabelEncoder()
177                self._encoders[factor] = le
178                self._data[factor] = le.fit_transform([str(d) for d in self._data[factor]])
179        self.data = self.data[self._factors + [self._response]]
180        self.data = self.data.dropna(axis=0, how='any')

Called during initialization: encodes categorical variables in the data if there are any. Uses LabelEncoder() from sklearn to convert categorical variables to numerical values. Also drops rows with NaN values.

factors
182    @property
183    def factors(self):
184        """The list of names of the columns of the `data` DataFrame that contain factor variables."""
185        return self._factors

The list of names of the columns of the data DataFrame that contain factor variables.

encoders
193    @property
194    def encoders(self):
195        """The list of encoders for categorical variables."""
196        return self._encoders

The list of encoders for categorical variables.

dtypes
204    @property
205    def dtypes(self):
206        """Get the data types of the columns."""
207        return self._dtypes

Get the data types of the columns.

response
213    @property
214    def response(self):
215        """The name of the column of the `data` DataFrame that contain the response variable."""
216        return self._response

The name of the column of the data DataFrame that contain the response variable.

linear_model
224    @property
225    def linear_model(self):
226        """Get the linear model."""
227        return self._linear_model

Get the linear model.

equation
233    @property
234    def equation(self):
235        """The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. This is based on the [statsmodels package](https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html)."""
236        return self._equation

The equation for the linear model, in the form response ~ var1 + var2 + var1:var2. This is based on the statsmodels package.

model_type
244    @property
245    def model_type(self):
246        """The type of machine learning model to use. Default is None. 
247            Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`,
248            `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`."""
249        return self._model_type

The type of machine learning model to use. Default is None. Must be one of the following: "ElasticNetCV", "RidgeCV", "LinearRegression", "RandomForest", "GaussianProcess", "GradientBoosting".

model
261    @property
262    def model(self):
263        """The machine learning model object."""
264        return self._model

The machine learning model object.

split_size
270    @property
271    def split_size(self):
272        """The proportion of the dataset to include in the test split. Default is `0.2`."""
273        return self._split_size

The proportion of the dataset to include in the test split. Default is 0.2.

def plot_qq(self):
295    def plot_qq(self):
296        """
297        Plot a Q-Q plot for the response variable.
298
299        Returns
300        -------
301        fig : plotly.graph_objs.Figure
302            The Q-Q plot figure.
303        """
304        qqplot_data = pd.Series(self._data[self._response], copy=True)
305        qqplot_data = qqplot(qqplot_data, line='s').gca().lines
306        fig = go.Figure()
307
308        fig.add_trace({
309            'type': 'scatter',
310            'x': qqplot_data[0].get_xdata(),
311            'y': qqplot_data[0].get_ydata(),
312            'mode': 'markers',
313        })
314
315        fig.add_trace({
316            'type': 'scatter',
317            'x': qqplot_data[1].get_xdata(),
318            'y': qqplot_data[1].get_ydata(),
319            'mode': 'lines',
320            'line': {
321                'color': 'red'
322            }
323        })
324
325        fig.update_layout(
326            title='Q-Q Plot',
327            xaxis_title='Theoretical Quantiles',
328            yaxis_title='Sample Quantiles',
329            plot_bgcolor="white",  # White background
330            showlegend=False,
331            margin=dict(l=10, r=10, t=50, b=50),
332            xaxis=dict(
333                showgrid=True,  # Enable grid
334                gridcolor="lightgray",  # Light gray grid lines
335                zeroline=False,
336                zerolinecolor="black",  # Black zero line
337                showline=True,
338                linewidth=1,
339                linecolor="black",  # Black border
340                mirror=True
341            ),
342            yaxis=dict(
343                showgrid=True,  # Enable grid
344                gridcolor="lightgray",  # Light gray grid lines
345                zeroline=False,
346                zerolinecolor="black",  # Black zero line
347                showline=True,
348                linewidth=1,
349                linecolor="black",  # Black border
350                mirror=True
351            ),
352            height=300,
353        )
354        return fig

Plot a Q-Q plot for the response variable.

Returns
  • fig (plotly.graph_objs.Figure): The Q-Q plot figure.
def plot_boxplot(self):
356    def plot_boxplot(self):
357        """
358        Plot a boxplot for the response variable.
359
360        Returns
361        -------
362        fig : plotly.graph_objs.Figure
363            The boxplot figure.
364        """
365        fig = px.box(self._data, y=self._response, points="all", title='Box Plot')
366        fig.update_layout(
367            plot_bgcolor="white",  # White background
368            legend=dict(bgcolor='rgba(0,0,0,0)'),
369            margin=dict(l=10, r=10, t=50, b=50),
370            xaxis=dict(
371                showgrid=True,  # Enable grid
372                gridcolor="lightgray",  # Light gray grid lines
373                zeroline=False,
374                zerolinecolor="black",  # Black zero line
375                showline=True,
376                linewidth=1,
377                linecolor="black",  # Black border
378                mirror=True
379            ),
380            yaxis=dict(
381                showgrid=True,  # Enable grid
382                gridcolor="lightgray",  # Light gray grid lines
383                zeroline=False,
384                zerolinecolor="black",  # Black zero line
385                showline=True,
386                linewidth=1,
387                linecolor="black",  # Black border
388                mirror=True
389            ), height=300,
390        )
391        return fig

Plot a boxplot for the response variable.

Returns
  • fig (plotly.graph_objs.Figure): The boxplot figure.
def plot_histogram(self):
393    def plot_histogram(self):
394        """
395        Plot a histogram for the response variable.
396
397        Returns
398        -------
399        fig : plotly.graph_objs.Figure
400            The histogram figure.
401        """
402        fig = px.histogram(self._data, x=self._response, title='Histogram')
403        fig.update_layout(
404            plot_bgcolor="white",  # White background
405            legend=dict(bgcolor='rgba(0,0,0,0)'),
406            margin=dict(l=10, r=10, t=50, b=50),
407            xaxis=dict(
408                showgrid=True,  # Enable grid
409                gridcolor="lightgray",  # Light gray grid lines
410                zeroline=False,
411                zerolinecolor="black",  # Black zero line
412                showline=True,
413                linewidth=1,
414                linecolor="black",  # Black border
415                mirror=True
416            ),
417            yaxis=dict(
418                showgrid=True,  # Enable grid
419                gridcolor="lightgray",  # Light gray grid lines
420                zeroline=False,
421                zerolinecolor="black",  # Black zero line
422                showline=True,
423                linewidth=1,
424                linecolor="black",  # Black border
425                mirror=True
426            ), height=300,
427        )
428        return fig

Plot a histogram for the response variable.

Returns
  • fig (plotly.graph_objs.Figure): The histogram figure.
def plot_scatter_response(self):
430    def plot_scatter_response(self):
431        """
432        Plot a scatter plot for the response variable.
433
434        Returns
435        -------
436        fig : plotly.graph_objs.Figure
437            The scatter plot figure.
438        """
439        fig = px.scatter(x=np.arange(1, len(self._data[self._response]) + 1),
440                         y=self._data[self._response],
441                         labels={'x': 'Measurement number', 'y': self._response},
442                         title='Scatter Plot')
443        fig.update_layout(
444            plot_bgcolor="white",  # White background
445            legend=dict(bgcolor='rgba(0,0,0,0)'),
446            margin=dict(l=10, r=10, t=50, b=50),
447            xaxis=dict(
448                showgrid=True,  # Enable grid
449                gridcolor="lightgray",  # Light gray grid lines
450                zeroline=False,
451                zerolinecolor="black",  # Black zero line
452                showline=True,
453                linewidth=1,
454                linecolor="black",  # Black border
455                mirror=True
456            ),
457            yaxis=dict(
458                showgrid=True,  # Enable grid
459                gridcolor="lightgray",  # Light gray grid lines
460                zeroline=False,
461                zerolinecolor="black",  # Black zero line
462                showline=True,
463                linewidth=1,
464                linecolor="black",  # Black border
465                mirror=True
466            ), height=300,
467        )
468        return fig

Plot a scatter plot for the response variable.

Returns
  • fig (plotly.graph_objs.Figure): The scatter plot figure.
def plot_pairplot_seaborn(self):
470    def plot_pairplot_seaborn(self):
471        """
472        Plot a pairplot for the data using seaborn.
473
474        Returns
475        -------
476        fig : seaborn.axisgrid.PairGrid
477            The pairplot figure.
478        """
479        fig = sns.pairplot(
480            self._data,
481            kind="reg",
482            diag_kind="kde",
483            plot_kws={"scatter_kws": {"alpha": 0.1}},
484        )
485        return fig

Plot a pairplot for the data using seaborn.

Returns
  • fig (seaborn.axisgrid.PairGrid): The pairplot figure.
def plot_pairplot_plotly(self):
487    def plot_pairplot_plotly(self):
488        """
489        Plot a pairplot for the data using plotly.
490
491        Returns
492        -------
493        fig : plotly.graph_objs.Figure
494            The plotly figure.
495        """
496        
497        # Get the column names
498        columns = self._data.columns
499        n_cols = len(columns)
500
501        # Create subplot grid with partially shared axes
502        fig = make_subplots(
503            rows=n_cols,
504            cols=n_cols,
505            shared_xaxes=True,  
506            shared_yaxes=True,  
507            vertical_spacing=0.05,
508            horizontal_spacing=0.05
509        )
510
511        # Add scatter plots and regression lines to each subplot
512        for i in range(n_cols):
513            for j in range(n_cols):
514                x_data = self._data[columns[j]]
515                y_data = self._data[columns[i]]
516
517                if i == j:  # Diagonal: Add KDE plot
518                    # Calculate KDE
519                    data = self._data[columns[i]].dropna()
520                    if len(data) > 1:  # Ensure enough data points
521                        kde_x = np.linspace(data.min(), data.max(), 100)
522                        kde = stats.gaussian_kde(data)
523                        kde_y = kde(kde_x)
524                        # Scale to match y-axis (needed for top left corner)
525                        kde_y = kde_y / kde_y.max() * np.max(y_data.dropna())  
526
527                        # Add KDE plot to diagonal
528                        fig.add_trace(
529                            go.Scatter(
530                                x=kde_x,
531                                y=kde_y,
532                                mode='lines',
533                                fill='tozeroy',
534                                line=dict(color='rgba(29, 81, 189, 1)'),
535                                showlegend=False
536                            ),
537                            row=i+1, col=j+1
538                        )
539                        # Ensure the y-axis is independent for diagonal plots
540                        # don't know why, doesn't work with top left corner
541                        fig.update_yaxes(
542                            matches=None,  # Ensure independent y-axis
543                            showticklabels=True,  # Show y-axis values
544                            row=i+1,
545                            col=j+1
546                        )
547                else:  # Off-diagonal: Add scatter plot with regression line
548                    # Add scatter plot
549                    fig.add_trace(
550                        go.Scatter(
551                            x=x_data,
552                            y=y_data,
553                            mode='markers',
554                            marker=dict(
555                                color='rgba(29, 81, 189, 0.5)',
556                                size=5
557                            ),
558                            showlegend=False
559                        ),
560                        row=i+1, col=j+1
561                    )
562
563                    # Calculate and add regression line
564                    if len(x_data.dropna()) > 1 and len(y_data.dropna()) > 1:
565                        # Drop NaN values for regression calculation
566                        valid_indices = x_data.notna() & y_data.notna()
567                        x_clean = x_data[valid_indices]
568                        y_clean = y_data[valid_indices]
569                        ytype = y_clean.dtype
570                        xtype = x_clean.dtype
571                        unique_x = x_clean.unique()
572                        if len(x_clean) > 1 and len(unique_x)>1 and ytype != 'object' and xtype != 'object':
573                            # Calculate regression parameters
574                            slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean)
575                            x_range = np.linspace(x_clean.min(), x_clean.max(), 100)
576                            y_pred = slope * x_range + intercept
577
578                            # Standard error of estimate
579                            y_fit = slope * x_clean + intercept
580                            residuals = y_clean - y_fit
581                            dof = len(x_clean) - 2
582                            residual_std_error = np.sqrt(np.sum(residuals**2) / dof)
583
584                            mean_x = np.mean(x_clean)
585                            t_val = t.ppf(0.975, dof)  # 95% confidence
586
587                            se_line = residual_std_error * np.sqrt(1/len(x_clean) + (x_range - mean_x)**2 / np.sum((x_clean - mean_x)**2))
588                            y_upper = y_pred + t_val * se_line
589                            y_lower = y_pred - t_val * se_line
590
591                            # Add regression line
592                            fig.add_trace(
593                                go.Scatter(
594                                    x=x_range,
595                                    y=y_pred,
596                                    mode='lines',
597                                    line=dict(color='red', width=2),
598                                    showlegend=False
599                                ),
600                                row=i+1, col=j+1
601                            )
602
603                            # Add confidence interval area
604                            fig.add_trace(
605                                go.Scatter(
606                                    x=np.concatenate([x_range, x_range[::-1]]),
607                                    y=np.concatenate([y_upper, y_lower[::-1]]),
608                                    fill='toself',
609                                    fillcolor='rgba(255, 0, 0, 0.2)',
610                                    line=dict(color='rgba(255, 0, 0, 0)'),
611                                    showlegend=False
612                                ),
613                                row=i+1, col=j+1
614                            )
615
616        # Update layout and axis properties
617        fig.update_layout(
618            margin=dict(l=10, r=10, t=50, b=50),
619            title="Pair Plot",
620            height=600,
621            width=600,
622            showlegend=False,
623            plot_bgcolor="white"
624        )
625
626        # Update all axes properties first (hiding tick labels by default)
627        fig.update_xaxes(
628            showgrid=True,
629            gridcolor="lightgray",
630            zeroline=False,
631            showline=True,
632            linewidth=1,
633            linecolor="black",
634            mirror=True,
635            showticklabels=False
636        )
637
638        fig.update_yaxes(
639            showgrid=True,
640            gridcolor="lightgray",
641            zeroline=False,
642            showline=True,
643            linewidth=1,
644            linecolor="black",
645            mirror=True,
646            showticklabels=False
647        )
648
649        # Show tick labels and titles only for bottom row and leftmost column
650        for i, col_name in enumerate(columns):
651            # Bottom row: show x-axis titles and tick labels
652            fig.update_xaxes(
653                title_text=col_name,
654                showticklabels=True,
655                row=n_cols,
656                col=i+1
657            )
658
659            # Leftmost column: show y-axis titles and tick labels
660            fig.update_yaxes(
661                title_text=col_name,
662                showticklabels=True,
663                row=i+1,
664                col=1
665            )
666            
667        return fig

Plot a pairplot for the data using plotly.

Returns
  • fig (plotly.graph_objs.Figure): The plotly figure.
def plot_corr(self):
669    def plot_corr(self):
670        """
671        Plot a correlation matrix for the data.
672
673        Returns
674        -------
675        fig : seaborn.axisgrid.PairGrid
676            The pairplot figure.
677        """
678        corr_matrix = self._data.corr(method='pearson')
679        mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
680        # Apply mask - replace upper triangle with NaN values
681        corr_matrix_lower = corr_matrix.copy()
682        corr_matrix_lower.values[mask] = np.nan
683        
684        fig = px.imshow(
685            corr_matrix_lower,
686            text_auto='.2f',  # Display correlation values
687            color_continuous_scale='RdBu_r',  # Red to Blue color scale (reversed)
688            zmin=-1,  # Minimum correlation value
689            zmax=1,   # Maximum correlation value
690            aspect="auto",  # Keep aspect ratio adaptive
691            title="Pearson Correlation Heatmap"
692        )
693        # Customize hover template
694        fig.update_traces(
695            hovertemplate='<b>%{y} vs %{x}</b><br>Correlation: %{z}<extra></extra>'
696        )
697        # Improve layout
698        fig.update_layout(
699            coloraxis_colorbar=dict(
700                title="Correlation",
701                tickvals=[-1, -0.5, 0, 0.5, 1],
702                ticktext=["-1", "-0.5", "0", "0.5", "1"],
703            ),
704            plot_bgcolor="white",  # White background
705            legend=dict(bgcolor='rgba(0,0,0,0)'),
706            margin=dict(l=10, r=10, t=50, b=50),
707            xaxis=dict(
708                showgrid=False,  # Enable grid
709                gridcolor="lightgray",  # Light gray grid lines
710                zeroline=False,
711                zerolinecolor="black",  # Black zero line
712                showline=False,
713                linewidth=1,
714                tickangle=-45,
715                linecolor="black",  # Black border
716                mirror=True
717            ),
718            yaxis=dict(
719                showgrid=False,  # Enable grid
720                gridcolor="lightgray",  # Light gray grid lines
721                zeroline=False,
722                zerolinecolor="black",  # Black zero line
723                showline=False,
724                linewidth=1,
725                linecolor="black",  # Black border
726                mirror=True
727            ), 
728            height=600,
729            width=600
730        )
731
732        return fig

Plot a correlation matrix for the data.

Returns
  • fig (seaborn.axisgrid.PairGrid): The pairplot figure.
def write_equation(self, order=1, quadratic=[]):
734    def write_equation(self, order=1, quadratic=[]):
735        """
736        Write R-style equation for multivariate fitting procedure using the statsmodels package.
737
738        Parameters
739        ----------
740        order : int, optional
741            The order of the polynomial. Default is 1.
742        quadratic : list, optional
743            The list of quadratic factors. Default is an empty list.
744
745        Returns
746        -------
747        str
748            The R-style equation.
749        """
750        myfactors = self._factors.copy()
751        if self._dtypes is not None:
752            if not isinstance(myfactors, list):
753                myfactors = myfactors.tolist()  # Convert to list to allow mutable operations
754            for i in range(len(self._factors)):
755                if self._dtypes[self._factors[i]] == 'object':
756                    myfactors[i] = f'C({myfactors[i]})'
757        eqn = f'{self._response} ~ {myfactors[0]} '
758        for factor in myfactors[1:]:
759            eqn += f'+ {factor} '
760        if order > 1:
761            for i in range(len(myfactors)):
762                for j in range(i + 1, len(myfactors)):
763                    eqn += f'+ {myfactors[i]}:{myfactors[j]} '
764        if order > 2:
765            for i in range(len(myfactors)):
766                for j in range(i + 1, len(myfactors)):
767                    for k in range(j + 1, len(myfactors)):
768                        eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]} '
769        if order > 3:
770            for i in range(len(myfactors)):
771                for j in range(i + 1, len(myfactors)):
772                    for k in range(j + 1, len(myfactors)):
773                        for l in range(k + 1, len(myfactors)):
774                            eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]}:{myfactors[l]} '
775        if len(quadratic) > 0:
776            for factor in quadratic:
777                eqn += f'+ I({factor}**2) '
778        self._equation = eqn
779        return self._equation

Write R-style equation for multivariate fitting procedure using the statsmodels package.

Parameters
  • order (int, optional): The order of the polynomial. Default is 1.
  • quadratic (list, optional): The list of quadratic factors. Default is an empty list.
Returns
  • str: The R-style equation.
def compute_linear_model(self, order=1, quadratic=[]):
781    def compute_linear_model(self, order=1, quadratic=[]):
782        """
783        Compute the linear model using the statsmodels package.
784
785        Parameters
786        ----------
787        order : int, optional
788            The order of the polynomial. Default is 1. The parameter 
789            is not used if the equation is already set.
790        quadratic : list, optional
791            The list of quadratic factors. Default is an empty list.
792            The parameter is not used if the equation is already set.
793
794        Returns
795        -------
796        statsmodels.regression.linear_model.RegressionResultsWrapper
797            The fitted linear model.
798        """
799        if self._equation == '':
800            eqn = self.write_equation(order=order, quadratic=quadratic)
801        else:
802            eqn = self._equation
803        model = smf.ols(formula=eqn, data=self._data)
804        self._linear_model = model.fit()
805        return self._linear_model

Compute the linear model using the statsmodels package.

Parameters
  • order (int, optional): The order of the polynomial. Default is 1. The parameter is not used if the equation is already set.
  • quadratic (list, optional): The list of quadratic factors. Default is an empty list. The parameter is not used if the equation is already set.
Returns
  • statsmodels.regression.linear_model.RegressionResultsWrapper: The fitted linear model.
def plot_linear_model(self):
807    def plot_linear_model(self):
808        """
809        Plot the linear model using plotly.
810
811        Returns
812        -------
813        fig : list
814            The list of plotly figures.
815        """
816        if self._linear_model is None:
817            raise Warning("Linear model has not been computed yet.")
818        fig = [None] * 2
819        fig[0] = go.Figure()
820        # Fig[0]: actual vs predicted line
821        fig[0].add_trace(go.Scatter(x=self._data[self._response],
822                                    y=self._linear_model.predict(),
823                                    mode='markers',
824                                    marker=dict(size=12),
825                                    name='Actual vs Predicted'))
826        # Add 1:1 line
827        fig[0].add_shape(type="line",
828                      x0=min(self._data[self._response]),
829                      y0=min(self._data[self._response]),
830                      x1=max(self._data[self._response]),
831                      y1=max(self._data[self._response]),
832                      line=dict(color="Gray", width=1, dash="dash"))
833        fig[0].update_layout(
834            plot_bgcolor="white",  # White background
835            legend=dict(bgcolor='rgba(0,0,0,0)'),
836            xaxis_title=f'Actual {self._response}',
837            yaxis_title=f'Predicted {self._response}',
838            height=500,  # Adjust height as needed
839            margin=dict(l=10, r=10, t=50, b=50),
840            xaxis=dict(
841                showgrid=True,  # Enable grid
842                gridcolor="lightgray",  # Light gray grid lines
843                zeroline=False,
844                zerolinecolor="black",  # Black zero line
845                showline=True,
846                linewidth=1,
847                linecolor="black",  # Black border
848                mirror=True
849            ),
850            yaxis=dict(
851                showgrid=True,  # Enable grid
852                gridcolor="lightgray",  # Light gray grid lines
853                zeroline=False,
854                zerolinecolor="black",  # Black zero line
855                showline=True,
856                linewidth=1,
857                linecolor="black",  # Black border
858                mirror=True
859            )
860        )
861        # # # # # # # # # # # # # # # # # # # # #
862        # Fig[1]: slope values for each factor
863        # # # # # # # # # # # # # # # # # # # # #
864        # Plot: Slope values for each factor
865        res = self._linear_model.params.rename_axis('terms').reset_index()[1:]
866        res.columns = ['terms', 'slope']
867        error = self._linear_model.bse.rename_axis('terms').reset_index()[1:]
868        error.columns = ['terms', 'error']
869        res['error'] = error['error']
870        res['pvalue'] = [self._linear_model.pvalues[res['terms']].iloc[i] for i in range(len(res))]
871
872        # Sort by p-values
873        res = res.sort_values(by='pvalue', ascending=False)
874        # Prepare colors and labels
875        colors = ['red' if x < 0 else 'green' for x in res['slope']]
876        res['slope'] = res['slope'].abs()
877
878        fig[1] = go.Figure()
879
880        # Add bar plot
881        fig[1].add_trace(go.Bar(
882            y=[term.replace('I(', '').replace('C(', '').replace(')', '').replace(' ** ', '^') for term in res['terms']],
883            x=res['slope'],
884            error_x=dict(type='data', array=res['error']),
885            marker_color=colors,
886            orientation='h',
887            name='Slope',
888            showlegend=False  # Hide the legend entry for the bar trace
889        ))
890
891        # Update layout for log scale and labels
892        fig[1].update_layout(
893            xaxis_title='Magnitude of effect',
894            xaxis_type="log",
895            height=500,  # Adjust height as needed
896            plot_bgcolor="white",  # White background
897            legend=dict(
898                orientation="h",  # Horizontal orientation
899                y=1.1  # Place legend on top
900            ),
901            margin=dict(l=10, r=150, t=50, b=50),  # Increase right margin for annotations
902            xaxis=dict(
903                showgrid=True,  # Enable grid
904                gridcolor="lightgray",  # Light gray grid lines
905                zeroline=False,
906                zerolinecolor="black",  # Black zero line
907                showline=True,
908                linewidth=1,
909                linecolor="black",  # Black border
910                mirror=True
911            ),
912            yaxis=dict(
913                showgrid=True,  # Enable grid
914                gridcolor="lightgray",  # Light gray grid lines
915                zeroline=False,
916                zerolinecolor="black",  # Black zero line
917                showline=True,
918                linewidth=1,
919                linecolor="black",  # Black border
920                mirror=True
921            )
922        )
923
924        # Add legend for Negative and Positive
925        fig[1].add_trace(go.Scatter(
926            x=[None], y=[None], mode='markers',
927            marker=dict(size=10, color='red'),
928            name='Negative'
929        ))
930        fig[1].add_trace(go.Scatter(
931            x=[None], y=[None], mode='markers',
932            marker=dict(size=10, color='green'),
933            name='Positive'
934        ))
935
936        # Add p-values as annotations outside the plot
937        for i, p in enumerate(res['slope']):
938            fig[1].add_annotation(
939                x=1.025,  # Place annotations outside the plot
940                y=i,
941                text=f"p = {self._linear_model.pvalues[res['terms'].iloc[i]]:.2g}",
942                showarrow=False,
943                xref="paper",  # Use paper coordinates for x
944                xanchor='left'
945            )
946
947        return fig

Plot the linear model using plotly.

Returns
  • fig (list): The list of plotly figures.
def compute_ML_model(self, **kwargs):
 949    def compute_ML_model(self, **kwargs):
 950        """
 951        Compute the machine learning model.
 952
 953        Parameters
 954        ----------
 955        kwargs : dict
 956            Additional keyword arguments for the model. Overrides default parameters.
 957            
 958        Default Parameters by Model Type
 959        --------------------------------
 960        - **ElasticNetCV:**
 961            - l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0].
 962                List of L1 ratios to try.
 963            - cv : int, default=5.
 964                Cross-validation folds.
 965            - max_iter : int, default=1000.
 966                Maximum iterations.
 967        - **RidgeCV:**
 968            - alphas : list, default=[0.1, 1.0, 10.0].
 969                List of alpha values to try.
 970            - cv : int, default=5.
 971                Cross-validation folds.
 972        - **LinearRegression:**
 973            - fit_intercept : bool, default=True.
 974                Whether to calculate the intercept.
 975        - **RandomForest:**
 976            - n_estimators : int, default=100.
 977                Number of trees in the forest.
 978            - max_depth : int or None, default=None.
 979                Maximum depth of trees.
 980            - min_samples_split : int, default=2.
 981                Minimum samples required to split a node.
 982            - random_state : int, default=42.
 983                Random seed for reproducibility.
 984        - **GaussianProcess:**
 985            - kernel : kernel object, default=None.
 986                Kernel for the Gaussian Process.
 987            - alpha : float, default=1e-10.
 988                Value added to diagonal of kernel matrix.
 989            - normalize_y : bool, default=True.
 990                Normalize target values.
 991            - random_state : int, default=42.
 992                Random seed for reproducibility.
 993        - **GradientBoosting:**
 994            - n_estimators : int, default=100.
 995                Number of boosting stages.
 996            - learning_rate : float, default=0.1.
 997                Learning rate.
 998            - max_depth : int, default=3.
 999                Maximum depth of trees.
1000            - random_state : int, default=42.
1001                Random seed for reproducibility.
1002
1003        Returns
1004        -------
1005        object
1006            The fitted machine learning model.
1007        """
1008        if self._model_type is None:
1009            raise ValueError("Model must be provided.")
1010
1011        X = self._data[self._factors]
1012        y = self._data[self._response]
1013        if self._split_size > 0:
1014            X_train, X_test, y_train, y_test = train_test_split(
1015                X, y, test_size=self._split_size)
1016        else:
1017            X_train, X_test, y_train, y_test = X, X, y, y
1018        
1019        # Default parameters for each model type
1020        default_params = {
1021            "ElasticNetCV": {"l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0],
1022                            "cv": 5, 
1023                            "max_iter": 1000},
1024            "RidgeCV": {"alphas": [0.1, 1.0, 10.0], 
1025                        "cv": 5},
1026            "LinearRegression": {"fit_intercept": True},
1027            "RandomForest": {"n_estimators": 100, 
1028                            "max_depth": None,
1029                            "min_samples_split": 2,
1030                            "random_state": 42},
1031            "GaussianProcess": {"kernel": None, 
1032                                "alpha": 1e-10,
1033                                "normalize_y": True,
1034                                "random_state": 42},
1035            "GradientBoosting": {"n_estimators": 100,
1036                                "learning_rate": 0.1,
1037                                "max_depth": 3,
1038                                "random_state": 42}
1039        }
1040        
1041        # Get default parameters for the selected model
1042        model_defaults = default_params.get(self._model_type, {})
1043        
1044        # Override defaults with any provided kwargs
1045        model_params = {**model_defaults, **kwargs}
1046        
1047        if self._model_type == "ElasticNetCV":
1048            self._model = make_pipeline(StandardScaler(),
1049                                        ElasticNetCV(**model_params))
1050        elif self._model_type == "RidgeCV":
1051            self._model = make_pipeline(StandardScaler(),
1052                                        RidgeCV(**model_params))
1053        elif self._model_type == "LinearRegression":
1054            self._model = make_pipeline(StandardScaler(),
1055                                        LinearRegression(**model_params))
1056        elif self._model_type == "RandomForest":
1057            self._model = make_pipeline(StandardScaler(),
1058                                        RandomForestRegressor(**model_params))
1059        elif self._model_type == "GaussianProcess":
1060            self._model = make_pipeline(StandardScaler(),
1061                                        GaussianProcessRegressor(**model_params))
1062        elif self._model_type == "GradientBoosting":
1063            self._model = make_pipeline(StandardScaler(),
1064                                        GradientBoostingRegressor(**model_params))
1065        
1066        # Fit the model
1067        self._model.fit(X_train, y_train)
1068        return self._model

Compute the machine learning model.

Parameters
  • kwargs (dict): Additional keyword arguments for the model. Overrides default parameters.
Default Parameters by Model Type
  • ElasticNetCV:
    • l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0]. List of L1 ratios to try.
    • cv : int, default=5. Cross-validation folds.
    • max_iter : int, default=1000. Maximum iterations.
  • RidgeCV:
    • alphas : list, default=[0.1, 1.0, 10.0]. List of alpha values to try.
    • cv : int, default=5. Cross-validation folds.
  • LinearRegression:
    • fit_intercept : bool, default=True. Whether to calculate the intercept.
  • RandomForest:
    • n_estimators : int, default=100. Number of trees in the forest.
    • max_depth : int or None, default=None. Maximum depth of trees.
    • min_samples_split : int, default=2. Minimum samples required to split a node.
    • random_state : int, default=42. Random seed for reproducibility.
  • GaussianProcess:
    • kernel : kernel object, default=None. Kernel for the Gaussian Process.
    • alpha : float, default=1e-10. Value added to diagonal of kernel matrix.
    • normalize_y : bool, default=True. Normalize target values.
    • random_state : int, default=42. Random seed for reproducibility.
  • GradientBoosting:
    • n_estimators : int, default=100. Number of boosting stages.
    • learning_rate : float, default=0.1. Learning rate.
    • max_depth : int, default=3. Maximum depth of trees.
    • random_state : int, default=42. Random seed for reproducibility.
Returns
  • object: The fitted machine learning model.
def plot_ML_model(self, features_in_log=False):
1070    def plot_ML_model(self, features_in_log=False):
1071        """
1072        Plot the machine learning model using plotly.
1073
1074        Parameters
1075        ----------
1076        features_in_log : bool, optional
1077            Whether to plot the feature importances in log scale. Default is False.
1078
1079        Returns
1080        -------
1081        fig : list
1082            The list of plotly figures.
1083        """
1084        coef_names = self._data[self._factors].columns
1085        X = self._data[self._factors]
1086        y = self._data[self._response]
1087        if self._split_size > 0:
1088            X_train, X_test, y_train, y_test = train_test_split(
1089                X, y, test_size=self._split_size)
1090        else:
1091            X_train, X_test, y_train, y_test = X, X, y, y
1092        bootstrap_coefs = bootstrap_coefficients(
1093            self._model[1], X, y, n_bootstrap=100, random_state=42)
1094        # Calculate mean and standard deviation of coefficients
1095        mean_coefs = np.mean(bootstrap_coefs, axis=0)
1096        std_coefs = np.std(bootstrap_coefs, axis=0)
1097        # make the fig
1098        fig = [None] * 2
1099        fig[0] = go.Figure()
1100        # Add actual vs predicted line
1101        fig[0].add_trace(go.Scatter(x=y_train,
1102                                 y=self._model.predict(X_train),
1103                                 mode='markers',
1104                                 marker=dict(size=12, color='royalblue'),
1105                                 name='Training'))
1106        if self._split_size > 0:
1107            fig[0].add_trace(go.Scatter(x=y_test,
1108                                    y=self._model.predict(X_test),
1109                                    mode='markers',
1110                                    marker=dict(size=12, color='orange'),
1111                                    name='Validation'))
1112        # Add 1:1 line
1113        fig[0].add_shape(type="line",
1114                      x0=min(y_train), y0=min(y_train),
1115                      x1=max(y_train), y1=max(y_train),
1116                      line=dict(color="Gray", width=1, dash="dash"))
1117        fig[0].update_layout(
1118            plot_bgcolor="white",  # White background
1119            legend=dict(bgcolor='rgba(0,0,0,0)'),
1120            xaxis_title=f'Actual {self._response}',
1121            yaxis_title=f'Predicted {self._response}',
1122            height=500,  # Adjust height as needed
1123            margin=dict(l=10, r=10, t=120, b=50),
1124            xaxis=dict(
1125                showgrid=True,  # Enable grid
1126                gridcolor="lightgray",  # Light gray grid lines
1127                zeroline=False,
1128                zerolinecolor="black",  # Black zero line
1129                showline=True,
1130                linewidth=1,
1131                linecolor="black",  # Black border
1132                mirror=True
1133            ),
1134            yaxis=dict(
1135                showgrid=True,  # Enable grid
1136                gridcolor="lightgray",  # Light gray grid lines
1137                zeroline=False,
1138                zerolinecolor="black",  # Black zero line
1139                showline=True,
1140                linewidth=1,
1141                linecolor="black",  # Black border
1142                mirror=True
1143            )
1144        )
1145        # add the score in title
1146        fig[0].update_layout(
1147            title={
1148                'text': f"Predicted vs Actual for {str(self._model[1])}<br>R²_test = {r2_score(y_test, self._model.predict(X_test)):.4f}  -  R²_train = {r2_score(y_train, self._model.predict(X_train)):.4f}<br>RMSE_test = {root_mean_squared_error(y_test, self._model.predict(X_test)):.4f}  -  RMSE_train = {root_mean_squared_error(y_train, self._model.predict(X_train)):.4f}",
1149                'x': 0.45,
1150                'xanchor': 'center'
1151            }
1152        )
1153        # make feature importance plot
1154        if self._model_type != "GaussianProcess":
1155            fig[1] = go.Figure()
1156            pos_coefs = mean_coefs[mean_coefs > 0]
1157            pos_coefs_names = coef_names[mean_coefs > 0]
1158            neg_coefs = mean_coefs[mean_coefs < 0]
1159            neg_coefs_names = coef_names[mean_coefs < 0]
1160            # Add bars for positive mean coefficients
1161            fig[1].add_trace(go.Bar(
1162                y=[f"{pos_coefs_names[i]}" for i in range(len(pos_coefs))],
1163                x=mean_coefs[mean_coefs > 0],
1164                error_x=dict(type='data', array=std_coefs[mean_coefs > 0], visible=True),
1165                orientation='h',
1166                marker_color='royalblue',
1167                name='Positive'
1168            ))
1169            fig[1].add_trace(go.Bar(
1170                y=[f"{neg_coefs_names[i]}" for i in range(len(neg_coefs))],
1171                x=-mean_coefs[mean_coefs < 0],
1172                error_x=dict(type='data', array=std_coefs[mean_coefs < 0], visible=True),
1173                orientation='h',
1174                marker_color='orange',
1175                name='Negative'
1176            ))
1177            # Update layout
1178            fig[1].update_layout(
1179                title={
1180                    'text': f"Features importance for {str(self._model[1])}",
1181                    'x': 0.5,
1182                    'xanchor': 'center'
1183                },
1184                xaxis_title="Coefficient Value",
1185                yaxis_title="Features",
1186                barmode='relative',
1187                margin=dict(l=150)
1188            )
1189            if features_in_log:
1190                fig[1].update_xaxes(type="log")
1191            else:
1192                fig[1].update_xaxes(type="linear")
1193            fig[1].update_layout(
1194                plot_bgcolor="white",  # White background
1195                legend=dict(bgcolor='rgba(0,0,0,0)'),
1196                height=500,  # Adjust height as needed
1197                margin=dict(l=10, r=10, t=120, b=50),
1198                xaxis=dict(
1199                    showgrid=True,  # Enable grid
1200                    gridcolor="lightgray",  # Light gray grid lines
1201                    zeroline=False,
1202                    zerolinecolor="black",  # Black zero line
1203                    showline=True,
1204                    linewidth=1,
1205                    linecolor="black",  # Black border
1206                    mirror=True
1207                ),
1208                yaxis=dict(
1209                    showgrid=True,  # Enable grid
1210                    gridcolor="lightgray",  # Light gray grid lines
1211                    zeroline=False,
1212                    zerolinecolor="black",  # Black zero line
1213                    showline=True,
1214                    linewidth=1,
1215                    linecolor="black",  # Black border
1216                    mirror=True
1217                )
1218            )
1219        else:
1220            print('GaussianProcess does not have coefficients or feature importances')
1221        return fig

Plot the machine learning model using plotly.

Parameters
  • features_in_log (bool, optional): Whether to plot the feature importances in log scale. Default is False.
Returns
  • fig (list): The list of plotly figures.
def predict(self, X=None, model='all'):
1223    def predict(self, X=None, model='all'):
1224        """
1225        Predict using the machine learning model and the linear model,
1226        if they are trained. Use the encoders to encode the data.
1227        If X is not provided, use the original data.
1228        If the model has not been trained, raise a warning.
1229
1230        Parameters
1231        ----------
1232        X : pd.DataFrame, optional
1233            The input features. Default is None, which uses the original data.
1234
1235        Returns
1236        -------
1237        pd.DataFrame
1238            The predicted values with a column indicating the model used.
1239        """
1240        if X is None:
1241            X = self._data[self._factors].copy()
1242        else:
1243            X = X.copy()
1244
1245        # Encode the input data using the same encoders
1246        for factor in self._factors:
1247            if factor in self._encoders:
1248                X[factor] = self._encoders[factor].transform(X[factor].astype(str))
1249
1250        predML = None
1251        predlin = None
1252
1253        if self._model is not None:
1254            # Check if the model is fitted
1255            if not hasattr(self._model, 'predict'):
1256                raise Warning("Model has not been trained yet.")
1257            predML = self._model.predict(X)
1258            predML = pd.DataFrame(predML, columns=['prediction'])
1259            predML['model'] = self._model_type
1260
1261        if self._linear_model is not None:
1262            # Check if the linear model is fitted
1263            if not hasattr(self._linear_model, 'predict'):
1264                raise Warning("Linear model has not been trained yet.")
1265            predlin = self._linear_model.predict(X)
1266            predlin = pd.DataFrame(predlin, columns=['prediction'])
1267            predlin['model'] = 'Linear Model'
1268
1269        if predML is not None and predlin is not None and model == 'all':
1270            # Combine the predictions on top of each other
1271            pred = pd.concat([predML, predlin], axis=0)
1272            pred = pred.reset_index(drop=True)
1273            return pred
1274        elif predML is not None and model == 'ML':
1275            return predML
1276        elif predlin is not None and model == 'linear':
1277            return predlin
1278        else:
1279            raise Warning("No model has been trained yet.")

Predict using the machine learning model and the linear model, if they are trained. Use the encoders to encode the data. If X is not provided, use the original data. If the model has not been trained, raise a warning.

Parameters
  • X (pd.DataFrame, optional): The input features. Default is None, which uses the original data.
Returns
  • pd.DataFrame: The predicted values with a column indicating the model used.
def bootstrap_coefficients(mod, X, y, n_bootstrap=100, random_state=None):
1284def bootstrap_coefficients(mod, X, y, n_bootstrap=100, random_state=None):
1285    """
1286    Perform bootstrap resampling to estimate the variability of model coefficients.
1287
1288    Parameters
1289    ----------
1290    mod : object
1291        The machine learning model.
1292    X : pd.DataFrame
1293        The input features.
1294    y : pd.Series
1295        The target variable.
1296    n_bootstrap : int, optional
1297        The number of bootstrap samples. Default is 100.
1298    random_state : int, optional
1299        The seed for the random number generator.
1300
1301    Returns
1302    -------
1303    results : np.ndarray
1304        The bootstrapped coefficients.
1305    """
1306    np.random.seed(random_state)
1307    n_samples = X.shape[0]
1308    results = []
1309
1310    for _ in range(n_bootstrap):
1311        # Resample the data
1312        indices = np.random.choice(np.arange(n_samples), size=n_samples, replace=True)
1313        X_resample, y_resample = X.values[indices], y.values[indices]
1314
1315        # Fit the model
1316        if isinstance(mod, RidgeCV):
1317            model = RidgeCV(alphas=np.logspace(-3, 3, 10)).fit(X_resample, y_resample)
1318            results.append(model.coef_)
1319        elif isinstance(mod, ElasticNetCV):
1320            model = ElasticNetCV(alphas=np.logspace(-3, 3, 10), l1_ratio=0.5).fit(X_resample, y_resample)
1321            results.append(model.coef_)
1322        elif isinstance(mod, LinearRegression):
1323            model = LinearRegression().fit(X_resample, y_resample)
1324            results.append(model.coef_)
1325        elif isinstance(mod, RandomForestRegressor):
1326            model = RandomForestRegressor().fit(X_resample, y_resample)
1327            results.append(model.feature_importances_)
1328        elif isinstance(mod, GradientBoostingRegressor):
1329            model = GradientBoostingRegressor().fit(X_resample, y_resample)
1330            results.append(model.feature_importances_)
1331        elif isinstance(mod, GaussianProcessRegressor):
1332            # Gaussian Process does not have coefficients or feature importances
1333            model = GaussianProcessRegressor().fit(X_resample, y_resample)
1334            results.append(np.zeros(X.shape[1]))  # Placeholder for Gaussian Process
1335        else:
1336            raise ValueError(f"Unsupported model type: {mod}")
1337
1338    return np.array(results)

Perform bootstrap resampling to estimate the variability of model coefficients.

Parameters
  • mod (object): The machine learning model.
  • X (pd.DataFrame): The input features.
  • y (pd.Series): The target variable.
  • n_bootstrap (int, optional): The number of bootstrap samples. Default is 100.
  • random_state (int, optional): The seed for the random number generator.
Returns
  • results (np.ndarray): The bootstrapped coefficients.