optimeo.analysis
The analysis module provides tools for data analysis and regression modeling.
The main workhorse is the DataAnalysis
class, which allows for encoding categorical variables, performing regression analysis, and visualizing results.
It supports both linear regression using the statsmodels
package and machine learning models from sklearn
.
The class also provides methods for plotting Q-Q plots, box plots, histograms, and scatter plots.
It includes functionality for bootstrap resampling to estimate the variability of model coefficients.
The DataAnalysis
class is designed to be flexible and extensible, allowing users to customize the regression analysis process.
You can see an example notebook here.
1# Copyright (c) 2025 Colin BOUSIGE 2# Contact: colin.bousige@cnrs.fr 3# 4# This program is free software: you can redistribute it and/or modify 5# it under the terms of the MIT License as published by 6# the Free Software Foundation, either version 3 of the License, or 7# any later version. 8 9""" 10The analysis module provides tools for data analysis and regression modeling. 11The main workhorse is the `DataAnalysis` class, which allows for encoding categorical variables, performing regression analysis, and visualizing results. 12 13It supports both linear regression using the `statsmodels` package and machine learning models from `sklearn`. 14The class also provides methods for plotting Q-Q plots, box plots, histograms, and scatter plots. 15It includes functionality for bootstrap resampling to estimate the variability of model coefficients. 16The `DataAnalysis` class is designed to be flexible and extensible, allowing users to customize the regression analysis process. 17 18You can see an example notebook [here](../examples/MLanalysis.html). 19 20""" 21 22 23import numpy as np 24import pandas as pd 25import statsmodels.formula.api as smf 26from scipy import stats 27from scipy.stats import t 28import plotly.express as px 29import plotly.graph_objects as go 30from plotly.subplots import make_subplots 31from statsmodels.graphics.gofplots import qqplot 32# import ML models and scaling functions 33from sklearn.preprocessing import StandardScaler 34from sklearn.linear_model import LinearRegression, RidgeCV, ElasticNetCV 35from sklearn.gaussian_process import GaussianProcessRegressor 36from sklearn.ensemble import RandomForestRegressor 37from sklearn.ensemble import GradientBoostingRegressor 38from sklearn.model_selection import train_test_split 39from sklearn.metrics import root_mean_squared_error, r2_score 40from sklearn.preprocessing import LabelEncoder 41from sklearn.pipeline import make_pipeline 42import seaborn as sns 43 44 45class DataAnalysis: 46 """ 47 This class is used to analyze the data and perform regression analysis. 48 49 Parameters 50 ---------- 51 52 data : pd.DataFrame 53 The input data. 54 factors : list 55 The list of factor variables. 56 response : str 57 The response variable. 58 split_size : float, optional 59 The proportion of the dataset to include in the test split. Default is `0.2`. 60 model_type : str, optional 61 The type of machine learning model to use. Default is None. 62 Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`, 63 `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`. 64 65 Attributes 66 ---------- 67 68 data : pd.DataFrame 69 The input data. 70 factors : list 71 The list of factor variables. 72 response : str 73 The response variable. 74 encoders : dict 75 The encoders for categorical variables. 76 dtypes : pd.Series 77 The data types of the columns. 78 linear_model : object 79 The linear model object. 80 equation : str 81 The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. 82 model : object 83 The machine learning model object. 84 model_type : str 85 The type of machine learning model to use. 86 split_size : float 87 The proportion of the dataset to include in the test split. 88 89 Methods 90 ------- 91 92 - **encode_data()**: 93 Encodes categorical variables in the data. Called during initialization. 94 - **plot_qq()**: 95 Plots a Q-Q plot for the response variable using `plotly`. 96 - **plot_boxplot()**: 97 Plots a boxplot for the response variable using `plotly`. 98 - **plot_histogram()**: 99 Plots a histogram for the response variable using `plotly`. 100 - **plot_scatter_response()**: 101 Plots a scatter plot for the response variable using `plotly`. 102 - **plot_corr()**: 103 Plots a correlation matrix for the data using `plotly`. 104 - **plot_pairplot_seaborn()**: 105 Plots a pairplot for the data using `seaborn`. 106 - **plot_pairplot_plotly()**: 107 Plots a pairplot for the data using `plotly`. 108 - **write_equation(order=1, quadratic=[])**: 109 Writes R-style equation for multivariate fitting procedure using the `statsmodels` package. 110 - **compute_linear_model(order=1, quadratic=[])**: 111 Computes the linear model using the `statsmodels` package. 112 - **plot_linear_model()**: 113 Plots the linear model using `plotly`. 114 - <b>compute_ML_model(**kwargs)</b>: 115 Computes the machine learning model using `sklearn`. 116 - **plot_ML_model(features_in_log=False)**: 117 Plots the machine learning model using `plotly`. 118 119 Example 120 ------- 121 122 ```python 123 from optimeo.analysis import * 124 125 data = pd.read_csv('dataML.csv') 126 factors = data.columns[:-1] 127 response = data.columns[-1] 128 analysis = DataAnalysis(data, factors, response) 129 analysis.model_type = "ElasticNetCV" 130 MLmodel = analysis.compute_ML_model() 131 figs = analysis.plot_ML_model() 132 for fig in figs: 133 fig.show() 134 ``` 135 """ 136 137 def __init__(self, 138 data: pd.DataFrame, 139 factors: list, 140 response: str, 141 split_size=.2, 142 model_type=None): 143 self._dtypes = None 144 self._encoders = {} 145 self._linear_model = None 146 self._model = None 147 self._equation = '' 148 self._data = data 149 self._factors = factors 150 self._response = response 151 self._model_type = model_type 152 self._split_size = split_size 153 self.encode_data() 154 155 @property 156 def data(self): 157 """The input `pandas.DataFrame`.""" 158 return self._data 159 160 @data.setter 161 def data(self, value): 162 if not isinstance(value, pd.DataFrame): 163 raise ValueError("Data must be a pandas DataFrame.") 164 self._data = value 165 166 def encode_data(self): 167 """ 168 Called during initialization: encodes categorical variables in the data if there are any. 169 Uses `LabelEncoder()` from `sklearn` to convert categorical variables to numerical values. 170 Also drops rows with NaN values. 171 """ 172 self._dtypes = self._data.dtypes 173 for factor in self._factors: 174 if self._dtypes[factor] == 'object': 175 le = LabelEncoder() 176 self._encoders[factor] = le 177 self._data[factor] = le.fit_transform([str(d) for d in self._data[factor]]) 178 self.data = self.data[self._factors + [self._response]] 179 self.data = self.data.dropna(axis=0, how='any') 180 181 @property 182 def factors(self): 183 """The list of names of the columns of the `data` DataFrame that contain factor variables.""" 184 return self._factors 185 186 @factors.setter 187 def factors(self, value): 188 if not isinstance(value, list): 189 raise ValueError("Factors must be a list.") 190 self._factors = value 191 192 @property 193 def encoders(self): 194 """The list of encoders for categorical variables.""" 195 return self._encoders 196 197 @encoders.setter 198 def encoders(self, value): 199 if not isinstance(value, dict): 200 raise ValueError("Encoders must be a dictionary.") 201 self._encoders = value 202 203 @property 204 def dtypes(self): 205 """Get the data types of the columns.""" 206 return self._dtypes 207 208 @dtypes.setter 209 def dtypes(self, value): 210 self._dtypes = value 211 212 @property 213 def response(self): 214 """The name of the column of the `data` DataFrame that contain the response variable.""" 215 return self._response 216 217 @response.setter 218 def response(self, value): 219 if not isinstance(value, str): 220 raise ValueError("Response must be a string.") 221 self._response = value 222 223 @property 224 def linear_model(self): 225 """Get the linear model.""" 226 return self._linear_model 227 228 @linear_model.setter 229 def linear_model(self, value): 230 self._linear_model = value 231 232 @property 233 def equation(self): 234 """The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. This is based on the [statsmodels package](https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html).""" 235 return self._equation 236 237 @equation.setter 238 def equation(self, value): 239 if not isinstance(value, str): 240 raise ValueError("Equation must be a string.") 241 self._equation = value 242 243 @property 244 def model_type(self): 245 """The type of machine learning model to use. Default is None. 246 Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`, 247 `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`.""" 248 return self._model_type 249 250 @model_type.setter 251 def model_type(self, value): 252 if value is not None and value not in ["ElasticNetCV", "RidgeCV", 253 "LinearRegression", "RandomForest", 254 "GaussianProcess", "GradientBoosting"]: 255 raise ValueError("Model must be one of the following: " 256 "ElasticNetCV, RidgeCV, LinearRegression, " 257 "RandomForest, GaussianProcess, GradientBoosting.") 258 self._model_type = value 259 260 @property 261 def model(self): 262 """The machine learning model object.""" 263 return self._model 264 265 @model.setter 266 def model(self, value): 267 self._model = value 268 269 @property 270 def split_size(self): 271 """The proportion of the dataset to include in the test split. Default is `0.2`.""" 272 return self._split_size 273 274 @split_size.setter 275 def split_size(self, value): 276 if not isinstance(value, (int, float)): 277 raise ValueError("Split size must be a number.") 278 self._split_size = value 279 280 def __str__(self): 281 """Return a string representation of the DataAnalysis object.""" 282 return (f"DataAnalysis(data={self._data.shape}, " 283 f"factors={self._factors}, " 284 f"response={self._response}, " 285 f"model_type={self._model_type}, " 286 f"split_size={self._split_size}, " 287 f"encoders={self._encoders})" 288 ) 289 290 def __repr__(self): 291 """Return a string representation of the DataAnalysis object.""" 292 return self.__str__() 293 294 def plot_qq(self): 295 """ 296 Plot a Q-Q plot for the response variable. 297 298 Returns 299 ------- 300 fig : plotly.graph_objs.Figure 301 The Q-Q plot figure. 302 """ 303 qqplot_data = pd.Series(self._data[self._response], copy=True) 304 qqplot_data = qqplot(qqplot_data, line='s').gca().lines 305 fig = go.Figure() 306 307 fig.add_trace({ 308 'type': 'scatter', 309 'x': qqplot_data[0].get_xdata(), 310 'y': qqplot_data[0].get_ydata(), 311 'mode': 'markers', 312 }) 313 314 fig.add_trace({ 315 'type': 'scatter', 316 'x': qqplot_data[1].get_xdata(), 317 'y': qqplot_data[1].get_ydata(), 318 'mode': 'lines', 319 'line': { 320 'color': 'red' 321 } 322 }) 323 324 fig.update_layout( 325 title='Q-Q Plot', 326 xaxis_title='Theoretical Quantiles', 327 yaxis_title='Sample Quantiles', 328 plot_bgcolor="white", # White background 329 showlegend=False, 330 margin=dict(l=10, r=10, t=50, b=50), 331 xaxis=dict( 332 showgrid=True, # Enable grid 333 gridcolor="lightgray", # Light gray grid lines 334 zeroline=False, 335 zerolinecolor="black", # Black zero line 336 showline=True, 337 linewidth=1, 338 linecolor="black", # Black border 339 mirror=True 340 ), 341 yaxis=dict( 342 showgrid=True, # Enable grid 343 gridcolor="lightgray", # Light gray grid lines 344 zeroline=False, 345 zerolinecolor="black", # Black zero line 346 showline=True, 347 linewidth=1, 348 linecolor="black", # Black border 349 mirror=True 350 ), 351 height=300, 352 ) 353 return fig 354 355 def plot_boxplot(self): 356 """ 357 Plot a boxplot for the response variable. 358 359 Returns 360 ------- 361 fig : plotly.graph_objs.Figure 362 The boxplot figure. 363 """ 364 fig = px.box(self._data, y=self._response, points="all", title='Box Plot') 365 fig.update_layout( 366 plot_bgcolor="white", # White background 367 legend=dict(bgcolor='rgba(0,0,0,0)'), 368 margin=dict(l=10, r=10, t=50, b=50), 369 xaxis=dict( 370 showgrid=True, # Enable grid 371 gridcolor="lightgray", # Light gray grid lines 372 zeroline=False, 373 zerolinecolor="black", # Black zero line 374 showline=True, 375 linewidth=1, 376 linecolor="black", # Black border 377 mirror=True 378 ), 379 yaxis=dict( 380 showgrid=True, # Enable grid 381 gridcolor="lightgray", # Light gray grid lines 382 zeroline=False, 383 zerolinecolor="black", # Black zero line 384 showline=True, 385 linewidth=1, 386 linecolor="black", # Black border 387 mirror=True 388 ), height=300, 389 ) 390 return fig 391 392 def plot_histogram(self): 393 """ 394 Plot a histogram for the response variable. 395 396 Returns 397 ------- 398 fig : plotly.graph_objs.Figure 399 The histogram figure. 400 """ 401 fig = px.histogram(self._data, x=self._response, title='Histogram') 402 fig.update_layout( 403 plot_bgcolor="white", # White background 404 legend=dict(bgcolor='rgba(0,0,0,0)'), 405 margin=dict(l=10, r=10, t=50, b=50), 406 xaxis=dict( 407 showgrid=True, # Enable grid 408 gridcolor="lightgray", # Light gray grid lines 409 zeroline=False, 410 zerolinecolor="black", # Black zero line 411 showline=True, 412 linewidth=1, 413 linecolor="black", # Black border 414 mirror=True 415 ), 416 yaxis=dict( 417 showgrid=True, # Enable grid 418 gridcolor="lightgray", # Light gray grid lines 419 zeroline=False, 420 zerolinecolor="black", # Black zero line 421 showline=True, 422 linewidth=1, 423 linecolor="black", # Black border 424 mirror=True 425 ), height=300, 426 ) 427 return fig 428 429 def plot_scatter_response(self): 430 """ 431 Plot a scatter plot for the response variable. 432 433 Returns 434 ------- 435 fig : plotly.graph_objs.Figure 436 The scatter plot figure. 437 """ 438 fig = px.scatter(x=np.arange(1, len(self._data[self._response]) + 1), 439 y=self._data[self._response], 440 labels={'x': 'Measurement number', 'y': self._response}, 441 title='Scatter Plot') 442 fig.update_layout( 443 plot_bgcolor="white", # White background 444 legend=dict(bgcolor='rgba(0,0,0,0)'), 445 margin=dict(l=10, r=10, t=50, b=50), 446 xaxis=dict( 447 showgrid=True, # Enable grid 448 gridcolor="lightgray", # Light gray grid lines 449 zeroline=False, 450 zerolinecolor="black", # Black zero line 451 showline=True, 452 linewidth=1, 453 linecolor="black", # Black border 454 mirror=True 455 ), 456 yaxis=dict( 457 showgrid=True, # Enable grid 458 gridcolor="lightgray", # Light gray grid lines 459 zeroline=False, 460 zerolinecolor="black", # Black zero line 461 showline=True, 462 linewidth=1, 463 linecolor="black", # Black border 464 mirror=True 465 ), height=300, 466 ) 467 return fig 468 469 def plot_pairplot_seaborn(self): 470 """ 471 Plot a pairplot for the data using seaborn. 472 473 Returns 474 ------- 475 fig : seaborn.axisgrid.PairGrid 476 The pairplot figure. 477 """ 478 fig = sns.pairplot( 479 self._data, 480 kind="reg", 481 diag_kind="kde", 482 plot_kws={"scatter_kws": {"alpha": 0.1}}, 483 ) 484 return fig 485 486 def plot_pairplot_plotly(self): 487 """ 488 Plot a pairplot for the data using plotly. 489 490 Returns 491 ------- 492 fig : plotly.graph_objs.Figure 493 The plotly figure. 494 """ 495 496 # Get the column names 497 columns = self._data.columns 498 n_cols = len(columns) 499 500 # Create subplot grid with partially shared axes 501 fig = make_subplots( 502 rows=n_cols, 503 cols=n_cols, 504 shared_xaxes=True, 505 shared_yaxes=True, 506 vertical_spacing=0.05, 507 horizontal_spacing=0.05 508 ) 509 510 # Add scatter plots and regression lines to each subplot 511 for i in range(n_cols): 512 for j in range(n_cols): 513 x_data = self._data[columns[j]] 514 y_data = self._data[columns[i]] 515 516 if i == j: # Diagonal: Add KDE plot 517 # Calculate KDE 518 data = self._data[columns[i]].dropna() 519 if len(data) > 1: # Ensure enough data points 520 kde_x = np.linspace(data.min(), data.max(), 100) 521 kde = stats.gaussian_kde(data) 522 kde_y = kde(kde_x) 523 # Scale to match y-axis (needed for top left corner) 524 kde_y = kde_y / kde_y.max() * np.max(y_data.dropna()) 525 526 # Add KDE plot to diagonal 527 fig.add_trace( 528 go.Scatter( 529 x=kde_x, 530 y=kde_y, 531 mode='lines', 532 fill='tozeroy', 533 line=dict(color='rgba(29, 81, 189, 1)'), 534 showlegend=False 535 ), 536 row=i+1, col=j+1 537 ) 538 # Ensure the y-axis is independent for diagonal plots 539 # don't know why, doesn't work with top left corner 540 fig.update_yaxes( 541 matches=None, # Ensure independent y-axis 542 showticklabels=True, # Show y-axis values 543 row=i+1, 544 col=j+1 545 ) 546 else: # Off-diagonal: Add scatter plot with regression line 547 # Add scatter plot 548 fig.add_trace( 549 go.Scatter( 550 x=x_data, 551 y=y_data, 552 mode='markers', 553 marker=dict( 554 color='rgba(29, 81, 189, 0.5)', 555 size=5 556 ), 557 showlegend=False 558 ), 559 row=i+1, col=j+1 560 ) 561 562 # Calculate and add regression line 563 if len(x_data.dropna()) > 1 and len(y_data.dropna()) > 1: 564 # Drop NaN values for regression calculation 565 valid_indices = x_data.notna() & y_data.notna() 566 x_clean = x_data[valid_indices] 567 y_clean = y_data[valid_indices] 568 ytype = y_clean.dtype 569 xtype = x_clean.dtype 570 unique_x = x_clean.unique() 571 if len(x_clean) > 1 and len(unique_x)>1 and ytype != 'object' and xtype != 'object': 572 # Calculate regression parameters 573 slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean) 574 x_range = np.linspace(x_clean.min(), x_clean.max(), 100) 575 y_pred = slope * x_range + intercept 576 577 # Standard error of estimate 578 y_fit = slope * x_clean + intercept 579 residuals = y_clean - y_fit 580 dof = len(x_clean) - 2 581 residual_std_error = np.sqrt(np.sum(residuals**2) / dof) 582 583 mean_x = np.mean(x_clean) 584 t_val = t.ppf(0.975, dof) # 95% confidence 585 586 se_line = residual_std_error * np.sqrt(1/len(x_clean) + (x_range - mean_x)**2 / np.sum((x_clean - mean_x)**2)) 587 y_upper = y_pred + t_val * se_line 588 y_lower = y_pred - t_val * se_line 589 590 # Add regression line 591 fig.add_trace( 592 go.Scatter( 593 x=x_range, 594 y=y_pred, 595 mode='lines', 596 line=dict(color='red', width=2), 597 showlegend=False 598 ), 599 row=i+1, col=j+1 600 ) 601 602 # Add confidence interval area 603 fig.add_trace( 604 go.Scatter( 605 x=np.concatenate([x_range, x_range[::-1]]), 606 y=np.concatenate([y_upper, y_lower[::-1]]), 607 fill='toself', 608 fillcolor='rgba(255, 0, 0, 0.2)', 609 line=dict(color='rgba(255, 0, 0, 0)'), 610 showlegend=False 611 ), 612 row=i+1, col=j+1 613 ) 614 615 # Update layout and axis properties 616 fig.update_layout( 617 margin=dict(l=10, r=10, t=50, b=50), 618 title="Pair Plot", 619 height=600, 620 width=600, 621 showlegend=False, 622 plot_bgcolor="white" 623 ) 624 625 # Update all axes properties first (hiding tick labels by default) 626 fig.update_xaxes( 627 showgrid=True, 628 gridcolor="lightgray", 629 zeroline=False, 630 showline=True, 631 linewidth=1, 632 linecolor="black", 633 mirror=True, 634 showticklabels=False 635 ) 636 637 fig.update_yaxes( 638 showgrid=True, 639 gridcolor="lightgray", 640 zeroline=False, 641 showline=True, 642 linewidth=1, 643 linecolor="black", 644 mirror=True, 645 showticklabels=False 646 ) 647 648 # Show tick labels and titles only for bottom row and leftmost column 649 for i, col_name in enumerate(columns): 650 # Bottom row: show x-axis titles and tick labels 651 fig.update_xaxes( 652 title_text=col_name, 653 showticklabels=True, 654 row=n_cols, 655 col=i+1 656 ) 657 658 # Leftmost column: show y-axis titles and tick labels 659 fig.update_yaxes( 660 title_text=col_name, 661 showticklabels=True, 662 row=i+1, 663 col=1 664 ) 665 666 return fig 667 668 def plot_corr(self): 669 """ 670 Plot a correlation matrix for the data. 671 672 Returns 673 ------- 674 fig : seaborn.axisgrid.PairGrid 675 The pairplot figure. 676 """ 677 corr_matrix = self._data.corr(method='pearson') 678 mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1) 679 # Apply mask - replace upper triangle with NaN values 680 corr_matrix_lower = corr_matrix.copy() 681 corr_matrix_lower.values[mask] = np.nan 682 683 fig = px.imshow( 684 corr_matrix_lower, 685 text_auto='.2f', # Display correlation values 686 color_continuous_scale='RdBu_r', # Red to Blue color scale (reversed) 687 zmin=-1, # Minimum correlation value 688 zmax=1, # Maximum correlation value 689 aspect="auto", # Keep aspect ratio adaptive 690 title="Pearson Correlation Heatmap" 691 ) 692 # Customize hover template 693 fig.update_traces( 694 hovertemplate='<b>%{y} vs %{x}</b><br>Correlation: %{z}<extra></extra>' 695 ) 696 # Improve layout 697 fig.update_layout( 698 coloraxis_colorbar=dict( 699 title="Correlation", 700 tickvals=[-1, -0.5, 0, 0.5, 1], 701 ticktext=["-1", "-0.5", "0", "0.5", "1"], 702 ), 703 plot_bgcolor="white", # White background 704 legend=dict(bgcolor='rgba(0,0,0,0)'), 705 margin=dict(l=10, r=10, t=50, b=50), 706 xaxis=dict( 707 showgrid=False, # Enable grid 708 gridcolor="lightgray", # Light gray grid lines 709 zeroline=False, 710 zerolinecolor="black", # Black zero line 711 showline=False, 712 linewidth=1, 713 tickangle=-45, 714 linecolor="black", # Black border 715 mirror=True 716 ), 717 yaxis=dict( 718 showgrid=False, # Enable grid 719 gridcolor="lightgray", # Light gray grid lines 720 zeroline=False, 721 zerolinecolor="black", # Black zero line 722 showline=False, 723 linewidth=1, 724 linecolor="black", # Black border 725 mirror=True 726 ), 727 height=600, 728 width=600 729 ) 730 731 return fig 732 733 def write_equation(self, order=1, quadratic=[]): 734 """ 735 Write R-style equation for multivariate fitting procedure using the statsmodels package. 736 737 Parameters 738 ---------- 739 order : int, optional 740 The order of the polynomial. Default is 1. 741 quadratic : list, optional 742 The list of quadratic factors. Default is an empty list. 743 744 Returns 745 ------- 746 str 747 The R-style equation. 748 """ 749 myfactors = self._factors.copy() 750 if self._dtypes is not None: 751 if not isinstance(myfactors, list): 752 myfactors = myfactors.tolist() # Convert to list to allow mutable operations 753 for i in range(len(self._factors)): 754 if self._dtypes[self._factors[i]] == 'object': 755 myfactors[i] = f'C({myfactors[i]})' 756 eqn = f'{self._response} ~ {myfactors[0]} ' 757 for factor in myfactors[1:]: 758 eqn += f'+ {factor} ' 759 if order > 1: 760 for i in range(len(myfactors)): 761 for j in range(i + 1, len(myfactors)): 762 eqn += f'+ {myfactors[i]}:{myfactors[j]} ' 763 if order > 2: 764 for i in range(len(myfactors)): 765 for j in range(i + 1, len(myfactors)): 766 for k in range(j + 1, len(myfactors)): 767 eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]} ' 768 if order > 3: 769 for i in range(len(myfactors)): 770 for j in range(i + 1, len(myfactors)): 771 for k in range(j + 1, len(myfactors)): 772 for l in range(k + 1, len(myfactors)): 773 eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]}:{myfactors[l]} ' 774 if len(quadratic) > 0: 775 for factor in quadratic: 776 eqn += f'+ I({factor}**2) ' 777 self._equation = eqn 778 return self._equation 779 780 def compute_linear_model(self, order=1, quadratic=[]): 781 """ 782 Compute the linear model using the statsmodels package. 783 784 Parameters 785 ---------- 786 order : int, optional 787 The order of the polynomial. Default is 1. The parameter 788 is not used if the equation is already set. 789 quadratic : list, optional 790 The list of quadratic factors. Default is an empty list. 791 The parameter is not used if the equation is already set. 792 793 Returns 794 ------- 795 statsmodels.regression.linear_model.RegressionResultsWrapper 796 The fitted linear model. 797 """ 798 if self._equation == '': 799 eqn = self.write_equation(order=order, quadratic=quadratic) 800 else: 801 eqn = self._equation 802 model = smf.ols(formula=eqn, data=self._data) 803 self._linear_model = model.fit() 804 return self._linear_model 805 806 def plot_linear_model(self): 807 """ 808 Plot the linear model using plotly. 809 810 Returns 811 ------- 812 fig : list 813 The list of plotly figures. 814 """ 815 if self._linear_model is None: 816 raise Warning("Linear model has not been computed yet.") 817 fig = [None] * 2 818 fig[0] = go.Figure() 819 # Fig[0]: actual vs predicted line 820 fig[0].add_trace(go.Scatter(x=self._data[self._response], 821 y=self._linear_model.predict(), 822 mode='markers', 823 marker=dict(size=12), 824 name='Actual vs Predicted')) 825 # Add 1:1 line 826 fig[0].add_shape(type="line", 827 x0=min(self._data[self._response]), 828 y0=min(self._data[self._response]), 829 x1=max(self._data[self._response]), 830 y1=max(self._data[self._response]), 831 line=dict(color="Gray", width=1, dash="dash")) 832 fig[0].update_layout( 833 plot_bgcolor="white", # White background 834 legend=dict(bgcolor='rgba(0,0,0,0)'), 835 xaxis_title=f'Actual {self._response}', 836 yaxis_title=f'Predicted {self._response}', 837 height=500, # Adjust height as needed 838 margin=dict(l=10, r=10, t=50, b=50), 839 xaxis=dict( 840 showgrid=True, # Enable grid 841 gridcolor="lightgray", # Light gray grid lines 842 zeroline=False, 843 zerolinecolor="black", # Black zero line 844 showline=True, 845 linewidth=1, 846 linecolor="black", # Black border 847 mirror=True 848 ), 849 yaxis=dict( 850 showgrid=True, # Enable grid 851 gridcolor="lightgray", # Light gray grid lines 852 zeroline=False, 853 zerolinecolor="black", # Black zero line 854 showline=True, 855 linewidth=1, 856 linecolor="black", # Black border 857 mirror=True 858 ) 859 ) 860 # # # # # # # # # # # # # # # # # # # # # 861 # Fig[1]: slope values for each factor 862 # # # # # # # # # # # # # # # # # # # # # 863 # Plot: Slope values for each factor 864 res = self._linear_model.params.rename_axis('terms').reset_index()[1:] 865 res.columns = ['terms', 'slope'] 866 error = self._linear_model.bse.rename_axis('terms').reset_index()[1:] 867 error.columns = ['terms', 'error'] 868 res['error'] = error['error'] 869 res['pvalue'] = [self._linear_model.pvalues[res['terms']].iloc[i] for i in range(len(res))] 870 871 # Sort by p-values 872 res = res.sort_values(by='pvalue', ascending=False) 873 # Prepare colors and labels 874 colors = ['red' if x < 0 else 'green' for x in res['slope']] 875 res['slope'] = res['slope'].abs() 876 877 fig[1] = go.Figure() 878 879 # Add bar plot 880 fig[1].add_trace(go.Bar( 881 y=[term.replace('I(', '').replace('C(', '').replace(')', '').replace(' ** ', '^') for term in res['terms']], 882 x=res['slope'], 883 error_x=dict(type='data', array=res['error']), 884 marker_color=colors, 885 orientation='h', 886 name='Slope', 887 showlegend=False # Hide the legend entry for the bar trace 888 )) 889 890 # Update layout for log scale and labels 891 fig[1].update_layout( 892 xaxis_title='Magnitude of effect', 893 xaxis_type="log", 894 height=500, # Adjust height as needed 895 plot_bgcolor="white", # White background 896 legend=dict( 897 orientation="h", # Horizontal orientation 898 y=1.1 # Place legend on top 899 ), 900 margin=dict(l=10, r=150, t=50, b=50), # Increase right margin for annotations 901 xaxis=dict( 902 showgrid=True, # Enable grid 903 gridcolor="lightgray", # Light gray grid lines 904 zeroline=False, 905 zerolinecolor="black", # Black zero line 906 showline=True, 907 linewidth=1, 908 linecolor="black", # Black border 909 mirror=True 910 ), 911 yaxis=dict( 912 showgrid=True, # Enable grid 913 gridcolor="lightgray", # Light gray grid lines 914 zeroline=False, 915 zerolinecolor="black", # Black zero line 916 showline=True, 917 linewidth=1, 918 linecolor="black", # Black border 919 mirror=True 920 ) 921 ) 922 923 # Add legend for Negative and Positive 924 fig[1].add_trace(go.Scatter( 925 x=[None], y=[None], mode='markers', 926 marker=dict(size=10, color='red'), 927 name='Negative' 928 )) 929 fig[1].add_trace(go.Scatter( 930 x=[None], y=[None], mode='markers', 931 marker=dict(size=10, color='green'), 932 name='Positive' 933 )) 934 935 # Add p-values as annotations outside the plot 936 for i, p in enumerate(res['slope']): 937 fig[1].add_annotation( 938 x=1.025, # Place annotations outside the plot 939 y=i, 940 text=f"p = {self._linear_model.pvalues[res['terms'].iloc[i]]:.2g}", 941 showarrow=False, 942 xref="paper", # Use paper coordinates for x 943 xanchor='left' 944 ) 945 946 return fig 947 948 def compute_ML_model(self, **kwargs): 949 """ 950 Compute the machine learning model. 951 952 Parameters 953 ---------- 954 kwargs : dict 955 Additional keyword arguments for the model. Overrides default parameters. 956 957 Default Parameters by Model Type 958 -------------------------------- 959 - **ElasticNetCV:** 960 - l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0]. 961 List of L1 ratios to try. 962 - cv : int, default=5. 963 Cross-validation folds. 964 - max_iter : int, default=1000. 965 Maximum iterations. 966 - **RidgeCV:** 967 - alphas : list, default=[0.1, 1.0, 10.0]. 968 List of alpha values to try. 969 - cv : int, default=5. 970 Cross-validation folds. 971 - **LinearRegression:** 972 - fit_intercept : bool, default=True. 973 Whether to calculate the intercept. 974 - **RandomForest:** 975 - n_estimators : int, default=100. 976 Number of trees in the forest. 977 - max_depth : int or None, default=None. 978 Maximum depth of trees. 979 - min_samples_split : int, default=2. 980 Minimum samples required to split a node. 981 - random_state : int, default=42. 982 Random seed for reproducibility. 983 - **GaussianProcess:** 984 - kernel : kernel object, default=None. 985 Kernel for the Gaussian Process. 986 - alpha : float, default=1e-10. 987 Value added to diagonal of kernel matrix. 988 - normalize_y : bool, default=True. 989 Normalize target values. 990 - random_state : int, default=42. 991 Random seed for reproducibility. 992 - **GradientBoosting:** 993 - n_estimators : int, default=100. 994 Number of boosting stages. 995 - learning_rate : float, default=0.1. 996 Learning rate. 997 - max_depth : int, default=3. 998 Maximum depth of trees. 999 - random_state : int, default=42. 1000 Random seed for reproducibility. 1001 1002 Returns 1003 ------- 1004 object 1005 The fitted machine learning model. 1006 """ 1007 if self._model_type is None: 1008 raise ValueError("Model must be provided.") 1009 1010 X = self._data[self._factors] 1011 y = self._data[self._response] 1012 if self._split_size > 0: 1013 X_train, X_test, y_train, y_test = train_test_split( 1014 X, y, test_size=self._split_size) 1015 else: 1016 X_train, X_test, y_train, y_test = X, X, y, y 1017 1018 # Default parameters for each model type 1019 default_params = { 1020 "ElasticNetCV": {"l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0], 1021 "cv": 5, 1022 "max_iter": 1000}, 1023 "RidgeCV": {"alphas": [0.1, 1.0, 10.0], 1024 "cv": 5}, 1025 "LinearRegression": {"fit_intercept": True}, 1026 "RandomForest": {"n_estimators": 100, 1027 "max_depth": None, 1028 "min_samples_split": 2, 1029 "random_state": 42}, 1030 "GaussianProcess": {"kernel": None, 1031 "alpha": 1e-10, 1032 "normalize_y": True, 1033 "random_state": 42}, 1034 "GradientBoosting": {"n_estimators": 100, 1035 "learning_rate": 0.1, 1036 "max_depth": 3, 1037 "random_state": 42} 1038 } 1039 1040 # Get default parameters for the selected model 1041 model_defaults = default_params.get(self._model_type, {}) 1042 1043 # Override defaults with any provided kwargs 1044 model_params = {**model_defaults, **kwargs} 1045 1046 if self._model_type == "ElasticNetCV": 1047 self._model = make_pipeline(StandardScaler(), 1048 ElasticNetCV(**model_params)) 1049 elif self._model_type == "RidgeCV": 1050 self._model = make_pipeline(StandardScaler(), 1051 RidgeCV(**model_params)) 1052 elif self._model_type == "LinearRegression": 1053 self._model = make_pipeline(StandardScaler(), 1054 LinearRegression(**model_params)) 1055 elif self._model_type == "RandomForest": 1056 self._model = make_pipeline(StandardScaler(), 1057 RandomForestRegressor(**model_params)) 1058 elif self._model_type == "GaussianProcess": 1059 self._model = make_pipeline(StandardScaler(), 1060 GaussianProcessRegressor(**model_params)) 1061 elif self._model_type == "GradientBoosting": 1062 self._model = make_pipeline(StandardScaler(), 1063 GradientBoostingRegressor(**model_params)) 1064 1065 # Fit the model 1066 self._model.fit(X_train, y_train) 1067 return self._model 1068 1069 def plot_ML_model(self, features_in_log=False): 1070 """ 1071 Plot the machine learning model using plotly. 1072 1073 Parameters 1074 ---------- 1075 features_in_log : bool, optional 1076 Whether to plot the feature importances in log scale. Default is False. 1077 1078 Returns 1079 ------- 1080 fig : list 1081 The list of plotly figures. 1082 """ 1083 coef_names = self._data[self._factors].columns 1084 X = self._data[self._factors] 1085 y = self._data[self._response] 1086 if self._split_size > 0: 1087 X_train, X_test, y_train, y_test = train_test_split( 1088 X, y, test_size=self._split_size) 1089 else: 1090 X_train, X_test, y_train, y_test = X, X, y, y 1091 bootstrap_coefs = bootstrap_coefficients( 1092 self._model[1], X, y, n_bootstrap=100, random_state=42) 1093 # Calculate mean and standard deviation of coefficients 1094 mean_coefs = np.mean(bootstrap_coefs, axis=0) 1095 std_coefs = np.std(bootstrap_coefs, axis=0) 1096 # make the fig 1097 fig = [None] * 2 1098 fig[0] = go.Figure() 1099 # Add actual vs predicted line 1100 fig[0].add_trace(go.Scatter(x=y_train, 1101 y=self._model.predict(X_train), 1102 mode='markers', 1103 marker=dict(size=12, color='royalblue'), 1104 name='Training')) 1105 if self._split_size > 0: 1106 fig[0].add_trace(go.Scatter(x=y_test, 1107 y=self._model.predict(X_test), 1108 mode='markers', 1109 marker=dict(size=12, color='orange'), 1110 name='Validation')) 1111 # Add 1:1 line 1112 fig[0].add_shape(type="line", 1113 x0=min(y_train), y0=min(y_train), 1114 x1=max(y_train), y1=max(y_train), 1115 line=dict(color="Gray", width=1, dash="dash")) 1116 fig[0].update_layout( 1117 plot_bgcolor="white", # White background 1118 legend=dict(bgcolor='rgba(0,0,0,0)'), 1119 xaxis_title=f'Actual {self._response}', 1120 yaxis_title=f'Predicted {self._response}', 1121 height=500, # Adjust height as needed 1122 margin=dict(l=10, r=10, t=120, b=50), 1123 xaxis=dict( 1124 showgrid=True, # Enable grid 1125 gridcolor="lightgray", # Light gray grid lines 1126 zeroline=False, 1127 zerolinecolor="black", # Black zero line 1128 showline=True, 1129 linewidth=1, 1130 linecolor="black", # Black border 1131 mirror=True 1132 ), 1133 yaxis=dict( 1134 showgrid=True, # Enable grid 1135 gridcolor="lightgray", # Light gray grid lines 1136 zeroline=False, 1137 zerolinecolor="black", # Black zero line 1138 showline=True, 1139 linewidth=1, 1140 linecolor="black", # Black border 1141 mirror=True 1142 ) 1143 ) 1144 # add the score in title 1145 fig[0].update_layout( 1146 title={ 1147 'text': f"Predicted vs Actual for {str(self._model[1])}<br>R²_test = {r2_score(y_test, self._model.predict(X_test)):.4f} - R²_train = {r2_score(y_train, self._model.predict(X_train)):.4f}<br>RMSE_test = {root_mean_squared_error(y_test, self._model.predict(X_test)):.4f} - RMSE_train = {root_mean_squared_error(y_train, self._model.predict(X_train)):.4f}", 1148 'x': 0.45, 1149 'xanchor': 'center' 1150 } 1151 ) 1152 # make feature importance plot 1153 if self._model_type != "GaussianProcess": 1154 fig[1] = go.Figure() 1155 pos_coefs = mean_coefs[mean_coefs > 0] 1156 pos_coefs_names = coef_names[mean_coefs > 0] 1157 neg_coefs = mean_coefs[mean_coefs < 0] 1158 neg_coefs_names = coef_names[mean_coefs < 0] 1159 # Add bars for positive mean coefficients 1160 fig[1].add_trace(go.Bar( 1161 y=[f"{pos_coefs_names[i]}" for i in range(len(pos_coefs))], 1162 x=mean_coefs[mean_coefs > 0], 1163 error_x=dict(type='data', array=std_coefs[mean_coefs > 0], visible=True), 1164 orientation='h', 1165 marker_color='royalblue', 1166 name='Positive' 1167 )) 1168 fig[1].add_trace(go.Bar( 1169 y=[f"{neg_coefs_names[i]}" for i in range(len(neg_coefs))], 1170 x=-mean_coefs[mean_coefs < 0], 1171 error_x=dict(type='data', array=std_coefs[mean_coefs < 0], visible=True), 1172 orientation='h', 1173 marker_color='orange', 1174 name='Negative' 1175 )) 1176 # Update layout 1177 fig[1].update_layout( 1178 title={ 1179 'text': f"Features importance for {str(self._model[1])}", 1180 'x': 0.5, 1181 'xanchor': 'center' 1182 }, 1183 xaxis_title="Coefficient Value", 1184 yaxis_title="Features", 1185 barmode='relative', 1186 margin=dict(l=150) 1187 ) 1188 if features_in_log: 1189 fig[1].update_xaxes(type="log") 1190 else: 1191 fig[1].update_xaxes(type="linear") 1192 fig[1].update_layout( 1193 plot_bgcolor="white", # White background 1194 legend=dict(bgcolor='rgba(0,0,0,0)'), 1195 height=500, # Adjust height as needed 1196 margin=dict(l=10, r=10, t=120, b=50), 1197 xaxis=dict( 1198 showgrid=True, # Enable grid 1199 gridcolor="lightgray", # Light gray grid lines 1200 zeroline=False, 1201 zerolinecolor="black", # Black zero line 1202 showline=True, 1203 linewidth=1, 1204 linecolor="black", # Black border 1205 mirror=True 1206 ), 1207 yaxis=dict( 1208 showgrid=True, # Enable grid 1209 gridcolor="lightgray", # Light gray grid lines 1210 zeroline=False, 1211 zerolinecolor="black", # Black zero line 1212 showline=True, 1213 linewidth=1, 1214 linecolor="black", # Black border 1215 mirror=True 1216 ) 1217 ) 1218 else: 1219 print('GaussianProcess does not have coefficients or feature importances') 1220 return fig 1221 1222 def predict(self, X=None, model='all'): 1223 """ 1224 Predict using the machine learning model and the linear model, 1225 if they are trained. Use the encoders to encode the data. 1226 If X is not provided, use the original data. 1227 If the model has not been trained, raise a warning. 1228 1229 Parameters 1230 ---------- 1231 X : pd.DataFrame, optional 1232 The input features. Default is None, which uses the original data. 1233 1234 Returns 1235 ------- 1236 pd.DataFrame 1237 The predicted values with a column indicating the model used. 1238 """ 1239 if X is None: 1240 X = self._data[self._factors].copy() 1241 else: 1242 X = X.copy() 1243 1244 # Encode the input data using the same encoders 1245 for factor in self._factors: 1246 if factor in self._encoders: 1247 X[factor] = self._encoders[factor].transform(X[factor].astype(str)) 1248 1249 predML = None 1250 predlin = None 1251 1252 if self._model is not None: 1253 # Check if the model is fitted 1254 if not hasattr(self._model, 'predict'): 1255 raise Warning("Model has not been trained yet.") 1256 predML = self._model.predict(X) 1257 predML = pd.DataFrame(predML, columns=['prediction']) 1258 predML['model'] = self._model_type 1259 1260 if self._linear_model is not None: 1261 # Check if the linear model is fitted 1262 if not hasattr(self._linear_model, 'predict'): 1263 raise Warning("Linear model has not been trained yet.") 1264 predlin = self._linear_model.predict(X) 1265 predlin = pd.DataFrame(predlin, columns=['prediction']) 1266 predlin['model'] = 'Linear Model' 1267 1268 if predML is not None and predlin is not None and model == 'all': 1269 # Combine the predictions on top of each other 1270 pred = pd.concat([predML, predlin], axis=0) 1271 pred = pred.reset_index(drop=True) 1272 return pred 1273 elif predML is not None and model == 'ML': 1274 return predML 1275 elif predlin is not None and model == 'linear': 1276 return predlin 1277 else: 1278 raise Warning("No model has been trained yet.") 1279 1280 1281 1282 1283def bootstrap_coefficients(mod, X, y, n_bootstrap=100, random_state=None): 1284 """ 1285 Perform bootstrap resampling to estimate the variability of model coefficients. 1286 1287 Parameters 1288 ---------- 1289 mod : object 1290 The machine learning model. 1291 X : pd.DataFrame 1292 The input features. 1293 y : pd.Series 1294 The target variable. 1295 n_bootstrap : int, optional 1296 The number of bootstrap samples. Default is 100. 1297 random_state : int, optional 1298 The seed for the random number generator. 1299 1300 Returns 1301 ------- 1302 results : np.ndarray 1303 The bootstrapped coefficients. 1304 """ 1305 np.random.seed(random_state) 1306 n_samples = X.shape[0] 1307 results = [] 1308 1309 for _ in range(n_bootstrap): 1310 # Resample the data 1311 indices = np.random.choice(np.arange(n_samples), size=n_samples, replace=True) 1312 X_resample, y_resample = X.values[indices], y.values[indices] 1313 1314 # Fit the model 1315 if isinstance(mod, RidgeCV): 1316 model = RidgeCV(alphas=np.logspace(-3, 3, 10)).fit(X_resample, y_resample) 1317 results.append(model.coef_) 1318 elif isinstance(mod, ElasticNetCV): 1319 model = ElasticNetCV(alphas=np.logspace(-3, 3, 10), l1_ratio=0.5).fit(X_resample, y_resample) 1320 results.append(model.coef_) 1321 elif isinstance(mod, LinearRegression): 1322 model = LinearRegression().fit(X_resample, y_resample) 1323 results.append(model.coef_) 1324 elif isinstance(mod, RandomForestRegressor): 1325 model = RandomForestRegressor().fit(X_resample, y_resample) 1326 results.append(model.feature_importances_) 1327 elif isinstance(mod, GradientBoostingRegressor): 1328 model = GradientBoostingRegressor().fit(X_resample, y_resample) 1329 results.append(model.feature_importances_) 1330 elif isinstance(mod, GaussianProcessRegressor): 1331 # Gaussian Process does not have coefficients or feature importances 1332 model = GaussianProcessRegressor().fit(X_resample, y_resample) 1333 results.append(np.zeros(X.shape[1])) # Placeholder for Gaussian Process 1334 else: 1335 raise ValueError(f"Unsupported model type: {mod}") 1336 1337 return np.array(results)
46class DataAnalysis: 47 """ 48 This class is used to analyze the data and perform regression analysis. 49 50 Parameters 51 ---------- 52 53 data : pd.DataFrame 54 The input data. 55 factors : list 56 The list of factor variables. 57 response : str 58 The response variable. 59 split_size : float, optional 60 The proportion of the dataset to include in the test split. Default is `0.2`. 61 model_type : str, optional 62 The type of machine learning model to use. Default is None. 63 Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`, 64 `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`. 65 66 Attributes 67 ---------- 68 69 data : pd.DataFrame 70 The input data. 71 factors : list 72 The list of factor variables. 73 response : str 74 The response variable. 75 encoders : dict 76 The encoders for categorical variables. 77 dtypes : pd.Series 78 The data types of the columns. 79 linear_model : object 80 The linear model object. 81 equation : str 82 The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. 83 model : object 84 The machine learning model object. 85 model_type : str 86 The type of machine learning model to use. 87 split_size : float 88 The proportion of the dataset to include in the test split. 89 90 Methods 91 ------- 92 93 - **encode_data()**: 94 Encodes categorical variables in the data. Called during initialization. 95 - **plot_qq()**: 96 Plots a Q-Q plot for the response variable using `plotly`. 97 - **plot_boxplot()**: 98 Plots a boxplot for the response variable using `plotly`. 99 - **plot_histogram()**: 100 Plots a histogram for the response variable using `plotly`. 101 - **plot_scatter_response()**: 102 Plots a scatter plot for the response variable using `plotly`. 103 - **plot_corr()**: 104 Plots a correlation matrix for the data using `plotly`. 105 - **plot_pairplot_seaborn()**: 106 Plots a pairplot for the data using `seaborn`. 107 - **plot_pairplot_plotly()**: 108 Plots a pairplot for the data using `plotly`. 109 - **write_equation(order=1, quadratic=[])**: 110 Writes R-style equation for multivariate fitting procedure using the `statsmodels` package. 111 - **compute_linear_model(order=1, quadratic=[])**: 112 Computes the linear model using the `statsmodels` package. 113 - **plot_linear_model()**: 114 Plots the linear model using `plotly`. 115 - <b>compute_ML_model(**kwargs)</b>: 116 Computes the machine learning model using `sklearn`. 117 - **plot_ML_model(features_in_log=False)**: 118 Plots the machine learning model using `plotly`. 119 120 Example 121 ------- 122 123 ```python 124 from optimeo.analysis import * 125 126 data = pd.read_csv('dataML.csv') 127 factors = data.columns[:-1] 128 response = data.columns[-1] 129 analysis = DataAnalysis(data, factors, response) 130 analysis.model_type = "ElasticNetCV" 131 MLmodel = analysis.compute_ML_model() 132 figs = analysis.plot_ML_model() 133 for fig in figs: 134 fig.show() 135 ``` 136 """ 137 138 def __init__(self, 139 data: pd.DataFrame, 140 factors: list, 141 response: str, 142 split_size=.2, 143 model_type=None): 144 self._dtypes = None 145 self._encoders = {} 146 self._linear_model = None 147 self._model = None 148 self._equation = '' 149 self._data = data 150 self._factors = factors 151 self._response = response 152 self._model_type = model_type 153 self._split_size = split_size 154 self.encode_data() 155 156 @property 157 def data(self): 158 """The input `pandas.DataFrame`.""" 159 return self._data 160 161 @data.setter 162 def data(self, value): 163 if not isinstance(value, pd.DataFrame): 164 raise ValueError("Data must be a pandas DataFrame.") 165 self._data = value 166 167 def encode_data(self): 168 """ 169 Called during initialization: encodes categorical variables in the data if there are any. 170 Uses `LabelEncoder()` from `sklearn` to convert categorical variables to numerical values. 171 Also drops rows with NaN values. 172 """ 173 self._dtypes = self._data.dtypes 174 for factor in self._factors: 175 if self._dtypes[factor] == 'object': 176 le = LabelEncoder() 177 self._encoders[factor] = le 178 self._data[factor] = le.fit_transform([str(d) for d in self._data[factor]]) 179 self.data = self.data[self._factors + [self._response]] 180 self.data = self.data.dropna(axis=0, how='any') 181 182 @property 183 def factors(self): 184 """The list of names of the columns of the `data` DataFrame that contain factor variables.""" 185 return self._factors 186 187 @factors.setter 188 def factors(self, value): 189 if not isinstance(value, list): 190 raise ValueError("Factors must be a list.") 191 self._factors = value 192 193 @property 194 def encoders(self): 195 """The list of encoders for categorical variables.""" 196 return self._encoders 197 198 @encoders.setter 199 def encoders(self, value): 200 if not isinstance(value, dict): 201 raise ValueError("Encoders must be a dictionary.") 202 self._encoders = value 203 204 @property 205 def dtypes(self): 206 """Get the data types of the columns.""" 207 return self._dtypes 208 209 @dtypes.setter 210 def dtypes(self, value): 211 self._dtypes = value 212 213 @property 214 def response(self): 215 """The name of the column of the `data` DataFrame that contain the response variable.""" 216 return self._response 217 218 @response.setter 219 def response(self, value): 220 if not isinstance(value, str): 221 raise ValueError("Response must be a string.") 222 self._response = value 223 224 @property 225 def linear_model(self): 226 """Get the linear model.""" 227 return self._linear_model 228 229 @linear_model.setter 230 def linear_model(self, value): 231 self._linear_model = value 232 233 @property 234 def equation(self): 235 """The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. This is based on the [statsmodels package](https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html).""" 236 return self._equation 237 238 @equation.setter 239 def equation(self, value): 240 if not isinstance(value, str): 241 raise ValueError("Equation must be a string.") 242 self._equation = value 243 244 @property 245 def model_type(self): 246 """The type of machine learning model to use. Default is None. 247 Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`, 248 `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`.""" 249 return self._model_type 250 251 @model_type.setter 252 def model_type(self, value): 253 if value is not None and value not in ["ElasticNetCV", "RidgeCV", 254 "LinearRegression", "RandomForest", 255 "GaussianProcess", "GradientBoosting"]: 256 raise ValueError("Model must be one of the following: " 257 "ElasticNetCV, RidgeCV, LinearRegression, " 258 "RandomForest, GaussianProcess, GradientBoosting.") 259 self._model_type = value 260 261 @property 262 def model(self): 263 """The machine learning model object.""" 264 return self._model 265 266 @model.setter 267 def model(self, value): 268 self._model = value 269 270 @property 271 def split_size(self): 272 """The proportion of the dataset to include in the test split. Default is `0.2`.""" 273 return self._split_size 274 275 @split_size.setter 276 def split_size(self, value): 277 if not isinstance(value, (int, float)): 278 raise ValueError("Split size must be a number.") 279 self._split_size = value 280 281 def __str__(self): 282 """Return a string representation of the DataAnalysis object.""" 283 return (f"DataAnalysis(data={self._data.shape}, " 284 f"factors={self._factors}, " 285 f"response={self._response}, " 286 f"model_type={self._model_type}, " 287 f"split_size={self._split_size}, " 288 f"encoders={self._encoders})" 289 ) 290 291 def __repr__(self): 292 """Return a string representation of the DataAnalysis object.""" 293 return self.__str__() 294 295 def plot_qq(self): 296 """ 297 Plot a Q-Q plot for the response variable. 298 299 Returns 300 ------- 301 fig : plotly.graph_objs.Figure 302 The Q-Q plot figure. 303 """ 304 qqplot_data = pd.Series(self._data[self._response], copy=True) 305 qqplot_data = qqplot(qqplot_data, line='s').gca().lines 306 fig = go.Figure() 307 308 fig.add_trace({ 309 'type': 'scatter', 310 'x': qqplot_data[0].get_xdata(), 311 'y': qqplot_data[0].get_ydata(), 312 'mode': 'markers', 313 }) 314 315 fig.add_trace({ 316 'type': 'scatter', 317 'x': qqplot_data[1].get_xdata(), 318 'y': qqplot_data[1].get_ydata(), 319 'mode': 'lines', 320 'line': { 321 'color': 'red' 322 } 323 }) 324 325 fig.update_layout( 326 title='Q-Q Plot', 327 xaxis_title='Theoretical Quantiles', 328 yaxis_title='Sample Quantiles', 329 plot_bgcolor="white", # White background 330 showlegend=False, 331 margin=dict(l=10, r=10, t=50, b=50), 332 xaxis=dict( 333 showgrid=True, # Enable grid 334 gridcolor="lightgray", # Light gray grid lines 335 zeroline=False, 336 zerolinecolor="black", # Black zero line 337 showline=True, 338 linewidth=1, 339 linecolor="black", # Black border 340 mirror=True 341 ), 342 yaxis=dict( 343 showgrid=True, # Enable grid 344 gridcolor="lightgray", # Light gray grid lines 345 zeroline=False, 346 zerolinecolor="black", # Black zero line 347 showline=True, 348 linewidth=1, 349 linecolor="black", # Black border 350 mirror=True 351 ), 352 height=300, 353 ) 354 return fig 355 356 def plot_boxplot(self): 357 """ 358 Plot a boxplot for the response variable. 359 360 Returns 361 ------- 362 fig : plotly.graph_objs.Figure 363 The boxplot figure. 364 """ 365 fig = px.box(self._data, y=self._response, points="all", title='Box Plot') 366 fig.update_layout( 367 plot_bgcolor="white", # White background 368 legend=dict(bgcolor='rgba(0,0,0,0)'), 369 margin=dict(l=10, r=10, t=50, b=50), 370 xaxis=dict( 371 showgrid=True, # Enable grid 372 gridcolor="lightgray", # Light gray grid lines 373 zeroline=False, 374 zerolinecolor="black", # Black zero line 375 showline=True, 376 linewidth=1, 377 linecolor="black", # Black border 378 mirror=True 379 ), 380 yaxis=dict( 381 showgrid=True, # Enable grid 382 gridcolor="lightgray", # Light gray grid lines 383 zeroline=False, 384 zerolinecolor="black", # Black zero line 385 showline=True, 386 linewidth=1, 387 linecolor="black", # Black border 388 mirror=True 389 ), height=300, 390 ) 391 return fig 392 393 def plot_histogram(self): 394 """ 395 Plot a histogram for the response variable. 396 397 Returns 398 ------- 399 fig : plotly.graph_objs.Figure 400 The histogram figure. 401 """ 402 fig = px.histogram(self._data, x=self._response, title='Histogram') 403 fig.update_layout( 404 plot_bgcolor="white", # White background 405 legend=dict(bgcolor='rgba(0,0,0,0)'), 406 margin=dict(l=10, r=10, t=50, b=50), 407 xaxis=dict( 408 showgrid=True, # Enable grid 409 gridcolor="lightgray", # Light gray grid lines 410 zeroline=False, 411 zerolinecolor="black", # Black zero line 412 showline=True, 413 linewidth=1, 414 linecolor="black", # Black border 415 mirror=True 416 ), 417 yaxis=dict( 418 showgrid=True, # Enable grid 419 gridcolor="lightgray", # Light gray grid lines 420 zeroline=False, 421 zerolinecolor="black", # Black zero line 422 showline=True, 423 linewidth=1, 424 linecolor="black", # Black border 425 mirror=True 426 ), height=300, 427 ) 428 return fig 429 430 def plot_scatter_response(self): 431 """ 432 Plot a scatter plot for the response variable. 433 434 Returns 435 ------- 436 fig : plotly.graph_objs.Figure 437 The scatter plot figure. 438 """ 439 fig = px.scatter(x=np.arange(1, len(self._data[self._response]) + 1), 440 y=self._data[self._response], 441 labels={'x': 'Measurement number', 'y': self._response}, 442 title='Scatter Plot') 443 fig.update_layout( 444 plot_bgcolor="white", # White background 445 legend=dict(bgcolor='rgba(0,0,0,0)'), 446 margin=dict(l=10, r=10, t=50, b=50), 447 xaxis=dict( 448 showgrid=True, # Enable grid 449 gridcolor="lightgray", # Light gray grid lines 450 zeroline=False, 451 zerolinecolor="black", # Black zero line 452 showline=True, 453 linewidth=1, 454 linecolor="black", # Black border 455 mirror=True 456 ), 457 yaxis=dict( 458 showgrid=True, # Enable grid 459 gridcolor="lightgray", # Light gray grid lines 460 zeroline=False, 461 zerolinecolor="black", # Black zero line 462 showline=True, 463 linewidth=1, 464 linecolor="black", # Black border 465 mirror=True 466 ), height=300, 467 ) 468 return fig 469 470 def plot_pairplot_seaborn(self): 471 """ 472 Plot a pairplot for the data using seaborn. 473 474 Returns 475 ------- 476 fig : seaborn.axisgrid.PairGrid 477 The pairplot figure. 478 """ 479 fig = sns.pairplot( 480 self._data, 481 kind="reg", 482 diag_kind="kde", 483 plot_kws={"scatter_kws": {"alpha": 0.1}}, 484 ) 485 return fig 486 487 def plot_pairplot_plotly(self): 488 """ 489 Plot a pairplot for the data using plotly. 490 491 Returns 492 ------- 493 fig : plotly.graph_objs.Figure 494 The plotly figure. 495 """ 496 497 # Get the column names 498 columns = self._data.columns 499 n_cols = len(columns) 500 501 # Create subplot grid with partially shared axes 502 fig = make_subplots( 503 rows=n_cols, 504 cols=n_cols, 505 shared_xaxes=True, 506 shared_yaxes=True, 507 vertical_spacing=0.05, 508 horizontal_spacing=0.05 509 ) 510 511 # Add scatter plots and regression lines to each subplot 512 for i in range(n_cols): 513 for j in range(n_cols): 514 x_data = self._data[columns[j]] 515 y_data = self._data[columns[i]] 516 517 if i == j: # Diagonal: Add KDE plot 518 # Calculate KDE 519 data = self._data[columns[i]].dropna() 520 if len(data) > 1: # Ensure enough data points 521 kde_x = np.linspace(data.min(), data.max(), 100) 522 kde = stats.gaussian_kde(data) 523 kde_y = kde(kde_x) 524 # Scale to match y-axis (needed for top left corner) 525 kde_y = kde_y / kde_y.max() * np.max(y_data.dropna()) 526 527 # Add KDE plot to diagonal 528 fig.add_trace( 529 go.Scatter( 530 x=kde_x, 531 y=kde_y, 532 mode='lines', 533 fill='tozeroy', 534 line=dict(color='rgba(29, 81, 189, 1)'), 535 showlegend=False 536 ), 537 row=i+1, col=j+1 538 ) 539 # Ensure the y-axis is independent for diagonal plots 540 # don't know why, doesn't work with top left corner 541 fig.update_yaxes( 542 matches=None, # Ensure independent y-axis 543 showticklabels=True, # Show y-axis values 544 row=i+1, 545 col=j+1 546 ) 547 else: # Off-diagonal: Add scatter plot with regression line 548 # Add scatter plot 549 fig.add_trace( 550 go.Scatter( 551 x=x_data, 552 y=y_data, 553 mode='markers', 554 marker=dict( 555 color='rgba(29, 81, 189, 0.5)', 556 size=5 557 ), 558 showlegend=False 559 ), 560 row=i+1, col=j+1 561 ) 562 563 # Calculate and add regression line 564 if len(x_data.dropna()) > 1 and len(y_data.dropna()) > 1: 565 # Drop NaN values for regression calculation 566 valid_indices = x_data.notna() & y_data.notna() 567 x_clean = x_data[valid_indices] 568 y_clean = y_data[valid_indices] 569 ytype = y_clean.dtype 570 xtype = x_clean.dtype 571 unique_x = x_clean.unique() 572 if len(x_clean) > 1 and len(unique_x)>1 and ytype != 'object' and xtype != 'object': 573 # Calculate regression parameters 574 slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean) 575 x_range = np.linspace(x_clean.min(), x_clean.max(), 100) 576 y_pred = slope * x_range + intercept 577 578 # Standard error of estimate 579 y_fit = slope * x_clean + intercept 580 residuals = y_clean - y_fit 581 dof = len(x_clean) - 2 582 residual_std_error = np.sqrt(np.sum(residuals**2) / dof) 583 584 mean_x = np.mean(x_clean) 585 t_val = t.ppf(0.975, dof) # 95% confidence 586 587 se_line = residual_std_error * np.sqrt(1/len(x_clean) + (x_range - mean_x)**2 / np.sum((x_clean - mean_x)**2)) 588 y_upper = y_pred + t_val * se_line 589 y_lower = y_pred - t_val * se_line 590 591 # Add regression line 592 fig.add_trace( 593 go.Scatter( 594 x=x_range, 595 y=y_pred, 596 mode='lines', 597 line=dict(color='red', width=2), 598 showlegend=False 599 ), 600 row=i+1, col=j+1 601 ) 602 603 # Add confidence interval area 604 fig.add_trace( 605 go.Scatter( 606 x=np.concatenate([x_range, x_range[::-1]]), 607 y=np.concatenate([y_upper, y_lower[::-1]]), 608 fill='toself', 609 fillcolor='rgba(255, 0, 0, 0.2)', 610 line=dict(color='rgba(255, 0, 0, 0)'), 611 showlegend=False 612 ), 613 row=i+1, col=j+1 614 ) 615 616 # Update layout and axis properties 617 fig.update_layout( 618 margin=dict(l=10, r=10, t=50, b=50), 619 title="Pair Plot", 620 height=600, 621 width=600, 622 showlegend=False, 623 plot_bgcolor="white" 624 ) 625 626 # Update all axes properties first (hiding tick labels by default) 627 fig.update_xaxes( 628 showgrid=True, 629 gridcolor="lightgray", 630 zeroline=False, 631 showline=True, 632 linewidth=1, 633 linecolor="black", 634 mirror=True, 635 showticklabels=False 636 ) 637 638 fig.update_yaxes( 639 showgrid=True, 640 gridcolor="lightgray", 641 zeroline=False, 642 showline=True, 643 linewidth=1, 644 linecolor="black", 645 mirror=True, 646 showticklabels=False 647 ) 648 649 # Show tick labels and titles only for bottom row and leftmost column 650 for i, col_name in enumerate(columns): 651 # Bottom row: show x-axis titles and tick labels 652 fig.update_xaxes( 653 title_text=col_name, 654 showticklabels=True, 655 row=n_cols, 656 col=i+1 657 ) 658 659 # Leftmost column: show y-axis titles and tick labels 660 fig.update_yaxes( 661 title_text=col_name, 662 showticklabels=True, 663 row=i+1, 664 col=1 665 ) 666 667 return fig 668 669 def plot_corr(self): 670 """ 671 Plot a correlation matrix for the data. 672 673 Returns 674 ------- 675 fig : seaborn.axisgrid.PairGrid 676 The pairplot figure. 677 """ 678 corr_matrix = self._data.corr(method='pearson') 679 mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1) 680 # Apply mask - replace upper triangle with NaN values 681 corr_matrix_lower = corr_matrix.copy() 682 corr_matrix_lower.values[mask] = np.nan 683 684 fig = px.imshow( 685 corr_matrix_lower, 686 text_auto='.2f', # Display correlation values 687 color_continuous_scale='RdBu_r', # Red to Blue color scale (reversed) 688 zmin=-1, # Minimum correlation value 689 zmax=1, # Maximum correlation value 690 aspect="auto", # Keep aspect ratio adaptive 691 title="Pearson Correlation Heatmap" 692 ) 693 # Customize hover template 694 fig.update_traces( 695 hovertemplate='<b>%{y} vs %{x}</b><br>Correlation: %{z}<extra></extra>' 696 ) 697 # Improve layout 698 fig.update_layout( 699 coloraxis_colorbar=dict( 700 title="Correlation", 701 tickvals=[-1, -0.5, 0, 0.5, 1], 702 ticktext=["-1", "-0.5", "0", "0.5", "1"], 703 ), 704 plot_bgcolor="white", # White background 705 legend=dict(bgcolor='rgba(0,0,0,0)'), 706 margin=dict(l=10, r=10, t=50, b=50), 707 xaxis=dict( 708 showgrid=False, # Enable grid 709 gridcolor="lightgray", # Light gray grid lines 710 zeroline=False, 711 zerolinecolor="black", # Black zero line 712 showline=False, 713 linewidth=1, 714 tickangle=-45, 715 linecolor="black", # Black border 716 mirror=True 717 ), 718 yaxis=dict( 719 showgrid=False, # Enable grid 720 gridcolor="lightgray", # Light gray grid lines 721 zeroline=False, 722 zerolinecolor="black", # Black zero line 723 showline=False, 724 linewidth=1, 725 linecolor="black", # Black border 726 mirror=True 727 ), 728 height=600, 729 width=600 730 ) 731 732 return fig 733 734 def write_equation(self, order=1, quadratic=[]): 735 """ 736 Write R-style equation for multivariate fitting procedure using the statsmodels package. 737 738 Parameters 739 ---------- 740 order : int, optional 741 The order of the polynomial. Default is 1. 742 quadratic : list, optional 743 The list of quadratic factors. Default is an empty list. 744 745 Returns 746 ------- 747 str 748 The R-style equation. 749 """ 750 myfactors = self._factors.copy() 751 if self._dtypes is not None: 752 if not isinstance(myfactors, list): 753 myfactors = myfactors.tolist() # Convert to list to allow mutable operations 754 for i in range(len(self._factors)): 755 if self._dtypes[self._factors[i]] == 'object': 756 myfactors[i] = f'C({myfactors[i]})' 757 eqn = f'{self._response} ~ {myfactors[0]} ' 758 for factor in myfactors[1:]: 759 eqn += f'+ {factor} ' 760 if order > 1: 761 for i in range(len(myfactors)): 762 for j in range(i + 1, len(myfactors)): 763 eqn += f'+ {myfactors[i]}:{myfactors[j]} ' 764 if order > 2: 765 for i in range(len(myfactors)): 766 for j in range(i + 1, len(myfactors)): 767 for k in range(j + 1, len(myfactors)): 768 eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]} ' 769 if order > 3: 770 for i in range(len(myfactors)): 771 for j in range(i + 1, len(myfactors)): 772 for k in range(j + 1, len(myfactors)): 773 for l in range(k + 1, len(myfactors)): 774 eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]}:{myfactors[l]} ' 775 if len(quadratic) > 0: 776 for factor in quadratic: 777 eqn += f'+ I({factor}**2) ' 778 self._equation = eqn 779 return self._equation 780 781 def compute_linear_model(self, order=1, quadratic=[]): 782 """ 783 Compute the linear model using the statsmodels package. 784 785 Parameters 786 ---------- 787 order : int, optional 788 The order of the polynomial. Default is 1. The parameter 789 is not used if the equation is already set. 790 quadratic : list, optional 791 The list of quadratic factors. Default is an empty list. 792 The parameter is not used if the equation is already set. 793 794 Returns 795 ------- 796 statsmodels.regression.linear_model.RegressionResultsWrapper 797 The fitted linear model. 798 """ 799 if self._equation == '': 800 eqn = self.write_equation(order=order, quadratic=quadratic) 801 else: 802 eqn = self._equation 803 model = smf.ols(formula=eqn, data=self._data) 804 self._linear_model = model.fit() 805 return self._linear_model 806 807 def plot_linear_model(self): 808 """ 809 Plot the linear model using plotly. 810 811 Returns 812 ------- 813 fig : list 814 The list of plotly figures. 815 """ 816 if self._linear_model is None: 817 raise Warning("Linear model has not been computed yet.") 818 fig = [None] * 2 819 fig[0] = go.Figure() 820 # Fig[0]: actual vs predicted line 821 fig[0].add_trace(go.Scatter(x=self._data[self._response], 822 y=self._linear_model.predict(), 823 mode='markers', 824 marker=dict(size=12), 825 name='Actual vs Predicted')) 826 # Add 1:1 line 827 fig[0].add_shape(type="line", 828 x0=min(self._data[self._response]), 829 y0=min(self._data[self._response]), 830 x1=max(self._data[self._response]), 831 y1=max(self._data[self._response]), 832 line=dict(color="Gray", width=1, dash="dash")) 833 fig[0].update_layout( 834 plot_bgcolor="white", # White background 835 legend=dict(bgcolor='rgba(0,0,0,0)'), 836 xaxis_title=f'Actual {self._response}', 837 yaxis_title=f'Predicted {self._response}', 838 height=500, # Adjust height as needed 839 margin=dict(l=10, r=10, t=50, b=50), 840 xaxis=dict( 841 showgrid=True, # Enable grid 842 gridcolor="lightgray", # Light gray grid lines 843 zeroline=False, 844 zerolinecolor="black", # Black zero line 845 showline=True, 846 linewidth=1, 847 linecolor="black", # Black border 848 mirror=True 849 ), 850 yaxis=dict( 851 showgrid=True, # Enable grid 852 gridcolor="lightgray", # Light gray grid lines 853 zeroline=False, 854 zerolinecolor="black", # Black zero line 855 showline=True, 856 linewidth=1, 857 linecolor="black", # Black border 858 mirror=True 859 ) 860 ) 861 # # # # # # # # # # # # # # # # # # # # # 862 # Fig[1]: slope values for each factor 863 # # # # # # # # # # # # # # # # # # # # # 864 # Plot: Slope values for each factor 865 res = self._linear_model.params.rename_axis('terms').reset_index()[1:] 866 res.columns = ['terms', 'slope'] 867 error = self._linear_model.bse.rename_axis('terms').reset_index()[1:] 868 error.columns = ['terms', 'error'] 869 res['error'] = error['error'] 870 res['pvalue'] = [self._linear_model.pvalues[res['terms']].iloc[i] for i in range(len(res))] 871 872 # Sort by p-values 873 res = res.sort_values(by='pvalue', ascending=False) 874 # Prepare colors and labels 875 colors = ['red' if x < 0 else 'green' for x in res['slope']] 876 res['slope'] = res['slope'].abs() 877 878 fig[1] = go.Figure() 879 880 # Add bar plot 881 fig[1].add_trace(go.Bar( 882 y=[term.replace('I(', '').replace('C(', '').replace(')', '').replace(' ** ', '^') for term in res['terms']], 883 x=res['slope'], 884 error_x=dict(type='data', array=res['error']), 885 marker_color=colors, 886 orientation='h', 887 name='Slope', 888 showlegend=False # Hide the legend entry for the bar trace 889 )) 890 891 # Update layout for log scale and labels 892 fig[1].update_layout( 893 xaxis_title='Magnitude of effect', 894 xaxis_type="log", 895 height=500, # Adjust height as needed 896 plot_bgcolor="white", # White background 897 legend=dict( 898 orientation="h", # Horizontal orientation 899 y=1.1 # Place legend on top 900 ), 901 margin=dict(l=10, r=150, t=50, b=50), # Increase right margin for annotations 902 xaxis=dict( 903 showgrid=True, # Enable grid 904 gridcolor="lightgray", # Light gray grid lines 905 zeroline=False, 906 zerolinecolor="black", # Black zero line 907 showline=True, 908 linewidth=1, 909 linecolor="black", # Black border 910 mirror=True 911 ), 912 yaxis=dict( 913 showgrid=True, # Enable grid 914 gridcolor="lightgray", # Light gray grid lines 915 zeroline=False, 916 zerolinecolor="black", # Black zero line 917 showline=True, 918 linewidth=1, 919 linecolor="black", # Black border 920 mirror=True 921 ) 922 ) 923 924 # Add legend for Negative and Positive 925 fig[1].add_trace(go.Scatter( 926 x=[None], y=[None], mode='markers', 927 marker=dict(size=10, color='red'), 928 name='Negative' 929 )) 930 fig[1].add_trace(go.Scatter( 931 x=[None], y=[None], mode='markers', 932 marker=dict(size=10, color='green'), 933 name='Positive' 934 )) 935 936 # Add p-values as annotations outside the plot 937 for i, p in enumerate(res['slope']): 938 fig[1].add_annotation( 939 x=1.025, # Place annotations outside the plot 940 y=i, 941 text=f"p = {self._linear_model.pvalues[res['terms'].iloc[i]]:.2g}", 942 showarrow=False, 943 xref="paper", # Use paper coordinates for x 944 xanchor='left' 945 ) 946 947 return fig 948 949 def compute_ML_model(self, **kwargs): 950 """ 951 Compute the machine learning model. 952 953 Parameters 954 ---------- 955 kwargs : dict 956 Additional keyword arguments for the model. Overrides default parameters. 957 958 Default Parameters by Model Type 959 -------------------------------- 960 - **ElasticNetCV:** 961 - l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0]. 962 List of L1 ratios to try. 963 - cv : int, default=5. 964 Cross-validation folds. 965 - max_iter : int, default=1000. 966 Maximum iterations. 967 - **RidgeCV:** 968 - alphas : list, default=[0.1, 1.0, 10.0]. 969 List of alpha values to try. 970 - cv : int, default=5. 971 Cross-validation folds. 972 - **LinearRegression:** 973 - fit_intercept : bool, default=True. 974 Whether to calculate the intercept. 975 - **RandomForest:** 976 - n_estimators : int, default=100. 977 Number of trees in the forest. 978 - max_depth : int or None, default=None. 979 Maximum depth of trees. 980 - min_samples_split : int, default=2. 981 Minimum samples required to split a node. 982 - random_state : int, default=42. 983 Random seed for reproducibility. 984 - **GaussianProcess:** 985 - kernel : kernel object, default=None. 986 Kernel for the Gaussian Process. 987 - alpha : float, default=1e-10. 988 Value added to diagonal of kernel matrix. 989 - normalize_y : bool, default=True. 990 Normalize target values. 991 - random_state : int, default=42. 992 Random seed for reproducibility. 993 - **GradientBoosting:** 994 - n_estimators : int, default=100. 995 Number of boosting stages. 996 - learning_rate : float, default=0.1. 997 Learning rate. 998 - max_depth : int, default=3. 999 Maximum depth of trees. 1000 - random_state : int, default=42. 1001 Random seed for reproducibility. 1002 1003 Returns 1004 ------- 1005 object 1006 The fitted machine learning model. 1007 """ 1008 if self._model_type is None: 1009 raise ValueError("Model must be provided.") 1010 1011 X = self._data[self._factors] 1012 y = self._data[self._response] 1013 if self._split_size > 0: 1014 X_train, X_test, y_train, y_test = train_test_split( 1015 X, y, test_size=self._split_size) 1016 else: 1017 X_train, X_test, y_train, y_test = X, X, y, y 1018 1019 # Default parameters for each model type 1020 default_params = { 1021 "ElasticNetCV": {"l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0], 1022 "cv": 5, 1023 "max_iter": 1000}, 1024 "RidgeCV": {"alphas": [0.1, 1.0, 10.0], 1025 "cv": 5}, 1026 "LinearRegression": {"fit_intercept": True}, 1027 "RandomForest": {"n_estimators": 100, 1028 "max_depth": None, 1029 "min_samples_split": 2, 1030 "random_state": 42}, 1031 "GaussianProcess": {"kernel": None, 1032 "alpha": 1e-10, 1033 "normalize_y": True, 1034 "random_state": 42}, 1035 "GradientBoosting": {"n_estimators": 100, 1036 "learning_rate": 0.1, 1037 "max_depth": 3, 1038 "random_state": 42} 1039 } 1040 1041 # Get default parameters for the selected model 1042 model_defaults = default_params.get(self._model_type, {}) 1043 1044 # Override defaults with any provided kwargs 1045 model_params = {**model_defaults, **kwargs} 1046 1047 if self._model_type == "ElasticNetCV": 1048 self._model = make_pipeline(StandardScaler(), 1049 ElasticNetCV(**model_params)) 1050 elif self._model_type == "RidgeCV": 1051 self._model = make_pipeline(StandardScaler(), 1052 RidgeCV(**model_params)) 1053 elif self._model_type == "LinearRegression": 1054 self._model = make_pipeline(StandardScaler(), 1055 LinearRegression(**model_params)) 1056 elif self._model_type == "RandomForest": 1057 self._model = make_pipeline(StandardScaler(), 1058 RandomForestRegressor(**model_params)) 1059 elif self._model_type == "GaussianProcess": 1060 self._model = make_pipeline(StandardScaler(), 1061 GaussianProcessRegressor(**model_params)) 1062 elif self._model_type == "GradientBoosting": 1063 self._model = make_pipeline(StandardScaler(), 1064 GradientBoostingRegressor(**model_params)) 1065 1066 # Fit the model 1067 self._model.fit(X_train, y_train) 1068 return self._model 1069 1070 def plot_ML_model(self, features_in_log=False): 1071 """ 1072 Plot the machine learning model using plotly. 1073 1074 Parameters 1075 ---------- 1076 features_in_log : bool, optional 1077 Whether to plot the feature importances in log scale. Default is False. 1078 1079 Returns 1080 ------- 1081 fig : list 1082 The list of plotly figures. 1083 """ 1084 coef_names = self._data[self._factors].columns 1085 X = self._data[self._factors] 1086 y = self._data[self._response] 1087 if self._split_size > 0: 1088 X_train, X_test, y_train, y_test = train_test_split( 1089 X, y, test_size=self._split_size) 1090 else: 1091 X_train, X_test, y_train, y_test = X, X, y, y 1092 bootstrap_coefs = bootstrap_coefficients( 1093 self._model[1], X, y, n_bootstrap=100, random_state=42) 1094 # Calculate mean and standard deviation of coefficients 1095 mean_coefs = np.mean(bootstrap_coefs, axis=0) 1096 std_coefs = np.std(bootstrap_coefs, axis=0) 1097 # make the fig 1098 fig = [None] * 2 1099 fig[0] = go.Figure() 1100 # Add actual vs predicted line 1101 fig[0].add_trace(go.Scatter(x=y_train, 1102 y=self._model.predict(X_train), 1103 mode='markers', 1104 marker=dict(size=12, color='royalblue'), 1105 name='Training')) 1106 if self._split_size > 0: 1107 fig[0].add_trace(go.Scatter(x=y_test, 1108 y=self._model.predict(X_test), 1109 mode='markers', 1110 marker=dict(size=12, color='orange'), 1111 name='Validation')) 1112 # Add 1:1 line 1113 fig[0].add_shape(type="line", 1114 x0=min(y_train), y0=min(y_train), 1115 x1=max(y_train), y1=max(y_train), 1116 line=dict(color="Gray", width=1, dash="dash")) 1117 fig[0].update_layout( 1118 plot_bgcolor="white", # White background 1119 legend=dict(bgcolor='rgba(0,0,0,0)'), 1120 xaxis_title=f'Actual {self._response}', 1121 yaxis_title=f'Predicted {self._response}', 1122 height=500, # Adjust height as needed 1123 margin=dict(l=10, r=10, t=120, b=50), 1124 xaxis=dict( 1125 showgrid=True, # Enable grid 1126 gridcolor="lightgray", # Light gray grid lines 1127 zeroline=False, 1128 zerolinecolor="black", # Black zero line 1129 showline=True, 1130 linewidth=1, 1131 linecolor="black", # Black border 1132 mirror=True 1133 ), 1134 yaxis=dict( 1135 showgrid=True, # Enable grid 1136 gridcolor="lightgray", # Light gray grid lines 1137 zeroline=False, 1138 zerolinecolor="black", # Black zero line 1139 showline=True, 1140 linewidth=1, 1141 linecolor="black", # Black border 1142 mirror=True 1143 ) 1144 ) 1145 # add the score in title 1146 fig[0].update_layout( 1147 title={ 1148 'text': f"Predicted vs Actual for {str(self._model[1])}<br>R²_test = {r2_score(y_test, self._model.predict(X_test)):.4f} - R²_train = {r2_score(y_train, self._model.predict(X_train)):.4f}<br>RMSE_test = {root_mean_squared_error(y_test, self._model.predict(X_test)):.4f} - RMSE_train = {root_mean_squared_error(y_train, self._model.predict(X_train)):.4f}", 1149 'x': 0.45, 1150 'xanchor': 'center' 1151 } 1152 ) 1153 # make feature importance plot 1154 if self._model_type != "GaussianProcess": 1155 fig[1] = go.Figure() 1156 pos_coefs = mean_coefs[mean_coefs > 0] 1157 pos_coefs_names = coef_names[mean_coefs > 0] 1158 neg_coefs = mean_coefs[mean_coefs < 0] 1159 neg_coefs_names = coef_names[mean_coefs < 0] 1160 # Add bars for positive mean coefficients 1161 fig[1].add_trace(go.Bar( 1162 y=[f"{pos_coefs_names[i]}" for i in range(len(pos_coefs))], 1163 x=mean_coefs[mean_coefs > 0], 1164 error_x=dict(type='data', array=std_coefs[mean_coefs > 0], visible=True), 1165 orientation='h', 1166 marker_color='royalblue', 1167 name='Positive' 1168 )) 1169 fig[1].add_trace(go.Bar( 1170 y=[f"{neg_coefs_names[i]}" for i in range(len(neg_coefs))], 1171 x=-mean_coefs[mean_coefs < 0], 1172 error_x=dict(type='data', array=std_coefs[mean_coefs < 0], visible=True), 1173 orientation='h', 1174 marker_color='orange', 1175 name='Negative' 1176 )) 1177 # Update layout 1178 fig[1].update_layout( 1179 title={ 1180 'text': f"Features importance for {str(self._model[1])}", 1181 'x': 0.5, 1182 'xanchor': 'center' 1183 }, 1184 xaxis_title="Coefficient Value", 1185 yaxis_title="Features", 1186 barmode='relative', 1187 margin=dict(l=150) 1188 ) 1189 if features_in_log: 1190 fig[1].update_xaxes(type="log") 1191 else: 1192 fig[1].update_xaxes(type="linear") 1193 fig[1].update_layout( 1194 plot_bgcolor="white", # White background 1195 legend=dict(bgcolor='rgba(0,0,0,0)'), 1196 height=500, # Adjust height as needed 1197 margin=dict(l=10, r=10, t=120, b=50), 1198 xaxis=dict( 1199 showgrid=True, # Enable grid 1200 gridcolor="lightgray", # Light gray grid lines 1201 zeroline=False, 1202 zerolinecolor="black", # Black zero line 1203 showline=True, 1204 linewidth=1, 1205 linecolor="black", # Black border 1206 mirror=True 1207 ), 1208 yaxis=dict( 1209 showgrid=True, # Enable grid 1210 gridcolor="lightgray", # Light gray grid lines 1211 zeroline=False, 1212 zerolinecolor="black", # Black zero line 1213 showline=True, 1214 linewidth=1, 1215 linecolor="black", # Black border 1216 mirror=True 1217 ) 1218 ) 1219 else: 1220 print('GaussianProcess does not have coefficients or feature importances') 1221 return fig 1222 1223 def predict(self, X=None, model='all'): 1224 """ 1225 Predict using the machine learning model and the linear model, 1226 if they are trained. Use the encoders to encode the data. 1227 If X is not provided, use the original data. 1228 If the model has not been trained, raise a warning. 1229 1230 Parameters 1231 ---------- 1232 X : pd.DataFrame, optional 1233 The input features. Default is None, which uses the original data. 1234 1235 Returns 1236 ------- 1237 pd.DataFrame 1238 The predicted values with a column indicating the model used. 1239 """ 1240 if X is None: 1241 X = self._data[self._factors].copy() 1242 else: 1243 X = X.copy() 1244 1245 # Encode the input data using the same encoders 1246 for factor in self._factors: 1247 if factor in self._encoders: 1248 X[factor] = self._encoders[factor].transform(X[factor].astype(str)) 1249 1250 predML = None 1251 predlin = None 1252 1253 if self._model is not None: 1254 # Check if the model is fitted 1255 if not hasattr(self._model, 'predict'): 1256 raise Warning("Model has not been trained yet.") 1257 predML = self._model.predict(X) 1258 predML = pd.DataFrame(predML, columns=['prediction']) 1259 predML['model'] = self._model_type 1260 1261 if self._linear_model is not None: 1262 # Check if the linear model is fitted 1263 if not hasattr(self._linear_model, 'predict'): 1264 raise Warning("Linear model has not been trained yet.") 1265 predlin = self._linear_model.predict(X) 1266 predlin = pd.DataFrame(predlin, columns=['prediction']) 1267 predlin['model'] = 'Linear Model' 1268 1269 if predML is not None and predlin is not None and model == 'all': 1270 # Combine the predictions on top of each other 1271 pred = pd.concat([predML, predlin], axis=0) 1272 pred = pred.reset_index(drop=True) 1273 return pred 1274 elif predML is not None and model == 'ML': 1275 return predML 1276 elif predlin is not None and model == 'linear': 1277 return predlin 1278 else: 1279 raise Warning("No model has been trained yet.")
This class is used to analyze the data and perform regression analysis.
Parameters
- data (pd.DataFrame): The input data.
- factors (list): The list of factor variables.
- response (str): The response variable.
- split_size (float, optional):
The proportion of the dataset to include in the test split. Default is
0.2
. - model_type (str, optional):
The type of machine learning model to use. Default is None.
Must be one of the following:
"ElasticNetCV"
,"RidgeCV"
,"LinearRegression"
,"RandomForest"
,"GaussianProcess"
,"GradientBoosting"
.
Attributes
- data (pd.DataFrame): The input data.
- factors (list): The list of factor variables.
- response (str): The response variable.
- encoders (dict): The encoders for categorical variables.
- dtypes (pd.Series): The data types of the columns.
- linear_model (object): The linear model object.
- equation (str):
The equation for the linear model, in the form
response ~ var1 + var2 + var1:var2
. - model (object): The machine learning model object.
- model_type (str): The type of machine learning model to use.
- split_size (float): The proportion of the dataset to include in the test split.
Methods
- encode_data(): Encodes categorical variables in the data. Called during initialization.
- plot_qq():
Plots a Q-Q plot for the response variable using
plotly
. - plot_boxplot():
Plots a boxplot for the response variable using
plotly
. - plot_histogram():
Plots a histogram for the response variable using
plotly
. - plot_scatter_response():
Plots a scatter plot for the response variable using
plotly
. - plot_corr():
Plots a correlation matrix for the data using
plotly
. - plot_pairplot_seaborn():
Plots a pairplot for the data using
seaborn
. - plot_pairplot_plotly():
Plots a pairplot for the data using
plotly
. - write_equation(order=1, quadratic=[]):
Writes R-style equation for multivariate fitting procedure using the
statsmodels
package. - compute_linear_model(order=1, quadratic=[]):
Computes the linear model using the
statsmodels
package. - plot_linear_model():
Plots the linear model using
plotly
. - compute_ML_model(**kwargs):
Computes the machine learning model using
sklearn
. - plot_ML_model(features_in_log=False):
Plots the machine learning model using
plotly
.
Example
from optimeo.analysis import *
data = pd.read_csv('dataML.csv')
factors = data.columns[:-1]
response = data.columns[-1]
analysis = DataAnalysis(data, factors, response)
analysis.model_type = "ElasticNetCV"
MLmodel = analysis.compute_ML_model()
figs = analysis.plot_ML_model()
for fig in figs:
fig.show()
138 def __init__(self, 139 data: pd.DataFrame, 140 factors: list, 141 response: str, 142 split_size=.2, 143 model_type=None): 144 self._dtypes = None 145 self._encoders = {} 146 self._linear_model = None 147 self._model = None 148 self._equation = '' 149 self._data = data 150 self._factors = factors 151 self._response = response 152 self._model_type = model_type 153 self._split_size = split_size 154 self.encode_data()
167 def encode_data(self): 168 """ 169 Called during initialization: encodes categorical variables in the data if there are any. 170 Uses `LabelEncoder()` from `sklearn` to convert categorical variables to numerical values. 171 Also drops rows with NaN values. 172 """ 173 self._dtypes = self._data.dtypes 174 for factor in self._factors: 175 if self._dtypes[factor] == 'object': 176 le = LabelEncoder() 177 self._encoders[factor] = le 178 self._data[factor] = le.fit_transform([str(d) for d in self._data[factor]]) 179 self.data = self.data[self._factors + [self._response]] 180 self.data = self.data.dropna(axis=0, how='any')
Called during initialization: encodes categorical variables in the data if there are any.
Uses LabelEncoder()
from sklearn
to convert categorical variables to numerical values.
Also drops rows with NaN values.
182 @property 183 def factors(self): 184 """The list of names of the columns of the `data` DataFrame that contain factor variables.""" 185 return self._factors
The list of names of the columns of the data
DataFrame that contain factor variables.
193 @property 194 def encoders(self): 195 """The list of encoders for categorical variables.""" 196 return self._encoders
The list of encoders for categorical variables.
204 @property 205 def dtypes(self): 206 """Get the data types of the columns.""" 207 return self._dtypes
Get the data types of the columns.
213 @property 214 def response(self): 215 """The name of the column of the `data` DataFrame that contain the response variable.""" 216 return self._response
The name of the column of the data
DataFrame that contain the response variable.
224 @property 225 def linear_model(self): 226 """Get the linear model.""" 227 return self._linear_model
Get the linear model.
233 @property 234 def equation(self): 235 """The equation for the linear model, in the form `response ~ var1 + var2 + var1:var2`. This is based on the [statsmodels package](https://www.statsmodels.org/dev/examples/notebooks/generated/formulas.html).""" 236 return self._equation
The equation for the linear model, in the form response ~ var1 + var2 + var1:var2
. This is based on the statsmodels package.
244 @property 245 def model_type(self): 246 """The type of machine learning model to use. Default is None. 247 Must be one of the following: `"ElasticNetCV"`, `"RidgeCV"`, 248 `"LinearRegression"`, `"RandomForest"`, `"GaussianProcess"`, `"GradientBoosting"`.""" 249 return self._model_type
The type of machine learning model to use. Default is None.
Must be one of the following: "ElasticNetCV"
, "RidgeCV"
,
"LinearRegression"
, "RandomForest"
, "GaussianProcess"
, "GradientBoosting"
.
261 @property 262 def model(self): 263 """The machine learning model object.""" 264 return self._model
The machine learning model object.
270 @property 271 def split_size(self): 272 """The proportion of the dataset to include in the test split. Default is `0.2`.""" 273 return self._split_size
The proportion of the dataset to include in the test split. Default is 0.2
.
295 def plot_qq(self): 296 """ 297 Plot a Q-Q plot for the response variable. 298 299 Returns 300 ------- 301 fig : plotly.graph_objs.Figure 302 The Q-Q plot figure. 303 """ 304 qqplot_data = pd.Series(self._data[self._response], copy=True) 305 qqplot_data = qqplot(qqplot_data, line='s').gca().lines 306 fig = go.Figure() 307 308 fig.add_trace({ 309 'type': 'scatter', 310 'x': qqplot_data[0].get_xdata(), 311 'y': qqplot_data[0].get_ydata(), 312 'mode': 'markers', 313 }) 314 315 fig.add_trace({ 316 'type': 'scatter', 317 'x': qqplot_data[1].get_xdata(), 318 'y': qqplot_data[1].get_ydata(), 319 'mode': 'lines', 320 'line': { 321 'color': 'red' 322 } 323 }) 324 325 fig.update_layout( 326 title='Q-Q Plot', 327 xaxis_title='Theoretical Quantiles', 328 yaxis_title='Sample Quantiles', 329 plot_bgcolor="white", # White background 330 showlegend=False, 331 margin=dict(l=10, r=10, t=50, b=50), 332 xaxis=dict( 333 showgrid=True, # Enable grid 334 gridcolor="lightgray", # Light gray grid lines 335 zeroline=False, 336 zerolinecolor="black", # Black zero line 337 showline=True, 338 linewidth=1, 339 linecolor="black", # Black border 340 mirror=True 341 ), 342 yaxis=dict( 343 showgrid=True, # Enable grid 344 gridcolor="lightgray", # Light gray grid lines 345 zeroline=False, 346 zerolinecolor="black", # Black zero line 347 showline=True, 348 linewidth=1, 349 linecolor="black", # Black border 350 mirror=True 351 ), 352 height=300, 353 ) 354 return fig
Plot a Q-Q plot for the response variable.
Returns
- fig (plotly.graph_objs.Figure): The Q-Q plot figure.
356 def plot_boxplot(self): 357 """ 358 Plot a boxplot for the response variable. 359 360 Returns 361 ------- 362 fig : plotly.graph_objs.Figure 363 The boxplot figure. 364 """ 365 fig = px.box(self._data, y=self._response, points="all", title='Box Plot') 366 fig.update_layout( 367 plot_bgcolor="white", # White background 368 legend=dict(bgcolor='rgba(0,0,0,0)'), 369 margin=dict(l=10, r=10, t=50, b=50), 370 xaxis=dict( 371 showgrid=True, # Enable grid 372 gridcolor="lightgray", # Light gray grid lines 373 zeroline=False, 374 zerolinecolor="black", # Black zero line 375 showline=True, 376 linewidth=1, 377 linecolor="black", # Black border 378 mirror=True 379 ), 380 yaxis=dict( 381 showgrid=True, # Enable grid 382 gridcolor="lightgray", # Light gray grid lines 383 zeroline=False, 384 zerolinecolor="black", # Black zero line 385 showline=True, 386 linewidth=1, 387 linecolor="black", # Black border 388 mirror=True 389 ), height=300, 390 ) 391 return fig
Plot a boxplot for the response variable.
Returns
- fig (plotly.graph_objs.Figure): The boxplot figure.
393 def plot_histogram(self): 394 """ 395 Plot a histogram for the response variable. 396 397 Returns 398 ------- 399 fig : plotly.graph_objs.Figure 400 The histogram figure. 401 """ 402 fig = px.histogram(self._data, x=self._response, title='Histogram') 403 fig.update_layout( 404 plot_bgcolor="white", # White background 405 legend=dict(bgcolor='rgba(0,0,0,0)'), 406 margin=dict(l=10, r=10, t=50, b=50), 407 xaxis=dict( 408 showgrid=True, # Enable grid 409 gridcolor="lightgray", # Light gray grid lines 410 zeroline=False, 411 zerolinecolor="black", # Black zero line 412 showline=True, 413 linewidth=1, 414 linecolor="black", # Black border 415 mirror=True 416 ), 417 yaxis=dict( 418 showgrid=True, # Enable grid 419 gridcolor="lightgray", # Light gray grid lines 420 zeroline=False, 421 zerolinecolor="black", # Black zero line 422 showline=True, 423 linewidth=1, 424 linecolor="black", # Black border 425 mirror=True 426 ), height=300, 427 ) 428 return fig
Plot a histogram for the response variable.
Returns
- fig (plotly.graph_objs.Figure): The histogram figure.
430 def plot_scatter_response(self): 431 """ 432 Plot a scatter plot for the response variable. 433 434 Returns 435 ------- 436 fig : plotly.graph_objs.Figure 437 The scatter plot figure. 438 """ 439 fig = px.scatter(x=np.arange(1, len(self._data[self._response]) + 1), 440 y=self._data[self._response], 441 labels={'x': 'Measurement number', 'y': self._response}, 442 title='Scatter Plot') 443 fig.update_layout( 444 plot_bgcolor="white", # White background 445 legend=dict(bgcolor='rgba(0,0,0,0)'), 446 margin=dict(l=10, r=10, t=50, b=50), 447 xaxis=dict( 448 showgrid=True, # Enable grid 449 gridcolor="lightgray", # Light gray grid lines 450 zeroline=False, 451 zerolinecolor="black", # Black zero line 452 showline=True, 453 linewidth=1, 454 linecolor="black", # Black border 455 mirror=True 456 ), 457 yaxis=dict( 458 showgrid=True, # Enable grid 459 gridcolor="lightgray", # Light gray grid lines 460 zeroline=False, 461 zerolinecolor="black", # Black zero line 462 showline=True, 463 linewidth=1, 464 linecolor="black", # Black border 465 mirror=True 466 ), height=300, 467 ) 468 return fig
Plot a scatter plot for the response variable.
Returns
- fig (plotly.graph_objs.Figure): The scatter plot figure.
470 def plot_pairplot_seaborn(self): 471 """ 472 Plot a pairplot for the data using seaborn. 473 474 Returns 475 ------- 476 fig : seaborn.axisgrid.PairGrid 477 The pairplot figure. 478 """ 479 fig = sns.pairplot( 480 self._data, 481 kind="reg", 482 diag_kind="kde", 483 plot_kws={"scatter_kws": {"alpha": 0.1}}, 484 ) 485 return fig
Plot a pairplot for the data using seaborn.
Returns
- fig (seaborn.axisgrid.PairGrid): The pairplot figure.
487 def plot_pairplot_plotly(self): 488 """ 489 Plot a pairplot for the data using plotly. 490 491 Returns 492 ------- 493 fig : plotly.graph_objs.Figure 494 The plotly figure. 495 """ 496 497 # Get the column names 498 columns = self._data.columns 499 n_cols = len(columns) 500 501 # Create subplot grid with partially shared axes 502 fig = make_subplots( 503 rows=n_cols, 504 cols=n_cols, 505 shared_xaxes=True, 506 shared_yaxes=True, 507 vertical_spacing=0.05, 508 horizontal_spacing=0.05 509 ) 510 511 # Add scatter plots and regression lines to each subplot 512 for i in range(n_cols): 513 for j in range(n_cols): 514 x_data = self._data[columns[j]] 515 y_data = self._data[columns[i]] 516 517 if i == j: # Diagonal: Add KDE plot 518 # Calculate KDE 519 data = self._data[columns[i]].dropna() 520 if len(data) > 1: # Ensure enough data points 521 kde_x = np.linspace(data.min(), data.max(), 100) 522 kde = stats.gaussian_kde(data) 523 kde_y = kde(kde_x) 524 # Scale to match y-axis (needed for top left corner) 525 kde_y = kde_y / kde_y.max() * np.max(y_data.dropna()) 526 527 # Add KDE plot to diagonal 528 fig.add_trace( 529 go.Scatter( 530 x=kde_x, 531 y=kde_y, 532 mode='lines', 533 fill='tozeroy', 534 line=dict(color='rgba(29, 81, 189, 1)'), 535 showlegend=False 536 ), 537 row=i+1, col=j+1 538 ) 539 # Ensure the y-axis is independent for diagonal plots 540 # don't know why, doesn't work with top left corner 541 fig.update_yaxes( 542 matches=None, # Ensure independent y-axis 543 showticklabels=True, # Show y-axis values 544 row=i+1, 545 col=j+1 546 ) 547 else: # Off-diagonal: Add scatter plot with regression line 548 # Add scatter plot 549 fig.add_trace( 550 go.Scatter( 551 x=x_data, 552 y=y_data, 553 mode='markers', 554 marker=dict( 555 color='rgba(29, 81, 189, 0.5)', 556 size=5 557 ), 558 showlegend=False 559 ), 560 row=i+1, col=j+1 561 ) 562 563 # Calculate and add regression line 564 if len(x_data.dropna()) > 1 and len(y_data.dropna()) > 1: 565 # Drop NaN values for regression calculation 566 valid_indices = x_data.notna() & y_data.notna() 567 x_clean = x_data[valid_indices] 568 y_clean = y_data[valid_indices] 569 ytype = y_clean.dtype 570 xtype = x_clean.dtype 571 unique_x = x_clean.unique() 572 if len(x_clean) > 1 and len(unique_x)>1 and ytype != 'object' and xtype != 'object': 573 # Calculate regression parameters 574 slope, intercept, r_value, p_value, std_err = stats.linregress(x_clean, y_clean) 575 x_range = np.linspace(x_clean.min(), x_clean.max(), 100) 576 y_pred = slope * x_range + intercept 577 578 # Standard error of estimate 579 y_fit = slope * x_clean + intercept 580 residuals = y_clean - y_fit 581 dof = len(x_clean) - 2 582 residual_std_error = np.sqrt(np.sum(residuals**2) / dof) 583 584 mean_x = np.mean(x_clean) 585 t_val = t.ppf(0.975, dof) # 95% confidence 586 587 se_line = residual_std_error * np.sqrt(1/len(x_clean) + (x_range - mean_x)**2 / np.sum((x_clean - mean_x)**2)) 588 y_upper = y_pred + t_val * se_line 589 y_lower = y_pred - t_val * se_line 590 591 # Add regression line 592 fig.add_trace( 593 go.Scatter( 594 x=x_range, 595 y=y_pred, 596 mode='lines', 597 line=dict(color='red', width=2), 598 showlegend=False 599 ), 600 row=i+1, col=j+1 601 ) 602 603 # Add confidence interval area 604 fig.add_trace( 605 go.Scatter( 606 x=np.concatenate([x_range, x_range[::-1]]), 607 y=np.concatenate([y_upper, y_lower[::-1]]), 608 fill='toself', 609 fillcolor='rgba(255, 0, 0, 0.2)', 610 line=dict(color='rgba(255, 0, 0, 0)'), 611 showlegend=False 612 ), 613 row=i+1, col=j+1 614 ) 615 616 # Update layout and axis properties 617 fig.update_layout( 618 margin=dict(l=10, r=10, t=50, b=50), 619 title="Pair Plot", 620 height=600, 621 width=600, 622 showlegend=False, 623 plot_bgcolor="white" 624 ) 625 626 # Update all axes properties first (hiding tick labels by default) 627 fig.update_xaxes( 628 showgrid=True, 629 gridcolor="lightgray", 630 zeroline=False, 631 showline=True, 632 linewidth=1, 633 linecolor="black", 634 mirror=True, 635 showticklabels=False 636 ) 637 638 fig.update_yaxes( 639 showgrid=True, 640 gridcolor="lightgray", 641 zeroline=False, 642 showline=True, 643 linewidth=1, 644 linecolor="black", 645 mirror=True, 646 showticklabels=False 647 ) 648 649 # Show tick labels and titles only for bottom row and leftmost column 650 for i, col_name in enumerate(columns): 651 # Bottom row: show x-axis titles and tick labels 652 fig.update_xaxes( 653 title_text=col_name, 654 showticklabels=True, 655 row=n_cols, 656 col=i+1 657 ) 658 659 # Leftmost column: show y-axis titles and tick labels 660 fig.update_yaxes( 661 title_text=col_name, 662 showticklabels=True, 663 row=i+1, 664 col=1 665 ) 666 667 return fig
Plot a pairplot for the data using plotly.
Returns
- fig (plotly.graph_objs.Figure): The plotly figure.
669 def plot_corr(self): 670 """ 671 Plot a correlation matrix for the data. 672 673 Returns 674 ------- 675 fig : seaborn.axisgrid.PairGrid 676 The pairplot figure. 677 """ 678 corr_matrix = self._data.corr(method='pearson') 679 mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1) 680 # Apply mask - replace upper triangle with NaN values 681 corr_matrix_lower = corr_matrix.copy() 682 corr_matrix_lower.values[mask] = np.nan 683 684 fig = px.imshow( 685 corr_matrix_lower, 686 text_auto='.2f', # Display correlation values 687 color_continuous_scale='RdBu_r', # Red to Blue color scale (reversed) 688 zmin=-1, # Minimum correlation value 689 zmax=1, # Maximum correlation value 690 aspect="auto", # Keep aspect ratio adaptive 691 title="Pearson Correlation Heatmap" 692 ) 693 # Customize hover template 694 fig.update_traces( 695 hovertemplate='<b>%{y} vs %{x}</b><br>Correlation: %{z}<extra></extra>' 696 ) 697 # Improve layout 698 fig.update_layout( 699 coloraxis_colorbar=dict( 700 title="Correlation", 701 tickvals=[-1, -0.5, 0, 0.5, 1], 702 ticktext=["-1", "-0.5", "0", "0.5", "1"], 703 ), 704 plot_bgcolor="white", # White background 705 legend=dict(bgcolor='rgba(0,0,0,0)'), 706 margin=dict(l=10, r=10, t=50, b=50), 707 xaxis=dict( 708 showgrid=False, # Enable grid 709 gridcolor="lightgray", # Light gray grid lines 710 zeroline=False, 711 zerolinecolor="black", # Black zero line 712 showline=False, 713 linewidth=1, 714 tickangle=-45, 715 linecolor="black", # Black border 716 mirror=True 717 ), 718 yaxis=dict( 719 showgrid=False, # Enable grid 720 gridcolor="lightgray", # Light gray grid lines 721 zeroline=False, 722 zerolinecolor="black", # Black zero line 723 showline=False, 724 linewidth=1, 725 linecolor="black", # Black border 726 mirror=True 727 ), 728 height=600, 729 width=600 730 ) 731 732 return fig
Plot a correlation matrix for the data.
Returns
- fig (seaborn.axisgrid.PairGrid): The pairplot figure.
734 def write_equation(self, order=1, quadratic=[]): 735 """ 736 Write R-style equation for multivariate fitting procedure using the statsmodels package. 737 738 Parameters 739 ---------- 740 order : int, optional 741 The order of the polynomial. Default is 1. 742 quadratic : list, optional 743 The list of quadratic factors. Default is an empty list. 744 745 Returns 746 ------- 747 str 748 The R-style equation. 749 """ 750 myfactors = self._factors.copy() 751 if self._dtypes is not None: 752 if not isinstance(myfactors, list): 753 myfactors = myfactors.tolist() # Convert to list to allow mutable operations 754 for i in range(len(self._factors)): 755 if self._dtypes[self._factors[i]] == 'object': 756 myfactors[i] = f'C({myfactors[i]})' 757 eqn = f'{self._response} ~ {myfactors[0]} ' 758 for factor in myfactors[1:]: 759 eqn += f'+ {factor} ' 760 if order > 1: 761 for i in range(len(myfactors)): 762 for j in range(i + 1, len(myfactors)): 763 eqn += f'+ {myfactors[i]}:{myfactors[j]} ' 764 if order > 2: 765 for i in range(len(myfactors)): 766 for j in range(i + 1, len(myfactors)): 767 for k in range(j + 1, len(myfactors)): 768 eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]} ' 769 if order > 3: 770 for i in range(len(myfactors)): 771 for j in range(i + 1, len(myfactors)): 772 for k in range(j + 1, len(myfactors)): 773 for l in range(k + 1, len(myfactors)): 774 eqn += f'+ {myfactors[i]}:{myfactors[j]}:{myfactors[k]}:{myfactors[l]} ' 775 if len(quadratic) > 0: 776 for factor in quadratic: 777 eqn += f'+ I({factor}**2) ' 778 self._equation = eqn 779 return self._equation
Write R-style equation for multivariate fitting procedure using the statsmodels package.
Parameters
- order (int, optional): The order of the polynomial. Default is 1.
- quadratic (list, optional): The list of quadratic factors. Default is an empty list.
Returns
- str: The R-style equation.
781 def compute_linear_model(self, order=1, quadratic=[]): 782 """ 783 Compute the linear model using the statsmodels package. 784 785 Parameters 786 ---------- 787 order : int, optional 788 The order of the polynomial. Default is 1. The parameter 789 is not used if the equation is already set. 790 quadratic : list, optional 791 The list of quadratic factors. Default is an empty list. 792 The parameter is not used if the equation is already set. 793 794 Returns 795 ------- 796 statsmodels.regression.linear_model.RegressionResultsWrapper 797 The fitted linear model. 798 """ 799 if self._equation == '': 800 eqn = self.write_equation(order=order, quadratic=quadratic) 801 else: 802 eqn = self._equation 803 model = smf.ols(formula=eqn, data=self._data) 804 self._linear_model = model.fit() 805 return self._linear_model
Compute the linear model using the statsmodels package.
Parameters
- order (int, optional): The order of the polynomial. Default is 1. The parameter is not used if the equation is already set.
- quadratic (list, optional): The list of quadratic factors. Default is an empty list. The parameter is not used if the equation is already set.
Returns
- statsmodels.regression.linear_model.RegressionResultsWrapper: The fitted linear model.
807 def plot_linear_model(self): 808 """ 809 Plot the linear model using plotly. 810 811 Returns 812 ------- 813 fig : list 814 The list of plotly figures. 815 """ 816 if self._linear_model is None: 817 raise Warning("Linear model has not been computed yet.") 818 fig = [None] * 2 819 fig[0] = go.Figure() 820 # Fig[0]: actual vs predicted line 821 fig[0].add_trace(go.Scatter(x=self._data[self._response], 822 y=self._linear_model.predict(), 823 mode='markers', 824 marker=dict(size=12), 825 name='Actual vs Predicted')) 826 # Add 1:1 line 827 fig[0].add_shape(type="line", 828 x0=min(self._data[self._response]), 829 y0=min(self._data[self._response]), 830 x1=max(self._data[self._response]), 831 y1=max(self._data[self._response]), 832 line=dict(color="Gray", width=1, dash="dash")) 833 fig[0].update_layout( 834 plot_bgcolor="white", # White background 835 legend=dict(bgcolor='rgba(0,0,0,0)'), 836 xaxis_title=f'Actual {self._response}', 837 yaxis_title=f'Predicted {self._response}', 838 height=500, # Adjust height as needed 839 margin=dict(l=10, r=10, t=50, b=50), 840 xaxis=dict( 841 showgrid=True, # Enable grid 842 gridcolor="lightgray", # Light gray grid lines 843 zeroline=False, 844 zerolinecolor="black", # Black zero line 845 showline=True, 846 linewidth=1, 847 linecolor="black", # Black border 848 mirror=True 849 ), 850 yaxis=dict( 851 showgrid=True, # Enable grid 852 gridcolor="lightgray", # Light gray grid lines 853 zeroline=False, 854 zerolinecolor="black", # Black zero line 855 showline=True, 856 linewidth=1, 857 linecolor="black", # Black border 858 mirror=True 859 ) 860 ) 861 # # # # # # # # # # # # # # # # # # # # # 862 # Fig[1]: slope values for each factor 863 # # # # # # # # # # # # # # # # # # # # # 864 # Plot: Slope values for each factor 865 res = self._linear_model.params.rename_axis('terms').reset_index()[1:] 866 res.columns = ['terms', 'slope'] 867 error = self._linear_model.bse.rename_axis('terms').reset_index()[1:] 868 error.columns = ['terms', 'error'] 869 res['error'] = error['error'] 870 res['pvalue'] = [self._linear_model.pvalues[res['terms']].iloc[i] for i in range(len(res))] 871 872 # Sort by p-values 873 res = res.sort_values(by='pvalue', ascending=False) 874 # Prepare colors and labels 875 colors = ['red' if x < 0 else 'green' for x in res['slope']] 876 res['slope'] = res['slope'].abs() 877 878 fig[1] = go.Figure() 879 880 # Add bar plot 881 fig[1].add_trace(go.Bar( 882 y=[term.replace('I(', '').replace('C(', '').replace(')', '').replace(' ** ', '^') for term in res['terms']], 883 x=res['slope'], 884 error_x=dict(type='data', array=res['error']), 885 marker_color=colors, 886 orientation='h', 887 name='Slope', 888 showlegend=False # Hide the legend entry for the bar trace 889 )) 890 891 # Update layout for log scale and labels 892 fig[1].update_layout( 893 xaxis_title='Magnitude of effect', 894 xaxis_type="log", 895 height=500, # Adjust height as needed 896 plot_bgcolor="white", # White background 897 legend=dict( 898 orientation="h", # Horizontal orientation 899 y=1.1 # Place legend on top 900 ), 901 margin=dict(l=10, r=150, t=50, b=50), # Increase right margin for annotations 902 xaxis=dict( 903 showgrid=True, # Enable grid 904 gridcolor="lightgray", # Light gray grid lines 905 zeroline=False, 906 zerolinecolor="black", # Black zero line 907 showline=True, 908 linewidth=1, 909 linecolor="black", # Black border 910 mirror=True 911 ), 912 yaxis=dict( 913 showgrid=True, # Enable grid 914 gridcolor="lightgray", # Light gray grid lines 915 zeroline=False, 916 zerolinecolor="black", # Black zero line 917 showline=True, 918 linewidth=1, 919 linecolor="black", # Black border 920 mirror=True 921 ) 922 ) 923 924 # Add legend for Negative and Positive 925 fig[1].add_trace(go.Scatter( 926 x=[None], y=[None], mode='markers', 927 marker=dict(size=10, color='red'), 928 name='Negative' 929 )) 930 fig[1].add_trace(go.Scatter( 931 x=[None], y=[None], mode='markers', 932 marker=dict(size=10, color='green'), 933 name='Positive' 934 )) 935 936 # Add p-values as annotations outside the plot 937 for i, p in enumerate(res['slope']): 938 fig[1].add_annotation( 939 x=1.025, # Place annotations outside the plot 940 y=i, 941 text=f"p = {self._linear_model.pvalues[res['terms'].iloc[i]]:.2g}", 942 showarrow=False, 943 xref="paper", # Use paper coordinates for x 944 xanchor='left' 945 ) 946 947 return fig
Plot the linear model using plotly.
Returns
- fig (list): The list of plotly figures.
949 def compute_ML_model(self, **kwargs): 950 """ 951 Compute the machine learning model. 952 953 Parameters 954 ---------- 955 kwargs : dict 956 Additional keyword arguments for the model. Overrides default parameters. 957 958 Default Parameters by Model Type 959 -------------------------------- 960 - **ElasticNetCV:** 961 - l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0]. 962 List of L1 ratios to try. 963 - cv : int, default=5. 964 Cross-validation folds. 965 - max_iter : int, default=1000. 966 Maximum iterations. 967 - **RidgeCV:** 968 - alphas : list, default=[0.1, 1.0, 10.0]. 969 List of alpha values to try. 970 - cv : int, default=5. 971 Cross-validation folds. 972 - **LinearRegression:** 973 - fit_intercept : bool, default=True. 974 Whether to calculate the intercept. 975 - **RandomForest:** 976 - n_estimators : int, default=100. 977 Number of trees in the forest. 978 - max_depth : int or None, default=None. 979 Maximum depth of trees. 980 - min_samples_split : int, default=2. 981 Minimum samples required to split a node. 982 - random_state : int, default=42. 983 Random seed for reproducibility. 984 - **GaussianProcess:** 985 - kernel : kernel object, default=None. 986 Kernel for the Gaussian Process. 987 - alpha : float, default=1e-10. 988 Value added to diagonal of kernel matrix. 989 - normalize_y : bool, default=True. 990 Normalize target values. 991 - random_state : int, default=42. 992 Random seed for reproducibility. 993 - **GradientBoosting:** 994 - n_estimators : int, default=100. 995 Number of boosting stages. 996 - learning_rate : float, default=0.1. 997 Learning rate. 998 - max_depth : int, default=3. 999 Maximum depth of trees. 1000 - random_state : int, default=42. 1001 Random seed for reproducibility. 1002 1003 Returns 1004 ------- 1005 object 1006 The fitted machine learning model. 1007 """ 1008 if self._model_type is None: 1009 raise ValueError("Model must be provided.") 1010 1011 X = self._data[self._factors] 1012 y = self._data[self._response] 1013 if self._split_size > 0: 1014 X_train, X_test, y_train, y_test = train_test_split( 1015 X, y, test_size=self._split_size) 1016 else: 1017 X_train, X_test, y_train, y_test = X, X, y, y 1018 1019 # Default parameters for each model type 1020 default_params = { 1021 "ElasticNetCV": {"l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0], 1022 "cv": 5, 1023 "max_iter": 1000}, 1024 "RidgeCV": {"alphas": [0.1, 1.0, 10.0], 1025 "cv": 5}, 1026 "LinearRegression": {"fit_intercept": True}, 1027 "RandomForest": {"n_estimators": 100, 1028 "max_depth": None, 1029 "min_samples_split": 2, 1030 "random_state": 42}, 1031 "GaussianProcess": {"kernel": None, 1032 "alpha": 1e-10, 1033 "normalize_y": True, 1034 "random_state": 42}, 1035 "GradientBoosting": {"n_estimators": 100, 1036 "learning_rate": 0.1, 1037 "max_depth": 3, 1038 "random_state": 42} 1039 } 1040 1041 # Get default parameters for the selected model 1042 model_defaults = default_params.get(self._model_type, {}) 1043 1044 # Override defaults with any provided kwargs 1045 model_params = {**model_defaults, **kwargs} 1046 1047 if self._model_type == "ElasticNetCV": 1048 self._model = make_pipeline(StandardScaler(), 1049 ElasticNetCV(**model_params)) 1050 elif self._model_type == "RidgeCV": 1051 self._model = make_pipeline(StandardScaler(), 1052 RidgeCV(**model_params)) 1053 elif self._model_type == "LinearRegression": 1054 self._model = make_pipeline(StandardScaler(), 1055 LinearRegression(**model_params)) 1056 elif self._model_type == "RandomForest": 1057 self._model = make_pipeline(StandardScaler(), 1058 RandomForestRegressor(**model_params)) 1059 elif self._model_type == "GaussianProcess": 1060 self._model = make_pipeline(StandardScaler(), 1061 GaussianProcessRegressor(**model_params)) 1062 elif self._model_type == "GradientBoosting": 1063 self._model = make_pipeline(StandardScaler(), 1064 GradientBoostingRegressor(**model_params)) 1065 1066 # Fit the model 1067 self._model.fit(X_train, y_train) 1068 return self._model
Compute the machine learning model.
Parameters
- kwargs (dict): Additional keyword arguments for the model. Overrides default parameters.
Default Parameters by Model Type
- ElasticNetCV:
- l1_ratio : list, default=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0]. List of L1 ratios to try.
- cv : int, default=5. Cross-validation folds.
- max_iter : int, default=1000. Maximum iterations.
- RidgeCV:
- alphas : list, default=[0.1, 1.0, 10.0]. List of alpha values to try.
- cv : int, default=5. Cross-validation folds.
- LinearRegression:
- fit_intercept : bool, default=True. Whether to calculate the intercept.
- RandomForest:
- n_estimators : int, default=100. Number of trees in the forest.
- max_depth : int or None, default=None. Maximum depth of trees.
- min_samples_split : int, default=2. Minimum samples required to split a node.
- random_state : int, default=42. Random seed for reproducibility.
- GaussianProcess:
- kernel : kernel object, default=None. Kernel for the Gaussian Process.
- alpha : float, default=1e-10. Value added to diagonal of kernel matrix.
- normalize_y : bool, default=True. Normalize target values.
- random_state : int, default=42. Random seed for reproducibility.
- GradientBoosting:
- n_estimators : int, default=100. Number of boosting stages.
- learning_rate : float, default=0.1. Learning rate.
- max_depth : int, default=3. Maximum depth of trees.
- random_state : int, default=42. Random seed for reproducibility.
Returns
- object: The fitted machine learning model.
1070 def plot_ML_model(self, features_in_log=False): 1071 """ 1072 Plot the machine learning model using plotly. 1073 1074 Parameters 1075 ---------- 1076 features_in_log : bool, optional 1077 Whether to plot the feature importances in log scale. Default is False. 1078 1079 Returns 1080 ------- 1081 fig : list 1082 The list of plotly figures. 1083 """ 1084 coef_names = self._data[self._factors].columns 1085 X = self._data[self._factors] 1086 y = self._data[self._response] 1087 if self._split_size > 0: 1088 X_train, X_test, y_train, y_test = train_test_split( 1089 X, y, test_size=self._split_size) 1090 else: 1091 X_train, X_test, y_train, y_test = X, X, y, y 1092 bootstrap_coefs = bootstrap_coefficients( 1093 self._model[1], X, y, n_bootstrap=100, random_state=42) 1094 # Calculate mean and standard deviation of coefficients 1095 mean_coefs = np.mean(bootstrap_coefs, axis=0) 1096 std_coefs = np.std(bootstrap_coefs, axis=0) 1097 # make the fig 1098 fig = [None] * 2 1099 fig[0] = go.Figure() 1100 # Add actual vs predicted line 1101 fig[0].add_trace(go.Scatter(x=y_train, 1102 y=self._model.predict(X_train), 1103 mode='markers', 1104 marker=dict(size=12, color='royalblue'), 1105 name='Training')) 1106 if self._split_size > 0: 1107 fig[0].add_trace(go.Scatter(x=y_test, 1108 y=self._model.predict(X_test), 1109 mode='markers', 1110 marker=dict(size=12, color='orange'), 1111 name='Validation')) 1112 # Add 1:1 line 1113 fig[0].add_shape(type="line", 1114 x0=min(y_train), y0=min(y_train), 1115 x1=max(y_train), y1=max(y_train), 1116 line=dict(color="Gray", width=1, dash="dash")) 1117 fig[0].update_layout( 1118 plot_bgcolor="white", # White background 1119 legend=dict(bgcolor='rgba(0,0,0,0)'), 1120 xaxis_title=f'Actual {self._response}', 1121 yaxis_title=f'Predicted {self._response}', 1122 height=500, # Adjust height as needed 1123 margin=dict(l=10, r=10, t=120, b=50), 1124 xaxis=dict( 1125 showgrid=True, # Enable grid 1126 gridcolor="lightgray", # Light gray grid lines 1127 zeroline=False, 1128 zerolinecolor="black", # Black zero line 1129 showline=True, 1130 linewidth=1, 1131 linecolor="black", # Black border 1132 mirror=True 1133 ), 1134 yaxis=dict( 1135 showgrid=True, # Enable grid 1136 gridcolor="lightgray", # Light gray grid lines 1137 zeroline=False, 1138 zerolinecolor="black", # Black zero line 1139 showline=True, 1140 linewidth=1, 1141 linecolor="black", # Black border 1142 mirror=True 1143 ) 1144 ) 1145 # add the score in title 1146 fig[0].update_layout( 1147 title={ 1148 'text': f"Predicted vs Actual for {str(self._model[1])}<br>R²_test = {r2_score(y_test, self._model.predict(X_test)):.4f} - R²_train = {r2_score(y_train, self._model.predict(X_train)):.4f}<br>RMSE_test = {root_mean_squared_error(y_test, self._model.predict(X_test)):.4f} - RMSE_train = {root_mean_squared_error(y_train, self._model.predict(X_train)):.4f}", 1149 'x': 0.45, 1150 'xanchor': 'center' 1151 } 1152 ) 1153 # make feature importance plot 1154 if self._model_type != "GaussianProcess": 1155 fig[1] = go.Figure() 1156 pos_coefs = mean_coefs[mean_coefs > 0] 1157 pos_coefs_names = coef_names[mean_coefs > 0] 1158 neg_coefs = mean_coefs[mean_coefs < 0] 1159 neg_coefs_names = coef_names[mean_coefs < 0] 1160 # Add bars for positive mean coefficients 1161 fig[1].add_trace(go.Bar( 1162 y=[f"{pos_coefs_names[i]}" for i in range(len(pos_coefs))], 1163 x=mean_coefs[mean_coefs > 0], 1164 error_x=dict(type='data', array=std_coefs[mean_coefs > 0], visible=True), 1165 orientation='h', 1166 marker_color='royalblue', 1167 name='Positive' 1168 )) 1169 fig[1].add_trace(go.Bar( 1170 y=[f"{neg_coefs_names[i]}" for i in range(len(neg_coefs))], 1171 x=-mean_coefs[mean_coefs < 0], 1172 error_x=dict(type='data', array=std_coefs[mean_coefs < 0], visible=True), 1173 orientation='h', 1174 marker_color='orange', 1175 name='Negative' 1176 )) 1177 # Update layout 1178 fig[1].update_layout( 1179 title={ 1180 'text': f"Features importance for {str(self._model[1])}", 1181 'x': 0.5, 1182 'xanchor': 'center' 1183 }, 1184 xaxis_title="Coefficient Value", 1185 yaxis_title="Features", 1186 barmode='relative', 1187 margin=dict(l=150) 1188 ) 1189 if features_in_log: 1190 fig[1].update_xaxes(type="log") 1191 else: 1192 fig[1].update_xaxes(type="linear") 1193 fig[1].update_layout( 1194 plot_bgcolor="white", # White background 1195 legend=dict(bgcolor='rgba(0,0,0,0)'), 1196 height=500, # Adjust height as needed 1197 margin=dict(l=10, r=10, t=120, b=50), 1198 xaxis=dict( 1199 showgrid=True, # Enable grid 1200 gridcolor="lightgray", # Light gray grid lines 1201 zeroline=False, 1202 zerolinecolor="black", # Black zero line 1203 showline=True, 1204 linewidth=1, 1205 linecolor="black", # Black border 1206 mirror=True 1207 ), 1208 yaxis=dict( 1209 showgrid=True, # Enable grid 1210 gridcolor="lightgray", # Light gray grid lines 1211 zeroline=False, 1212 zerolinecolor="black", # Black zero line 1213 showline=True, 1214 linewidth=1, 1215 linecolor="black", # Black border 1216 mirror=True 1217 ) 1218 ) 1219 else: 1220 print('GaussianProcess does not have coefficients or feature importances') 1221 return fig
Plot the machine learning model using plotly.
Parameters
- features_in_log (bool, optional): Whether to plot the feature importances in log scale. Default is False.
Returns
- fig (list): The list of plotly figures.
1223 def predict(self, X=None, model='all'): 1224 """ 1225 Predict using the machine learning model and the linear model, 1226 if they are trained. Use the encoders to encode the data. 1227 If X is not provided, use the original data. 1228 If the model has not been trained, raise a warning. 1229 1230 Parameters 1231 ---------- 1232 X : pd.DataFrame, optional 1233 The input features. Default is None, which uses the original data. 1234 1235 Returns 1236 ------- 1237 pd.DataFrame 1238 The predicted values with a column indicating the model used. 1239 """ 1240 if X is None: 1241 X = self._data[self._factors].copy() 1242 else: 1243 X = X.copy() 1244 1245 # Encode the input data using the same encoders 1246 for factor in self._factors: 1247 if factor in self._encoders: 1248 X[factor] = self._encoders[factor].transform(X[factor].astype(str)) 1249 1250 predML = None 1251 predlin = None 1252 1253 if self._model is not None: 1254 # Check if the model is fitted 1255 if not hasattr(self._model, 'predict'): 1256 raise Warning("Model has not been trained yet.") 1257 predML = self._model.predict(X) 1258 predML = pd.DataFrame(predML, columns=['prediction']) 1259 predML['model'] = self._model_type 1260 1261 if self._linear_model is not None: 1262 # Check if the linear model is fitted 1263 if not hasattr(self._linear_model, 'predict'): 1264 raise Warning("Linear model has not been trained yet.") 1265 predlin = self._linear_model.predict(X) 1266 predlin = pd.DataFrame(predlin, columns=['prediction']) 1267 predlin['model'] = 'Linear Model' 1268 1269 if predML is not None and predlin is not None and model == 'all': 1270 # Combine the predictions on top of each other 1271 pred = pd.concat([predML, predlin], axis=0) 1272 pred = pred.reset_index(drop=True) 1273 return pred 1274 elif predML is not None and model == 'ML': 1275 return predML 1276 elif predlin is not None and model == 'linear': 1277 return predlin 1278 else: 1279 raise Warning("No model has been trained yet.")
Predict using the machine learning model and the linear model, if they are trained. Use the encoders to encode the data. If X is not provided, use the original data. If the model has not been trained, raise a warning.
Parameters
- X (pd.DataFrame, optional): The input features. Default is None, which uses the original data.
Returns
- pd.DataFrame: The predicted values with a column indicating the model used.
1284def bootstrap_coefficients(mod, X, y, n_bootstrap=100, random_state=None): 1285 """ 1286 Perform bootstrap resampling to estimate the variability of model coefficients. 1287 1288 Parameters 1289 ---------- 1290 mod : object 1291 The machine learning model. 1292 X : pd.DataFrame 1293 The input features. 1294 y : pd.Series 1295 The target variable. 1296 n_bootstrap : int, optional 1297 The number of bootstrap samples. Default is 100. 1298 random_state : int, optional 1299 The seed for the random number generator. 1300 1301 Returns 1302 ------- 1303 results : np.ndarray 1304 The bootstrapped coefficients. 1305 """ 1306 np.random.seed(random_state) 1307 n_samples = X.shape[0] 1308 results = [] 1309 1310 for _ in range(n_bootstrap): 1311 # Resample the data 1312 indices = np.random.choice(np.arange(n_samples), size=n_samples, replace=True) 1313 X_resample, y_resample = X.values[indices], y.values[indices] 1314 1315 # Fit the model 1316 if isinstance(mod, RidgeCV): 1317 model = RidgeCV(alphas=np.logspace(-3, 3, 10)).fit(X_resample, y_resample) 1318 results.append(model.coef_) 1319 elif isinstance(mod, ElasticNetCV): 1320 model = ElasticNetCV(alphas=np.logspace(-3, 3, 10), l1_ratio=0.5).fit(X_resample, y_resample) 1321 results.append(model.coef_) 1322 elif isinstance(mod, LinearRegression): 1323 model = LinearRegression().fit(X_resample, y_resample) 1324 results.append(model.coef_) 1325 elif isinstance(mod, RandomForestRegressor): 1326 model = RandomForestRegressor().fit(X_resample, y_resample) 1327 results.append(model.feature_importances_) 1328 elif isinstance(mod, GradientBoostingRegressor): 1329 model = GradientBoostingRegressor().fit(X_resample, y_resample) 1330 results.append(model.feature_importances_) 1331 elif isinstance(mod, GaussianProcessRegressor): 1332 # Gaussian Process does not have coefficients or feature importances 1333 model = GaussianProcessRegressor().fit(X_resample, y_resample) 1334 results.append(np.zeros(X.shape[1])) # Placeholder for Gaussian Process 1335 else: 1336 raise ValueError(f"Unsupported model type: {mod}") 1337 1338 return np.array(results)
Perform bootstrap resampling to estimate the variability of model coefficients.
Parameters
- mod (object): The machine learning model.
- X (pd.DataFrame): The input features.
- y (pd.Series): The target variable.
- n_bootstrap (int, optional): The number of bootstrap samples. Default is 100.
- random_state (int, optional): The seed for the random number generator.
Returns
- results (np.ndarray): The bootstrapped coefficients.