Integration of Single Hyperbox-based Estimators with Grid-Search and Random-Search in sklearn

This example shows how to integrate the GFMM classifier with the Grid Search Cross-Validation and Random Search Cross-Validation functionalities implemented by scikit-learn

Note that this example will use the original online learning algorithm of GFMM model for demonstration of the integration of Grid Search and Random Search with hyperbox-based model. However, this characteristic can be similarly applied for all of the other hyperbox-based machine learning algorithms.

[1]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM

Load Iris dataset, normalize it into the range of [0, 1] and build training and testing datasets

[2]:
from sklearn.datasets import load_iris
[3]:
df = load_iris()
X = df.data
y = df.target
[4]:
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
[5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

1. Using Grid Search with 5-fold cross-validation

[6]:
import numpy as np
from sklearn.metrics import accuracy_score
[7]:
parameters = {'theta': np.arange(0.05, 1.01, 0.05), 'theta_min':[1], 'gamma':[0.5, 1, 2, 4, 8, 16]}
[8]:
onln_gfmm = OnlineGFMM()
clf_grid_search = GridSearchCV(onln_gfmm, parameters, cv=5, scoring='accuracy', refit=True)
[9]:
clf_grid_search.fit(X_train, y_train)
print("Best average score = ", clf_grid_search.best_score_)
print("Best params: ", clf_grid_search.best_params_)
Best average score =  0.9583333333333334
Best params:  {'gamma': 0.5, 'theta': 0.3, 'theta_min': 1}
[10]:
best_gfmm_grid_search = clf_grid_search.best_estimator_
[11]:
# Testing the performance on the test set
y_pred = best_gfmm_grid_search.predict(X_test)
[12]:
acc_grid_search = accuracy_score(y_test, y_pred)
print(f'Accuracy (grid-search) = {acc_grid_search * 100: .2f}%')
Accuracy (grid-search) =  96.67%
[13]:
# Try another way to create the best classifier
best_gfmm_grid_search_2 = OnlineGFMM(**clf_grid_search.best_params_)
#best_gfmm_grid_search_2.set_params(**clf_grid_search.best_params_)
[14]:
# Training
best_gfmm_grid_search_2.fit(X_train, y_train)
[14]:
OnlineGFMM(C=array([2, 1, 0, 1, 2, 2, 1, 2, 0, 0, 1, 0, 2, 2, 1]),
           V=array([[0.44444444, 0.29166667, 0.6440678 , 0.70833333],
       [0.25      , 0.125     , 0.42372881, 0.375     ],
       [0.11111111, 0.45833333, 0.03389831, 0.04166667],
       [0.16666667, 0.        , 0.33898305, 0.375     ],
       [0.38888889, 0.08333333, 0.68221339, 0.58333333],
       [0.77777778, 0.41666667, 0.83050847, 0.70833333],
       [0.47222222, 0.375     , 0.55932203, 0.5       ],
       [0.166666...
       [0.16666667, 0.20833333, 0.59322034, 0.66666667],
       [0.19444444, 0.58333333, 0.10169492, 0.08333333],
       [0.41666667, 1.        , 0.11864407, 0.125     ],
       [0.55555556, 0.20833333, 0.66101695, 0.58333333],
       [0.05555556, 0.125     , 0.05084746, 0.08333333],
       [0.94444444, 0.41666667, 1.        , 0.91666667],
       [1.        , 0.75      , 0.96610169, 0.875     ],
       [0.44444444, 0.5       , 0.6440678 , 0.70833333]]),
           gamma=0.5, theta=0.3, theta_min=0.3)
[15]:
# predict
y_pred_2 = best_gfmm_grid_search_2.predict(X_test)
[16]:
acc_grid_search_2 = accuracy_score(y_test, y_pred_2)
print(f'Accuracy (grid-search) = {acc_grid_search_2 * 100: .2f}%')
Accuracy (grid-search) =  96.67%

2. Using Random Search with 5-fold cross-validation

[17]:
# Using random search with only 20 random combinations of parameters
onln_gfmm_rd_search = OnlineGFMM()
clf_rd_search = RandomizedSearchCV(onln_gfmm_rd_search, parameters, n_iter=20, cv=5, random_state=0)
[18]:
clf_rd_search.fit(X_train, y_train)
print("Best average score = ", clf_rd_search.best_score_)
print("Best params: ", clf_rd_search.best_params_)
Best average score =  0.9583333333333334
Best params:  {'theta_min': 1, 'theta': 0.3, 'gamma': 2}
[19]:
best_gfmm_rd_search = clf_rd_search.best_estimator_
[20]:
# Testing the performance on the test set
y_pred_rd_search = best_gfmm_rd_search.predict(X_test)
[21]:
acc_rd_search = accuracy_score(y_test, y_pred_rd_search)
print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')
Accuracy (random-search) =  96.67%

Try to show explanation for an input sample

[22]:
sample_need_explain = 10
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes = best_gfmm_rd_search.get_sample_explanation(X_test[sample_need_explain], X_test[sample_need_explain])
[23]:
print("Predicted class for sample X = %s is %d and real class is %d" % (X_test[sample_need_explain], y_pred_input_0, y_test[sample_need_explain]))
Predicted class for sample X = [0.5        0.25       0.77966102 0.54166667] is 2 and real class is 2
[24]:
print("Membership values:")
for key, val in mem_val_classes.items():
    print("Class %d has the maximum membership value = %f" % (key, val))

for key in min_points_classes:
    print("Class %d has the representative hyperbox: V = %s and W = %s" % (key, min_points_classes[key], max_points_classes[key]))
Membership values:
Class 0 has the maximum membership value = 0.000000
Class 1 has the maximum membership value = 0.805085
Class 2 has the maximum membership value = 0.916667
Class 0 has the representative hyperbox: V = [0.11111111 0.45833333 0.03389831 0.04166667] and W = [0.38888889 0.75       0.11864407 0.20833333]
Class 1 has the representative hyperbox: V = [0.25       0.125      0.42372881 0.375     ] and W = [0.5        0.41666667 0.68220339 0.625     ]
Class 2 has the representative hyperbox: V = [0.38888889 0.08333333 0.68221339 0.58333333] and W = [0.66666667 0.33333333 0.81355932 0.79166667]

Show explanation results by parallel coordinates

[25]:
# Create a parallel coordinates graph
best_gfmm_rd_search.show_sample_explanation(X_test[sample_need_explain], X_test[sample_need_explain], min_points_classes, max_points_classes, y_pred_input_0, file_path="par_cord/iris_par_cord.html")
[26]:
# Load parallel coordinates to display on the notebook
from IPython.display import IFrame
# We load the parallel coordinates from GitHub here for demostration in readthedocs
# On the local notebook, we only need to load from the graph storing at 'par_cord/iris_par_cord.html'
IFrame('https://uts-caslab.github.io/hyperbox-brain/docs/tutorials/par_cord/iris_par_cord.html', width=820, height=520)
[26]: