Integration of Single Hyperbox-based Estimators with Grid-Search and Random-Search in sklearn

This example shows how to integrate the GFMM classifier with the Grid Search Cross-Validation and Random Search Cross-Validation functionalities implemented by scikit-learn

Note that this example will use the original online learning algorithm of GFMM model for demonstration of the integration of Grid Search and Random Search with hyperbox-based model. However, this characteristic can be similarly applied for all of the other hyperbox-based machine learning algorithms.

[1]:

import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM

Load Iris dataset, normalize it into the range of [0, 1] and build training and testing datasets

[2]:

from sklearn.datasets import load_iris

[3]:

df = load_iris()
X = df.data
y = df.target

[4]:

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

[5]:

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

1. Using Grid Search with 5-fold cross-validation

[6]:

import numpy as np
from sklearn.metrics import accuracy_score

[7]:

parameters = {'theta': np.arange(0.05, 1.01, 0.05), 'theta_min':[1], 'gamma':[0.5, 1, 2, 4, 8, 16]}

[8]:

onln_gfmm = OnlineGFMM()
clf_grid_search = GridSearchCV(onln_gfmm, parameters, cv=5, scoring='accuracy', refit=True)

[9]:

clf_grid_search.fit(X_train, y_train)
print("Best average score = ", clf_grid_search.best_score_)
print("Best params: ", clf_grid_search.best_params_)

Best average score =  0.9583333333333334
Best params:  {'gamma': 0.5, 'theta': 0.3, 'theta_min': 1}

[10]:

best_gfmm_grid_search = clf_grid_search.best_estimator_

[11]:

# Testing the performance on the test set
y_pred = best_gfmm_grid_search.predict(X_test)

[12]:

acc_grid_search = accuracy_score(y_test, y_pred)
print(f'Accuracy (grid-search) = {acc_grid_search * 100: .2f}%')

Accuracy (grid-search) =  96.67%

[13]:

# Try another way to create the best classifier
best_gfmm_grid_search_2 = OnlineGFMM(**clf_grid_search.best_params_)
#best_gfmm_grid_search_2.set_params(**clf_grid_search.best_params_)

[14]:

# Training
best_gfmm_grid_search_2.fit(X_train, y_train)

[14]:

OnlineGFMM(C=array([2, 1, 0, 1, 2, 2, 1, 2, 0, 0, 1, 0, 2, 2, 1]),
           V=array([[0.44444444, 0.29166667, 0.6440678 , 0.70833333],
       [0.25      , 0.125     , 0.42372881, 0.375     ],
       [0.11111111, 0.45833333, 0.03389831, 0.04166667],
       [0.16666667, 0.        , 0.33898305, 0.375     ],
       [0.38888889, 0.08333333, 0.68221339, 0.58333333],
       [0.77777778, 0.41666667, 0.83050847, 0.70833333],
       [0.47222222, 0.375     , 0.55932203, 0.5       ],
       [0.166666...
       [0.16666667, 0.20833333, 0.59322034, 0.66666667],
       [0.19444444, 0.58333333, 0.10169492, 0.08333333],
       [0.41666667, 1.        , 0.11864407, 0.125     ],
       [0.55555556, 0.20833333, 0.66101695, 0.58333333],
       [0.05555556, 0.125     , 0.05084746, 0.08333333],
       [0.94444444, 0.41666667, 1.        , 0.91666667],
       [1.        , 0.75      , 0.96610169, 0.875     ],
       [0.44444444, 0.5       , 0.6440678 , 0.70833333]]),
           gamma=0.5, theta=0.3, theta_min=0.3)

[15]:

# predict
y_pred_2 = best_gfmm_grid_search_2.predict(X_test)

[16]:

acc_grid_search_2 = accuracy_score(y_test, y_pred_2)
print(f'Accuracy (grid-search) = {acc_grid_search_2 * 100: .2f}%')

Accuracy (grid-search) =  96.67%

2. Using Random Search with 5-fold cross-validation

[17]:

# Using random search with only 20 random combinations of parameters
onln_gfmm_rd_search = OnlineGFMM()
clf_rd_search = RandomizedSearchCV(onln_gfmm_rd_search, parameters, n_iter=20, cv=5, random_state=0)

[18]:

clf_rd_search.fit(X_train, y_train)
print("Best average score = ", clf_rd_search.best_score_)
print("Best params: ", clf_rd_search.best_params_)

Best average score =  0.9583333333333334
Best params:  {'theta_min': 1, 'theta': 0.3, 'gamma': 2}

[19]:

best_gfmm_rd_search = clf_rd_search.best_estimator_

[20]:

# Testing the performance on the test set
y_pred_rd_search = best_gfmm_rd_search.predict(X_test)

[21]:

acc_rd_search = accuracy_score(y_test, y_pred_rd_search)
print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')

Accuracy (random-search) =  96.67%

Try to show explanation for an input sample

[22]:

sample_need_explain = 10
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes = best_gfmm_rd_search.get_sample_explanation(X_test[sample_need_explain], X_test[sample_need_explain])

[23]:

print("Predicted class for sample X = %s is %d and real class is %d" % (X_test[sample_need_explain], y_pred_input_0, y_test[sample_need_explain]))

Predicted class for sample X = [0.5        0.25       0.77966102 0.54166667] is 2 and real class is 2

[24]:

print("Membership values:")
for key, val in mem_val_classes.items():
    print("Class %d has the maximum membership value = %f" % (key, val))

for key in min_points_classes:
    print("Class %d has the representative hyperbox: V = %s and W = %s" % (key, min_points_classes[key], max_points_classes[key]))

Membership values:
Class 0 has the maximum membership value = 0.000000
Class 1 has the maximum membership value = 0.805085
Class 2 has the maximum membership value = 0.916667
Class 0 has the representative hyperbox: V = [0.11111111 0.45833333 0.03389831 0.04166667] and W = [0.38888889 0.75       0.11864407 0.20833333]
Class 1 has the representative hyperbox: V = [0.25       0.125      0.42372881 0.375     ] and W = [0.5        0.41666667 0.68220339 0.625     ]
Class 2 has the representative hyperbox: V = [0.38888889 0.08333333 0.68221339 0.58333333] and W = [0.66666667 0.33333333 0.81355932 0.79166667]

Show explanation results by parallel coordinates

[25]:

# Create a parallel coordinates graph
best_gfmm_rd_search.show_sample_explanation(X_test[sample_need_explain], X_test[sample_need_explain], min_points_classes, max_points_classes, y_pred_input_0, file_path="par_cord/iris_par_cord.html")

[26]:

# Load parallel coordinates to display on the notebook
from IPython.display import IFrame
# We load the parallel coordinates from GitHub here for demostration in readthedocs
# On the local notebook, we only need to load from the graph storing at 'par_cord/iris_par_cord.html'
IFrame('https://uts-caslab.github.io/hyperbox-brain/docs/tutorials/par_cord/iris_par_cord.html', width=820, height=520)

[26]: