Integration of Ensemble Models with Hyper-parameter Optimisation in Sklearn

This example shows how to integrate the random hyperboxes classifier with the Random Search Cross-Validation functionality implemented by scikit-learn.

Note that this example uses the random hyperboxes model and Random Search for illustration. However, other hyperbox-based ensemble learning algorithms in the library can be used similarly for any hyper-parameter tunning methods.

[1]:
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.ensemble_learner.random_hyperboxes import RandomHyperboxesClassifier
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM

Load dataset, normalize numerical features into the range of [0, 1] and build training and testing datasets.

This example will use the breast cancer dataset in sklearn for illustration purposes.

[2]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
[3]:
df = load_breast_cancer()
X = df.data
y = df.target
[4]:
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
[5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

Using Random Search with 5-fold cross-validation.

[6]:
parameters = {'n_estimators': [20, 30, 50, 100, 200, 500],
              'max_samples': [0.2, 0.3, 0.4, 0.5, 0.6],
              'max_features' : [0.2, 0.3, 0.4, 0.5, 0.6],
              'class_balanced' : [True, False],
              'feature_balanced' : [True, False],
              'n_jobs' : [4],
              'random_state' : [0],
              'base_estimator__theta' : np.arange(0.05, 0.61, 0.05),
              'base_estimator__gamma' : [0.5, 1, 2, 4, 8, 16]}
[7]:
# Init base learner. This example uses the original online learning algorithm to train a GFMM classifier
base_estimator = OnlineGFMM()
[8]:
# Using random search with only 40 random combinations of parameters
random_hyperboxes_clf = RandomHyperboxesClassifier(base_estimator=base_estimator)
clf_rd_search = RandomizedSearchCV(random_hyperboxes_clf, parameters, n_iter=40, cv=5, random_state=0)
[9]:
clf_rd_search.fit(X_train, y_train)
[9]:
RandomizedSearchCV(cv=5,
                   estimator=RandomHyperboxesClassifier(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                                                  V=array([], dtype=float64),
                                                                                  W=array([], dtype=float64))),
                   n_iter=40,
                   param_distributions={'base_estimator__gamma': [0.5, 1, 2, 4,
                                                                  8, 16],
                                        'base_estimator__theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 ]),
                                        'class_balanced': [True, False],
                                        'feature_balanced': [True, False],
                                        'max_features': [0.2, 0.3, 0.4, 0.5,
                                                         0.6],
                                        'max_samples': [0.2, 0.3, 0.4, 0.5,
                                                        0.6],
                                        'n_estimators': [20, 30, 50, 100, 200,
                                                         500],
                                        'n_jobs': [4], 'random_state': [0]},
                   random_state=0)
[10]:
print("Best average score = ", clf_rd_search.best_score_)
print("Best params: ", clf_rd_search.best_params_)
Best average score =  0.9714285714285715
Best params:  {'random_state': 0, 'n_jobs': 4, 'n_estimators': 500, 'max_samples': 0.6, 'max_features': 0.5, 'feature_balanced': True, 'class_balanced': False, 'base_estimator__theta': 0.15000000000000002, 'base_estimator__gamma': 16}
[12]:
best_gfmm_rd_search = clf_rd_search.best_estimator_
[13]:
# Testing the performance on the test set
y_pred_rd_search = best_gfmm_rd_search.predict(X_test)
[14]:
acc_rd_search = accuracy_score(y_test, y_pred_rd_search)
print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')
Accuracy (random-search) =  96.49%