Integration of Ensemble Models with Hyper-parameter Optimisation in Sklearn

This example shows how to integrate the random hyperboxes classifier with the Random Search Cross-Validation functionality implemented by scikit-learn.

Note that this example uses the random hyperboxes model and Random Search for illustration. However, other hyperbox-based ensemble learning algorithms in the library can be used similarly for any hyper-parameter tunning methods.

[1]:

import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.ensemble_learner.random_hyperboxes import RandomHyperboxesClassifier
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM

Load dataset, normalize numerical features into the range of [0, 1] and build training and testing datasets.

This example will use the breast cancer dataset in sklearn for illustration purposes.

[2]:

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

[3]:

df = load_breast_cancer()
X = df.data
y = df.target

[4]:

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

[5]:

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

Using Random Search with 5-fold cross-validation.

[6]:

parameters = {'n_estimators': [20, 30, 50, 100, 200, 500],
              'max_samples': [0.2, 0.3, 0.4, 0.5, 0.6],
              'max_features' : [0.2, 0.3, 0.4, 0.5, 0.6],
              'class_balanced' : [True, False],
              'feature_balanced' : [True, False],
              'n_jobs' : [4],
              'random_state' : [0],
              'base_estimator__theta' : np.arange(0.05, 0.61, 0.05),
              'base_estimator__gamma' : [0.5, 1, 2, 4, 8, 16]}

[7]:

# Init base learner. This example uses the original online learning algorithm to train a GFMM classifier
base_estimator = OnlineGFMM()

[8]:

# Using random search with only 40 random combinations of parameters
random_hyperboxes_clf = RandomHyperboxesClassifier(base_estimator=base_estimator)
clf_rd_search = RandomizedSearchCV(random_hyperboxes_clf, parameters, n_iter=40, cv=5, random_state=0)

[9]:

clf_rd_search.fit(X_train, y_train)

[9]:

RandomizedSearchCV(cv=5,
                   estimator=RandomHyperboxesClassifier(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                                                  V=array([], dtype=float64),
                                                                                  W=array([], dtype=float64))),
                   n_iter=40,
                   param_distributions={'base_estimator__gamma': [0.5, 1, 2, 4,
                                                                  8, 16],
                                        'base_estimator__theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 ]),
                                        'class_balanced': [True, False],
                                        'feature_balanced': [True, False],
                                        'max_features': [0.2, 0.3, 0.4, 0.5,
                                                         0.6],
                                        'max_samples': [0.2, 0.3, 0.4, 0.5,
                                                        0.6],
                                        'n_estimators': [20, 30, 50, 100, 200,
                                                         500],
                                        'n_jobs': [4], 'random_state': [0]},
                   random_state=0)

[10]:

print("Best average score = ", clf_rd_search.best_score_)
print("Best params: ", clf_rd_search.best_params_)

Best average score =  0.9714285714285715
Best params:  {'random_state': 0, 'n_jobs': 4, 'n_estimators': 500, 'max_samples': 0.6, 'max_features': 0.5, 'feature_balanced': True, 'class_balanced': False, 'base_estimator__theta': 0.15000000000000002, 'base_estimator__gamma': 16}

[12]:

best_gfmm_rd_search = clf_rd_search.best_estimator_

[13]:

# Testing the performance on the test set
y_pred_rd_search = best_gfmm_rd_search.predict(X_test)

[14]:

acc_rd_search = accuracy_score(y_test, y_pred_rd_search)
print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')

Accuracy (random-search) =  96.49%