Decision-level Bagging of Hyperbox-based Models with Hyper-parameter Optimisation

This example shows how to use a Bagging classifier of base hyperbox-based models trained on a full set of features and a subset of samples, in which each base learner is trained by random search-based hyper-parameter tuning and k-fold cross-validation.

While the original bagging model in the class DecisionCombinationBagging uses the same base learners with the same hyperparameters, the cross-validation bagging model in the class DecisionCombinationCrossValBagging allows each base learner to use specific hyperparameters depending on the training data by performing random research to find the best combination of hyperparameters for each base learner.

[1]:

import warnings
warnings.filterwarnings('ignore')
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.ensemble_learner.decision_comb_cross_val_bagging import DecisionCombinationCrossValBagging
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM

Load dataset.

This example will use the breast cancer dataset available in sklearn to demonstrate how to use this ensemble classifier.

[2]:

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

[3]:

df = load_breast_cancer()
X = df.data
y = df.target

[4]:

# Normailise data into the range of [0, 1] as hyperbox-based models only work in the unit cube
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

[5]:

# Split data into training, validation and testing sets
Xtr_val, X_test, ytr_val, y_test = train_test_split(X, y, train_size=0.8, random_state=0)
Xtr, X_val, ytr, y_val = train_test_split(X, y, train_size=0.75, random_state=0)

This example will use the GFMM classifier with the original online learning algorithm as base learners. However, any type of hyperbox-based learning algorithms in this library can also be used to train base learners.

1. Using random subsampling to generate training sets for various base learners

Training

[6]:

# Initialise parameters
n_estimators = 20 # number of base learners
max_samples = 0.5 # sampling rate for samples
bootstrap = False # random subsampling without replacement
class_balanced = False # do not use the class-balanced sampling mode
n_jobs = 4 # number of processes is used to build base learners
n_iter = 20 # Number of parameter settings that are randomly sampled to choose the best combination of hyperparameters
k_fold = 5 # Number of folds to conduct Stratified K-Fold cross-validation for hyperparameter tunning

[7]:

# Init a hyperbox-based model used to train base learners
# Using the GFMM classifier with the original online learning algorithm
base_estimator = OnlineGFMM()

[8]:

# Init ranges for hyperparameters of base learners to perform a random search process for hyperparameter tunning
base_estimator_params = {'theta': np.arange(0.05, 1.01, 0.05), 'theta_min':[1], 'gamma':[0.5, 1, 2, 4, 8, 16]}

[9]:

dc_cv_bagging_subsampling = DecisionCombinationCrossValBagging(base_estimator=base_estimator, base_estimator_params=base_estimator_params, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_iter=n_iter, k_fold=k_fold, n_jobs=n_jobs, random_state=0)
dc_cv_bagging_subsampling.fit(Xtr, ytr)

[9]:

DecisionCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                             V=array([], dtype=float64),
                                                             W=array([], dtype=float64)),
                                   base_estimator_params={'gamma': [0.5, 1, 2,
                                                                    4, 8, 16],
                                                          'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
                                                          'theta_min': [1]},
                                   n_estimators=20, n_iter=20, n_jobs=4,
                                   random_state=0)

[10]:

print("Training time: %.3f (s)"%(dc_cv_bagging_subsampling.elapsed_training_time))

Training time: 46.519 (s)

[11]:

print('Total number of hyperboxes from all base learners = %d'%dc_cv_bagging_subsampling.get_n_hyperboxes())

Total number of hyperboxes from all base learners = 1168

Prediction

[12]:

y_pred = dc_cv_bagging_subsampling.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy = {acc * 100: .2f}%')

Testing accuracy =  94.74%

Apply pruning for base learners

[13]:

acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
dc_cv_bagging_subsampling.simple_pruning_base_estimators(X_val, y_val, acc_threshold, keep_empty_boxes)

[13]:

DecisionCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                             V=array([], dtype=float64),
                                                             W=array([], dtype=float64)),
                                   base_estimator_params={'gamma': [0.5, 1, 2,
                                                                    4, 8, 16],
                                                          'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
                                                          'theta_min': [1]},
                                   n_estimators=20, n_iter=20, n_jobs=4,
                                   random_state=0)

[14]:

print('Total number of hyperboxes from all base learners after pruning = %d'%dc_cv_bagging_subsampling.get_n_hyperboxes())

Total number of hyperboxes from all base learners after pruning = 756

Prediction after doing a pruning procedure

[15]:

y_pred_2 = dc_cv_bagging_subsampling.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy (after pruning) = {acc_pruned * 100: .2f}%')

Testing accuracy (after pruning) =  96.49%

2. Using random undersampling to generate class-balanced training sets for various base learners

Training

[16]:

# Initialise parameters
n_estimators = 20 # number of base learners
max_samples = 0.5 # sampling rate for samples
bootstrap = False # random subsampling without replacement
class_balanced = True # use the class-balanced sampling mode
n_jobs = 4 # number of processes is used to build base learners
n_iter = 20 # Number of parameter settings that are randomly sampled to choose the best combination of hyperparameters
k_fold = 5 # Number of folds to conduct Stratified K-Fold cross-validation for hyperparameter tunning

[17]:

# Init a hyperbox-based model used to train base learners
# Using the GFMM classifier with the original online learning algorithm with the maximum hyperbox size 0.1
base_estimator = OnlineGFMM()

[18]:

# Init ranges for hyperparameters of base learners to perform a random search process for hyperparameter tunning
base_estimator_params = {'theta': np.arange(0.05, 1.01, 0.05), 'theta_min':[1], 'gamma':[0.5, 1, 2, 4, 8, 16]}

[19]:

dc_cv_bagging_class_balanced = DecisionCombinationCrossValBagging(base_estimator=base_estimator, base_estimator_params=base_estimator_params, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_iter=n_iter, k_fold=k_fold, n_jobs=n_jobs, random_state=0)
dc_cv_bagging_class_balanced.fit(Xtr, ytr)

[19]:

DecisionCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                             V=array([], dtype=float64),
                                                             W=array([], dtype=float64)),
                                   base_estimator_params={'gamma': [0.5, 1, 2,
                                                                    4, 8, 16],
                                                          'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
                                                          'theta_min': [1]},
                                   class_balanced=True, n_estimators=20,
                                   n_iter=20, n_jobs=4, random_state=0)

[20]:

print("Training time: %.3f (s)"%(dc_cv_bagging_class_balanced.elapsed_training_time))

Training time: 32.595 (s)

[21]:

print('Total number of hyperboxes from all base learners = %d'%dc_cv_bagging_class_balanced.get_n_hyperboxes())

Total number of hyperboxes from all base learners = 1407

Prediction

[22]:

y_pred = dc_cv_bagging_class_balanced.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy = {acc * 100: .2f}%')

Testing accuracy =  94.74%

Apply pruning for base learners

[23]:

acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
dc_cv_bagging_class_balanced.simple_pruning_base_estimators(X_val, y_val, acc_threshold, keep_empty_boxes)

[23]:

DecisionCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                             V=array([], dtype=float64),
                                                             W=array([], dtype=float64)),
                                   base_estimator_params={'gamma': [0.5, 1, 2,
                                                                    4, 8, 16],
                                                          'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
                                                          'theta_min': [1]},
                                   class_balanced=True, n_estimators=20,
                                   n_iter=20, n_jobs=4, random_state=0)

[24]:

print('Total number of hyperboxes from all base learners after pruning = %d'%dc_cv_bagging_class_balanced.get_n_hyperboxes())

Total number of hyperboxes from all base learners after pruning = 719

Prediction after doing a pruning procedure

[25]:

y_pred_2 = dc_cv_bagging_class_balanced.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy (after pruning) = {acc_pruned * 100: .2f}%')

Testing accuracy (after pruning) =  95.61%