Decision-level Bagging of Hyperbox-based Models

This example shows how to use a Bagging classifier of base hyperbox-based models trained on a full set of features and a subset of samples.

[1]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.ensemble_learner.decision_comb_bagging import DecisionCombinationBagging
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM

Load dataset.

This example will use the breast cancer dataset available in sklearn to demonstrate how to use this ensemble classifier.

[2]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
[3]:
df = load_breast_cancer()
X = df.data
y = df.target
[4]:
# Normailise data into the range of [0, 1] as hyperbox-based models only work in the unit cube
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
[5]:
# Split data into training, validation and testing sets
Xtr_val, X_test, ytr_val, y_test = train_test_split(X, y, train_size=0.8, random_state=0)
Xtr, X_val, ytr, y_val = train_test_split(X, y, train_size=0.75, random_state=0)

This example will use the GFMM classifier with the original online learning algorithm as base learners. However, any type of hyperbox-based learning algorithms in this library can also be used to train base learners.

1. Using random subsampling to generate training sets for various base learners

Training

[6]:
# Initialise parameters
n_estimators = 20 # number of base learners
max_samples = 0.5 # sampling rate for samples
bootstrap = False # random subsampling without replacement
class_balanced = False # do not use the class-balanced sampling mode
n_jobs = 4 # number of processes is used to build base learners
[7]:
# Init a hyperbox-based model used to train base learners
# Using the GFMM classifier with the original online learning algorithm with the maximum hyperbox size 0.1
base_estimator = OnlineGFMM(theta=0.1)
[8]:
dc_bagging_subsampling = DecisionCombinationBagging(base_estimator=base_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_jobs=n_jobs, random_state=0)
[9]:
dc_bagging_subsampling.fit(Xtr, ytr)
[9]:
DecisionCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                     V=array([], dtype=float64),
                                                     W=array([], dtype=float64),
                                                     theta=0.1),
                           n_estimators=20, n_jobs=4, random_state=0)
[10]:
print("Training time: %.3f (s)"%(dc_bagging_subsampling.elapsed_training_time))
Training time: 4.355 (s)
[11]:
print('Total number of hyperboxes from all base learners = %d'%dc_bagging_subsampling.get_n_hyperboxes())
Total number of hyperboxes from all base learners = 3948

Prediction

[12]:
y_pred = dc_bagging_subsampling.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy = {acc * 100: .2f}%')
Testing accuracy =  93.86%

Apply pruning for base learners

[13]:
acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
dc_bagging_subsampling.simple_pruning_base_estimators(X_val, y_val, acc_threshold, keep_empty_boxes)
[13]:
DecisionCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                     V=array([], dtype=float64),
                                                     W=array([], dtype=float64),
                                                     theta=0.1),
                           n_estimators=20, n_jobs=4, random_state=0)
[14]:
print('Total number of hyperboxes from all base learners after pruning = %d'%dc_bagging_subsampling.get_n_hyperboxes())
Total number of hyperboxes from all base learners after pruning = 2195

Prediction after doing a pruning procedure

[15]:
y_pred_2 = dc_bagging_subsampling.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy (after pruning) = {acc_pruned * 100: .2f}%')
Testing accuracy (after pruning) =  95.61%

2. Using random undersampling to generate class-balanced training sets for various base learners

Training

[16]:
# Initialise parameters
n_estimators = 20 # number of base learners
max_samples = 0.5 # sampling rate for samples
bootstrap = False # random subsampling without replacement
class_balanced = True # use the class-balanced sampling mode
n_jobs = 4 # number of processes is used to build base learners
[17]:
# Init a hyperbox-based model used to train base learners
# Using the GFMM classifier with the original online learning algorithm with the maximum hyperbox size 0.1
base_estimator = OnlineGFMM(theta=0.1)
[18]:
dc_bagging_class_balanced = DecisionCombinationBagging(base_estimator=base_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_jobs=n_jobs, random_state=0)
[19]:
dc_bagging_class_balanced.fit(Xtr, ytr)
[19]:
DecisionCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                     V=array([], dtype=float64),
                                                     W=array([], dtype=float64),
                                                     theta=0.1),
                           class_balanced=True, n_estimators=20, n_jobs=4,
                           random_state=0)
[20]:
print("Training time: %.3f (s)"%(dc_bagging_class_balanced.elapsed_training_time))
Training time: 0.271 (s)
[21]:
print('Total number of hyperboxes from all base learners = %d'%dc_bagging_class_balanced.get_n_hyperboxes())
Total number of hyperboxes from all base learners = 4010

Prediction

[22]:
y_pred = dc_bagging_class_balanced.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy = {acc * 100: .2f}%')
Testing accuracy =  92.11%

Apply pruning for base learners

[23]:
acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
dc_bagging_class_balanced.simple_pruning_base_estimators(X_val, y_val, acc_threshold, keep_empty_boxes)
[23]:
DecisionCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                     V=array([], dtype=float64),
                                                     W=array([], dtype=float64),
                                                     theta=0.1),
                           class_balanced=True, n_estimators=20, n_jobs=4,
                           random_state=0)
[24]:
print('Total number of hyperboxes from all base learners after pruning = %d'%dc_bagging_class_balanced.get_n_hyperboxes())
Total number of hyperboxes from all base learners after pruning = 2738

Prediction after doing a pruning procedure

[25]:
y_pred_2 = dc_bagging_class_balanced.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy (after pruning) = {acc_pruned * 100: .2f}%')
Testing accuracy (after pruning) =  94.74%