Model-level Bagging of Hyperbox-based Models

This example shows how to use a Bagging classifier with a combination at the model level to generate a single model from many base learners, in which each base hyperbox-based model is trained on a full set of features and a subset of samples.

[1]:

import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.ensemble_learner.model_comb_bagging import ModelCombinationBagging
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM
from hbbrain.numerical_data.batch_learner.accel_agglo_gfmm import AccelAgglomerativeLearningGFMM

Load dataset.

This example will use the breast cancer dataset available in sklearn to demonstrate how to use this ensemble classifier.

[2]:

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

[3]:

df = load_breast_cancer()
X = df.data
y = df.target

[4]:

# Normailise data into the range of [0, 1] as hyperbox-based models only work in the unit cube
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

[5]:

# Split data into training, validation and testing sets
Xtr_val, X_test, ytr_val, y_test = train_test_split(X, y, train_size=0.8, random_state=0)
Xtr, X_val, ytr, y_val = train_test_split(X, y, train_size=0.75, random_state=0)

This example will use the GFMM classifier with the original online learning algorithm as base learners. However, any type of hyperbox-based learning algorithms in this library can also be used to train base learners.

1. Using random subsampling to generate training sets for various base learners

a. Training without pruning for base learners

[6]:

# Initialise parameters
n_estimators = 20 # number of base learners
max_samples = 0.5 # sampling rate for samples
bootstrap = False # random subsampling without replacement
class_balanced = False # do not use the class-balanced sampling mode
n_jobs = 4 # number of processes is used to build base learners

[7]:

# Init a hyperbox-based model used to train base learners
# Using the GFMM classifier with the original online learning algorithm with the maximum hyperbox size 0.1
base_estimator = OnlineGFMM(theta=0.1)

[8]:

# Init a hyperbox-based model used to aggregate the resulting hyperboxes from all base learners
# Using the accelerated agglomerative learning algorithm for the GFMM model to do this task
model_level_estimator = AccelAgglomerativeLearningGFMM(theta=0.1, min_simil=0, simil_measure='long')

[9]:

model_comb_bagging_subsampling = ModelCombinationBagging(base_estimator=base_estimator, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_jobs=n_jobs, random_state=0)
model_comb_bagging_subsampling.fit(Xtr, ytr)

[9]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[10]:

print("Training time: %.3f (s)"%(model_comb_bagging_subsampling.elapsed_training_time))

Training time: 16.647 (s)

[11]:

print('Total number of hyperboxes in all base learners = %d'%model_comb_bagging_subsampling.get_n_hyperboxes())

Total number of hyperboxes in all base learners = 3948

[12]:

print('Number of hyperboxes in the combined model = %d'%model_comb_bagging_subsampling.get_n_hyperboxes_comb_model())

Number of hyperboxes in the combined model = 401

Prediction

Using majority voting from predicted results of all base learners

[13]:

y_pred_voting = model_comb_bagging_subsampling.predict_voting(X_test)

[14]:

acc_voting = accuracy_score(y_test, y_pred_voting)
print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')

Testing accuracy using voting of decisions from base learners =  93.86%

Using the final combined single model to make prediction

[16]:

y_pred = model_comb_bagging_subsampling.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')

Testing accuracy of the combined model =  92.98%

Apply pruning for the final combined model

[17]:

acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
model_comb_bagging_subsampling.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)

[17]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[18]:

print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_bagging_subsampling.get_n_hyperboxes_comb_model())

Number of hyperboxes of the combined single model after pruning = 393

Prediction after doing a pruning procedure for the combined single model

[20]:

y_pred_2 = model_comb_bagging_subsampling.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')

Testing accuracy after pruning the final model =  94.74%

b. Training with pruning for base learners

[21]:

model_comb_bagging_subsampling_base_learner_pruning = ModelCombinationBagging(base_estimator=base_estimator, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_jobs=n_jobs, random_state=0)
model_comb_bagging_subsampling_base_learner_pruning.fit(Xtr, ytr, is_pruning_base_learners=True, X_val=X_val, y_val=y_val, acc_threshold=acc_threshold, keep_empty_boxes=keep_empty_boxes)

[21]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[22]:

print("Training time: %.3f (s)"%(model_comb_bagging_subsampling_base_learner_pruning.elapsed_training_time))

Training time: 8.254 (s)

[23]:

print('Total number of hyperboxes in all base learners = %d'%model_comb_bagging_subsampling_base_learner_pruning.get_n_hyperboxes())

Total number of hyperboxes in all base learners = 2195

[24]:

print('Number of hyperboxes in the combined model = %d'%model_comb_bagging_subsampling_base_learner_pruning.get_n_hyperboxes_comb_model())

Number of hyperboxes in the combined model = 388

Prediction

Using majority voting from predicted results of all base learners

[25]:

y_pred_voting = model_comb_bagging_subsampling_base_learner_pruning.predict_voting(X_test)

[26]:

acc_voting = accuracy_score(y_test, y_pred_voting)
print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')

Testing accuracy using voting of decisions from base learners =  95.61%

Using the final combined single model to make prediction

[27]:

y_pred = model_comb_bagging_subsampling_base_learner_pruning.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')

Testing accuracy of the combined model =  94.74%

Apply pruning for the final combined model

[28]:

acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
model_comb_bagging_subsampling_base_learner_pruning.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)

[28]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[29]:

print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_bagging_subsampling_base_learner_pruning.get_n_hyperboxes_comb_model())

Number of hyperboxes of the combined single model after pruning = 383

Prediction after doing a pruning procedure for the combined single model

[30]:

y_pred_2 = model_comb_bagging_subsampling_base_learner_pruning.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')

Testing accuracy after pruning the final model =  94.74%

2. Using random undersampling to generate class-balanced training sets for various base learners

a. Training without pruning for base learners

[31]:

# Initialise parameters
n_estimators = 20 # number of base learners
max_samples = 0.5 # sampling rate for samples
bootstrap = False # random subsampling without replacement
class_balanced = True # use the class-balanced sampling mode
n_jobs = 4 # number of processes is used to build base learners

[32]:

# Init a hyperbox-based model used to train base learners
# Using the GFMM classifier with the original online learning algorithm with the maximum hyperbox size 0.1
base_estimator = OnlineGFMM(theta=0.1)

[33]:

# Init a hyperbox-based model used to aggregate the resulting hyperboxes from all base learners
# Using the accelerated agglomerative learning algorithm for the GFMM model to do this task
model_level_estimator = AccelAgglomerativeLearningGFMM(theta=0.1, min_simil=0, simil_measure='long')

[34]:

model_comb_bagging_class_balanced = ModelCombinationBagging(base_estimator=base_estimator, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_jobs=n_jobs, random_state=0)
model_comb_bagging_class_balanced.fit(Xtr, ytr)

[34]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        class_balanced=True,
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[35]:

print("Training time: %.3f (s)"%(model_comb_bagging_class_balanced.elapsed_training_time))

Training time: 16.955 (s)

[36]:

print('Total number of hyperboxes in all base learners = %d'%model_comb_bagging_class_balanced.get_n_hyperboxes())

Total number of hyperboxes in all base learners = 4010

[37]:

print('Number of hyperboxes in the combined model = %d'%model_comb_bagging_class_balanced.get_n_hyperboxes_comb_model())

Number of hyperboxes in the combined model = 400

Prediction

Using majority voting from predicted results of all base learners

[38]:

y_pred_voting = model_comb_bagging_class_balanced.predict_voting(X_test)

[39]:

acc_voting = accuracy_score(y_test, y_pred_voting)
print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')

Testing accuracy using voting of decisions from base learners =  92.11%

Using the final combined single model to make prediction

[40]:

y_pred = model_comb_bagging_class_balanced.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')

Testing accuracy of the combined model =  92.98%

Apply pruning for the final combined model

[41]:

acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
model_comb_bagging_class_balanced.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)

[41]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        class_balanced=True,
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[42]:

print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_bagging_class_balanced.get_n_hyperboxes_comb_model())

Number of hyperboxes of the combined single model after pruning = 392

Prediction after doing a pruning procedure for the combined single model

[43]:

y_pred_2 = model_comb_bagging_class_balanced.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')

Testing accuracy after pruning the final model =  94.74%

b. Training with pruning for base learners

[44]:

model_comb_bagging_class_balanced_base_learner_pruning = ModelCombinationBagging(base_estimator=base_estimator, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_jobs=n_jobs, random_state=0)
model_comb_bagging_class_balanced_base_learner_pruning.fit(Xtr, ytr, is_pruning_base_learners=True, X_val=X_val, y_val=y_val, acc_threshold=acc_threshold, keep_empty_boxes=keep_empty_boxes)

[44]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        class_balanced=True,
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[45]:

print("Training time: %.3f (s)"%(model_comb_bagging_class_balanced_base_learner_pruning.elapsed_training_time))

Training time: 7.264 (s)

[46]:

print('Total number of hyperboxes in all base learners = %d'%model_comb_bagging_class_balanced_base_learner_pruning.get_n_hyperboxes())

Total number of hyperboxes in all base learners = 2738

[47]:

print('Number of hyperboxes in the combined model = %d'%model_comb_bagging_class_balanced_base_learner_pruning.get_n_hyperboxes_comb_model())

Number of hyperboxes in the combined model = 395

Prediction

Using majority voting from predicted results of all base learners

[48]:

y_pred_voting = model_comb_bagging_class_balanced_base_learner_pruning.predict_voting(X_test)

[49]:

acc_voting = accuracy_score(y_test, y_pred_voting)
print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')

Testing accuracy using voting of decisions from base learners =  94.74%

Using the final combined single model to make prediction

[50]:

y_pred = model_comb_bagging_class_balanced_base_learner_pruning.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')

Testing accuracy of the combined model =  94.74%

Apply pruning for the final combined model

[51]:

acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes
keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated
model_comb_bagging_class_balanced_base_learner_pruning.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)

[51]:

ModelCombinationBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                  V=array([], dtype=float64),
                                                  W=array([], dtype=float64),
                                                  theta=0.1),
                        class_balanced=True,
                        model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,
                                                                             simil_measure='long',
                                                                             theta=0.1),
                        n_estimators=20, n_jobs=4, random_state=0)

[52]:

print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_bagging_class_balanced_base_learner_pruning.get_n_hyperboxes_comb_model())

Number of hyperboxes of the combined single model after pruning = 100

Prediction after doing a pruning procedure for the combined single model

[53]:

y_pred_2 = model_comb_bagging_class_balanced_base_learner_pruning.predict(X_test)
acc_pruned = accuracy_score(y_test, y_pred_2)
print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')

Testing accuracy after pruning the final model =  94.74%

Read the Docs v: latest

Versions: latest; stable

Downloads: pdf; html

On Read the Docs: Project Home; Builds