Integration of Ensemble Models with Sklearn Pipeline

This example shows how to integrate the random hyperboxes classifier into the Pipeline class implemented by scikit-learn.

Note that this example is illustrated by using the random hyperboxes model with original onliner learning algorithm for training base learners. However, it can be used for any ensemble model of GFMM classifiers using other learning algorithms.

[1]:

import warnings
warnings.filterwarnings('ignore')
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM
from hbbrain.numerical_data.ensemble_learner.random_hyperboxes import RandomHyperboxesClassifier

Load dataset.

This example will use the breast cancer dataset in sklearn for illustration purposes.

[2]:

from sklearn.datasets import load_breast_cancer

[3]:

df = load_breast_cancer()
X = df.data
y = df.target

[4]:

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

Create a pipeline of pre-processing method (i.e., normalization of data in the range of [0, 1]) and a Random hyperboxes model.

Note: the GFMM classifiers requires the input data in the range of [0, 1].

[5]:

theta = 0.1
theta_min = 0.1
base_estimator = OnlineGFMM(theta=theta, theta_min=theta_min)
n_estimators = 50
max_samples = 0.5
max_features = 0.5
class_balanced = False
feature_balanced = False
n_jobs = 4
# Init a classifier
rh_clf = RandomHyperboxesClassifier(base_estimator=base_estimator, n_estimators=n_estimators, max_samples=max_samples, max_features=max_features, class_balanced=class_balanced, feature_balanced=feature_balanced, n_jobs=n_jobs, random_state=0)

[6]:

# create a pipeline including data pre-processing and a classifier
pipe = Pipeline([('scaler', MinMaxScaler()), ('rh_clf', rh_clf)])

Training

[7]:

pipe.fit(X_train, y_train)

[7]:

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('rh_clf',
                 RandomHyperboxesClassifier(base_estimator=OnlineGFMM(C=array([], dtype=float64),
                                                                      V=array([], dtype=float64),
                                                                      W=array([], dtype=float64),
                                                                      theta=0.1,
                                                                      theta_min=0.1),
                                            max_features=0.5, n_estimators=50,
                                            n_jobs=4, random_state=0))])

Prediction

[8]:

acc = pipe.score(X_test, y_test)
print(f'Testing accuracy = {acc * 100: .2f}%')

Testing accuracy =  96.49%