Batch-Incremental Learning Algorithm for GFMM using One-hot Encoding for Categorical Features

This example shows how to use the general fuzzy min-max neural network trained by the batch-incremental learning algorithm, in which categorical features are encoded using one-hot encoding.

Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube. Therefore, continuous features need to be normalised before training. For categorical features, nothing needs to be done as this OneHotOnlineGFMM classifier will apply the appropriate encoding method for the categorical values.

1. Execute directly from the python file

[1]:

import os
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score
import pandas as pd

Get the path to the this jupyter notebook file

[2]:

this_notebook_dir = os.path.dirname(os.path.abspath("__file__"))
this_notebook_dir

[2]:

'C:\\hyperbox-brain\\examples\\mixed_data'

Get the home folder of the Hyperbox-Brain project

[3]:

from pathlib import Path
project_dir = Path(this_notebook_dir).parent.parent
project_dir

[3]:

WindowsPath('C:/hyperbox-brain')

Create the path to the Python file containing the implementation of the GFMM classifier using the online learning algorithm with one-hot encoding for categorical values in mixed attribute features

[4]:

onehot_gfmm_file_path = os.path.join(project_dir, Path("hbbrain/mixed_data/onehot_onln_gfmm.py"))
onehot_gfmm_file_path

[4]:

'C:\\hyperbox-brain\\hbbrain\\mixed_data\\onehot_onln_gfmm.py'

Run the found file by showing the execution directions

[5]:

!python "{onehot_gfmm_file_path}" -h

usage: onehot_onln_gfmm.py [-h] -training_file TRAINING_FILE -testing_file
                           TESTING_FILE -categorical_features
                           CATEGORICAL_FEATURES [--theta THETA]
                           [--theta_min THETA_MIN]
                           [--min_percent_overlap_cat MIN_PERCENT_OVERLAP_CAT]
                           [--gamma GAMMA] [--alpha ALPHA]

The description of parameters

required arguments:
  -training_file TRAINING_FILE
                        A required argument for the path to training data file
                        (including file name)
  -testing_file TESTING_FILE
                        A required argument for the path to testing data file
                        (including file name)
  -categorical_features CATEGORICAL_FEATURES
                        Indices of categorical features

optional arguments:
  --theta THETA         Maximum hyperbox size (in the range of (0, 1])
                        (default: 0.5)
  --theta_min THETA_MIN
                        Mimimum value of the maximum hyperbox size to escape
                        the training loop (in the range of (0, 1]) (default:
                        0.5)
  --min_percent_overlap_cat MIN_PERCENT_OVERLAP_CAT
                        Mimimum rate of numbers of categorical features
                        overlapped for hyperbox expansion (default: 0.5)
  --gamma GAMMA         A sensitivity parameter describing the speed of
                        decreasing of the membership function in each
                        continous dimension (larger than 0) (default: 1)
  --alpha ALPHA         Multiplier showing the decrease of theta in each step
                        (default: 0.9)

Create the path to mixed-attribute training and testing datasets stored in the dataset folder.

This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.

[6]:

training_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_train.csv"))
training_data_file

[6]:

'C:\\hyperbox-brain\\dataset\\japanese_credit_train.csv'

[7]:

testing_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_test.csv"))
testing_data_file

[7]:

'C:\\hyperbox-brain\\dataset\\japanese_credit_test.csv'

Run a demo program

[8]:

!python "{onehot_gfmm_file_path}" -training_file "{training_data_file}" -testing_file "{testing_data_file}" -categorical_features "[0, 3, 4, 5, 6, 8, 9, 11,12]" --theta 0.1 --theta_min 0.1 --min_percent_overlap_cat 0.6 --gamma 1

Number of hyperboxes = 166
Testing accuracy =  67.94%

2. Using the OneHotOnlineGFMM algorithm to train a GFMM classifier for mixed-attribute data through its init, fit, and predict functions

[9]:

from hbbrain.mixed_data.onehot_onln_gfmm import OneHotOnlineGFMM
import pandas as pd

Create mixed attribute training, validation, and testing data sets.

This example will use the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.

[10]:

df_train = pd.read_csv(training_data_file, header=None)
df_test = pd.read_csv(testing_data_file, header=None)

Xy_train = df_train.to_numpy()
Xy_test = df_test.to_numpy()

Xtr = Xy_train[:, :-1]
ytr = Xy_train[:, -1].astype(int)

Xtest = Xy_test[:, :-1]
ytest = Xy_test[:, -1].astype(int)

[11]:

val_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_val.csv"))
df_val = pd.read_csv(val_data_file, header=None)
Xy_val = df_val.to_numpy()
Xval = Xy_val[:, :-1]
yval = Xy_val[:, -1].astype(int)

Initializing parameters

[12]:

theta = 0.1 # maximum hyperbox size for continuous features
theta_min = 0.1 # Only performing one training loop
min_percent_overlap_cat = 0.6 # Mimimum rate of numbers of categorical features overlapped for hyperbox expansion
gamma = 1 # speed of decreasing degree in the membership values of continuous features

Indicate the indices of categorical features in the training data

[13]:

categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]

Training

[14]:

onehot_onln_gfmm_clf = OneHotOnlineGFMM(theta=theta, theta_min=theta_min, min_percent_overlap_cat=min_percent_overlap_cat, gamma=gamma)
onehot_onln_gfmm_clf.fit(Xtr, ytr, categorical_features)

[14]:

OneHotOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1,...
        1.00000000e-01, 4.00000000e-04],
       [2.67142857e-01, 3.80892857e-01, 2.98245614e-03, 1.79104478e-01,
        6.45000000e-02, 3.00000000e-05],
       [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,
        0.00000000e+00, 9.90000000e-04],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                 min_percent_overlap_cat=0.6, theta=0.1, theta_min=0.1)

[15]:

 print('Number of hyperboxes = %d'%onehot_onln_gfmm_clf.get_n_hyperboxes())

Number of hyperboxes = 166

[16]:

print("Training time: %.3f (s)"%onehot_onln_gfmm_clf.elapsed_training_time)

Training time: 1.326 (s)

Prediction

[17]:

y_pred = onehot_onln_gfmm_clf.predict(Xtest)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')

Accuracy =  67.94%

Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class

[18]:

sample_need_explain = 10
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, cat_poins_classes = onehot_onln_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])
print("Explain samples:")
print("Membership values for classes: ", mem_val_classes)
print("Predicted class = ", y_pred_input_0)
print("Minimum points of the selected hyperbox for each class: ", min_points_classes)
print("Maximum points of the selected hyperbox for each class: ", max_points_classes)
print("Categorical features of the selected hyperbox for each class: ", cat_poins_classes)

Explain samples:
Membership values for classes:  {0: 0.9315476190476191, 1: 0.8133333333333332}
Predicted class =  0
Minimum points of the selected hyperbox for each class:  {0: array([0.05031746, 0.00892857, 0.        , 0.00094284, 0.05      ,
       0.01286   ]), 1: array([0.21031746, 0.05053571, 0.00140351, 0.        , 0.12      ,
       0.00050875])}
Maximum points of the selected hyperbox for each class:  {0: array([0.14880952, 0.10714286, 0.0877193 , 0.07462687, 0.14      ,
       0.04208   ]), 1: array([0.29095238, 0.14285714, 0.0877193 , 0.01492537, 0.1965    ,
       0.0033975 ])}
Categorical features of the selected hyperbox for each class:  {0: array([array([ True,  True]), array([False,  True,  True]),
       array([ True, False,  True]),
       array([ True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False]),
       array([False,  True,  True,  True,  True,  True, False,  True, False]),
       array([ True,  True]), array([ True,  True]),
       array([ True,  True]), array([ True, False,  True])], dtype=object), 1: array([array([ True,  True]), array([False,  True,  True]),
       array([ True, False,  True]),
       array([False,  True, False, False,  True, False, False, False, False,
        True, False, False,  True, False]),
       array([False, False, False,  True, False, False, False,  True, False]),
       array([False,  True]), array([ True,  True]),
       array([ True,  True]), array([ True, False,  True])], dtype=object)}

Apply pruning for the trained classifier

[19]:

acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained
keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set
onehot_onln_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes)

[19]:

OneHotOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1]...
        1.00000000e-01, 4.00000000e-04],
       [2.67142857e-01, 3.80892857e-01, 2.98245614e-03, 1.79104478e-01,
        6.45000000e-02, 3.00000000e-05],
       [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,
        0.00000000e+00, 9.90000000e-04],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                 min_percent_overlap_cat=0.6, theta=0.1, theta_min=0.1)

[20]:

print('Number of hyperboxes after pruning = %d'%onehot_onln_gfmm_clf.get_n_hyperboxes())

Number of hyperboxes after pruning = 162

Make prediction after pruning

[21]:

y_pred = onehot_onln_gfmm_clf.predict(Xtest)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy after pruning = {acc * 100: .2f}%')

Accuracy after pruning =  69.47%