Enhanced Improved Online Learning Algorithm with Mixed-Attribute Data for GFMM

This example shows how to use the general fuzzy min-max neural network trained by the extended improved incremental learning algorithm for mixed attribute data (EIOL-GFMM)

Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube. Therefore, continuous features need to be normalised before training. For categorical feature, nothing needs to be done as the EIOL-GFMM does not require any categorical feature encoding methods.

1. Execute directly from the python file

[1]:

import os
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score

Get the path to the this jupyter notebook file

[2]:

this_notebook_dir = os.path.dirname(os.path.abspath("__file__"))
this_notebook_dir

[2]:

'C:\\hyperbox-brain\\examples\\mixed_data'

Get the home folder of the Hyperbox-Brain project

[3]:

from pathlib import Path
project_dir = Path(this_notebook_dir).parent.parent
project_dir

[3]:

WindowsPath('C:/hyperbox-brain')

Create the path to the Python file containing the implementation of the GFMM classifier using the extended improved online learning algorithm for mixed attribute data

[4]:

eiol_gfmm_file_path = os.path.join(project_dir, Path("hbbrain/mixed_data/eiol_gfmm.py"))
eiol_gfmm_file_path

[4]:

'C:\\hyperbox-brain\\hbbrain\\mixed_data\\eiol_gfmm.py'

Run the found file by showing the execution directions

[5]:

!python "{eiol_gfmm_file_path}" -h

usage: eiol_gfmm.py [-h] -training_file TRAINING_FILE -testing_file
                    TESTING_FILE -categorical_features CATEGORICAL_FEATURES
                    [--theta THETA] [--delta DELTA] [--gamma GAMMA]
                    [--alpha ALPHA]

The description of parameters

required arguments:
  -training_file TRAINING_FILE
                        A required argument for the path to training data file
                        (including file name)
  -testing_file TESTING_FILE
                        A required argument for the path to testing data file
                        (including file name)
  -categorical_features CATEGORICAL_FEATURES
                        Indices of categorical features

optional arguments:
  --theta THETA         Maximum hyperbox size (in the range of (0, 1])
                        (default: 0.5)
  --delta DELTA         Maximum changing entropy for categorical features (in
                        the range of (0, 1]) (default: 0.5)
  --gamma GAMMA         A sensitivity parameter describing the speed of
                        decreasing of the membership function in each
                        continous dimension (larger than 0) (default: 1)
  --alpha ALPHA         The trade-off weighting factor between categorical
                        features and numerical features for membership values
                        (in the range of [0, 1]) (default: 0.5)

Create the path to mixed-attribute training and testing datasets stored in the dataset folder.

This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.

[6]:

training_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_train.csv"))
training_data_file

[6]:

'C:\\hyperbox-brain\\dataset\\japanese_credit_train.csv'

[7]:

testing_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_test.csv"))
testing_data_file

[7]:

'C:\\hyperbox-brain\\dataset\\japanese_credit_test.csv'

Run a demo program

[8]:

!python "{eiol_gfmm_file_path}" -training_file "{training_data_file}" -testing_file "{testing_data_file}" -categorical_features "[0, 3, 4, 5, 6, 8, 9, 11,12]" --theta 0.1 --delta 0.6 --gamma 1 --alpha 0.5

Number of hyperboxes = 378
Testing accuracy =  82.44%

2. Using the EIOL-GFMM algorithm to train a GFMM classifier for mixed-attribute data through its init, fit, and predict functions

[9]:

from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM
import pandas as pd

Create mixed attribute training, validation, and testing data sets.

This example will use the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.

[10]:

df_train = pd.read_csv(training_data_file, header=None)
df_test = pd.read_csv(testing_data_file, header=None)

Xy_train = df_train.to_numpy()
Xy_test = df_test.to_numpy()

Xtr = Xy_train[:, :-1]
ytr = Xy_train[:, -1].astype(int)

Xtest = Xy_test[:, :-1]
ytest = Xy_test[:, -1].astype(int)

[11]:

val_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_val.csv"))
df_val = pd.read_csv(val_data_file, header=None)
Xy_val = df_val.to_numpy()
Xval = Xy_val[:, :-1]
yval = Xy_val[:, -1].astype(int)

Initializing parameters

[12]:

theta = 0.1 # maximum hyperbox size for continuous features
delta = 0.6 # The maximum value of the increased entropy degree for each categorical dimension after extended.
gamma = 1 # speed of decreasing degree in the membership values of continuous features
alpha = 0.5 # the trade-off factor for the contribution of categorical features and continuous features to final membership value

Indicate the indices of categorical features in the training data

[13]:

categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]

a. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for every categorical dimension

Training

[14]:

eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)
eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=0)

[14]:

ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 1,...
       [8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,
        6.00000000e-02, 2.20600000e-02],
       ...,
       [1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,
        7.20000000e-02, 0.00000000e+00],
       [6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,
        0.00000000e+00, 0.00000000e+00],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                           delta=0.6, theta=0.1)

[15]:

print("Number of existing hyperboxes = %d"%(eiol_gfmm_clf.get_n_hyperboxes()))

Number of existing hyperboxes = 378

[16]:

print("Training time: %.3f (s)"%eiol_gfmm_clf.elapsed_training_time)

Training time: 0.991 (s)

Prediction

[17]:

from hbbrain.constants import MANHATTAN_DIS, PROBABILITY_MEASURE

Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries

[18]:

y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')

Accuracy =  82.44%

Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries

[19]:

y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')

Accuracy (Manhattan distance for samples on the decision boundaries) =  78.63%

Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class

[20]:

sample_need_explain = 1
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])
print("Explain samples:")
print("Membership values for classes: ", mem_val_classes)
print("Predicted class = ", y_pred_input_0)
print("Minimum continuous points of the selected hyperbox for each class: ", min_points_classes)
print("Maximum continuous points of the selected hyperbox for each class: ", max_points_classes)
print("Categorical bounds of the selected hyperbox for each class: ", dict_cat_bound_classes)

Explain samples:
Membership values for classes:  {0: 0.8441127694859039, 1: 0.9191765873015874}
Predicted class =  1
Minimum continuous points of the selected hyperbox for each class:  {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04      ,
       0.0099    ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048     ,
       0.        ])}
Maximum continuous points of the selected hyperbox for each class:  {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04      ,
       0.0099    ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048     ,
       0.        ])}
Categorical bounds of the selected hyperbox for each class:  {0: array([{'a': 1}, {'u': 1}, {'g': 1}, {'q': 1}, {'v': 1}, {'t': 1},
       {'t': 1}, {'f': 1}, {'g': 1}], dtype=object), 1: array([{'a': 1}, {'u': 1}, {'g': 1}, {'cc': 1}, {'h': 1}, {'t': 1},
       {'t': 1}, {'f': 1}, {'g': 1}], dtype=object)}

Apply pruning for the trained classifier

[21]:

acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained
keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set
# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary
type_boundary_handling = PROBABILITY_MEASURE
eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)

[21]:

ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 1,...
       [8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,
        6.00000000e-02, 2.20600000e-02],
       ...,
       [1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,
        7.20000000e-02, 0.00000000e+00],
       [6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,
        0.00000000e+00, 0.00000000e+00],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                           delta=0.6, theta=0.1)

[22]:

print('Number of hyperboxes after pruning = %d'%eiol_gfmm_clf.get_n_hyperboxes())

Number of hyperboxes after pruning = 358

Make prediction after pruning

Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries

[23]:

y_pred_2 = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred_2)
print(f'Accuracy after pruning = {acc * 100: .2f}%')

Accuracy after pruning =  83.21%

Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries

[24]:

y_pred_2 = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred_2)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')

Accuracy (Manhattan distance for samples on the decision boundaries) =  79.39%

b. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for the average changing entropy value over all categorical features.

Training

[25]:

eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)
eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=1)

[25]:

ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1,...
       [2.67142857e-01, 3.80892857e-01, 2.98245614e-03, 1.79104478e-01,
        6.45000000e-02, 3.00000000e-05],
       [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,
        0.00000000e+00, 9.90000000e-04],
       [5.33015873e-01, 2.32142857e-01, 3.50877193e-02, 0.00000000e+00,
        0.00000000e+00, 2.28000000e-03],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                           delta=0.6, theta=0.1)

[26]:

print("Number of existing hyperboxes = %d"%(eiol_gfmm_clf.get_n_hyperboxes()))

Number of existing hyperboxes = 159

[27]:

print("Training time: %.3f (s)"%eiol_gfmm_clf.elapsed_training_time)

Training time: 0.256 (s)

Prediction

Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries

[28]:

y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')

Accuracy =  83.97%

Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries

[29]:

y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')

Accuracy (Manhattan distance for samples on the decision boundaries) =  80.92%

Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class

[30]:

sample_need_explain = 1
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])
print("Explain samples:")
print("Membership values for classes: ", mem_val_classes)
print("Predicted class = ", y_pred_input_0)
print("Minimum continuous points of the selected hyperbox for each class: ", min_points_classes)
print("Maximum continuous points of the selected hyperbox for each class: ", max_points_classes)
print("Categorical bounds of the selected hyperbox for each class: ", dict_cat_bound_classes)

Explain samples:
Membership values for classes:  {0: 0.818407960199005, 1: 0.8854166666666667}
Predicted class =  1
Minimum continuous points of the selected hyperbox for each class:  {0: array([6.07936508e-02, 3.57142857e-01, 4.38596491e-03, 1.49253731e-02,
       0.00000000e+00, 1.00000000e-05]), 1: array([1.46825397e-01, 4.19642857e-01, 1.75438596e-02, 1.49253731e-02,
       6.00000000e-02, 1.10000000e-04])}
Maximum continuous points of the selected hyperbox for each class:  {0: array([0.15079365, 0.45089286, 0.03508772, 0.02985075, 0.06      ,
       0.05552   ]), 1: array([0.21698413, 0.51785714, 0.10824561, 0.02985075, 0.15      ,
       0.00551   ])}
Categorical bounds of the selected hyperbox for each class:  {0: array([{'b': 2, 'a': 1}, {'u': 3}, {'g': 3}, {'w': 2, 'c': 1},
       {'h': 1, 'v': 2}, {'f': 3}, {'t': 3}, {'f': 3}, {'g': 3}],
      dtype=object), 1: array([{'a': 2}, {'u': 2}, {'g': 2}, {'x': 2}, {'h': 2}, {'t': 2},
       {'t': 2}, {'t': 1, 'f': 1}, {'g': 2}], dtype=object)}

Apply pruning for the trained classifier

[31]:

acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained
keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set
# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary
type_boundary_handling = PROBABILITY_MEASURE
eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)

[31]:

ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1]),
                           D=array([[{'a': 3, 'b': 12}, {'u': 10, 'y': 5}, {'g': 10, 'p': 5},
        {'q': 1, 'w': 4, 'k': 5, 'c': 2, 'i': 1, 'x': 1, 'm': 1},
        {'v': 11, 'h': 3, 'ff': 1}, {'f': 15}, {'t': 5, 'f': 10},
        {'t': 6, 'f': 9}, {'g': 14, 's': 1}],
       [{'b': 2, 'a': 3}, {'u': 4...
       [4.35238095e-01, 2.32142857e-01, 1.75438596e-02, 4.47761194e-02,
        1.14000000e-01, 0.00000000e+00],
       [5.46349206e-01, 1.25000000e-01, 1.22807018e-01, 0.00000000e+00,
        1.15000000e-01, 0.00000000e+00],
       [6.09841270e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,
        0.00000000e+00, 9.90000000e-04]]),
                           delta=0.6, theta=0.1)

Make prediction after pruning

Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries

[32]:

y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')

Accuracy =  82.44%

Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries

[33]:

y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')

Accuracy (Manhattan distance for samples on the decision boundaries) =  82.44%