Batch-Incremental Learning Algorithm for GFMM using Probability-based Measures for Categorical Features

This example shows how to use the general fuzzy min-max neural network trained by the batch-incremental learning algorithm, in which categorical features are encoded using the ordinal encoding method and the similarity values among categorical feature are computed using frequency of categorical values.

Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube. Therefore, continuous features need to be normalised before training. For categorical features, nothing needs to be done as this FreqCatOnlineGFMM classifier will apply the appropriate encoding method for the categorical values.

1. Execute directly from the python file

[1]:
import os
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score

Get the path to the this jupyter notebook file

[2]:
this_notebook_dir = os.path.dirname(os.path.abspath("__file__"))
this_notebook_dir
[2]:
'C:\\hyperbox-brain\\examples\\mixed_data'

Get the home folder of the Hyperbox-Brain project

[3]:
from pathlib import Path
project_dir = Path(this_notebook_dir).parent.parent
project_dir
[3]:
WindowsPath('C:/hyperbox-brain')

Create the path to the Python file containing the implementation of the GFMM classifier using the online learning algorithm with the cateogical feature similarity measure based on the frequence of occurence of categorical values for mixed attribute features

[4]:
freq_cat_gfmm_file_path = os.path.join(project_dir, Path("hbbrain/mixed_data/freq_cat_onln_gfmm.py"))
freq_cat_gfmm_file_path
[4]:
'C:\\hyperbox-brain\\hbbrain\\mixed_data\\freq_cat_onln_gfmm.py'

Run the found file by showing the execution directions

[5]:
!python "{freq_cat_gfmm_file_path}" -h
usage: freq_cat_onln_gfmm.py [-h] -training_file TRAINING_FILE -testing_file
                             TESTING_FILE -categorical_features
                             CATEGORICAL_FEATURES [--theta THETA]
                             [--theta_min THETA_MIN] [--eta ETA]
                             [--gamma GAMMA] [--alpha ALPHA]

The description of parameters

required arguments:
  -training_file TRAINING_FILE
                        A required argument for the path to training data file
                        (including file name)
  -testing_file TESTING_FILE
                        A required argument for the path to testing data file
                        (including file name)
  -categorical_features CATEGORICAL_FEATURES
                        Indices of categorical features

optional arguments:
  --theta THETA         Maximum hyperbox size (in the range of (0, 1])
                        (default: 0.5)
  --theta_min THETA_MIN
                        Mimimum value of the maximum hyperbox size to escape
                        the training loop (in the range of (0, 1]) (default:
                        0.5)
  --eta ETA             Maximum similarity value for each pair of categorical
                        values (in the range of (0, 1] (default: 0.5
  --gamma GAMMA         A sensitivity parameter describing the speed of
                        decreasing of the membership function in each
                        continuous dimension (larger than 0) (default: 1)
  --alpha ALPHA         Multiplier showing the decrease of theta in each step
                        (default: 0.9)

Create the path to mixed-attribute training and testing datasets stored in the dataset folder.

This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.

[6]:
training_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_train.csv"))
training_data_file
[6]:
'C:\\hyperbox-brain\\dataset\\japanese_credit_train.csv'
[7]:
testing_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_test.csv"))
testing_data_file
[7]:
'C:\\hyperbox-brain\\dataset\\japanese_credit_test.csv'

Run a demo program

[8]:
!python "{freq_cat_gfmm_file_path}" -training_file "{training_data_file}" -testing_file "{testing_data_file}" -categorical_features "[0, 3, 4, 5, 6, 8, 9, 11,12]" --theta 0.1 --theta_min 0.1 --eta 0.6 --gamma 1
Number of hyperboxes = 266
Testing accuracy =  80.92%

2. Using the FreqCatOnlineGFMM algorithm to train a GFMM classifier for mixed-attribute data through its init, fit, and predict functions

[9]:
from hbbrain.mixed_data.freq_cat_onln_gfmm import FreqCatOnlineGFMM
import pandas as pd

Create mixed attribute training, validation, and testing data sets.

This example will use the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.

[10]:
df_train = pd.read_csv(training_data_file, header=None)
df_test = pd.read_csv(testing_data_file, header=None)

Xy_train = df_train.to_numpy()
Xy_test = df_test.to_numpy()

Xtr = Xy_train[:, :-1]
ytr = Xy_train[:, -1].astype(int)

Xtest = Xy_test[:, :-1]
ytest = Xy_test[:, -1].astype(int)
[11]:
val_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_val.csv"))
df_val = pd.read_csv(val_data_file, header=None)
Xy_val = df_val.to_numpy()
Xval = Xy_val[:, :-1]
yval = Xy_val[:, -1].astype(int)

Initializing parameters

[12]:
theta = 0.1 # maximum hyperbox size for continuous features
theta_min = 0.1 # Only performing one training loop
eta = 0.6 # Maximum similarity value for each pair of categorical values
gamma = 1 # speed of decreasing degree in the membership values of continuous features

Indicate the indices of categorical features in the training data

[13]:
categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]

Training

[14]:
freq_cat_onln_gfmm_clf = FreqCatOnlineGFMM(theta=theta, theta_min=theta_min, eta=eta, gamma=gamma)
freq_cat_onln_gfmm_clf.fit(Xtr, ytr, categorical_features)
[14]:
FreqCatOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 0...
       [1.32222222e-01, 3.92857143e-01, 5.26315789e-02, 0.00000000e+00,
        6.00000000e-02, 2.20600000e-02],
       ...,
       [4.35238095e-01, 2.32142857e-01, 1.75438596e-02, 4.47761194e-02,
        7.25000000e-02, 0.00000000e+00],
       [1.29682540e-01, 1.78571429e-02, 4.38596491e-03, 0.00000000e+00,
        1.80000000e-01, 0.00000000e+00],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                  eta=0.6, theta=0.1, theta_min=0.1)
[15]:
 print('Number of hyperboxes = %d'%freq_cat_onln_gfmm_clf.get_n_hyperboxes())
Number of hyperboxes = 266
[16]:
print("Training time: %.3f (s)"%freq_cat_onln_gfmm_clf.elapsed_training_time)
Training time: 1.256 (s)

Prediction

[17]:
y_pred = freq_cat_onln_gfmm_clf.predict(Xtest)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')
Accuracy =  80.92%

Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class

[18]:
sample_need_explain = 1
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_min_point_cat_classes, dict_max_point_cat_classes = freq_cat_onln_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])
print("Explain samples:")
print("Membership values for classes: ", mem_val_classes)
print("Predicted class = ", y_pred_input_0)
print("Minimum continuous points of the selected hyperbox for each class: ", min_points_classes)
print("Maximum continuous points of the selected hyperbox for each class: ", max_points_classes)
print("Minimum categorical points of the selected hyperbox for each class: ", dict_min_point_cat_classes)
print("Maximum categorical points of the selected hyperbox for each class: ", dict_max_point_cat_classes)
Explain samples:
Membership values for classes:  {0: 0.6642512077294687, 1: 0.75}
Predicted class =  1
Minimum continuous points of the selected hyperbox for each class:  {0: array([0.1852381 , 0.04017857, 0.04526316, 0.02985075, 0.1       ,
       0.        ]), 1: array([0.03301587, 0.02089286, 0.00578947, 0.02985075, 0.05      ,
       0.        ])}
Maximum continuous points of the selected hyperbox for each class:  {0: array([0.1852381 , 0.04017857, 0.04526316, 0.02985075, 0.1       ,
       0.        ]), 1: array([0.10984127, 0.10714286, 0.07315789, 0.07462687, 0.11      ,
       0.02503   ])}
Minimum categorical points of the selected hyperbox for each class:  {0: array([0.0, 1.0, 0.0, 10.0, 7.0, 1.0, 1.0, 0.0, 0.0], dtype=object), 1: array([0.0, 1.0, 0.0, 7.0, 7.0, 1.0, 1.0, 0.0, 0.0], dtype=object)}
Maximum categorical points of the selected hyperbox for each class:  {0: array([100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000]), 1: array([0, 1, 0, 1, 3, 1, 1, 0, 0])}

Apply pruning for the trained classifier

[19]:
acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained
keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set
freq_cat_onln_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes)
[19]:
FreqCatOnlineGFMM(C=array([0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1...
       [2.65873016e-01, 2.32142857e-01, 1.40350877e-01, 1.04477612e-01,
        4.95000000e-02, 3.06500000e-02],
       ...,
       [4.35238095e-01, 2.32142857e-01, 1.75438596e-02, 4.47761194e-02,
        7.25000000e-02, 0.00000000e+00],
       [1.29682540e-01, 1.78571429e-02, 4.38596491e-03, 0.00000000e+00,
        1.80000000e-01, 0.00000000e+00],
       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
        0.00000000e+00, 1.50000000e-01]]),
                  eta=0.6, theta=0.1, theta_min=0.1)
[20]:
print('Number of hyperboxes after pruning = %d'%freq_cat_onln_gfmm_clf.get_n_hyperboxes())
Number of hyperboxes after pruning = 246

Make prediction after pruning

[21]:
y_pred = freq_cat_onln_gfmm_clf.predict(Xtest)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy after pruning = {acc * 100: .2f}%')
Accuracy after pruning =  83.21%