Enhanced Improved Online Learning Algorithm with Mixed-Attribute Data for GFMM
This example shows how to use the general fuzzy min-max neural network trained by the extended improved incremental learning algorithm for mixed attribute data (EIOL-GFMM)
Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube. Therefore, continuous features need to be normalised before training. For categorical feature, nothing needs to be done as the EIOL-GFMM does not require any categorical feature encoding methods.
1. Execute directly from the python file
[1]:
import os
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score
Get the path to the this jupyter notebook file
[2]:
this_notebook_dir = os.path.dirname(os.path.abspath("__file__"))
this_notebook_dir
[2]:
'C:\\hyperbox-brain\\examples\\mixed_data'
Get the home folder of the Hyperbox-Brain project
[3]:
from pathlib import Path
project_dir = Path(this_notebook_dir).parent.parent
project_dir
[3]:
WindowsPath('C:/hyperbox-brain')
Create the path to the Python file containing the implementation of the GFMM classifier using the extended improved online learning algorithm for mixed attribute data
[4]:
eiol_gfmm_file_path = os.path.join(project_dir, Path("hbbrain/mixed_data/eiol_gfmm.py"))
eiol_gfmm_file_path
[4]:
'C:\\hyperbox-brain\\hbbrain\\mixed_data\\eiol_gfmm.py'
Run the found file by showing the execution directions
[5]:
!python "{eiol_gfmm_file_path}" -h
usage: eiol_gfmm.py [-h] -training_file TRAINING_FILE -testing_file
TESTING_FILE -categorical_features CATEGORICAL_FEATURES
[--theta THETA] [--delta DELTA] [--gamma GAMMA]
[--alpha ALPHA]
The description of parameters
required arguments:
-training_file TRAINING_FILE
A required argument for the path to training data file
(including file name)
-testing_file TESTING_FILE
A required argument for the path to testing data file
(including file name)
-categorical_features CATEGORICAL_FEATURES
Indices of categorical features
optional arguments:
--theta THETA Maximum hyperbox size (in the range of (0, 1])
(default: 0.5)
--delta DELTA Maximum changing entropy for categorical features (in
the range of (0, 1]) (default: 0.5)
--gamma GAMMA A sensitivity parameter describing the speed of
decreasing of the membership function in each
continous dimension (larger than 0) (default: 1)
--alpha ALPHA The trade-off weighting factor between categorical
features and numerical features for membership values
(in the range of [0, 1]) (default: 0.5)
Create the path to mixed-attribute training and testing datasets stored in the dataset folder.
This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.
[6]:
training_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_train.csv"))
training_data_file
[6]:
'C:\\hyperbox-brain\\dataset\\japanese_credit_train.csv'
[7]:
testing_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_test.csv"))
testing_data_file
[7]:
'C:\\hyperbox-brain\\dataset\\japanese_credit_test.csv'
Run a demo program
[8]:
!python "{eiol_gfmm_file_path}" -training_file "{training_data_file}" -testing_file "{testing_data_file}" -categorical_features "[0, 3, 4, 5, 6, 8, 9, 11,12]" --theta 0.1 --delta 0.6 --gamma 1 --alpha 0.5
Number of hyperboxes = 378
Testing accuracy = 82.44%
2. Using the EIOL-GFMM algorithm to train a GFMM classifier for mixed-attribute data through its init, fit, and predict functions
[9]:
from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM
import pandas as pd
Create mixed attribute training, validation, and testing data sets.
This example will use the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged.
[10]:
df_train = pd.read_csv(training_data_file, header=None)
df_test = pd.read_csv(testing_data_file, header=None)
Xy_train = df_train.to_numpy()
Xy_test = df_test.to_numpy()
Xtr = Xy_train[:, :-1]
ytr = Xy_train[:, -1].astype(int)
Xtest = Xy_test[:, :-1]
ytest = Xy_test[:, -1].astype(int)
[11]:
val_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_val.csv"))
df_val = pd.read_csv(val_data_file, header=None)
Xy_val = df_val.to_numpy()
Xval = Xy_val[:, :-1]
yval = Xy_val[:, -1].astype(int)
Initializing parameters
[12]:
theta = 0.1 # maximum hyperbox size for continuous features
delta = 0.6 # The maximum value of the increased entropy degree for each categorical dimension after extended.
gamma = 1 # speed of decreasing degree in the membership values of continuous features
alpha = 0.5 # the trade-off factor for the contribution of categorical features and continuous features to final membership value
Indicate the indices of categorical features in the training data
[13]:
categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]
a. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for every categorical dimension
Training
[14]:
eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)
eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=0)
[14]:
ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
0, 1, 1,...
[8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,
6.00000000e-02, 2.20600000e-02],
...,
[1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,
7.20000000e-02, 0.00000000e+00],
[6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,
0.00000000e+00, 0.00000000e+00],
[5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
0.00000000e+00, 1.50000000e-01]]),
delta=0.6, theta=0.1)
[15]:
print("Number of existing hyperboxes = %d"%(eiol_gfmm_clf.get_n_hyperboxes()))
Number of existing hyperboxes = 378
[16]:
print("Training time: %.3f (s)"%eiol_gfmm_clf.elapsed_training_time)
Training time: 0.991 (s)
Prediction
[17]:
from hbbrain.constants import MANHATTAN_DIS, PROBABILITY_MEASURE
Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries
[18]:
y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')
Accuracy = 82.44%
Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries
[19]:
y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')
Accuracy (Manhattan distance for samples on the decision boundaries) = 78.63%
Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class
[20]:
sample_need_explain = 1
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])
print("Explain samples:")
print("Membership values for classes: ", mem_val_classes)
print("Predicted class = ", y_pred_input_0)
print("Minimum continuous points of the selected hyperbox for each class: ", min_points_classes)
print("Maximum continuous points of the selected hyperbox for each class: ", max_points_classes)
print("Categorical bounds of the selected hyperbox for each class: ", dict_cat_bound_classes)
Explain samples:
Membership values for classes: {0: 0.8441127694859039, 1: 0.9191765873015874}
Predicted class = 1
Minimum continuous points of the selected hyperbox for each class: {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04 ,
0.0099 ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048 ,
0. ])}
Maximum continuous points of the selected hyperbox for each class: {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04 ,
0.0099 ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048 ,
0. ])}
Categorical bounds of the selected hyperbox for each class: {0: array([{'a': 1}, {'u': 1}, {'g': 1}, {'q': 1}, {'v': 1}, {'t': 1},
{'t': 1}, {'f': 1}, {'g': 1}], dtype=object), 1: array([{'a': 1}, {'u': 1}, {'g': 1}, {'cc': 1}, {'h': 1}, {'t': 1},
{'t': 1}, {'f': 1}, {'g': 1}], dtype=object)}
Apply pruning for the trained classifier
[21]:
acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained
keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set
# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary
type_boundary_handling = PROBABILITY_MEASURE
eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)
[21]:
ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,
1, 1, 1,...
[8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,
6.00000000e-02, 2.20600000e-02],
...,
[1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,
7.20000000e-02, 0.00000000e+00],
[6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,
0.00000000e+00, 0.00000000e+00],
[5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
0.00000000e+00, 1.50000000e-01]]),
delta=0.6, theta=0.1)
[22]:
print('Number of hyperboxes after pruning = %d'%eiol_gfmm_clf.get_n_hyperboxes())
Number of hyperboxes after pruning = 358
Make prediction after pruning
Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries
[23]:
y_pred_2 = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred_2)
print(f'Accuracy after pruning = {acc * 100: .2f}%')
Accuracy after pruning = 83.21%
Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries
[24]:
y_pred_2 = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred_2)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')
Accuracy (Manhattan distance for samples on the decision boundaries) = 79.39%
b. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for the average changing entropy value over all categorical features.
Training
[25]:
eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)
eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=1)
[25]:
ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
1, 0, 1,...
[2.67142857e-01, 3.80892857e-01, 2.98245614e-03, 1.79104478e-01,
6.45000000e-02, 3.00000000e-05],
[7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,
0.00000000e+00, 9.90000000e-04],
[5.33015873e-01, 2.32142857e-01, 3.50877193e-02, 0.00000000e+00,
0.00000000e+00, 2.28000000e-03],
[5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,
0.00000000e+00, 1.50000000e-01]]),
delta=0.6, theta=0.1)
[26]:
print("Number of existing hyperboxes = %d"%(eiol_gfmm_clf.get_n_hyperboxes()))
Number of existing hyperboxes = 159
[27]:
print("Training time: %.3f (s)"%eiol_gfmm_clf.elapsed_training_time)
Training time: 0.256 (s)
Prediction
Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries
[28]:
y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')
Accuracy = 83.97%
Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries
[29]:
y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')
Accuracy (Manhattan distance for samples on the decision boundaries) = 80.92%
Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class
[30]:
sample_need_explain = 1
y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])
print("Explain samples:")
print("Membership values for classes: ", mem_val_classes)
print("Predicted class = ", y_pred_input_0)
print("Minimum continuous points of the selected hyperbox for each class: ", min_points_classes)
print("Maximum continuous points of the selected hyperbox for each class: ", max_points_classes)
print("Categorical bounds of the selected hyperbox for each class: ", dict_cat_bound_classes)
Explain samples:
Membership values for classes: {0: 0.818407960199005, 1: 0.8854166666666667}
Predicted class = 1
Minimum continuous points of the selected hyperbox for each class: {0: array([6.07936508e-02, 3.57142857e-01, 4.38596491e-03, 1.49253731e-02,
0.00000000e+00, 1.00000000e-05]), 1: array([1.46825397e-01, 4.19642857e-01, 1.75438596e-02, 1.49253731e-02,
6.00000000e-02, 1.10000000e-04])}
Maximum continuous points of the selected hyperbox for each class: {0: array([0.15079365, 0.45089286, 0.03508772, 0.02985075, 0.06 ,
0.05552 ]), 1: array([0.21698413, 0.51785714, 0.10824561, 0.02985075, 0.15 ,
0.00551 ])}
Categorical bounds of the selected hyperbox for each class: {0: array([{'b': 2, 'a': 1}, {'u': 3}, {'g': 3}, {'w': 2, 'c': 1},
{'h': 1, 'v': 2}, {'f': 3}, {'t': 3}, {'f': 3}, {'g': 3}],
dtype=object), 1: array([{'a': 2}, {'u': 2}, {'g': 2}, {'x': 2}, {'h': 2}, {'t': 2},
{'t': 2}, {'t': 1, 'f': 1}, {'g': 2}], dtype=object)}
Apply pruning for the trained classifier
[31]:
acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained
keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set
# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary
type_boundary_handling = PROBABILITY_MEASURE
eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)
[31]:
ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1]),
D=array([[{'a': 3, 'b': 12}, {'u': 10, 'y': 5}, {'g': 10, 'p': 5},
{'q': 1, 'w': 4, 'k': 5, 'c': 2, 'i': 1, 'x': 1, 'm': 1},
{'v': 11, 'h': 3, 'ff': 1}, {'f': 15}, {'t': 5, 'f': 10},
{'t': 6, 'f': 9}, {'g': 14, 's': 1}],
[{'b': 2, 'a': 3}, {'u': 4...
[4.35238095e-01, 2.32142857e-01, 1.75438596e-02, 4.47761194e-02,
1.14000000e-01, 0.00000000e+00],
[5.46349206e-01, 1.25000000e-01, 1.22807018e-01, 0.00000000e+00,
1.15000000e-01, 0.00000000e+00],
[6.09841270e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,
0.00000000e+00, 9.90000000e-04]]),
delta=0.6, theta=0.1)
Make prediction after pruning
Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries
[32]:
y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy = {acc * 100: .2f}%')
Accuracy = 82.44%
Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries
[33]:
y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)
acc = accuracy_score(ytest, y_pred)
print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')
Accuracy (Manhattan distance for samples on the decision boundaries) = 82.44%