{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Enhanced Improved Online Learning Algorithm with Mixed-Attribute Data for GFMM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example shows how to use the general fuzzy min-max neural network trained by the extended improved incremental learning algorithm for mixed attribute data (EIOL-GFMM)\n", "\n", "Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube. Therefore, continuous features need to be normalised before training. For categorical feature, nothing needs to be done as the EIOL-GFMM does not require any categorical feature encoding methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Execute directly from the python file" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the path to the this jupyter notebook file" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'C:\\\\hyperbox-brain\\\\examples\\\\mixed_data'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "this_notebook_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n", "this_notebook_dir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get the home folder of the Hyperbox-Brain project" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WindowsPath('C:/hyperbox-brain')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pathlib import Path\n", "project_dir = Path(this_notebook_dir).parent.parent\n", "project_dir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create the path to the Python file containing the implementation of the GFMM classifier using the extended improved online learning algorithm for mixed attribute data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'C:\\\\hyperbox-brain\\\\hbbrain\\\\mixed_data\\\\eiol_gfmm.py'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eiol_gfmm_file_path = os.path.join(project_dir, Path(\"hbbrain/mixed_data/eiol_gfmm.py\"))\n", "eiol_gfmm_file_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Run the found file by showing the execution directions" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: eiol_gfmm.py [-h] -training_file TRAINING_FILE -testing_file\n", " TESTING_FILE -categorical_features CATEGORICAL_FEATURES\n", " [--theta THETA] [--delta DELTA] [--gamma GAMMA]\n", " [--alpha ALPHA]\n", "\n", "The description of parameters\n", "\n", "required arguments:\n", " -training_file TRAINING_FILE\n", " A required argument for the path to training data file\n", " (including file name)\n", " -testing_file TESTING_FILE\n", " A required argument for the path to testing data file\n", " (including file name)\n", " -categorical_features CATEGORICAL_FEATURES\n", " Indices of categorical features\n", "\n", "optional arguments:\n", " --theta THETA Maximum hyperbox size (in the range of (0, 1])\n", " (default: 0.5)\n", " --delta DELTA Maximum changing entropy for categorical features (in\n", " the range of (0, 1]) (default: 0.5)\n", " --gamma GAMMA A sensitivity parameter describing the speed of\n", " decreasing of the membership function in each\n", " continous dimension (larger than 0) (default: 1)\n", " --alpha ALPHA The trade-off weighting factor between categorical\n", " features and numerical features for membership values\n", " (in the range of [0, 1]) (default: 0.5)\n" ] } ], "source": [ "!python \"{eiol_gfmm_file_path}\" -h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create the path to mixed-attribute training and testing datasets stored in the dataset folder.\n", "This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'C:\\\\hyperbox-brain\\\\dataset\\\\japanese_credit_train.csv'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "training_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_train.csv\"))\n", "training_data_file" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'C:\\\\hyperbox-brain\\\\dataset\\\\japanese_credit_test.csv'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "testing_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_test.csv\"))\n", "testing_data_file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Run a demo program" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes = 378\n", "Testing accuracy = 82.44%\n" ] } ], "source": [ "!python \"{eiol_gfmm_file_path}\" -training_file \"{training_data_file}\" -testing_file \"{testing_data_file}\" -categorical_features \"[0, 3, 4, 5, 6, 8, 9, 11,12]\" --theta 0.1 --delta 0.6 --gamma 1 --alpha 0.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Using the EIOL-GFMM algorithm to train a GFMM classifier for mixed-attribute data through its init, fit, and predict functions" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create mixed attribute training, validation, and testing data sets.\n", "This example will use the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df_train = pd.read_csv(training_data_file, header=None)\n", "df_test = pd.read_csv(testing_data_file, header=None)\n", "\n", "Xy_train = df_train.to_numpy()\n", "Xy_test = df_test.to_numpy()\n", "\n", "Xtr = Xy_train[:, :-1]\n", "ytr = Xy_train[:, -1].astype(int)\n", "\n", "Xtest = Xy_test[:, :-1]\n", "ytest = Xy_test[:, -1].astype(int)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "val_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_val.csv\"))\n", "df_val = pd.read_csv(val_data_file, header=None)\n", "Xy_val = df_val.to_numpy()\n", "Xval = Xy_val[:, :-1]\n", "yval = Xy_val[:, -1].astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Initializing parameters" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "theta = 0.1 # maximum hyperbox size for continuous features\n", "delta = 0.6 # The maximum value of the increased entropy degree for each categorical dimension after extended.\n", "gamma = 1 # speed of decreasing degree in the membership values of continuous features\n", "alpha = 0.5 # the trade-off factor for the contribution of categorical features and continuous features to final membership value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Indicate the indices of categorical features in the training data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for every categorical dimension" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,\n", " 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,\n", " 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,\n", " 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,\n", " 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n", " 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n", " 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,\n", " 0, 1, 1,...\n", " [8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,\n", " 6.00000000e-02, 2.20600000e-02],\n", " ...,\n", " [1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,\n", " 7.20000000e-02, 0.00000000e+00],\n", " [6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,\n", " 0.00000000e+00, 0.00000000e+00],\n", " [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,\n", " 0.00000000e+00, 1.50000000e-01]]),\n", " delta=0.6, theta=0.1)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)\n", "eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=0)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of existing hyperboxes = 378\n" ] } ], "source": [ "print(\"Number of existing hyperboxes = %d\"%(eiol_gfmm_clf.get_n_hyperboxes()))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 0.991 (s)\n" ] } ], "source": [ "print(\"Training time: %.3f (s)\"%eiol_gfmm_clf.elapsed_training_time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from hbbrain.constants import MANHATTAN_DIS, PROBABILITY_MEASURE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 82.44%\n" ] } ], "source": [ "y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n", "acc = accuracy_score(ytest, y_pred)\n", "print(f'Accuracy = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy (Manhattan distance for samples on the decision boundaries) = 78.63%\n" ] } ], "source": [ "y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n", "acc = accuracy_score(ytest, y_pred)\n", "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Explain samples:\n", "Membership values for classes: {0: 0.8441127694859039, 1: 0.9191765873015874}\n", "Predicted class = 1\n", "Minimum continuous points of the selected hyperbox for each class: {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04 ,\n", " 0.0099 ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048 ,\n", " 0. ])}\n", "Maximum continuous points of the selected hyperbox for each class: {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04 ,\n", " 0.0099 ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048 ,\n", " 0. ])}\n", "Categorical bounds of the selected hyperbox for each class: {0: array([{'a': 1}, {'u': 1}, {'g': 1}, {'q': 1}, {'v': 1}, {'t': 1},\n", " {'t': 1}, {'f': 1}, {'g': 1}], dtype=object), 1: array([{'a': 1}, {'u': 1}, {'g': 1}, {'cc': 1}, {'h': 1}, {'t': 1},\n", " {'t': 1}, {'f': 1}, {'g': 1}], dtype=object)}\n" ] } ], "source": [ "sample_need_explain = 1\n", "y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])\n", "print(\"Explain samples:\")\n", "print(\"Membership values for classes: \", mem_val_classes)\n", "print(\"Predicted class = \", y_pred_input_0)\n", "print(\"Minimum continuous points of the selected hyperbox for each class: \", min_points_classes)\n", "print(\"Maximum continuous points of the selected hyperbox for each class: \", max_points_classes)\n", "print(\"Categorical bounds of the selected hyperbox for each class: \", dict_cat_bound_classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply pruning for the trained classifier" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,\n", " 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,\n", " 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,\n", " 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,\n", " 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,\n", " 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,\n", " 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,\n", " 1, 1, 1,...\n", " [8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,\n", " 6.00000000e-02, 2.20600000e-02],\n", " ...,\n", " [1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,\n", " 7.20000000e-02, 0.00000000e+00],\n", " [6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,\n", " 0.00000000e+00, 0.00000000e+00],\n", " [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,\n", " 0.00000000e+00, 1.50000000e-01]]),\n", " delta=0.6, theta=0.1)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained\n", "keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set\n", "# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary\n", "type_boundary_handling = PROBABILITY_MEASURE\n", "eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes after pruning = 358\n" ] } ], "source": [ "print('Number of hyperboxes after pruning = %d'%eiol_gfmm_clf.get_n_hyperboxes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make prediction after pruning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy after pruning = 83.21%\n" ] } ], "source": [ "y_pred_2 = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n", "acc = accuracy_score(ytest, y_pred_2)\n", "print(f'Accuracy after pruning = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy (Manhattan distance for samples on the decision boundaries) = 79.39%\n" ] } ], "source": [ "y_pred_2 = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n", "acc = accuracy_score(ytest, y_pred_2)\n", "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for the average changing entropy value over all categorical features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,\n", " 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,\n", " 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,\n", " 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,\n", " 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,\n", " 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,\n", " 1, 0, 1,...\n", " [2.67142857e-01, 3.80892857e-01, 2.98245614e-03, 1.79104478e-01,\n", " 6.45000000e-02, 3.00000000e-05],\n", " [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,\n", " 0.00000000e+00, 9.90000000e-04],\n", " [5.33015873e-01, 2.32142857e-01, 3.50877193e-02, 0.00000000e+00,\n", " 0.00000000e+00, 2.28000000e-03],\n", " [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,\n", " 0.00000000e+00, 1.50000000e-01]]),\n", " delta=0.6, theta=0.1)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)\n", "eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of existing hyperboxes = 159\n" ] } ], "source": [ "print(\"Number of existing hyperboxes = %d\"%(eiol_gfmm_clf.get_n_hyperboxes()))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 0.256 (s)\n" ] } ], "source": [ "print(\"Training time: %.3f (s)\"%eiol_gfmm_clf.elapsed_training_time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 83.97%\n" ] } ], "source": [ "y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n", "acc = accuracy_score(ytest, y_pred)\n", "print(f'Accuracy = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy (Manhattan distance for samples on the decision boundaries) = 80.92%\n" ] } ], "source": [ "y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n", "acc = accuracy_score(ytest, y_pred)\n", "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Explain samples:\n", "Membership values for classes: {0: 0.818407960199005, 1: 0.8854166666666667}\n", "Predicted class = 1\n", "Minimum continuous points of the selected hyperbox for each class: {0: array([6.07936508e-02, 3.57142857e-01, 4.38596491e-03, 1.49253731e-02,\n", " 0.00000000e+00, 1.00000000e-05]), 1: array([1.46825397e-01, 4.19642857e-01, 1.75438596e-02, 1.49253731e-02,\n", " 6.00000000e-02, 1.10000000e-04])}\n", "Maximum continuous points of the selected hyperbox for each class: {0: array([0.15079365, 0.45089286, 0.03508772, 0.02985075, 0.06 ,\n", " 0.05552 ]), 1: array([0.21698413, 0.51785714, 0.10824561, 0.02985075, 0.15 ,\n", " 0.00551 ])}\n", "Categorical bounds of the selected hyperbox for each class: {0: array([{'b': 2, 'a': 1}, {'u': 3}, {'g': 3}, {'w': 2, 'c': 1},\n", " {'h': 1, 'v': 2}, {'f': 3}, {'t': 3}, {'f': 3}, {'g': 3}],\n", " dtype=object), 1: array([{'a': 2}, {'u': 2}, {'g': 2}, {'x': 2}, {'h': 2}, {'t': 2},\n", " {'t': 2}, {'t': 1, 'f': 1}, {'g': 2}], dtype=object)}\n" ] } ], "source": [ "sample_need_explain = 1\n", "y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])\n", "print(\"Explain samples:\")\n", "print(\"Membership values for classes: \", mem_val_classes)\n", "print(\"Predicted class = \", y_pred_input_0)\n", "print(\"Minimum continuous points of the selected hyperbox for each class: \", min_points_classes)\n", "print(\"Maximum continuous points of the selected hyperbox for each class: \", max_points_classes)\n", "print(\"Categorical bounds of the selected hyperbox for each class: \", dict_cat_bound_classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply pruning for the trained classifier" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1,\n", " 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,\n", " 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1]),\n", " D=array([[{'a': 3, 'b': 12}, {'u': 10, 'y': 5}, {'g': 10, 'p': 5},\n", " {'q': 1, 'w': 4, 'k': 5, 'c': 2, 'i': 1, 'x': 1, 'm': 1},\n", " {'v': 11, 'h': 3, 'ff': 1}, {'f': 15}, {'t': 5, 'f': 10},\n", " {'t': 6, 'f': 9}, {'g': 14, 's': 1}],\n", " [{'b': 2, 'a': 3}, {'u': 4...\n", " [4.35238095e-01, 2.32142857e-01, 1.75438596e-02, 4.47761194e-02,\n", " 1.14000000e-01, 0.00000000e+00],\n", " [5.46349206e-01, 1.25000000e-01, 1.22807018e-01, 0.00000000e+00,\n", " 1.15000000e-01, 0.00000000e+00],\n", " [6.09841270e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00],\n", " [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,\n", " 0.00000000e+00, 9.90000000e-04]]),\n", " delta=0.6, theta=0.1)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained\n", "keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set\n", "# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary\n", "type_boundary_handling = PROBABILITY_MEASURE\n", "eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make prediction after pruning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy = 82.44%\n" ] } ], "source": [ "y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n", "acc = accuracy_score(ytest, y_pred)\n", "print(f'Accuracy = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy (Manhattan distance for samples on the decision boundaries) = 82.44%\n" ] } ], "source": [ "y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n", "acc = accuracy_score(ytest, y_pred)\n", "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 4 }