{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Enhanced Improved Online Learning Algorithm with Mixed-Attribute Data for GFMM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example shows how to use the general fuzzy min-max neural network trained by the extended improved incremental learning algorithm for mixed attribute data (EIOL-GFMM)\n",
    "\n",
    "Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube. Therefore, continuous features need to be normalised before training. For categorical feature, nothing needs to be done as the EIOL-GFMM does not require any categorical feature encoding methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Execute directly from the python file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get the path to the this jupyter notebook file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\hyperbox-brain\\\\examples\\\\mixed_data'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "this_notebook_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n",
    "this_notebook_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get the home folder of the Hyperbox-Brain project"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "WindowsPath('C:/hyperbox-brain')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "project_dir = Path(this_notebook_dir).parent.parent\n",
    "project_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create the path to the Python file containing the implementation of the GFMM classifier using the extended improved online learning algorithm for mixed attribute data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\hyperbox-brain\\\\hbbrain\\\\mixed_data\\\\eiol_gfmm.py'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eiol_gfmm_file_path = os.path.join(project_dir, Path(\"hbbrain/mixed_data/eiol_gfmm.py\"))\n",
    "eiol_gfmm_file_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run the found file by showing the execution directions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: eiol_gfmm.py [-h] -training_file TRAINING_FILE -testing_file\n",
      "                    TESTING_FILE -categorical_features CATEGORICAL_FEATURES\n",
      "                    [--theta THETA] [--delta DELTA] [--gamma GAMMA]\n",
      "                    [--alpha ALPHA]\n",
      "\n",
      "The description of parameters\n",
      "\n",
      "required arguments:\n",
      "  -training_file TRAINING_FILE\n",
      "                        A required argument for the path to training data file\n",
      "                        (including file name)\n",
      "  -testing_file TESTING_FILE\n",
      "                        A required argument for the path to testing data file\n",
      "                        (including file name)\n",
      "  -categorical_features CATEGORICAL_FEATURES\n",
      "                        Indices of categorical features\n",
      "\n",
      "optional arguments:\n",
      "  --theta THETA         Maximum hyperbox size (in the range of (0, 1])\n",
      "                        (default: 0.5)\n",
      "  --delta DELTA         Maximum changing entropy for categorical features (in\n",
      "                        the range of (0, 1]) (default: 0.5)\n",
      "  --gamma GAMMA         A sensitivity parameter describing the speed of\n",
      "                        decreasing of the membership function in each\n",
      "                        continous dimension (larger than 0) (default: 1)\n",
      "  --alpha ALPHA         The trade-off weighting factor between categorical\n",
      "                        features and numerical features for membership values\n",
      "                        (in the range of [0, 1]) (default: 0.5)\n"
     ]
    }
   ],
   "source": [
    "!python \"{eiol_gfmm_file_path}\" -h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create the path to mixed-attribute training and testing datasets stored in the dataset folder.\n",
    "This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\hyperbox-brain\\\\dataset\\\\japanese_credit_train.csv'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_train.csv\"))\n",
    "training_data_file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\hyperbox-brain\\\\dataset\\\\japanese_credit_test.csv'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "testing_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_test.csv\"))\n",
    "testing_data_file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Run a demo program"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of hyperboxes = 378\n",
      "Testing accuracy =  82.44%\n"
     ]
    }
   ],
   "source": [
    "!python \"{eiol_gfmm_file_path}\" -training_file \"{training_data_file}\" -testing_file \"{testing_data_file}\" -categorical_features \"[0, 3, 4, 5, 6, 8, 9, 11,12]\" --theta 0.1 --delta 0.6 --gamma 1 --alpha 0.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Using the EIOL-GFMM algorithm to train a GFMM classifier for mixed-attribute data through its init, fit, and predict functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create mixed attribute training, validation, and testing data sets.\n",
    "This example will use the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(training_data_file, header=None)\n",
    "df_test = pd.read_csv(testing_data_file, header=None)\n",
    "\n",
    "Xy_train = df_train.to_numpy()\n",
    "Xy_test = df_test.to_numpy()\n",
    "\n",
    "Xtr = Xy_train[:, :-1]\n",
    "ytr = Xy_train[:, -1].astype(int)\n",
    "\n",
    "Xtest = Xy_test[:, :-1]\n",
    "ytest = Xy_test[:, -1].astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_val.csv\"))\n",
    "df_val = pd.read_csv(val_data_file, header=None)\n",
    "Xy_val = df_val.to_numpy()\n",
    "Xval = Xy_val[:, :-1]\n",
    "yval = Xy_val[:, -1].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initializing parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "theta = 0.1 # maximum hyperbox size for continuous features\n",
    "delta = 0.6 # The maximum value of the increased entropy degree for each categorical dimension after extended.\n",
    "gamma = 1 # speed of decreasing degree in the membership values of continuous features\n",
    "alpha = 0.5 # the trade-off factor for the contribution of categorical features and continuous features to final membership value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Indicate the indices of categorical features in the training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### a. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for every categorical dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,\n",
       "       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,\n",
       "       1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,\n",
       "       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,\n",
       "       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
       "       0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,\n",
       "       0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,\n",
       "       0, 1, 1,...\n",
       "       [8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,\n",
       "        6.00000000e-02, 2.20600000e-02],\n",
       "       ...,\n",
       "       [1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,\n",
       "        7.20000000e-02, 0.00000000e+00],\n",
       "       [6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,\n",
       "        0.00000000e+00, 0.00000000e+00],\n",
       "       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,\n",
       "        0.00000000e+00, 1.50000000e-01]]),\n",
       "                           delta=0.6, theta=0.1)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)\n",
    "eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of existing hyperboxes = 378\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of existing hyperboxes = %d\"%(eiol_gfmm_clf.get_n_hyperboxes()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time: 0.991 (s)\n"
     ]
    }
   ],
   "source": [
    "print(\"Training time: %.3f (s)\"%eiol_gfmm_clf.elapsed_training_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hbbrain.constants import MANHATTAN_DIS, PROBABILITY_MEASURE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy =  82.44%\n"
     ]
    }
   ],
   "source": [
    "y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n",
    "acc = accuracy_score(ytest, y_pred)\n",
    "print(f'Accuracy = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy (Manhattan distance for samples on the decision boundaries) =  78.63%\n"
     ]
    }
   ],
   "source": [
    "y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n",
    "acc = accuracy_score(ytest, y_pred)\n",
    "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Explain samples:\n",
      "Membership values for classes:  {0: 0.8441127694859039, 1: 0.9191765873015874}\n",
      "Predicted class =  1\n",
      "Minimum continuous points of the selected hyperbox for each class:  {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04      ,\n",
      "       0.0099    ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048     ,\n",
      "       0.        ])}\n",
      "Maximum continuous points of the selected hyperbox for each class:  {0: array([0.13888889, 0.30357143, 0.06140351, 0.14925373, 0.04      ,\n",
      "       0.0099    ]), 1: array([0.08603175, 0.30660714, 0.02631579, 0.10447761, 0.048     ,\n",
      "       0.        ])}\n",
      "Categorical bounds of the selected hyperbox for each class:  {0: array([{'a': 1}, {'u': 1}, {'g': 1}, {'q': 1}, {'v': 1}, {'t': 1},\n",
      "       {'t': 1}, {'f': 1}, {'g': 1}], dtype=object), 1: array([{'a': 1}, {'u': 1}, {'g': 1}, {'cc': 1}, {'h': 1}, {'t': 1},\n",
      "       {'t': 1}, {'f': 1}, {'g': 1}], dtype=object)}\n"
     ]
    }
   ],
   "source": [
    "sample_need_explain = 1\n",
    "y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])\n",
    "print(\"Explain samples:\")\n",
    "print(\"Membership values for classes: \", mem_val_classes)\n",
    "print(\"Predicted class = \", y_pred_input_0)\n",
    "print(\"Minimum continuous points of the selected hyperbox for each class: \", min_points_classes)\n",
    "print(\"Maximum continuous points of the selected hyperbox for each class: \", max_points_classes)\n",
    "print(\"Categorical bounds of the selected hyperbox for each class: \", dict_cat_bound_classes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Apply pruning for the trained classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,\n",
       "       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,\n",
       "       1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,\n",
       "       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,\n",
       "       0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,\n",
       "       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,\n",
       "       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,\n",
       "       1, 1, 1,...\n",
       "       [8.60317460e-02, 3.39285714e-01, 5.26315789e-02, 0.00000000e+00,\n",
       "        6.00000000e-02, 2.20600000e-02],\n",
       "       ...,\n",
       "       [1.41587302e-01, 2.82142857e-02, 2.98245614e-03, 0.00000000e+00,\n",
       "        7.20000000e-02, 0.00000000e+00],\n",
       "       [6.93174603e-01, 3.03571429e-01, 2.45614035e-01, 4.47761194e-02,\n",
       "        0.00000000e+00, 0.00000000e+00],\n",
       "       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,\n",
       "        0.00000000e+00, 1.50000000e-01]]),\n",
       "                           delta=0.6, theta=0.1)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained\n",
    "keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set\n",
    "# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary\n",
    "type_boundary_handling = PROBABILITY_MEASURE\n",
    "eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of hyperboxes after pruning = 358\n"
     ]
    }
   ],
   "source": [
    "print('Number of hyperboxes after pruning = %d'%eiol_gfmm_clf.get_n_hyperboxes())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Make prediction after pruning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy after pruning =  83.21%\n"
     ]
    }
   ],
   "source": [
    "y_pred_2 = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n",
    "acc = accuracy_score(ytest, y_pred_2)\n",
    "print(f'Accuracy after pruning = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy (Manhattan distance for samples on the decision boundaries) =  79.39%\n"
     ]
    }
   ],
   "source": [
    "y_pred_2 = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n",
    "acc = accuracy_score(ytest, y_pred_2)\n",
    "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b. Training the EIOL-GFMM algorithm with the categorical feature expansion condition regarding the maximum entropy changing threshold be applied for the average changing entropy value over all categorical features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,\n",
       "       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,\n",
       "       1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,\n",
       "       1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,\n",
       "       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,\n",
       "       1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,\n",
       "       1, 0, 1,...\n",
       "       [2.67142857e-01, 3.80892857e-01, 2.98245614e-03, 1.79104478e-01,\n",
       "        6.45000000e-02, 3.00000000e-05],\n",
       "       [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,\n",
       "        0.00000000e+00, 9.90000000e-04],\n",
       "       [5.33015873e-01, 2.32142857e-01, 3.50877193e-02, 0.00000000e+00,\n",
       "        0.00000000e+00, 2.28000000e-03],\n",
       "       [5.38412698e-01, 1.03571429e-02, 5.26315789e-01, 2.98507463e-01,\n",
       "        0.00000000e+00, 1.50000000e-01]]),\n",
       "                           delta=0.6, theta=0.1)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eiol_gfmm_clf = ExtendedImprovedOnlineGFMM(theta=theta, gamma=gamma, delta=delta, alpha=alpha)\n",
    "eiol_gfmm_clf.fit(Xtr, ytr, categorical_features, type_cat_expansion=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of existing hyperboxes = 159\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of existing hyperboxes = %d\"%(eiol_gfmm_clf.get_n_hyperboxes()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time: 0.256 (s)\n"
     ]
    }
   ],
   "source": [
    "print(\"Training time: %.3f (s)\"%eiol_gfmm_clf.elapsed_training_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy =  83.97%\n"
     ]
    }
   ],
   "source": [
    "y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n",
    "acc = accuracy_score(ytest, y_pred)\n",
    "print(f'Accuracy = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy (Manhattan distance for samples on the decision boundaries) =  80.92%\n"
     ]
    }
   ],
   "source": [
    "y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n",
    "acc = accuracy_score(ytest, y_pred)\n",
    "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Explaining the predicted result for the input sample by showing membership values and hyperboxes for each class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Explain samples:\n",
      "Membership values for classes:  {0: 0.818407960199005, 1: 0.8854166666666667}\n",
      "Predicted class =  1\n",
      "Minimum continuous points of the selected hyperbox for each class:  {0: array([6.07936508e-02, 3.57142857e-01, 4.38596491e-03, 1.49253731e-02,\n",
      "       0.00000000e+00, 1.00000000e-05]), 1: array([1.46825397e-01, 4.19642857e-01, 1.75438596e-02, 1.49253731e-02,\n",
      "       6.00000000e-02, 1.10000000e-04])}\n",
      "Maximum continuous points of the selected hyperbox for each class:  {0: array([0.15079365, 0.45089286, 0.03508772, 0.02985075, 0.06      ,\n",
      "       0.05552   ]), 1: array([0.21698413, 0.51785714, 0.10824561, 0.02985075, 0.15      ,\n",
      "       0.00551   ])}\n",
      "Categorical bounds of the selected hyperbox for each class:  {0: array([{'b': 2, 'a': 1}, {'u': 3}, {'g': 3}, {'w': 2, 'c': 1},\n",
      "       {'h': 1, 'v': 2}, {'f': 3}, {'t': 3}, {'f': 3}, {'g': 3}],\n",
      "      dtype=object), 1: array([{'a': 2}, {'u': 2}, {'g': 2}, {'x': 2}, {'h': 2}, {'t': 2},\n",
      "       {'t': 2}, {'t': 1, 'f': 1}, {'g': 2}], dtype=object)}\n"
     ]
    }
   ],
   "source": [
    "sample_need_explain = 1\n",
    "y_pred_input_0, mem_val_classes, min_points_classes, max_points_classes, dict_cat_bound_classes = eiol_gfmm_clf.get_sample_explanation(Xtest[sample_need_explain])\n",
    "print(\"Explain samples:\")\n",
    "print(\"Membership values for classes: \", mem_val_classes)\n",
    "print(\"Predicted class = \", y_pred_input_0)\n",
    "print(\"Minimum continuous points of the selected hyperbox for each class: \", min_points_classes)\n",
    "print(\"Maximum continuous points of the selected hyperbox for each class: \", max_points_classes)\n",
    "print(\"Categorical bounds of the selected hyperbox for each class: \", dict_cat_bound_classes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Apply pruning for the trained classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ExtendedImprovedOnlineGFMM(C=array([0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1,\n",
       "       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,\n",
       "       0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1]),\n",
       "                           D=array([[{'a': 3, 'b': 12}, {'u': 10, 'y': 5}, {'g': 10, 'p': 5},\n",
       "        {'q': 1, 'w': 4, 'k': 5, 'c': 2, 'i': 1, 'x': 1, 'm': 1},\n",
       "        {'v': 11, 'h': 3, 'ff': 1}, {'f': 15}, {'t': 5, 'f': 10},\n",
       "        {'t': 6, 'f': 9}, {'g': 14, 's': 1}],\n",
       "       [{'b': 2, 'a': 3}, {'u': 4...\n",
       "       [4.35238095e-01, 2.32142857e-01, 1.75438596e-02, 4.47761194e-02,\n",
       "        1.14000000e-01, 0.00000000e+00],\n",
       "       [5.46349206e-01, 1.25000000e-01, 1.22807018e-01, 0.00000000e+00,\n",
       "        1.15000000e-01, 0.00000000e+00],\n",
       "       [6.09841270e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n",
       "        0.00000000e+00, 0.00000000e+00],\n",
       "       [7.48730159e-01, 1.78571429e-01, 1.40350877e-01, 5.97014925e-02,\n",
       "        0.00000000e+00, 9.90000000e-04]]),\n",
       "                           delta=0.6, theta=0.1)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acc_threshold = 0.5 # minimum accuracy of hyperboxes being retained\n",
    "keep_empty_boxes = False # do not keep the hyperboxes which do not join the prediction process on the validation set\n",
    "# using a probability measure based on the number of samples included in the hyperbox for handling samples located on the boundary\n",
    "type_boundary_handling = PROBABILITY_MEASURE\n",
    "eiol_gfmm_clf.simple_pruning(Xval, yval, acc_threshold, keep_empty_boxes, type_boundary_handling)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Make prediction after pruning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using a probability measure based on the number of samples included inside the winner hyperboxes for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy =  82.44%\n"
     ]
    }
   ],
   "source": [
    "y_pred = eiol_gfmm_clf.predict(Xtest, PROBABILITY_MEASURE)\n",
    "acc = accuracy_score(ytest, y_pred)\n",
    "print(f'Accuracy = {acc * 100: .2f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict the class label for input samples using Manhattan distance measure (applied only for continuous features) for the samples located on the decision boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy (Manhattan distance for samples on the decision boundaries) =  82.44%\n"
     ]
    }
   ],
   "source": [
    "y_pred = eiol_gfmm_clf.predict(Xtest, MANHATTAN_DIS)\n",
    "acc = accuracy_score(ytest, y_pred)\n",
    "print(f'Accuracy (Manhattan distance for samples on the decision boundaries) = {acc * 100: .2f}%')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}