{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Integration of Algorithms for Mixed-Attribute Data with Hyper-parameter Optimisation in Sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example shows how to integrate the GFMM classifiers for mixed-attribute with the Random Search Cross-Validation functionality implemented by scikit-learn\n",
    "\n",
    "Note that this example uses the extended improved incremental learning algorithm and Random Search for illustration. However, other learning algorithms for mixed-attribute data in the library can be used similarly for any hyper-parameter tunning methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.metrics import accuracy_score\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "from sklearn.model_selection import train_test_split\n",
    "from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load dataset.\n",
    "This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged. Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "this_notebook_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n",
    "project_dir = Path(this_notebook_dir).parent.parent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_train.csv\"))\n",
    "testing_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_test.csv\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(training_data_file, header=None)\n",
    "df_test = pd.read_csv(testing_data_file, header=None)\n",
    "\n",
    "Xy_train = df_train.to_numpy()\n",
    "Xy_test = df_test.to_numpy()\n",
    "\n",
    "Xtr = Xy_train[:, :-1]\n",
    "ytr = Xy_train[:, -1].astype(int)\n",
    "\n",
    "Xtest = Xy_test[:, :-1]\n",
    "ytest = Xy_test[:, -1].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Random Search with 5-fold cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "parameters = {'theta': np.arange(0.05, 1.01, 0.05), 'delta':np.arange(0.05, 1.01, 0.05), 'alpha':np.arange(0.1, 1.1, 0.1), 'gamma':[0.5, 1, 2, 4, 8, 16]}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using random search with only 20 random combinations of parameters\n",
    "eiol_gfmm_rd_search = ExtendedImprovedOnlineGFMM()\n",
    "clf_rd_search = RandomizedSearchCV(eiol_gfmm_rd_search, parameters, n_iter=20, cv=5, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomizedSearchCV(cv=5,\n",
       "                   estimator=ExtendedImprovedOnlineGFMM(C=array([], dtype=float64),\n",
       "                                                        D=array([], dtype=float64),\n",
       "                                                        N_samples=array([], dtype=float64),\n",
       "                                                        V=array([], dtype=float64),\n",
       "                                                        W=array([], dtype=float64)),\n",
       "                   n_iter=20,\n",
       "                   param_distributions={'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),\n",
       "                                        'delta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n",
       "       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),\n",
       "                                        'gamma': [0.5, 1, 2, 4, 8, 16],\n",
       "                                        'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n",
       "       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])},\n",
       "                   random_state=0)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create parameters in the fit function apart from X and y\n",
    "# we use the expansion condition for categorical featurers using the average entropy changing values over all categorical features\n",
    "fit_params={'categorical_features':[0, 3, 4, 5, 6, 8, 9, 11, 12], 'type_cat_expansion':1}\n",
    "clf_rd_search.fit(Xtr, ytr, **fit_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best average score =  0.8209672184355729\n",
      "Best params:  {'theta': 0.5, 'gamma': 2, 'delta': 0.15000000000000002, 'alpha': 0.8}\n"
     ]
    }
   ],
   "source": [
    "print(\"Best average score = \", clf_rd_search.best_score_)\n",
    "print(\"Best params: \", clf_rd_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_gfmm_rd_search = clf_rd_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Testing the performance on the test set\n",
    "y_pred_rd_search = best_gfmm_rd_search.predict(Xtest)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy (random-search) =  79.39%\n"
     ]
    }
   ],
   "source": [
    "acc_rd_search = accuracy_score(ytest, y_pred_rd_search)\n",
    "print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}