{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Integration of Algorithms for Mixed-Attribute Data with Hyper-parameter Optimisation in Sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example shows how to integrate the GFMM classifiers for mixed-attribute with the Random Search Cross-Validation functionality implemented by scikit-learn\n", "\n", "Note that this example uses the extended improved incremental learning algorithm and Random Search for illustration. However, other learning algorithms for mixed-attribute data in the library can be used similarly for any hyper-parameter tunning methods." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.metrics import accuracy_score\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from sklearn.model_selection import train_test_split\n", "from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load dataset.\n", "This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged. Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "this_notebook_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n", "project_dir = Path(this_notebook_dir).parent.parent" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "training_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_train.csv\"))\n", "testing_data_file = os.path.join(project_dir, Path(\"dataset/japanese_credit_test.csv\"))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df_train = pd.read_csv(training_data_file, header=None)\n", "df_test = pd.read_csv(testing_data_file, header=None)\n", "\n", "Xy_train = df_train.to_numpy()\n", "Xy_test = df_test.to_numpy()\n", "\n", "Xtr = Xy_train[:, :-1]\n", "ytr = Xy_train[:, -1].astype(int)\n", "\n", "Xtest = Xy_test[:, :-1]\n", "ytest = Xy_test[:, -1].astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Random Search with 5-fold cross-validation" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "parameters = {'theta': np.arange(0.05, 1.01, 0.05), 'delta':np.arange(0.05, 1.01, 0.05), 'alpha':np.arange(0.1, 1.1, 0.1), 'gamma':[0.5, 1, 2, 4, 8, 16]}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Using random search with only 20 random combinations of parameters\n", "eiol_gfmm_rd_search = ExtendedImprovedOnlineGFMM()\n", "clf_rd_search = RandomizedSearchCV(eiol_gfmm_rd_search, parameters, n_iter=20, cv=5, random_state=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomizedSearchCV(cv=5,\n", " estimator=ExtendedImprovedOnlineGFMM(C=array([], dtype=float64),\n", " D=array([], dtype=float64),\n", " N_samples=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " n_iter=20,\n", " param_distributions={'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),\n", " 'delta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'gamma': [0.5, 1, 2, 4, 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])},\n", " random_state=0)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create parameters in the fit function apart from X and y\n", "# we use the expansion condition for categorical featurers using the average entropy changing values over all categorical features\n", "fit_params={'categorical_features':[0, 3, 4, 5, 6, 8, 9, 11, 12], 'type_cat_expansion':1}\n", "clf_rd_search.fit(Xtr, ytr, **fit_params)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best average score = 0.8209672184355729\n", "Best params: {'theta': 0.5, 'gamma': 2, 'delta': 0.15000000000000002, 'alpha': 0.8}\n" ] } ], "source": [ "print(\"Best average score = \", clf_rd_search.best_score_)\n", "print(\"Best params: \", clf_rd_search.best_params_)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "best_gfmm_rd_search = clf_rd_search.best_estimator_" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Testing the performance on the test set\n", "y_pred_rd_search = best_gfmm_rd_search.predict(Xtest)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy (random-search) = 79.39%\n" ] } ], "source": [ "acc_rd_search = accuracy_score(ytest, y_pred_rd_search)\n", "print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 4 }