Integration of Algorithms for Mixed-Attribute Data with Hyper-parameter Optimisation in Sklearn

This example shows how to integrate the GFMM classifiers for mixed-attribute with the Random Search Cross-Validation functionality implemented by scikit-learn

Note that this example uses the extended improved incremental learning algorithm and Random Search for illustration. However, other learning algorithms for mixed-attribute data in the library can be used similarly for any hyper-parameter tunning methods.

[1]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from hbbrain.mixed_data.eiol_gfmm import ExtendedImprovedOnlineGFMM

Load dataset.

This example uses the japanese_credit dataset for illustration purposes. The continuous features in this dataset were normalised into the range of [0, 1], while categorical features were kept unchanged. Note that the numerical features in training and testing datasets must be in the range of [0, 1] because the GFMM classifiers require features in the unit cube.

[2]:
from pathlib import Path
this_notebook_dir = os.path.dirname(os.path.abspath("__file__"))
project_dir = Path(this_notebook_dir).parent.parent
[3]:
training_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_train.csv"))
testing_data_file = os.path.join(project_dir, Path("dataset/japanese_credit_test.csv"))
[4]:
df_train = pd.read_csv(training_data_file, header=None)
df_test = pd.read_csv(testing_data_file, header=None)

Xy_train = df_train.to_numpy()
Xy_test = df_test.to_numpy()

Xtr = Xy_train[:, :-1]
ytr = Xy_train[:, -1].astype(int)

Xtest = Xy_test[:, :-1]
ytest = Xy_test[:, -1].astype(int)

Using Random Search with 5-fold cross-validation

[5]:
parameters = {'theta': np.arange(0.05, 1.01, 0.05), 'delta':np.arange(0.05, 1.01, 0.05), 'alpha':np.arange(0.1, 1.1, 0.1), 'gamma':[0.5, 1, 2, 4, 8, 16]}
[6]:
# Using random search with only 20 random combinations of parameters
eiol_gfmm_rd_search = ExtendedImprovedOnlineGFMM()
clf_rd_search = RandomizedSearchCV(eiol_gfmm_rd_search, parameters, n_iter=20, cv=5, random_state=0)
[7]:
# create parameters in the fit function apart from X and y
# we use the expansion condition for categorical featurers using the average entropy changing values over all categorical features
fit_params={'categorical_features':[0, 3, 4, 5, 6, 8, 9, 11, 12], 'type_cat_expansion':1}
clf_rd_search.fit(Xtr, ytr, **fit_params)
[7]:
RandomizedSearchCV(cv=5,
                   estimator=ExtendedImprovedOnlineGFMM(C=array([], dtype=float64),
                                                        D=array([], dtype=float64),
                                                        N_samples=array([], dtype=float64),
                                                        V=array([], dtype=float64),
                                                        W=array([], dtype=float64)),
                   n_iter=20,
                   param_distributions={'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                                        'delta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
                                        'gamma': [0.5, 1, 2, 4, 8, 16],
                                        'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])},
                   random_state=0)
[8]:
print("Best average score = ", clf_rd_search.best_score_)
print("Best params: ", clf_rd_search.best_params_)
Best average score =  0.8209672184355729
Best params:  {'theta': 0.5, 'gamma': 2, 'delta': 0.15000000000000002, 'alpha': 0.8}
[9]:
best_gfmm_rd_search = clf_rd_search.best_estimator_
[10]:
# Testing the performance on the test set
y_pred_rd_search = best_gfmm_rd_search.predict(Xtest)
[11]:
acc_rd_search = accuracy_score(ytest, y_pred_rd_search)
print(f'Accuracy (random-search) = {acc_rd_search * 100: .2f}%')
Accuracy (random-search) =  79.39%