{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model-level Bagging of Hyperbox-based Learners with Hyper-parameter Optimisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example shows how to use a Bagging classifier with a combination at the model level to generate a single model from many base learners, in which each base learner is trained by random search-based hyper-parameter tuning and k-fold cross-validation.\n", "\n", "While the original model-level combination bagging classifier in the class ModelCombinationBagging uses the same base learners with the same hyperparameters, the cross-validation model-level combination bagging classifier in the class ModelCombinationCrossValBagging allows each base learner to use specific hyperparameters depending on the training data by performing random research to find the best combination of hyperparameters for each base learner." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "import numpy as np\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.model_selection import train_test_split\n", "from hbbrain.numerical_data.ensemble_learner.model_comb_cross_val_bagging import ModelCombinationCrossValBagging\n", "from hbbrain.numerical_data.incremental_learner.onln_gfmm import OnlineGFMM\n", "from hbbrain.numerical_data.batch_learner.accel_agglo_gfmm import AccelAgglomerativeLearningGFMM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load dataset. This example will use the breast cancer dataset available in sklearn to demonstrate how to use this ensemble classifier. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_breast_cancer\n", "from sklearn.preprocessing import MinMaxScaler" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = load_breast_cancer()\n", "X = df.data\n", "y = df.target" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Normailise data into the range of [0, 1] as hyperbox-based models only work in the unit cube\n", "scaler = MinMaxScaler()\n", "X = scaler.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Split data into training, validation and testing sets\n", "Xtr_val, X_test, ytr_val, y_test = train_test_split(X, y, train_size=0.8, random_state=0)\n", "Xtr, X_val, ytr, y_val = train_test_split(X, y, train_size=0.75, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**This example will use the GFMM classifier with the original online learning algorithm as base learners. However, any type of hyperbox-based learning algorithms in this library can also be used to train base learners.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Using random subsampling to generate training sets for various base learners" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a. Training without pruning for base learners" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Initialise parameters\n", "n_estimators = 20 # number of base learners\n", "max_samples = 0.5 # sampling rate for samples\n", "bootstrap = False # random subsampling without replacement\n", "class_balanced = False # do not use the class-balanced sampling mode\n", "n_jobs = 4 # number of processes is used to build base learners\n", "n_iter = 20 # Number of parameter settings that are randomly sampled to choose the best combination of hyperparameters\n", "k_fold = 5 # Number of folds to conduct Stratified K-Fold cross-validation for hyperparameter tunning" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Init a hyperbox-based model used to train base learners\n", "# Using the GFMM classifier with the original online learning algorithm\n", "base_estimator = OnlineGFMM()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Init ranges for hyperparameters of base learners to perform a random search process for hyperparameter tunning\n", "base_estimator_params = {'theta': np.arange(0.05, 1.01, 0.05), 'theta_min':[1], 'gamma':[0.5, 1, 2, 4, 8, 16]}" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Init a hyperbox-based model used to aggregate the resulting hyperboxes from all base learners\n", "# Using the accelerated agglomerative learning algorithm for the GFMM model to do this task\n", "model_level_estimator = AccelAgglomerativeLearningGFMM(theta=0.1, min_simil=0, simil_measure='long')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_comb_cv_bagging_subsampling = ModelCombinationCrossValBagging(base_estimator=base_estimator, base_estimator_params=base_estimator_params, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_iter=n_iter, k_fold=k_fold, n_jobs=n_jobs, random_state=0)\n", "model_comb_cv_bagging_subsampling.fit(Xtr, ytr)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 44.155 (s)\n" ] } ], "source": [ "print(\"Training time: %.3f (s)\"%(model_comb_cv_bagging_subsampling.elapsed_training_time))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of hyperboxes in all base learners = 1168\n" ] } ], "source": [ "print('Total number of hyperboxes in all base learners = %d'%model_comb_cv_bagging_subsampling.get_n_hyperboxes())" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes in the combined model = 779\n" ] } ], "source": [ "print('Number of hyperboxes in the combined model = %d'%model_comb_cv_bagging_subsampling.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using majority voting from predicted results of all base learners" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "y_pred_voting = model_comb_cv_bagging_subsampling.predict_voting(X_test)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy using voting of decisions from base learners = 94.74%\n" ] } ], "source": [ "acc_voting = accuracy_score(y_test, y_pred_voting)\n", "print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the final combined single model to make prediction" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy of the combined model = 88.60%\n" ] } ], "source": [ "y_pred = model_comb_cv_bagging_subsampling.predict(X_test)\n", "acc = accuracy_score(y_test, y_pred)\n", "print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply pruning for the final combined model" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes\n", "keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated\n", "model_comb_cv_bagging_subsampling.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes of the combined single model after pruning = 36\n" ] } ], "source": [ "print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_cv_bagging_subsampling.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction after doing a pruning procedure for the combined single model" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy after pruning the final model = 88.60%\n" ] } ], "source": [ "y_pred_2 = model_comb_cv_bagging_subsampling.predict(X_test)\n", "acc_pruned = accuracy_score(y_test, y_pred_2)\n", "print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b. Training with pruning for base learners" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_comb_cv_bagging_subsampling_base_learner_pruning = ModelCombinationCrossValBagging(base_estimator=base_estimator, base_estimator_params=base_estimator_params, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_iter=n_iter, k_fold=k_fold, n_jobs=n_jobs, random_state=0)\n", "model_comb_cv_bagging_subsampling_base_learner_pruning.fit(Xtr, ytr, is_pruning_base_learners=True, X_val=X_val, y_val=y_val, acc_threshold=acc_threshold, keep_empty_boxes=keep_empty_boxes)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 44.437 (s)\n" ] } ], "source": [ "print(\"Training time: %.3f (s)\"%(model_comb_cv_bagging_subsampling_base_learner_pruning.elapsed_training_time))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of hyperboxes in all base learners = 756\n" ] } ], "source": [ "print('Total number of hyperboxes in all base learners = %d'%model_comb_cv_bagging_subsampling_base_learner_pruning.get_n_hyperboxes())" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes in the combined model = 613\n" ] } ], "source": [ "print('Number of hyperboxes in the combined model = %d'%model_comb_cv_bagging_subsampling_base_learner_pruning.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using majority voting from predicted results of all base learners" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "y_pred_voting = model_comb_cv_bagging_subsampling_base_learner_pruning.predict_voting(X_test)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy using voting of decisions from base learners = 96.49%\n" ] } ], "source": [ "acc_voting = accuracy_score(y_test, y_pred_voting)\n", "print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the final combined single model to make prediction" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy of the combined model = 88.60%\n" ] } ], "source": [ "y_pred = model_comb_cv_bagging_subsampling_base_learner_pruning.predict(X_test)\n", "acc = accuracy_score(y_test, y_pred)\n", "print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply pruning for the final combined model" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes\n", "keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated\n", "model_comb_cv_bagging_subsampling_base_learner_pruning.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes of the combined single model after pruning = 36\n" ] } ], "source": [ "print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_cv_bagging_subsampling_base_learner_pruning.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction after doing a pruning procedure for the combined single model" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy after pruning the final model = 88.60%\n" ] } ], "source": [ "y_pred_2 = model_comb_cv_bagging_subsampling_base_learner_pruning.predict(X_test)\n", "acc_pruned = accuracy_score(y_test, y_pred_2)\n", "print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Using random undersampling to generate class-balanced training sets for various base learners" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a. Training without pruning for base learners" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# Initialise parameters\n", "n_estimators = 20 # number of base learners\n", "max_samples = 0.5 # sampling rate for samples\n", "bootstrap = False # random subsampling without replacement\n", "class_balanced = True # use the class-balanced sampling mode\n", "n_jobs = 4 # number of processes is used to build base learners\n", "n_iter = 20 # Number of parameter settings that are randomly sampled to choose the best combination of hyperparameters\n", "k_fold = 5 # Number of folds to conduct Stratified K-Fold cross-validation for hyperparameter tunning" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# Init a hyperbox-based model used to train base learners\n", "# Using the GFMM classifier with the original online learning algorithm\n", "base_estimator = OnlineGFMM()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Init ranges for hyperparameters of base learners to perform a random search process for hyperparameter tunning\n", "base_estimator_params = {'theta': np.arange(0.05, 1.01, 0.05), 'theta_min':[1], 'gamma':[0.5, 1, 2, 4, 8, 16]}" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Init a hyperbox-based model used to aggregate the resulting hyperboxes from all base learners\n", "# Using the accelerated agglomerative learning algorithm for the GFMM model to do this task\n", "model_level_estimator = AccelAgglomerativeLearningGFMM(theta=0.1, min_simil=0, simil_measure='long')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " class_balanced=True,\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_comb_cv_bagging_class_balanced = ModelCombinationCrossValBagging(base_estimator=base_estimator, base_estimator_params=base_estimator_params, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_iter=n_iter, k_fold=k_fold, n_jobs=n_jobs, random_state=0)\n", "model_comb_cv_bagging_class_balanced.fit(Xtr, ytr)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 36.885 (s)\n" ] } ], "source": [ "print(\"Training time: %.3f (s)\"%(model_comb_cv_bagging_class_balanced.elapsed_training_time))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of hyperboxes in all base learners = 1407\n" ] } ], "source": [ "print('Total number of hyperboxes in all base learners = %d'%model_comb_cv_bagging_class_balanced.get_n_hyperboxes())" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes in the combined model = 812\n" ] } ], "source": [ "print('Number of hyperboxes in the combined model = %d'%model_comb_cv_bagging_class_balanced.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using majority voting from predicted results of all base learners" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "y_pred_voting = model_comb_cv_bagging_class_balanced.predict_voting(X_test)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy using voting of decisions from base learners = 94.74%\n" ] } ], "source": [ "acc_voting = accuracy_score(y_test, y_pred_voting)\n", "print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the final combined single model to make prediction" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy of the combined model = 89.47%\n" ] } ], "source": [ "y_pred = model_comb_cv_bagging_class_balanced.predict(X_test)\n", "acc = accuracy_score(y_test, y_pred)\n", "print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply pruning for the final combined model" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " class_balanced=True,\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes\n", "keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated\n", "model_comb_cv_bagging_class_balanced.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes of the combined single model after pruning = 42\n" ] } ], "source": [ "print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_cv_bagging_class_balanced.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction after doing a pruning procedure for the combined single model" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy after pruning the final model = 90.35%\n" ] } ], "source": [ "y_pred_2 = model_comb_cv_bagging_class_balanced.predict(X_test)\n", "acc_pruned = accuracy_score(y_test, y_pred_2)\n", "print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b. Training with pruning for base learners" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " class_balanced=True,\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_comb_cv_bagging_class_balanced_base_learner_pruning = ModelCombinationCrossValBagging(base_estimator=base_estimator, base_estimator_params=base_estimator_params, model_level_estimator=model_level_estimator, n_estimators=n_estimators, max_samples=max_samples, bootstrap=bootstrap, class_balanced=class_balanced, n_iter=n_iter, k_fold=k_fold, n_jobs=n_jobs, random_state=0)\n", "model_comb_cv_bagging_class_balanced_base_learner_pruning.fit(Xtr, ytr, is_pruning_base_learners=True, X_val=X_val, y_val=y_val, acc_threshold=acc_threshold, keep_empty_boxes=keep_empty_boxes)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training time: 31.609 (s)\n" ] } ], "source": [ "print(\"Training time: %.3f (s)\"%(model_comb_cv_bagging_class_balanced_base_learner_pruning.elapsed_training_time))" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of hyperboxes in all base learners = 719\n" ] } ], "source": [ "print('Total number of hyperboxes in all base learners = %d'%model_comb_cv_bagging_class_balanced_base_learner_pruning.get_n_hyperboxes())" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes in the combined model = 538\n" ] } ], "source": [ "print('Number of hyperboxes in the combined model = %d'%model_comb_cv_bagging_class_balanced_base_learner_pruning.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using majority voting from predicted results of all base learners" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "y_pred_voting = model_comb_cv_bagging_class_balanced_base_learner_pruning.predict_voting(X_test)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy using voting of decisions from base learners = 95.61%\n" ] } ], "source": [ "acc_voting = accuracy_score(y_test, y_pred_voting)\n", "print(f'Testing accuracy using voting of decisions from base learners = {acc_voting * 100 : .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the final combined single model to make prediction" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy of the combined model = 89.47%\n" ] } ], "source": [ "y_pred = model_comb_cv_bagging_class_balanced_base_learner_pruning.predict(X_test)\n", "acc = accuracy_score(y_test, y_pred)\n", "print(f'Testing accuracy of the combined model = {acc * 100: .2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply pruning for the final combined model" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelCombinationCrossValBagging(base_estimator=OnlineGFMM(C=array([], dtype=float64),\n", " V=array([], dtype=float64),\n", " W=array([], dtype=float64)),\n", " base_estimator_params={'gamma': [0.5, 1, 2, 4,\n", " 8, 16],\n", " 'theta': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,\n", " 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]),\n", " 'theta_min': [1]},\n", " class_balanced=True,\n", " model_level_estimator=AccelAgglomerativeLearningGFMM(min_simil=0,\n", " simil_measure='long',\n", " theta=0.1),\n", " n_estimators=20, n_iter=20, n_jobs=4,\n", " random_state=0)" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc_threshold=0.5 # minimum accuracy score of the unpruned hyperboxes\n", "keep_empty_boxes=False # False means hyperboxes that do not join the prediction process within the pruning procedure are also eliminated\n", "model_comb_cv_bagging_class_balanced_base_learner_pruning.simple_pruning(X_val, y_val, acc_threshold, keep_empty_boxes)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of hyperboxes of the combined single model after pruning = 42\n" ] } ], "source": [ "print('Number of hyperboxes of the combined single model after pruning = %d'%model_comb_cv_bagging_class_balanced_base_learner_pruning.get_n_hyperboxes_comb_model())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction after doing a pruning procedure for the combined single model" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing accuracy after pruning the final model = 90.35%\n" ] } ], "source": [ "y_pred_2 = model_comb_cv_bagging_class_balanced_base_learner_pruning.predict(X_test)\n", "acc_pruned = accuracy_score(y_test, y_pred_2)\n", "print(f'Testing accuracy after pruning the final model = {acc_pruned * 100: .2f}%')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 4 }