The model averaging ensemble allows each ensemble member to contribute an equal amount to the prediction of the ensemble.
We can update the example so that instead, the contribution of each ensemble member is weighted by a coefficient that indicates the trust or expected performance of the model. Weight values are small values between 0 and 1 and are treated like a percentage, such that the weights across all ensemble members sum to one.
First, we must update the ensemble_predictions() function to make use of a vector of weights for each ensemble member.
Instead of simply summing the predictions across each ensemble member, we must calculate a weighted sum. We can implement this manually using for loops, but this is terribly inefficient; for example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # calculated a weighted sum of predictions def weighted_sum(weights, yhats): rows = list() for j in range(yhats.shape[1]): # enumerate values row = list() for k in range(yhats.shape[2]): # enumerate members value = 0.0 for i in range(yhats.shape[0]): value += weights[i] * yhats[i,j,k] row.append(value) rows.append(row) return array(rows) |
Instead, we can use efficient NumPy functions to implement the weighted sum such as einsum() or tensordot().
Full discussion of these functions is a little out of scope so please refer to the API documentation for more information on how to use these functions as they are challenging if you are new to linear algebra and/or NumPy. We will use tensordot() function to apply the tensor product with the required summing; the updated ensemble_predictions() function is listed below.
1 2 3 4 5 6 7 8 9 10 | # make an ensemble prediction for multi-class classification def ensemble_predictions(members, weights, testX): # make predictions yhats = [model.predict(testX) for model in members] yhats = array(yhats) # weighted sum across ensemble members summed = tensordot(yhats, weights, axes=((0),(0))) # argmax across classes result = argmax(summed, axis=1) return result |
Next, we must update evaluate_ensemble() to pass along the weights when making the prediction for the ensemble.
1 2 3 4 5 6 | # evaluate a specific number of members in an ensemble def evaluate_ensemble(members, weights, testX, testy): # make prediction yhat = ensemble_predictions(members, weights, testX) # calculate accuracy return accuracy_score(testy, yhat) |
We will use a modest-sized ensemble of five members, that appeared to perform well in the model averaging ensemble.
1 2 3 | # fit all models n_members = 5 members = [fit_model(trainX, trainy) for _ in range(n_members)] |
We can then estimate the performance of each individual model on the test dataset as a reference.
1 2 3 4 5 | # evaluate each single model on the test set testy_enc = to_categorical(testy) for i in range(n_members): _, test_acc = members[i].evaluate(testX, testy_enc, verbose=0) print('Model %d: %.3f' % (i+1, test_acc)) |
Next, we can use a weight of 1/5 or 0.2 for each of the five ensemble members and use the new functions to estimate the performance of a model averaging ensemble, a so-called equal-weight ensemble.
We would expect this ensemble to perform as well or better than any single model.
1 2 3 4 | # evaluate averaging ensemble (equal weights) weights = [1.0/n_members for _ in range(n_members)] score = evaluate_ensemble(members, weights, testX, testy) print('Equal Weights Score: %.3f' % score) |
Finally, we can develop a weighted average ensemble.
A simple, but exhaustive approach to finding weights for the ensemble members is to grid search values. We can define a course grid of weight values from 0.0 to 1.0 in steps of 0.1, then generate all possible five-element vectors with those values. Generating all possible combinations is called a Cartesian product, which can be implemented in Python using the itertools.product() function from the standard library.
A limitation of this approach is that the vectors of weights will not sum to one (called the unit norm), as required. We can force reach generated weight vector to have a unit norm by calculating the sum of the absolute weight values (called the L1 norm) and dividing each weight by that value. The normalize() function below implements this hack.
1 2 3 4 5 6 7 8 9 | # normalize a vector to have unit norm def normalize(weights): # calculate l1 vector norm result = norm(weights, 1) # check for a vector of all zeros if result == 0.0: return weights # return normalized vector (unit norm) return weights / result |
We can now enumerate each weight vector generated by the Cartesian product, normalize it, and evaluate it by making a prediction and keeping the best to be used in our final weight averaging ensemble.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # grid search weights def grid_search(members, testX, testy): # define weights to consider w = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] best_score, best_weights = 0.0, None # iterate all possible combinations (cartesian product) for weights in product(w, repeat=len(members)): # skip if all weights are equal if len(set(weights)) == 1: continue # hack, normalize weight vector weights = normalize(weights) # evaluate weights score = evaluate_ensemble(members, weights, testX, testy) if score > best_score: best_score, best_weights = score, weights print('>%s %.3f' % (best_weights, best_score)) return list(best_weights) |
Once discovered, we can report the performance of our weight average ensemble on the test dataset, which we would expect to be better than the best single model and ideally better than the model averaging ensemble.
1 2 3 4 | # grid search weights weights = grid_search(members, testX, testy) score = evaluate_ensemble(members, weights, testX, testy) print('Grid Search Weights: %s, Score: %.3f' % (weights, score)) |
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | # grid search for coefficients in a weighted average ensemble for the blobs problem from sklearn.datasets.samples_generator import make_blobs from sklearn.metrics import accuracy_score from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from matplotlib import pyplot from numpy import mean from numpy import std from numpy import array from numpy import argmax from numpy import tensordot from numpy.linalg import norm from itertools import product
# fit model on dataset def fit_model(trainX, trainy): trainy_enc = to_categorical(trainy) # define model model = Sequential() model.add(Dense(25, input_dim=2, activation='relu')) model.add(Dense(3, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model model.fit(trainX, trainy_enc, epochs=500, verbose=0) return model
# make an ensemble prediction for multi-class classification def ensemble_predictions(members, weights, testX): # make predictions yhats = [model.predict(testX) for model in members] yhats = array(yhats) # weighted sum across ensemble members summed = tensordot(yhats, weights, axes=((0),(0))) # argmax across classes result = argmax(summed, axis=1) return result
# evaluate a specific number of members in an ensemble def evaluate_ensemble(members, weights, testX, testy): # make prediction yhat = ensemble_predictions(members, weights, testX) # calculate accuracy return accuracy_score(testy, yhat)
# normalize a vector to have unit norm def normalize(weights): # calculate l1 vector norm result = norm(weights, 1) # check for a vector of all zeros if result == 0.0: return weights # return normalized vector (unit norm) return weights / result
# grid search weights def grid_search(members, testX, testy): # define weights to consider w = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] best_score, best_weights = 0.0, None # iterate all possible combinations (cartesian product) for weights in product(w, repeat=len(members)): # skip if all weights are equal if len(set(weights)) == 1: continue # hack, normalize weight vector weights = normalize(weights) # evaluate weights score = evaluate_ensemble(members, weights, testX, testy) if score > best_score: best_score, best_weights = score, weights print('>%s %.3f' % (best_weights, best_score)) return list(best_weights)
# generate 2d classification dataset X, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2) # split into train and test n_train = 100 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] print(trainX.shape, testX.shape) # fit all models n_members = 5 members = [fit_model(trainX, trainy) for _ in range(n_members)] # evaluate each single model on the test set testy_enc = to_categorical(testy) for i in range(n_members): _, test_acc = members[i].evaluate(testX, testy_enc, verbose=0) print('Model %d: %.3f' % (i+1, test_acc)) # evaluate averaging ensemble (equal weights) weights = [1.0/n_members for _ in range(n_members)] score = evaluate_ensemble(members, weights, testX, testy) print('Equal Weights Score: %.3f' % score) # grid search weights weights = grid_search(members, testX, testy) score = evaluate_ensemble(members, weights, testX, testy) print('Grid Search Weights: %s, Score: %.3f' % (weights, score)) |
Running the example first creates the five single models and evaluates their performance on the test dataset.
Your specific results will vary given the stochastic nature of the learning algorithm.
On this run, we can see that model 2 has the best solo performance of about 81.7% accuracy.
Next, a model averaging ensemble is created with a performance of about 80.7%, which is reasonable compared to most of the models, but not all.
1 2 3 4 5 6 7 | (100, 2) (1000, 2) Model 1: 0.798 Model 2: 0.817 Model 3: 0.798 Model 4: 0.806 Model 5: 0.810 Equal Weights Score: 0.807 |
Next, the grid search is performed. It is pretty slow and may take about twenty minutes on modern hardware. The process could easily be made parallel using libraries such as Joblib.
Each time a new top performing set of weights is discovered, it is reported along with its performance on the test dataset. We can see that during the run, the process discovered that using model 2 alone resulted in a good performance, until it was replaced with something better.
We can see that the best performance was achieved on this run using the weights that focus only on the first and second models with the accuracy of 81.8% on the test dataset. This out-performs both the single models and the model averaging ensemble on the same dataset.
1 2 3 4 5 6 | >[0. 0. 0. 0. 1.] 0.810 >[0. 0. 0. 0.5 0.5] 0.814 >[0. 0. 0. 0.33333333 0.66666667] 0.815 >[0. 1. 0. 0. 0.] 0.817 >[0.23076923 0.76923077 0. 0. 0. ] 0.818 Grid Search Weights: [0.23076923076923075, 0.7692307692307692, 0.0, 0.0, 0.0], Score: 0.818 |
An alternate approach to finding weights would be a random search, which has been shown to be effective more generally for model hyperparameter tuning.