Grid Search Weighted Average Ensemble

The model averaging ensemble allows each ensemble member to contribute an equal amount to the prediction of the ensemble.

We can update the example so that instead, the contribution of each ensemble member is weighted by a coefficient that indicates the trust or expected performance of the model. Weight values are small values between 0 and 1 and are treated like a percentage, such that the weights across all ensemble members sum to one.

First, we must update the ensemble_predictions() function to make use of a vector of weights for each ensemble member.

Instead of simply summing the predictions across each ensemble member, we must calculate a weighted sum. We can implement this manually using for loops, but this is terribly inefficient; for example:

# calculated a weighted sum of predictions

def weighted_sum(weights, yhats):

rows = list()

for j in range(yhats.shape[1]):

# enumerate values

row = list()

for k in range(yhats.shape[2]):

# enumerate members

value = 0.0

for i in range(yhats.shape[0]):

value += weights[i] * yhats[i,j,k]

row.append(value)

rows.append(row)

return array(rows)

Instead, we can use efficient NumPy functions to implement the weighted sum such as einsum() or tensordot().

Full discussion of these functions is a little out of scope so please refer to the API documentation for more information on how to use these functions as they are challenging if you are new to linear algebra and/or NumPy. We will use tensordot() function to apply the tensor product with the required summing; the updated ensemble_predictions() function is listed below.

# make an ensemble prediction for multi-class classification

def ensemble_predictions(members, weights, testX):

# make predictions

yhats = [model.predict(testX) for model in members]

yhats = array(yhats)

# weighted sum across ensemble members

summed = tensordot(yhats, weights, axes=((0),(0)))

# argmax across classes

result = argmax(summed, axis=1)

return result

Next, we must update evaluate_ensemble() to pass along the weights when making the prediction for the ensemble.

# evaluate a specific number of members in an ensemble

def evaluate_ensemble(members, weights, testX, testy):

# make prediction

yhat = ensemble_predictions(members, weights, testX)

# calculate accuracy

return accuracy_score(testy, yhat)

We will use a modest-sized ensemble of five members, that appeared to perform well in the model averaging ensemble.

# fit all models

n_members = 5

members = [fit_model(trainX, trainy) for _ in range(n_members)]

We can then estimate the performance of each individual model on the test dataset as a reference.

# evaluate each single model on the test set

testy_enc = to_categorical(testy)

for i in range(n_members):

_, test_acc = members[i].evaluate(testX, testy_enc, verbose=0)

print('Model %d: %.3f' % (i+1, test_acc))

Next, we can use a weight of 1/5 or 0.2 for each of the five ensemble members and use the new functions to estimate the performance of a model averaging ensemble, a so-called equal-weight ensemble.

We would expect this ensemble to perform as well or better than any single model.

# evaluate averaging ensemble (equal weights)

weights = [1.0/n_members for _ in range(n_members)]

score = evaluate_ensemble(members, weights, testX, testy)

print('Equal Weights Score: %.3f' % score)

Finally, we can develop a weighted average ensemble.

A simple, but exhaustive approach to finding weights for the ensemble members is to grid search values. We can define a course grid of weight values from 0.0 to 1.0 in steps of 0.1, then generate all possible five-element vectors with those values. Generating all possible combinations is called a Cartesian product, which can be implemented in Python using the itertools.product() function from the standard library.

A limitation of this approach is that the vectors of weights will not sum to one (called the unit norm), as required. We can force reach generated weight vector to have a unit norm by calculating the sum of the absolute weight values (called the L1 norm) and dividing each weight by that value. The normalize() function below implements this hack.

# normalize a vector to have unit norm

def normalize(weights):

# calculate l1 vector norm

result = norm(weights, 1)

# check for a vector of all zeros

if result == 0.0:

return weights

# return normalized vector (unit norm)

return weights / result

We can now enumerate each weight vector generated by the Cartesian product, normalize it, and evaluate it by making a prediction and keeping the best to be used in our final weight averaging ensemble.

# grid search weights

def grid_search(members, testX, testy):

# define weights to consider

w = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

best_score, best_weights = 0.0, None

# iterate all possible combinations (cartesian product)

for weights in product(w, repeat=len(members)):

# skip if all weights are equal

if len(set(weights)) == 1:

continue

# hack, normalize weight vector

weights = normalize(weights)

# evaluate weights

score = evaluate_ensemble(members, weights, testX, testy)

if score > best_score:

best_score, best_weights = score, weights

print('>%s %.3f' % (best_weights, best_score))

return list(best_weights)

Once discovered, we can report the performance of our weight average ensemble on the test dataset, which we would expect to be better than the best single model and ideally better than the model averaging ensemble.

# grid search weights

weights = grid_search(members, testX, testy)

score = evaluate_ensemble(members, weights, testX, testy)

print('Grid Search Weights: %s, Score: %.3f' % (weights, score))

The complete example is listed below.

# grid search for coefficients in a weighted average ensemble for the blobs problem

from sklearn.datasets.samples_generator import make_blobs

from sklearn.metrics import accuracy_score

from keras.utils import to_categorical

from keras.models import Sequential

from keras.layers import Dense

from matplotlib import pyplot

from numpy import mean

from numpy import std

from numpy import array

from numpy import argmax

from numpy import tensordot

from numpy.linalg import norm

from itertools import product

# fit model on dataset

def fit_model(trainX, trainy):

trainy_enc = to_categorical(trainy)

# define model

model = Sequential()

model.add(Dense(25, input_dim=2, activation='relu'))

model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

model.fit(trainX, trainy_enc, epochs=500, verbose=0)

return model

# make an ensemble prediction for multi-class classification

def ensemble_predictions(members, weights, testX):

# make predictions

yhats = [model.predict(testX) for model in members]

yhats = array(yhats)

# weighted sum across ensemble members

summed = tensordot(yhats, weights, axes=((0),(0)))

# argmax across classes

result = argmax(summed, axis=1)

return result

# evaluate a specific number of members in an ensemble

def evaluate_ensemble(members, weights, testX, testy):

# make prediction

yhat = ensemble_predictions(members, weights, testX)

# calculate accuracy

return accuracy_score(testy, yhat)

# normalize a vector to have unit norm

def normalize(weights):

# calculate l1 vector norm

result = norm(weights, 1)

# check for a vector of all zeros

if result == 0.0:

return weights

# return normalized vector (unit norm)

return weights / result

# grid search weights

def grid_search(members, testX, testy):

# define weights to consider

w = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

best_score, best_weights = 0.0, None

# iterate all possible combinations (cartesian product)

for weights in product(w, repeat=len(members)):

# skip if all weights are equal

if len(set(weights)) == 1:

continue

# hack, normalize weight vector

weights = normalize(weights)

# evaluate weights

score = evaluate_ensemble(members, weights, testX, testy)

if score > best_score:

best_score, best_weights = score, weights

print('>%s %.3f' % (best_weights, best_score))

return list(best_weights)

# generate 2d classification dataset

X, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)

# split into train and test

n_train = 100

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

print(trainX.shape, testX.shape)

# fit all models

n_members = 5

members = [fit_model(trainX, trainy) for _ in range(n_members)]

# evaluate each single model on the test set

testy_enc = to_categorical(testy)

for i in range(n_members):

_, test_acc = members[i].evaluate(testX, testy_enc, verbose=0)

print('Model %d: %.3f' % (i+1, test_acc))

# evaluate averaging ensemble (equal weights)

weights = [1.0/n_members for _ in range(n_members)]

score = evaluate_ensemble(members, weights, testX, testy)

print('Equal Weights Score: %.3f' % score)

# grid search weights

weights = grid_search(members, testX, testy)

score = evaluate_ensemble(members, weights, testX, testy)

print('Grid Search Weights: %s, Score: %.3f' % (weights, score))

Running the example first creates the five single models and evaluates their performance on the test dataset.

Your specific results will vary given the stochastic nature of the learning algorithm.

On this run, we can see that model 2 has the best solo performance of about 81.7% accuracy.

Next, a model averaging ensemble is created with a performance of about 80.7%, which is reasonable compared to most of the models, but not all.

(100, 2) (1000, 2)

Model 1: 0.798

Model 2: 0.817

Model 3: 0.798

Model 4: 0.806

Model 5: 0.810

Equal Weights Score: 0.807

Next, the grid search is performed. It is pretty slow and may take about twenty minutes on modern hardware. The process could easily be made parallel using libraries such as Joblib.

Each time a new top performing set of weights is discovered, it is reported along with its performance on the test dataset. We can see that during the run, the process discovered that using model 2 alone resulted in a good performance, until it was replaced with something better.

We can see that the best performance was achieved on this run using the weights that focus only on the first and second models with the accuracy of 81.8% on the test dataset. This out-performs both the single models and the model averaging ensemble on the same dataset.

>[0. 0. 0. 0. 1.] 0.810

>[0. 0. 0. 0.5 0.5] 0.814

>[0. 0. 0. 0.33333333 0.66666667] 0.815

>[0. 1. 0. 0. 0.] 0.817

>[0.23076923 0.76923077 0. 0. 0. ] 0.818

Grid Search Weights: [0.23076923076923075, 0.7692307692307692, 0.0, 0.0, 0.0], Score: 0.818

An alternate approach to finding weights would be a random search, which has been shown to be effective more generally for model hyperparameter tuning.