Model Averaging Ensemble

We can develop a simple model averaging ensemble before we look at developing a weighted average ensemble.

The results of the model averaging ensemble can be used as a point of comparison as we would expect a well configured weighted average ensemble to perform better.

First, we need to fit multiple models from which to develop an ensemble. We will define a function named fit_model() to create and fit a single model on the training dataset that we can call repeatedly to create as many models as we wish.

1

2

3

4

5

6

7

8

9

10

11

# fit model on dataset

def fit_model(trainX, trainy):

           trainy_enc = to_categorical(trainy)

           # define model

           model = Sequential()

           model.add(Dense(25, input_dim=2, activation='relu'))

           model.add(Dense(3, activation='softmax'))

           model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

           # fit model

           model.fit(trainX, trainy_enc, epochs=500, verbose=0)

           return model

We can call this function to create a pool of 10 models.

1

2

3

# fit all models

n_members = 10

members = [fit_model(trainX, trainy) for _ in range(n_members)]

Next, we can develop model averaging ensemble.

We don’t know how many members would be appropriate for this problem, so we can create ensembles with different sizes from one to 10 members and evaluate the performance of each on the test set.

We can also evaluate the performance of each standalone model in the performance on the test set. This provides a useful point of comparison for the model averaging ensemble, as we expect that the ensemble will out-perform a randomly selected single model on average.

Each model predicts the probabilities for each class label, e.g. has three outputs. A single prediction can be converted to a class label by using the argmax() function on the predicted probabilities, e.g. return the index in the prediction with the largest probability value. We can ensemble the predictions from multiple models by summing the probabilities for each class prediction and using the argmax() on the result. The ensemble_predictions() function below implements this behavior.

1

2

3

4

5

6

7

8

9

10

# make an ensemble prediction for multi-class classification

def ensemble_predictions(members, testX):

           # make predictions

           yhats = [model.predict(testX) for model in members]

           yhats = array(yhats)

           # sum across ensemble members

           summed = numpy.sum(yhats, axis=0)

           # argmax across classes

           result = argmax(summed, axis=1)

           return result

We can estimate the performance of an ensemble of a given size by selecting the required number of models from the list of all models, calling the ensemble_predictions() function to make a prediction, then calculating the accuracy of the prediction by comparing it to the true values. The evaluate_n_members() function below implements this behavior.

1

2

3

4

5

6

7

8

# evaluate a specific number of members in an ensemble

def evaluate_n_members(members, n_members, testX, testy):

           # select a subset of members

           subset = members[:n_members]

           # make prediction

           yhat = ensemble_predictions(subset, testX)

           # calculate accuracy

           return accuracy_score(testy, yhat)

The scores of the ensembles of each size can be stored to be plotted later, and the scores for each individual model are collected and the average performance reported.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

# evaluate different numbers of ensembles on hold out set

single_scores, ensemble_scores = list(), list()

for i in range(1, len(members)+1):

           # evaluate model with i members

           ensemble_score = evaluate_n_members(members, i, testX, testy)

           # evaluate the i'th model standalone

           testy_enc = to_categorical(testy)

           _, single_score = members[i-1].evaluate(testX, testy_enc, verbose=0)

           # summarize this step

           print('> %d: single=%.3f, ensemble=%.3f' % (i, single_score, ensemble_score))

           ensemble_scores.append(ensemble_score)

           single_scores.append(single_score)

# summarize average accuracy of a single final model

print('Accuracy %.3f (%.3f)' % (mean(single_scores), std(single_scores)))

Finally, we create a graph that shows the accuracy of each individual model (blue dots) and the performance of the model averaging ensemble as the number of members is increased from one to 10 members (orange line).

Tying all of this together, the complete example is listed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

# model averaging ensemble for the blobs dataset

from sklearn.datasets.samples_generator import make_blobs

from sklearn.metrics import accuracy_score

from keras.utils import to_categorical

from keras.models import Sequential

from keras.layers import Dense

from matplotlib import pyplot

from numpy import mean

from numpy import std

import numpy

from numpy import array

from numpy import argmax

 

# fit model on dataset

def fit_model(trainX, trainy):

           trainy_enc = to_categorical(trainy)

           # define model

           model = Sequential()

           model.add(Dense(25, input_dim=2, activation='relu'))

           model.add(Dense(3, activation='softmax'))

           model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

           # fit model

           model.fit(trainX, trainy_enc, epochs=500, verbose=0)

           return model

 

# make an ensemble prediction for multi-class classification

def ensemble_predictions(members, testX):

           # make predictions

           yhats = [model.predict(testX) for model in members]

           yhats = array(yhats)

           # sum across ensemble members

           summed = numpy.sum(yhats, axis=0)

           # argmax across classes

           result = argmax(summed, axis=1)

           return result

 

# evaluate a specific number of members in an ensemble

def evaluate_n_members(members, n_members, testX, testy):

           # select a subset of members

           subset = members[:n_members]

           # make prediction

           yhat = ensemble_predictions(subset, testX)

           # calculate accuracy

           return accuracy_score(testy, yhat)

 

# generate 2d classification dataset

X, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)

# split into train and test

n_train = 100

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

print(trainX.shape, testX.shape)

# fit all models

n_members = 10

members = [fit_model(trainX, trainy) for _ in range(n_members)]

# evaluate different numbers of ensembles on hold out set

single_scores, ensemble_scores = list(), list()

for i in range(1, len(members)+1):

           # evaluate model with i members

           ensemble_score = evaluate_n_members(members, i, testX, testy)

           # evaluate the i'th model standalone

           testy_enc = to_categorical(testy)

           _, single_score = members[i-1].evaluate(testX, testy_enc, verbose=0)

           # summarize this step

           print('> %d: single=%.3f, ensemble=%.3f' % (i, single_score, ensemble_score))

           ensemble_scores.append(ensemble_score)

           single_scores.append(single_score)

# summarize average accuracy of a single final model

print('Accuracy %.3f (%.3f)' % (mean(single_scores), std(single_scores)))

# plot score vs number of ensemble members

x_axis = [i for i in range(1, len(members)+1)]

pyplot.plot(x_axis, single_scores, marker='o', linestyle='None')

pyplot.plot(x_axis, ensemble_scores, marker='o')

pyplot.show()

Running the example first reports the performance of each single model as well as the model averaging ensemble of a given size with 1, 2, 3, etc. members.

Your results will vary given the stochastic nature of the training algorithm.

On this run, the average performance of the single models is reported at about 80.4% and we can see that an ensemble with between five and nine members will achieve a performance between 80.8% and 81%. As expected, the performance of a modest-sized model averaging ensemble out-performs the performance of a randomly selected single model on average.

1

2

3

4

5

6

7

8

9

10

11

12

(100, 2) (1000, 2)

> 1: single=0.803, ensemble=0.803

> 2: single=0.805, ensemble=0.808

> 3: single=0.798, ensemble=0.805

> 4: single=0.809, ensemble=0.809

> 5: single=0.808, ensemble=0.811

> 6: single=0.805, ensemble=0.808

> 7: single=0.805, ensemble=0.808

> 8: single=0.804, ensemble=0.809

> 9: single=0.810, ensemble=0.810

> 10: single=0.794, ensemble=0.808

Accuracy 0.804 (0.005)

Next, a graph is created comparing the accuracy of single models (blue dots) to the model averaging ensemble of increasing size (orange line).

On this run, the orange line of the ensembles clearly shows better or comparable performance (if dots are hidden) than the single models.

Line Plot Showing Single Model Accuracy (blue dots) and Accuracy of Ensembles of Increasing Size (orange line)

Now that we know how to develop a model averaging ensemble, we can extend the approach one step further by weighting the contributions of the ensemble members.