Weighted Average MLP Ensemble

An alternative to searching for weight values is to use a directed optimization process.

Optimization is a search process, but instead of sampling the space of possible solutions randomly or exhaustively, the search process uses any available information to make the next step in the search, such as toward a set of weights that has lower error.

The SciPy library offers many excellent optimization algorithms, including local and global search methods.

SciPy provides an implementation of the Differential Evolution method. This is one of the few stochastic global search algorithms that “just works” for function optimization with continuous inputs, and it works well.

The differential_evolution() SciPy function requires that function is specified to evaluate a set of weights and return a score to be minimized. We can minimize the classification error (1 – accuracy).

As with the grid search, we most normalize the weight vector before we evaluate it. The loss_function() function below will be used as the evaluation function during the optimization process.

1

2

3

4

5

6

# loss function for optimization process, designed to be minimized

def loss_function(weights, members, testX, testy):

           # normalize weights

           normalized = normalize(weights)

           # calculate error rate

           return 1.0 - evaluate_ensemble(members, normalized, testX, testy)

We must also specify the bounds of the optimization process. We can define the bounds as a five-dimensional hypercube (e.g. 5 weights for the 5 ensemble members) with values between 0.0 and 1.0.

1

2

# define bounds on each weight

bound_w = [(0.0, 1.0) for _ in range(n_members)]

Our loss function requires three parameters in addition to the weights, which we will provide as a tuple to then be passed along to the call to the loss_function()each time a set of weights is evaluated.

1

2

# arguments to the loss function

search_arg = (members, testX, testy)

We can now call our optimization process.

We will limit the total number of iterations of the algorithms to 1,000, and use a smaller than default tolerance to detect if the search process has converged.

1

2

# global optimization of ensemble weights

result = differential_evolution(loss_function, bound_w, search_arg, maxiter=1000, tol=1e-7)

The result of the call to differential_evolution() is a dictionary that contains all kinds of information about the search.

Importantly, the ‘x‘ key contains the optimal set of weights found during the search. We can retrieve the best set of weights, then report them and their performance on the test set when used in a weighted ensemble.

1

2

3

4

5

6

# get the chosen weights

weights = normalize(result['x'])

print('Optimized Weights: %s' % weights)

# evaluate chosen weights

score = evaluate_ensemble(members, weights, testX, testy)

print('Optimized Weights Score: %.3f' % score)

Tying all of this together, the complete example is listed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

# global optimization to find coefficients for weighted ensemble on blobs problem

from sklearn.datasets.samples_generator import make_blobs

from sklearn.metrics import accuracy_score

from keras.utils import to_categorical

from keras.models import Sequential

from keras.layers import Dense

from matplotlib import pyplot

from numpy import mean

from numpy import std

from numpy import array

from numpy import argmax

from numpy import tensordot

from numpy.linalg import norm

from scipy.optimize import differential_evolution

 

# fit model on dataset

def fit_model(trainX, trainy):

           trainy_enc = to_categorical(trainy)

           # define model

           model = Sequential()

           model.add(Dense(25, input_dim=2, activation='relu'))

           model.add(Dense(3, activation='softmax'))

           model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

           # fit model

           model.fit(trainX, trainy_enc, epochs=500, verbose=0)

           return model

 

# make an ensemble prediction for multi-class classification

def ensemble_predictions(members, weights, testX):

           # make predictions

           yhats = [model.predict(testX) for model in members]

           yhats = array(yhats)

           # weighted sum across ensemble members

           summed = tensordot(yhats, weights, axes=((0),(0)))

           # argmax across classes

           result = argmax(summed, axis=1)

           return result

 

# # evaluate a specific number of members in an ensemble

def evaluate_ensemble(members, weights, testX, testy):

           # make prediction

           yhat = ensemble_predictions(members, weights, testX)

           # calculate accuracy

           return accuracy_score(testy, yhat)

 

# normalize a vector to have unit norm

def normalize(weights):

           # calculate l1 vector norm

           result = norm(weights, 1)

           # check for a vector of all zeros

           if result == 0.0:

                       return weights

           # return normalized vector (unit norm)

           return weights / result

 

# loss function for optimization process, designed to be minimized

def loss_function(weights, members, testX, testy):

           # normalize weights

           normalized = normalize(weights)

           # calculate error rate

           return 1.0 - evaluate_ensemble(members, normalized, testX, testy)

 

# generate 2d classification dataset

X, y = make_blobs(n_samples=1100, centers=3, n_features=2, cluster_std=2, random_state=2)

# split into train and test

n_train = 100

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

print(trainX.shape, testX.shape)

# fit all models

n_members = 5

members = [fit_model(trainX, trainy) for _ in range(n_members)]

# evaluate each single model on the test set

testy_enc = to_categorical(testy)

for i in range(n_members):

           _, test_acc = members[i].evaluate(testX, testy_enc, verbose=0)

           print('Model %d: %.3f' % (i+1, test_acc))

# evaluate averaging ensemble (equal weights)

weights = [1.0/n_members for _ in range(n_members)]

score = evaluate_ensemble(members, weights, testX, testy)

print('Equal Weights Score: %.3f' % score)

# define bounds on each weight

bound_w = [(0.0, 1.0)  for _ in range(n_members)]

# arguments to the loss function

search_arg = (members, testX, testy)

# global optimization of ensemble weights

result = differential_evolution(loss_function, bound_w, search_arg, maxiter=1000, tol=1e-7)

# get the chosen weights

weights = normalize(result['x'])

print('Optimized Weights: %s' % weights)

# evaluate chosen weights

score = evaluate_ensemble(members, weights, testX, testy)

print('Optimized Weights Score: %.3f' % score)

Running the example first creates five single models and evaluates the performance of each on the test dataset.

Your specific results will vary given the stochastic nature of the learning algorithm.

We can see on this run that models 3 and 4 both perform best with an accuracy of about 82.2%.

Next, a model averaging ensemble with all five members is evaluated on the test set reporting an accuracy of 81.8%, which is better than some, but not all, single models.

1

2

3

4

5

6

7

(100, 2) (1000, 2)

Model 1: 0.814

Model 2: 0.811

Model 3: 0.822

Model 4: 0.822

Model 5: 0.809

Equal Weights Score: 0.818

The optimization process is relatively quick.

We can see that the process found a set of weights that pays most attention to models 3 and 4, and spreads the remaining attention out among the other models, achieving an accuracy of about 82.4%, out-performing the model averaging ensemble and individual models.

1

2

Optimized Weights: [0.1660322  0.09652591 0.33991854 0.34540932 0.05211403]

Optimized Weights Score: 0.824

It is important to note that in these examples, we have treated the test dataset as though it were a validation dataset. This was done to keep the examples focused and technically simpler. In practice, the choice and tuning of the weights for the ensemble would be chosen by a validation dataset, and single models, model averaging ensembles, and weighted ensembles would be compared on a separate test set.