How to Choose Loss Functions When Training Deep Learning Neural Networks
Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm.
As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.
Neural network models learn a mapping from inputs to outputs from examples and the choice of loss function must match the framing of the specific predictive modeling problem, such as classification or regression. Further, the configuration of the output layer must also be appropriate for the chosen loss function.
In this tutorial, you will discover how to choose a loss function for your deep learning neural network for a given predictive modeling problem.
After completing this tutorial, you will know:
· How to configure a model for mean squared error and variants for regression problems.
· How to configure a model for cross-entropy and hinge loss functions for binary classification.
· How to configure a model for cross-entropy and KL divergence loss functions for multi-class classification.
Let’s get started.
How to Choose Loss Functions When Training Deep Learning Neural Networks
Photo by GlacierNPS, some rights reserved.
This tutorial is divided into three parts; they are:
1. Regression Loss Functions
1. Mean Squared Error Loss
2. Mean Squared Logarithmic Error Loss
3. Mean Absolute Error Loss
2. Binary Classification Loss Functions
1. Binary Cross-Entropy
2. Hinge Loss
3. Squared Hinge Loss
3. Multi-Class Classification Loss Functions
1. Multi-Class Cross-Entropy Loss
2. Sparse Multiclass Cross-Entropy Loss
3. Kullback Leibler Divergence Loss
A regression predictive modeling problem involves predicting a real-valued quantity.
In this section, we will investigate loss functions that are appropriate for regression predictive modeling problems.
As the context for this investigation, we will use a standard regression problem generator provided by the scikit-learn library in the make_regression() function. This function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.
We will use this function to define a problem that has 20 input features; 10 of the features will be meaningful and 10 will not be relevant. A total of 1,000 examples will be randomly generated. The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.
1 2 | # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) |
Neural networks generally perform better when the real-valued input and output variables are to be scaled to a sensible range. For this problem, each of the input variables and the target variable have a Gaussian distribution; therefore, standardizing the data in this case is desirable.
We can achieve this using the StandardScaler transformer class also from the scikit-learn library. On a real problem, we would prepare the scaler on the training dataset and apply it to the train and test sets, but for simplicity, we will scale all of the data together before splitting into train and test sets.
1 2 3 | # standardize dataset X = StandardScaler().fit_transform(X) y = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0] |
Once scaled, the data will be split evenly into train and test sets.
1 2 3 4 | # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] |
A small Multilayer Perceptron (MLP) model will be defined to address this problem and provide the basis for exploring different loss functions.
The model will expect 20 features as input as defined by the problem. The model will have one hidden layer with 25 nodes and will use the rectified linear activation function. The output layer will have 1 node, given the one real-value to be predicted, and will use the linear activation function.
1 2 3 4 | # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) |
The model will be fit with stochastic gradient descent with a learning rate of 0.01 and a momentum of 0.9, both sensible default values.
Training will be performed for 100 epochs and the test set will be evaluated at the end of each epoch so that we can plot learning curves at the end of the run.
1 2 3 4 | opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='...', optimizer=opt) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) |
Now that we have the basis of a problem and model, we can take a look evaluating three common loss functions that are appropriate for a regression predictive modeling problem.
Although an MLP is used in these examples, the same loss functions can be used when training CNN and RNN models for regression.