Mean squared error keras

Computes the mean of squares of errors between labels and predictions.

Computes the mean of squares of errors between labels and predictions.

Inherits From: Loss

View aliases

Main aliases

tf.losses.MeanSquaredError

Compat aliases for migration

See
Migration guide for
more details.

tf.compat.v1.keras.losses.MeanSquaredError

tf.keras.losses.MeanSquaredError(
    reduction=losses_utils.ReductionV2.AUTO,
    name='mean_squared_error'
)

Used in the notebooks

Used in the guide Used in the tutorials
  • Training and evaluation with the built-in methods
  • Making new Layers and Models via subclassing
  • Time series forecasting
  • Intro to Autoencoders
  • Load CSV data
  • Hello, many worlds
  • Recommending movies: ranking

loss = square(y_true - y_pred)

Standalone usage:

y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
# Using 'auto'/'sum_over_batch_size' reduction type.
mse = tf.keras.losses.MeanSquaredError()
mse(y_true, y_pred).numpy()
0.5
# Calling with 'sample_weight'.
mse(y_true, y_pred, sample_weight=[0.7, 0.3]).numpy()
0.25
# Using 'sum' reduction type.
mse = tf.keras.losses.MeanSquaredError(
    reduction=tf.keras.losses.Reduction.SUM)
mse(y_true, y_pred).numpy()
1.0
# Using 'none' reduction type.
mse = tf.keras.losses.MeanSquaredError(
    reduction=tf.keras.losses.Reduction.NONE)
mse(y_true, y_pred).numpy()
array([0.5, 0.5], dtype=float32)

Usage with the compile() API:

model.compile(optimizer='sgd', loss=tf.keras.losses.MeanSquaredError())

Args

reduction Type of tf.keras.losses.Reduction to apply to
loss. Default value is AUTO. AUTO indicates that the reduction
option will be determined by the usage context. For almost all cases
this defaults to SUM_OVER_BATCH_SIZE. When used with
tf.distribute.Strategy, outside of built-in training loops such as
tf.keras compile and fit, using AUTO or
SUM_OVER_BATCH_SIZE will raise an error. Please see this custom
training tutorial for
more details.
name Optional name for the instance. Defaults to
‘mean_squared_error’.

Methods

from_config

View source

@classmethod
from_config(
    config
)

Instantiates a Loss from its config (output of get_config()).

Args
config Output of get_config().
Returns
A keras.losses.Loss instance.

get_config

View source

get_config()

Returns the config dictionary for a Loss instance.

__call__

View source

__call__(
    y_true, y_pred, sample_weight=None
)

Invokes the Loss instance.

Args
y_true Ground truth values. shape = [batch_size, d0, .. dN], except
sparse loss functions such as sparse categorical crossentropy where
shape = [batch_size, d0, .. dN-1]
y_pred The predicted values. shape = [batch_size, d0, .. dN]
sample_weight Optional sample_weight acts as a coefficient for the
loss. If a scalar is provided, then the loss is simply scaled by the
given value. If sample_weight is a tensor of size [batch_size],
then the total loss for each sample of the batch is rescaled by the
corresponding element in the sample_weight vector. If the shape of
sample_weight is [batch_size, d0, .. dN-1] (or can be
broadcasted to this shape), then each loss element of y_pred is
scaled by the corresponding value of sample_weight. (Note
ondN-1: all loss functions reduce by 1 dimension, usually
axis=-1.)
Returns
Weighted loss float Tensor. If reduction is NONE, this has
shape [batch_size, d0, .. dN-1]; otherwise, it is scalar. (Note
dN-1 because all loss functions reduce by 1 dimension, usually
axis=-1.)
Raises
ValueError If the shape of sample_weight is invalid.

Использование функций потерь

Функция потерь (или объективная функция, или функция оценки результатов оптимизации) является одним из двух параметров, необходимых для компиляции модели:

model.compile(loss=’mean_squared_error’, optimizer=’sgd’)
from keras import losses
model.compile(loss=losses.mean_squared_error, optimizer=’sgd’)

Можно либо передать имя существующей функции потерь, либо передать символическую функцию TensorFlow/Theano, которая возвращает скаляр для каждой точки данных и принимает следующие два аргумента:

y_true: истинные метки. Тензор TensorFlow/Theano.

y_pred: Прогнозы. Тензор TensorFlow/Theano той же формы, что и y_true.

Фактически оптимизированная цель — это среднее значение выходного массива по всем точкам данных.

Доступные функции потери

mean_squared_error

keras.losses.mean_squared_error(y_true, y_pred)


mean_absolute_error

keras.losses.mean_absolute_error(y_true, y_pred)


mean_absolute_percentage_error

keras.losses.mean_absolute_percentage_error(y_true, y_pred)


mean_squared_logarithmic_error

keras.losses.mean_squared_logarithmic_error(y_true, y_pred)


squared_hinge

keras.losses.squared_hinge(y_true, y_pred)


hinge

keras.losses.hinge(y_true, y_pred)


categorical_hinge

keras.losses.categorical_hinge(y_true, y_pred)


logcosh

keras.losses.logcosh(y_true, y_pred)

Логарифм гиперболического косинуса ошибки прогнозирования.

log(cosh(x)) приблизительно равен (x ** 2) / 2 для малого x и  abs(x) — log(2) для большого x. Это означает, что ‘logcosh’ работает в основном как средняя квадратичная ошибка, но не будет так сильно зависеть от случайного сильно неправильного предсказания.

Аргументы

  • y_true: тензор истинных целей.
  • y_pred: тензор прогнозируемых целей.

Возвращает

Тензор с одной записью о скалярной потере на каждый сэмпл.


huber_loss

keras.losses.huber_loss(y_true, y_pred, delta=1.0)


categorical_crossentropy

keras.losses.categorical_crossentropy(y_true, y_pred, from_logits=False, label_smoothing=0)


sparse_categorical_crossentropy

keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=False, axis=-1)


binary_crossentropy

keras.losses.binary_crossentropy(y_true, y_pred, from_logits=False, label_smoothing=0)


kullback_leibler_divergence

keras.losses.kullback_leibler_divergence(y_true, y_pred)


poisson

keras.losses.poisson(y_true, y_pred)


cosine_proximity

keras.losses.cosine_proximity(y_true, y_pred, axis=-1)


is_categorical_crossentropy

keras.losses.is_categorical_crossentropy(loss)


Примечание: при использовании потери categorical_crossentropy ваши данные должны быть в категориальном формате (например, если у вас 10 классов, то целью для каждой выборки должен быть 10-мерный вектор, который является полностью нулевым, за исключением 1 в индексе, соответствующем классу выборки). Для того, чтобы преобразовать целые данные в категорические, можно использовать утилиту Keras to_categorical:

from keras.utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=None)

При использовании переменной sparse_categorical_crossentropy loss, ваши данные должны быть целыми. Если у вас есть категориальные данные, следует использовать categoryical_crossentropy.

categoryical_crossentropy — это еще один термин для обозначения потери лога по нескольким классам.

You’ve created a deep learning model in Keras, you prepared the data and now you are wondering which loss you should choose for your problem. 

We’ll get to that in a second but first what is a loss function?

In deep learning, the loss is computed to get the gradients with respect to model weights and update those weights accordingly via backpropagation. Loss is calculated and the network is updated after every iteration until model updates don’t bring any improvement in the desired evaluation metric. 

So while you keep using the same evaluation metric like f1 score or AUC on the validation set during (long parts) of your machine learning project, the loss can be changed, adjusted and modified to get the best evaluation metric performance.

You can think of the loss function just like you think about the model architecture or the optimizer and it is important to put some thought into choosing it. In this piece we’ll look at:

  • loss functions available in Keras and how to use them,
  • how you can define your own custom loss function in Keras,
  • how to add sample weighing to create observation-sensitive losses,
  • how to avoid nans in the loss,
  • how you can monitor the loss function via plotting and callbacks.

Let’s get into it!

Keras loss functions 101

In Keras, loss functions are passed during the compile stage, as shown below. 

In this example, we’re defining the loss function by creating an instance of the loss class. Using the class is advantageous because you can pass some additional parameters. 

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential()
model.add(layers.Dense(64, kernel_initializer='uniform', input_shape=(10,)))
model.add(layers.Activation('softmax'))

loss_function = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss_function, optimizer='adam')

If you want to use a loss function that is built into Keras without specifying any parameters you can just use the string alias as shown below:

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

You might be wondering how does one decide on which loss function to use?

There are various loss functions available in Keras. Other times you might have to implement your own custom loss functions. 

Let’s dive into all those scenarios.

Which loss functions are available in Keras?

Binary Classification

Binary classification loss function comes into play when solving a problem involving just two classes. For example, when predicting fraud in credit card transactions, a transaction is either fraudulent or not. 

Binary Cross Entropy

The Binary Cross entropy will calculate the cross-entropy loss between the predicted classes and the true classes. By default, the sum_over_batch_size reduction is used. This means that the loss will return the average of the per-sample losses in the batch.

y_true = [[0., 1.], [0.2, 0.8],[0.3, 0.7],[0.4, 0.6]]
y_pred = [[0.6, 0.4], [0.4, 0.6],[0.6, 0.4],[0.8, 0.2]]
bce = tf.keras.losses.BinaryCrossentropy(reduction='sum_over_batch_size')
bce(y_true, y_pred).numpy()

The sum reduction means that the loss function will return the sum of the per-sample losses in the batch.

bce = tf.keras.losses.BinaryCrossentropy(reduction='sum')
bce(y_true, y_pred).numpy()

Using the reduction as none returns the full array of the per-sample losses.

bce = tf.keras.losses.BinaryCrossentropy(reduction='none')
bce(y_true, y_pred).numpy()
array([0.9162905 , 0.5919184 , 0.79465103, 1.0549198 ], dtype=float32)

In binary classification, the activation function used is the sigmoid activation function. It constrains the output to a number between 0 and 1. 

Multiclass classification

Problems involving the prediction of more than one class use different loss functions. In this section we’ll look at a couple:

Categorical Crossentropy

The CategoricalCrossentropy also computes the cross-entropy loss between the true classes and predicted classes. The labels are given in an one_hot format. 

cce = tf.keras.losses.CategoricalCrossentropy()
cce(y_true, y_pred).numpy()

Sparse Categorical Crossentropy

If you have two or more classes and  the labels are integers, the SparseCategoricalCrossentropy should be used. 

y_true = [0, 1,2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1],[0.1, 0.8, 0.1]]
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()

The Poison Loss

You can also use the Poisson class to compute the poison loss. It’s a great choice if your dataset comes from a Poisson distribution for example the number of calls a call center receives per hour. 

y_true = [[0.1, 1.,0.8], [0.1, 0.9,0.1],[0.2, 0.7,0.1],[0.3, 0.1,0.6]]
y_pred = [[0.6, 0.2,0.2], [0.2, 0.6,0.2],[0.7, 0.1,0.2],[0.8, 0.1,0.1]]
p = tf.keras.losses.Poisson()
p(y_true, y_pred).numpy()

Kullback-Leibler Divergence Loss

The relative entropy can be computed using the KLDivergence class. According to the official docs at PyTorch:

KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions. 

y_true = [[0.1, 1.,0.8], [0.1, 0.9,0.1],[0.2, 0.7,0.1],[0.3, 0.1,0.6]]
y_pred = [[0.6, 0.2,0.2], [0.2, 0.6,0.2],[0.7, 0.1,0.2],[0.8, 0.1,0.1]]
kl = tf.keras.losses.KLDivergence()
kl(y_true, y_pred).numpy()

In a multi-class problem, the activation function used is the softmax function.

Object Detection

The Focal Loss

In classification problems involving imbalanced data and object detection problems, you can use the Focal Loss. The loss introduces an adjustment to the cross-entropy criterion. 

It is done by altering its shape in a way that the loss allocated to well-classified examples is down-weighted. This ensures that the model is able to learn equally from minority and majority classes.

The cross-entropy loss is scaled by scaling the factors decaying at zero as the confidence in the correct class increases. The factor of scaling down weights the contribution of unchallenging samples at training time and focuses on the challenging ones.

import tensorflow_addons as tfa

y_true = [[0.97], [0.91], [0.03]]
y_pred = [[1.0], [1.0], [0.0]]
sfc = tfa.losses.SigmoidFocalCrossEntropy()
sfc(y_true, y_pred).numpy()
array([0.00010971, 0.00329749, 0.00030611], dtype=float32)

Generalized Intersection over Union

The Generalized Intersection over Union loss from the TensorFlow add on can also be used. The Intersection over Union (IoU) is a very common metric in object detection problems. IoU is however not very efficient in problems involving non-overlapping bounding boxes. 

The Generalized Intersection over Union was introduced to address this challenge that IoU is facing. It ensures that generalization is achieved by maintaining the scale-invariant property of IoU, encoding the shape properties of the compared objects into the region property, and making sure that there is a strong correlation with IoU in the event of overlapping objects. 

gl = tfa.losses.GIoULoss()
boxes1 = tf.constant([[4.0, 3.0, 7.0, 5.0], [5.0, 6.0, 10.0, 7.0]])
boxes2 = tf.constant([[3.0, 4.0, 6.0, 8.0], [14.0, 14.0, 15.0, 15.0]])
loss = gl(boxes1, boxes2)

Regression

In regression problems, you have to calculate the differences between the predicted values and the true values but as always there are many ways to do it.

Mean Squared Error

The MeanSquaredError class can be used to compute the mean square of errors between the predictions and the true values. 

y_true = [12, 20, 29., 60.]
y_pred = [14., 18., 27., 55.]
mse = tf.keras.losses.MeanSquaredError()
mse(y_true, y_pred).numpy()

Use Mean Squared Error when you desire to have large errors penalized more than smaller ones. 

Mean Absolute Percentage Error

The mean absolute percentage error is computed using the function below.

It is calculated as shown below.

y_true = [12, 20, 29., 60.]
y_pred = [14., 18., 27., 55.]
mape = tf.keras.losses.MeanAbsolutePercentageError()
mape(y_true, y_pred).numpy()

Consider using this loss when you want a loss that you can explain intuitively. People understand percentages easily. The loss is also robust to outliers. 

Mean Squared Logarithmic Error

The mean squared logarithmic error can be computed using the formula below:

Here’s an implementation of the same:

y_true = [12, 20, 29., 60.]
y_pred = [14., 18., 27., 55.]
msle = tf.keras.losses.MeanSquaredLogarithmicError()
msle(y_true, y_pred).numpy()

Mean Squared Logarithmic Error penalizes underestimates more than it does overestimates. It’s a great choice when you prefer not to penalize large errors, it is, therefore, robust to outliers. 

Cosine Similarity Loss

If your interest is in computing the cosine similarity between the true and predicted values, you’d use the CosineSimilarity class. It is computed as:

The result is a number between  -1 and 1 . 0 indicates orthogonality while values close to -1 show that there is great similarity.

y_true = [[12, 20], [29., 60.]]
y_pred = [[14., 18.], [27., 55.]]
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
cosine_loss(y_true, y_pred).numpy()

LogCosh Loss

The LogCosh class computes the logarithm of the hyperbolic cosine of the prediction error.

Here’s its implementation as a stand-alone function. 

y_true = [[12, 20], [29., 60.]]
y_pred = [[14., 18.], [27., 55.]]
l = tf.keras.losses.LogCosh()
l(y_true, y_pred).numpy()

LogCosh Loss works like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction. — TensorFlow Docs

Huber loss

For regression problems that are less sensitive to outliers, the Huber loss is used. 

y_true = [12, 20, 29., 60.]
y_pred = [14., 18., 27., 55.]
h = tf.keras.losses.Huber()
h(y_true, y_pred).numpy()

Learning Embeddings

Triplet Loss

You can also compute the triplet loss with semi-hard negative mining via TensorFlow addons. The loss encourages the positive distances between pairs of embeddings with the same labels to be less than the minimum negative distance. 

import tensorflow_addons as tfa

model.compile(optimizer='adam',
              loss=tfa.losses.TripletSemiHardLoss(),
              metrics=['accuracy'])

Creating custom loss functions in Keras

Sometimes there is no good loss available or you need to implement some modifications. Let’s learn how to do that.

A custom loss function can be created by defining a function that takes the true values and predicted values as required parameters. The function should return an array of losses. The function can then be passed at the compile stage. 

def custom_loss_function(y_true, y_pred):
   squared_difference = tf.square(y_true - y_pred)
   return tf.reduce_mean(squared_difference, axis=-1)

model.compile(optimizer='adam', loss=custom_loss_function)

Let’s see how we can apply this custom loss function to an array of predicted and true values.

import numpy as np

y_true = [12, 20, 29., 60.]
y_pred = [14., 18., 27., 55.]
cl = custom_loss_function(np.array(y_true),np.array(y_pred))
cl.numpy()

Use of Keras loss weights

During the training process, one can weigh the loss function by observations or samples. The weights can be arbitrary, but a typical choice is class weights (distribution of labels). Each observation is weighted by the fraction of the class it belongs to (reversed) so that the loss for minority class observations is more important when calculating the loss.  

One of the ways to do this is to pass the class weights during the training process. 

The weights are passed using a dictionary that contains the weight for each class. You can compute the weights using Scikit-learn or calculate the weights based on your own criterion. 

weights = { 0:1.01300017,1:0.88994364,2:1.00704935, 3:0.97863318,      4:1.02704553, 5:1.10680686,6:1.01385603,7:0.95770152, 8:1.02546573,
               9:1.00857287}
model.fit(x_train, y_train,verbose=1, epochs=10,class_weight=weights)

The second way is to pass these weights at the compile stage.

weights = [1.013, 0.889, 1.007, 0.978, 1.027,1.106,1.013,0.957,1.025, 1.008]

model.compile(optimizer=tf.keras.optimizers.SGD(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              loss_weights=weights,
              metrics=['accuracy'])

How to monitor Keras loss function [example]

It is usually a good idea to monitor the loss function on the training and validation set as the model is training. Looking at those learning curves is a good indication of overfitting or other problems with model training.

There are two main options of how this can be done.

Monitor Keras loss using console logs 

The quickest and easiest way to log and look at the losses is simply printing them to the console. 

import tensorflow as tf

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
   tf.keras.layers.Flatten(input_shape=(28, 28)),
   tf.keras.layers.Dense(512, activation='relu'),
   tf.keras.layers.Dropout(0.2),
   tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='sgd',
             loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train,verbose=1, epochs=10)

The problem with this approach is that those logs can be easily lost, it is difficult to see progress, and when working on remote machines, you may not have access to it.

Monitor Keras loss using a callback

Another cleaner option is to use a callback that will log the loss somewhere on every batch and epoch ended. 

You need to decide where and what you would like to log, but it is really simple. 

For example, logging Keras loss to neptune.ai could look like this:

from keras.callbacks import Callback

class NeptuneCallback(Callback):
    def on_batch_end(self, batch, logs=None):
        for metric_name, metric_value in logs.items():
            neptune_run[f"{metric_name}"].log(metric_value)

    def on_epoch_end(self, epoch, logs=None):
        for metric_name, metric_value in logs.items():
            neptune_run[f"{metric_name}"].log(metric_value)

You can create the monitoring callback yourself or use one of the many available Keras callbacks both in the Keras library and in other libraries that integrate with it, like neptune.ai, TensorBoard, and others.

Once you have the callback ready, you simply pass it to the model.fit(...):

pip install neptune-tensorflow-keras
# the same as above
import neptune.new as neptune
from neptune.new.integrations.tensorflow_keras import NeptuneCallback
 
 
run = neptune.init_run()
 
neptune_callback = NeptuneCallback(run=run)
 
model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    epochs=10,
    callbacks=[neptune_callback],
)

And monitor your experiment learning curves in the web app: 

Note: For the most up-to-date code examples, please refer to the Neptune-Keras integration docs.

With neptune.ai, you can not only track losses, but also other metrics and parameters, as well as artifacts, source code, system metrics and more.

Why Keras loss nan happens

Most of the time, losses you log will be just some regular values, but sometimes you might get nans when working with Keras loss functions. 

When that happens, your model will not update its weights and will stop learning, so this situation needs to be avoided.

There could be many reasons for nan loss but usually, what happens is:

  • nans in the training set will lead to nans in the loss,
  • NumPy infinite in the training set will also lead to nans in the loss,
  • Using a training set that is not scaled,
  • Use of very large l2 regularizers and a learning rate above 1,
  • Use of the wrong optimizer function,
  • Large (exploding) gradients that result in a large update to network weights during training.

So in order to avoid nans in the loss, ensure that:

  • Check that your training data is properly scaled and doesn’t contain nans;
  • Check that you are using the right optimizer and that your learning rate is not too large;
  • Check whether the l2 regularization is not too large;
  • If you are facing the exploding gradient problem, you can either: re-design the network or use gradient clipping so that your gradients have a certain “maximum allowed model update”.

Vanishing and Exploding Gradients in Neural Network Models: Debugging, Monitoring, and Fixing

Understanding Gradient Clipping (and How It Can Fix Exploding Gradients Problem)

Final thoughts

Hopefully, this article gave you some background into loss functions in Keras.

We’ve covered:

  • Built-in loss functions in Keras,
  • Implementation of your own custom loss functions,
  • How to add sample weighing to create observation-sensitive losses,
  • How to avoid loss nans,
  • How you can visualize loss as your model is training.

For more information, check out the Keras Repository and the TensorFlow Loss Functions documentation.

How to build custom loss functions in Keras for any use case

In this article, there is an in-depth discussion on

  • What are Loss Functions 
  • What are Evaluation Metrics?
  • Commonly used Loss functions in Keras (Regression and Classification)
  • Built-in loss functions in Keras
  • What is the custom loss function?
  • Implementation of common loss functions in Keras
  • Custom Loss Function for Layers i.e Custom Regularization Loss
  • Dealing with NaN values in Keras Loss
  • Why should you use a Custom Loss?
  • Monitoring Keras Loss using callbacks

What are Loss Functions

Loss functions are one of the core parts of a machine learning model. If you’ve been in the field of data science for some time, you must have heard it. Loss functions, also known as cost functions, are special types of functions, which help us minimize the error, and reach as close as possible to the expected output.

In deep learning, the loss is computed to get the gradients for the model weights and update those weights accordingly using backpropagation.

What are Loss Functions

Basic working or understanding of error can be gained from the image above, where there is an actual value and a predicted value. The difference between the actual value and predicted value can be known as error. 

This can be written in the equation form as

So our goal is to minimize the difference between the predicted value which is hθ(x) and the actual value y. In other words, you have to minimize the value of the cost function. This main idea can be understood better from the following picture by Professor Andrew NG where he explains that choosing the correct value of θ0 and θ1 which are weights of a model, such that our prediction hθ is closest to y which is the actual output.

Here Professor Andrew NG is using the Mean Squared Error function, which will be discussed later on.

An easy explanation can be said that the goal of a machine learning model is to minimize the cost and maximize the evaluation metric. This can be achieved by updating the weights of a machine learning model using some algorithm such as Gradient Descent.

Here you can see the weight that is being updated and the cost function, that is used to update the weight of a machine learning model.

What are Evaluation Metrics

Evaluation metrics are the metrics used to evaluate and judge the performance of a machine learning model. Evaluating a machine learning project is very essential. There are different types of evaluation metrics such as ‘Mean Squared Error’, ‘Accuracy’, ‘Mean Absolute Error’ etc. The cost functions used, such as mean squared error, or binary cross-entropy are also metrics, but they are difficult to read and interpret how our model is performing. So there is a need for other metrics like Accuracy, Precision, Recall, etc. Using different metrics is important because a model may perform well using one measurement from one evaluation metric, but may perform poorly using another measurement from another evaluation metric.

output01

Here you can see the performance of our model using 2 metrics. The first one is Loss and the second one is accuracy. It can be seen that our loss function (which was cross-entropy in this example) has a value of 0.4474 which is difficult to interpret whether it is a good loss or not, but it can be seen from the accuracy that currently it has an accuracy of 80%.

Hence it is important to use different evaluation metrics other than loss/cost function only to properly evaluate the model’s performance and capabilities.

Some of the common evaluation metrics include:

  • Accuracy
  • Precision
  • Recall
  • F-1 Score
  • MSE
  • MAE
  • Confusion Matrix
  • Logarithmic Loss
  • ROC curve

And many more.

Commonly Used Loss Functions in Machine Learning Algorithms and their Keras Implementation

Commonly Used Loss Functions in Machine Learning Algorithms and their Keras Implementation

Common Regression Losses:

Regression is the type of problem where you are going to predict a continuous variable. This means that our variable can be any number, not some specific labels.

For example, when you have to predict prices of houses, it can be a house of any price, so it is a regression problem. 

Some of the common examples of regressions tasks are

  • Prices Prediction
  • Stock Market Prediction
  • Financial Forecasting
  • Trend Analysis
  • Time Series Predictions

And many more.

Common Regression Losses

This figure above explains the regression problem where you are going to predict the price of the house by checking three features which are size of the house, rooms in the house, and baths in the house. Our model will check these features, and will predict a continuous number that will be the price of the house.

Since regression problems deal with predicting a continuous number, so you have to use different types of loss then classification problems. Some of the commonly used loss functions in regression problems are as follows.

Mean Squared Error  

Mean squared error, also known as L2 Loss is mainly used for Regression Tasks. As the name suggests, it is calculated by taking the mean of the square of the loss/error which is the difference between actual and predicted value.

The Mathematical equation for Mean Squared Error is

Mean Squared Error

Where Ŷi is the predicted value, and Yi is the actual value. Mean Squared Error penalizes the model for making errors by taking the square. This is the reason that this loss function is less robust to outliers in the dataset.

Implementation in Keras.

				
					import keras
import numpy as np
 
y_true = np.array([[10.0,7.0]]) #sample data
y_pred = np.array([[8.0, 6.0]])
 
a = keras.losses.MSE(y_true, y_pred)
 
print(f'Value of Mean Squared Error is {a.numpy()}')

				
			

Here predicted values and the true values are passed inside the Mean Squared Error Object from keras.losses and computed the loss. It returns a tf.Tensor object which has been converted into numpy to see more clearly.

Using via compile Method:

Keras losses can be specified for a deep learning model using the compile method from keras.Model..

				
					model = keras.Sequential([
                     keras.layers.Dense(10, input_shape=(1,), activation='relu'),
                     keras.layers.Dense(1)
 
])

				
			

And now the compile method can be used to specify the loss and metrics.

				
					model.compile(loss='mse', optimizer='adam')

				
			

Now when our model is going to be trained, it will use the Mean Squared Error loss function to compute the loss, update the weights using ADAM optimizer.

				
					model.fit(np.array([[10.0],[20.0], [30.0],[40.0],[50.0],[60.0],[10.0], [20.0]]), np.array([6, 12, 18,24,30, 36,6, 12]), epochs=10)

				
			

Mean Absolute Error

Mean Absolute error, also known as L1 Error, is defined as the average of the absolute differences between the actual value and the predicted value. This is the average of the absolute difference between the predicted and the actual value. 

Mathematically, it can be shown as:

Mean Absolute Error

The Mean Absolute error uses the scale-dependent accuracy measure which means that it uses the same scale which is being used by the data being measured, thus it can not be used in making comparisons between series that are using different scales.

Mean Squared Error is also a common regression loss, which means that it is used to predict a continuous variable.

Standalone Implementation in Keras:

				
					import keras 
import numpy as np
 
 
y_true = np.array([[10.0,7.0]]) #dummy data
y_pred = np.array([[8.0, 6.0]])
 
c = keras.losses.MAE(y_true, y_pred) #calculating loss
 
print(f'Value of Mean Absolute Error is {c.numpy()}')

				
			

What you have to do is to create an MAE object from keras.losses and pass in our true and predicted labels to calculate the loss using the equation given above.

Implementing using compile method

When working with a deep learning model in Keras, you have to define the model structure first.

				
					model = keras.models.Sequential([
                                   keras.layers.Dense(10, input_shape=(1,), activation='relu'),
                                   keras.layers.Dense(1)
])

				
			

After defining the model architecture, you have to compile it and use the MAE loss function. Notice that either there is linear or no activation function in the last layer means that you are going to predict a continuous variable.

				
					model.compile(loss='mae', optimizer='adam')

				
			

You can now simply just fit the model to check our model’s progress. Here our model is going to train on a very small dummy random array just to check the progress.

				
					model.fit(np.array([[10.0],[20.0], [30.0],[40.0],[50.0],[60.0],[10.0], [20.0]]), np.array([6, 12, 18,24,30, 36,6, 12]), epochs=10)

				
			

And you can see the loss value which has been calculated using the MAE formula for each epoch.

Common Classification Losses:

Classification problems are those problems, in which you have to predict a label. This means that the output should be only from the given labels that you have provided to the model.

For example: There is a problem where you have to detect if the input image belongs to any given class such as dog, cat, or horse. The model will predict 3 numbers ranging from 0 to 1 and the one with the highest probability will be picked

Common Classification Losses

If you want to predict whether it is going to rain tomorrow or not, this means that the model can output between 0 and 1, and you will choose the option of rain if it is greater than 0.5, and no rain if it is less than 0.5.

Common Classification Loss:

1. Cross-Entropy

Cross Entropy is one of the most commonly used classification loss functions. You can say that it is the measure of the degrees of the dissimilarity between two probabilistic distributions. For example, in the task of predicting whether it will rain tomorrow or not, there are two distributions, one for True, and one for False.

Cross Entropy is of 3 main types. 

  a. Binary Cross Entropy

Binary Cross Entropy, as the name suggests, is the cross entropy that occurs between two classes, or in the problem of binary classification where you have to detect whether it belongs to class ‘A’, and if it does not belong to class ‘A’, then it belongs to class ‘B’.

Just like in the example of rain prediction, if it is going to rain tomorrow, then it belongs to rain class, and if there is less probability of rain tomorrow, then this means that it belongs to no rain class.

Mathematical Equation for Binary Cross Entropy is

output

This loss function has 2 parts. If our actual label is 1, the equation after ‘+’ becomes 0 because 1-1 = 0. So loss when our label is 1 is 

And when our label is 0, then the first part becomes 0. So our loss in that case would be

This loss function is also known as the Log Loss function, because of the logarithm of the loss.

Standalone Implementation:

You can create an object for Binary Cross Entropy from Keras.losses. Then you have to pass in our true and predicted labels.

				
					import keras
import numpy as np
 
y_true=np.array([[1.0]])
y_pred = np.array([[0.9]])
 
loss = keras.losses.BinaryCrossentropy()
print(f"BCE LOSS VALUE IS {loss(y_true, y_pred).numpy()}")
				
			

bce

Implementation using compile method

To use Binary Cross Entropy in a deep learning model, design the architecture, and compile the model while specifying the loss as Binary Cross Entropy.

				
					import keras
import numpy as np
 
model = keras.models.Sequential([
                                   keras.layers.Dense(16, input_shape=(2,), activation='relu'),
                                   keras.layers.Dense(8, activation='relu'),
                                   keras.layers.Dense(1, activation='sigmoid') #Sigmoid for probabilistic distribution
])
model.compile(optimizer='sgd', loss=keras.losses.BinaryCrossentropy(), metrics=['acc'])# binary cross entropy
 
model.fit(np.array([[10.0, 20.0],[20.0,30.0],[30.0,6.0], [8.0, 20.0]]),np.array([1,1,0,1]) ,epochs=10)

				
			

This will train the model using Binary Cross Entropy Loss function.

  b. Categorical Cross Entropy

Categorical Cross Entropy is the cross entropy that is used for multi-class classification. This means for a single training example, you have n probabilities, and you take the class with maximum probability where n is number of classes.

Categorical Cross Entropy

Mathematically, you can write it as:

This double sum is over the N number of examples and C categories. The term 1yi ∈ Cc shows that the ith observation belongs to the cth category. The Pmodel[yi Cc] is the probability predicted by the model for the ith observation to belong to the cth category. When there are more than 2 probabilities, the neural network outputs a vector of C probabilities, with each probability belonging to each class. When the number of categories is just two, the neural network outputs a single probability ŷi , with the other one being 1 minus the output. This is why the binary cross entropy looks a bit different from categorical cross entropy, despite being a special case of it.

Standalone Implementation

You will create a Categorical Cross Entropy object from keras.losses and pass in our true and predicted labels, on which it will calculate the Cross Entropy and return a Tensor.

Note that you have to provide a matrix that is one hot encoded showing probability for each class, as shown in this example.

				
					import keras
import numpy as np
 
y_true = [[0, 1, 0], [0, 0, 1]] #3classes
 
y_pred = [[0.05, 0.95, 0], [0.1, 0.5, 0.4]]
 
 
 
loss = keras.losses.CategoricalCrossentropy()
print(f"CCE LOSS VALUE IS {loss(y_true, y_pred).numpy()}")
				
			

Implementation using compile method

When implemented using the compile method, you have to design a model in Keras, and compile it using Categorical Cross Entropy loss. Now when the model is trained, it is calculating the loss based on categorical cross entropy, and updating the weights according to the given optimizer.

				
					import keras
import numpy as np
from keras.utils import to_categorical #to one hot encode the data
 
model = keras.models.Sequential([
                                   keras.layers.Dense(16, input_shape=(2,), activation='relu'),
                                   keras.layers.Dense(8, activation='relu'),
                                   keras.layers.Dense(3, activation='softmax') #Softmax for multiclass probability
])
 
model.compile(optimizer='sgd', loss=keras.losses.CategoricalCrossentropy(), metrics=['acc'])# categorical cross entropy
 
model.fit(np.array([[10.0, 20.0],[20.0,30.0],[30.0,6.0], [8.0, 20.0], [-1.0, -100.0], [-10.0, -200.0]]) ,to_categorical(np.array([1,1,0,1, 2, 2])) ,epochs=10)

				
			

Here, it will train the model on our dummy dataset.

  c. Sparse Categorical Cross Entropy

Mathematically, there is no difference between Categorical Cross Entropy, and Sparse Categorical Cross Entropy according to official documentation. Use this cross entropy loss function when there are two or more label classes. We expect labels to be provided as integers. If you want to provide labels using one-hot representation, please use CategoricalCrossentrop loss. There should be # classes floating point values per feature for y_pred and a single floating point value per feature for y_true.

As you have seen earlier in Categorical Cross Entropy that one hot matrix has been passed as the true labels, and predicted labels. An example of which is as follows:

				
					to_categorical(np.array([1,1,0,1, 2, 2]))

				
			

For using sparse categorical cross entropy in Keras, you need to pass in the label encoded labels. You can use sklearn for this purpose.

Lets see this example to understand better.

				
					from sklearn.preprocessing import LabelEncoder
 
le = LabelEncoder()
output = le.fit_transform(np.array(['High chance of Rain', 'No Rain', 'Maybe', 'Maybe', 'No Rain', 'High chance of Rain']))
				
			

Here a LabelEncoder object has been created, and the fit_transform method is used to encode it. The output of it is as follows.

Standalone Implementation

To perform standalone implementation, you need to perform label encoding on labels. There should be n floating point values per feature for each true label, where n is the total number of classes.

				
					from sklearn.preprocessing import LabelEncoder
t = LabelEncoder()
y_pred = [[0.1,0.1,0.8], [0.1,0.4,0.5], [0.5,0.3,0.2], [0.6,0.3,0.1]]
y_true = t.fit_transform(['Rain', 'Rain', 'High Changes of Rain', 'No Rain'])
loss = keras.losses.SparseCategoricalCrossentropy()
print(f"Sparse Categorical Loss is {loss(y_true, y_pred).numpy()} ")
				
			

sparse

Implementation using model.compile

To implement Sparse Categorical Cross Entropy in a deep learning model, you have to design the model, and compile it using the loss sparse categorical cross entropy. Remember to perform label encoding of your class labels so that sparse categorical cross entropy can work.

				
					import keras
import numpy as np
from sklearn.preprocessing import LabelEncoder
model = keras.models.Sequential([
                                   keras.layers.Dense(16, input_shape=(2,), activation='relu'),
                                   keras.layers.Dense(8, activation='relu'),
                                   keras.layers.Dense(3, activation='softmax') #Softmax for multiclass probability
])
le = LabelEncoder()
model.compile(optimizer='sgd', loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['acc'])# sparse categorical cross entropy
				
			

Now the model will be trained on the dummy dataset.

				
					model.fit(np.array([[10.0, 20.0],[20.0,30.0],[30.0,6.0], [8.0, 20.0], [-1.0, -100.0], [-10.0, -200.0]]) ,le.fit_transform(np.array(['High chance of Rain', 'High chance of Rain', 'High chance of Rain', 'Maybe', 'No Rain', 'No Rain'])) ,epochs=10)


				
			

epoch

The model has been trained, where the loss is calculated using sparse categorical cross entropy, and the weights have been updated using stochastic gradient descent.

2. Hinge Loss

Hinge loss is a commonly used loss function for classification problems. It is mainly used in problems where you have to do ‘maximum-margin’ classification. A common example of which is Support Vector Machines.

The following image shows how maximum margin classification works.

Hinge Loss

Source: Stanford NLP Group

The mathematical formula for hinge loss is:

Where yi is the actual label and ŷ is the predicted label. When prediction is positive, value goes on one side, and when the prediction is negative, value goes totally opposite. This is why it is known as maximum margin classification.

Standalone Implementation:

To perform standalone implementation of Hinge Loss in Keras, you are going to use Hinge Loss Class from keras.losses.

				
					import keras
import numpy as np
y_true = [[0., 1.], [0., 0.]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
h = keras.losses.Hinge()
print(f'Value for Hinge Loss is {h(y_true, y_pred).numpy()}')
				
			

Implementation using compile Method

To implement Hinge loss using compile method, you will design our model and compile it where you will mention our loss as Hinge.
Note that Hinge Loss works best with tanh as the activation in the last layer.

				
					import keras
import numpy as np
				
			

				
					model = keras.models.Sequential([
                                   keras.layers.Dense(16, input_shape=(2,), activation='relu'),
                                   keras.layers.Dense(8, activation='tanh'),
                                   keras.layers.Dense(1, activation='tanh') 
])

				
			

				
					model.compile(optimizer='adam', loss=keras.losses.Hinge(), metrics=['acc'])# Hinge Loss
				
			

				
					from sklearn.preprocessing import LabelEncoder
 
le = LabelEncoder()
				
			

				
					model.fit(np.array([[10.0, 20.0],[20.0,30.0],[30.0,6.0], [-1.0, -100.0], [-10.0, -200.0]]) ,le.fit_transform(np.array(['High chance of Rain', 'High chance of Rain', 'High chance of Rain', 'No Rain', 'No Rain'])) ,epochs=10, batch_size=5)
				
			

This will train the model using Hinge Loss and update the weights using Adam optimizer.

Custom Loss Functions

So far you have seen some of the important cost functions that are widely used in industry and are good and easy to understand, and are built-in by famous deep learning frameworks such as Keras, or PyTorch. These built-in loss functions are enough for most of the typical tasks such as classification, or regression.

But there are some tasks, which can not be performed well using these built-in loss functions, and require some other loss that is more suitable for that task. For that purpose, a custom loss function is designed that calculates the error between the predicted value and actual value based on custom criteria.

Why you should use Custom Loss

Artificial Intelligence in general and Deep Learning in general is a very strong research field. There are various industries using Deep Learning to solve complex scenarios.

There is a lot of research on how to perform a specific task using Deep Learning. For example there is a task on generating different recipes of food using the picture of the food. Now on papers with code (a famous site for deep learning and machine learning research papers), there are a lot of research papers on this topic.

Why you should use Custom Loss

Now Imagine you are reading a research paper where the researchers thought that using Cross Entropy, or Mean Squared Error, or whatever the general loss function is for that specific type of the problem is not good enough. It may require you to modify it according to the need. This may involve adding some new parameters, or a whole new technique to achieve better results. 

Now when you are implementing that problem, or you hired some data scientists to solve that specific problem for you, you may find that this specific problem is best solved using that specific loss function which is not available by default in Keras, and you need to implement it yourself.

A custom loss function can improve the models performance significantly, and can be really useful in solving some specific problems.

To create a custom loss, you have to take care of some rules.

  1. The loss function must only take two values, that are true labels, and predicted labels. This is because in order to calculate the error in prediction, these two values are needed. These arguments are passed from the model itself when the model is being fitted.

For example:

				
					def customLoss(y_true, y_pred):
            	return loss
	model.compile(loss=customLoss, optimizer='sgd')
				
			

   2. Make sure that you are making the use of y_pred or predicted value in the loss function, because if you do not do so, the gradient expression would not be defined, and it can throw some error.

   3. You can now simply use it in  model.compile  function just like you would use any other loss function.

Example:

Let’s say you want to perform a regression task where you want to use a custom loss function that divides the loss value of Mean Squared Error by 10. Mathematically, it can be denoted as:

Now to implement it in Keras, you need to define a custom loss function, with two parameters that are true and predicted values. Then you will perform mathematical functions as per our algorithm, and return the loss value.

Note that Keras Backend functions and Tensorflow mathematical operations will be used instead of numpy functions to avoid some silly errors. Keras backend functions work similarly to numpy functions.

				
					import keras
import numpy as np
from tensorflow.python.ops import math_ops
 
def custom_loss(y_true, y_pred):
 diff = math_ops.squared_difference(y_pred, y_true)  #squared difference
 loss = K.mean(diff, axis=-1) #mean over last dimension
 loss = loss / 10.0
 return loss
				
			

Here you can see a custom function with 2 parameters that are true and predicted values, and the first step was to calculate the squared difference between the predicted labels and the true labels using squared difference function from Tensorflow Python ops. Then the mean is calculated to complete the mean squared error, and divided by 10 to complete our algorithm. The loss value is then returned.

You can use it in our deep learning model, by compiling our model and setting the loss function to the custom loss defined above.

				
					model = keras.Sequential([
                   keras.layers.Dense(10, activation='relu', input_shape=(1,)),
                   keras.layers.Dense(1)
])
 
model.compile(loss=custom_loss, optimizer='sgd')
 
X_train = np.array([[10.0],[20.0], [30.0],[40.0],[50.0],[60.0],[10.0], [20.0]])
y_train = np.array([6.0, 12, 18,24,30, 36,6, 12]) #dummy data
 
model.fit(X_train, y_train, batch_size=2, epochs=10)
				
			

Passing multiple arguments to a Keras Loss Function

Now, if you want to add some extra parameters to our loss function, for example, in the above formula, the MSE is being divided by 10. Now if you want to divide it by any value that is given by the user, you need to create a Wrapper Function with those extra parameters.

Wrapper function in short is a function whose job is to call another function, with little or no computation. The additional parameters will be passed in the wrapper function, while the main 2 parameters will remain the same in our original function.

Let’s see it with the code.

				
					def wrapper(param1):
 def custom_loss_1(y_true, y_pred):
   diff = math_ops.squared_difference(y_pred, y_true)  #squared difference
   loss = K.mean(diff, axis=-1) #mean
   loss = loss / param1
   return loss
 return custom_loss_1
				
			

To do the standalone computation using Keras, You will first create the object of our wrapper, and then pass in it y_true and y_pred parameters.

				
					loss = wrapper(10.0)
 
final_loss = loss(y_true=[[10.0,7.0]], y_pred=[[8.0, 6.0]])
print(f"Final Loss is {final_loss.numpy()}")
				
			

final-loss

You can use it in our deep learning models by simply calling the function by using appropriate value for our param1.

				
					model1 = keras.Sequential([
                   keras.layers.Dense(10, activation='relu', input_shape=(1,)),
                   keras.layers.Dense(1)
])
 
model1.compile(loss=wrapper(10.0), optimizer='sgd')
				
			

Here the model has been compiled using the value 10.0 for our param1.

The model can be trained and the results can be seen .

				
					model1.fit(X_train, y_train, batch_size=2, epochs=10)
				
			

Creating Custom Loss for Layers

Loss functions that are applied to the output of the model (i.e what you have seen till now) are not the only way to calculate and compute the losses. The custom losses for custom layers or subclassed models can be computed for the quantities which you want to minimize during the training like the regularization losses.

These losses are added using add_loss() function from keras.Layer.

For example, if you want to add custom l2 regularization in our layer, the mathematical formula of which is as follows:

You can create your own custom regularizer class which should be inherited from keras.layers..

				
					from keras.layers import Layer
from tensorflow.math import reduce_sum, square

class MyActivityRegularizer(Layer):

    def __init__(self, rate=1e-2):
        super(MyActivityRegularizer, self).__init__()
        self.rate = rate

    def call(self, inputs):
        self.add_loss(self.rate * reduce_sum(square(inputs)))
        return inputs
				
			

Now, since the regularized loss has been defined, you can simply add it in any built-in layer, or create our own layer.

				
					
class SparseMLP(Layer):
  """Stack of Linear layers with our custom regularization loss."""

  def __init__(self, output_dim):
      super(SparseMLP, self).__init__()
      self.dense_1 = layers.Dense(32, activation=tf.nn.relu)
      self.regularization = MyActivityRegularizer(1e-2)
      self.dense_2 = layers.Dense(output_dim)

  def call(self, inputs):
      x = self.dense_1(inputs)
      x = self.regularization(x)
      return self.dense_2(x)


				
			

Here custom sparse MLP layer has been defined, where when stacking two linear layers, The custom loss function has been added which will regularize the weights of our deep learning model.

It can be tested:

				
					mlp = SparseMLP(1)
y = mlp(np.random.normal(size=(10, 10)))
 
print(mlp.losses)  # List containing one float32 scalar
				
			

It returns a tf.Tensor, which can be converted into numpy using mlp.losses.numpy() method.

Dealing with NaN in Custom Loss in Keras

There are many reasons that our loss function in Keras gives NaN values. If you are new to Keras or practical deep learning, this could be very annoying because you have no idea why Keras is not giving the desired output. Since Keras is a high level API, built over low level frameworks such as Theano, Tensorflow etc. it is difficult to know the problem. 

There are many different reasons for which many people have received NaN in their loss, like shown in this figure below

Some of the main reasons, which are very common, are as follows:

 1. Missing Values in training dataset

This is one of the most common reasons for why the loss is nan while training. You should remove all the missing values from your dataset, or fill them using a good strategy, such as filling with mean. You can check nan values by using Pandas built in functions.

				
					
print(np.any(np.isnan(X_test)))


				
			

And if there are any null values, you can either use pandas fillna() function to fill them, or dropna() function to drop those values.

 2. Loss is unable to get traction on training dataset

This means that the custom loss function you designed, is not suitable for the dataset, and the business problem you are trying to solve. You should look at the problem from another perspective, and try to find a suitable loss function for your problem.

 3. Exploding Gradients

Exploding Gradients is a very common problem especially in large neural networks where the value of your gradients become very large. This problem can be solved using Gradient Clipping.

In Keras, you can add gradient clipping to your model when compiling it by adding a parameterclipnorm=x in the selected choice of optimizer. This will clip all the gradients above the value x.

For example:

				
					
opt = keras.optimizers.Adam(clipnorm=1.0)


				
			

This will clip all the gradients that are greater than 1.0. You can add it into your model as

				
					
model.compile(loss=custom_loss, optimizer=opt)


				
			

Using RMSProp optimizer function with heavy regularization also helps in diminishing the exploding gradients problem.
4. Dataset is not scaled

Scaling and normalizing the dataset is important. Unscaled data can lead the neural network to behave very strangely. Hence it is advised to properly scale the data.

There are 2 most commonly used scaling methods, and both of them are easily implementable in sklearn which is a famous Machine Learning Library in Python.

  1. StandardScaler
				
					from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
 
X_train_scale = ss.fit_transform(X_train) #using fit_transform so that it can fit on data, and other data can be normalized to same scale
X_test_scale = ss.transform(X_test) #using trasnfrom to get it on same scale as training
				
			

				
					from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
 
X_train_scale = ss.fit_transform(X_train) #using fit_transform so that it can fit on data, and other data can be normalized to same scale
X_test_scale = ss.transform(X_test) #using trasnfrom to get it on same scale as training
				
			

5. Dying ReLU problem 

A dead ReLU happens, when the relu function always outputs the same value (0 mostly). This means that it takes no role in the discrimination between the inputs. Once a ReLU reaches this state, it is unrecoverable because the function gradient at 0 is also 0, so gradient descent will not change the weights and the model will not improve.

This can be improved by using the Leaky ReLU activation function, where there is a small positive gradient for negative inputs. y=0.01x when x < 0 say

Dying ReLU problem 

Hence, it is advised to use Leaky ReLU to avoid NaNs in your loss.
In Keras, you can add a leaky relu layer as follows.

				
					keras.layers.LeakyReLU(alpha=0.3, **kwargs)
				
			

6. Not a good choice of optimizer function

If you are using Stochastic Gradient Descent, then it is very likely that you are going to face the exploding gradients problem. One way to tackle it is by Scheduling Learning Rate after some epochs, but now due to more advancements and research it has been proven that using a per-parameter adaptive learning rate algorithm like Adam optimizer, you no longer need to schedule the learning rate.

So there are chances that you are not using the right optimizer function.

To use the ADAM optimizer function in Keras, you can use it from keras.optimizers class.

				
					keras.optimizers.Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-07,
    amsgrad=False,
    name="Adam",
    **kwargs
)
model.compile(optimizer= keras.optimizers.Adam(), loss=custom_loss)
				
			

7. Wrong Activation Function

The wrong choice of activation function can also lead to very strange behaviour of the deep learning model. For example if you are working on a multi-class classification problem, and using the relu activation function or sigmoid activation function in the final layer instead of categorical_crossentropy loss function, that can lead the deep learning model to perform very weirdly.

8. Low Batch Size

It has been seen that the optimizer functions on a very low batch size such as 16 or 32 are less stable, as compared to the batch size of 64 or 128. 

9. High Learning Rate

High learning rate can lead the deep learning model to not converge to optimum, and it can get lost somewhere in between.

Hence it is advisable to use a lower amount of Learning Rate. It can also be improved using Hyper Parameter Tuning.

10. Different file type (for NLP Problems)

If you are doing some textual problem, you can check your file type by running the following command.

Linux
$ file -i {input}

OSX
$ file -I {input}

This will give you the file type. If that file type is ISO-8859-1 or us-ascii then try converting the file to  utf-8 or utf-16le.

Monitoring Keras Loss using Callbacks

It is important to monitor your loss when you are training the model, so that you can understand different types of behaviours your model is showing. There are many callbacks introduced by Keras using which you can monitor the loss. Some of the famous ones are:

1. CSVLogger

CSVLogger is a callback provided by Keras that can be used to save the epoch result in a csv file, so that later on it can be visualized, information could be extracted, and the results of epochs can be stored.

You can use CSVLogger from keras.callbacks.

				
					from keras.callbacks import CSVLogger

csv_logger = CSVLogger('training.csv')
model.fit(X_train, Y_train, callbacks=[csv_logger])
				
			

This will fit the model on the dataset, and stores the callback information in a training.csv file, which you can load in a dataframe and visualize it.

2. TerminateOnNaN

Imagine you set the training limit of your model to 1000 epochs, and your model starts showing NaN loss. You can not just sit and stare at the screen while the progress is 0. Keras provides a TerminateOnNan callback that terminates the training whenever NaN loss is encountered.

				
					import keras
terNan = keras.callbacks.TerminateOnNaN()
model.fit(X_train, Y_train, callbacks=[terNan])
				
			

3. RemoteMonitor

RemoteMonitor is a powerful callback in Keras, which can help us monitor, and visualize the learning in real time.

To use this callback, you need to clone hualos by Francis Chollet, who is the creator of Keras.

				
					git clone https://github.com/fchollet/hualos
cd hualos
python api.py
				
			

Now you can access the hualos at localhost:9000 from your browser. Now you have to define the callback, and add it to your model while training

				
					monitor = RemoteMonitor()
hist = model.fit( train_X, train_Y, nb_epoch=50, callbacks=[ monitor ] )
				
			

During the training, the localhost:9000 is automatically updated, and you can see the visualizations of learning in real time.

4. EarlyStopping

EarlyStopping is a very useful callback provided by Keras, where you can stop the training earlier than expected based on some monitor value.

For example you set your epochs to be 100, and your model is not improving after the 10th epoch. You can not sit and stare at the screen so that model may finish the training, and you can change the architecture of the model. Keras provides EarlyStopping callback, which is used to stop the training based on some criteria.

				
					
es = keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=3,
)


				
			

Here, the EarlyStopping callback has been defined, and the monitor has been set to the validation loss value. And it will check that if the value of validation loss does not improve for 3 epochs, it will stop the training.

This article should give you good foundations in dealing with loss functions, especially in Keras, implementing your own custom loss functions which you develop yourself or a researcher has already developed, and you are implementing that, their implementation using Keras a deep learning framework, avoiding silly errors such as repeating NaNs in your loss function, and how you should monitor your loss function in Keras.

Hopefully, now you have a good grip on these topics:

  • What are Loss Functions 
  • What are Evaluation Metrics?
  • Commonly used Loss functions in Keras (Regression and Classification)
  • Built-in loss functions in Keras
  • What is the custom loss function?
  • Why should you use a Custom Loss?
  • Implementation of common loss functions in Keras
  • Custom Loss Function for Layers i.e Custom Regularization Loss
  • Dealing with NaN values in Keras Loss
  • Monitoring Keras Loss using callbacks

Usage of loss functions

A loss function (or objective function, or optimization score function) is one of the two parameters required to compile a model:

model.compile(loss='mean_squared_error', optimizer='sgd')
from keras import losses

model.compile(loss=losses.mean_squared_error, optimizer='sgd')

You can either pass the name of an existing loss function, or pass a TensorFlow/Theano symbolic function that returns a scalar for each data-point and takes the following two arguments:

  • y_true: True labels. TensorFlow/Theano tensor.
  • y_pred: Predictions. TensorFlow/Theano tensor of the same shape as y_true.

The actual optimized objective is the mean of the output array across all datapoints.

For a few examples of such functions, check out the losses source.

Available loss functions

mean_squared_error

mean_squared_error(y_true, y_pred)

mean_absolute_error

mean_absolute_error(y_true, y_pred)

mean_absolute_percentage_error

mean_absolute_percentage_error(y_true, y_pred)

mean_squared_logarithmic_error

mean_squared_logarithmic_error(y_true, y_pred)

squared_hinge

squared_hinge(y_true, y_pred)

hinge

hinge(y_true, y_pred)

categorical_hinge

categorical_hinge(y_true, y_pred)

logcosh

logcosh(y_true, y_pred)

Logarithm of the hyperbolic cosine of the prediction error.

log(cosh(x)) is approximately equal to (x ** 2) / 2 for small x and
to abs(x) - log(2) for large x. This means that ‘logcosh’ works mostly
like the mean squared error, but will not be so strongly affected by the
occasional wildly incorrect prediction. However, it may return NaNs if the
intermediate value cosh(y_pred - y_true) is too large to be represented
in the chosen precision.


categorical_crossentropy

categorical_crossentropy(y_true, y_pred)

sparse_categorical_crossentropy

sparse_categorical_crossentropy(y_true, y_pred)

binary_crossentropy

binary_crossentropy(y_true, y_pred)

kullback_leibler_divergence

kullback_leibler_divergence(y_true, y_pred)

poisson

poisson(y_true, y_pred)

cosine_proximity

cosine_proximity(y_true, y_pred)

Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample). In order to convert integer targets into categorical targets, you can use the Keras utility to_categorical:

from keras.utils.np_utils import to_categorical

categorical_labels = to_categorical(int_labels, num_classes=None)

Types of Keras Loss Functions Explained for Beginners
Types of Keras Loss Functions Explained for Beginners

Contents

  • 1 Introduction
  • 2 What is Loss Function?
  • 3 Types of Loss Functions in Keras
    • 3.1 1. Keras Loss Function for Classification 
      • 3.1.1 i) Keras Binary Cross Entropy 
        • 3.1.1.1 Syntax of Keras Binary Cross Entropy
        • 3.1.1.2 Keras Binary Cross Entropy Example
      • 3.1.2 ii) Keras Categorical Cross Entropy 
        • 3.1.2.1 Syntax of Keras Categorical Cross Entropy
        • 3.1.2.2 Keras Categorical Cross Entropy Example
      • 3.1.3 iii) Keras KL Divergence
        • 3.1.3.1 Syntax of Keras KL Divergence
        • 3.1.3.2 Keras KL Divergence Example
      • 3.1.4 iv) Keras Poisson Loss Function
        • 3.1.4.1 Syntax of Keras Poisson Loss Function
        • 3.1.4.2 Keras Poisson Loss Function Example
      • 3.1.5 iv) Keras Hinge Loss
        • 3.1.5.1 Syntax of Keras Hinge Loss
        • 3.1.5.2 Keras Hinge Loss Example
      • 3.1.6 vi) Keras Squared Hinge Loss
        • 3.1.6.1 Syntax of Squared Hinge Loss in Keras
        • 3.1.6.2 Example of Squared Hinge Loss in Keras
      • 3.1.7 vii) Keras Categorical Hinge Loss
        • 3.1.7.1 Syntax of Keras Categorical Hinge Loss
        • 3.1.7.2 Keras Categorical Hinge Loss Example
    • 3.2 2. Keras Loss Function for Regression
      • 3.2.1 i) Keras Mean Square Error Loss
        • 3.2.1.1 Syntax of Mean Square Error Loss in Keras
        • 3.2.1.2 Keras Mean Square Error Loss Example
      • 3.2.2 ii) Keras Mean Absolute Error Loss
        • 3.2.2.1 Syntax of Mean Absolute Error Loss in Keras
        • 3.2.2.2 Keras Mean Absolute Error Loss Example
      • 3.2.3 iii) Keras Cosine Similarity Loss
        • 3.2.3.1 Syntax of Cosine Similarity Loss in Keras
        • 3.2.3.2 Keras Cosine Similarity Loss Example
      • 3.2.4 iv) Keras Huber Loss Function
        • 3.2.4.1 Syntax of Huber Loss Function in Keras
        • 3.2.4.2 Huber Loss Function in Keras Example
    • 3.3 Keras Custom Loss Function
      • 3.3.1 Keras Custom Loss function Example
    • 3.4 Keras add_loss() API
      • 3.4.1 Keras add_loss() API Example
  • 4 Conclusion

Introduction

In this tutorial, we will look at various types of Keras loss functions for training neural networks. The loss functions are an important part of any neural network training process as it helps the network to minimize the error and reach as close as possible to the expected output. Here we will go through Kera loss functions for regression, classification and also see how to create a custom loss function in Keras.

What is Loss Function?

Loss Functions, also known as cost functions, are used for computing the error with the aim that the model should minimize it during training.

Loss Functions also help in finding out the slope i.e. gradient w.r.t. weights used in the model and then these weights are updated after each epoch with the help of backpropagation.

The below animation shows how a loss function works.

  • Read MoreDummies guide to Loss Functions in Machine Learning [with Animation]

Types of Keras Loss Functions

Ad

Deep Learning Specialization on Coursera

Selecting a loss function is not so easy, so we’ll be going over some prominent loss functions that can be helpful in various instances.

1. Keras Loss Function for Classification 

Let us first understand the Keras loss functions for classification which is usually calculated by using probabilistic losses.

i) Keras Binary Cross Entropy 

Binary Cross Entropy loss function finds out the loss between the true labels and predicted labels for the binary classification models that gives the output as a probability between 0 to 1.

Types of Keras Loss Functions for Classification

Syntax of Keras Binary Cross Entropy

Following is the syntax of Binary Cross Entropy Loss Function in Keras.

In [1]:

tf.keras.losses.BinaryCrossentropy(
    from_logits=False, label_smoothing=0, reduction="auto", name="binary_crossentropy"
)
Keras Binary Cross Entropy Example

The example for Keras binary cross entropy shows how two sets of random values are used as data and then the required function from losses class is used.

In [3]:

y_true = [[0., 1.], [0., 0.]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
bce = tf.keras.losses.BinaryCrossentropy()
bce(y_true, y_pred).numpy()

ii) Keras Categorical Cross Entropy 

This is the second type of probabilistic loss function for classification in Keras and is a generalized version of binary cross entropy that we discussed above. Categorical Cross Entropy is used for multiclass classification where there are more than two class labels.

Syntax of Keras Categorical Cross Entropy

Following is the syntax of Categorical Cross Entropy Loss Function in Keras.

In [4]:

tf.keras.losses.CategoricalCrossentropy(from_logits=False,label_smoothing=0, reduction="auto",name="categorical_crossentropy",)
Keras Categorical Cross Entropy Example

The following is an example of Keras categorical cross entropy. y_true denotes the actual probability distribution of the output and y_pred denotes the probability distribution we got from the model.

In [5]:

y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
cce = tf.keras.losses.CategoricalCrossentropy()
cce(y_true, y_pred).numpy()

iii) Keras KL Divergence

The KL Divergence or Kullback-Leibler Divergene Loss function is computed between the actual value and predicted value in the case of continuous distributions.

Syntax of Keras KL Divergence

Below is the syntax of LL Divergence in Keras –

In [8]:

tf.keras.losses.KLDivergence(reduction="auto", name="kl_divergence")
Keras KL Divergence Example

The KLDivergence() function is used in this case. The result obtained shows that there is not a huge loss but still it is considerable.

In [9]:

y_true = [[0, 1], [0, 0]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
kl = tf.keras.losses.KLDivergence()
kl(y_true, y_pred).numpy()

iv) Keras Poisson Loss Function

In the Poisson loss function, we calculate the Poisson loss between the actual value and predicted value. Poisson Loss Function is generally used with datasets that consists of Poisson distribution. An example of Poisson distribution is the count of calls received by the call center in an hour.

Syntax of Keras Poisson Loss Function

Following is the syntax of Poisson Loss Function in Keras.

In [6]:

tf.keras.losses.Poisson(reduction="auto", name="poisson")
Keras Poisson Loss Function Example

The poisson loss function is used in below example.

In [7]:

y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [0., 0.]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
p = tf.keras.losses.Poisson()
p(y_true, y_pred).numpy()

iv) Keras Hinge Loss

The above Keras loss functions for classification were using probabilistic loss as their basis for calculation. Now we are going to see some loss functions in Keras that use Hinge Loss for maximum margin classification like in SVM.

The hinge loss function is performed by computing hinge loss of true values and predicted values.

Syntax of Keras Hinge Loss

Below is the syntax of Keras Hinge loss –

In [18]:

tf.keras.losses.Hinge(reduction="auto", name="hinge")
Keras Hinge Loss Example

The hinge() function from the Keras package helps in finding the hinge loss

In [19]:

y_true = [[0., 1.], [0., 0.]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
h = tf.keras.losses.Hinge()
h(y_true, y_pred).numpy()

vi) Keras Squared Hinge Loss

The squared hinge loss is calculated using squared_hinge() function and is similar to Hinge Loss calculation discussed above except that the result is squared.

Syntax of Squared Hinge Loss in Keras

In [22]:

tf.keras.losses.squared_hinge(y_true, y_pred)
Example of Squared Hinge Loss in Keras

In this example, at first, data is generated using numpy randon function, then Keras squared hinge loss function calculates the loss.

In [23]:

import numpy as np

y_true = np.random.choice([-1, 1], size=(2, 3))
y_pred = np.random.random(size=(2, 3))
loss = tf.keras.losses.squared_hinge(y_true, y_pred)
assert loss.shape == (2,)
assert np.array_equal(loss.numpy(),np.mean(np.square(np.maximum(1. - y_true * y_pred, 0.)), axis=-1))

vii) Keras Categorical Hinge Loss

The second type of hinge loss function is the categorical hinge loss function. It can help in computing categorical hinge loss between true values and predicted values for multiclass classification.

Syntax of Keras Categorical Hinge Loss

Below is the syntax of Categorical Hinge Loss in Keras –

In [20]:

tf.keras.losses.CategoricalHinge(reduction="auto", name="categorical_hinge")
Keras Categorical Hinge Loss Example

With the CategoricalHinge() function we calculate the final result for categorical hinge loss.

In [21]:

y_true = [[0, 1], [0, 0]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
h = tf.keras.losses.CategoricalHinge()
h(y_true, y_pred).numpy()

2. Keras Loss Function for Regression

Let us now see the second types of loss function in Keras for Regression models

These regression loss functions are calculated on the basis of residual or error of the actual value and predicted value. The below animation shows this concept.

Types of Keras Loss Functions for Regression

Different types of Regression Loss function in Keras are as follows:

i) Keras Mean Square Error Loss

The mean square error in Keras is used for computing the mean square of errors between predicted values and actual values to get the loss.

Syntax of Mean Square Error Loss in Keras

Below is the syntax of Keras Mean Square in Keras –

In [10]:

tf.keras.losses.MeanSquaredError(reduction="auto", name="mean_squared_error")
Keras Mean Square Error Loss Example

The below code snippet shows how we can implement mean square error in Keras.

In [11]:

y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
mse = tf.keras.losses.MeanSquaredError()
mse(y_true, y_pred).numpy()

ii) Keras Mean Absolute Error Loss

The mean absolute error is computed using mean of absolute difference of labels and predicted values.

Syntax of Mean Absolute Error Loss in Keras

Below is the syntax of mean absolute error loss in Keras –

In [12]:

tf.keras.losses.MeanAbsoluteError(
    reduction="auto", name="mean_absolute_error"
)
Keras Mean Absolute Error Loss Example

With help of losses class of Keras, we can import mean absolute error and then apply this over a dataset to compute mean absolute error loss.

In [13]:

y_true = [[0., 1.], [0., 0.]]
y_pred = [[1., 1.], [1., 0.]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred).numpy()

iii) Keras Cosine Similarity Loss

To calculate cosine similarity loss amongst the labels and predictions, we use cosine similarity. The value for cosine similarity ranges from -1 to 1.

Syntax of Cosine Similarity Loss in Keras

Below is the syntax of cosine similarity loss in Keras –

In [14]:

tf.keras.losses.CosineSimilarity(
    axis=-1, reduction="auto", name="cosine_similarity"
)
Keras Cosine Similarity Loss Example

In this example, for implementing cosine similarity in Keras, we are going to use cosine_loss function.

In [15]:

y_true = [[0., 1.], [1., 1.]]
y_pred = [[1., 0.], [1., 1.]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
cosine_loss(y_true, y_pred).numpy()

iv) Keras Huber Loss Function

In regression related problems where data is less affected by outliers, we can use huber loss function.

Syntax of Huber Loss Function in Keras

Below is the syntax of Huber Loss function in Keras

In [16]:

tf.keras.losses.Huber(delta=1.0, reduction="auto", name="huber_loss")
Huber Loss Function in Keras Example

Keras library provides Huber function for calculating the Huber loss.

In [17]:

y_true = [[0, 1], [0, 0]]
y_pred = [[0.6, 0.4], [0.4, 0.6]]
# Using 'auto'/'sum_over_batch_size' reduction type.  
h = tf.keras.losses.Huber()
h(y_true, y_pred).numpy()

Keras Custom Loss Function

In spite of so many loss functions, there are cases when these loss functions do not serve the purpose. In such scenarios, we can build a custom loss function in Keras, which is especially useful for research purposes.

You can pass this custom loss function in Keras as a parameter while compiling the model. But there is a constraint here that the custom loss function should take the true value (y_true) and predicted value (y_pred) as input and return an array of loss.  If your function does not match this signature then you cannot use this as a custom function in Keras.

Keras Custom Loss function Example

The below code snippet shows how to build a custom loss function. Once this function is created, we use it to compile the model using Keras.

In [24]:

def custom_loss_function(y_true, y_pred):
   squared_difference = tf.square(y_true - y_pred)
   return tf.reduce_mean(squared_difference, axis=-1)

model.compile(optimizer='adam', loss=custom_loss_function)

Keras add_loss() API

As we saw above, the custom loss function in Keras has a restriction to use a specific signature of having y_true and y_pred as arguments. Keras provides another option of add_loss() API which does not have this constraint.

Keras add_loss() API Example

The below cell contains an example of how add_loss() function is used for building loss function.

In [25]:

from keras.layers import Layer
class Custom_layer(Layer):
  def __init__(self,rate=1e-2):
    super(Custom_layer,self).__init__()
    self.rate=rate
  
  def call(self,inputs):
    self.add_loss(self.rate*tf.square(inputs))
return inputs 

Понравилась статья? Поделить с друзьями:
  • Mediaget exe системная ошибка как исправить
  • Mean squared error example
  • Mediaget curl error 60
  • Mean squared error bias variance
  • Mean square error root mean square error