From Wikipedia, the free encyclopedia
In statistics, the mean squared error (MSE)[1] or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss.[2] The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.[3] In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution).
The MSE is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero.
The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the true value).[citation needed] For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error.
Definition and basic properties[edit]
The MSE either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). The definition of an MSE differs according to whether one is describing a predictor or an estimator.
Predictor[edit]
If a vector of predictions is generated from a sample of data points on all variables, and is the vector of observed values of the variable being predicted, with being the predicted values (e.g. as from a least-squares fit), then the within-sample MSE of the predictor is computed as
In other words, the MSE is the mean of the squares of the errors . This is an easily computable quantity for a particular sample (and hence is sample-dependent).
In matrix notation,
where is and is the column vector.
The MSE can also be computed on q data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as statistical learning, the MSE is often called the test MSE,[4] and is computed as
Estimator[edit]
The MSE of an estimator with respect to an unknown parameter is defined as[1]
This definition depends on the unknown parameter, but the MSE is a priori a property of an estimator. The MSE could be a function of unknown parameters, in which case any estimator of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable). If the estimator is derived as a sample statistic and is used to estimate some population parameter, then the expectation is with respect to the sampling distribution of the sample statistic.
The MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent.[5]
Proof of variance and bias relationship[edit]
An even shorter proof can be achieved using the well-known formula that for a random variable , . By substituting with, , we have
But in real modeling case, MSE could be described as the addition of model variance, model bias, and irreducible uncertainty (see Bias–variance tradeoff). According to the relationship, the MSE of the estimators could be simply used for the efficiency comparison, which includes the information of estimator variance and bias. This is called MSE criterion.
In regression[edit]
In regression analysis, plotting is a more natural way to view the overall trend of the whole data. The mean of the distance from each point to the predicted regression model can be calculated, and shown as the mean squared error. The squaring is critical to reduce the complexity with negative signs. To minimize MSE, the model could be more accurate, which would mean the model is closer to actual data. One example of a linear regression using this method is the least squares method—which evaluates appropriateness of linear regression model to model bivariate dataset,[6] but whose limitation is related to known distribution of the data.
The term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom. This definition for a known, computed quantity differs from the above definition for the computed MSE of a predictor, in that a different denominator is used. The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n−p) for p regressors or (n−p−1) if an intercept is used (see errors and residuals in statistics for more details).[7] Although the MSE (as defined in this article) is not an unbiased estimator of the error variance, it is consistent, given the consistency of the predictor.
In regression analysis, «mean squared error», often referred to as mean squared prediction error or «out-of-sample mean squared error», can also refer to the mean value of the squared deviations of the predictions from the true values, over an out-of-sample test space, generated by a model estimated over a particular sample space. This also is a known, computed quantity, and it varies by sample and by out-of-sample test space.
Examples[edit]
Mean[edit]
Suppose we have a random sample of size from a population, . Suppose the sample units were chosen with replacement. That is, the units are selected one at a time, and previously selected units are still eligible for selection for all draws. The usual estimator for the is the sample average
which has an expected value equal to the true mean (so it is unbiased) and a mean squared error of
where is the population variance.
For a Gaussian distribution, this is the best unbiased estimator (i.e., one with the lowest MSE among all unbiased estimators), but not, say, for a uniform distribution.
Variance[edit]
The usual estimator for the variance is the corrected sample variance:
This is unbiased (its expected value is ), hence also called the unbiased sample variance, and its MSE is[8]
where is the fourth central moment of the distribution or population, and is the excess kurtosis.
However, one can use other estimators for which are proportional to , and an appropriate choice can always give a lower mean squared error. If we define
then we calculate:
This is minimized when
For a Gaussian distribution, where , this means that the MSE is minimized when dividing the sum by . The minimum excess kurtosis is ,[a] which is achieved by a Bernoulli distribution with p = 1/2 (a coin flip), and the MSE is minimized for Hence regardless of the kurtosis, we get a «better» estimate (in the sense of having a lower MSE) by scaling down the unbiased estimator a little bit; this is a simple example of a shrinkage estimator: one «shrinks» the estimator towards zero (scales down the unbiased estimator).
Further, while the corrected sample variance is the best unbiased estimator (minimum mean squared error among unbiased estimators) of variance for Gaussian distributions, if the distribution is not Gaussian, then even among unbiased estimators, the best unbiased estimator of the variance may not be
Gaussian distribution[edit]
The following table gives several estimators of the true parameters of the population, μ and σ2, for the Gaussian case.[9]
True value | Estimator | Mean squared error |
---|---|---|
= the unbiased estimator of the population mean, | ||
= the unbiased estimator of the population variance, | ||
= the biased estimator of the population variance, | ||
= the biased estimator of the population variance, |
Interpretation[edit]
An MSE of zero, meaning that the estimator predicts observations of the parameter with perfect accuracy, is ideal (but typically not possible).
Values of MSE may be used for comparative purposes. Two or more statistical models may be compared using their MSEs—as a measure of how well they explain a given set of observations: An unbiased estimator (estimated from a statistical model) with the smallest variance among all unbiased estimators is the best unbiased estimator or MVUE (Minimum-Variance Unbiased Estimator).
Both analysis of variance and linear regression techniques estimate the MSE as part of the analysis and use the estimated MSE to determine the statistical significance of the factors or predictors under study. The goal of experimental design is to construct experiments in such a way that when the observations are analyzed, the MSE is close to zero relative to the magnitude of at least one of the estimated treatment effects.
In one-way analysis of variance, MSE can be calculated by the division of the sum of squared errors and the degree of freedom. Also, the f-value is the ratio of the mean squared treatment and the MSE.
MSE is also used in several stepwise regression techniques as part of the determination as to how many predictors from a candidate set to include in a model for a given set of observations.
Applications[edit]
- Minimizing MSE is a key criterion in selecting estimators: see minimum mean-square error. Among unbiased estimators, minimizing the MSE is equivalent to minimizing the variance, and the estimator that does this is the minimum variance unbiased estimator. However, a biased estimator may have lower MSE; see estimator bias.
- In statistical modelling the MSE can represent the difference between the actual observations and the observation values predicted by the model. In this context, it is used to determine the extent to which the model fits the data as well as whether removing some explanatory variables is possible without significantly harming the model’s predictive ability.
- In forecasting and prediction, the Brier score is a measure of forecast skill based on MSE.
Loss function[edit]
Squared error loss is one of the most widely used loss functions in statistics[citation needed], though its widespread use stems more from mathematical convenience than considerations of actual loss in applications. Carl Friedrich Gauss, who introduced the use of mean squared error, was aware of its arbitrariness and was in agreement with objections to it on these grounds.[3] The mathematical benefits of mean squared error are particularly evident in its use at analyzing the performance of linear regression, as it allows one to partition the variation in a dataset into variation explained by the model and variation explained by randomness.
Criticism[edit]
The use of mean squared error without question has been criticized by the decision theorist James Berger. Mean squared error is the negative of the expected value of one specific utility function, the quadratic utility function, which may not be the appropriate utility function to use under a given set of circumstances. There are, however, some scenarios where mean squared error can serve as a good approximation to a loss function occurring naturally in an application.[10]
Like variance, mean squared error has the disadvantage of heavily weighting outliers.[11] This is a result of the squaring of each term, which effectively weights large errors more heavily than small ones. This property, undesirable in many applications, has led researchers to use alternatives such as the mean absolute error, or those based on the median.
See also[edit]
- Bias–variance tradeoff
- Hodges’ estimator
- James–Stein estimator
- Mean percentage error
- Mean square quantization error
- Mean square weighted deviation
- Mean squared displacement
- Mean squared prediction error
- Minimum mean square error
- Minimum mean squared error estimator
- Overfitting
- Peak signal-to-noise ratio
Notes[edit]
- ^ This can be proved by Jensen’s inequality as follows. The fourth central moment is an upper bound for the square of variance, so that the least value for their ratio is one, therefore, the least value for the excess kurtosis is −2, achieved, for instance, by a Bernoulli with p=1/2.
References[edit]
- ^ a b «Mean Squared Error (MSE)». www.probabilitycourse.com. Retrieved 2020-09-12.
- ^ Bickel, Peter J.; Doksum, Kjell A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics. Vol. I (Second ed.). p. 20.
If we use quadratic loss, our risk function is called the mean squared error (MSE) …
- ^ a b Lehmann, E. L.; Casella, George (1998). Theory of Point Estimation (2nd ed.). New York: Springer. ISBN 978-0-387-98502-2. MR 1639875.
- ^ Gareth, James; Witten, Daniela; Hastie, Trevor; Tibshirani, Rob (2021). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1071614174.
- ^ Wackerly, Dennis; Mendenhall, William; Scheaffer, Richard L. (2008). Mathematical Statistics with Applications (7 ed.). Belmont, CA, USA: Thomson Higher Education. ISBN 978-0-495-38508-0.
- ^ A modern introduction to probability and statistics : understanding why and how. Dekking, Michel, 1946-. London: Springer. 2005. ISBN 978-1-85233-896-1. OCLC 262680588.
{{cite book}}
: CS1 maint: others (link) - ^ Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288.
- ^ Mood, A.; Graybill, F.; Boes, D. (1974). Introduction to the Theory of Statistics (3rd ed.). McGraw-Hill. p. 229.
- ^ DeGroot, Morris H. (1980). Probability and Statistics (2nd ed.). Addison-Wesley.
- ^ Berger, James O. (1985). «2.4.2 Certain Standard Loss Functions». Statistical Decision Theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. p. 60. ISBN 978-0-387-96098-2. MR 0804611.
- ^ Bermejo, Sergio; Cabestany, Joan (2001). «Oriented principal component analysis for large margin classifiers». Neural Networks. 14 (10): 1447–1461. doi:10.1016/S0893-6080(01)00106-X. PMID 11771723.
title | date | categories | tags | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
About loss and loss functions |
2019-10-04 |
|
|
When you’re training supervised machine learning models, you often hear about a loss function that is minimized; that must be chosen, and so on.
The term cost function is also used equivalently.
But what is loss? And what is a loss function?
I’ll answer these two questions in this blog, which focuses on this optimization aspect of machine learning. We’ll first cover the high-level supervised learning process, to set the stage. This includes the role of training, validation and testing data when training supervised models.
Once we’re up to speed with those, we’ll introduce loss. We answer the question what is loss? However, we don’t forget what is a loss function? We’ll even look into some commonly used loss functions.
Let’s go! 😎
[toc]
[ad]
The high-level supervised learning process
Before we can actually introduce the concept of loss, we’ll have to take a look at the high-level supervised machine learning process. All supervised training approaches fall under this process, which means that it is equal for deep neural networks such as MLPs or ConvNets, but also for SVMs.
Let’s take a look at this training process, which is cyclical in nature.
Forward pass
We start with our features and targets, which are also called your dataset. This dataset is split into three parts before the training process starts: training data, validation data and testing data. The training data is used during the training process; more specificially, to generate predictions during the forward pass. However, after each training cycle, the predictive performance of the model must be tested. This is what the validation data is used for — it helps during model optimization.
Then there is testing data left. Assume that the validation data, which is essentially a statistical sample, does not fully match the population it describes in statistical terms. That is, the sample does not represent it fully and by consequence the mean and variance of the sample are (hopefully) slightly different than the actual population mean and variance. Hence, a little bias is introduced into the model every time you’ll optimize it with your validation data. While it may thus still work very well in terms of predictive power, it may be the case that it will lose its power to generalize. In that case, it would no longer work for data it has never seen before, e.g. data from a different sample. The testing data is used to test the model once the entire training process has finished (i.e., only after the last cycle), and allows us to tell something about the generalization power of our machine learning model.
The training data is fed into the machine learning model in what is called the forward pass. The origin of this name is really easy: the data is simply fed to the network, which means that it passes through it in a forward fashion. The end result is a set of predictions, one per sample. This means that when my training set consists of 1000 feature vectors (or rows with features) that are accompanied by 1000 targets, I will have 1000 predictions after my forward pass.
[ad]
Loss
You do however want to know how well the model performs with respect to the targets originally set. A well-performing model would be interesting for production usage, whereas an ill-performing model must be optimized before it can be actually used.
This is where the concept of loss enters the equation.
Most generally speaking, the loss allows us to compare between some actual targets and predicted targets. It does so by imposing a «cost» (or, using a different term, a «loss») on each prediction if it deviates from the actual targets.
It’s relatively easy to compute the loss conceptually: we agree on some cost for our machine learning predictions, compare the 1000 targets with the 1000 predictions and compute the 1000 costs, then add everything together and present the global loss.
Our goal when training a machine learning model?
To minimize the loss.
The reason why is simple: the lower the loss, the more the set of targets and the set of predictions resemble each other.
And the more they resemble each other, the better the machine learning model performs.
As you can see in the machine learning process depicted above, arrows are flowing backwards towards the machine learning model. Their goal: to optimize the internals of your model only slightly, so that it will perform better during the next cycle (or iteration, or epoch, as they are also called).
Backwards pass
When loss is computed, the model must be improved. This is done by propagating the error backwards to the model structure, such as the model’s weights. This closes the learning cycle between feeding data forward, generating predictions, and improving it — by adapting the weights, the model likely improves (sometimes much, sometimes slightly) and hence learning takes place.
Depending on the model type used, there are many ways for optimizing the model, i.e. propagating the error backwards. In neural networks, often, a combination of gradient descent based methods and backpropagation is used: gradient descent like optimizers for computing the gradient or the direction in which to optimize, backpropagation for the actual error propagation.
In other model types, such as Support Vector Machines, we do not actually propagate the error backward, strictly speaking. However, we use methods such as quadratic optimization to find the mathematical optimum, which given linear separability of your data (whether in regular space or kernel space) must exist. However, visualizing it as «adapting the weights by computing some error» benefits understanding. Next up — the loss functions we can actually use for computing the error! 😄
[ad]
Loss functions
Here, we’ll cover a wide array of loss functions: some of them for regression, others for classification.
Loss functions for regression
There are two main types of supervised learning problems: classification and regression. In the first, your aim is to classify a sample into the correct bucket, e.g. into one of the buckets ‘diabetes’ or ‘no diabetes’. In the latter case, however, you don’t classify but rather estimate some real valued number. What you’re trying to do is regress a mathematical function from some input data, and hence it’s called regression. For regression problems, there are many loss functions available.
Mean Absolute Error (L1 Loss)
Mean Absolute Error (MAE) is one of them. This is what it looks like:
Don’t worry about the maths, we’ll introduce the MAE intuitively now.
That weird E-like sign you see in the formula is what is called a Sigma sign, and it sums up what’s behind it: |Ei
|, in our case, where Ei
is the error (the difference between prediction and actual value) and the | signs mean that you’re taking the absolute value, or convert -3 into 3 and 3 remains 3.
The summation, in this case, means that we sum all the errors, for all the n
samples that were used for training the model. We therefore, after doing so, end up with a very large number. We divide this number by n
, or the number of samples used, to find the mean, or the average Absolute Error: the Mean Absolute Error or MAE.
It’s very well possible to use the MAE in a multitude of regression scenarios (Rich, n.d.). However, if your average error is very small, it may be better to use the Mean Squared Error that we will introduce next.
What’s more, and this is important: when you use the MAE in optimizations that use gradient descent, you’ll face the fact that the gradients are continuously large (Grover, 2019). Since this also occurs when the loss is low (and hence, you would only need to move a tiny bit), this is bad for learning — it’s easy to overshoot the minimum continously, finding a suboptimal model. Consider Huber loss (more below) if you face this problem. If you face larger errors and don’t care (yet?) about this issue with gradients, or if you’re here to learn, let’s move on to Mean Squared Error!
Mean Squared Error
Another loss function used often in regression is Mean Squared Error (MSE). It sounds really difficult, especially when you look at the formula (Binieli, 2018):
… but fear not. It’s actually really easy to understand what MSE is and what it does!
We’ll break the formula above into three parts, which allows us to understand each element and subsequently how they work together to produce the MSE.
The primary part of the MSE is the middle part, being the Sigma symbol or the summation sign. What it does is really simple: it counts from i to n, and on every count executes what’s written behind it. In this case, that’s the third part — the square of (Yi — Y’i).
In our case, i
starts at 1 and n is not yet defined. Rather, n
is the number of samples in our training set and hence the number of predictions that has been made. In the scenario sketched above, n
would be 1000.
Then, the third part. It’s actually mathematical notation for what we already intuitively learnt earlier: it’s the difference between the actual target for the sample (Yi
) and the predicted target (Y'i
), the latter of which is removed from the first.
With one minor difference: the end result of this computation is squared. This property introduces some mathematical benefits during optimization (Rich, n.d.). Particularly, the MSE is continuously differentiable whereas the MAE is not (at x = 0). This means that optimizing the MSE is easier than optimizing the MAE.
Additionally, large errors introduce a much larger cost than smaller errors (because the differences are squared and larger errors produce much larger squares than smaller errors). This is both good and bad at the same time (Rich, n.d.). This is a good property when your errors are small, because optimization is then advanced (Quora, n.d.). However, using MSE rather than e.g. MAE will open your ML model up to outliers, which will severely disturb training (by means of introducing large errors).
Although the conclusion may be rather unsatisfactory, choosing between MAE and MSE is thus often heavily dependent on the dataset you’re using, introducing the need for some a priori inspection before starting your training process.
Finally, when we have the sum of the squared errors, we divide it by n — producing the mean squared error.
Mean Absolute Percentage Error
The Mean Absolute Percentage Error, or MAPE, really looks like the MAE, even though the formula looks somewhat different:
When using the MAPE, we don’t compute the absolute error, but rather, the mean error percentage with respect to the actual values. That is, suppose that my prediction is 12 while the actual target is 10, the MAPE for this prediction is [latex]| (10 — 12 ) / 10 | = 0.2[/latex].
Similar to the MAE, we sum the error over all the samples, but subsequently face a different computation: [latex]100% / n[/latex]. This looks difficult, but we can once again separate this computation into more easily understandable parts. More specifically, we can write it as a multiplication of [latex]100%[/latex] and [latex]1 / n[/latex] instead. When multiplying the latter with the sum, you’ll find the same result as dividing it by n
, which we did with the MAE. That’s great.
The only thing left now is multiplying the whole with 100%. Why do we do that? Simple: because our computed error is a ratio and not a percentage. Like the example above, in which our error was 0.2, we don’t want to find the ratio, but the percentage instead. [latex]0.2 times 100%[/latex] is … unsurprisingly … [latex]20%[/latex]! Hence, we multiply the mean ratio error with the percentage to find the MAPE!
Why use MAPE if you can also use MAE?
[ad]
Very good question.
Firstly, it is a very intuitive value. Contrary to the absolute error, we have a sense of how well-performing the model is or how bad it performs when we can express the error in terms of a percentage. An error of 100 may seem large, but if the actual target is 1000000 while the estimate is 1000100, well, you get the point.
Secondly, it allows us to compare the performance of regression models on different datasets (Watson, 2019). Suppose that our goal is to train a regression model on the NASDAQ ETF and the Dutch AEX ETF. Since their absolute values are quite different, using MAE won’t help us much in comparing the performance of our model. MAPE, on the other hand, demonstrates the error in terms of a percentage — and a percentage is a percentage, whether you apply it to NASDAQ or to AEX. This way, it’s possible to compare model performance across statistically varying datasets.
Root Mean Squared Error (L2 Loss)
Remember the MSE?
There’s also something called the RMSE, or the Root Mean Squared Error or Root Mean Squared Deviation (RMSD). It goes like this:
Simple, hey? It’s just the MSE but then its square root value.
How does this help us?
The errors of the MSE are squared — hey, what’s in a name.
The RMSE or RMSD errors are root squares of the square — and hence are back at the scale of the original targets (Dragos, 2018). This gives you much better intuition for the error in terms of the targets.
Logcosh
«Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.» (Grover, 2019).
Well, how’s that for a starter.
This is the mathematical formula:
And this the plot:
Okay, now let’s introduce some intuitive explanation.
The TensorFlow docs write this about Logcosh loss:
log(cosh(x))
is approximately equal to(x ** 2) / 2
for smallx
and toabs(x) - log(2)
for largex
. This means that ‘logcosh’ works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction.
Well, that’s great. It seems to be an improvement over MSE, or L2 loss. Recall that MSE is an improvement over MAE (L1 Loss) if your data set contains quite large errors, as it captures these better. However, this also means that it is much more sensitive to errors than the MAE. Logcosh helps against this problem:
- For relatively small errors (even with the relatively small but larger errors, which is why MSE can be better for your ML problem than MAE) it outputs approximately equal to [latex]x^2 / 2[/latex] — which is pretty equal to the [latex]x^2[/latex] output of the MSE.
- For larger errors, i.e. outliers, where MSE would produce extremely large errors ([latex](10^6)^2 = 10^12[/latex]), the Logcosh approaches [latex]|x| — log(2)[/latex]. It’s like (as well as unlike) the MAE, but then somewhat corrected by the
log
.
Hence: indeed, if you have both larger errors that must be detected as well as outliers, which you perhaps cannot remove from your dataset, consider using Logcosh! It’s available in many frameworks like TensorFlow as we saw above, but also in Keras.
[ad]
Huber loss
Let’s move on to Huber loss, which we already hinted about in the section about the MAE:
Or, visually:
When interpreting the formula, we see two parts:
- [latex]1/2 times (t-p)^2[/latex], when [latex]|t-p| leq delta[/latex]. This sounds very complicated, but we can break it into parts easily.
- [latex]|t-p|[/latex] is the absolute error: the difference between target [latex]t[/latex] and prediction [latex]p[/latex].
- We square it and divide it by two.
- We however only do so when the absolute error is smaller than or equal to some [latex]delta[/latex], also called delta, which you can configure! We’ll see next why this is nice.
- When the absolute error is larger than [latex]delta[/latex], we compute the error as follows: [latex]delta times |t-p| — (delta^2 / 2)[/latex].
- Let’s break this apart again. We multiply the delta with the absolute error and remove half of delta square.
What is the effect of all this mathematical juggling?
Look at the visualization above.
For relatively small deltas (in our case, with [latex]delta = 0.25[/latex], you’ll see that the loss function becomes relatively flat. It takes quite a long time before loss increases, even when predictions are getting larger and larger.
For larger deltas, the slope of the function increases. As you can see, the larger the delta, the slower the increase of this slope: eventually, for really large [latex]delta[/latex] the slope of the loss tends to converge to some maximum.
If you look closely, you’ll notice the following:
- With small [latex]delta[/latex], the loss becomes relatively insensitive to larger errors and outliers. This might be good if you have them, but bad if on average your errors are small.
- With large [latex]delta[/latex], the loss becomes increasingly sensitive to larger errors and outliers. That might be good if your errors are small, but you’ll face trouble when your dataset contains outliers.
Hey, haven’t we seen that before?
Yep: in our discussions about the MAE (insensitivity to larger errors) and the MSE (fixes this, but facing sensitivity to outliers).
Grover (2019) writes about this nicely:
Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)
That’s what this [latex]delta[/latex] is for! You are now in control about the ‘degree’ of MAE vs MSE-ness you’ll introduce in your loss function. When you face large errors due to outliers, you can try again with a lower [latex]delta[/latex]; if your errors are too small to be picked up by your Huber loss, you can increase the delta instead.
And there’s another thing, which we also mentioned when discussing the MAE: it produces large gradients when you optimize your model by means of gradient descent, even when your errors are small (Grover, 2019). This is bad for model performance, as you will likely overshoot the mathematical optimum for your model. You don’t face this problem with MSE, as it tends to decrease towards the actual minimum (Grover, 2019). If you switch to Huber loss from MAE, you might find it to be an additional benefit.
Here’s why: Huber loss, like MSE, decreases as well when it approaches the mathematical optimum (Grover, 2019). This means that you can combine the best of both worlds: the insensitivity to larger errors from MAE with the sensitivity of the MSE and its suitability for gradient descent. Hooray for Huber loss! And like always, it’s also available when you train models with Keras.
Then why isn’t this the perfect loss function?
Because the benefit of the [latex]delta[/latex] is also becoming your bottleneck (Grover, 2019). As you have to configure them manually (or perhaps using some automated tooling), you’ll have to spend time and resources on finding the most optimum [latex]delta[/latex] for your dataset. This is an iterative problem that, in the extreme case, may become impractical at best and costly at worst. However, in most cases, it’s best just to experiment — perhaps, you’ll find better results!
[ad]
Loss functions for classification
Loss functions are also applied in classifiers. I already discussed in another post what classification is all about, so I’m going to repeat it here:
Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It’s an important job, one can argue, because we don’t want to sell customers tomatoes they can’t process into dinner. It’s the perfect job to illustrate what a human classifier would do.
Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape:
— If it’s green, it’s likely to be unripe (or: not sellable);
— If it smells, it is likely to be unsellable;
— The same goes for when it’s white or when fungus is visible on top of it.If none of those occur, it’s likely that the tomato can be sold. We now have two classes: sellable tomatoes and non-sellable tomatoes. Human classifiers decide about which class an object (a tomato) belongs to.
The same principle occurs again in machine learning and deep learning.
Only then, we replace the human with a machine learning model. We’re then using machine learning for classification, or for deciding about some “model input” to “which class” it belongs.Source: How to create a CNN classifier with Keras?
We’ll now cover loss functions that are used for classification.
Hinge
The hinge loss is defined as follows (Wikipedia, 2011):
It simply takes the maximum of either 0 or the computation [latex] 1 — t times y[/latex], where t
is the machine learning output value (being between -1 and +1) and y
is the true target (-1 or +1).
When the target equals the prediction, the computation [latex]t times y[/latex] is always one: [latex]1 times 1 = -1 times -1 = 1)[/latex]. Essentially, because then [latex]1 — t times y = 1 — 1 = 1[/latex], the max
function takes the maximum [latex]max(0, 0)[/latex], which of course is 0.
That is: when the actual target meets the prediction, the loss is zero. Negative loss doesn’t exist. When the target != the prediction, the loss value increases.
For t = 1
, or [latex]1[/latex] is your target, hinge loss looks like this:
Let’s now consider three scenarios which can occur, given our target [latex]t = 1[/latex] (Kompella, 2017; Wikipedia, 2011):
- The prediction is correct, which occurs when [latex]y geq 1.0[/latex].
- The prediction is very incorrect, which occurs when [latex]y < 0.0[/latex] (because the sign swaps, in our case from positive to negative).
- The prediction is not correct, but we’re getting there ([latex] 0.0 leq y < 1.0[/latex]).
In the first case, e.g. when [latex]y = 1.2[/latex], the output of [latex]1 — t times y[/latex] will be [latex] 1 — ( 1 times 1.2 ) = 1 — 1.2 = -0.2[/latex]. Loss, then will be [latex]max(0, -0.2) = 0[/latex]. Hence, for all correct predictions — even if they are too correct, loss is zero. In the too correct situation, the classifier is simply very sure that the prediction is correct (Peltarion, n.d.).
In the second case, e.g. when [latex]y = -0.5[/latex], the output of the loss equation will be [latex]1 — (1 times -0.5) = 1 — (-0.5) = 1.5[/latex], and hence the loss will be [latex]max(0, 1.5) = 1.5[/latex]. Very wrong predictions are hence penalized significantly by the hinge loss function.
In the third case, e.g. when [latex]y = 0.9[/latex], loss output function will be [latex]1 — (1 times 0.9) = 1 — 0.9 = 0.1[/latex]. Loss will be [latex]max(0, 0.1) = 0.1[/latex]. We’re getting there — and that’s also indicated by the small but nonzero loss.
What this essentially sketches is a margin that you try to maximize: when the prediction is correct or even too correct, it doesn’t matter much, but when it’s not, we’re trying to correct. The correction process keeps going until the prediction is fully correct (or when the human tells the improvement to stop). We’re thus finding the most optimum decision boundary and are hence performing a maximum-margin operation.
It is therefore not surprising that hinge loss is one of the most commonly used loss functions in Support Vector Machines (Kompella, 2017). What’s more, hinge loss itself cannot be used with gradient descent like optimizers, those with which (deep) neural networks are trained. This occurs due to the fact that it’s not continuously differentiable, more precisely at the ‘boundary’ between no loss / minimum loss. Fortunately, a subgradient of the hinge loss function can be optimized, so it can (albeit in a different form) still be used in today’s deep learning models (Wikipedia, 2011). For example, hinge loss is available as a loss function in Keras.
Squared hinge
The squared hinge loss is like the hinge formula displayed above, but then the [latex]max()[/latex] function output is squared.
This helps achieving two things:
- Firstly, it makes the loss value more sensitive to outliers, just as we saw with MSE vs MAE. Large errors will add to the loss more significantly than smaller errors. Note that simiarly, this may also mean that you’ll need to inspect your dataset for the presence of such outliers first.
- Secondly, squared hinge loss is differentiable whereas hinge loss is not (Tay, n.d.). The way the hinge loss is defined makes it not differentiable at the ‘boundary’ point of the chart — also see this perfect answer that illustrates it. Squared hinge loss, on the other hand, is differentiable, simply because of the square and the mathematical benefits it introduces during differentiation. This makes it easier for us to use a hinge-like loss in gradient based optimization — we’ll simply take squared hinge.
[ad]
Categorical / multiclass hinge
Both normal hinge and squared hinge loss work only for binary classification problems in which the actual target value is either +1 or -1. Although that’s perfectly fine for when you have such problems (e.g. the diabetes yes/no problem that we looked at previously), there are many other problems which cannot be solved in a binary fashion.
(Note that one approach to create a multiclass classifier, especially with SVMs, is to create many binary ones, feeding the data to each of them and counting classes, eventually taking the most-chosen class as output — it goes without saying that this is not very efficient.)
However, in neural networks and hence gradient based optimization problems, we’re not interested in doing that. It would mean that we have to train many networks, which significantly impacts the time performance of our ML training problem. Instead, we can use the multiclass hinge that has been introduced by researchers Weston and Watkins (Wikipedia, 2011):
What this means in plain English is this:
For all [latex]y[/latex] (output) values unequal to [latex]t[/latex], compute the loss. Eventually, sum them together to find the multiclass hinge loss.
Note that this does not mean that you sum over all possible values for y (which would be all real-valued numbers except [latex]t[/latex]), but instead, you compute the sum over all the outputs generated by your ML model during the forward pass. That is, all the predictions. Only for those where [latex]y neq t[/latex], you compute the loss. This is obvious from an efficiency point of view: where [latex]y = t[/latex], loss is always zero, so no [latex]max[/latex] operation needs to be computed to find zero after all.
Keras implements the multiclass hinge loss as categorical hinge loss, requiring to change your targets into categorical format (one-hot encoded format) first by means of to_categorical
.
Binary crossentropy
A loss function that’s used quite often in today’s neural networks is binary crossentropy. As you can guess, it’s a loss function for binary classification problems, i.e. where there exist two classes. Primarily, it can be used where the output of the neural network is somewhere between 0 and 1, e.g. by means of the Sigmoid layer.
This is its formula:
It can be visualized in this way:
And, like before, let’s now explain it in more intuitive ways.
The [latex]t[/latex] in the formula is the target (0 or 1) and the [latex]p[/latex] is the prediction (a real-valued number between 0 and 1, for example 0.12326).
When you input both into the formula, loss will be computed related to the target and the prediction. In the visualization above, where the target is 1, it becomes clear that loss is 0. However, when moving to the left, loss tends to increase (ML Cheatsheet documentation, n.d.). What’s more, it increases increasingly fast. Hence, it not only tends to punish wrong predictions, but also wrong predictions that are extremely confident (i.e., if the model is very confident that it’s 0 while it’s 1, it gets punished much harder than when it thinks it’s somewhere in between, e.g. 0.5). This latter property makes the binary cross entropy a valued loss function in classification problems.
When the target is 0, you can see that the loss is mirrored — which is exactly what we want:
Categorical crossentropy
Now what if you have no binary classification problem, but instead a multiclass one?
Thus: one where your output can belong to one of > 2 classes.
The CNN that we created with Keras using the MNIST dataset is a good example of this problem. As you can find in the blog (see the link), we used a different loss function there — categorical crossentropy. It’s still crossentropy, but then adapted to multiclass problems.
This is the formula with which we compute categorical crossentropy. Put very simply, we sum over all the classes that we have in our system, compute the target of the observation and the prediction of the observation and compute the observation target with the natural log of the observation prediction.
It took me some time to understand what was meant with a prediction, though, but thanks to Peltarion (n.d.), I got it.
The answer lies in the fact that the crossentropy is categorical and that hence categorical data is used, with one-hot encoding.
Suppose that we have dataset that presents what the odds are of getting diabetes after five years, just like the Pima Indians dataset we used before. However, this time another class is added, being «Possibly diabetic», rendering us three classes for one’s condition after five years given current measurements:
- 0: no diabetes
- 1: possibly diabetic
- 2: diabetic
That dataset would look like this:
Features | Target |
{ … } | 1 |
{ … } | 2 |
{ … } | 0 |
{ … } | 0 |
{ … } | 2 |
…and so on | …and so on |
However, categorical crossentropy cannot simply use integers as targets, because its formula doesn’t support this. Instead, we must apply one-hot encoding, which transforms the integer targets into categorial vectors, which are just vectors that displays all categories and whether it’s some class or not:
- 0: [latex][1, 0, 0][/latex]
- 1: [latex][0, 1, 0][/latex]
- 2: [latex][0, 0, 1][/latex]
[ad]
That’s what we always do with to_categorical
in Keras.
Our dataset then looks as follows:
Features | Target |
{ … } | [latex][0, 1, 0][/latex] |
{ … } | [latex][0, 0, 1][/latex] |
{ … } | [latex][1, 0, 0][/latex] |
{ … } | [latex][1, 0, 0][/latex] |
{ … } | [latex][0, 0, 1][/latex] |
…and so on | …and so on |
Now, we can explain with is meant with an observation.
Let’s look at the formula again and recall that we iterate over all the possible output classes — once for every prediction made, with some true target:
Now suppose that our trained model outputs for the set of features [latex]{ … }[/latex] or a very similar one that has target [latex][0, 1, 0][/latex] a probability distribution of [latex][0.25, 0.50, 0.25][/latex] — that’s what these models do, they pick no class, but instead compute the probability that it’s a particular class in the categorical vector.
Computing the loss, for [latex]c = 1[/latex], what is the target value? It’s 0: in [latex]textbf{t} = [0, 1, 0][/latex], the target value for class 0 is 0.
What is the prediction? Well, following the same logic, the prediction is 0.25.
We call these two observations with respect to the total prediction. By looking at all observations, merging them together, we can find the loss value for the entire prediction.
We multiply the target value with the log. But wait! We multiply the log with 0 — so the loss value for this target is 0.
It doesn’t surprise you that this happens for all targets except for one — where the target value is 1: in the prediction above, that would be for the second one.
Note that when the sum is complete, you’ll multiply it with -1 to find the true categorical crossentropy loss.
Hence, loss is driven by the actual target observation of your sample instead of all the non-targets. The structure of the formula however allows us to perform multiclass machine learning training with crossentropy. There we go, we learnt another loss function
Sparse categorical crossentropy
But what if we don’t want to convert our integer targets into categorical format? We can use sparse categorical crossentropy instead (Lin, 2019).
It performs in pretty much similar ways to regular categorical crossentropy loss, but instead allows you to use integer targets! That’s nice.
Features | Target |
{ … } | 1 |
{ … } | 2 |
{ … } | 0 |
{ … } | 0 |
{ … } | 2 |
…and so on | …and so on |
Kullback-Leibler divergence
Sometimes, machine learning problems involve the comparison between two probability distributions. An example comparison is the situation below, in which the question is how much the uniform distribution differs from the Binomial(10, 0.2) distribution.
When you wish to compare two probability distributions, you can use the Kullback-Leibler divergence, a.k.a. KL divergence (Wikipedia, 2004):
begin{equation} KL (P || Q) = sum p(X) log ( p(X) div q(X) ) end{equation}
KL divergence is an adaptation of entropy, which is a common metric in the field of information theory (Wikipedia, 2004; Wikipedia, 2001; Count Bayesie, 2017). While intuitively, entropy tells you something about «the quantity of your information», KL divergence tells you something about «the change of quantity when distributions are changed».
Your goal in machine learning problems is to ensure that [latex]change approx 0[/latex].
Is KL divergence used in practice? Yes! Generative machine learning models work by drawing a sample from encoded, latent space, which effectively represents a latent probability distribution. In other scenarios, you might wish to perform multiclass classification with neural networks that use Softmax activation in their output layer, effectively generating a probability distribution across the classes. And so on. In those cases, you can use KL divergence loss during training. It compares the probability distribution represented by your training data with the probability distribution generated during your forward pass, and computes the divergence (the difference, although when you swap distributions, the value changes due to non-symmetry of KL divergence — hence it’s not entirely the difference) between the two probability distributions. This is your loss value. Minimizing the loss value thus essentially steers your neural network towards the probability distribution represented in your training set, which is what you want.
Summary
In this blog, we’ve looked at the concept of loss functions, also known as cost functions. We showed why they are necessary by means of illustrating the high-level machine learning process and (at a high level) what happens during optimization. Additionally, we covered a wide range of loss functions, some of them for classification, others for regression. Although we introduced some maths, we also tried to explain them intuitively.
I hope you’ve learnt something from my blog! If you have any questions, remarks, comments or other forms of feedback, please feel free to leave a comment below! 👇 I’d also appreciate a comment telling me if you learnt something and if so, what you learnt. I’ll gladly improve my blog if mistakes are made. Thanks and happy engineering! 😎
References
Chollet, F. (2017). Deep Learning with Python. New York, NY: Manning Publications.
Keras. (n.d.). Losses. Retrieved from https://keras.io/losses/
Binieli, M. (2018, October 8). Machine learning: an introduction to mean squared error and regression lines. Retrieved from https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/
Rich. (n.d.). Why square the difference instead of taking the absolute value in standard deviation? Retrieved from https://stats.stackexchange.com/a/121
Quora. (n.d.). What is the difference between squared error and absolute error? Retrieved from https://www.quora.com/What-is-the-difference-between-squared-error-and-absolute-error
Watson, N. (2019, June 14). Using Mean Absolute Error to Forecast Accuracy. Retrieved from https://canworksmart.com/using-mean-absolute-error-forecast-accuracy/
Drakos, G. (2018, December 5). How to select the Right Evaluation Metric for Machine Learning Models: Part 1 Regression Metrics. Retrieved from https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0
Wikipedia. (2011, September 16). Hinge loss. Retrieved from https://en.wikipedia.org/wiki/Hinge_loss
Kompella, R. (2017, October 19). Support vector machines ( intuitive understanding ) ? Part#1. Retrieved from https://towardsdatascience.com/support-vector-machines-intuitive-understanding-part-1-3fb049df4ba1
Peltarion. (n.d.). Squared hinge. Retrieved from https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/squared-hinge
Tay, J. (n.d.). Why is squared hinge loss differentiable? Retrieved from https://www.quora.com/Why-is-squared-hinge-loss-differentiable
Rakhlin, A. (n.d.). Online Methods in Machine Learning. Retrieved from http://www.mit.edu/~rakhlin/6.883/lectures/lecture05.pdf
Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
TensorFlow. (n.d.). tf.keras.losses.logcosh. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh
ML Cheatsheet documentation. (n.d.). Loss Functions. Retrieved from https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
Peltarion. (n.d.). Categorical crossentropy. Retrieved from https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy
Lin, J. (2019, September 17). categorical_crossentropy VS. sparse_categorical_crossentropy. Retrieved from https://jovianlin.io/cat-crossentropy-vs-sparse-cat-crossentropy/
Wikipedia. (2004, February 13). Kullback–Leibler divergence. Retrieved from https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Wikipedia. (2001, July 9). Entropy (information theory). Retrieved from https://en.wikipedia.org/wiki/Entropy_(information_theory)
Count Bayesie. (2017, May 10). Kullback-Leibler Divergence Explained. Retrieved from https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
The article contains a brief on various loss functions used in Neural networks.
What is a Loss function?
When you train Deep learning models, you feed data to the network, generate predictions, compare them with the actual values (the targets) and then compute what is known as a loss. This loss essentially tells you something about the performance of the network: the higher it is, the worse your network performs overall.
Loss functions are mainly classified into two different categories Classification loss and Regression Loss. Classification loss is the case where the aim is to predict the output from the different categorical values for example, if we have a dataset of handwritten images and the digit is to be predicted that lies between (0–9), in these kinds of scenarios classification loss is used.
Whereas if the problem is regression like predicting the continuous values for example, if need to predict the weather conditions or predicting the prices of houses on the basis of some features. In this type of case, Regression Loss is used.
In this article, we will focus on the most widely used loss functions in Neural networks.
-
Mean Absolute Error (L1 Loss)
-
Mean Squared Error (L2 Loss)
-
Huber Loss
-
Cross-Entropy(a.k.a Log loss)
-
Relative Entropy(a.k.a Kullback–Leibler divergence)
-
Squared Hinge
Mean Absolute Error (MAE)
Mean absolute error (MAE) also called L1 Loss is a loss function used for regression problems. It represents the difference between the original and predicted values extracted by averaging the absolute difference over the data set.
MAE is not sensitive towards outliers and is given several examples with the same input feature values, and the optimal prediction will be their median target value. This should be compared with Mean Squared Error, where the optimal prediction is the mean. A disadvantage of MAE is that the gradient magnitude is not dependent on the error size, only on the sign of y — ŷ which leads to that the gradient magnitude will be large even when the error is small, which in turn can lead to convergence problems.
When to use it?
Use Mean absolute error when you are doing regression and don’t want outliers to play a big role. It can also be useful if you know that your distribution is multimodal, and it’s desirable to have predictions at one of the modes, rather than at the mean of them.
Example: When doing image reconstruction, MAE encourages less blurry images compared to MSE. This is used for example in the paper Image-to-Image Translation with Conditional Adversarial Networks by Isola et al.
Mean Squared Error (MSE)
Mean Squared Error (MSE) also called L2 Loss is also a loss function used for regression. It represents the difference between the original and predicted values extracted by squared the average difference over the data set.
MSE is sensitive towards outliers and given several examples with the same input feature values, the optimal prediction will be their mean target value. This should be compared with Mean Absolute Error, where the optimal prediction is the median. MSE is thus good to use if you believe that your target data, conditioned on the input, is normally distributed around a mean value, and when it’s important to penalize outliers extra much.
When to use it?
Use MSE when doing regression, believing that your target, conditioned on the input, is normally distributed, and want large errors to be significantly (quadratically) more penalized than small ones.
Example: You want to predict future house prices. The price is a continuous value, and therefore we want to do regression. MSE can here be used as the loss function.
Calculate MAE and MSE using Python
Original target data is denoted by y and predicted label is denoted by (Ŷ) Yhat are the main sources to evaluate the model.
import math
import numpy as np
import matplotlib.pyplot as plt
y = np.array([-3, -1, -2, 1, -1, 1, 2, 1, 3, 4, 3, 5])
yhat = np.array([-2, 1, -1, 0, -1, 1, 2, 2, 3, 3, 3, 5])
x = list(range(len(y)))
#We can visualize them in a plot to check the difference visually.
plt.figure(figsize=(9, 5))
plt.scatter(x, y, color="red", label="original")
plt.plot(x, yhat, color="green", label="predicted")
plt.legend()
plt.show()
# calculate MSE
d = y - yhat
mse_f = np.mean(d**2)
print("Mean square error:",mse_f)
Mean square error: 0.75
# calculate MAE
mae_f = np.mean(abs(d))
print("Mean absolute error:",mae_f)
Mean absolute error: 0.5833333333333334
Huber Loss
Huber Loss is typically used in regression problems. It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.
Consider an example where we have a dataset of 100 values we would like our model to be trained to predict. Out of all that data, 25% of the expected values are 5 while the other 75% are 10.
An MSE loss wouldn’t quite do the trick, since we don’t really have “outliers”; 25% is by no means a small fraction. On the other hand, we don’t necessarily want to weigh that 25% too low with an MAE. Those values of 5 aren’t close to the median (10 — since 75% of the points have a value of 10), but they’re also not really outliers.
This is where the Huber Loss Function comes into play.
The Huber Loss offers the best of both worlds by balancing the MSE and MAE together. We can define it using the following piecewise function:
Here, (𝛿) delta → hyperparameter defines the range for MAE and MSE.
In simple terms, the above radically says is: for loss values less than (𝛿) delta, use the MSE; for loss values greater than delta, use the MAE. This way Huber loss provides the best of both MAE and MSE.
Set delta to the value of the residual for the data points, you trust.
import numpy as np
import matplotlib.pyplot as plt
def huber(a, delta):
value = np.where(np.abs(a)<delta, .5*a**2, delta*(np.abs(a) - .5*delta))
deriv = np.where(np.abs(a)<delta, a, np.sign(a)*delta)
return value, deriv
h, d = huber(np.arange(-1, 1, .01), delta=0.2)
fig, ax = plt.subplots(1)
ax.plot(h, label='loss value')
ax.plot(d, label='loss derivative')
ax.grid(True)
ax.legend()
In the above figure, you can see how the derivative is a constant for abs(a)>delta
In TensorFlow 2 and Keras, Huber loss can be added to the compile step of your model.
model.compile(loss=tensorflow.keras.losses.Huber(delta=1.5), optimizer='adam', metrics=['mean_absolute_error'])
When to use Huber Loss?
As we already know Huber loss has both MAE and MSE. So when we think higher weightage should not be given to outliers, then set your loss function as Huber loss. We need to manually define is the (𝛿) delta value. Generally, some iterations are needed with the respective algorithm used to find the correct delta value.
Cross-Entropy Loss(a.k.a Log loss)
The concept of cross-entropy traces back into the field of Information Theory where Claude Shannon introduced the concept of entropy in 1948. Before diving into the Cross-Entropy loss function, let us talk about Entropy.
Entropy has roots in physics — it is a measure of disorder, or unpredictability, in a system.
For instance, consider below figure two gases in a box: initially, the system has low entropy, in that the two gasses are completely separable(skewed distribution); after some time, however, the gases blend(distribution where events have equal probability) so the system’s entropy increases. It is said that in an isolated system, the entropy never decreases — the chaos never dims down without external influence.
Entropy
For p(x) — probability distribution and a random variable X, entropy is defined as follows:
Reason for the Negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability distribution and therefore the values must range between 0 and 1.
A plot of log(x). For x values between 0 and 1, log(x) <0 (is negative).
Cross-Entropy loss is also called logarithmic loss, log loss, or logistic loss. Each predicted class probability is compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic in nature yielding a large score for large differences close to 1 and small score for small differences tending to 0.
Cross-Entropy is expressed by the equation;
Where x represents the predicted results by ML algorithm, p(x) is the probability distribution of “true” label from training samples and q(x) depicts the estimation of the ML algorithm.
Cross-entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
The graph above shows the range of possible loss values given a true observation. As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!
The cross-entropy method is a Monte Carlo technique for significance optimization and sampling.
Binary Cross-Entropy
Binary cross-entropy is a loss function that is used in binary classification tasks. These are tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right).
In binary classification, where the number of classes M equals 2, cross-entropy can be calculated as:
Sigmoid is the only activation function compatible with the binary cross-entropy loss function. You must use it on the last block before the target block.
The binary cross-entropy needs to compute the logarithms of Ŷi and (1-Ŷi), which only exist if Ŷi is between 0 and 1. The softmax activation function is the only one to guarantee that the output is within this range.
Categorical Cross-Entropy
Categorical cross-entropy is a loss function that is used in multi-class classification tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one.
Formally, it is designed to quantify the difference between two probability distributions.
If 𝑀>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.
-
M — number of classes (dog, cat, fish)
-
log — the natural log
-
y — binary indicator (0 or 1) if class label c is the correct classification for observation o
-
p — predicted probability observation o is of class 𝑐
Softmax is the only activation function recommended to use with the categorical cross-entropy loss function.
Strictly speaking, the output of the model only needs to be positive so that the logarithm of every output value Ŷi exists. However, the main appeal of this loss function is for comparing two probability distributions. The softmax activation rescales the model output so that it has the right properties.
Sparse Categorical Cross-Entropy
sparse categorical cross-entropy has the same loss function as, categorical cross-entropy which we have mentioned above. The only difference is the format in which we mention 𝑌𝑖(i,e true labels).
If your Yi’s are one-hot encoded, use categorical_crossentropy. Examples for a 3-class classification: [1,0,0] , [0,1,0], [0,0,1]
But if your Yi’s are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]
The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross-entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.
Calculate Cross-Entropy Between Class Labels and Probabilities
The use of cross-entropy for classification often gives different specific names based on the number of classes.
Consider a two-class classification task with the following 10 actual class labels (P) and predicted class labels (Q).
# calculate cross entropy for classification problem
from math import log
from numpy import mean
# calculate cross entropy
def cross_entropy_funct(p, q):
return -sum([p[i]*log(q[i]) for i in range(len(p))])
# define classification data p and q
p = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
q = [0.7, 0.9, 0.8, 0.8, 0.6, 0.2, 0.1, 0.4, 0.1, 0.3]
# calculate cross entropy for each example
results = list()
for i in range(len(p)):
# create the distribution for each event {0, 1}
expected = [1.0 - p[i], p[i]]
predicted = [1.0 - q[i], q[i]]
# calculate cross entropy for the two events
cross = cross_entropy_funct(expected, predicted)
print('>[y=%.1f, yhat=%.1f] cross entropy: %.3f' % (p[i], q[i], cross))
results.append(cross)
# calculate the average cross entropy
mean_cross_entropy = mean(results)
print('nAverage Cross Entropy: %.3f' % mean_cross_entropy)
Running the example prints the actual and predicted probabilities for each example. The final average cross-entropy loss across all examples is reported, in this case, as 0.272
>[y=1.0, yhat=0.7] cross entropy: 0.357
>[y=1.0, yhat=0.9] cross entropy: 0.105
>[y=1.0, yhat=0.8] cross entropy: 0.223
>[y=1.0, yhat=0.8] cross entropy: 0.223
>[y=1.0, yhat=0.6] cross entropy: 0.511
>[y=0.0, yhat=0.2] cross entropy: 0.223
>[y=0.0, yhat=0.1] cross entropy: 0.105
>[y=0.0, yhat=0.4] cross entropy: 0.511
>[y=0.0, yhat=0.1] cross entropy: 0.105
>[y=0.0, yhat=0.3] cross entropy: 0.357
Average Cross Entropy: 0.272
Relative Entropy(Kullback–Leibler divergence)
The Relative entropy (also called Kullback–Leibler divergence), is a method for measuring the similarity between two probability distributions. It was refined by Solomon Kullback and Richard Leibler for public release in 1951(paper), KL-Divergence aims to identify the divergence(separation or bifurcation) of a probability distribution given a baseline distribution. That is, for a target distribution, P, we compare a competing distribution, Q, by computing the expected value of the log-odds of the two distributions:
For distributions P and Q of a continuous random variable, the Kullback-Leibler divergence is computed as an integral:
If P and Q represent the probability distribution of a discrete random variable, the Kullback-Leibler divergence is calculated as a summation:
Also, with a little bit of work, we can show that the KL-Divergence is non-negative. It means, that the smallest possible value is zero (distributions are equal) and the maximum value is infinity. We procure infinity when P is defined in a region where Q can never exist. Therefore, it is a common assumption that both distributions exist on the same support.
The closer two distributions get to each other, the lower the loss becomes. In the following graph, the blue distribution is trying to model the green distribution. As the blue distribution comes closer and closer to the green one, the KL divergence loss will get closer to zero.
Lower the KL divergence value, the better we have matched the true distribution with our approximation.
Comparison of Blue and green distribution
The applications of KL-Divergence:
-
Primarily, it is used in Variational Autoencoders. These autoencoders learn to encode samples into a latent probability distribution and from this latent distribution, a sample can be drawn that can be fed to a decoder which outputs e.g. an image.
-
KL divergence can also be used in multiclass classification scenarios. These problems, which traditionally use the Softmax function and use one-hot encoded target data, are naturally suitable to KL divergence since Softmax “normalizes data into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers”
-
Delineating the relative (Shannon) entropy in information systems,
-
Randomness in continuous time-series.
Calculate KL-Divergence using Python
Consider a random variable with six events as different colors. We may have two different probability distributions for this variable; for example:
import numpy as np
import matplotlib.pyplot as plt
events = ['red', 'green', 'blue', 'black', 'yellow', 'orange']
p = [0.10, 0.30, 0.05, 0.90, 0.65, 0.21]
q = [0.70, 0.55, 0.15, 0.04, 0.25, 0.45]
Plot a histogram for each probability distribution, allowing the probabilities for each event to be directly compared.
# plot first distribution
plt.figure(figsize=(9, 5))
plt.subplot(2,1,1)
plt.bar(events, p, color ='green',align='center')
# plot second distribution
plt.subplot(2,1,2)
plt.bar(events, q,color ='green',align='center')
# show the plot
plt.show()
We can see that indeed the distributions are different.
Next, we can develop a function to calculate the KL divergence between the two distributions.
def kl_divergence(p, q):
return sum(p[i] * np.log(p[i]/q[i]) for i in range(len(p)))
# calculate (P || Q)
kl_pq = kl_divergence(p, q)
print('KL(P || Q): %.3f bits' % kl_pq)
# calculate (Q || P)
kl_qp = kl_divergence(q, p)
print('KL(Q || P): %.3f bits' % kl_qp)
KL(P || Q): 2.832 bits
KL(Q || P): 1.840 bits
Nevertheless, we can calculate the KL divergence using the rel_entr() SciPy function and confirm that our manual calculation is correct.
The rel_entr() function takes lists of probabilities across all events from each probability distribution as arguments and returns a list of divergences for each event. These can be summed to give the KL divergence.
from scipy.special import rel_entr
print("Using Scipy rel_entr function")
bo_1 = np.array(p)
bo_2 = np.array(q)
print('KL(P || Q): %.3f bits' % sum(rel_entr(bo_1,bo_2)))
print('KL(Q || P): %.3f bits' % sum(rel_entr(bo_2,bo_1)))
Using Scipy rel_entr function
KL(P || Q): 2.832 bits
KL(Q || P): 1.840 bits
Let us see how KL divergence can be used with Keras. It’s pretty simple, It just involves specifying it as the used loss function during the model compilation step:
# Compile the model model.compile(loss=keras.losses.kullback_leibler_divergence, optimizer=keras.optimizers.Adam(), metrics=[‘accuracy’])
Squared Hinge
The squared hinge loss is a loss function used for “maximum margin” binary classification problems. Mathematically it is defined as:
where ŷ is the predicted value and y is either 1 or -1.
Thus, the squared hinge loss → 0, when the true and predicted labels are the same and when ŷ≥ 1 (which is an indication that the classifier is sure that it’s the correct label).
The squared hinge loss → quadratically increasing with the error, when when the true and predicted labels are not the same or when ŷ< 1, even when the true and predicted labels are the same (which is an indication that the classifier is not sure that it’s the correct label).
As compared to traditional hinge loss(used in SVM) larger errors are punished more significantly, whereas smaller errors are punished slightly lighter.
Comparison between Hinge and Squared hinge loss
When to use Squared Hinge?
Use the Squared Hinge loss function on problems involving yes/no (binary) decisions. Especially, when you’re not interested in knowing how certain the classifier is about the classification. Namely, when you don’t care about the classification probabilities. Use in combination with the tanh() the activation function in the last layer of the neural network.
A typical application can be classifying email into ‘spam’ and ‘not spam’ and you’re only interested in the classification accuracy.
Let us see how Squared Hinge can be used with Keras. It’s pretty simple, It just involves specifying it as the used loss function during the model compilation step:
#Compile the model
model.compile(loss=squared_hinge, optimizer=tensorflow.keras.optimizers.Adam(lr=0.03), metrics=['accuracy'])
Feel free to connect me on LinkedIn for any query.
Thank you for reading this article, I hope you have found it useful.
References
https://www.machinecurve.com/index.php/2019/10/12/using-huber-loss-in-keras/
https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#:~:text=Cross%2Dentropy%20loss%2C%20or%20log,So%20predicting%20a%20probability%20of%20.
https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3
https://gobiviswa.medium.com/huber-error-loss-functions-3f2ac015cd45
https://www.datatechnotes.com/2019/10/accuracy-check-in-python-mae-mse-rmse-r.html
https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/
Перевод
Ссылка на автора
Функция потерь в машинном обучении является мерой того, насколько точно ваша модель ML способна предсказать ожидаемый результат, т.е. основную правду.
Функция потерь примет в качестве входных данных два элемента: выходное значение нашей модели и ожидаемое истинное значение. Выход функции потерь называетсяпотерячто является показателем того, насколько хорошо наша модель справилась с прогнозированием результата.
Высокое значение потери означает, что наша модель работает очень плохо. Низкое значение потерь означает, что наша модель работает очень хорошо.
Выбор правильной функции потерь имеет решающее значение для обучения точной модели. Определенные функции потерь будут иметь определенные свойства и помогут вашей модели учиться особым образом. Некоторые могут придавать большее значение выбросам, другие — большинству.
В этой статье мы рассмотрим 3 наиболее распространенных функции потерь для регрессии машинного обучения. Я объясню, как они работают, их плюсы и минусы, и как их можно наиболее эффективно применять при обучении регрессионным моделям.
(1) Средняя квадратическая ошибка (MSE)
Среднеквадратичная ошибка (MSE), возможно, является самой простой и наиболее распространенной функцией потерь, которую часто преподают на вводных курсах машинного обучения. Чтобы рассчитать MSE, вы берете разницу между предсказаниями вашей модели и основополагающей правдой, возводите ее в квадрат и усредняете ее по всему набору данных.
MSE никогда не будет отрицательным, так как мы всегда возводим в квадрат ошибки. MSE формально определяется следующим уравнением:
гдеNколичество образцов, с которыми мы тестируем Код достаточно прост, мы можем написать его в виде простого кода и построить его с помощью matplotlib:
Преимущество:MSE отлично подходит для гарантии того, что наша обученная модель не имеет прогнозируемых выбросов с огромными ошибками, поскольку MSE придает большее значение этим ошибкам из-за квадратной части функции.
Недостаток:Если наша модель делает одно очень плохое предсказание, то квадратичная часть функции увеличивает ошибку. Тем не менее, во многих практических случаях мы не очень заботимся об этих выбросах и стремимся к более всесторонней модели, которая достаточно хороша для большинства.
(2) Средняя абсолютная ошибка (MAE)
Средняя абсолютная ошибка (MAE) лишь немного отличается по определению от MSE, но, что интересно, обеспечивает почти совершенно противоположные свойства! Чтобы рассчитать MAE, вы берете разницу между предсказаниями вашей модели и основополагающей правдой, применяете абсолютное значение к этой разнице и затем усредняете его по всему набору данных.
MAE, как и MSE, никогда не будет отрицательным, так как в этом случае мы всегда принимаем абсолютное значение ошибок. MAE формально определяется следующим уравнением:
Еще раз наш код очень прост в Python! Мы можем написать его в виде простого numpy и построить его с помощью matplotlib. На этот раз мы нарисуем его красным прямо над MSE, чтобы увидеть, как они сравниваются.
Преимущество:Прелесть MAE заключается в том, что его преимущество напрямую покрывает недостаток MSE. Поскольку мы берем абсолютное значение, все ошибки будут взвешены в одной линейной шкале. Таким образом, в отличие от MSE, мы не будем придавать слишком большой вес нашим выбросам, а наша функция потерь обеспечивает общую и даже меру того, насколько хорошо работает наша модель.
Недостаток:Если мы действительно заботимся о прогнозируемых отклонениях нашей модели, то MAE не будет столь же эффективным. Большие ошибки, возникающие из-за выбросов, в конечном итоге взвешиваются точно так же, как и более низкие ошибки. Это может привести к тому, что наша модель в большинстве случаев будет отличной, но время от времени делает несколько очень плохих прогнозов.
(3) Huber Loss
Теперь мы знаем, что MSE отлично подходит для изучения выбросов, а MAE — для их игнорирования. Но как насчет чего-то посередине?
Рассмотрим пример, где у нас есть набор данных из 100 значений, которые мы хотели бы, чтобы наша модель была обучена прогнозировать. Из всех этих данных 25% ожидаемых значений равны 5, а остальные 75% — 10.
Потеря MSE не вполне сработает, поскольку у нас действительно нет «выбросов»; 25% отнюдь не маленькая доля. С другой стороны, мы не обязательно хотим весить эти 25% слишком низко с MAE. Эти значения 5 не близки к медиане (10 — 75% точек имеют значение 10), но на самом деле они также не являются выбросами.
Наше решение?
Функция Huber Loss,
Huber Loss предлагает лучшее из обоих миров, уравновешивая MSE и MAE. Мы можем определить его, используя следующую кусочную функцию:
По сути, это уравнение гласит: для значений потерь меньше дельты используйте MSE; для значений потерь больше, чем дельта, используйте MAE Это эффективно объединяет лучшее из обоих миров из двух функций потерь!
Использование MAE для больших значений потерь уменьшает вес, который мы придаем выбросам, так что мы по-прежнему получаем хорошо округленную модель. В то же время мы используем MSE для меньших значений потерь, чтобы поддерживать квадратичную функцию вблизи центра.
Это приводит к увеличению значений потерь до тех пор, пока они превышают 1. Как только потеря для этих точек данных падает ниже 1, квадратичная функция понижает их вес, чтобы сосредоточить обучение на точках данных с более высокой ошибкой.
Проверьте код ниже для функции Huber Loss. Мы также строим график потери Хьюбера рядом с MSE и MAE, чтобы сравнить разницу.
Обратите внимание на то, как мы можем получить потери Хубера прямо между MSE и MAE.
Лучшее обоих миров!
Вы захотите использовать потерю Хубера в любое время, когда почувствуете, что вам необходим баланс между приданием выбросам некоторого веса, но не слишком большого. Для случаев, когда выбросы очень важны для вас, используйте MSE! В тех случаях, когда вы не заботитесь о выбросах, используйте MAE!
Нравится учиться?
Следуй за мной по щебет где я публикую все о новейших и лучших ИИ, технологиях и науке! Связаться со мной на LinkedIn слишком!
Рекомендуемое чтение
Хотите узнать больше о машинном обучении? Практическое машинное обучение Книга — лучший ресурс для изучения того, как сделатьреальныйМашинное обучение с Python!
И просто напоследок, я поддерживаю этот блог с помощью партнерских ссылок Amazon на замечательные книги, потому что обмен отличными книгами помогает всем! Как партнер Amazon я зарабатываю на соответствующих покупках.
Tech Blog
Regression loss functions for machine learning
Overview
Loss functions take as input a set of predictions and actual values, and return a metric of the prediction error. This “prediction error metric” guides the machine learning model during training. Model training often consists of a model tuning its inner workings to minimize the output of the loss function for the training data set.
Three of the most useful loss functions for regression problems are described below: mean squared error, mean absolute error and Huber loss. Recommendations about when to apply each of them are also included.
This post focuses on regression problems (i.e. the output variable takes continuous values). If you would like us to cover classification problems (i.e. the output variable takes class labels) on another blog entrance, just email us.
Mean squared error
The most common loss function for regression problems is the mean squared error (MSE). The MSE is calculated by the sum of the squared distance between the target variable (yi) and its predicted value (yip):
The mean squared error function is widely used as it is simple, continuous and differentiable.
A key MSE characteristic is its disproportional sensitivity to large errors compared to small ones. A model trained with MSE will give the same importance to a single error of 5 units compared to 25 errors of 1 unit. In other words, the model will be biased to reduce the largest errors, even if that penalizes the predictions of many common conditions.
The above MSE characteristic is especially important when dealing with outliers that the model fails to predict. These could be outliers caused by corrupted data or random unpredictable processes. A small number of outliers very distant from other observations can impair the model’s predictive ability. Figure 1 shows the prediction of a linear model trained with a mean squared error loss function. The training data is formed by 15 instances, one of them an outlier (see upper right corner in Figure 1). Even though most of the training data set can be well represented by a linear model, the outlier distorts the model prediction when MSE is applied.
Mean absolute error
The mean absolute error (MAE) is the sum of the absolute differences between the actual and the predicted values:
The MAE method has the advantage of not being overly affected by outliers in the training data. Using the previously discussed example, a model trained with the MSA approach will give equal importance to 1 error of 5 units and 5 errors of 1 unit.
The main issue with the MAE is that it is not differentiable at its minimum, see Figure 2. This lack of differentiability can produce convergence issues when training machine learning models.
Huber loss
The Huber loss approach combines the advantages of the mean squared error and the mean absolute error. It is a piecewise-defined function:
where δ is a hyperparameter that controls the split between the two sub-function intervals.
The sub-function for large errors, such as outliers, is the absolute error function. Hence, it avoids the excessive sensitivity to large errors that characterizes MSE. The sub-function for small errors is the squared error making the whole function continuous and differentiable, which overcomes MAE’s convergence issues.
Figure 2 shows the value of squared error, absolute error and the Huber loss as a function of the prediction error. The Huber loss can be seen to be proportional to the absolute value, except for small errors, where it is proportional to the square of the error.
No size fits all in machine learning, and Huber loss also has its drawbacks. Its main disadvantage is the associated complexity. In order to maximize model accuracy, the hyperparameter δ will also need to be optimized which increases the training requirements.
Selecting a loss function
To sum up, we recommend MSE as a default option. It works sufficiently well for the majority of machine learning problems, it is simple, mathematically robust and well supported by most machine learning libraries.
If the training data has outliers that the model fails to predict, the model accuracy with MSE may be affected. The following three options arises:
-
The most accurate approach is to apply the Huber loss function and tune its hyperparameter δ. The hyperparameter should be tuned iteratively by testing different values of δ.
-
The fastest approach is to use MAE. This should be done carefully, however, as convergence issues may appear.
-
If the outliers are not critical to the dataset, MSE can also be applied after removing the outliers from the training data.
What next
In a separate post, we will discuss the extremely powerful quantile regression loss function that allows predictions of confidence intervals, instead of just values.
If you have any questions or there any machine learning topic that you would like us to cover, just email us.