Skip to content

Loss Functions and Metrics

Frank Seide edited this page Aug 19, 2016 · 23 revisions

CNTK contains a number of common predefined loss functions. In addition, custom loss functions can be defined as BrainScript expressions.

CrossEntropy(), CrossEntropyWithSoftmax()

Computes the cross entropy

CrossEntropy (y, p)
CrossEntropyWithSoftmax (y, z)

Parameters

  • y: labels (one-hot), or more generally, reference distribution. Must sum up to 1.
  • p (for CrossEntropy()): posterior probability distribution to score against the reference. Must sum up to 1.
  • z (for CrossEntropyWithSoftmax()): input to a Softmax operation to compute the posterior probability distribution to score against the reference

Return value

This operation computes the cross-entropy between y and p, which is defined as:

ce = sum_i y_i log p_i

with i iterating over all elements of y and p. For CrossEntropyWithSoftmax(), p is computed from the input parameter z as

p = Softmax (z)

Description

These functions compute the cross-entropy of two probability distributions. This is the most common training criterion (loss function), where y is a one-hot-represented categorical label.

CrossEntropyWithSoftmax() is an optimization that takes advantage of the fact that for one-hot input, the full Softmax distribution is not needed. Instead of a normalized probability, it accepts as its input the argument to the Softmax operation, which is the same as a non-normalized version of log Softmax, also known as "logit". This is the recommended way in CNTK to compute the cross-entropy criterion.

The function's result is undefined if the distributions y and p (not for CrossEntropyWithSoftmax()) do not sum up to 1. Specifically, this function cannot be used for multi-class labels, where y contains more than one position containing a 1. For this case, consider using Sigmoid() instead of Softmax, with a Logistic() loss.

Alternative definition

CrossEntropyWithSoftmax() is currently a CNTK primitive which has limitations. A more flexible, recommended alternative is to define it manually as:

CrossEntropyWithSoftmax (y, z) = ReduceLogSum (z) - TransposeTimes (y, z)

Sparse labels

To compute the cross entropy with sparse labels (e.g. read using Input(./Inputs#input){..., sparse=true}), the alternative form above must be used.

Softmax over tensors with rank>1

To compute CrossEntropyWithSoftmax() over When applied to tensors of rank>1, e.g. where the task is to determine a location on a 2D grid, yet another alternative form must be used:

CrossEntropyWithSoftmax (y, z, axis=None) = ReduceLogSum (z, axis=axis) - ReduceSum (y .* z, axis=axis)

This form also allows to apply the Softmax operation along a specific axis only. For example, if the inputs and labels have the shape [10 x 20], and the Softmax should be computed over each of the 20 columns independently, specify axis=1.

Example

labels = Input {9000}
...
z = W * h + b
ce = CrossEntropyWithSoftmax (labels, z)
criterionNodes = (ce)

The same with sparse labels:

labels = Input {9000, sparse=true}
...
z = W * h + b
ce = ReduceLogSum (z) - TransposeTimes (labels, z)
criterionNodes = (ce)

ErrorPrediction{}

Computes the error rate for prediction of categorical labels.

ErrorPrediction (y, z)

Parameters

  • y: categorical labels
  • z: vector of prediction scores, e.g. log probabilities

Return value

1 where the maximum value of z is at a position where y has a 1.

Description

This function accepts a vector of posterior probabilities, logits, or other matching scores, where each element represents the matching score of a class or category. The function determines whether the highest-scoring class is equal to the

ErrorPrediction_new (L, z, tag='')         = Minus (BS.Constants.One, TransposeTimes (L, Hardmax (z)), tag=tag)
Clone this wiki locally