$24
1 Theory (50pt)
1.1 Energy Based Models Intuition (15pts)
This question tests your intuitive understanding of Energy-based models and their properties.
(a) (1pt) How do energy-based models allow for modeling situations where the mapping from input xi to output yi is not 1 to 1, but 1 to many?
(b) (2pts) How do energy-based models differ from models that output proba-bilities?
(c) (2pts) How can you use energy function FW (x, y) to calculate a probability p(y | x)?
(d) (2pts) What are the roles of the loss function and energy function?
(e) (2pts) Can loss function be equal to the energy function?
(f) (2pts) Why using only positive examples for energy (pushing down energy of correct inputs only) may lead to a degenerate solution?
(g) (2pts) Briefly explain the three methods that can be used to shape the en-ergy function.
(h) (2pts) Provide an example of a loss function that uses negative examples. The format should be as follows ℓexample(x, y,W) = FW (x, y).
1.2 Negative log-likelihood loss (20pts)
Let’s consider an energy-based model we are training to do classification of input between n classes. FW (x, y) is the energy of input x and class y . We consider n classes: y ∈ {1, . . . , n}.
(a) (2pts) For a given input x, write down an expression for a Gibbs distribution over labels y that this energy-based model specifies. Use β for the constant multiplier.
(b) (5pts) Let’s say for a particular data sample x, we have the label y . Give the expression for the negative log likelihood loss, i.e. negative log likelihood of the correct label (don’t copy expressions from the slides, show step-by-step derivation of the loss function from the expression of the previous subprob-lem). For easier calculations in the following subproblem, multiply the loss by β1 .
2
(c) (8pts) Now, derive the gradient of that expression with respect to W (just providing the final expression is not enough). Why can it be intractable to compute it, and how can we get around the intractability?
(d) (5pts) Explain why negative log-likelihood loss pushes the energy of the correct example to negative infinity, and all others to positive infinity, no matter how close the two examples are, resulting in an energy surface with really sharp edges in case of continuous y (this is usually not an issue for discrete y because there’s no distance measure between different classes).
1.3 Comparing Contrastive Loss Functions (15pts)
In this problem, we’re going to compare a few contrastive loss functions. We are going to look at the behavior of the gradients, and understand what uses each loss function has. In the following subproblems, m is a margin, m ∈ R, x is input, y is the correct label, y¯ is the incorrect label. Define the loss in the following format: ℓexample(x, y, y¯,W) = FW (x, y).
(a)
(3pts) Simple loss function is defined as follows:
ℓsimple(x, y, y¯,W) = [FW (x, y)]+ +[m − FW (x, y¯)]+
Assuming we know the derivative
∂FW (x,y)
for any x, y, give an expression
∂W
for the partial derivative of the ℓsimple with respect to W.
(b)
(3pts) Hinge loss function is defined as follows:
ℓhinge(x, y, y¯,W) = [FW (x, y) − FW (x, y¯) + m]+
Assuming we know the derivative
∂FW (x,y)
for any x, y, give an expression
∂W
for the partial derivative of the ℓhinge with respect to W.
(c) (3pts) Square-Square loss function is defined as follows:
ℓsquare-square(x, y, y¯,W) = [FW (x, y)]+ 2 + [m − FW (x, y¯)]+ 2
Assuming we know the derivative for any x, y, give an expression
for the partial derivative of the ℓsquare-square with respect to W.
(d) (6pts) Comparison:
(a) Explain how NLL loss is different from the three losses above.
(b) What is the role of the margin in hinge loss? Why do we take only the positive part of FW (x, y) − FW (x, y¯) + m?
(c) How are simple loss and square-square loss different from hinge loss? In what situations would you use simple loss, and in what situations would you use square-square loss?
2 Implementation (50pts + 30pts)
Please make a copy of this notebook hw4_practice.ipynb and add your solu-tions. Please use your NYU account to access the notebook. The notebook con-tains parts marked as TODO , where you should put your code or explanations. The notebook is a Google Colab notebook, you should copy it to your drive, add your solutions, and then download and submit it to Brightspace. You’re also free to run it on any other machine, as long as the version you send us can be run on Google Colab.
There are 3 parts in the notebook:
1. (50pts) Part - 1 deals with training the energy based model with your viterbi implementation.
2. (15pts, Extra Credits) Part - 2 introduces the GTN framework which are popular in Automatic Speech Recognition and Handwriting Recognition.
3. (15pts, Extra Credits) Part - 3 is an open ended part. Here, you will be experimenting with what you have coded on the handwritten data.
4