Homework 4 Solution

Starting from:

$35

LSTM Gradient [4pts] Here, you’ll derive the Backprop Through Time equations for the univariate version of the Long-Term Short-Term Memory (LSTM) architecture.

For reference, here are the computations it performs:

i(t) = (wixx(t) + wihh(t 1))

f(t) = (wfxx(t) + wfhh(t 1))

o(t) = (woxx(t) + wohh(t 1))

g(t) = tanh(wgxx(t) + wghh(t 1))

c(t) = f(t)c(t 1) + i(t)g(t)

h(t) = o(t) tanh(c(t))

(a) [3pts] Derive the Backprop Through Time equations for the activations and the gates:

h(t) =

c(t) =

g(t) =

o(t) =

f(t) =

i(t) =

You don’t need to vectorize anything or factor out any repeated subexpressions.

(b) [1pt] Derive the BPTT equation for the weight wix:

wix =

(The other weight matrices are basically the same, so we won’t make you write those out.)

[optional, no points] Based on your answers above, explain why the gradient doesn’t explode if the values of the forget gates are very close to 1 and the values of the input and output gates are very close to 0. (Your answer should involve both h(t) and c(t).)

Multidimensional RNN [3pts] One of the predecessors to the PixelRNN architecture was the multidimensional RNN (MDRNN). This is like the RNNs we discussed in lecture, except that instead of a 1-D sequence, we have a 2-D grid structure. Analogously to how ordinary RNNs have an input vector and a hidden vector for every time step, MDRNNs have an input vector and hidden vector for every grid square. Each hidden unit receives bottom-up connections from the corresponding input square, as well as recurrent connections from its north and west neighbors as follows:

The activations are computed as follows:

h(i;j) = Winx(i;j) + WWh(i 1;j) + WNh(i;j 1) :

For simplicity, we assume there are no bias parameters. Suppose the grid is G G, the input dimension is D, and the hidden dimension is H.

[1pt] How many weights does this architecture have? How many arithmetic operations are required to compute the hidden activations? (You only need to give big-O, not an exact count.)

[1pt] Suppose that in each step, you can compute as many matrix-vector multiplications as you like. How many steps are required to compute the hidden activations? Explain your answer.

[1pt] Give one advantage and one disadvantage of an MDRNN compared to a conv net.

Reversibility [3pts] In lecture, we discussed reversible generator architectures, which en-able e cient maximum likelihood training. In this question, we consider another (perhaps surprising) example of a reversible operation: gradient descent with momentum. Suppose the parameter vector (and hence also the velocity vector p) are both D-dimensional. Recall that the updates are as follows:

p(k+1)
p(k) rJ ( (k))
(k+1)
(k) + p(k+1)
If we denote s(k) = ( (k); p(k)), then we can think of the above equations as de ning a function

s(k+1) = f(s(k)).

[1pt] Show how to compute the inverse, s(k) = f 1(s(k+1)).

[2pts] Find the determinant of the Jacobian, i.e.

det @s(k+1)=@s(k):

Hint: rst write the Jacobian as a product of two matrices, one for each step of the above algorithm.

2

More products

Lab 9: Scheme Solution

$35

Buy now

Lab 7: Prolog Solution

$35

Buy now

Lab 6 Prolog Solution

$35

Buy now