$29
Hard-Coding a Network. [2pts] In this problem, you need to nd a set of weights and biases for a multilayer perceptron which determines if a list of length 4 is in sorted order. More speci cally, you receive four inputs x1; : : : ; x4, where xi 2 R, and the network must output 1 if x1 < x2 < x3 < x4, and 0 otherwise. You will use the following architecture:
All of the hidden units and the output unit use a hard threshold activation function:
1 if z 0
(z) =
0 if z < 0
Please give a set of weights and biases for the network which correctly implements this function (including cases where some of the inputs are equal). Your answer should include:
A 3 4 weight matrix W(1) for the hidden layer
A 3-dimensional vector of biases b(1) for the hidden layer A 3-dimensional weight vector w(2) for the output layer A scalar bias b(2) for the output layer
You do not need to show your work.
Backprop. Consider a neural network with N input units, N output units, and K hidden units. The activations are computed as follows:
= W(1)x + b(1)
= (z)
= x + W(2)h + b(2);
where denotes the logistic function, applied elementwise. The cost will involve both h and y:
=R+S R = rh
S =
1
2ky sk2
for given vectors r and s.
[1pt] Draw the computation graph relating x, z, h, y, R, S, and J .
[3pts] Derive the backprop equations for computing x = @J =@x. You may use 0 to denote the derivative of the logistic function (so you don’t need to write it out explicitly).
Sparsifying Activation Function. [4pts] One of the interesting features of the ReLU activation function is that it sparsi es the activations and the derivatives, i.e. sets a large fraction of the values to zero for any given input vector. Consider the following network:
Note that each wi refers to the weight on a single connection, not the whole layer. Suppose we are trying to minimize a loss function L which depends only on the activation of the output unit y. (For instance, L could be the squared error loss 12 (y t)2.) Suppose the unit h1 receives an input of -1 on a particular training case, so the ReLU evaluates to 0. Based only on this information, which of the weight derivatives
@L; @L;
@w1 @w2
are guaranteed to be 0 for this training case? answers.
@L
@w3
Write YES or NO for each. Justify your