$29
Dropout. [5pts] For this question, you may wish to review the properties of expectation and variance: https://metacademy.org/graphs/concepts/expectation_and_variance Dropout has an interesting interpretation in the case of linear regression. Recall that the
predictions are made stochastically as:
X
y = mjwjxj;
j
where the mj’s are all i.i.d. (independnet and identically distributed) Bernoulli random vari-ables with expectation 1/2. (I.e., they are indepdendent for every input dimension and every data point.) We would like to minimize the cost
1
N
Xi
J =
2N
E[(y(i) t(i))2];
(1)
=1
where the expectation is with respect to the m(ji)’s.
Now we show that this is equivalent to a regularized linear regression problem:
[2pts] Find expressions for E[y] and Var[y] for a given x and w.
[1pt] Determine w~j as a function of wj such that
X
E[y] = y~ = w~jxj:
j
Here, y~ can be thought of as (deterministic) predictions made by a di erent model.
(c) [2pts] Using the model from the previous section, show that the cost J (Eqn. 1) can be written as
1
N
Xi
J =
2N
(~y(i) t(i))2 + R(w~1
; : : : ; w~D);
=1
where R is a function of the w~D’s which does not involve an expectation. I.e., give an expression for R. (Note that R will depend on the data, so we call it a \data-dependent regularizer.")
Hint: write the cost in terms of the mean and variance formulas from part (a). For inspiration, you may wish to refer to the derivation of the bias/variance decomposition from the Lecture 12 course notes.
Binary Addition [5pts] In this problem, you will implement a recurrent neural network which implements binary addition. The inputs are given as binary sequences, starting with the least signi cant binary digit. (It is easier to start from the least signi cant bit, just like how you did addition in grade school.) The sequences will be padded with at least one zero on the end. For instance, the problem
100111 + 110010 = 1011001
would be represented as:
Input 1: 1, 1, 1, 0, 0, 1, 0
Input 2: 0, 1, 0, 0, 1, 1, 0
Correct output: 1, 0, 0, 1, 1, 0, 1
There are two input units corresponding to the two inputs, and one output unit. Therefore, the pattern of inputs and outputs for this example would be:
Design the weights and biases for an RNN which has two input units, three hidden units, and one output unit, which implements binary addition. All of the units use the hard threshold activation function. In particular, specify weight matrices U, V, and W, bias vector bh, and scalar bias by for the following architecture:
Hint: In the grade school algorithm, you add up the values in each column, including the carry. Have one of your hidden units activate if the sum is at least 1, the second one if it is at least 2, and the third one if it is 3.