$24
Version 1
Instructions.
Homework is due Tuesday, March 12, at 11:59pm; no late homework accepted. Everyone must submit individually at gradescope under hw3 and hw3code.
The \written" submission at hw3 must be typed, and submitted in any format gradescope accepts (to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like; but it must be typed!
When submitting at hw3, gradescope will ask you to mark out boxes around each of your answers; please do this precisely!
Please make sure your NetID is clear and large on the rst page of the homework.
Your solution must be written in your own words. Please see the course webpage for full academic integrity information. Brie y, you may have high-level discussions with at most 3 classmates, whose NetIDs you should place on the rst page of your solutions, and you should cite any external reference you use; despite all this, your solution must be written in your own words.
We reserve the right to reduce the auto-graded score for hw3code if we detect funny business (e.g., rather than implementing an algorithm, you keep re-submitting the assignment to the auto-grader, eventually completing a binary search for the answers).
There are no regrade requests on hw3code, which is the code auto-grader; however, you can re-submit and re-grade as many times as you like before the deadline! Start early and report any issues on piazza!
Methods and functions in the template and utility code include docstrings describing the inputs and outputs. The autograder relies on correct implementations of these methods. Follow the docstrings to avoid failing tests.
The ln-sum-exp and cross entropy.
(a) Given z 2 Rk, prove that g(z) = ln Pk exp(zj) is convex.
j=1
Hint: prove that the Hessian matrix is positive semi-de nite, meaning r2g(z) 0.
Recall that given a data example (x; y) where x 2 Rd and y 2 f1; 2; : : : ; kg, and a classi er
f : Rd ! Rk, the cross entropy loss is de ned as
0
j=1
f(
j
1
k
j=1
‘ce
@
exp f(x)y
A
=
X
n
P
f(x); y =
ln
k
exp
x)
f(x)y + ln exp
f(x)j :
Let data examples (xi; yi) i=1 be given, where xi
Rd
and yi 2 f1; 2; : : : ; kg. Consider the linear
7!
W
2
Rk d. Prove that2the empirical risk
predictor x
W x, where
1
n
b
Xi
R(W) =
n
‘ce (W x; y)
=1
is convex.
Hint: use part (a) and the fact that convexity is preserved under a ne composition and nonnegative combination. (It doesn’t matter that this part uses matrix variables; this a ne composition property holds for convex functions over matrices, and then when applying the previous part, its input is a vector after the a ne transformation.)
1
k
(c) For z 2 Rk and r 0, let gr(z) =
ln Pj=1 exp(rzj). Prove that
r
lim gr(z) = max zj:
r!1
1 j k
(d)
As a corollary, for z
2
R
and r 0, the logistic loss ‘(z) = ln(1 + exp( z)), and ‘ (z) =
1
r
ln(1 + exp( rz)), prove that
r
rlim!1 ‘r(z) = maxf0;
zg = ReLU( z):
Solution. (Your solution here.)
2. On initialization.
Consider a 2-layer network
m
Xj
f(x; W ; v) =
vj hwj; xi ;
=1
where x 2 Rd, W 2 Rm d with rows wj, and v 2 Rm. For simplicity, the network has a single output, and bias terms are omitted.
Given a data example (x; y) and a loss function ‘, consider the empirical risk
Rb(W ; v) = ‘ f(x; W ; v); y :
Only a single data example will be considered in this problem; the same analysis extends to multiple examples by taking averages.
For each 1 j m, derive @Rb=@vj and @Rb=@wj.
Consider gradient descent which starts from some W (0) and v(0), and at step t 0, updates the weights for each 1 j m as follows:
wj(t+1) = wj(t)
@R
;
andvj(t+1) = vj(t)
@R
:
@wbj
@vjb
(t)
(t)
(0) (0) (0) (0)
Suppose there exists two hidden units p; q 2 f1; 2; : : : ; mg such that wp = wq and vp = vq .
(t) (t) (t) (t)
Prove by induction that for any step t 0, it holds that wp = wq and vp = vq .
Remark: as a result, if the neural network is initialized symmetrically, then such a symmetry may persist during gradient descent, and thus the representation power of the network will be limited.
Random initialization is a good way to break symmetry. Moreover, proper random initialization also preserves the squared norm of the input, as formalized below.
First consider the identity activation (z) = z. For each 1 j m and 1 k d, initialize wj;k(0) N (0; 1=m) (i.e., normal distribution with mean = 0 and variance 2 = 1=m). Prove that
E
h W (0)x
2
i
= kxk22:
2
Next consider the ReLU activation r(z) = maxf0; zg. For each 1 j m and 1 k d, initialize wj;k(0) N (0; 2=m). Prove that
E
h
r(W (0)x)
2
i
= kxk22:
2
Hint: linear combinations of Gaussians are again Gaussian! For the second part (with ReLU), consider the symmetry of a Gaussian around 0.
Solution. (Your solution here.)
ResNet.
In this problem, you will implement a simpli ed ResNet. You do not need to change arguments which are not mentioned here (but you of course could try and see what happens).
Implement a class Block, which is a building block of ResNet. It is described in (He et al., 2016) Figure 2.
The input to Block is of shape (N; C; H; W ), where N denotes the batch size, C denotes the number of channels, and H and W are the height and width of each channel. For each data example x with shape (C; H; W ), the output of block is
Block(x) = r x + f(x) ;
where r denotes the ReLU activation, and f(x) also has shape (C; H; W ) and thus can be added to x. In detail, f contains the following layers.
A Conv2d with C input channels, C output channels, kernel size 3, stride 1, padding 1, and no bias term.
A BatchNorm2d with C features.
A ReLU layer.
Another Conv2d with the same arguments as i above.
Another BatchNorm2d with C features.
Because 3 3 kernels and padding 1 are used, the convolutional layers do not change the shape of each channel. Moreover, the number of channels are also kept unchanged. Therefore f(x) does have the same shape as x.
Additional instructions are given in doscstrings in hw3.py.
Explain why a Conv2d layer does not need a bias term if it is followed by a BatchNorm2d layer.
Implement a (shallow) ResNet consists of the following parts:
A Conv2d with 1 input channel, C output channels, kernel size 3, stride 2, padding 1, and no bias term.
A BatchNorm2d with C features.
A ReLU layer.
A MaxPool2d with kernel size 2.
A Block with C channels.
An AdaptiveAvgPool2d which for each channel takes the average of all elements.
A Linear with C inputs and 10 outputs.
Additional instructions are given in doscstrings in hw3.py.
Train a ResNet with 16 channels on the data given by hw3 utils.torch digits(), using the cross entropy loss and SGD with learning rate 0.005 and batch size 16, for 30 epochs. You can just use your fit and validate() in hw2. Plot the epochs vs training and validation cross entropy losses. Since there is some inconsistency due to random initialization, try 3 runs and have 3 plots.
Solution. (Your solution here.)
Kernel properties.
Let A 2 Rn n denote a n vectors z1; z2; : : : ; zn
symmetric positive semi-de nite matrix of rank r. Prove that there exists 2 Rr, such that Ai;j = hzi; zji.
Hint: use the eigendecomposition A = U U.
On the other hand, given n vectors z1; z2; : : : ; zn 2 Rm, prove that the matrix A de ned by Ai;j = hzi; zji is positive semi-de nite.
Remark: note that the rank of A is at most m, and it could be strictly less than m. In particular, m could be larger than n.
Using (a) and (b), prove that if K1(x; y) and K2(x; y) are kernels, then K(x; y) = K1(x; y)K2(x; y) is also a kernel.
Using (c), prove that K(x; y) = (1 + hx; yi)r is a kernel for any positive integer r. (It is the polynomial kernel with degree r.)
(e) Assume K(x; y) = exp( x yk22=2 2) is a kernel in the 1-dimensional case (i.e., x; y 2 R). Using (c), prove that K(x; y) is indeed a kernel for any dimension. (It is the Gaussian / RBF kernel.)
Solution. (Your solution here.)
RBF kernel and nearest neighbors.
Recall that given data examples ((xi; yi))ni=1 and an optimal dual solution (^i)ni=1, the RBF kernel SVM makes a prediction as follows:
f (x) =
^iyi exp
kx
2 2
ik2
! =
^iyi exp
k
2 2
ik2
! ;
n
x
2
x
x
2
X
X
i=1
i2S
where S f1; 2; : : : ; ng is the set of indices of support vectors.
Given an input x, let T := arg mini2S kx
xik2 denote the set of closest support vectors to x, and let
:= mini2S kx xik2 denote this smallest distance. (In other words, T := fi 2 S
:
kx xik = g.)
Prove that
f (x)
X
lim
=
2=2 2
^iyi:
!0 exp
i2T
Remark: in other words, when the bandwidth becomes small enough, RBF kernel SVM is almost the 1-nearest neighbor predictor with the set of support vectors as the training set.
(b) Consider the XOR dataset:
x1
= (+1; +1);
y1
= +1;
x2
= (
1; +1);
y2
=
1;
x3 = (
1;
1);
y3 = +1;
x4
= (+1;
1);
y4
=
1:
Verify that ^ = (1= ; 1= ; 1= ; 1= ) is an optimal dual solution to the RBF kernel SVM, where
=
0
1 exp k
1
2 2
2k2
!1
2
2
!
0:
= 1 exp
x
x
2
2
2
@
A
Hint: prove that the gradient of the dual function is 0 at ^ . Since the dual function is concave, and ^ 0, it follows that ^ is an optimal dual solution.
Remark: in other words, all four data examples are mapped to support vectors in the reproducing kernel Hilbert space. In light of (a), when is small enough, f (x) is almost the 1-nearest neighbor predictor on the XOR dataset. In fact, it is also true for large , due to the symmetry of the XOR data.
Solution. (Your solution here.)
SVM implementation.
Recall that the dual problem of SVM is
n
1
n
Xi
X
max
i
2
i jyiyjK(xi; xj):
2C
=1
i;j=1
where the domain C = [0; 1)n = f : i 0g for hard-margin SVM, and C = [0; C]n = f : 0 i Cg for soft-margin SVM.
Equivalently, it can be formulated as a minimization problem
min f( ) :=
1
n
i jyiyjK(xi; xj)
n
i:
2
X
Xi
2C
i;j=1
=1
It can be solved by projected gradient descent, which starts from some 0 2 C (e.g., 0) and updates as follows
t+1 = C t rf( t) :
Here C[ ] is the projection of onto C, de ned as the closet point to in C:
C[ ] := arg min k 0 k2:
0 2C
If C is convex, the projection is uniquely de ned.
(a) Prove that
[0;1)n [ ] = maxf i; 0g;
i
and
[0;C]n [ ] = minfmaxf0; ig; Cg:
i
Implement an svm solver(), using projected gradient descent formulated as above. See the docstrings in hw3.py for details.
Implement an svm predictor(), using an optimal dual solution, the training set, and an input. See the docstrings in hw3.py for details.
(d) On the area [ 5; 5] [ 5; 5], plot the contour lines of the following kernel SVMs, trained on the XOR data. Di erent kernels and the XOR data are provided in hw3 utils.py. Learning rate 0.1 and 10000 steps should be enough. To draw the contour lines, you can use hw3.svm contour().
The polynomial kernel with degree 2. The RBF kernel with = 1.
The RBF kernel with = 2. The RBF kernel with = 4.
Solution. (Your solution here.)
7References
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770{778, 2016.