Starting from:
$35

$29

Homework 3 Solution

Conceptual Questions

A1. The answers to these questions should be answerable without referring to external materials. Brie y justify your answers with a few words.

    a. [2 points] True or False: Given a data matrix X 2 Rn d where d is much smaller than n, if we project our data onto a k dimensional subspace using PCA where k = rank(X), our projection will have 0 reconstruction error (we nd a perfect representation of our data, with no information loss).

    b. [2 points] True or False: The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classi ers.

    c. [2 points] True or False: An individual observation xi can occur multiple times in a single bootstrap sample from a dataset X, even if xi only occurs once in X.

    d. [2 points] True or False: Suppose that the SVD of a square n n matrix X is U SV >, where S is a diagonal n n matrix. Then the rows of V are equal to the eigenvectors of X>X.

    e. [2 points] True or False: Performing PCA to reduce the feature dimensionality and then applying the Lasso results in an interpretable linear model.

    f. [2 points] True or False: choosing k to minimize the k-means objective (see Equation (1) below) is a good way to nd meaningful clusters.

g. [2 points]  Say you trained an SVM classi er with an RBF kernel (K(u; v) = exp(

ku  vk22
)). It seems to


2

under t the training set: should you increase or decrease  ?
2








1
Kernels and the Bootstrap

A2. [5 points]  Suppose that our inputs x are one-dimensional and that our feature map is in nite-dimensional:

(x) is a vector whose ith component is



1

x2
xi





p

e
2








i!






for all nonnegative integers i. (Thus,   is an in nite-dimensional vector.) Show that K(x; x0) = e
(x  x0)2


is a

2

kernel function for this feature map, i.e.,












(x)

(x0) = e
(x  x0)2





2:

















Hint: Use the Taylor expansion of ez. (This is the one dimensional version of the Gaussian (RBF) kernel).

A3. This problem will get you familiar with kernel ridge regression using the polynomial and RBF kernels. First, let’s generate some data. Let n = 30 and f (x) = 4 sin( x) cos(6 x2). For i = 1; : : : ; n let each xi be drawn uniformly at random on [0; 1] and yi = f (xi) + i where i N (0; 1).
For any function f, the true error and the train error are respectively de ned as









1

n








b



Xi

Etrue(f) = EXY [(f(X)
Y )2];

Etrain(f) =
n

(f(xi)  yi)2:











=1

Using kernel ridge regression, construct a predictor







b







b
n
b




2

T


Xi

= argjj
K
yjj

+

K  ;
f(x) =
ik(xi; x)

min



















=1


where Ki;j = k(xi; xj) is a kernel evaluation and is the regularization constant. Include any code you use for your experiments in your submission.

    a. [10 points] Using leave-one-out cross validation, nd a good and hyperparameter settings for the following kernels:

kpoly(x; z) = (1 + xT z)d where d 2 N is a hyperparameter,
krbf (x; z) = exp(    kx    zk2) where    > 0 is a hyperparameter1.

Report the values of d,    , and the    values for both kernels.

    b. [10 points] Let fbpoly(x) and fbrbf (x) be the functions learned using the hyperparameters you found in part a. For a single plot per function fb2 ffbpoly(x); fbrbf (x)g, plot the original data f(xi; yi)gni=1, the true f(x), and fb(x) (i.e., de ne a ne grid on [0; 1] to plot the functions).

    c. [5 points] We wish to build bootstrap percentile con dence intervals for fbpoly(x) and fbrbf (x) for all x 2 [0; 1] from part b.2 Use the non-parametric bootstrap with B = 300 bootstrap iterations to nd 5% and 95% percentiles at each point x on a ne grid over [0; 1].

Speci cally, for each bootstrap sample b 2 f1; : : : ; Bg, draw uniformly at randomly with replacement n samples from f(xi; yi)gni=1, train an fbb using the bth resampled dataset, compute fbb(x) for each x in your ne grid; let the 5th percentile at point x be the largest value such that 1 PB 1ffbb(x) g :05,

B    b=1
de ne the 95% analogously.

Plot the 5 and 95 percentile curves on the plots from part b.


1Given a dataset x1; : : : ; xn 2 Rd, a heuristic for choosing a range of   in the right ballpark is the inverse of the median of all
n2  squared distances jjxi    xj jj22.
    • See Hastie, Tibshirani, Friedman Ch. 8.2 for a review of the bootstrap procedure.


2
    d. [5 points]  Repeat parts a, b, and c with n = 300, but use 10-fold CV instead of leave-one-out for part a.

    e. [5 points] For this problem, use the fbpoly(x) and fbrbf (x) learned in part d. Suppose m = 1000 additional samples (x01; y10); : : : ; (x0m; ym0) are drawn i.i.d. the same way the rst n samples were drawn.
Use the non-parametric bootstrap with B = 300 to construct a con dence interval on E[(Y
fpoly(X))2
(Y  frbf (X))2
1


m



2



2

f
(xi0
; yi0)
g

b

] (i.e. randomly draw with replacement m samples denoted as




im=1 from f(xi0; yi0)gim=1
and b
m

i=1
(yi0
fpoly(xi0))

(yi0
frbf (xi0))



e
e



compute













, repeat this B times) and  nd 5% and 95%



P


e
b
e

e
b
e








percentiles. Report
these values.






























Using this con dence interval, is there statistically signi cant evidence to suggest that one of fbrbf and fbpoly is better than the other at predicting Y from X? (Hint: does the con dence interval contain 0?)
k-means clustering

A4. Given a dataset x1; :::; xn 2 Rd and an integer 1    k    n, recall the following k-means objective function

k



2

1




X X




X


kxj
ik2





min
















;
i = j ij j

i xj :
(1)
1;:::; k i=1 j
2
i




2















is a partition of f1; 2; :::; ng. The objective (1) is NP-hard3 to nd a global minimizer of. Nevertheless Lloyd’s algorithm, the commonly-used heuristic which we discussed in lecture, typically works well in practice.

a. [5 points] Implement Lloyd’s algorithm for solving the k-means objective (1). Do not use any o -the-shelf implementations, such as those found in scikit-learn. Include your code in your submission.

b. [5 points] Run the algorithm on the training dataset of MNIST with k = 10, plotting the objective function

(1) as a function of the iteration number. Visualize (and include in your report) the cluster centers as a

28    28 image.

c. [5 points]
For k = f2;
4; 8; 16; 32; 64g run the algorithm on the training dataset to obtain centers f igik=1.
If
(x
; y
)
n
and
f
(x0; y0)
m
denote the training and test sets, respectively, plot the training error
1

f n i
i

gi=1



i  i
gi=1
1
P
m



P






xik22 and test error


i=1 minj=1;:::;k k j   xi0k22 as a function of k on the same



i=1 minj=1;:::;k k j




n


m


plot.



B1.

Intro to sample complexity

i:i:d:
For i = 1; : : : ; n let (xi; yi)    PXY where yi 2 f 1; 1g and xi lives in some set X (xi is not necessarily a

vector). The 0=1 loss, or risk, for a deterministic classi er f : X ! f 1; 1g is de ned as:

R(f) = EXY [1(f(X) 6= Y )]

where 1(E) is the indicator function for the event E (the function takes the value 1 if E occurs and 0 otherwise). The expectation is with respect to the underlying distribution PXY on (X; Y ). Unfortunately, we don’t know PXY exactly, but we do have our i.i.d. samples f(xi; yi)gni=1 drawn from it. De ne the



    • To be more precise, it is both NP-hard in d when k = 2 and k when d = 2. See the references on the wikipedia page for k-means for more details.


3
empirical risk as

1
n
b

Xi



Rn(f) = n
1(f(xi) 6= yi)


=1

which is just an empirical estimate of our loss. Suppose that a learning algorithm computes the empirical risk Rn(f) for all f 2 F and outputs the prediction function fbwhich is the one with the smallest empirical risk. (In this problem, we are assuming that F is nite.) Suppose that the best-in-class function f (i.e., the one that minimizes the true 0/1 loss) is:

f  = arg min R(f) :

        ◦ 2F

    a. [2 points] Suppose that for some f 2 F, we have R(f) >  . Show that P(Rbn(f) = 0)   e n . (You
may use the fact that 1    e    .)

b. [2 points]  Use the union bound to show that

P r(9f 2 F s.t. R(f) >    and Rbn(f) = 0)    jFje  n:

Recall that the union bound says that if A1; : : : ; Ak are events in a probability space, then





X




P r(A1 [ A2 [ : : : [ Ak)
P r(Ai):




1  i  k

c. [2 points]
Solve for the minimum  such that jFje  n    .


d. [4 points]
Use this to show that with probability at least 1



R
(f) = 0
=
R(f)  R(f )

log(jFj= )

bn
b
)
b

n

where fb= arg minf2F Rbn(f).

Context: Note that among a larger number of functions F there is more likely to exist an fb such that Rbn(fb) = 0. However, this increased exibility comes at the cost of a worse guarantee on the true error re ected in the larger jFj. This tradeo quanti es how we can choose function classes F that over t. This sample complexity result is remarkable because it depends just on the number of functions in F, not what they look like. This is among the simplest results among a rich literature known as statistical learning theory. Using a similar strategy, one can use Hoe ding’s inequality to obtain a generalization bound when Rbn(fb) 6= 0.
Neural Networks for MNIST

A5. In Homework 1, we used ridge regression for training a classi er for the MNIST data set. Students who did problem B.2 also used a random feature transform. In Homework 2, we used logistic regression to distinguish between the digits 2 and 7. Students who did problem B.4 extended this idea to multinomial logistic regression to distinguish between all 10 digits. In this problem, we will use PyTorch to build a simple neural network classi er for MNIST to further improve our accuracy.

We will implement two di erent architectures: a shallow but wide network, and a narrow but deeper net-work. For both architectures, we use d to refer to the number of input features (in MNIST, d = 282 = 784), hi to refer to the dimension of the ith hidden layer and k for the number of target classes (in MNIST, k = 10). For the non-linear activation, use ReLU. Recall from lecture that
(
x;    x    0
ReLU(x) =


4
Weight Initialization

Consider a weight matrix W 2 Rn m and b 2 Rn. Note that here m refers to the input dimension and n to the output dimension of the transformation W x + b. De ne = p1m . Initialize all your weight matrices and biases according to Unif( ; ).


Training

For this assignment, use the Adam optimizer from torch.optim. Adam is a more advanced form of gradient descent that combines momentum and learning rate scaling. It often converges faster than regular gradient descent. You can use either Gradient Descent or any form of Stochastic Gradient Descent. Note that you are still using Adam, but might pass either the full data, a single datapoint or a batch of data to it. Use cross entropy for the loss function and ReLU for the non-linearity.

Implementing the Neural Networks

    a. [10 points] Let W0 2 Rh d, b0 2 Rh, W1 2 Rk h, b1 2 Rk and (z) : R ! R some non-linear activation function. Given some x 2 Rd, the forward pass of the wide, shallow network can be formulated as:

F1(x) = W1  (W0x + b0) + b1

Use h = 64 for the number of hidden units and choose an appropriate learning rate. Train the network until it reaches 99% accuracy on the training data and provide a training plot (loss vs. epoch). Finally evaluate the model on the test data and report both the accuracy and the loss.

    b. [10 points] Let W0 2 Rh0 d, b0 2 Rh0 , W1 2 Rh1 h0 , b1 2 Rh1 , W2 2 Rk h2 , b2 2 Rk and (z) : R ! R some non-linear activation function. Given some x 2 Rd, the forward pass of the network can be formulated as:


F2(x) = W2  (W1  (W0x + b0) + b1) + b2

Use h0 = h1 = 32 and perform the same steps as in part a.

    c. [5 points] Compute the total number of parameters of each network and report them. Then compare the number of parameters as well as the test accuracies the networks achieved. Is one of the approaches (wide, shallow vs. narrow, deeper) better than the other? Give an intuition for why or why not.

Using PyTorch: For your solution, you may not use any functionality from the torch.nn module except for torch.nn.functional.relu and torch.nn.functional.cross entropy. You must implement the networks F from scratch. For starter code and a tutorial on PyTorch refer to the section material here and B.4. on the previous homework.


PCA

Let’s do PCA on MNIST dataset and reconstruct the digits in the dimensionality-reduced PCA basis.

You will actually compute your PCA basis using the training dataset only, and evaluate the quality of the basis on the test set, similar to the k-means reconstructions of above. Because 50; 000 training examples are size 28 28 so begin by attening each example to a vector to obtain Xtrain 2 R50;000 d and Xtest 2 R10;000 d for d := 784.

A6. Let 2 Rd denote the average of the training examples in Xtrain, i.e., = d1 Xtrain>1>. Now let = (Xtrain 1 >)>(Xtrain 1 >)=50000 denote the sample covariance matrix of the training examples, and let
= U DUT denote the eigenvalue decomposition of  .

        a. [2 points]  If  i denotes the ith largest eigenvalue of  , what are the eigenvalues  1,  2,  10,  30, and  50?
d
i?
What is the sum of eigenvalues Pi=1

5
    b. [5 points] Any example x 2 Rd (including those from either the training or test set) can be approximated using just and the rst k eigenvalue, eigenvector pairs, for any k = 1; 2; : : : ; d. For any k, provide a formula for computing this approximation.

    c. [5 points] Using this approximation, plot the reconstruction error from k = 1 to 100 (the X-axis is k and the Y -axis is the mean-squared error reconstruction error) on the training set and the test set (using the

P
k
i



i=1


and the basis learned from the training set). On a separate plot, plot 1

i=1
i
from k = 1 to 100.

P



d. [3 points] Now let us get a sense of what the top P CA directions are capturing. Display the rst 10 eigenvectors as images, and provide a brief interpretation of what you think they capture.

    e. [3 points] Finally, visualize a set of reconstructed digits from the training set for di erent values of k. In particular provide the reconstructions for digits 2; 6; 7 with values k = 5; 15; 40; 100 (just choose an image from each digit arbitrarily). Show the original image side-by-side with its reconstruction. Provide a brief interpretation, in terms of your perceptions of the quality of these reconstructions and the dimensionality you used.

















































6

More products