Starting from:
$30

$24

Homework 6 Solution

Version 2







Instructions.




Homework is due Tuesday, April 30, at 11:59pm; no late homework accepted. Everyone must submit individually at gradescope under hw6 and hw6code.




The \written" submission at hw6 must be typed, and submitted in any format gradescope accepts (to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like; but it must be typed!




When submitting at hw6, gradescope will ask you to mark out boxes around each of your answers; please do this precisely!




Please make sure your NetID is clear and large on the rst page of the homework.




Your solution must be written in your own words. Please see the course webpage for full academic integrity information. Brie y, you may have high-level discussions with at most 3 classmates, whose NetIDs you should place on the rst page of your solutions, and you should cite any external reference you use; despite all this, your solution must be written in your own words.




We reserve the right to reduce the auto-graded score for hw6code if we detect funny business (e.g., rather than implementing an algorithm, you keep re-submitting the assignment to the auto-grader, eventually completing a binary search for the answers).




There are no regrade requests on hw6code, which is the code auto-grader; however, you can re-submit and re-grade as many times as you like before the deadline! Start early and report any issues on piazza!




Methods and functions in the template and utility code include docstrings to describe the inputs and outputs. The autograder relies on correct implementations of these methods. Follow the docstrings to avoid failing tests.































































k-means.



Recall the k-means problem with n data points S = (xi)ni=1 in Rd, k centers ( i)ki=1 in Rd, and corresponding clusters (Sj)kj=1. The objective of k-means is then to minimize the cost:






n
















Xi
k






jk
2


S( 1; : : : ; k) =
min
x
i


:
j










=1

















In this problem you will develop an alternate formulation of this cost function in terms of pairwise distances.




Let Sz S be the cluster induced by using a particular point z 2 Rd as a center. Then the cost of using any point z as a center is then given by



X

Sz (z) = kx zk2:




x2Sz




Let (Sz) be the sample mean of the cluster Sz. Prove that the cost Sz (z) is equivalent to




Sz (z) = Sz ( (Sz)) + jSzjk (Sz) zk2:




(b) Show that














1


X


Sj ( j) =








ka bk2:
2 S
a;b




j jj
2
Sj














Conclude that solving the k-means problem is equivalent to solving
k
1










X


X
2




min








Sj ka bk :
2jSjj a;b


S1;:::;Sk j=1
2


















Solution.




(Your solution here.)















































































2



Wasserstein Distance.



Consider two discrete distributions with weights ( i)ni=1 and ( j)mj=1 on points (xi)ni=1 and (zj)mj=1. The Wasserstein distance between these two distributions (let’s call them and ) is




m
X X

W ( ; ) = max if(xi) jf(zj):

kfkLip 1

i=1 j=1




Suppose n = m and i = i = 1=n, meaning both distributions are uniform. Show that for any permutation of (1; : : : ; n).
W(;)
max
x






z
(i)k
:








i
k


i
















Note that this implies W ( ; ) min maxi kxi
z (i)k.














(b) Choose (( i; xi))in=1 and (( j; zj))jm=1 with m = n so that












0<W( ; )=
min max
k
x
i


z
(i)k
:






i














(c) Choose (( i; xi))in=1 and (( j; zj))jm=1 with m = n so that












0<W( ; )


1
min max
k
x




z
(i)k
:
100
i










i















Solution. (Your solution here.)



















































































































3



Boosting.



In this problem we will consider boosting applied to interval classi ers on the real line. An interval classi er has the form h(x) := 1[a x b]; let H denote all such classi ers (meaning for all a b). Boosting therefore outputs a function of the form




m m

X X

g(x) = jhj(x) = j 1[aj x bj]:




j=1 j=1




For all parts of this problem let (xi; yi)ni=1 be a data set of n points xi 2 R along with associated labels yi 2 f 1; 1g. Assume that the xi are in sorted order and distinct, meaning xi < xi+1.

(a) Let (q1
; : : : ; qn) be


n




i






P
i
q
i
= 1. Show that






any weights on the training set, meaning q




0 and






h




X
qi [2h(xi) 1 6= yi]min (


qi;


2fX g
qi) :


min


1
























2H
i=1


1;:::;n




i
1;:::;n




















yi0








yi<0











Remark. This calculation is related to the \weak learning assumption" discussed in lecture. The only di erence is these predictors map to f0; 1g, rather than f 1; +1g.

(b) Show that




min

h2H







n
1






n L
Xi


1[2h(xi) 1
6= yi]








n ;
=1
n

















where L is the length of the longest contiguous subsequence of examples having the same labels, meaning yj = yj+1 = = yj+L 1 for some j.




(c) Show that there exists an integer m, reals ( 1; : : : ; m), and interval classi ers (h1; : : : ; hm) with hj 2 H so that, for every i,

m

X

yi = jhj(xi):




j=1




In other words, that there exists a perfect boosted interval classi er.










Solution. (Your solution here.)








































































4



Variational Autoencoders.



In this problem you will implement a Variational Autoencoder (VAE) to model points sampled from an unknown distribution. This will be done by constructing an encoder network and a decoder network. The encoder network fenc : X R2 ! Rh Rh takes as input a point x from the input space and outputs parameters ( ; ) where = log 2. The decoder network fdec : Rh ! R2 takes as input a latent vector z N ( ; 2) and outputs an element x^ 2 R2 that we would hope is similar to members of the input space X. You will train this model by minimizing the (regularized) empirical risk




1


n
















b




Xi
















RVAE(f) =
n


‘(fdec(fenc(x)); x) + KL
N ( (xi); exp( (xi)=2)); N (0; I) :






=1
















(a) Let = diag( 2). In your written submission show that


j2 j2
3


KL N( ; );N(0;I) =
2
2h + j=1
log j2
;








1


h
















4
X




5


































where KL(p; q) = R p(x) ln pq((xx)) dx is the KL divergence between two densities p; q. You may use the fact that the KL-divergence between two h-dimensional normal distributions N ( 0; 0); N ( 1; 1) is given by




KL(N ( 0
; 0);N( 1
; 1))= 2
tr( 1 1 0) + ( 1
0) 11( 1
0) h + ln jj 0jj
:






1






1










Use the empirical risk discussed above to implement a VAE in the class VAE. Use ReLU activa-tions between each layer, except on the last layer of the decoder use sigmoid. Use the ADAM optimizer to optimize in the step() function. Make use of the PyTorch library for this. Use torch.optim.Adam(), there is no need to implement it yourself. Please refer to the docstrings in hw6.py for more implementation details.



Implement the fit function using the net.step() function from the VAE class. See the docstrings in hw6.py for more details.



Fit a VAE on the data generated by generate data in hw6 utils.py. Use a learning rate = 0:01,



latent space dimension h = 6, KL-divergence scaling factor = 5 10 5, and train for 8000 iterations. Use least squares as the loss, that is, let ‘(f(x); x^) = kf(x) x^k22. Include separate plots of each of the following in your written submission:




Your empirical risk RbVAE on the training data vs iteration count;



The data points (x)ni=1 along with their encoded and decoded approximations x^ = fdec(fenc(x));
The data points (x)ni=1 along with their encoded and decoded approximations x^, and n generated points fdec(z) where z N (0; I).



After you are done training, save your neural network to a le using torch.save(model.cpu().state dict(), "vae.pb"). You will submit this le to the autograder with your code submission.




(e)
What is the di erence between the x^ and fdec(z) in general? Why are they di erent in the plots?
(f)
Repeat part (d) except this time use L1 as your loss, that is let ‘(f(x); x^) =
k
f(x) x^
k1
=
Pj2=1 jxj x^jj. Again, be sure to include the plots in your written submission.




Fit a VAE with 2 f1; 0:01; 0:001g and L1 loss on the same data again , but this time only plot



from part (d). Discuss your results. Do you expect the VAE to generalize more closely to the true distribution better or worse as you increase ? Out of all of the parameters you tried including 5 10 5, which parameter seems to give the right balance? Be sure to provide a brief justi cation for your choice.









Solution.







5



Naive Bayes (Extra credit!).



Let X = (X1; : : : ; Xd) be a vector of d binary random variables whose distribution, labeled by a boolean

function f : f0; 1gd ! f0; 1g. Naive bayes proceeds by forming estimates of various probabilities, and predicting with






c
d








^
iY c


= x
ij
Y = y):
max Pr(Y = y)
Pr(X
f(x) = arg
y
=1
i













































































































































































































(a) Suppose f(x) = 1
P
d
, and that the various Pr estimates in f^ are exact. Show that
j=1 xj d2








c



the naive Bayes predictor f^(x) classi es perfectly in this case. For this problem you can assume d is odd.




Hint. Use symmetry arguments to make computing the probabilities easier.




Under the same setup from part(a), construct a boolean function f : f0; 1g3 ! f0; 1g for which naive Bayes will be unable to correctly classify every binary vector x 2 f0; 1g3. Be sure to verify that your construction works.









Solution. (Your solution here.)

6

More products