$24
Version 2
Instructions.
Homework is due Tuesday, April 2, at 11:59pm; no late homework accepted.
Everyone must submit individually at gradescope under hw4. (There is no hw4code!)
The \written" submission at hw4 must be typed, and submitted in any format gradescope accepts (to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like; but it must be typed!
When submitting at hw4, gradescope will ask you to mark out boxes around each of your answers; please do this precisely!
Please make sure your NetID is clear and large on the rst page of the homework.
Your solution must be written in your own words. Please see the course webpage for full academic integrity information. Brie y, you may have high-level discussions with at most 3 classmates, whose NetIDs you should place on the rst page of your solutions, and you should cite any external reference you use; despite all this, your solution must be written in your own words.
VC dimension.
This problem will show that two di erent classes of predictors have in nite VC dimension.
Hint: to prove in nite VC(H) = 1, it is usually most convenient to show VC(H) n for all n.
(a)
Let F :=
x 7!2 1[x 2 C]
1 : C Rd is convex denote the set of all classi ers whose decision
boundary is a convex subset of Rd for d 2. Prove VC(F) = 1.
Hint: Consider data examples on the unit sphere fx 2 R
d :
kxk = 1g.
(b)
Given x 2 R, let sgn denote the sign of x: sgn(x) = 1 if x 0 while sgn(x) =
1 if x < 0.
Let 0 be given, and de ne G to be the set of (sign of) all RBF classi ers with bandwidth ,
meaning
0
1
9
8
m
<
=
X
G := x 7!sgn @ i exp x xik2=(2 2) A : m 2 Z 0; x1; : : : ; xm 2 Rd; 2 Rm :
: ;
i=1
Prove VC(G ) = 1.
Remark: the sign of 0 is not important: you have the freedom to choose some nice data examples and avoid this case.
Hint: remember in hw3 it is proved that if is small enough, the RBF kernel SVM is close to the 1-nearest neighbor predictor. In this problem, is xed, but you have the freedom to choose the data examples. If the distance between data examples is large enough, the RBF kernel SVM could still be close to the 1-nearest neighbor predictor. Make sure to have an explicit construction of such a dataset.
Solution. (Your solution here.)
2
Rademacher complexity of linear predictors.
Let examples (x1; : : : ; xn) be given with kxik R, along with linear functions x 7!wTx : kwk W .
The goal in this problem is to show Rad(F) RW=p
.
n
n P
2 f
g
de n
"
i=1
i i
(a) For a xed sign vector "
1; +1
n,
ne x
:=
1
n
x
. Show
f
n
Xi
i
i
k
"k
max
1
f(x
)
W
x
:
2F
=1
Hint: Cauchy-Schwarz!
(b) Show E"kx"k2 R2=n.
RW p
(c) Now combine the pieces to show Rad(F)
=
n:
p
Hint: one missing piece is to write k k = k k2 and use Jensen’s inequality.
Solution. (Your solution here.)
3
Generalization bounds for a few linear predictors.
In this problem, it is always assumed that for any (x; y) sampled from the distribution, kxk R and y 2 f 1; +1g.
Consider the following version of the soft-margin SVM:
1 n
h
1
wxiyii+ =
w2Rd2 kwk
+ n i=1
2 kwk + Rhinge(w):
min
2
X
2
b
Let w^ denote the (unique!) optimal solution, and f^(x) = w^x.
Prove that for any regularization level 0, with probability at least 1 , it holds that
R(f^) Rb(f^) + Rr
+ 3 1 + Rp2=
r
8n
2n
:
ln(2= )
Hint: use the fact from slide 5/61 of the rst ML Theory lecture that kw^k
2= , the linear predictor
Rademacher complexity bound from the previous problem, and the
Rademacher generalization theorem
p
on slide 57 of the nal theory lecture.
Solution. (Your solution here.)
4