Starting from:
$30

$24

Assignment #2 SOlution

Problem 1 (20 Points)




This problem investigates how changing the error measure can change the result of the learning process.




You have N data points y1 yN and wish to estimate a ‘representative’ value.




[10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of squared deviations,



N

X

Ein(h) = (h yn)2;

n=1




then show that your estimate will be the in sample mean,




N

1 X

hmean = N n=1 yn:







[10 pts] If your algorithm is to find the hypothesis h that minimizes the in-sample sum of absolute deviations,






N

X

Ein(h) = jh ynj;

n=1




then show that your estimate will be the in-sample median hmed, which is any value for which half the data points are at most hmed and half the data points are at least hmed.




Problem 2 (20 Points)




Consider




en(w) = max(0; 1 ynwxn):




[5 pts] Show that en(w) is continuous and differentiable except when yn = wxn.



[5 pts] Show that en(w) is an upper bound for the “0-1 loss” or [[sign(wxn) 6= yn]]. Thus, N n=1 en(w) is an upper bound for the in-sample classification error Ein(w).1PN



(c) [10
pts] Apply stochastic gradient descent on N
n=1 en(w) (ignoring the singular case of










1
N
w


x


= y




algorithm.





n


n) and derive a new perceptron learning
P
Note: en (w) corresponds to the “hinge loss” used for maximum-margin classification, most notably for support vector machines (SVMs).

Electrical and Computer Engineering, SNU
3



Problem 3 (20 Points)




There are a number of bounds on the generalization error , all holding with probability at least 1 .




(a) Original VC-bound:










r






























































































































N ln




H


























































8 4m
(2N)


















(b) Rademacher penalty bound:


















































































































































r


























+ r


























2 ln(2NN
H
(N))










+ N






N ln
























m






2






1




1
(c) Parrondo and van den Broek:




















































































































































s


































N
2 + ln
6m
H


























1
























(2N)


















(d) Devroye:
















































































































































s


















2N 4 (1 + ) + ln 4mH(N
2
)












1


















































































































































































































Note that (c) and (d) are implicit bounds in . Fix dVC = 50 and = 0:05. Plot these bounds as a function of N. Which is the best?




Problem 4 (20 Points)




The bias-variance decomposition of out-of-sample error is based on squared error measures. Recall that the out-of-sample error is given by

h i

Eout g(D) = Ex (g(D)(x) f(x))2 (1)




where Ex denotes the expected value with respect to input x, and using g(D) is to make explicit the dependence of g on data D. From (1) we can remove the dependence on D by taking average:





ED hEout g(D) i = ED
hEx
h
(g(D)(x) f(x))2
ii






= Ex hED h(g(D)(x) f(x))2ii :
(a)
[5 pts] To evaluate ED (g(D)(x) f(x))2 , we define the ‘average’ hypothesis


(x) as
g








(x) , ED hg(D)(x)i :


(2)






g




Now imagine that we have K datasets D1; D2; : : : ; DK , then what will be the average hypothesis




(x) estimated using these datasets?






















g




















(b)
[10 pts] Let us define
























var(x) , ED g(D)(x)


(x) 2
, and




g




bias(x) , (


(x) f(x))2 :












g










Show that




i = var(x) + bias(x)




ED h(g(D)(x) f(x))2
(c)
[5 pts] Show that























ED hEout(g(D))i = bias + var




where bias , Ex [bias(x)] and var , Ex [var(x)].




Problem 5 (20 Points)




In support vector machines, the hyperplane h separates the data if and only if it can be represented by weights (b; w) that satisfy










min
yn(wxn + b) = 1:


(3)






n=1;:::;N








Consider the data below and a ‘hyperplane’ (b; w) that separates the data.
X =
22
23
y =
2 13
w =
3:2
b = 0:5


0
0


1


1:2






2
0


+1






4
5


4 5








(a) [10 pts] Compute




= min yn(wxn + b):

1;:::;N




[5 pts] Compute the weights 1 (b; w) and show that they satisfy Eq (3).



[5 pts] Plot both hyperplanes to show that they are the same separator.



Note: This problem is to give a concrete example of re-normalizing the weights to show that the condition (3) and the following condition are equivalent, as covered at class:




yn(wxn + b) 0:
(4)

More products