Starting from:
$29.99

$23.99

Problem Set #4 Solution

•  Feel free to  talk  to  other  members  of the  class in doing the  homework.   I am more  concerned  that you learn  how to  solve the  problem  than  that you demonstrate that you solved it  entirely  on your own.  You should,  however,  write  down your  solution  yourself.  Please  try  to keep the  solution  brief and clear.

 

•  Please  use Piazza  first if you have  questions  about  the  homework.   Also feel free to send us e-mails and come to office hours.

 

•  Please,  no handwritten solutions.  You will submit  your solution  manuscript as a single pdf file.

 

•  Please present your algorithms  in both pseudocode and English.  That is, give a precise formulation of your algorithm as pseudocode and also explain in one or two concise paragraphs what  your algorithm does.  Be aware that pseudocode  is much simpler and more abstract than  real code.

 

•  The  homework  is  due  at  11:59 PM  on  the  due  date.    We  will be  using  Compass  for  collecting the  homework  assignments.    Please  submit   your  solution  manuscript  as  a  pdf  file via  Compass (http://compass2g.illinois.edu). Please do NOT hand  in a hard  copy of your write-up.  Contact the TAs if you are having  technical  difficulties in submitting the assignment.

 

•  You may not use late submission  credit  hours for this problem  set.

 

1. [PAC Learning - 35 points] In this problem, we are going to prove that  the class of two concentric circles in the plane is PAC learnable.  This hypothesis  class is formally defined as H2cc   = {hr1 ,r2  : r1, r2 ∈ R+ and r1 < r2 }, where

(1    if r1 ≤ kxk2 ≤ r2

hr (x) =


 

0   otherwise

 

For this problem, assume a sample of m points is drawn I.I.D. from some distribution



r∗
 
D  and that  the labels are provided from some target  function h∗


∗  ∈ H2cc .

1 ,r2

 

(a)  [5  points] Describe an  algorithm  that  takes  a training  sample  of m points  as described above and returns  a hypothesis  hˆ r1 ,r2  ∈ H2cc   that  makes zero mistakes on the training  sample.  To simplify the analysis that  follows (in (b)),  represent

your hypothesis  as two circles with radii r1, r2 such that:  r∗ ≤ r1 < r2 ≤ r∗ .

1                                    2

 

Recall that  PAC  learning  involves two primary  parameters:    and δ.    is sometimes called the accuracy  parameter ; we say that  if the true  error of a learner is larger than

 , then  the learning has “failed”.  Our hope is to directly prove that  we can find some sample size m such that  the  probability  of drawing a sample of size at  least m from D  which  causes  the  learner  to  “fail”  is less than  δ (which  is sometimes  called  the confidence parameter).

 

(b)  Given  the  hypothesis  that   you  learned  in  (a)  your  hypothesis  will only make mistakes  on positive  examples  (we ask  that  you justify  that  below).   For  this problem,    is equal to the probability  of drawing a point x from D that  is labeled



1
 
as  a  positive  example  and  lies in  the  area  between  either  r∗


≤ kxk  <  r1  or



2
 
r2 < kxk2 ≤ r∗ in other words,

 

  = Pr [r∗  ≤ kxk2 < r1 or r2 < kxk2 ≤ r∗ ]

x∼D    1                                                                             2

 

i.  [5 points] Explain why this is the case.

ii. [5 points] What  is the probability  of drawing a sample of m points from D

where none of the points lie in the areas r∗ ≤ kxk2 < r1 or r2 < kxk2 ≤ r∗ ?

1                                                                             2

(c)  [15 points] Given parameters  δ and  , find a value for m such that,  after learning with the algorithm  from (a) on m examples, with confidence at least (1 − δ), the probability  that  the resulting hypothesis will make a mistake on a randomly drawn example is less than   .

•  Hint:  The following inequality  might be useful:

 

1 − x ≤ e−x

 

(d)  [5 points] We could have found a bound  on m using another  method.   Derive this bound; how does it compare to the bound we found in the last step?  (Hint: what is the VC Dimension of H2cc ?).

 

2. [VC  Dimension - 5 points] We define a set of concepts

 

H = {sgn(ax2 + bx + c); a, b, c, ∈ R},

 

where sgn(·)  is 1 when the argument · is positive, and 0 otherwise.  What  is the VC

dimension of H ? Prove your claim.

Grading note: You will not get any points without proper justification of your answer.

 

3. [Kernels - 15 points]

 

(a)  [5 points] Write  down the dual representation of the Perceptron algorithm. (b)  [5 points] Given two examples ~x ∈ R2  and ~z ∈ R2, let

K (~x, ~z) = (~xT ~z)3 + 49(~xT ~z + 4)2 + 64~xT ~z.

 

Prove that  this is a valid kernel function.

 

(c)  [5 points] We wish to learn a Boolean function represented  as a monotone DNF (DNF without  negated  variables)  using kernel Perceptron. For this problem, as- sume that  the  size of each term  in the  DNF  is of size k, s.t.   k ≤ n,  the  size dimensionality  of the input.  In order to complete this task,  we will first define a

n

kernel that  maps an example x ∈ {0, 1}


into a new space of monotone conjunc-

tions  of exactly k different  variables  from the  n-dimensional  space.  Then,  we will use the kernel Perceptron to perform our learning task.



c
 
Define a kernel K (x, z) = P




C c(x)c(z),  where C is a family of monotone  con-

junctions  containing  exactly k different variables,  and c(x), c(z) ∈ {0, 1} is the

value of c when evaluated  on example x and z separately.  Show that  K (x, z) can be computed  in time that  is linear in n.

4. [SVM - 25 points]

 

We  have  a  set  of  six  labeled  examples  D  in  the  two-dimensional  space,   D   =

{(x(1) , y(1) ), ..., (x(6) , y(6) )}, x(i) ∈ R2  and y(i) ∈ {1, −1}, i = 1, 2, ..., 6 listed as follows:

 



i
x(i)
x(i)
y(i)
1
−2
0
1
2
−2.4
−1.6
1
3
1.3
2.6
−1
4
−0.3
−2.5
1
5
3
0.2
−1
6
0
2
−1
 
 
1                2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1: Training  examples for SVM in question 1.(a)

 

 

 

(a)  [5 points] We want to find a linear classifier where examples x are positive if and only if w · x + θ ≥ 0.

1. [1 points]  Find  an  easy solution  (w, θ) that  can  separate  the  positive  and negative examples given.

 

Define w =                                 Define θ =                                                   

2. [1 points] Recall the Hard SVM formulation:

1        2

minw 2 ||w||


 

(1)

s.t y(i) (w · x(i) + θ) ≥ 1, ∀(x(i), y(i) ) ∈ D                          (2)

 

What  would the solution  be if you solve this  optimization problem?  (Note: you don’t actually  need to solve the optimization problem; we expect you to use a simple geometric argument to derive the same solution SVM optimiza- tion would result in).

Define w =                                 Define θ =      

 

3. [3 points] Given your understanding of SVM optimization, how did you derive the SVM solution for the points in Figure 1?

 

 

 

 

 

(b)  [15 points] Recall the dual representation of SVM. There exists coefficients αi 0 such that:

w∗  = X αi y(i)x(i)                                                                              (3)

i∈I

 

where I is the set of indices of the support  vectors.

 

1. [5 points] Identify support  vectors from the six examples given.

 

 

Define I =                                  

 

 

2. [5 points] For the support  vectors you have identified, find αi  such that  the dual representation of w∗   is equal to the primal one you found in (a)-2.

 

Define α = {α1 , α2 , ..., α|I | } =                                   

 

 

 

 

3. [5  points]  Compute  the  value  of the  hard  SVM objective  function  for the optimal  solution you found.

 

Objective function value  =                               

(c)  [5 points] Recall the objective function for soft representation of SVM.

 



1
 


2
 
min


m

||w||2 + C X ξi                                                               (4)

j=1

s.t y(i)(w · x(i) + θ) ≥ 1 − ξi , ξi ≥ 0, ∀(x(i) , y(i)) ∈ D                    (5)

 

where m is the number of examples.  Here C is an important parameter. For which trivial value of C , the solution to this optimization problem gives the hyperplane that  you have found in (a)-2?  Comment on the impact on the margin and support vectors when we use C = ∞, C = 1, and C = 0. Interpret what C controls.

 

 

 

 

 

 

5. [Boosting - 20 points] Consider the following examples (x, y) ∈ R2  (i is the example index):

 

i    x          y          Label

1   0          8          −

2   1          4          −

3   3          7          +

4   -2         1          −

5   -1         13        −

6   9          11        −

7   12        7          +

8   -7         -1         −

9   -3         12        +

10 5          9          +

 

In this problem, you will use Boosting to learn a hidden Boolean function from this set of examples.  We will use two rounds of AdaBoost  to learn a hypothesis  for this data set.  In each round,  AdaBoost  chooses a weak learner  that  minimizes the error   .  As weak learners,  use hypotheses  of the form (a) f1 ≡ [x θx] or (b) f2 ≡ [y θy ], for some integers θx, θy  (either  one of the two forms, not a disjunction  of the two).  There should be no need to try many values of θx, θy ; appropriate values should be clear from the data.

 

(a)  [5 points] Start the first round with a uniform distribution D0.  Place the value for D0  for each example in the third column of Table 1. Write the new representation of the  data  in terms  of the  rules  of thumb,  f1  and  f2, in the  fourth  and  fifth columns of Table 1.

(b)  [5 points] Find the hypothesis given by the weak learner that  minimizes the error

  for that  distribution. Place this hypothesis  as the heading to the sixth column of Table 1, and give its prediction  for each example in that  column.

 

 

i

 

 

(1)
 

Label

 

 

(2)
Hypothesis 1
Hypothesis 2
D0

 

 

(3)
f1 ≡

[x   ]

 

(4)
f2 ≡

[y   ]

 

(5)
h1  ≡

[              ]

 

(6)
D1

 

 

(7)
f1 ≡

[x   ]

 

(8)
f2 ≡

[y   ]

 

(9)
h2  ≡

[              ]

 

(10)
1

 
 
 
 
 
 
 
 
2

 
 
 
 
 
 
 
 
3
+
 
 
 
 
 
 
 
 
4

 
 
 
 
 
 
 
 
5

 
 
 
 
 
 
 
 
6

 
 
 
 
 
 
 
 
7
+
 
 
 
 
 
 
 
 
8

 
 
 
 
 
 
 
 
9
+
 
 
 
 
 
 
 
 
10
+
 
 
 
 
 
 
 
 
 

Table 1: Table for Boosting results

 

 

(c)  [5 points] Now compute  D1   for each example, find the new best weak learners f1 and f2, and select hypothesis that  minimizes error on this distribution, placing these values and predictions  in the seventh to tenth  columns of Table 1.

 

(d)  [5 points] Write  down the final hypothesis  produced by AdaBoost.

 

What to submit: Fill out Table  1 as explained,  show computation of α and D1(i), and give the final hypothesis,  Hfinal .

More products