$23.99
• Feel free to talk to other members of the class in doing the homework. I am more concerned that you learn how to solve the problem than that you demonstrate that you solved it entirely on your own. You should, however, write down your solution yourself. Please try to keep the solution brief and clear.
• Please use Piazza first if you have questions about the homework. Also feel free to send us e-mails and come to office hours.
• Please, no handwritten solutions. You will submit your solution manuscript as a single pdf file.
• Please present your algorithms in both pseudocode and English. That is, give a precise formulation of your algorithm as pseudocode and also explain in one or two concise paragraphs what your algorithm does. Be aware that pseudocode is much simpler and more abstract than real code.
• The homework is due at 11:59 PM on the due date. We will be using Compass for collecting the homework assignments. Please submit your solution manuscript as a pdf file via Compass (http://compass2g.illinois.edu). Please do NOT hand in a hard copy of your write-up. Contact the TAs if you are having technical difficulties in submitting the assignment.
• You may not use late submission credit hours for this problem set.
1. [PAC Learning - 35 points] In this problem, we are going to prove that the class of two concentric circles in the plane is PAC learnable. This hypothesis class is formally defined as H2cc = {hr1 ,r2 : r1, r2 ∈ R+ and r1 < r2 }, where
(1 if r1 ≤ kxk2 ≤ r2
hr (x) =
0 otherwise
For this problem, assume a sample of m points is drawn I.I.D. from some distribution
r∗
D and that the labels are provided from some target function h∗
∗ ∈ H2cc .
1 ,r2
(a) [5 points] Describe an algorithm that takes a training sample of m points as described above and returns a hypothesis hˆ r1 ,r2 ∈ H2cc that makes zero mistakes on the training sample. To simplify the analysis that follows (in (b)), represent
your hypothesis as two circles with radii r1, r2 such that: r∗ ≤ r1 < r2 ≤ r∗ .
1 2
Recall that PAC learning involves two primary parameters: and δ. is sometimes called the accuracy parameter ; we say that if the true error of a learner is larger than
, then the learning has “failed”. Our hope is to directly prove that we can find some sample size m such that the probability of drawing a sample of size at least m from D which causes the learner to “fail” is less than δ (which is sometimes called the confidence parameter).
(b) Given the hypothesis that you learned in (a) your hypothesis will only make mistakes on positive examples (we ask that you justify that below). For this problem, is equal to the probability of drawing a point x from D that is labeled
1
as a positive example and lies in the area between either r∗
≤ kxk < r1 or
2
r2 < kxk2 ≤ r∗ in other words,
= Pr [r∗ ≤ kxk2 < r1 or r2 < kxk2 ≤ r∗ ]
x∼D 1 2
i. [5 points] Explain why this is the case.
ii. [5 points] What is the probability of drawing a sample of m points from D
where none of the points lie in the areas r∗ ≤ kxk2 < r1 or r2 < kxk2 ≤ r∗ ?
1 2
(c) [15 points] Given parameters δ and , find a value for m such that, after learning with the algorithm from (a) on m examples, with confidence at least (1 − δ), the probability that the resulting hypothesis will make a mistake on a randomly drawn example is less than .
• Hint: The following inequality might be useful:
1 − x ≤ e−x
(d) [5 points] We could have found a bound on m using another method. Derive this bound; how does it compare to the bound we found in the last step? (Hint: what is the VC Dimension of H2cc ?).
2. [VC Dimension - 5 points] We define a set of concepts
H = {sgn(ax2 + bx + c); a, b, c, ∈ R},
where sgn(·) is 1 when the argument · is positive, and 0 otherwise. What is the VC
dimension of H ? Prove your claim.
Grading note: You will not get any points without proper justification of your answer.
3. [Kernels - 15 points]
(a) [5 points] Write down the dual representation of the Perceptron algorithm. (b) [5 points] Given two examples ~x ∈ R2 and ~z ∈ R2, let
K (~x, ~z) = (~xT ~z)3 + 49(~xT ~z + 4)2 + 64~xT ~z.
Prove that this is a valid kernel function.
(c) [5 points] We wish to learn a Boolean function represented as a monotone DNF (DNF without negated variables) using kernel Perceptron. For this problem, as- sume that the size of each term in the DNF is of size k, s.t. k ≤ n, the size dimensionality of the input. In order to complete this task, we will first define a
n
kernel that maps an example x ∈ {0, 1}
into a new space of monotone conjunc-
tions of exactly k different variables from the n-dimensional space. Then, we will use the kernel Perceptron to perform our learning task.
c
Define a kernel K (x, z) = P
∈
C c(x)c(z), where C is a family of monotone con-
junctions containing exactly k different variables, and c(x), c(z) ∈ {0, 1} is the
value of c when evaluated on example x and z separately. Show that K (x, z) can be computed in time that is linear in n.
4. [SVM - 25 points]
We have a set of six labeled examples D in the two-dimensional space, D =
{(x(1) , y(1) ), ..., (x(6) , y(6) )}, x(i) ∈ R2 and y(i) ∈ {1, −1}, i = 1, 2, ..., 6 listed as follows:
i
x(i)
x(i)
y(i)
1
−2
0
1
2
−2.4
−1.6
1
3
1.3
2.6
−1
4
−0.3
−2.5
1
5
3
0.2
−1
6
0
2
−1
1 2
Figure 1: Training examples for SVM in question 1.(a)
(a) [5 points] We want to find a linear classifier where examples x are positive if and only if w · x + θ ≥ 0.
1. [1 points] Find an easy solution (w, θ) that can separate the positive and negative examples given.
Define w = Define θ =
2. [1 points] Recall the Hard SVM formulation:
1 2
minw 2 ||w||
(1)
s.t y(i) (w · x(i) + θ) ≥ 1, ∀(x(i), y(i) ) ∈ D (2)
What would the solution be if you solve this optimization problem? (Note: you don’t actually need to solve the optimization problem; we expect you to use a simple geometric argument to derive the same solution SVM optimiza- tion would result in).
Define w = Define θ =
3. [3 points] Given your understanding of SVM optimization, how did you derive the SVM solution for the points in Figure 1?
(b) [15 points] Recall the dual representation of SVM. There exists coefficients αi 0 such that:
w∗ = X αi y(i)x(i) (3)
i∈I
where I is the set of indices of the support vectors.
1. [5 points] Identify support vectors from the six examples given.
Define I =
2. [5 points] For the support vectors you have identified, find αi such that the dual representation of w∗ is equal to the primal one you found in (a)-2.
Define α = {α1 , α2 , ..., α|I | } =
3. [5 points] Compute the value of the hard SVM objective function for the optimal solution you found.
Objective function value =
(c) [5 points] Recall the objective function for soft representation of SVM.
1
2
min
m
||w||2 + C X ξi (4)
j=1
s.t y(i)(w · x(i) + θ) ≥ 1 − ξi , ξi ≥ 0, ∀(x(i) , y(i)) ∈ D (5)
where m is the number of examples. Here C is an important parameter. For which trivial value of C , the solution to this optimization problem gives the hyperplane that you have found in (a)-2? Comment on the impact on the margin and support vectors when we use C = ∞, C = 1, and C = 0. Interpret what C controls.
5. [Boosting - 20 points] Consider the following examples (x, y) ∈ R2 (i is the example index):
i x y Label
1 0 8 −
2 1 4 −
3 3 7 +
4 -2 1 −
5 -1 13 −
6 9 11 −
7 12 7 +
8 -7 -1 −
9 -3 12 +
10 5 9 +
In this problem, you will use Boosting to learn a hidden Boolean function from this set of examples. We will use two rounds of AdaBoost to learn a hypothesis for this data set. In each round, AdaBoost chooses a weak learner that minimizes the error . As weak learners, use hypotheses of the form (a) f1 ≡ [x θx] or (b) f2 ≡ [y θy ], for some integers θx, θy (either one of the two forms, not a disjunction of the two). There should be no need to try many values of θx, θy ; appropriate values should be clear from the data.
(a) [5 points] Start the first round with a uniform distribution D0. Place the value for D0 for each example in the third column of Table 1. Write the new representation of the data in terms of the rules of thumb, f1 and f2, in the fourth and fifth columns of Table 1.
(b) [5 points] Find the hypothesis given by the weak learner that minimizes the error
for that distribution. Place this hypothesis as the heading to the sixth column of Table 1, and give its prediction for each example in that column.
i
(1)
Label
(2)
Hypothesis 1
Hypothesis 2
D0
(3)
f1 ≡
[x ]
(4)
f2 ≡
[y ]
(5)
h1 ≡
[ ]
(6)
D1
(7)
f1 ≡
[x ]
(8)
f2 ≡
[y ]
(9)
h2 ≡
[ ]
(10)
1
−
2
−
3
+
4
−
5
−
6
−
7
+
8
−
9
+
10
+
Table 1: Table for Boosting results
(c) [5 points] Now compute D1 for each example, find the new best weak learners f1 and f2, and select hypothesis that minimizes error on this distribution, placing these values and predictions in the seventh to tenth columns of Table 1.
(d) [5 points] Write down the final hypothesis produced by AdaBoost.
What to submit: Fill out Table 1 as explained, show computation of α and D1(i), and give the final hypothesis, Hfinal .