CS 446 Homework 2

Starting from:

~~$35~~

$29

Home

CS 446 Homework 2

• Soft-margin SVM: 4pts

In the lecture, we learned about the duality of hard-margin SVM. Given data examples ((xi, yi))Ni=1, the primal form of hard-margin SVM is

• SVM, RBF Kernel and Nearest Neighbor: 6pts

1. (1pts) Suppose that given data examples ((xi, yi))Ni=1, an optimal dual solution to the hard-margin SVM is (ˆαi)Ni=1. Write the prediction on a new x as a function of (ˆαi)Ni=1, ((xi, yi))Ni=1 and x.

Hint: The prediction on x is f(x) = wˆT x.

2. (1pts) Now we apply the kernel trick to the prediction function f(x). Recall that the RBF kernel (Gaussian kernel) is

κ(x

, x

) = exp

−

∥xi − xj∥22

2σ2

What is fσ(x), the prediction on x using RBF kernel?

3. (4pts) Denote S ⊂ {1, 2, . . . , n} as the set of indices of support vectors. Given an input x, let T := argmini∈S ∥x − xi∥2 denote the set of closest support vectors to x, and let ρ := mini∈S ∥x − xi∥2 denote this smallest distance. (In other words,

• := {i ∈ S : ∥x − xi∥2 = ρ}.) Prove that

fσ(x)

lim exp (−ρ2/2σ2) = αˆiyi.

σ→0

i∈T

Hint: Split up the sum over elements of S into two sums: one over i ∈ T and one over i ∈ S \ T .

Remark: In other words, when the bandwidth σ becomes small enough, the RBF kernel SVM is almost the 1-nearest neighbor predictor with the set of support vectors as the training set. Note that while nearest neighbors will not be introduced until Lecture 11, solving this problem does not require knowledge of them.

Remark 2: The prediction function, fσ(x) : x → ϕ(x)⊤w, depends only on the labels of the closest support vectors of x, i.e., the constituents of the set T .

Page 2

Homework 2

CS 446

• Decision Tree and Adaboost: 12 pts

The figure below shows a dataset D = {(x(i), y(i))}6i=1 (where x(i) ∈ R2, y(i) ∈ R), containing six data points with two features x1 and x2. For example, x(1) = [1 2]⊤ and x(2) = [2 1]⊤.

The label y(i) can take on the values 1 (blue) or −1 (green).

The decision tree defined in class can only support discrete-valued instances. Here, we extend this concept to general continuous spaces. A continuous-valued decision attribute need to be represented using a comparison operator (≥, <). Specifically, for each round, unlike discrete-valued tree building that chooses only which feature to use as the current decision attribute, we will specify a feature (x1 or x2) and also a threshold τ, and then create two descendant nodes at the current node. For all data points in the current node, those below the threshold will be put into child node “xj < τ”, and those above the threshold will be put into child node “xj ≥ τ”(j = 1 or 2).

Note: We assume τ can only be integer values. Please describe the split rule as “xj ≥ τ”, such as x1 ≥ 1 or x2 ≥ 2. Do not use the answers like x1 ≥ 2.5 or x2 > 2. And also make sure to describe what the predicted label is in two child nodes “xj < τ” and “xj ≥ τ” (j = 1 or 2).

Note: Use log2(·) in the calculation of entropy.

1. (1pts) What is the sample entropy of D? Show each step of your calculation.

2. (2pts) What is the maximum information gain if we split the root into two child nodes? what is the rule for this split? Show each step of calculating information gain. You do not need to prove the split.

3. (3pts) After the first split in 2., how do we further split child nodes based on maximum information gain? Please also give information gain for each split.

Adaboost. A decision stump is an one-level decision tree. It classifies cases with only one attribute. In this problem, you will run through T = 2 steps of AdaBoost with decision

Page 3

Homework 2

CS 446

stumps as weak learners on dataset D. For the sake of notation simplicity, we denote x(1) = [1, 2]⊤, x(2) = [2, 1]⊤, x(3) = [3, 4]⊤, x(4) = [4, 6]⊤, x[5] = [5, 3]⊤, x(6) = [6, 5]⊤.

4. (4pts) For each iteration t = 1, 2, compute the weights for each sample γt ∈ R6, the weighted error rate ϵt, the weight of the decisition stump αt and the decision stump ft.

Note:

◦ Please describe the split rule as xj ≥ τ or xj < τ where τ is an integer.

◦ If you find multiple solutions, giving one is okay.

5. (2pts) Following 4., write down the rule of the classifier you constructed with a formula. Does your solution classify each case correctly? Show your work.

Note:

◦ You may use the function sign in the decision rule. Where

1 for x ≥ 0,

sign(x) =

−1 for x < 0

Page 4

Homework 2

CS 446

• Learning Theory: 14pts

1. (2pts) Generalization bound. Imagine tossing a biased coin that lands heads with probability p and let our hypothesis h be the one that always guesses heads. Denote

the true error rate as R(h) (R(h) = p) and the empirical error rate as RS(h) = pˆ, where pˆ is the empirical probability of heads based on the training sample drawn (S), which is drawn i.i.d. At least how many samples are needed so that with a probability

| −ˆ |≤

(i.e. confidence) of 95%, we have an accuracy of R(h) RS(h) 0.05?

2. VC Dimensions. In the lecture we learned about the VC dimension which captures some kind of complexity or capacity of a set of functions F.

Note: The straightforward proof strategy to show that the VC dimension of a set of classifiers is k is to first show that there exists a set of k points which is shattered by the set of classifiers. Then, show that any set of k + 1 points cannot be shattered. You can do that by showing that for any set of k + 1 points, there exists an assignment of labels which cannot be correctly classified using F.

Notation: The indicator function is defined as follows

1{condition} =

1 if condition is true

−1 if condition is false

We will now find the VC dimension of some basic classifiers.

(a) (4pts) 1D Affine Classifier

Let’s start with a fairly simple problem. Consider X = R and Y = {0, 1}. Affine classifiers are of the form:

Faffine = {1{wx + w0 ≥ 0} : X → R | w, w0 ∈ R},

Show what is V C(Faffine) and prove your result.

Hint: Try less than a handful of points.

(b) (4pts) General Affine Classifier

We will now go one step further. Given some dimensionality k ≥ 1, consider X = Rk and Y = {0, 1}. Affine classifiers in k dimensions are of the form

Faffinek = {1{w⊤x + w0 ≥ 0} : X → R | w ∈ Rk, w0 ∈ R} Show what is V C(Faffinek) and prove your result.

Page 5

Homework 2

CS 446

Hint: Note that w⊤x + w0 can be written as [x⊤

w . Moreover, consider

to put all data points into a matrix, e.g.,

X =

(x(2))⊤

(x(1))⊤

...

Consider the classifier based on the cosine function

Fcos = {1{cos(cx) ≥ 0} : X → R | c ∈ R} Show what is V C(Fcos) and prove your result.

Page 6

Homework 2

CS 446

• Coding: SVM, 24pts

Recall that the dual problem of SVM is

where the domain C = [0, ∞)n = {α : αi ≥ 0} for hard-margin SVM, and you already derive the domain for soft-margin SVM in Q1.

Equivalently, it can be formulated as a minimization problem

min f(α) := 1

α∈C 2

n n

αiαjyiyjK(xi, xj) − αi.

i,j=1 i=1

It can be solved by projected gradient descent, which starts from some α0 ∈ C (e.g., 0) and updates as follows

αt+1 = ΠCαt − η∇f(αt).

Here ΠC[α] is the projection of α onto C, defined as the closet point to α in C:

ΠC[α] := argmin ∥α′ − α∥2.

α′∈C

If C is convex, the projection is uniquely defined.

1. (10pts) Implement an svm solver(), using projected gradient descent formulated as above. See the docstrings in hw2.py for details.

2. (10pts) Implement an svm predictor(), using an optimal dual solution, the training set, and an input. See the docstrings in hw2.py for details.

3. (4pts) On the area [−5, 5] × [−5, 5], plot the contour lines of the following kernel SVMs, trained on the XOR data. Different kernels and the XOR data are provided in hw2 utils.py. Learning rate 0.1 and 10000 steps should be enough. To draw the contour lines, you can use hw2.svm contour(). Include your plots in the report.

◦ The polynomial kernel with degree 2.

◦ The RBF kernel with σ = 1.

◦ The RBF kernel with σ = 2.

◦ The RBF kernel with σ = 4.

Page 7