$24
Starting point Your repository will have now a directory ‘homework3/’. Please do not change the name of this repository or the names of any les we have added to it. Please perform a git pull to retrieve these les. You will nd within it:
A python script pegasos.py, which you will be amending by adding code for questions in Sect. 5.
A script pegasos.sh. You will use it to generate output les after you complete pegasos.py.
Dataset: mnist subset.json
Submission Instructions The following will constitute your submission:
The python script pegasos.py, amended with the code you added for Sect. 5. Be sure to commit your changes!
A PDF report named firstname lastname USCID.pdf, which contains your solutions for questions in Sect. 1, Sect. 2, and Sect. 3. For your written report, your answers must be typeset with LATEX and generated as a PDF le. No handwritten submission will be permitted. There are many free integrated LATEX editors that are convenient to use, please Google search them and choose the one(s) you like the most. This http://www.andy-roberts.net/writing/ latex seems like a good tutorial.
A .json les pegasos.json, which will be the output of pegasos.sh.
Collaboration You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.
1
Algorithmic component
Kernels
[Recommended maximum time spent: 1 hour]
Consider the following kernel function:
k(x; x0) =
(0;
otherwise0
; 8x; x0 2 RD:
(1)
1;
if x = x
Q1.1 Prove that this is a valid kernel. You can apply the Mercer’s theorem mentioned in the lectures (lec10.pdf) and assume that the N points fx1; ; xN g you pick are distinct (i.e., xi 6= xj if i 6= j).
What to submit: No more than 10 lines of proof.
Q1.2 Suppose now you are given a training set f(xn 2 RD; yn 2 R)gNn=1 for a regression problem, where xi 6= xj if i 6= j. Show that by using this kernel, training a kernel ridge regressor (lec10.pdf) with = 0 will always lead to the training objective of 0|meaning that all the training examples are predicted accurately by the learned regressor. The training objective of kernel ridge regression is as follows
J( )=
1
TKTK
yT K +
K +
1
yT y;
(2)
2
2
2
where K 2 RN N is the kernel matrix, y = [y1; y2; ; yN ]T , and 2 RN . The learned regressor is
f(x) = [k(x; x1); k(x; x2); ; k(x; xN )] ;
(3)
where = arg min J( )
What to submit: No more than 5 lines of derivations, plus 1 line to show what is.
Q1.3 Although the learned regressor can accurately predict the value of each training example, it does not generalize to the test data. Speci cally, show that for any x with x 6= xn; 8n = 1; 2; ; N, the predicted value is always 0.
What to submit: No more than 5 lines of derivations.
Support Vector Machines
[Recommended maximum time spent: 2 hours]
2
Consider the dataset consisting of points (x; y), where x is a real value, and y 2 f 1; 1g is the class label. Let’s start with three points (x1; y1) = ( 1; 1); (x3; y3) = (0; 1); (x2; y2) = (1; 1).
Figure 1: Three data points considered in Q2
Q2.1 Can three points shown in Figure 1, in their current one-dimensional feature space, be per-fectly separated with a linear separator? Why or why not?
What to submit: No more than 3 lines of reasoning.
Q2.2 Now we de ne a simple feature mapping (x) = [x; x2]T to transform the three points from one- to two-dimensional feature space. Plot the transformed points in the new two-dimensional feature space (use any package you prefer for the plot, e.g., Matplotlib, PowerPoint). Is there a linear decision boundary that can separate the points in this new feature space? Why or why not?
What to submit: The plot and no more than 3 lines of reasoning.
Q2.3 Given the feature mapping (x) = [x; x2]T , write down the kernel function k(x; x0). More-over, write down the 3 3 kernel (or Gram) matrix K based on k(xi; xj) of the three data points. Verify that K is a positive semi-de nite (PSD) matrix. You may want to show this by the de nition of PSD matrix: a symmetric N N real matrix M is said to be positive semi-de nite if the scalar zT Mz is non-negative for every non-zero column vector z of N real numbers.
What to submit: The form of kernel function k, the kernel matrix K, and no more than 10 lines of proof to show that K is PSD.
Q2.4 Now you are going to learn a hard margin SVM classi er on this dataset with k(xi; xj). Re-call the primal and dual formulation you learned in lecture 11 (lec11.pdf). Write down the primal and dual formulations of this problem.
What to submit: Primal and dual formulations. Each with no more than 5 lines.
Q2.5 Next, you are going to solve this problem using its dual formulation. Recall the toy example we walked through in lecture 12 (lec12.pdf). You may want to borrow a similar symmetric property to simplify the dual formulation.
What to submit: Solve the dual formulation and report the vector 2 R3, the weight vector w 2 R2 and the bias b 2 R. No more than 10 lines of derivation.
3
Q2.6 Let y^ = wT (x) + b, where w and b are the weights and the bias you got from the previ-ous question. Draw the decision boundary in the new two-dimensional feature space and circle all support vectors. (Set y^ to 0 to get the decision boundary). Then, draw the decision boundary in the original one-dimensional setting.
What to submit: Two plots: one in the new two-dimensional feature space, and the other in the original one-dimensional feature space.
Adaboost for building up a nonlinear classi er
[Recommended maximum time spent: 2 hour]
In the lectures (lec13.pdf), you learned an algorithm, Adaboost, which constructs a strong (bi-nary) classi er based on iteratively adding one weak (binary) classi er into it. A weak classi er is learned to maximize the weighted training accuracy at the corresponding iteration, and the weight of each training example is updated after every iteration. Algorithm 1 summarizes the algorithm of Adaboost.
Given
H: A set of functions, where h 2 H takes a D-dimensional vector as input and outputs either +1 or 1.
A training set f(xn 2 RD; yn 2 f+1; 1g)gNn=1.
Goal Learn F (x) = sign[PTt=1 tft(x)], where ft 2 H and t 2 R.
Initialization w1(n) =
1
; 8n 2 f1; 2; ; Ng.
N
for t = 1; 2; ; T do
Pn wt(n)I[yn 6= h(xn)]
(a) nd ft = arg minh2H
(b) compute t = Pn wt(n)I[yn 6= ft(xn)]
1
1
t
(c) compute t =
log
2
t
; 8n 2 f1; 2; ; Ng
(d) compute wt+1(n) =
(wt(n) exp( t);
otherwise
wt(n) exp( t);
if yn = ft(xn)
wt+1(n)
(e) normalize wt+1(n) Pn0 wt+1(n0); 8n 2 f1; 2; ; Ng
end
Algorithm 1: The Adaboost algorithm
4
Figure 2: The two training sets considered in Q3: (A) for Q3.1, Q3.2, and (B) for Q3.3, Q3.4, Q3.5, Q3.6.
In this question, you are going to experiment with the learning process of Adaboost, with decision stumps as H. A decision stump h 2 H is a classi er characterized by a triplet (s 2 f+1; 1g; b 2 R; d 2 f1; 2; ; Dg) such that
h(s;b;d)(x) = (
s; if xd b;
(4)
s; otherwise:
That is, each decision stump function only looks at a single dimension/entry xd of the input vector x, and check whether xd is larger than b or not (s decides which label to give if xd b).
Speci cally, you are given a simple 4-example training set (as illustrated in Fig. 2 (A)):
(x1
=
1
; y1 = +1); (x2 =
1
; y2 =
1); (x3 =
1
; y3 = +1); (x4 =
1
; y4 = 1)
1
1
1
1
For simplicity, let’s consider
H = fh(s;b;d)
j s
2 f+1;
1g; b 2 f 2;
0:5; 0:5; 2g; d 2 f1; 2gg:
(5)
That is, H contains only 16 distinct decision stump functions (8 horizontal boundaries and 8 vertical boundaries).
Q3.1 Let’s get started to run Adaboost for T = 3. At t = 1, please nd the best decision stump f1 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f1. What are the corresponding 1 and 1 based on f1?
What to submit: Write down the triplet (s; b; d) of f1, and the values of 1 and 1. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
5
Q3.2 From your results in Q3.1 and Algorithm 1, you will observe something interesting about Adaboost. That is, if t = 0, the corresponding iteration t does not contribute anything to the nal decision function F (x). This might be ne since we can move on to the next iteration and probably get t+1 6= 0 there. Let’s have a try. Please write down w2(n); 8n 2 f1; 2; ; N g, and check the objective function to solve at t = 2 (i.e., solve step (a) in Algorithm 1). You will observe that the objective function at t = 2 remains exactly the SAME as at t = 1. The Adaboost algorithm thus will likely get stuck after iteration t = 1.
What to submit: Write down w2(n); 8n 2 f1; 2; 3; 4g. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.3 From the observations in Q3.2, it seems like Adaboost is not really learning something strong. We argue that it is because of the form of decision stumps: each of them can only draw a horizontal or vertical decision boundary in our 2-dimensional case. Can we do any simple trick to solve this issue?
Let’s try one trick that you have learned, feature transform. Speci cally, we apply a simple
linear transformation for every example
2
p
p
3
2
2
2
2
x
6
p
p
7
x;
(6)
6
2
2
7
6
7
6
7
2
2
4
5
which results in a transformed training set
0
p
0
p
(x1 =
; y1
= +1); (x2 =
2
; y2 =
1); (x3 =
; y3
(x4 =
; y4 = 1):
p
p
= +1);
02
2
0
2
From now on, let’s only consider this new training set (as illustrated in Fig. 2 (B)), but use the same H de ned in eq. (5).
Let’s get started again to run Adaboost for T = 3, using the new training set. At t = 1, please nd the best decision stump f1 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f1. What are the corresponding 1 and 1 based on f1?
What to submit: Write down the triplet (s; b; d) of f1, and the values of 1 and 1. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.4 Now you should see that Adaboost does learn something at t = 1 (i.e., 1 6= 0). Let’s move on to t = 2. Please rst write down w2(n); 8n 2 f1; 2; 3; 4g for each example. Then, please nd the best decision stump f2 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f2. What are the corresponding 2 and 2 based on f2?
What to submit: Write down w2(n); 8n 2 f1; 2; 3; 4g, the triplet (s; b; d) of f2, and the values of 2 and 2. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
6
Q3.5 Let’s move on to t = 3. Please rst write down w3(n); 8n 2 f1; 2; 3; 4g for each example. Then, please nd the best decision stump f3 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f3. What are the corresponding 3 and 3 based on f3?
What to submit: Write down w3(n); 8n 2 f1; 2; 3; 4g, the triplet (s; b; d) of f3, and the values of 3 and 3. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.6 Please combine your results of Q3.3, Q3.4, and Q3.5, and write down your learned F (x)
in terms of h(s;b;d), 1, 2, and 3. You should plug VALUES into s; b; d; 1; 2; 3. For example, F (x) = sign[0:5 h(+1;0:5;1)(x) + 0:5 h( 1;0:5;1)(x) + 0:5 h( 1;2;2)(x)]. Then please write down the predicted labels of x1; x2; x3; x4 using F (x). How many training examples are correctly labeled
by F (x)?
What to submit: Write down F (x), and compute F (x1); F (x2); F (x3); F (x4), and the number of correctly labeled examples. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.7 From Q3.6, you should see that Adaboost with decision stumps can solve a problem similar to XOR in 3 iterations, by using a simple linear transformation (more precisely, 2-dimensional rotation). Note that the decision boundary by F (x) is nonlinear, even though that h 2 H can only produce a linear boundary. Also note that, even with the same feature transform, a linear classi er such as logistic regression cannot correctly label every training example in our case.
What to submit: Nothing.
Programming component
High-level descriptions
4.1 Dataset
We will use mnist subset (images of handwritten digits from 0 to 9). This is the same subset of the full MNIST that we used for Homework 1 and Homework 2. As before, the dataset is stored in a JSON-formated le mnist subset.json. You can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load mnist subset.json to the variable x. Then, x[0train0] refers to the training set of mnist subset. This set is a list with two elements: x[0train0][0] containing the features of size N (samples) D (dimension of features), and x[0train0][1] containing the corresponding labels of size N.
4.2 Tasks
You will be asked to implement the linear support vector machine (SVM) for binary classi cation (Sect. 5). Speci cally, you will
7
nish implementing the following three functions|objective function, pegasos train, and pegasos test|in pegasos.py. Refer to pegasos.py and Sect. 5 for more information.
Run the script pegasos.sh after you nish your implementation. This will output pegasos.json. add, commit, and push (1) the pegasos.py, and (2) the pegasos.json le that you have
created.
As in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you (e.g., we have pre-processed the data to have to class: digits 0{4 and digits 5{9.). For speci c instructions, please refer to text in Sect. 5 and the instructions in pegasos.py.
4.3 Cautions
Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required les, including your code and generated output les.
Pegasos: a stochastic gradient based solver for linear SVM
In this question, you will build a linear SVM classi er using the Pegasos algorithm [1]. Given a training set f(xn 2 RD; yn 2 f1; 1g)gNn=1, the primal formulation of linear SVM is as follows
min
w 2
+
1
X
T
(7)
2 jj
maxf0; 1
xng:
w
jj2
N n
ynw
Instead of turning it into dual formulation, we are going to solve the primal formulation directly with a gradient-base algorithm. Note that here we include the bias term b into parameter w by appending x with 1.
In (batch) gradient descent, at each iteration of parameter update, we compute the gradients for all data points and take the average (or sum). When the training set is large (i.e., N is a large number), this is often too computationally expensive (for example, a too large chunk of data cannot be held in memory). Stochastic gradient descent with mini-batch alleviates this issue by computing the gradient on a subset of the data at each iteration.
One key issue of using (stochastic) gradient descent to solve eq. (7) is that maxf0; zg is not di erentiable at z = 0. You have seen this issue in Homework 2, where we use the Heaviside function to deal with it. In this question, you are going to learn and implement Pegasos, a representative solver of eq. (7) that applies stochastic gradient descent with mini-batch and explicitly takes care of the non-di erentiable issue.
The pseudocode of Pegasos is given in Algorithm 2. At the t-th iteration of parameter update, we rst take a mini-batch of data At of size K [step (a)], and de ne A+t At to be the subset of samples for which wt su ers a non-zero loss [step (b)]. Next we set the learning rate t = 1=( t) [step (c)]. We then perform a two-step parameter update as follows. We rst compute
8
(1 t )wt and for all samples (x; y) 2 A+ we add the vector y tx to (1 t )wt. We denote the
t K
resulting vector by wt+ 12 [step (d)]. This step can also be written as wt+ 12 = wt trt where
rt = wt
1
X2 t
j
At
j (x
yx
;y) A+
The de nition of the hinge loss, together with the Heaviside function, implies that rt is the gradient of the objective function on the mini-batch At at wt. Last, we set wt+1 to be the projection of wt+ 12 onto the set
p
B = fw : jjwjj 1= g
p
1=
This is obtained by scaling wt+1 by minf1; jjwt+ 12 jjg [step (e)]. For details of Pegasos algorithm you may refer to the original paper [1].
Input A training set S = f(xn 2 RD; yn 2 f1; 1g)gNn=1, the total number of iterations T , the batch size K, and the regularization parameter .
Output The last weight wT +1.
Initialization Choose w s.t. jjw jj 1 .
1 1 p
for t = 1; 2; ; T do
Randomly choose a subset of data At 2 S, where jAtj = K
Set A+t = f(x; y) 2 At : ywtT xt < 1g
1
(c) Set t =
t
(d) Set wt+
= (1 t )wt +
t
P(x;y)2At+ yx
1
K
2
p
1=
Set wt+1 = minf1; jjwt+ 12 jjgwt+ 12
end
Algorithm 2: The Pegasos algorithm
Now you are going to implement Pegasos and train a binary linear SVM classi er on the dataset mnist subset.json. You are going to implement three functions|objective function, pegasos train, and pegasos test|in a script named pegasos.py. You will nd detailed pro-gramming instructions in the script.
9
Q5.1 Finish the implementation of the function objective function, which corresponds to the objective function in the primal formulation of SVM.
Q5.2 Finish the implementation of the function pegasos train, which corresponds to Algorithm
2.
Q5.3 After you train your model, run your classi er on the test set and report the accuracy, which is de ned as:
of correctly classi ed test samples
of test samples
Finish the implementation of the function pegasos test.
Q5.4 After you complete above steps, run pegasos.sh, which will run the Pegasos algorithm for 500 iterations with 6 settings (mini-batch size K = 100 with di erent 2 f0:01; 0:1; 1g and = 0:1 with di erent K 2 f1; 10; 1000g), and output a pegasos.json that records the test accuracy and the value of objective function at each iteration during the training process.
What to do and submit: run script pegasos.sh. It will generate pegasos.json. Add, commit, and push both pegasos.py and pegasos.json before the due date.
Q5.5 Based on pegasos.json, you are encouraged to make plots of the objective function value versus the number of iterations (i.e., a convergence curve) in di erent settings of and k, but you are not required to submit these plots.
References
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter Mathematical Program-ming, 2011, Volume 127, Number 1, Page 3. Pegasos: primal estimated sub-gradient solver for SVM.
10