$24
All questions have multiple-choice answers ([a], [b], [c], ...). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the homework
The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and ap-proach the problem from di erent angles.
The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.
The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.
You are also encouraged to take part in the forum http://book.caltech.edu/bookforum
where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers" announcement at the top there).
c 2012-2015 Yaser Abu-Mostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Primal versus Dual Problem
Recall that N is the size of the data set and d is the dimensionality of the input space. The original formulation of the hard-margin SVM problem (min-imize 12 wTw subject to the inequality constraints), without going through the Lagrangian dual problem, is
a quadratic programming problem with N variables
a quadratic programming problem with N + 1 variables
a quadratic programming problem with d variables
a quadratic programming problem with d + 1 variables
not a quadratic programming problem
Notice: The following problems deal with a real-life data set. In addition, the computa-tional packages you use may employ di erent heuristics and require di erent tweaks. This is a typical situation that a Machine Learning practitioner faces. There are uncertainties, and the answers may or may not match our expectations. Although this situation is not as ‘sanitized’ as other homework problems, it is important to go through it as part of the learning experience.
SVM with Soft Margins
In the rest of the problems of this homework set, we apply soft-margin SVM to handwritten digits from the processed US Postal Service Zip Code data set. Download the data (extracted features of intensity and symmetry) for training and testing:
http://www.amlbook.com/data/zip/features.train
http://www.amlbook.com/data/zip/features.test
(the format of each row is: digit intensity symmetry). We will train two types of binary classi ers; one-versus-one (one digit is class +1 and another digit is class 1, with the rest of the digits disregarded), and one-versus-all (one digit is class +1
and the rest of the digits are class 1).
The data set has thousands of points, and some quadratic programming packages cannot handle this size. We recommend that you use the packages in libsvm:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
2
Implement SVM with soft margin on the above zip-code data set by solving
1 N N
XXn
min
n mynymK(xn; xm)
2
=1 m=1
N
Xn
s:t:
yn n = 0
=1
0n C n = 1; ; N
N
X
n
n=1
When evaluating Ein and Eout of the resulting classi er, use binary classi cation error.
Practical remarks:
For the purpose of this homework, do not scale the data when you use libsvm or other packages, otherwise you may inadvertently change the (e ective) kernel and get di erent results.
In some packages, you need to specify double precision.
In 10-fold cross validation, if the data size is not a multiple of 10, the sizes of the 10 subsets may be o by 1 data point.
Some packages have software parameters whose values a ect the outcome. ML practitioners have to deal with this kind of added uncertainty.
Polynomial Kernels
Consider the polynomial kernel K(xn; xm) = (1 + xTnxm)Q, where Q is the degree of the polynomial.
With C = 0:01 and Q = 2, which of the following classi ers has the highest
Ein?
0 versus all
2 versus all
4 versus all
6 versus all
8 versus all
With C = 0:01 and Q = 2, which of the following classi ers has the lowest
Ein?
1 versus all
3 versus all
5 versus all
3
7 versus all
9 versus all
Comparing the two selected classi ers from Problems 2 and 3, which of the following values is the closest to the di erence between the number of support vectors of these two classi ers?
600
1200
1800
2400
3000
Consider the 1 versus 5 classi er with Q = 2 and C 2 f0:001; 0:01; 0:1; 1g. Which of the following statements is correct? Going up or down means strictly so.
The number of support vectors goes down when C goes up.
The number of support vectors goes up when C goes up.
Eout goes down when C goes up.
Maximum C achieves the lowest Ein.
None of the above
In the 1 versus 5 classi er, comparing Q = 2 with Q = 5, which of the following statements is correct?
When C = 0:0001, Ein is higher at Q = 5.
When C = 0:001, the number of support vectors is lower at Q = 5.
When C = 0:01, Ein is higher at Q = 5.
When C = 1, Eout is lower at Q = 5.
None of the above
Cross Validation
In the next two problems, we will experiment with 10-fold cross validation for the polynomial kernel. Because Ecv is a random variable that depends on the random partition of the data, we will try 100 runs with di erent partitions and base our answer on how many runs lead to a particular choice.
4
Consider the 1 versus 5 classi er with Q = 2. We use Ecv to select C 2 f0:0001; 0:001; 0:01; 0:1; 1g. If there is a tie in Ecv, select the smaller C. Within the 100 random runs, which of the following statements is correct?
C = 0:0001 is selected most often.
C = 0:001 is selected most often.
C = 0:01 is selected most often.
C = 0:1 is selected most often.
C = 1 is selected most often.
Again, consider the 1 versus 5 classi er with Q = 2. For the winning selection in the previous problem, the average value of Ecv over the 100 runs is closest to
0:001
0:003
0:005
0:007
0:009
RBF Kernel
Consider the radial basis function (RBF) kernel K(xn; xm) = exp ( xn xmjj2) in the soft-margin SVM approach. Focus on the 1 versus 5 classi er.
Which of the following values of C results in the lowest Ein?
C = 0:01
C = 1
C=100
C=104
C=106
Which of the following values of C results in the lowest Eout?
C = 0:01
C = 1
C=100
C=104
C=106
5