$29
Analytical Part (60 points)
Q1. (10 points) Suppose you are given the following multi-class classi cation training data, where each input example has three features and output label takes a value from good, bad, and ugly.
x1=(0, 1, 0) and y1=good x2=(1, 0, 1) and y2=bad x3=(1, 1, 1) and y3=ugly x4=(1, 0, 0) and y4=bad x5=(0, 0, 1) and y5=good
Suppose we want to learn a linear classi er using multi-class perceptron algorithm and start from the following weights: wgood=(0,0,0); wbad=(0,0,0); and wugly=(0,0,0). Please do hand calculations to show how weights change after processing examples in the same order (i.e., one single pass over the ve training examples). See slide 88 of the Perceptron notes.
Q2. (10 points) Suppose you are given the following binary classi cation training data, where each input example has three features and output label takes a value good or bad.
x1=(0, 1, 0) and y1=good x2=(1, 0, 1) and y2=bad x3=(1, 1, 1) and y3=good x4=(1, 0, 0) and y4=bad x5=(0, 0, 1) and y5=good
Suppose we want to learn a classi er using kernelized perceptron algorithm. Start from the following dual weights: 1=0; 2=0; 3=0; 4=0; and 5=0. Please do hand calculations to show how dual weights change after processing examples in the same order (i.e., one single pass over the ve training examples). Do this separately for the following kernels: (a) Linear kernel: K(x; x0)=x x0; and (b) Polynomial kernel with degree 3: K(x; x0)=(x x0 + 1)3, where x x0 stands for dot product between two inputs x and x0. See Algorithm 30 in http://ciml.info/dl/v0_99/ciml-v0_99-ch11.pdf. You can ignore the bias term b.
Q3. (10 points) Suppose x = (x1; x2; ; xd) and z = (z1; z2; ; zd) be any two points in a high-dimensional space (i.e., d is very large). Suppose you are given the following property, where the right-hand side quantity represents the standard Euclidean distance.
1
d
1 d
!
2
d
xi
zi
(xi zi)2
(1)
p
d
=1
p
d i=1
i=1
Xi
X
X
We know that the computation of nearest neighbors is very expensive in the high-dimensional space. Discuss how we can make use of the above property to make the nearest neighbors computation e cient?
Q4. (10 points) We know that we can convert any decision tree into a set of if-then rules, where there is one rule per leaf node. Suppose you are given a set of rules R = fr1; r2; ; rkg, where ri corresponds to the ith rule. Is it possible to convert the rule set R into an equivalent decision tree? Explain your construction or give a counterexample.
Q6. (20 points) Please read the following two papers and write a brief summary of the main points in at most THREE pages.
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franois Crespo, Dan Dennison: Hidden Technical Debt in Machine Learning Systems. NIPS 2015: 2503-2511 https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems. pdf
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley: The ML test score: A
rubric for ML production readiness and technical debt reduction. BigData 2017: 1123-1132 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/ 46555.pdf
Empirical Analysis (40 points)
You will use the Weka: http://www.cs.waikato.ac.nz/ml/weka/downloading.html soft-ware. You can use the Graphical Interface to answer all the questions below { It is eas-
2
ier. Weka employs the ARFF (https://www.cs.waikato.ac.nz/ml/weka/ar .html) format for datasets. All the speci c details provided below are for Weka.
Please use the voting dataset provided for this question in ARFF format. Please use the last 100 examples for testing and the remaining examples for training.
You can also use Scikit-learn http://scikit-learn.org/stable/ software if you are more comfortable with Python.
Bagging (weka.classi ers.meta.Bagging). You will use decision tree (weka.classi ers.trees.j48) as the base supervised learner. Try trees of di erent depth (1, 2, 3, 5, 10) and di erent sizes of bag or ensemble, i.e., number of trees (10, 20, 40, 60, 80, 100). Compute the training accuracy and testing accuracy for di erent combinations of tree depth and number of trees; and plot them. List your observations.
SVM Classi cation learner (weka.classi ers.functions.supportVector). You will run the SVM classi er on the training data to answer the following questions.
(a) Using a linear kernel (-t 0 option), train the SVM on the training data for di erent values of C parameter(-c option): 10 4; 10 3; 10 2; 10 1; 100; 101; 102; 103; 104. Com-pute the training accuracy, and testing accuracy for the SVM obtained with di erent values of the C parameter. Plot the training accuracy and testing accuracy as a func-tion of C (C value on x-axis and Accuracy on y-axis) { one curve each for training, validation, and testing data. List your observations.
(b) Repeat the experiment (a) with polynomial kernel (-t 1 -d option) of degree 2,
3, and 4. Compare the training and testing accuracies for di erent kernels (linear, polynomial kernel of degree 2, polynomial kernel of degree 3, and polynomial kernel of degree 4). List your observations.
3
Grading Rubric
Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the Instructor and TAs. This ve-point (discrete) scale is described as follows:
A) Exemplary (=100%).
Solution presented solves the problem stated correctly and meets all requirements of the problem.
Solution is clearly presented.
Assumptions made are reasonable and are explicitly stated in the solution.
Solution represents an elegant and e ective way to solve the problem and is not overly complicated than is necessary.
B) Capable (=75%).
Solution is mostly correct, satisfying most of the above criteria under the exemplary category, but contains some minor pitfalls, errors/ aws or limitations.
C) Needs Improvement (=50%).
Solution demonstrates a viable approach toward solving the problem but contains some major pitfalls, errors/ aws or limitations.
D) Unsatisfactory (=25%)
Critical elements of the solution are missing or signi cantly awed.
Solution does not demonstrate su cient understanding of the problem and/or any rea-sonable directions to solve the problem.
F) Not attempted (=0%) No solution provided.
The points on a given homework question will be equal to the percentage assigned (given by the letter grades shown above) multiplied by the maximum number of possible points worth for that question. For example, if a question is worth 6 points and the answer is awarded a B grade, then that implies 4.5 points out of 6.
4