$24
Theory
I: A \warm up" problem 5 points
Consider instances of X drawn from the uniform distribution D on [ 1; 1]. Let f denote the actual labeling function mapping each instance to its label y 2 f 1; 1g with probabilities
P r(y = 1jx 0) = 0:9
P r(y = 1jx 0) = 0:1
P r(y = 1jx 0) = 0:1
P r(y = 1jx 0) = 0:9
The hypothesis h predicts the label for each instance as de ned below:
(
1 if x 0
h(x) =
otherwise
Measure the success of this predictor by calculating the training error for h.
II: Bayes Optimal Predictor 5 points
Show that for every probability distribution D, the Bayes optimal predictor fD is, in fact, optimal. That is, show that for every classi er g : X ! f0; 1g,
LD(fD) LD(g):
III: Perceptron with a Learning Rate 5 points
Let us modify the perceptron algorithm as follows:
In the update step, instead of setting w( t+1) = w(t) + yixi every time there is a misclassi cation, we set w(t+1) = w(t) + yixi instead, for some 0 < < 1. Show that this modi ed perceptron will
perform the same number of iterations as the original, and
converge to a vector that points to the same direction as the output of the original perceptron.
IV: Unidentical Distributions 5 points
Let X be a domain and let D1; D2; :::; Dm be a sequence of distributions over X . Let H be a nite class of binary classi ers over X and let f 2 H. Suppose we are getting a sample S of m examples such that the instances are independent but not identically distributed, the ith instance is sampled from Di and then yi is set to be f(xi). Let Dm denote the average, i.e., Dm = (D1+ . . . +Dm)=m.
1
CSE 353
Homework I
Fix an accuracy parameter 2 (0; 1). Show that
P r h
9 h 2 H s.t. L(
m;f)(h) and L(S;f)(h) = 0
i jHje m
D
Programming
In this section, you will be working with a Breast Cancer Prediction dataset. Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). In this dataset, there are 569 observations, with each observation consisting of 5 features (mean radius, mean texture, mean perimeter, mean area, mean smoothness). The last column is the diagnosis { 0 indicates that the nding was benign, and 1 indicates that it was malignant.
I: Perceptron 25 points
The rst task is to write your own code in Java or Python to implement the perceptron learning algorithm. Your goal is minimize the empirical risk (i.e., no need to do validation or structural risk minimization) for the perceptron to learn the diagnosis.
As you can tell, this is a binary classi cation task. For this task, submit your code for the perceptron learning algorithm. This code should be able to read the dataset, and eventually provide the learned weight vector as its nal output.
II: A modi ed Perceptron algorithm 15 points
The second task is to modify the perceptron algorithm to exploit the observations that do not lead to updated weights. This modi ed perceptron algorithm does the following: for each weight vector w, (i) it keeps count of the number of consecutive correctly classi ed observations until the next update, and (ii) it always keeps the weight vector w that had the longest such streak. This variation is called the pocket algorithm because it keeps the best solution seen so far \in the pocket", and returns this rather than the last solution (see Alg. 1).
Your submission for this task should be your code for the pocket algorithm. Just like the original perceptron code, this code too should be able to read the dataset, and eventually provide the learned weight vector as its nal output. You may write a totally separate code for the pocket algorithm, or you may choose to incoroporate it as a ‘version’ of the perceptron algorithm. An example of the latter approach could be something that looks like
python perceptron.py --version naive --dataset /path/to/data/filename.csv
python perceptron.py --version pocket --dataset /path/to/data/filename.csv
III: Linear Regression for Classi cation 25 points
The third and nal programming task is to use linear regression for classi cation. In class, we saw how logistic regression has a nice probabilistic interpretation that can be used for binary classi cation. Linear regression, however, does not.
What you are required to do, instead, is divide the dataset into two sub-datasets according to whether the diagnosis is 0 or 1. Next, run linear regression on each of these two sub-datasets to obtain two linear subspaces that t the benign ndings and the malignant ndings, respectively. Once you have obtained these two subspaces (both through ERM), the actual binary class cation will be done by computing the Euclidean distance of each observation from the two subspaces, and assigning the label of the closer t. That is, once you obtain the two vectors w0 (for benign) and
CSE 353 Homework I Due by Mar 29,
w1 (for malignant) as output of the two linear regressions, an observation will be classi ed as 0 or 1 depending on which of these two weight vectors it is closer to.
Your submission for this task should be your code for linear regression. This code should be able to perform the whole task de ned above, and not require the user to manually divide the dataset and call the ‘usual’ linear regression code multiple times. For example, the user should be able to do something like
$ python linearregressionclassifier.py --dataset /path/to/data/filename.csv
and obtain the classi er’s performance.
IV: Final Report 15 points
Along with your code, you are also required to submit a short report (no more than 2 pages)1. The report must contain the following:
The performance of each model (i.e., the empirical risk).
A brief discussion (at most a half-page) about what you observed regarding the runtime and termination of the original perceptron algorithm, and what you did if/when the algorithm did not stop after some reasonable amount of time. In this discussion, also investigate why the algorithm did or did not stop and justify your reasoning.
A brief discussion (at most a half-page) about what you observed regarding the runtime and termination of the pocket algorithm, especially when compared to the original perceptron algorithm.
A brief discussion (at most a half-page) about using linear regression in this way and what might be the potential bene ts and/or drawbacks of the approach.
A README section explaining how to run your code. Keep in mind that the grader will only run the code, and not do any manual work (e.g., placing the dataset in the same folder as your code, or divide the dataset into two, etc. etc.) to make sure that your code runs \as is"!
The report will be graded based on (a) performance details, (b) replicability of experiments, (c) explanation of how to run your code, and (d) the quality of the discussions about each algorithm.
Notes
What programming languages are allowed?
Java (JDK 1.8 or above), Python (3.x).
You are NOT allowed to use any data science and/or machine learning library for this assign-ment. If you have doubts about whether the use of a library is acceptable, please ask before assuming that it can be used!
You can use libraries for the matrix multiplication and eigenvalue decomposition steps re-quired in linear regression. Beyond that, though, the ERM code for linear regression must be your own original code.
What should we submit, and how?
Submit a single .zip archive containing (i) one folder simply called \code", containing all your code, (ii) a PDF document for your report, and (iii) a PDF document for the theory part of your submission. Please do NOT handwrite the solutions and scan or take pictures. For the PDF documents, use either LATEXor MS Word to write the solutions, and export to PDF. Anything else will not be graded.
For the report, please stick to single line spacing.
CSE 353
Homework I
Due by Mar 29,
Input:
Training data: S = f(xi; yi)gin=1, X = 1 Rd, Y = f
1; 1g.
Learning rate parameter: 0 < < 1
Maximum number of steps: maxiter
Initialization:
w
0
wpocket 0
run 0
runpocket 0
while w has updated ^ run < maxiter do
draw the next labeled example (x; y) from S;
if (hw; xi 0 ^ y = 1) _ (hw; xi < 0 ^ y = 1) then if run runpocket then
wpocket w;
runpocket run;
end
w + x;
run 0;
else
run run +1;
end
end
Output: wpocket 2 Rd+1.
Algorithm 1: Pocket Algorithm