Starting from:
$35

$29

Homework #2 Solution

NOTE 1: Please use a word processing software (e.g., Microsoft word or Latex) to write your answers and submit a printed copy to me at the beginning of class on Oct 23. The rationale is that it is sometimes hard to read and understand the hand-written answers.

NOTE 2: Please ensure that all the graphs are appropriately labeled (x-axis, y-axis, and each curve). The caption or heading of each graph should be informative and self-contained.

    1. (10 points) Suppose x = (x1; x2; ; xd) and z = (z1; z2; ; zd) be any two points in a high-dimensional space (i.e., d is very large).

a. (5 points) Try to prove the following, where the right-hand side quantity represent the standard Euclidean distance.




d



d
!
2
d

pd

p





(1)

=1 xi

d i=1 zi


i=1 (xi   zi)2

1

Xi
1
X


X

























Hint: Use Jensen’s inequality { If X is a random variable and f is a convex function, then f(E[X]) E[f(X)].

        b. (5 points) We know that the computation of nearest neighbors is very expensive in the high-dimensional space. Discuss how we can make use of the above property to make the nearest neighbors computation e cient?

    2. (10 points) We brie y discussed in the class about Locality Sensitive Hashing (LSH) algo-rithm to make the nearest neighbor classi er e cient. Please read the following paper and brie y summarize the key ideas as you understood:

Alexandr Andoni, Piotr Indyk:  Near-optimal hashing algorithms for approximate nearest

neighbor in high dimensions. Communications of ACM 51(1): 117-122 (2008) http:// people.csail.mit.edu/indyk/p117-andoni.pdf

    3. (15 points) We know that we can convert any decision tree into a set of if-then rules, where there is one rule per leaf node. Suppose you are given a set of rules R = fr1; r2; ; rkg, where ri corresponds to the ith rule. Is it possible to convert the rule set R into an equivalent decision tree? Explain your construction or give a counterexample.

    4. (10 points) You are provided with a training set of examples (see Figure 1). Which feature will you pick rst to split the data as per the ID3 decision tree learning algorithm? Show all your work: compute the information gain for all the four attributes and pick the best one.

    5. (10 points) Please read the following paper and brie y summarize the key ideas as you understood (You can skip the proofs, but it is important to understand the main results):

Andrew Y. Ng, Michael I. Jordan: On Discriminative vs. Generative Classi ers: A comparison
of logistic regression and naive Bayes. NIPS 2001: 841-848 http://ai.stanford.edu/~ang/ papers/nips01-discriminativegenerative.pdf

    6. (10 points)

        a. Let us assume that the training data satis es the Naive Bayes assumption (i.e., features are independent given the class label). As the training data approaches in nity, which classi er will produce better results, Naive Bayes or Logistic Regression? Please explain your reasoning.

























Figure 1: Table with training examples. Each row corresponds to a single training example. There are four features, namely, outlook, temperature, humidity, and wind. \PlayTennis" is the class label.



        b. Let us assume that the training data does NOT satisfy the Naive Bayes assumption. As the training data approaches in nity, which classi er will produce better results, Naive Bayes or Logistic Regression? Please explain your reasoning.

    7. (10 points)

        a. Can we compute P (X) from the learned parameters of a Naive Bayes classi er? Please explain your reasoning.

        b. Can we compute P (X) from the learned parameters of a Logistic Regression classi er? Please explain your reasoning.

    8. (15 points) In the class, we looked at the log-likelihood derivation and the corresponding gradient ascent algorithm to nd the parameters of a binary logistic regression classi er (see slide 12 and slide 13). We want to extend the log-likelihood derivation and parameter learning algorithm to the multi-class case. Suppose we have K di erent classes, and the posterior probability can be represented using the so-called soft-max function (see slide 18):

P (y = k
x) =

exp(wk  x)

(2)








PiK=1 exp(wi


j


x)

        a. Derive the log-likelihood and the corresponding gradient ascent algorithm to nd the parameters.

        b. Add a regularization term to the log-likelihood objective (see slide 16), and derive the gradient ascent update rule with the additional change.

    9. (30 points) Fortune Cookie Classi er1

You will build a binary fortune cookie classi er. This classi er will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future (class 1) and messages that just contain a wise saying (class 0). For example,


1Thanks to Weng-Keen Wong and his advisor Andrew Moore for sharing the data.


\Never go in against a Sicilian when death is on the line" would be a message in class 0.

\You will get an A in Machine learning class" would be a message in class 1.

Files Provided There are three sets of les. All words in these les are lower case and punctuation has been removed.

1) The training data:

traindata.txt: This is the training data consisting of fortune cookie messages.

trainlabels.txt: This    le contains the class labels for the training data.

2) The testing data:

testdata.txt: This is the testing data consisting of fortune cookie messages.

testlabels.txt: This    le contains the class labels for the testing data.

3) A list of stopwords: stoplist.txt

There are two steps: the pre-processing step and the classi cation step. In the pre-processing step, you will convert fortune cookie messages into features to be used by your classi er. You will be using a bag of words representation. The following steps outline the process involved:

Form the vocabulary. The vocabulary consists of the set of all the words that are in the training data with stop words removed (stop words are common, uninformative words such as \a" and \the" that are listed in the le stoplist.txt). The vocabulary will now be the features of your training data. Keep the vocabulary in alphabetical order to help you with debugging.

Now, convert the training data into a set of features. Let M be the size of your vocabulary. For each fortune cookie message, you will convert it into a feature vector of size M. Each slot in that feature vector takes the value of 0 or 1. For these M slots, if the ith slot is 1, it means that the ith word in the vocabulary is present in the fortune cookie message; otherwise, if it is 0, then the ith word is not present in the message. Most of these feature vector slots will be 0. Since you are keeping the vocabulary in alphabetical order, the rst feature will be the rst word alphabetically in the vocabulary.

    a) Implement the Naive Bayes Classi er (with Laplace Smoothing) and run it on the training data. Compute the training and testing accuracy.

To debug and test your implementation, you can employ Weka (weka.classi ers.bayes.NaiveBayes): http://www.cs.waikato.ac.nz/ml/weka/downloading.html or scikit-learn (http://scikit-learn. org/stable/modules/naive_bayes.html)

        b) Run the o -the-shelf Logistic Regression classi er from Weka (weka.classi ers.functions.Logistic) or scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.linear_ model.LogisticRegression.html) on the training data. Compute the training and testing accuracy.

    10. (30 points) Income Classi er using Decision Trees. You will use the Adult Income dataset from HW1 for this question.

        a) Implement the ID3 decision tree learning algorithm that we discussed in the class. The

key step in the decision tree learning is choosing the next feature to split on. Implement the information gain heuristic for selecting the next feature. Please see lecture notes or https://en.wikipedia.org/wiki/ID3_algorithm for more details. I explained how to select candidate thresholds for continuous features: Sort all candidate values for feature f from training data. Suppose f1; f2; ; fn is the sorted list. The candidate thresholds are chosen as fi + (fi+1 fi)=2 for i=1 to n.

    b) Run the decision tree construction algorithm on the training examples. Compute the accuracy on validation examples and testing examples.


    c) Implement the decision tree pruning algorithm discussed in the class (via validation data).

    d) Run the pruning algorithm on the decision tree constructed using training examples. Com-pute the accuracy on validation examples and testing examples. List your observations by comparing the performance of decision tree with and without pruning.

To debug and test your implementation, you can employ Weka (weka.classi ers.trees.J48): http://www.cs.waikato.ac.nz/ml/weka/downloading.html or scikit-learn (http://scikit-learn. org/stable/modules/tree.html).

Instructions for Code Submission and Output Format.

Please follow the below instructions. It will help us in grading your programming part of the homework. We will provide a dropbox folder link for code submission.

Supported programming languages: Python, Java, C++

Store all the relevant les in a folder and submit the corresponding zip le named after your student-id, e.g., 114513209.zip

This folder should have a script  le named

run_code.sh

Executing this script should do all the necessary steps required for executing the code including compiling, linking, and execution

Assume relative le paths in your code. Some examples: ‘‘./filename.txt’’ or ‘‘../hw2/filename.txt’’

The output of your program should be dumped in a le named \output.txt" Make sure the output.txt le is dumped when you execute the script

run_code.sh

Zip the entire folder and submit it as <student_id>.zip

Grading Rubric

Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the

Instructor and TAs. This    ve-point (discrete) scale is described as follows:

A) Exemplary (=100%).

Solution presented solves the problem stated correctly and meets all requirements of the prob-lem.

Solution is clearly presented.

Assumptions made are reasonable and are explicitly stated in the solution.

Solution represents an elegant and e ective way to solve the problem and is not overly com-plicated than is necessary.


B) Capable (=75%).

Solution is mostly correct, satisfying most of the above criteria under the exemplary category, but contains some minor pitfalls, errors/ aws or limitations.


C) Needs Improvement (=50%).

Solution demonstrates a viable approach toward solving the problem but contains some major pitfalls, errors/ aws or limitations.


D) Unsatisfactory (=25%)

Critical elements of the solution are missing or signi cantly  awed.

Solution does not demonstrate su cient understanding of the problem and/or any reasonable directions to solve the problem.


F) Not attempted (=0%) No solution provided.


The points on a given homework question will be equal to the percentage assigned (given by the letter grades shown above) multiplied by the maximum number of possible points worth for that question. For example, if a question is worth 6 points and the answer is awarded a B grade, then that implies 4.5 points out of 6.

More products