Machine Learning Homework 2 Solution

Starting from:

~~$30~~

$24

Problem 1 (written) – 20 points

In this problem you will derive a naive Bayes classifier that you will implement in Problem 2 below. Recall that for a labeled set of data (y1; x1); : : : ; (yn; xn), where for this problem y 2 f0; 1g and x is a D-dimensional vector not necessarily in RD, the Bayes classifier observes a new x0 and predicts y0 as

D

dY

max
(d)
y0 = arg

p(y0 = yj ) pd(x0;dj y ):

y

=1

The distribution p(y0 = yj ) = Bernoulli(yj ). What is “naive” about this classifier is the assumption that all D dimensions of x are independent. Observe that you can pick any distribution pd you think appropriate for the dth dimension. In this problem, assume that D = 2 and p1 is a Bernoulli distribution and p2 is a Pareto distribution. That is,

(2)
p1(x0;1j y(1)) = ( y(1))x0;1 (1 y(1))1 x0;1 ; p2(x0;2j y(2)) = y(2)(x0;2) ( y +1):

The parameter y(1) 2 [0; 1] and y(2) > 0. For the class prior Bernoulli distribution, 2 [0; 1]. Given some data (y1; x1); : : : ; (yn; xn), derive the maximum likelihood solution for , y(1) and y(2), i.e.,

b
b
b

n
n

n

X
Xi
(1)
X

(1)
(2)

max

(2)
; y
; y
= arg

ln p(yij ) +
ln p(xi1j yi ) +
ln p(xi2j yi ):

; y(1); y(2)

i=1
=1

i=1

(I avoided copying for y = 1 and y = 0 above.) Notice that this can be viewed as three independent subproblems. Please separate your derivations as follows:

(a)
Derive b using the objective above.
(b)
Derive y(1)
using the objective above. Derive this leaving y arbitrary.
(c)
(2)
using the objective above. Derive this leaving y arbitrary.

Derive by

b

Problem 2 (coding) – 40 points

In this problem you will implement the naive Bayes classifier derived in Problem 1, as well as the k-

NN algorithm and logistic regression algorithm. The data consists of examples of spam and non-spam emails, of which there are 4508 training examples and 93 testing examples. The feature vector x is a 57-dimensional vector extracted from the email and y = 1 indicates a spam email. The data has been preprocessed by me such that the first 54-dimensions of each observation is binary and the last three dimensions are positive numbers.1 As with the first homework, you would ideally use cross-validation on multiple partitions, but I am keeping it simple here with one training and one testing set.

For the naive Bayes classifier, use 54 Bernoulli distributions for the 54 binary dimensions and 3 Pareto distributions for the last 3 positive dimensions. I choose the Pareto distribution because it is able to model outliers more easily, which the data seems to have many of. There are other “heavy tail” distributions we could have chosen as well. (The reason for the Bernoulli distribution should be clear.)

(a) Implement the naive Bayes classifier described above on the training data and make predictions on the testing data. In a 2 2 table, write the number of times that you predicted a class y data point (ground truth) as a class y0 data point (model prediction) in the (y; y0)-th cell of the table, where y and y0 can be either 0 or 1. There should be four values written in the table in your PDF. Next to your table, write the prediction accuracy—the sum of the diagonal divided by 93.

(b) In one figure, show a stem plot (stem() in Matlab) of the 54 Bernoulli parameters for each class. Use the file “spambase.names” to make an observation about dimensions 16 and 52.

(c) Implement the k-NN algorithm for k = 1; : : : ; 20. Use the ‘1 distance for this problem (sum of the absolute values of the differences). Plot the prediction accuracy as a function of k.

Finally, you will run logistic regression on the same data set. Set every yi = 0 to yi = 1 for this part.
Also, be sure to add a dimension equal to +1 to each data point.

(d)
Implement the steepest ascent algorithm given in Lecture 9. Use an iteration-dependent step size

t, where t =
1

. Run your algorithm for 10,000 iterations and plot the logistic regression

105p

t+1

objective training function L per iteration. (The pattern should look strange.)

(e)
Finally, implement a gradient method called “Newton’s method” as follows: At iteration t, set

n

t(rw2L) 1rwL;
rw2L =
Xi

w(t+1) = w(t)

(xiT w)(1 (xiT w))xixiT
a matrix

=1

1

Set t =
p

for this problem. Plot the objective function L on the training data as a function

t+1

of t = 1; : : : ; 100. Below the plot give the prediction accuracy on the testing data. We don’t have time to go through Newton’s method, but hopefully this will motivate you to study it more!

• The original data is at https://archive.ics.uci.edu/ml/datasets/Spambase. More information about the meanings of the 57 dimensions of the data is provided in two accompanying files.

More products

$6.00 OFF

Hw2: Linked List Solution

$35

$29

Buy now

$6.00 OFF

Hw1: Arrays and Files Solution

$35

$29

Buy now

$6.00 OFF

Assignment #2 Encoding/Decoding using Walsh codes Solution

$35

$29

Buy now