Starting from:
$35

$29

Assignment Two Solution

Provide credit to any sources other than the course sta that helped you solve the problems. This includes all students you talked to regarding the problems.

You can look up de nitions/basics online (e.g., wikipedia, stack-exchange,

Problem 1 (15 points). In class we said that for a generative model (eg Naive Bayes), the op-timal estimator will be the maximum aposteriori probability (MAP) estimator that when given a
~
feature X, outputs the label that satis es:
j ~
arg max p(y X):
y2Y

The maximum likelihood (ML) estimator outputs the label satisfying:
~ j
arg max p(X y):
y2Y

In this problem we will see that this is the predictor with the least error probability if the underlying data is generated from the model.

We will simplify the setting by considering a binary classi cation task, where the labels have
Y f g ~ two possible values, say = 1; +1 . Suppose the model that generates the data is p(X; y).
~
1. Suppose we receive a feature X, and predict the label    1. Show that the probability of error
    • ~
is p(y = +1 X).
~
2. Use this to argue that for any X the prediction to minimize the error probability is
j ~
max    p(y X):

y2f 1;+1g

This shows that the MAP estimator is the optimal estimator for the binary task. This also extends to larger Y.

3. Show that if the distribution over the labels is uniform, namely p(y = 1) = p(y = +1) = 0:5, then the MAP estimator and ML estimator are the same.


1

4. Construct any generative model where the MAP and ML estimator are not the same.

Problem 2. (10 points). ML vs MAP and add constant smoothing. Suppose you generate n independent coin tosses using a coin with bias . What you get is nH heads and nT = n nH tails. Show the following:

1. According to maximum likelihood principle, show that your estimate for    should be:
nH
^ = nH + nT :


    2. Suppose that there is a prior distribution p( ) on the bias of the coin. This is the initial assumption on the distribution of the value of . Show that

arg
max p(  n

; n

) = arg max p( )p(n

; n
)


j
H


T







H

T j
3. Let the prior of the bias be a Beta distribution, which is a distribution over [0; 1]


p( ) =




(1

)





show that:




R
01 x (1
x) d






max p(

n

; n

) =

nH +













nH +  + nT +

arg

j

H




T


Remark: This shows that add constant smoothing is equivalent to inducing a certain prior on the parameter of the generating model.

Problem 3. (10 points). Consider the Tennis data set.

    1. For = 1 (smoothing constant), write down the probabilities of all the features conditioned on the labels. The total number of probabilities you need to compute should not be more than twenty.

    2. What are the probabilities of the labels?

    3. For a new data (Overcast; Hot; High; Strong), what does the Naive Bayes classi er predict for = 0; = 1, and for ! 1?

    4. For new data (Overcast; Hot; High; Strong) and (Rain; Cool; High; Strong), what does the k-NN classi er predict respectively (k = 3)? Use Hamming distance (i.e. the number of di erent attributes) as the metric for k-NN.

As an example of Hamming distance, the distance between (Sunny; Hot; High; W eak) and (Rain; Cool; N ormal; W eak) is 3 because the rst 3 attributes are di erent and the last one is the same.

Problem 4. (15 points). Recall the Gaussian distribution with mean    , and variance    2. The
density is given by:

exp
(x   )2
:
p ; 2 (x) =
p
1







2 2



2  2




Given n independent samples X1; : : : ; Xn from a Gaussian distribution with unknown mean, and variance, let ML, and ML2 denote the maximum likelihood estimates of mean and variance.

2

1. Show that


2. Show that
Pn
ML =    i=1 Xi :

n
ML2 = n
n
!:

(Xi    ML)2

1

Xi









=1


Problem 5. (20 points). See attached Jupyter Notebook for reference.




















































3

More products