$23.99
Instructions: Please put all answers in a single PDF with your name and NetID and upload to SAKAI before class on the due date (there is a LaTeX template on the course web site for you to use). Definitely consider working in a group; please include the names of the people in your group and write up your solutions separately. If you look at any references (even wikipedia), cite them. If you happen to track the number of hours you spent on the homework, it would be great if you could put that at the top of your homework to give us an indication of how difficult it was.
Problem 1
EM and K-means
(a) We wish to compute the M-step in an E-M algorithm for a Gaussian mixture model.
Starting with the expected complete data log likelihood (auxiliary) function:
n K n K
Q(θ, θ(t−1) ) =
X X p(zi = k|xi , θ(t−1) ) log[πk ] + X X p(zi = k|xi , θ(t−1) ) log[p(xi |θk )]
i=1 k=1
where θ = {µk , πk , Σk }
i=1 k=1
(based on Equation 11.26 on page 351 in Murphy’s Book)
derive the maximum likelihood estimations of µk and πk .
As defined in equations 11.1 and 11.2 on pages 338 − 9 in Murphy’s book: each base distribution in the mixture model is a multivariate Gaussian with mean vector
µk and covariance matrix Σk . And π are the mixing weights sastifying 0 ≤ πk ≤ 1
and
PK
k=1
πk = 1.
(b) Compare these maximum likelihood updates for π and µ to that of K-means (Section
11.4.2.5 on page 352 in Murphy’s book). How are the updates similar and how do they differ? Use 3 − 4 sentences to explain your answer.
Problem 2
Hidden Markov Model
Z1 Z2 ZT
X1 X2 XT
In this problem we will be using a first-order Hidden Markov Model (HMM). Where our latent variables are denoted as Z ∈ {Z1, ..., ZT } and our observed variables are X ∈ {X1 , ..., XT } as shown in the figure above. Assume we have observed data D =
{x1, . . . , xn }, where xi ∈ {0, 1}. Assume the emission and transition distributions are:
Xt |Zt ∼ Ber(µZt )
Zt |Zt−1 ∼ Ber(πZt−1 )
Note: that Zt and Zt−1 are subscript notations of the parameters µ and π in the Bernoulli distributions above. Our goals are to estimate parameters θ = {µ, π}, and also update our estimates as we observe more data.
(a) Write out the observation likelihood function (P (X |θ)).
(b) Write out the complete data log likelihood (`c(θ) = log P (X, Z |θ)).
(c) Write out the expected complete data log likelihood(Q(θ, θ t−1) = E[`c(θ)|D, θ t−1]). (d) Explicitly write out the maximum likelihood updates for π and µ (the M-step),
assuming the E-step is as we discussed in class.
(e) Clearly outline and write out the steps for updating the parameters using the Expectation - Maximization algorithm in pseudocode (but you do not need to implement it).
Problem 3
Conceptual clustering algorithm
We have seen in class that we can use the K-means algorithms to identify K clusters in a dataset. Now consider a case where we are given a data set X ∈ R1000×1 (download from Sakai) and do not know the appropriate pre-defined cluster number, K . We propose to cluster the data, and determine the number of clusters K from the data, using the following method:
(Note that ηk is the centroid of cluster k, and K is the number of clusters, each with their own centroid: η = {η1, ..., ηK })
Adaptive K-means algorithm:
1. Initialize by labeling all of the data in one cluster (all Xi ’s are assigned to one cluster and call η0 is their centroid). Store η0.
2. Within the range of the data (i.e., based on centroid η0), randomly generate a new centroid, which we will call ηK +1 .
3. Assign each Xi to one of K +1 clusters based on which ηk minimizes the Euclidean distance:
zi = argmink ||xi − ηk ||2
4. Keep all centroids, ηk , that have at least one Xi assigned to it. Update the value of K based on the total number of labelled centroids. Update the retained centroids based on assigned data points, as in the standard K-means algorithm.
5. Repeat Steps 2 − 4 until you have reached a reasonable stopping place.
Question:
1. Conceptually, how many clusters do we expect to have? In other words, what value of K should we expect? Answer this before implementing.
2. When you implement this method in R, how many clusters do we end up with
(please show code)?
3. How do you suggest we modify our method to prevent overfitting? Try this approach in your code and report the estimated K . Did it work?