Statistical Inference and Machine Learning Homework 3

Starting from:

~~$35~~

$29

• This assignment can be solved in groups of 1 up to 5 students. You must mention the name of all the participants. Note that all the students in a group will get the same grade.

• Deadline: 14 December 2020, 23:59 (No late submissions will be accepted)

• Upload a single zip file on Moodle containing your solution and code. You can use any programming language.

1 PCA [40 pts]

One way to compress data point is to use the leading k principal component to represent the data point. That is we first obtain the principal directions w1, w2, . . . wk of a dataset D = {x1, x2, . . . , xn} for xi 2 Rd, where wi>wj = 0 and wi>wi = 1, and k ⌧ d. Then we approximately represent each data point xi as
k

xi =
(wj>xi) wj
(1)
e
|

{z

}

=1

Xj

↵j

Since wj are shared across the entire dataset, then we only need to remember those ↵j s for each data point. That is we only need to remember k scalars for each data point even when the original dimension d of the data point is very high. If k is much smaller than d, this approximate representation scheme leads to huge saving in term of storage. Typically, when you increase the number k of used principal directions wj , the approximation gets better, meaning that the approximation error kxi xeik2 is smaller. When k = d, the approximation error should be zero.

You will be using PCA to perform dimensionality reduction on the given dataset (leaf.csv). The present database comprises 40 di↵erent plant species. The data include 8 shape features and 6 texture features of leaves. Please refer to ReadMe.pdf for description of the dataset. You are to use Principal Component Analysis to perform Data Compression.

1. Submit a plot of the eigenvalues in ascending order (i.e., visualize the increase of computed eigenvalues).

2. Select a cut o↵to choose the top k eigenvectors based on the your calculation. Discuss the reasoning for choosing this cut o↵.

Note: this is an open problem. Any reasonable value of k will be acceptable. State your cut o↵, reason, and report your corresponding k.

3. Now consider k = 2 and report the first k = 2 eigenvectors.

4. Visualize all the leaves in terms of the discovered k = 2 principle components in a 2-D scatterplot (i.e., for each leave, visualize its (↵1, ↵2)).

5. For k = 2, compute the mean squared reconstruction error

kxi x˜ik22

using x1, x2, . . . , xn in the dataset.

1
2 EM algorithm [60 points]

In this problem, we will explore the use of EM algorithm for text clustering. Text clustering is a technique for unsupervised document organization, information retrieval. We want to find how to group a set of di↵erent text documents based on their topics. First we will analyze a model to represent the data.

Bag of Words

The simplest model for text documents is to understand them as a collection of words. To keep the model simple, we keep the collection unordered, disregarding grammar and word order. What we do is counting how often each word appears in each document and store the word counts into a matrix, where each row of the matrix represents one document. Each column of matrix represent a specific word from the document dictionary. Suppose we represent the set of nd documents using a matrix of word counts like this:
• 1
26...4
D =B2 4 ... 0C=T
1:nd @. A
.. ...

This means that word W1 occurs twice in document D1 . Word Wnw occurs 4 times in document D1 and not at all in document D2.

Multinomial Distribution

The simplest distribution representing a text document is multinomial distribution. The probability of a document Di is:
Ynw
p(Di) = µTjij

j=1

Here, µj denotes the probability of a particular word in the text being equal to wj , Tij is the count of the word in document. So the probability of document D1 would be p(D1) = µ21 · µ62 · ... · µ4nw ·

Mixture of Multinomial Distributions

In order to do text clustering, we want to use a mixture of multinomial distributions, so that each topic has a particular multinomial distribution associated with it, and each document is a mixture of di↵erent topics. We define p(c) = ⇡c as the mixture coeﬃcient of a document containing topic c, and each topic is modeled by a multinomial distribution p(Di|c) with parameters µjc, then we can write each document as a mixture over topics as

nc
nc
nw
X
X
jY
p(Di) =
p(Di|c)p(c) =
⇡c µjcTij
c=1
c=1
=1

EM for Mixture of Multinomials

In order to cluster a set of documents, we need to fit this mixture model to data. In this problem, the EM algorithm can be used for fitting mixture models. This will be a simple topic model for documents. Each topic is a multinomial distribution over words (a mixture component). EM algorithm for such a topic model, which consists of iterating the following steps:

1. Expectation

Compute the expectation of document Di belonging to cluster c:

2

⇡c
nw
Tij
γic =

j=1 µjc

Q

P
nc

nw
µTij

⇡

c=1
Qc
j=1
jc
2. Maximization

Update the mixture parameters, i.e. the probability of a word being Wj in cluster (topic) c, as well as prior probability of each cluster.
Pnd γicTij
µjc = Pnd i=1Pnw γ T

i=1 l=1 ic il
1 Xnd
⇡c = nd i=1 γic

Task 1 [30 pts]

1. Verify the above-derived E-step and M-step.

2. Implement the EM algorithm on the toy dataset data.csv. You can treat data.csv as D1:nd , except that the last column represent the true cluster that this document belongs to. Run your algorithm and observe the results. Compare them with the provided true clusters each document belongs to. Report the evaluation (e.g. accuracy) of your implementation.

Hint: We already did the word counting for you, so the data file only contains a count matrix like the one shown above. For the toy dataset, set the number of clusters nc = 4. You will need to initialize the parameters. Try several di↵erent random initial values for the probability of a word being Wj in topic c, µjc. Make sure you normalized it. Make sure that you should not use the true cluster information during your learning phase.

MCMC algorithm

Mixture of Gaussians

A finite Gaussian mixture of size k has the following density:
XK
f(x|k,⇢,✓ ) = ⇢kNk(x|µk, φk)
k=1

where µk and φk are the mean and precision (ie, inverse of covariance) of Gaussian density Nk(x|µk, φk) and K is finite and known.

The likelihood of the data is

Yn XK
L = ⇢kNk(xi|µk, φk)
i=1 k=1

Assume we have n observations x = {x1, . . . , xn} sampled iid from the finite Gaussian mixture model. We wish to make Bayesian inference for the model parameters ✓ := {⇢1, µ1, φ1, . . . , ⇢K , µK , φK } and evaluate the posterior distribution of the unknown parameters.
In order to simplify the likelihood, we introduce latent variables Zi such that

Xi|Zi = k ⇠ Nk(x|µk, φk) and p(Zi = k) = ⇢k.

3

These auxiliary variables allows us to identify the mixture component each observation has been generated from.
Therefore, for data x = {x1, . . . , xn}, we assumes a missing dataset z = {z1, . . . , zn}, which provide the labels indicating the mixture components from which the observations have been generated.

Using this missing dataset, the likelihood simplifies to

n

K

i:Yzi
!

Y

Y

L =

⇢zi Nzi (xi|µzi , φzi ) =⇢knk

Nk(xi|µk, φk)

i=1

k=1

=k

where nk = #{zi = k} and

K

k=1 nk = n.

has been generated from the k-th component is

probability that the observation x

Then, the posterior

P

i

p(z
i
= k
x
,✓) =

⇢kNk(xi|µk, φk)

|
i

P
K
⇢kNk(xi|µk, φk)

k=1

=

⇢kp

exp{
φk
(xi µk)2}

φk

2

K

P

⇢kpφk exp{
φk
(xi µk)2}

k=1

2

For the model parameters ⇢, µ and φ, we assume conjugate priors:

Prior: ⇢ ⇠ Dirichlet(δ1, . . . , δK )

Posterior: ⇢|x, z ⇠ Dirichlet(δ1⇤, . . . , δK⇤)
Prior: φk ⇠ Gamma(a/2, b/2)
Posterior: φk|x, z Gamma(ak⇤/2, bk⇤/2)
Prior: µk|φk ⇠ N
1

Posterior: µk|x, z,φ k
⇠ N (mk⇤,
1

(mk,

)

)

↵kφk

↵⇤φk

k
where

δk⇤ = δk + nk

ak⇤ = a + nk

bk⇤ = b +
(xi
µk)2

=k

i:Xzi

↵k⇤ = ↵k + nk

m⇤
=
↵kmk + nkx¯k
where: x¯
k
=
1

x
i

k

↵k + nk

nk

=k

i:Xzi

Note that Dirichlet(δ1, . . . , δK ) is a Dirichlet distribution with density

K

f(⇢1, . . . , ⇢K ) /
kY

⇢kδk
1.

=1

The usual prior choice is to take (δ1, . . . , δK ) = (1, . . . , 1) to impose a uniform prior over the mixture weights.
Note that this prior choice is equivalent to use the following reparametrization:

⇢1=⌘1
⇢k = (1 ⌘1) · · · (1 ⌘k 1)⌘k

assuming that ⌘k ⇠ Beta(1, K k + 1).

MCMC algorithm

1. Set initial values ⇢(0), µ(0) and φ(0).

4

2. Update z sampling from z(j+1) ⇠ z|x,⇢ (j), µ(j), φ(j).

3. Update ⇢ sampling from ⇢j+1 ⇠ ⇢|x, z(j+1).

4. Update φk sampling from φ(kj+1) ⇠ φk|x, z(j+1).

5. Update µk sampling from µ(kj+1) ⇠ µk|x, z(j+1), φ(kj+1)

6. j = j + 1. Go to 2.

Task 2 [30 pts]

1. Verify the above-derived δk⇤, a⇤k, b⇤k, ↵k⇤, m⇤k.

2. Implement the MCMC algorithm on the toy dataset data2.txt. You can treat data2.txt as x = {x1, . . . , xn}. Set K = 2 and run your MCMC algorithm and observe the results. As typically in MCMC, we may remove the initial burn-in phase. You need to:

◦ Visualize the posterior distribution of your unknown parameters.

◦ Compute the posterior mean of your unknown parameters.

◦ Use your posterior mean as estimation of your unknown parameters. Generate 1000 samples x˜ from your estimated model. Visualize your original dataset x and generated x˜ in terms of histogram respectively. Make a comparison.

5