Homework Assignment 5 Solution

Starting from:

~~$30~~

$24

Home

Starting point Your repository will have a directory ‘homework5/’. Please do not change the name of this repository or the names of any les/folders we have added to it. Please perform a git pull to retrieve these les. You will nd within ‘homework5/’:

Two python scripts hmm.py and tsne.py, which you will be amending by adding code for questions in Sect. 3 and Sect. 4, respectively.

Three scripts hmm.sh, pca.sh, tsne.sh. You will use them to generate output les.

Datasets: hmm model.json, mnist2500 X.txt, and mnist2500 labels.txt.

Submission instructions The following will constitute your submission (in ‘homework5/’):

A PDF report named firstname lastname USCID.pdf, which contains your solutions for questions in Sect. 1. For your written report, your answers must be typeset with LATEX and generated as a PDF le. No handwritten submission will be permitted. Please also write your full name and USC ID in the pdf le.

The python scripts hmm.py and tsne.py, amended with the code you added for Sect. 3 and Sect. 4, respectively. Be sure to commit your changes!

One .txt le, hmm.txt which will be the output of hmm.sh.

Two .npy les, pca.npy and Y.npy, which will be the outputs of pca.sh and tsne.sh, respectively.

Note that if your submission does not include the .txt and .npy les mentioned above, we will assume that you did not do the corresponding parts of the homework.

Collaboration You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.

1

Algorithmic component

Hidden Markov Models

[Recommended maximum time spent: 2 hours]

In this problem, we want to t DNA sequence data with a generative model. In particular, we assume that they are generated by a hidden Markov model (HMM). Let X1:N = [X1X2 : : : XN ] be random variables corresponding to a DNA sequence of length N, controlled by hidden states Z1:N = [Z1Z2 : : : ZN ]. Each Xn takes a value in fA; C; G; T g and each Zn takes one of the two possible states fs1; s2g. This HMM has the following parameters = f i; aij; bikg for i 2 f1; 2g,

2 f1; 2g, and k 2 fA; C; G; T g:

Initial state distribution i for i = 1; 2:

1 = P (Z1 = s1) = 0:7; 2 = P (Z1 = s2) = 0:3:
(1)

Transition probabilities aij = P (Zn+1 = s1jZn = si) for any n 2 N+; i = 1; 2 and j = 1; 2:

a11 = 0:8; a12 = 0:2; a21 = 0:4; a22 = 0:6:
(2)

Emission probabilities bik = P (Xn = kjZn = si) for any n 2 N+; i = 1; 2 and k 2 fA; C; G; T g:

b1A = 0:4;
b1C = 0:1;
b1G = 0:4;
b1T = 0:1;
(3)
b2A = 0:2;
b2C = 0:3;
b2G = 0:2;
b2T = 0:3:
(4)

We observe a sequence O1:6 = [o1o2 : : : o6] = [AGCGT A], please answer the following questions with step-by-step computations.

Q1.1 Probability of an observed sequence
Compute P (X1:6 = O1:6; ).

Q1.2 Most likely explanation Compute z
= [z z : : : z ] = arg max
z1:6
P (Z1:6
= z1:6
j
X1:6 =
1:6
1 2
6

O1:6; ).

Q1.3 Prediction Compute x = arg maxx P (X7 = xjX1:6 = O1:6; ).

What to submit: Mathematical derivations and numerical results for each of the problems above.

You should derive them manually.

Reminder: Do not have any spaces in your lename. If you are considering putting a space, put an underscore instead. Do not have more than one PDF le in your homework5 directory.

2

Programming component

High-level descriptions

2.1 Datasets

In Sect. 3, we will play with a small hidden Markov model. The parameters of the model are given in hmm model.json. In Sect. 4, you will perform dimensionality reduction and visualize a subset of MNIST: 784-dimensional feature vectors in mnist2500 X.txt and their corresponding labels in mnist2500 labels.txt.

2.2 Tasks

You will be asked to implement hidden Markov model (HMM) inference procedures and the t-distributed stochastic neighbor embedding (t-SNE) algorithm. Speci cally, you will

For Sect. 3: Finish implementing the function forward, backward, seqprob forward, seqprob backward, and viterbi. Refer to hmm.py for more information.

For Sect. 4: Finish implementing the function pca, compute Q and compute gradient. Refer to tsne.py for more information.

Run the scripts hmm.sh, pca.sh, and tsne.sh after you nish your implementation. hmm.sh will output hmm.txt. pca.sh will output pca.npy. tsne.sh will show visualization of the dataset and will also output Y.npy.

Add, commit, and push hmm.py and tsne.py, and the les that you have generated.

As in the previous homework, you are not responsible for loading/pre-processing data.

2.3 Cautions

Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required les, including your code and generated output les.

Hidden Markov Models

In this problem, you are given parameters of a small hidden Markov model and you will implement three inference procedures, similar to what you have done manually in the rst question. In hmm model.json, you will nd the following model parameters:

: the initial probabilities, i = P (Z1 = si);

A: the transition probabilities, with aij = P (Zt = sjjZt 1 = si);

3

B: the observation probabilities, with bik = P (Xt = okjZt = si).

Now we observe a sequence O of length L. Your task is to write code to compute probabilities and infer about the possible hidden states given this observation. In particular, rst in Q3.1 and Q3.2, we want to compute P (x1; : : : ; xL = O), the probability of observing the sequence. You should use the forward algorithm and the backward algorithm to achieve that. Then in Q3.3, we infer the most likely state path. Note that your code might be tested against di erent parameter sets/observation sequences at grading time.

Q3.1 Please nish the implementation of the functions forward() and backward() in hmm.py:

forward( ; A; B; O) takes the model parameters ; A; B and the observation sequence O as input and output a numpy array , where [j; t] = t(j) = P (Zt = sj; x1:t).

backward( ; A; B; O) takes the model parameters ; A; B and the observation sequence O as input and output a numpy array , where [j; t] = t(j) = P (Zt = sj; xt+1:T ).

Please follow the slides 35, 36 in lec18.pdf for your implementation.

Q3.2 Now we can calculate P (x1; : : : ; xL = O) from the output of your forward and backward al-gorithms. Please nish the implementation of the function seqprob forward() and seqprob backward() in hmm.py. Both of them should return P (x1; : : : ; xL = O).

Q3.3 We want to compute the most likely state path that corresponds to the observation O. Namely,

k = (k1 ; k2 ; ; kL) = arg maxk P (sk1 ; sk2 ; ; skL jx1; x2; ; xL = O):

Please implement the Viterbi algorithm in viterbi() in hmm.py. The function viterbi( ; A; B; O) takes the model parameters ; A; B and the observation sequence O as input and output a list path which contains the most likely state path k (in terms of the state index) you have found.

What to do and submit: After you nish each task in Q3.1/3.2 and Q3.3 in hmm.py, run the script hmm.sh. It will generate hmm.txt. Add, commit, and push both hmm.py and hmm.txt before the due date.

Dimensionality Reduction

In this question, you will implement t-Distributed Stochastic Neighbor Embedding (t-SNE) [1], a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. t-SNE maps high-dimensional data points x1; x2; : : : ; xN into two or three-dimensional embeddings y1; y2; : : : ; yN that can be displayed in a scatter plot. Unlike PCA that you learned in class, t-SNE is a non-linear dimensionality reduction technique. It is widely used (more than 3,300 citations and counting). If you are curious about how to use it e ectively, check this out https://distill.pub/2016/misread-tsne/.

Intuitively, we want the low-dimensional data points to re ect similarities of their corresponding data points in the high-dimensional space. In other words, after applying the mapping, similar

4

data points are still near each other and dissimilar data points are still far apart. In t-SNE, this is achieved in two steps. First, t-SNE constructs a joint probability distribution P over pairs of high-dimensional data points in such a way that similar points have a high probability of being picked, whilst dissimilar points have an extremely small probability. Second, t-SNE similarly de nes a joint probability distribution Q over pairs of low-dimensional data points and then minimizes the Kullback-Leibler (KL) divergence between P and Q. You could think of KL-divergence as a measure of how one probability distribution diverges from another.

Formally, let pij = P (xi; xj) and qij = Q(xi; xj). Then, we minimize

C = KL(P jjQ) = Xi
Xj
pij

pij log qij
;
(5)

where we de ne each of the above terms below. Please refer to the original paper [1] for justi cations of their choice of probability distributions as well as their choice of objective.

We de ne the pairwise similarities pij between xi and xj in the high-dimensional space as

kxi xjk22

p
jji
+ p
ijj

exp(

2 2

)

p

=

; where p

=

i

:
(6)
ij

2N

jji
Pk6=i exp(

k
xi
xk
k
22

)

2 i2

where i2 is the variance of a Gaussian distribution that is centered at xi. The conditional proba-bility pjji could be interpreted as the probability of xi picking xj as its neighbor.

Moreover, we de ne the pairwise similarities qij between yi and yj in the low-dimensional space with a Student t-distribution with one degree of freedom:

qij =

(1 + kyi yjk22) 1
:
(7)

Pk6=l(1 + kyk ylk22) 1

Finally, we set both pii and qii to zeros, as we are only interested in pairwise similarities. We use gradient descent to minimize the cost function in Eq. (5). The gradient is given by

@C
= 4
Xj
(pij
qij)(yi yj)(1 + kyi yjk22) 1:
(8)
@yi

(Can you verify that this is indeed the formula for the gradient?)

Q4.1 We rst apply PCA to our very high-dimensional data points. This in practice helps speed up the computation and reduce noise in the data points before we apply t-SNE. Finish the imple-mentation of PCA in function pca in tsne.py to map the data points of dimensionality 784 to 50.

What to do and submit: Finish the implementation of function pca. Run the script pca.sh. It will output a le pca.npy. Add, commit and push the pca.npy and your modi ed tsne.py before the due date.

5

Q4.2 After PCA, nish the implementation of the t-SNE algorithm. In tsne.py, you are given the code to compute pij, and you need to nish the code that computes qij in function compute Q. You may need to refer to Eq. (7).

What to do and submit: Finish the implementation of function compute Q. Add, commit, and push the modi ed your modi ed tsne.py before due date.

Q4.3 Compute the gradient @C using Eq. (8) in function compute gradient in tsne.py. Finally,

@yi

running the script tsne.sh will show cool t-SNE visualization the data in the 2-dimensional space.

What to do and submit: Finish the implementation of function compute gradient. Run the script tsne.sh. It will output a le Y.npy. Add, commit and push Y.npy and your modi ed tsne.py before due date.

Note: Feel free look at the Matlab implementation of t-SNE at https://lvdmaaten.github. io/tsne/code/tSNE_matlab.zip, but you are not permitted to look at any other online imple-mentation of t-SNE, here and elsewhere.

References

L. van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579{2605, 2008.

6