$29
• PCA: 6pts
1. (1pts) Recall that PCA finds a direction w in which the projected data has highest variance by solving the following program:
max wT Σw.
(1)
w:||w||2=1
Here, Σ is a covariance matrix. You are given a dataset of two 2-dimensional points (1, 3) and (4, 7). Draw the two data points on the 2D plane. What is the first principal component w of this dataset?
2. (3pts) Now you are given a dataset of four points (2, 0), (2, 2), (6, 0) and (6, 2). Given this dataset, derive the covariance matrix Σ in Eq.1. Then plot the centralized data with the first and the second principal components in one figure. Include the plot in your written submission.
3. (2pts) What is the optimal w and the optimal value of the program in Eq.1 given
12
0
0
Give your justification.
• Basics in Information Theory: 7pts
Let X be a discrete variable, and P , Q be two probability distributions over X. Define a new random variable X′ as follows:
X′= X∼P
if B = 1,
X ∼ Q
if B = 0,
Page 1
Homework 4
CS 446
where B ∈ {0, 1} is an independent and Bernoulli distribution over {0, 1} with the parameter λ, such that Pr(B = 1) = λ = 1 − Pr(B = 0).
1. (2pts) Derive and represent the mixture distribution Pr(X′ = x) in terms of P (x) and Q(x).
2. (5pts) Show that I(X′; B) = Dλ(P ∥Q), where Dλ(P ∥Q) is the λ-divergence between P and Q, i.e., Dλ(P ∥Q) = λDKL(P ∥λP + (1 − λ)Q) + (1 − λ)DKL(Q∥λP + (1 − λ)Q). Note that by setting λ = 0.5, the λ-divergence gives the Jensen-Shannon divergence.
• k-Means with Soft Assignments: 10pts
Consider the following exemplar-based, hard-assignment form as the objective of k-Means for K clusters and n data points x(i) for i = 1, ..., n:
n
where µk denotes the center for the k-th cluster, the matrix A ∈ {0, 1}n×K indicates the hard assignment of each data point to the clusters, and A · 1K = 1n, which tells us that each row of A has one 1 with all remaining elements as 0, i,e, Kk=1 Aik = 1, ∀i.
We extend this setting to soft assignment by designing the matrix A ∈ [0, 1]n×k, and the objective becomes:
• K
Hint: Note that {0, 1}n×K can be seen as a subset of [0, 1]n×K .
Page 2
Homework 4
CS 446
2. (5pts) Show that the following also holds:
Hint: You may use the fact that ∥x(i) − µk∥22 ≥ minl
∥x(i) − µl∥22, for any i and k.
3. (2pts) Show that the soft assignment problem introduced in this problem (Eq. 3) corresponds to a globally optimal hard assignment.
• Bernoulli Mixture Model: 18pts
Extended from the Gaussian mixture model introduced in the lecture, we explore the Bernoulli mixture model in this problem. We represent the dataset as X ∈ {0, 1}n×d and each data instance is a set of d independent binary random variables x(i) = {x(1i), x(2i), ..., x(di)} and the probability that x(i) is generated from the k-th Bernoulli distributions is calculated as:
d
Pr(x(i)|µk) = µkxj(i) (1 − µk) 1−xj(i)
,
j=1
where µk is the mean of the k-th Bernoulli distribution.
We consider K mixed Bernoulli distributions and introduce the auxiliary/latent variable zik ∈ {0, 1} with Kk=1 zik = 1 ∀i as the assignment for x(i) to the k-th Bernoulli distribution. Also, we have Pr(zik = 1) = πk and Pr(x(i)|zik = 1) = Pr(x(i)|µk).
1. (5pts) Derive the log-likelihood log Pr(x(i), zi|π, µ).
2. (5pts) In the expectation step, derive the update step for the assignment (posterior) ziknew = Pr(zik = 1|x(i)).
3. (8pts) In the maximization step, derive the update step for the model parameter, i.e., µnewk and πknew.
• Variational Autoencoder (VAE): 19pts
In this problem you will implement a Variational Autoencoder (VAE) to model points sam-pled from an unknown distribution. This will be done by constructing an encoder network and a decoder network. The encoder network fenc : X ⊂ R2 → Rh × Rh takes as input a point x from the input space and outputs parameters (µ, ξ) where ξ = log σ2. The decoder
Page 3
Homework 4
CS 446
network fdec : Rh → R2 takes as input a latent vector z ∼ N (µ, σ2) and outputs an element xˆ ∈ R2 that we would hope is similar to members of the input space X. You will train this model by minimizing the (regularized) empirical risk
1
n
RVAE(f) =
ℓ(xˆi, xi) + λKL (N (µ(xi), exp(ξ(xi)/2)), N (0, I)) .
n
i=1
Particularly, we have
1
h
KL (N(µ, Σ), N(0, I)) = −
h +log σj2 − µj2 − σj2 ,
2
j=1
1. (3pts) Use the empirical risk discussed above to implement a VAE in the class VAE. Use ReLU activations between each layer, except on the last layer of the decoder use sigmoid. Use the Adam optimizer to optimize in the step() function. Make use of the PyTorch library for this. Use torch.optim.Adam(), there is no need to implement it yourself. Please refer to the docstrings in hw4.py for more implementation details.
2. (5pts) Implement the fit function using the step() function from the VAE class. See the docstrings in hw4.py for more details.
3. (11pts) Fit a VAE on the data generated by generate data in hw4 utils.py. Use a learning rate η = 0.01, latent space dimension h = 6, KL-divergence scaling factor
◦ = 5 × 10−5, and train for 8000 iterations. Use least squares as the loss, that is, let
ℓ(x, xˆ) = ∥x − xˆ∥22. Include separate plots of each of the following with a legend in your written submission:
(a) Your empirical risk RVAE on the training data vs iteration count;
(b) The data points (x)ni=1 along with their encoded and decoded approximations xˆ;
(c) The data points (x)ni=1 along with their encoded and decoded approximations xˆ, and n generated points fdec(z) where z ∼ N (0, I).
After you are done training, save your neural network to a checkpoint file using torch.save(model.cpu().state dict(), "vae.pb"). You will submit this check-point file"vae.pb" to the autograder with your code submission.