$18.99
1. [2 points] KL Divergence
(a) [1 point] What is the expression of the KL divergence DK L(q(x)||p(x)) given two con- tinuous distributions p(x) and q(x) defined on the domain of R1?
Your answer:
(b) [1 point] Show that the KL divergence is non-negative. You can use Jensen’s inequality here without proving it.
Your answer:
2. [3 points] In the class, we derive the following equality:
Z pθ (x, z) Z
θ
qφ (z|x)
log pθ (x) =
qφ (z|x) log
q
z
z φ
dz + (z|x)
qφ (z|x) log p (z|x) dz
Instead of maximizing the log likelihood log pθ (x) w.r.t. θ, we find a lower bound for log pθ (x)
and maximize the lower bound.
(a) [1 point] Use the above equation and your result in 1(b) to give a lower bound for log pθ (x).
Your answer:
(b) [1 point] What do people usually call the bound?
Your answer:
(c) [1 point] In what condition will the bound be tight?
Your answer:
z
3. [2 points] Given z ∈ R1 , p(z) ∼ N (0, 1) and q(z|x) ∼ N (µz , σ2), write DK L (q(z|x)||p(z)) in
terms of σz and µz .
Your answer:
z
4. [1 points] In VAEs, the encoder computes the mean µz and the variance σ2 of qφ (z|x) as-
z
suming qφ (z|x) is Gaussian. Explain why we usually model σ2
in log space, i.e., modeling
2 2
log σz instead of σz when implementing it using neural nets?
Your answer:
z
5. [1 points] Why do we need the reparameterization trick when training VAEs instead of di- rectly sampling from the latent distribution N (µz , σ2)?
Your answer: