$24
Generative Models and EBMs
In this exercise, you will compare Energy-Based-Models with simple energy functions with alterna-tive generative models. To make computations tractable, we will focus on low-dimensional data, of dimension d = 2.
1. Generate n = 1000 datapoints from a Gaussian mixture using K = 10 mixture components with πk = 1/10, µk = R(cos(2πk/K),sin(2πk/K)), Σk = I2, k = 1. . . K, with R = 10. Write down the probability density function of the model. [5pt]
2. Assuming that Σk = I2 for each k, use the EM algorithm to estimate the model parameters (π and µk). Repeat the experiment using R = 1. Interpret the differences. [15pt]
3. Fit a kernel density estimator to the data, using the R = 10 setting above. Given the data {x1, . . . , xn}, a kernel density estimator is defined as
1 n
pˆ(x) = n i=1 φσ(x − xi) ,
where φσ is the pdf of an isotropic Gaussian distribution with covariance σI2. Tune the param-eter σ by using a validation set, ie a fresh batch of samples {x1′, . . . , x′n} drawn from the same GMM model. Using a test set of 1000 samples {x˜1, . . . , x˜n}, evaluate the log-likelihood of your model
1 n
log pˆ(x˜i) .
n i=1
[15pt]
Using the same training set, we will now fit an Energy-Based-Model using a simple shallow NN as your energy function.
4. Show that the MLE estimator in the EBM family given by pθ(x) = exp(Eθ(x) − A(θ)), with A(θ) = log exp(Eθ(x))dx, is the global optimiser of the loss
1 n
L(θ) = n i=1 Eθ(xi) − A(θ) ,
and
1 n
∇θ L(θ) = n i=1 ∇θ Eθ(xi) −Ex∼pθ ∇θ Eθ(x) .
[5pt]
5. Since d = 2 you can afford to sample using importance sampling: consider estimating a quan-tity of the form F = Ex∼pθ f (x). Consider a base probability measure q(x). Show that
• = Ex∼q[f (x) exp(Eθ(x))/q(x)] . Ex∼q[exp(Eθ(x))/q(x)]
[10pt]
6. This suggests the importance sampling estimator
ˆ
ˆ N
F=ˆ,
D
with
ˆ
1 M
ˆ
1
M
[exp(Eθ(xm))/q(xm)]
N = M
m=1
[ f (xm) exp(Eθ(xm))/q(xm)] , N = M
m=1
where {xm} are drawn i.i.d. from q. Apply the importance sampling estimator to ∇θ A(θ) using M = 5000 points and q the ground-truth GMM model [Note: you can use the same sample of q for all gradient steps, so you only need to sample once. ] [20pt]
7. Using importance sampling to estimate A(θ), evaluate the test likelihood of your EBM and compare with the kernel-density estimator. [10pt]
2 Variational Inference
In this exercise, we will verify some properties of variational inference. The setup is a mixture model of the form
p(x|θ) = p(x|z,θ)d p0(θ) ,
Z
where θ ∈ Θ are model parameters, z ∈ Z are latent variables defined over a generic domain, and p0 is a prior distribution over latent variables. For any probability distribution with positive density q over Z, recall the Variational Lower Bound
L(q,θ) :=
E
q log
p(x, z|θ)
,
q(z)
which satisfies log p(x|θ) L(q,θ). q is referred as variational distribution.
1. Show that log p(x|θ) = L(q,θ) + DK L(q||p(z|x,θ)). Use this result to argue that maximizing the Variational Lower bound with respect to q is equivalent to minimizing the KL divergence (also w.r.t. q). [10pt]
2. Draw two one-dimensional probability densities p and q such that DK L(p||q) ≫ DK L(q||p). Use a multimodal density for p and interpret the conditions on the variational distribution that result in a tighter bound. [10pt]
3. Using the example of the Gaussian Mixture Model from the previous exercise (in which z are the mixture assignments and x|z = k ∼ N (µk,Σk), use the variational lower bound to estimate log p(x) using two choices for q: (i) q = Unif[1, . . . , K], and (ii) q = p(z|x,θ), the posterior distri-bution, using the true parameters of the model. Interpret the results. [20pt]
4. What happens with the variational lower bound as the number of mixture components K → ∞ using each of the previous two choices? Interpret your answer [10pt]
3