$29
A Generative Adversarial Network (GAN) consists of two parts:
• Generator D generates a data sample from a random noise. It is trained to fit the real data distribution so that the synthesized data is close to real data.
• Discriminator G predicts the probability of a sample coming from real data distri-bution.
The training objective is formulated as a minimax game:
min max Ex∼pr(x) [log D(x)] + Ez∼pz(z) [log D(G(z))] .
(1)
G D
pr is the real data distribution. pz is the distribution that we sample from, which is usually the standard Gaussian distribution.
1. (2pts) Suppose that the generator G is fixed, what is the optimal value for the discrim-inator, i.e. D∗? Write D∗ in terms of pr(x) and pg(x). pg is the generated distribution, or the distribution of G(z) in Eq. 1.
2. (2pts) Show that when D reaches optimal, optimizing Eq. 1 is the same as minimizing the Jensen-Shannon (JS) divergence.
3. (1pts) Explain why the original GAN has the problem of vanishing gradients.
Hint: What will happen when D perfectly classifies generated samples from real data?
4. (2pts) Wasserstein GAN minimizes the approximation of the Earth-Mover’s (EM) distance (also called Wasserstein-1 distance) rather than JS divergence in the original GAN, which leads to more stable training.
Consider three distributions P1 ∼ U[0, 1], P2 ∼ U[0.5, 1.5] and P3 ∼ U[1, 2], where U[a, b] is the uniform distribution over [a, b]. Calculate JS divergences JS(P1, P2) and JS(P1, P3) and EM distances EM(P1, P2) and EM(P1, P3). Show your work.
Page 1
Homework 5
CS 446
• Diffusion Model: 11pts
In the forward diffusion process, we gradually convert a data sample x0 ∼ q0(x) by adding a small Gaussian noise in each step. We define
q(xt|xt−1) = N (xt; 1 − βtxt−1, βtI).
In the reverse diffusion process, we generate x0 by passing the noise through pθ(xt−1|xt).
1. (1pts) A diffusion model shares similarities to VAE in that it also has an encoder stage (forward diffusion process) which encodes data sample x0 to noise xT and a decoder stage (backward diffusion process) which decodes the noise to a data sample. When training a diffusion model, we can also maximize ELBO to optimize the log-likelihood Ex0∼q0(x0)[log pθ(x0)]. Write down the formula for ELBO for log pθ(x0) in diffusion model, and indicate which distribution the expectation is with respect to using the subscript to E[·].
2. (1pts) One way to evaluate generative models is to report the likelihood (or density) pθ(x0) where x0 is sampled from the test set. Can we directly estimate the density in diffusion models?
3. (3pts) Derive q(xt|x0) as a function of βi, i = 0, 1, . . . , t and x0. Hint: Use the reparameterization trick.
4. (3pts) The training objective contains the KL-divergence term
KL (q (xt−1|xt, x0) ∥pθ (xt−1|xt)) .
Derive the mean µθ(xt, x0) of q(xt−1|xt, x0).
Hint: Using Bayes rule, q(xt−1|xt, x0) = q(xt|xt−1, x0)q(xt−|1|x0) .
q(xt x0)
Hint: q(xt−1|xt, x0) is also a Gaussian distribution.
5. (3pts) We can understand diffusion models from the perspective of score-based gen-erative models. In score-based generative models, we estimate the score function sθ(x, δ) = ∇x log pθ(x, δ). The loss function is
L
=
T
λt
2
) s
(x˜, δ ) +
x˜ − x
2
.
2
I
Ex∼q0(x),x˜∼N(x,δt
θ
t
δ2
t−1
t2
After obtaining sθ(x, δ), we sample with Langevin dynamics.
Suppose we have trained sθ(x, δ) to approximate unconditional ∇x log pθ(x, δ), where
Page 2
Homework 5
CS 446
x ∈ Rd. Let’s consider the posterior sampling x ∼ pθ(x|xknown): we want to sample x given partial observation xknown ∈ Rd and the observation mask M ∈ {0, 1}d so that the sample aligns with the partial observation. Mi = 1 indicates that the i-th dimension in xknown is observed.
Show the conditional score function estimation sθ(x, δ|xknown) as a function of sθ(x, δ), x, xknown and M.
Hint: sθ(x, δ|xknown) estimates ∇x log pθ(x, δ|xknown). Using Bayes’ rule, pθ(x, δ|xknown) =
p(xknown|x)pθ(x,δ) .
p(xknown)
Hint: You can assume that p(xknown|x) ∝ exp(− (x − xknown) ⊙ M∥22)). Here, ⊙ is the element-wise product.
Remark: With the derived sθ(x, δ|xknown), we can sample from the posterior distri-bution without re-training. This can be applied to tasks, e.g. image impainting.
Page 3
Homework 5
CS 446
• Unsupervised learning / contrastive learning: 4 pts
True or false questions. If it is false, explain the reason in a few sentences. Each question is worth 1 point.
1. In unsupervised learning, we can evaluate the effectiveness of unsupervised learning by selecting several important downstream tasks and reporting performance on them.
2. Choosing a good pretraining task is essential for unsupervised learning. MAE and BERT are both examples of self-prediction. MAE uses the same mask-out rate during training as BERT.
3. Minimizing the InfoNCE loss maximizes a lower bound on mutual information.
4. We cannot use CLIP to classify images without finetuning on labelled image classifi-cation dataset.
Page 4
Homework 5
CS 446
• Coding: GAN, 10pts
In this problem, you need to implement a Generative Adversarial Network and train it on MNIST digits.
Table 1: Discriminator Architecture
Layer No. Layer Type Kernel Size Stride Padding Output Channels
1
conv2d
3
1
1
2
2
ReLU
-
-
-
2
3
MaxPool
2
2
-
2
4
conv2d
3
1
1
4
5
ReLU
-
-
-
4
6
MaxPool
2
2
-
4
7
conv2d
3
1
0
8
8
ReLU
-
-
-
8
9
Linear
-
-
-
1
Table 2: Generator Architecture
Layer No.
Layer Type
Kernel Size
Stride
Padding
Bias
Output Channels
1
Linear
-
-
-
✓
1568
2
LeakyReLU(0.2)
-
-
-
-
1568
3
Upsample(scale=2)
-
-
-
✗
32
4
conv2d
3
1
1
✓
16
5
LeakyReLU(0.2)
-
-
-
-
16
6
Upsample(scale=2)
-
-
-
✗
16
7
conv2d
3
1
1
✓
8
8
LeakyReLU(0.2)
-
-
-
-
8
9
conv2d
3
1
1
✓
1
10
sigmoid
-
-
-
-
1
1. Implement a discriminator DNet in hw5 gan.py with architecture in Tab. 1. Layers contain bias if corresponding torch function has an option for adding one.
Remark 1: From layer 8 to layer 9, you need to flatten each data entry from a matrix to a vector.
2. Implement a generator GNet in hw5 gan.py with architecture in Tab. 2.
Remark 2: From layer 2 to layer 3, you need to reshape each data to size (32, 7, 7) in the format of CHW. Note, 1568 = 32 × 7 × 7.
Page 5
Homework 5
CS 446
Remark 3: For (a) and (b), please define layers in
init
with exactly the same
order as they appear in Tab. 1 and Tab. 2.
Remark 4: We have listed all layers for discriminator and generator. No need to add any extra components.
3. Implement the weight initialization function weight init in DNet and GNet: use kaiming uniform for weights and 0 for the bias if the layer contains bias.
Hint: to iterate over all layers an nn.Module has, you may find self.children() useful. See children() function explained in https://pytorch.org/docs/stable/ generated/torch.nn.Module.html.
4. Implement the loss function for the discriminator: get loss d of GAN class in hw5 gan.py.
Hint: you may find torch.nn.BCEWithLogitsLoss useful.
5. Implement the loss function for the generator: get loss g of GAN class in hw5 gan.py.
Hint: you may find torch.nn.BCEWithLogitsLoss useful.
6. Attach generated images after training.
Remark 5: the provided code default saves images during training. You can choose three of the saved ones and indicate the corresponding epochs.
Remark 6: with default training settings, you should obtain reasonable generated images after around 30 epochs.
Page 6
Homework 5
CS 446
• Coding: Diffusion Model, 10pts
In this problem, you need to implement a score-based generative model and use it to sample from a complicated distribution of points. As introduced in lecture, sampling from this model consists of two main steps: matching the score functions sθ(x, δ) of each noise-perturbed distribution and then sampling from it by running Langevin dynamics.
Algorithm 1 Compute denoising loss.
Require: Score function sθ, training sample x, {σi}Li=1.
1: Sample σ from {σi}Li=1
2: Sample z ∼ N (0, I)
3: x˜ ← x + σz
4: λ←σ2
5:
L ←
λ
˜
x˜−x
2
2
2
sθ(x, σ) +
σ2
return L
Algorithm 2 Annealed Langevin dynamics.
Require: {σi}Li=1, ϵ, T .
1: Initialize x˜0 ∼ N (0, I)
2: for i ← 1 to L do
3:
αi ← ϵ · σi2/σL2
▷ αi is the step size.
4:
for t ← 1 to T do
5:
Draw zt ∼ N (0, I)
√
6:
x˜t ← x˜t−1 + αisθ(x˜t−1, σi) + 2αi zt
7: end for
8: x˜0 ← x˜T
9: end for return x˜T
1. Implement the class ScoreNet in hw5 diffusion.py. ScoreNet is a neural network that predicts sθ(x, δ). We use 8 linear layers and Softplus as the activation function. In the forward pass, it takes x and δ as inputs. Read the comments in the codes for detailed instructions.
2. Implement the training step in compute denoising loss. The detailed steps are in
Alg. 1. Note that Alg. 1 is for only one training sample. However, in compute denoising loss, you will be asked to apply it for all training samples and return the loss averaged over
all training samples.
Page 7
Homework 5
CS 446
3. Implement the Langevin dynamics sampling in langevin dynamics sample. We use the Annealed Langevin Dynamics shown in Alg. 2.
4. You are now ready to train the model and sample by running the main function.
(a) (3pts) Visualize the score function using hw5 utils.plot score. Include the visualization in the report.
(b) (4pts) Generate 1000 new samples with langevin dynamics sample. Plot the points at time step 0, 200, 400, 600, 800 and the final sampled points. Your final samples should roughly follow the pattern “CS446”. Include the plots in the report.
(c) (3pts) Visualize the trajectory of langevin dynamics. You could get the trajec-tory by setting return traj=True in langevin dynamics sample. Include the visualization of the trajectory in the report.