Assignment 3, Programming Part Solution

Starting from:

~~$35~~

$29

Home

Problem 1

Variational Autoencoders (VAEs) are probabilistic generative models to model data distribution p(x). In this question, you will be asked to train a VAE on the Binarised MNIST dataset, using the negative ELBO loss as shown in class. Note that each pixel in this image dataset is binary: The pixel is either black or white, which means each datapoint (image) a collection of binary values. You have to model the likelihood pθ(x|z), i.e. the decoder, as a product of bernoulli distributions.1

1. (unittest, 4 pts) Implement the function ‘log likelihood bernoulli’ in ‘q1 solution.py’ to compute the log-likelihood log p(x) for a given binary sample x and Bernoulli distribution p(x). p(x) will be parameterized by the mean of the distribution p(x = 1), and this will be given as input for the function.

2. (unittest, 4 pts) Implement the function ‘log likelihood normal’ in ‘q1 solution.py’ to compute the log-likelihood log p(x) for a given float vector x and isotropic Normal distribution p(z) = N (µ, diag(σ2)). Note that µ and log(σ2) will be given for Normal distributions.
3. (unittest, 4 pts) Implement the function ‘log mean exp’ in ‘q1 solution.py’ to compute

the following equation2 for each yi in a given Y = {y1, y2, . . . , yi, . . . yM };

1 K
(k)
− ai

log

exp yi

+ ai,

K

k=1

where ai = maxk yi(k). Note that yi = [yi(1), yi(2), . . . , yi(k), . . . , yi(K)]s.

1The binarized MNIST is not interchangeable with the MNIST dataset available on torchvision. So the data loader as well as dataset will be provided.

• This is a type of log-sum-exp trick to deal with numerical underflow issues: the generation of a number that is too small to be represented in the device meant to store it. For example, probabilities of pixels of image can get really small. For more details of numerical underflow in computing log-probability, see http://blog.smola.org/ post/987977550/log-probabilities-semirings-and-floating-point.

- Do not distribute -

IFT6135-H2022
Assignment 3, Programming Part
Prof: Aaron Courville
Generative Models and Self-supervised Learning

4. (unittest, 4 pts) Implement the function ‘kl gaussian gaussian analytic’ in ‘q1 solution.py’ to compute KL divergence DKL (q(z|x)∥p(z)) via analytic solution for given p and q. Note that p and q are multivariate normal distributions with diagonal covariance.

5. (unittest, 4 pts) Implement the function ‘kl gaussian gaussian mc’ in ‘q1 solution.py’ to compute KL diveregence DKL (q(z|x)∥p(z)) by using Monte Carlo estimate for given p and q. Note that p and q are multivariate normal distributions with diagonal covariance.

report

variable model p
(x) = p
(x
z)p(z)dz. The prior
6. (

, 15 pts) Consider a latent

R
L

θ
θ
|

is define as p(z) = N (0, IL) and z ∈

.
Train a VAE with a latent variable of 100-
dimensions (L = 100). Use the provided network architecture and hyperparameters described in ‘vae.ipynb’3. Use ADAM with a learning rate of 3×10−4, and train for 20 epochs. Evaluate the model on the validation set using the ELBO. Marks will neither be deducted nor awarded if you do not use the given architecture. Note that for this question you have to:

(a) Train a model to achieve an average per-instance ELBO of ≥ −102 on the validation set, and report the ELBO of your model. The ELBO on validation is written as:

1
LELBO(xi) ≥ −102

|Dvalid| xi Dvalid

Feel free to modify the above hyperparameters (except the latent variable size) to ensure it works.

7. (report, 15 pts) Evaluate log-likelihood of the trained VAE models by using importance sampling, which was covered during the lecture. Use the codes described in ‘vae.ipynb’. The formula is reproduced here with additional details:

log p(x = x
)
≈
log
1 K
pθ(x = xi|zi(k)) p(z = zi(k))
;

i

K k=1
qϕ(z = zi(k)|xi)
and xi ∈ D.

for all k: zi(k) ∼ qϕ(z|xi)

(a) Report your evaluations of the trained model on the test set using the log-likelihood
estimate (
1
N
log p(x )), where N is the size of the test dataset. Use K = 200 as the

N
i=1
i
number of importance samples, D as the dimension of the input (D = 784 in the case of MNIST), and L = 100 as the dimension of the latent variable.

Problem 2

Recent years have shown an explosion of research into using deep learning and computer vision algorithms to generate images. To train a GAN we need to estimate the distance between distribu-tions of real and generated data. In this question, we employ Wasserstein distance and to ensure

• This file is executable in Google Colab. You can also convertvae.ipynb to vae.py using the Colab.

- Do not distribute -

IFT6135-H2022
Assignment 3, Programming Part
Prof: Aaron Courville
Generative Models and Self-supervised Learning

the Lipschitz-constraint, we will penalize the violation of it as suggested by Petzka et. al. 4. In this question, you will first implement a function to estimate the Wasserstein distance between two points x ∼ µ and y ∼ ν (real and generated samples respectively):

Ey∼ν [f(y)] − Ex∼µ[f(x)]
(1)

and with an added regularization term

Exˆ∼τ [(max{0, ∥∇f(ˆx)∥ − 1}]2
(2)

then in the network training process minimize:

Ey∼ν [f(y)] − Ex∼µ[f(x)] + λExˆ∼τ [(max{0, ∥∇f(ˆx)∥ − 1}]2
(3)

where xˆ = tx + (1 − t)y for t ∼ U[0, 1].

Dataset & dataloader In this question, you will use the GAN framework train a generator to generate a distribution of images X ⊂ R32×32×3, namely the Street View House Numbers dataset (SVHN) 5. We will consider the prior distribution p(z) = N (0, I) the isotropic gaussian distribution. We provide a function for sampling from the SVHN datasets in ‘q2 samplers.py‘.

Hyperparameters & Training Pointers We provide code for the GANs network as well as the hyperparameters you should be using. We ask you to code the Wasserstein distance and training procedure to train the GANs as well as the qualitative exploration that you will include in your report.

1. (unittest, 4 pts) Implement the functions ‘vf wasserstein distance’ and ‘lp reg’ in ‘q2 solution.py’ to compute the objective function of the Wasserstein distance and compute the “Lipschitz Penalty”. Consider that the norm used in the regularizer is the L2-norm.

Qualitative Evaluation In your report,

2. (report, 8 pts) Provide visual samples. Comment the quality of the samples from each model (e.g. blurriness, diversity).

3. (report, 8 pts) We want to see if the model has learned a disentangled repre-sentation in the latent space. Sample a random z from your prior distribution. Make

small perturbations to your sample z for each dimension (e.g. for a dimension i, zi′ = zi + ϵ).

• has to be large enough to see some visual difference. For each dimension, observe if the changes result in visual variations (that means variations in g(z)). You do not have to show all dimensions, just a couple that result in interesting changes.

• See Section 5 of https://arxiv.org/pdf/1709.08894.pdf

5The SVHN dataset can be downloaded at http://ufldl.stanford.edu/housenumbers/. Please note that the pro-vided sampler can download the dataset so you do not need to download it separately.

- Do not distribute -
IFT6135-H2022
Assignment 3, Programming Part
Prof: Aaron Courville
Generative Models and Self-supervised Learning

4. (report, 8 pts) Compare between interpolating in the data space and in the latent space. Pick two random points z0 and z1 in the latent space sampled from the prior.

(a) For α = 0, 0.1, 0.2 ... 1 compute zα′ = αz0 + (1 − α)z1 and plot the resulting samples x′α = g(zα′).

(b) Using the data samples x0 = g(z0) and x1 = g(z1) and for α = 0, 0.1, 0.2 ... 1 plot the samples xˆα = αx0 + (1 − α)x1.

Explain the difference with the two schemes to interpolate between images.

Problem 3

Self-supervised methods learn a representation of data by solving pretext tasks to alleviate expensive supervised labelling. Contrastive learning methods, a sub-category of self-supervised learning (SSL), learn an embedding by minimizing the distance between the embedding of two different views of the same sample while maximizing the distance between the embedding of the view of two different samples. However, non-contrastive SSL methods showed comparable results without using a large number of negative samples. In this question, you will be investigating why these networks do not collapse to a trivial solution by implementing the Simsiam algorithm and experimenting on key factors of its performance.

You will first implement a function to estimate the negative cosine similarity. Then you will im-plement Simsiam loss and forward functions. In the end, you will run experiments to analyze the performance of the model in different conditions.

Dataset and dataloader for Siamese network 6 In this question, you will work on an object classification task for the CIFAR10 dataset 7. This dataset consist of images X ⊂ R32×32×3 of 10 classes. We provide samplers to generate the different distributions that you will need for this question. In the same repository, we also provide the architecture of a neural network functions f : X → Z and h : Z → P s.t. X ⊂ R32×32×3, Z ⊂ Rd, P ⊂ Rd (d is the feature dimension, Simsiam default is 2048)).

Hyperparameters & Training Pointers We provide code for the SSL network as well as the hyperparameters you should be using. We ask you to code the training procedure to train the Simsiam as well as the qualitative exploration that you will include in your report.

6A Siamese Neural Network is a class of neural network architectures that contain two or more identical subnet-works. (From here)

7The CIFAR10 dataset can be downloaded at https://www.cs.toronto.edu/ kriz/cifar.html. Please note that Pytorch CIFAR10 Dataset can download the dataset so you do not need to download it separately.

- Do not distribute -
IFT6135-H2022
Assignment 3, Programming Part
Prof: Aaron Courville
Generative Models and Self-supervised Learning

1. (unittest, 4 pts) Implement the forward functions of the Simsiam in ‘q3 solution.py’. This function receive two randomly augmented view x1 and x2 from an input sample x and compute the outputs of the network, which are as below:

∆ ∆
z1 = f(x1), p1 = h(f(x1))

∆ ∆
z2 = f(x2), p2 = h(f(x2))

Note that you need to perform the grading stopping in this step as well.

(4)

(5)

2. (unittest, 4 pts) Implement the function ‘cosine similarity’ in ‘q3 solution.py’ to compute the similarity between two inputs. The cosine similarity between two variables A and B can be defined as

Sc(A, B) =

A
·

B
(6)

∥A∥2

∥B∥2

3. (unittest, 4 pts) Implement the Simsiam loss functions in ‘q3 solution.py’ to compute the objective function of the Simsiam, using the cosine similarity above. Simsiam objective function is defined as:

L = 0.5 ∗ D(p1, stop − grad(z2)) + 0.5 ∗ D(p2, stop − grad(z1))
(7)

Where D is negative cosine similarity.

4. (report, 10 pts) Train the model for 100 epochs with and without gradient stopping. Plot training loss and Knn accuracy against training epochs.

5. (report, 10 pts) Investigate the effect of the predictor network (MLP) by experimenting the below setting(s). Plot training loss and KNN accuracy against training epochs.

(a) Remove the predictor by replacing it with an identity mapping.

- Do not distribute -