CS7643: Deep Learning Assignment 4

Starting from:

~~$30~~

$24

Home

In this assignment, we will work with Python 3. If you do not have a python distribution installed yet, we recommend installing Anaconda (or miniconda) with Python 3. Given that you should already have your PyTorch installed in your local anaconda environment in assignment 2, we do not

2.1 Training and Hyperparameter Tuning

Train seq2seq on the dataset with the default hyperparameters. Then per-form hyperparameter tuning and include the improved results in a report explaining what you have tried. Do NOT just increase the number of epochs or change the model type (RNN to LSTM) as this is too trivial.

3 Transformers

We will be implementing a one-layer Transformer encoder which, similar to an RNN, can encode a sequence of inputs and produce a final output of possibility of tokens in target language. The architecture can be seen below.

You can refer to the original paper. In models you will see the file Transformer.py. You will implement the functions in the TransformerTranslator class.



3.1 Embeddings

We will format our input embeddings similarly to how they are constructed in [BERT (source of figure)](https://arxiv.org/pdf/1810.04805.pdf). Recall

from lecture that unlike a RNN, a Transformer does not include any posi-tional information about the order in which the words in the sentence occur. Because of this, we need to append a positional encoding token at each po-sition. (We will ignore the segment embeddings and [SEP] token here, since we are only encoding one sentence at a time). We have already appended the [CLS] token for you in the previous step.



Your first task is to implement the embedding lookup, including the ad-dition of positional encodings. Complete the code section for Deliverable 1, which will include part of __init__ and embed.

3.2 Multi-head Self-Attention

Attention can be computed in matrix-form using the following formula:



We want to have multiple self-attention operations, computed in parallel. Each of these is called a head. We concatenate the heads and multiply them with the matrix attention_head_projection to produce the output of this layer.

After every multi-head self-attention and feedforward layer, there is a residual connection + layer normalization. Make sure to implement this, using the following formula:



Implement the function multi_head_attention to do this. We have already initialized all of the layers you will need in the constructor.

3.3 Element-Wise Feedforward Layer

Complete code for Deliverable 3 in feedforward_layer: the element-wise feed-forward layer consisting of two linear transformers with a ReLU layer in between.



3.4 Final Layer

Complete code for Deliverable 4 in final_layer., to produce probability scores for all tokens in target language.

3.5 Forward Pass

Put it all together by completing the method forward, where you combine all of the methods you have developed in the right order to perform a full forward pass.

3.6 Training and Hyperparameter Tuning

Train the transformer architecture on the dataset with the default hyperpa-rameters – you should get a perplexity better than that for seq2seq. Then perform hyperparameter tuning and include the improved results in a report explaining what you have tried. Do NOT just increase the number of epochs as this is too trivial.

4 Deliverables

You will need to submit the notebook as well as your code in the models folder. Your report should include the accuracy of the Seq2Seq model and Transformer architecture before and after hyperparameter tuning with ex-planations of what you did.

6