Starting from:
$35

$29

Programming Assignment 3: Attention-Based Neural Machine Trans-lation Solution




Introduction




In this assignment, you will train an attention-based neural machine translation model to translate words from English to Pig-Latin. Along the way, you’ll gain experience with several important concepts in NMT, including attention and teacher forcing.




Pig Latin




Pig Latin is a simple transformation of English based on the following rules (applied on a per-word basis):




If the rst letter of a word is a consonant, then the letter is moved to the end of the word, and the letters \ay" are added to the end: team ! eamtay.
If the rst letter is a vowel, then the word is left unchanged and the letters \way" are added to the end: impress ! impressway.
In addition, some consonant pairs, such as \sh", are treated as a block and are moved to the end of the string together: shopping ! oppingshay.



To translate a whole sentence from English to Pig-Latin, we simply apply these rules to each word independently:




i went shopping ! iway entway oppingshay




We would like a neural machine translation model to learn the rules of Pig-Latin implicitly, from (English, Pig-Latin) word pairs. Since the translation to Pig Latin involves moving characters around in a string, we will use character-level recurrent neural networks for our model.







https://markus.teach.cs.toronto.edu/csc321-2018-01
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf



1
CSC321 Programming Assignment 3










Because English and Pig-Latin are so similar in structure, the translation task is almost a copy task; the model must remember each character in the input, and recall the characters in a speci c order to produce the output. This makes it an ideal task for understanding the capacity of NMT models.




Data




The data for this task consists of pairs of words f(s(i); t(i))gNi=1 where the source s(i) is an English word, and the target t(i) is its translation in Pig-Latin. The dataset is composed of unique words from the book \Sense and Sensibility," by Jane Austen. The vocabulary consists of 29 tokens: the 26 standard alphabet letters (all lowercase), the dash symbol -, and two special tokens <SOS and <EOS that denote the start and end of a sequence, respectively. 3 The dataset contains 6387 unique (English, Pig-Latin) pairs in total; the rst few examples are:




{ (the, ethay), (family, amilyfay), (of, ofway), ... }




In order to simplify the processing of mini-batches of words, the word pairs are grouped based on the lengths of the source and target. Thus, in each mini-batch the source words are all the same length, and the target words are all the same length. This simpli es the code, as we don’t have to worry about batches of variable-length sequences.




Part 1: Encoder-Decoder Models and Capacity [1 mark]




Translation is a sequence-to-sequence problem: in our case, both the input and output are sequences of characters. A common architecture used for seq-to-seq problems is the encoder-decoder model [2], composed of two RNNs, as follows:

Training a t c a y <EOS






















c a t <EOS <SOS a t c a y




Encoder Decoder




Figure 1: Training the NMT encoder-decoder architecture.




The encoder RNN compresses the input sequence into a xed-length vector, represented by the nal hidden state hT . The decoder RNN conditions on this vector to produce the translation, character by character.




Input characters are passed through an embedding layer before they are fed into the encoder RNN; in our model, we learn a 29 10 embedding matrix, where each of the 29 characters in the vocabulary is assigned a 10-dimensional embedding. At each time step, the decoder RNN outputs a vector of unnormalized log probabilities given by a linear transformation of the decoder hidden state. When these probabilities are normalized, they de ne a distribution over the vocabulary, indicating







Note that for the English-to-Pig-Latin task, the input and output sequences share the same vocabulary; this is not always the case for other translation tasks (i.e., between languages that use di erent alphabets).






2
CSC321 Programming Assignment 3










the most probable characters for that time step. The model is trained via a cross-entropy loss between the decoder distribution and ground-truth at each time step.




Conceptual Questions




How do you think this architecture will perform on long sequences, and why? Consider the amount of information the decoder gets to see about the input sequence.



In the code folder, you will nd a pre-trained model of the above architecture, using a hidden state of size 10. This model was trained to convergence. The script translate_no_attn.py uses this pre-trained model to translate words given in the list words. Run this script by calling:



python translate_no_attn.py




How do the results look, qualitatively? Does the model do better for certain types of words than others? Add a few of your own words to the words list at the top of the script, and run it again. Which failure modes can you identify?




Part 2: Teacher-Forcing [1 mark]










Generation

a t c a y <EOS



















c a t <EOS <SOS




Encoder Decoder




Figure 2: Generating text with the NMT encoder-decoder architecture.




The decoder produces a distribution over the output vocabulary conditioned on the previous hidden state and the output token in the previous timestep. A common practice used to train NMT models is to feed in the ground-truth token from the previous time step to condition the decoder output in the current step, as shown in Figure 1. At test time, we don’t have access to the ground-truth output sequence, so the decoder must condition its output on the token it generated in the previous time step, as shown in Figure 2.




Questions




What problem may arise when training with teacher forcing? Consider the di erences that arise when we switch from training to testing.



Can you think of any way to address this issue? Read the abstract and introduction of the paper \Scheduled sampling for sequence prediction with recurrent neural networks" [1], and answer this question in your own words.






3
CSC321 Programming Assignment 3










Teacher-Forcing Ratio (Optional)




In the starter code, teacher-forcing is used 50% of the time, and the model’s own predictions are used 50% of the time when training (see [1]). If you want to observe the e ects of using teacher-forcing more or less of the time, you can provide your own teacher-forcing ratio to train the model; for example, python attention_nmt.py --teacher_forcing_ratio=1 trains purely with teacher-forcing. This is optional, and not required for this assignment.




Part 3: Gated Recurrent Unit (GRU) [2 marks]




Throughout the rest of the assignment, you will implement an attention-based neural machine translation model, and nally train the model and examine the results.




1. The forward pass of a Gated Recurrent Unit is de ned by the following equations:




rt = (Wirxt + Whrht
1 + br)
(1)
zt = (Wizxt + Whzht
1 + bz)
(2)
gt = tanh(Winxt + rt (Whnht 1 + bg))
(3)
ht = (1 z) gt + z ht 1
(4)



Although PyTorch has a GRU built in (nn.GRUCell), we’ll implement our own GRU cell from scratch, to better understand how it works. Fill in the __init__ and forward methods of the MyGRUCell class in models.py, to implement the above equations. A template has been provided for the forward method.




Part 4: Implementing Attention [4 marks]




Attention allows a model to look back over the input sequence, and focus on relevant input tokens when producing the corresponding output tokens. For our simple task, attention can help the model remember tokens from the input, e.g., focusing on the input letter c to produce the output letter c.




The hidden states produced by the encoder while reading the input sequence, henc1; : : : ; hencT can be viewed as annotations of the input; each encoder hidden state henci captures information about the ith input token, along with some contextual information. At each time step, an attention-based decoder computes a weighting over the annotations, where the weight given to each one indicates its relevance in determining the current output token.

In particular, at time step t, the decoder computes an attention weight i(t) for each of the

encoder hidden states henc. The weights are de ned such that 0 (t) 1 and P (t) = 1. (t) is

i i i i i

a function of an encoder hidden state and the previous decoder hidden state, f(hdect1; henci), where i ranges over the length of the input sequence. One possible function f is the dot product, which measures the similarity between the two hidden states.




For our model, we will learn the function f, parameterized as a two-layer fully-connected network with a ReLU activation. This network produces unnormalized weights ~i(t) as:



~i(t) = f(hdect1; henci) = W2(max(0; W1[hdect1; henci] + b1)) + b2




Here, the notation [hdect1; henci] denotes the concatenation of vectors hdect1 and henci. Because the attention weights must be normalized, we need to apply the softmax function over the output of the two-layer network: i(t) = softmax(~(t))i.




4
CSC321 Programming Assignment 3










Implement this two-layer attention mechanism. Fill in the __init__ and forward methods of the Attention class in models.py. Use the PyTorch nn.Sequential class to de ne the attention network, and use the self.softmax function in the forward pass of the Attention class to normalize the weights.




batch_size batch_size







batch_size


seq_len
seq_len











hidden_size hidden_size

1




Decoder Hidden States Encoder Hidden States Attention Weights




Figure 3: Dimensions of the input and output tensors of the Attention module.




For the forward pass, you will need to do some reshaping of tensors. You are given a batch of decoder hidden states for time t 1, which has dimension batch_size x hidden_size, and a batch of encoder hidden states (annotations) for each timestep in the input sequence, which has dimension batch_size x seq_len x hidden_size. The goal is to compute the function f(hdect1; henci) for each decoder hidden state in the batch and all corresponding encoder hidden states henci, where i ranges over seq_len di erent values. You must do this in a vectorized fashion. Since f(hdect1; henci) is a scalar, the resulting tensor of attention weights should have dimension batch_size x seq_len x 1. The input and output dimensions of the Attention module are visualized in Figure 3.




Depending on your implementation, you will need one or more of these functions (click to jump to the PyTorch documentation):




squeeze unsqueeze expand as cat




view







The self.attention_network module takes as input a 2-dimensional tensor; you will need to view a 3D tensor as a 2D tensor to pass it through the attention network, and then view it as a 3D tensor again. We have provided a template for the forward method of the Attention class. You are free to use the template, or code it from scratch, as long as the output is correct.




Once we have the attention weights, a context vector ct is computed as a linear combination of the encoder hidden states, with coe cients given by the weights:



T

ct = X i(t)henci

i=1




5



CSC321 Programming Assignment 3










This context vector is concatenated with the input vector and passed into the decoder GRU cell at each time step, as shown in Figure 4.































+







...







Figure 4: Computing a context vector with attention.




Fill in the forward method of the AttentionDecoder class, to implement the interface shown in Figure 4. You will need to:




Compute the attention weights using self.attention_network



Multiply these weights by the corresponding encoder hidden states and sum them to form the context vector.



Concatenate the context vector with the current decoder input.



Feed the concatenation to the decoder GRU cell to obtain the new hidden state.



Compute the output using self.out.



Train the model with attention by running the following command:



python attention_nmt.py




By default, the script runs for 100 epochs, which should be enough to get good results; this takes approximately 24 minutes on the teaching lab machines. If necessary, you can train for fewer epochs by running python attention_nmt.py --nepochs=50, or you can exit training early with Ctrl-C.




At the end of each epoch, the script prints training and validation losses, and the Pig-Latin translation of a xed sentence, \the air conditioning is working", so that you can see how the model improves qualitatively over time. The xed sentence is stored in the variable TEST_SENTENCE, at the top of attention_nmt.py. You can change this variable to see how translation improves for your own sentence!




The script also saves several items to the directory checkpoints/h10-bs16:




The best encoder and decoder model paramters, based on the validation loss. A plot of the training and validation losses.




Attention maps generated during training for a xed word, given by the variable TEST_WORD_ATTN in attention_nmt.py. These maps allow you to see how the attention improves over




the course of training.




6
CSC321 Programming Assignment 3










Part 5: Attention Visualizations [2 marks]




One of the bene ts of using attention is that it allows us to gain insight into the inner workings of the model. By visualizing the attention weights generated for the input tokens in each decoder step, we can see where the model focuses while producing each output token. In this part of the assignment, you will visualize the attention learned by your model, and try to nd interesting success and failure modes that illustrate its behaviour.




The script visualize_attention.py loads a pre-trained model and uses it to translate a given set of words: it prints the translations and saves heatmaps to show how attention is used at each step. To call this script, you need to pass in the path to a checkpoint folder, as follows:




python visualize_attention.py --load checkpoints/h10-bs16




The visualize_attention.py script produces visualizations for the strings in the words list, found at the top of the script. The visualizations are saved as PDF les in the same directory as the loaded model checkpoint, so they will be in checkpoints/h10-bs16. Add your own strings to words list in visualize_attention.py and run the script as shown above. Since the model operates at the character-level, the input doesn’t even have to be a real word in the dictionary. You can be creative! After running the script, you should examine the generated attention maps. Try to nd failure cases, and hypothesize about why they occur. Some interesting classes of words you may want to try are:



Words that begin with a single consonant (e.g., cake).




Words that begin with two or more consonants (e.g., drink).




Words that have unusual/rare letter combinations (e.g., aardvark).




Compound words consisting of two words separated by a dash (e.g., well-mannered). These are the hardest class of words present in the training data, because they are long, and because the rules of Pig-Latin dictate that each part of the word (e.g., well and mannered) must be translated separately, and stuck back together with a dash: ellway-annerdmay.




Made-up words or toy examples to show a particular behaviour.




Include attention maps for both success and failure cases in your writeup, along with your hypothesis about why the model succeeds or fails.




What you need to submit




One code le: models.py.




A PDF document titled a3-writeup.pdf containing your answers to the conceptual questions, and the attention visualizations, with explanations.




References




Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for se-quence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171{1179, 2015.






7
CSC321 Programming Assignment 3










Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104{3112, 2014.
































































































































































































8

More products