Starting from:
$35

$29

CS 224n: Assignment #4

This assignment is split into two sections: Neural Machine Translation with RNNs and Analyzing NMT Systems. The rst is primarily coding and implementation focused, whereas the second entirely consists of written, analysis questions. If you get stuck on the rst section, you can always work on the second as the two sections are independent of each other. Note that the NMT system is more complicated than the neural networks we have previously constructed within this class and takes about 4 hours to train on a GPU. Thus, we strongly recommend you get started early with this assignment. Finally, the notation and implementation of the NMT system is a bit tricky, so if you ever get stuck along the way, please come to O ce Hours so that the TAs can support you.

1. Neural Machine Translation with RNNs (45 points)

In Machine Translation, our goal is to convert a sentence from the source language (e.g. Spanish) to the target language (e.g. English). In this assignment, we will implement a sequence-to-sequence (Seq2Seq) network with attention, to build a Neural Machine Translation (NMT) system. In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder.

Figure 1: Seq2Seq Model with Multiplicative Attention, shown on the third step of the decoder. Hidden states henci and cell states cenci are de ned in the next page.

1

CS 224n Assignment 4 Page 2 of 7

Model description (training procedure)

Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding x1; : : : ; xm (xi 2 Re 1), where m is the length of the source sentence and e is the embedding size. We feed these embeddings to the bidirectional encoder, yielding hidden states and cell states for both the forwards (!) and backwards ( ) LSTMs. The forwards and backwards versions are concatenated to give hidden states henci and cell states cenci:

henc = [henc!;henc] where henc

2 R

2h 1; henc!;henc

2 R

h 1

1

i

m

(1)

i

i

i

i

i

i

cenc = [cenc!;cenc] where cenc

2 R

2h 1

; cenc!;cenc

2 R

h 1

1

i

m

(2)

i

i

i

i

i

i

We then initialize the decoder’s rst hidden state hdec0 and cell state cdec0 with a linear projection of the encoder’s nal hidden state and nal cell state.1

h0dec = Wh[h1enc!;hmenc] where h0dec 2 Rh 1; Wh 2 Rh 2h

(3)

c0dec = Wc[c1enc!;cmenc] where c0dec 2 Rh 1; Wc 2 Rh 2h

(4)

With the decoder initialized, we must now feed it a target sentence. On the tth step, we look up the embedding for the tth word, yt 2 Re 1. We then concatenate yt with the combined-output vector ot 1 2 Rh 1 from the previous timestep (we will explain what this is later down this page!) to produce yt 2 R(e+h) 1. Note that for the rst target word (i.e. the start token) o0 is a zero-vector. We then feed yt as input to the decoder.

hdect; cdect = Decoder(yt; hdect1; cdect1) where hdect 2 Rh 1; cdect 2 Rh 1

We then use hdect to compute multiplicative attention over henc1; : : : ; hencm:

e

t;i

= (hdec)T W

attProj

henc

where e

t 2 R

m 1; W

attProj 2 R

h 2h

1

i

m

t

i

t = softmax(et) where t 2 Rm 1

m

Xi

where at 2 R2h 1

at =

t;ihienc

=1

(5)

(6)

(7)

(8)

(9)

We now concatenate the attention output at with the decoder hidden state hdect and pass this through a linear layer, tanh, and dropout to attain the combined-output vector ot.

ut = [at; htdec] where

ut 2 R3h 1

(10)

vt = Wuut where

vt 2 Rh 1; Wu 2 Rh 3h

(11)

ot = dropout(tanh(vt)) where

ot 2 Rh 1

(12)

Then, we produce a probability distribution Pt over target words at the tth timestep:

!

1If it’s not obvious, think about why we regard [henc1; hencm] as the ‘ nal hidden state’ of the Encoder.

CS 224n Assignment 4 Page 3 of 7

Pt = softmax(Wvocabot) where Pt 2 RVt 1; Wvocab 2 RVt h

(13)

Here, Vt is the size of the target vocabulary. Finally, to train the network we then compute the softmax cross entropy loss between Pt and gt, where gt is the one-hot vector of the target word at timestep t:

Jt( ) = CrossEntropy(Pt; gt)

(14)

Here, represents all the parameters of the model and Jt( ) is the loss on step t of the decoder. Now that we have described the model, let’s try implementing it for Spanish to English translation!

Setting up your Virtual Machine

Follow the instructions in the CS224n Azure Guide (link also provided on website and Piazza) in order to create your VM instance. This should take you approximately 45 minutes. Though you will need the GPU to train your model, we strongly advise that you rst develop the code locally and ensure that it runs, before attempting to train it on your VM. GPU time is expensive and limited. It takes approximately 4 hours to train the NMT system. We don’t want you to accidentally use all your GPU time for debugging your model rather than training and evaluating it. Finally, make sure that your VM is turned o whenever you are not using it.

If your Azure subscription runs out of money, your VM will be temporarily locked and inaccessible. If that happens, make a private post on Piazza with your Name, email used for Azure and SUNetID to request more credits.

In order to run the model code on your local machine, please run the following command to create the proper virtual environment:

conda env create --file local env.yml

Note that this virtual environment will not be needed on the VM.

Implementation and written questions

(a) (2 points) (coding) In order to apply tensor operations, we must ensure that the sentences in a given batch are of the same length. Thus, we must identify the longest sentence in a batch and pad others to be the same length. Implement the pad sents function in utils.py, which shall produce these padded sentences.

(b)

(3 points)

(coding) Implement the

init

function in model embeddings.py to initialize the

necessary source and target embeddings.

(c)

(4 points)

(coding) Implement the

init

function in nmt model.py to initialize the necessary

model embeddings (using the ModelEmbeddings class from model embeddings.py) and layers

(LSTM, projection, and dropout) for the NMT system.

(d) (8 points) (coding) Implement the encode function in nmt model.py. This function converts the padded source sentences into the tensor X, generates henc1; : : : ; hencm, and computes the initial state hdec0 and initial cell cdec0 for the Decoder. You can run a non-comprehensive sanity check by executing:

python sanity_check.py 1d

CS 224n Assignment 4 Page 4 of 7

(e) (8 points) (coding) Implement the decode function in nmt model.py. This function constructs y and runs the step function over every timestep for the input. You can run a non-comprehensive sanity check by executing:

python sanity_check.py 1e

(f) (10 points) (coding) Implement the step function in nmt model.py. This function applies the Decoder’s LSTM cell for a single timestep, computing the encoding of the target word hdect, the attention scores et, attention distribution t, the attention output at, and nally the combined output ot. You can run a non-comprehensive sanity check by executing:

python sanity_check.py 1f

(g) (3 points) (written) The generate sent masks() function in nmt model.py produces a tensor called enc masks. It has shape (batch size, max source sentence length) and contains 1s in positions corresponding to ‘pad’ tokens in the input, and 0s for non-pad tokens. Look at how the masks are used during the attention computation in the step() function (lines 295-296).

First explain (in around three sentences) what e ect the masks have on the entire attention com-putation. Then explain (in one or two sentences) why it is necessary to use the masks in this way.

Now it’s time to get things running! Execute the following to generate the necessary vocab le:

sh run.sh vocab

As noted earlier, we recommend that you develop the code on your personal computer. Con rm that you are running in the proper conda environment and then execute the following command to train the model on your local machine:

sh run.sh train_local

Once you have ensured that your code does not crash (i.e. let it run till iter 10 or iter 20), power on your VM from the Azure Web Portal. Then read the Managing Code Deployment to a VM section of our Practical Guide to VMs (link also given on website and Piazza) for instructions on how to upload your code to the VM.

Next, install necessary packages to your VM by running:

pip install -r gpu_requirements.txt

Finally, turn to the Managing Processes on a VM section of the Practical Guide and follow the instruc-tions to create a new tmux session. Concretely, run the following command to create tmux session called nmt.

tmux new -s nmt

Once your VM is con gured and you are in a tmux session, execute:

sh run.sh train

Once you know your code is running properly, you can detach from session and close your ssh connection to the server. To detach from the session, run:

tmux detach

You can return to your training model by ssh-ing back into the server and attaching to the tmux session by running:

tmux a -t nmt

CS 224n Assignment 4 Page 5 of 7

(i) (4 points) Once your model is done training (this should take about 4 hours on the VM), execute the following command to test the model:

sh run.sh test

Please report the model’s corpus BLEU Score. It should be larger than 21.

(j) (3 points) (written) In class, we learned about dot product attention, multiplicative attention, and additive attention. Please explain one advantage and one disadvantage of dot product attention

compared to multiplicative attention. Then explain one advantage and one disadvantage of additive attention compared to multiplicative attention. As a reminder, dot product attention is et;i = sTt hi, multiplicative attention is et;i = sTt Whi, and additive attention is et;i = vT tanh(W1hi + W2st).

2. Analyzing NMT Systems (30 points)

(a) (12 points) Here we present a series of errors we found in the outputs 2 of our NMT model (which is the same as the one you just trained). For each example of a Spanish source sentence, reference (i.e., ‘gold’) English translation, and NMT (i.e., ‘model’) English translation, please:

1. Identify the error in the NMT translation.

2. Provide possible reason(s) why the model may have made the error (either due to a speci c linguistic construct or a speci c model limitation).

3. Describe one possible way we might alter the NMT system to x the observed error. There are more than one possible xes for an error. For example, it could be tweaking the size of the hidden layers or changing the attention mechanism.

Below are the translations that you should analyze as described above. Note that out-of-vocabulary words are underlined. Rest assured that you don’t need to know Spanish to answer these questions. You just need to know English! The Spanish words in these questions are similar enough to English that you can mostly see the alignments. If you are uncertain about some words, please feel free to use resources like Google Translate to look them up.

i. (2 points) Source Sentence: Aqu otro de mis favoritos, \La noche estrellada".

Reference Translation: So another one of my favorites, \The Starry Night".

NMT Translation: Here’s another favorite of my favorites, \The Starry Night".

ii. (2 points) Source Sentence: Ustedes saben que lo que yo hago es escribir para los ni~nos, y, de hecho, probablemente soy el autor para ni~nos, ms ledo en los EEUU.

Reference Translation: You know, what I do is write for children, and I’m probably America’s most widely read children’s author, in fact.

NMT Translation: You know what I do is write for children, and in fact, I’m probably the author for children, more reading in the U.S.

iii. (2 points) Source Sentence: Un amigo me hizo eso { Richard Bolingbroke.

Reference Translation: A friend of mine did that { Richard Bolingbroke.

NMT Translation: A friend of mine did that { Richard <unk>

iv. (2 points) Source Sentence: Solo tienes que dar vuelta a la manzana para verlo como una epifan a.

Reference Translation: You’ve just got to go around the block to see it as an epiphany.

NMT Translation: You just have to go back to the apple to see it as an epiphany.

v. (2 points) Source Sentence: Ella salvo mi vida al permitirme entrar al ba~no de la sala de profesores.

Reference Translation: She saved my life by letting me go to the bathroom in the teachers’ lounge.

NMT Translation: She saved my life by letting me go to the bathroom in the women’s room.

• The data is from TED talks.

CS 224n Assignment 4 Page 6 of 7

vi. (2 points) Source Sentence: Eso es mas de 100,000 hectareas.

Reference Translation: That’s more than 250 thousand acres.

NMT Translation: That’s over 100,000 acres.

(b) (4 points) Now it is time to explore the outputs of the model that you have trained! The test-set translations your model produced in question 1-i should be located in outputs/test outputs.txt. Please identify 2 examples of errors that your model produced.3 The two examples you nd should be di erent error types from one another and di erent error types than the examples provided in the previous question. For each example you should:

1. Write the source sentence in Spanish. The source sentences are in the en es data/test.es.

2. Write the reference English translation. The reference translations are in the en es data/test.en.

3. Write your NMT model’s English translation. The model-translated sentences are in the outputs/test outputs.txt.

4. Identify the error in the NMT translation.

5. Provide a reason why the model may have made the error (either due to a speci c linguistic construct or speci c model limitations).

6. Describe one possible way we might alter the NMT system to x the observed error.

(c) (14 points) BLEU score is the most commonly used automatic evaluation metric for NMT systems.

It is usually calculated across the entire test set, but here we will consider BLEU de ned for a single example.4 Suppose we have a source sentence s, a set of k reference translations r1; : : : ; rk, and a candidate translation c. To compute the BLEU score of c, we rst compute the modi ed n-gram precision pn of c, for each of n = 1; 2; 3; 4, where n is the n in n-gram:

pn =

X2

max Count

(ngram); Count (ngram)

(15)

min

ngram c

i=1;:::;k

ri

c

X2

Countc(ngram)

ngram

c

Here, for each of the n-grams that appear in the candidate translation c, we count the maxi-mum number of times it appears in any one reference translation, capped by the number of times it appears in c (this is the numerator). We divide this by the number of n-grams in c (denominator).

Next, we compute the brevity penalty BP. Let len(c) be the length of c and let len(r) be the length of the reference translation that is closest to len(c) (in the case of two equally-close reference translation lengths, choose len(r) as the shorter one).

1

if len(c)

len(r)

BP = (exp 1

len(r)

otherwise

len(c)

Lastly, the BLEU score for candidate c with respect to r1; : : : ; rk is:

BLEU = BP exp

4

X

n=1 n log pn

where 1; 2; 3; 4 are weights that sum to 1. The log here is natural log.

(16)

(17)

3An ‘error’ is not just a NMT translation that doesn’t match the reference translation. There must be something wrong with the NMT translation, in your opinion.

• This de nition of sentence-level BLEU score matches the sentence bleu() function in the nltk Python package. Note that the NLTK function is sensitive to capitalization. In this question, all text is lowercased, so capitalization is irrelevant.

http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu

CS 224n Assignment 4 Page 7 of 7

i. (5 points) Please consider this example:

Source Sentence s: el amor todo lo puede

Reference Translation r1: love can always nd a way

Reference Translation r2: love makes anything possible

NMT Translation c1: the love can always do

NMT Translation c2: love can make anything possible

Please compute the BLEU scores for c1 and c2. Let i = 0:5 for i 2 f1; 2g and i = 0 for i 2 f3; 4g (this means we ignore 3-grams and 4-grams, i.e., don’t compute p3 or p4). When computing BLEU scores, show your working (i.e., show your computed values for p1, p2, len(c), len(r) and BP ). Note that the BLEU scores can be expressed between 0 and 1 or between 0 and 100. The code is using the 0 to 100 scale while in this question we are using the 0 to 1 scale.

Which of the two NMT translations is considered the better translation according to the BLEU Score? Do you agree that it is the better translation?

ii. (5 points) Our hard drive was corrupted and we lost Reference Translation r2. Please recom-pute BLEU scores for c1 and c2, this time with respect to r1 only. Which of the two NMT translations now receives the higher BLEU score? Do you agree that it is the better translation?

iii. (2 points) Due to data availability, NMT systems are often evaluated with respect to only a single reference translation. Please explain (in a few sentences) why this may be problematic.

iv. (2 points) List two advantages and two disadvantages of BLEU, compared to human evaluation, as an evaluation metric for Machine Translation.

Submission Instructions

You shall submit this assignment on GradeScope as two submissions { one for \Assignment 4 [coding]" and another for ‘Assignment 4 [written]":

1. Run the collect submission.sh script on Azure to produce your assignment4.zip le. You can use scp to transfer les between Azure and your local computer.

2. Upload your assignment4.zip le to GradeScope to \Assignment 4 [coding]".

3. Upload your written solutions to GradeScope to \Assignment 4 [written]". When you submit your assignment, make sure to tag all the pages for each problem according to Gradescope’s submission directions. Points will be deducted if the submission is not correctly tagged.

More products