Starting from:
$35

$29

CS 224n Assignment #3: Dependency Parsing

In this assignment, you will build a neural dependency parser using PyTorch. In Part 1, you will learn about two general neural network techniques (Adam Optimization and Dropout) that you will use to build the dependency parser in Part 2. In Part 2, you will implement and train the dependency parser, before analyzing a few erroneous dependency parses.

1. Machine Learning & Neural Networks (8 points)

(a) (4 points) Adam Optimizer

Recall the standard Stochastic Gradient Descent update rule:

r Jminibatch( )

where is a vector containing all of the model parameters, J is the loss function, r Jminibatch( ) is the gradient of the loss function with respect to the parameters on a minibatch of data, and is the learning rate. Adam Optimization1 uses a more sophisticated update rule with two additional steps.2

i. (2 points) First, Adam uses a trick called momentum by keeping track of m, a rolling average of the gradients:

• 1m + (11)r Jminibatch( )

m

where 1 is a hyperparameter between 0 and 1 (often set to 0.9). Brie y explain (you don’t need to prove mathematically, just give an intuition) how using m stops the updates from varying as much and why this low variance may be helpful to learning, overall.

ii. (2 points) Adam extends the idea of momentum with the trick of adaptive learning rates by keeping track of v, a rolling average of the magnitudes of the gradients:

• 1m + (11)r Jminibatch( )

v 2v + (1 2)(r Jminibatch( ) r Jminibatch( ))

p

m= v

where and = denote elementwise multiplication and division (so z z is elementwise squaring) andp2 is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update by v, which of the model parameters will get larger updates? Why might this help with learning?

(b) (4 points) Dropout3 is a regularization technique. During training, dropout randomly sets units in the hidden layer h to zero with probability pdrop (dropping di erent units each minibatch), and then multiplies h by a constant . We can write this as

hdrop = d h

where d 2 f0; 1gDh (Dh is the size of h) is a mask vector where each entry is 0 with probability pdrop and 1 with probability (1 pdrop). is chosen such that the expected value of hdrop is h:

Epdrop [hdrop]i = hi

for all i 2 f1; : : : ; Dhg.

i. (2 points) What must equal in terms of pdrop? Brie y justify your answer.

ii. (2 points) Why should we apply dropout during training but not during evaluation?

• Kingma and Ba, 2015, https://arxiv.org/pdf/1412.6980.pdf

• The actual Adam update uses a few additional tricks that are less important, but we won’t worry about them here. If you want to learn more about it, you can take a look at: http://cs231n.github.io/neural-networks-3/#sgd

3Srivastava et al., 2014, https://www.cs.toronto.edu/˜hinton/absps/JMLRdropout.pdf

1

CS 224n Assignment 3 Page 2 of 7

2. Neural Transition-Based Dependency Parsing (44 points)

In this section, you’ll be implementing a neural-network based dependency parser, with the goal of max-imizing performance on the UAS (Unlabeled Attachment Score) metric.

Before you begin please install PyTorch 1.4.0 from https://pytorch.org/get-started/locally/ with the CUDA option set to None. Additionally run pip install tqdm to install the tqdm package { which produces progress bar visualizations throughout your training process.

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between head words, and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:

• A stack of words that are currently being processed.

• A bu er of words yet to be processed.

• A list of dependencies predicted by the parser.

Initially, the stack only contains ROOT, the dependencies list is empty, and the bu er contains all words of the sentence in order. At each step, the parser applies a transition to the partial parse until its bu er is empty and the stack size is 1. The following transitions can be applied:

• SHIFT: removes the rst word from the bu er and pushes it onto the stack.

• LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of

the rst item and removes the second item from the stack, adding a rst word ! second word dependency to the dependency list.

• RIGHT-ARC: marks the rst (most recently added) item on the stack as a dependent of the second

item and removes the rst item from the stack, adding a second word ! rst word dependency to the dependency list.

On each step, your parser will decide among the three transitions using a neural network classi er.

(a) (4 points) Go through the sequence of transitions needed for parsing the sentence \I parsed this sentence correctly". The dependency tree for the sentence is shown below. At each step, give the con guration of the stack and bu er, as well as what transition was applied this step and what new dependency was added (if any). The rst three steps are provided below as an example.

ROOT

I

parsed

this

sentence

correctly

Stack

Bu er

New dependency

Transition

[ROOT]

[I, parsed, this, sentence, correctly]

Initial Con guration

[ROOT, I]

[parsed, this, sentence, correctly]

SHIFT

[ROOT, I, parsed]

[this, sentence, correctly]

parsed!I

SHIFT

[ROOT, parsed]

[this, sentence, correctly]

LEFT-ARC

(b) (2 points) A sentence containing n words will be parsed in how many steps (in terms of n)? Brie y explain why.

(c) (6 points) Implement the init and parse step functions in the PartialParse class in parser transitions.py. This implements the transition mechanics your parser will use. You can run basic (non-exhaustive) tests by running python parser transitions.py part c.

CS 224n Assignment 3 Page 3 of 7

(d) (8 points) Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more e ciently when making predictions about batches of data at a time (i.e., predicting the next transition for any di erent partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

Algorithm 1 Minibatch Dependency Parsing

Input: sentences, a list of sentences to be parsed and model, our model that makes parse decisions

Initialize partial parses as a list of PartialParses, one for each sentence in sentences Initialize unfinished parses as a shallow copy of partial parses while unfinished parses is not empty do

Take the rst batch size parses in unfinished parses as a minibatch

Use the model to predict the next transition for each partial parse in the minibatch

Perform a parse step on each partial parse in the minibatch with its predicted transition

Remove the completed (empty bu er and stack of size 1) parses from unfinished parses

end while

Return: The dependencies for each (now completed) parse in partial parses.

Implement this algorithm in the minibatch parse function in parser transitions.py. You can run basic (non-exhaustive) tests by running python parser transitions.py part d. Note: You will need minibatch parse to be correctly implemented to evaluate the model you will build in part (e). However, you do not need it to train the model, so you should be able to complete most of part (e) even if minibatch parse is not implemented yet.

(e) (12 points) We are now going to train a neural network to predict, given the state of the stack, bu er, and dependencies, which transition should be applied next.

First, the model extracts a feature vector representing the current state. We will be using the feature

set presented in the original neural dependency parsing paper: A Fast and Accurate Dependency Parser using Neural Networks.4 The function extracting these features has been implemented for you in utils/parser utils.py. This feature vector consists of a list of tokens (e.g., the last

word in the stack, rst word in the bu er, dependent of the second-to-last word in the stack if there is one, etc.). They can be represented as a list of integers w = [w1; w2; : : : ; wm] where m is the number of features and each 0 wi < jV j is the index of a token in the vocabulary (jV j is the vocabulary size). Then our network looks up an embedding for each word and concatenates them into a single input vector:

• = [Ew1 ; :::; Ewm ] 2 Rdm

where E 2 RjV j d is an embedding matrix with each row Ew as the vector for a particular word w.

We then compute our prediction as:

• = ReLU(xW + b1)

l = hU + b2

y^ = softmax(l)

• Chen and Manning, 2014, http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf

CS 224n Assignment 3 Page 4 of 7

where h is referred to as the hidden layer, l is referred to as the logits, y^ is referred to as the predictions, and ReLU(z) = max(z; 0)). We will train the model to minimize cross-entropy loss:

3

J( ) = CE(y; y^) =

Xi

yi log y^i

=1

To compute the loss for the training set, we average this J( ) across all training examples.

We will use UAS score as our evaluation metric. UAS refers to Unlabeled Attachment Score, which is computed as the ratio between number of correctly predicted dependencies and the number of total dependencies despite of the relations (our model doesn’t predict this).

In parser model.py you will nd skeleton code to implement this simple neural network using PyTorch. Complete the init , embedding lookup and forward functions to implement the model. Then complete the train for epoch and train functions within the run.py le.

Finally execute python run.py to train your model and compute predictions on test data from Penn Treebank (annotated with Universal Dependencies).

Note:

• For this assignment, you are asked to implement Linear layer and Embedding layer. Please DO NOT use torch.nn.Linear or torch.nn.Embedding module in your code, otherwise you will receive deductions for this problem.

• Please follow the naming requirements in our TODO if there are any, e.g. if there are explicit requirements about variable names you have to follow them in order to receive full credits. You are free to declare other variable names if not explicitly required.

• Once you have implemented embedding lookup (e) or forward (f) you can call python parser model.py with ag -e or -f or both to run sanity checks with each function. These sanity checks are fairly basic and passing them doesn’t mean your code is bug free.

• When debugging, you can add a debug ag: python run.py -d. This will cause the code to run over a small subset of the data, so that training the model won’t take as long. Make sure to remove the -d ag to run the full model once you are done debugging.

• When running with debug mode, you should be able to get a loss smaller than 0.2 and a UAS larger than 65 on the dev set (although in rare cases your results may be lower, there is some randomness when training).

• It should take about 1 hour to train the model on the entire the training dataset, i.e., when debug mode is disabled.

• When debug mode is disabled, you should be able to get a loss smaller than 0.08 on the train set and an Unlabeled Attachment Score larger than 87 on the dev set. For comparison, the model in the original neural dependency parsing paper gets 92.5 UAS. If you want, you can tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of epochs, etc.) to improve the performance (but you are not required to do so).

◦ Working implementation of the neural dependency parser in parser model.py. (We’ll look at and run this code for grading).

◦ Report the best UAS your model achieves on the dev set and the UAS it achieves on the test set.

(f) (12 points) We’d like to look at example dependency parses and understand where parsers like ours might be wrong. For example, in this sentence:

CS 224n Assignment 3 Page 5 of 7

root

punct

nmod

nsubj dobj case

Moscow sent troops into Afghanistan .

PROPN VERB NOUN ADP PROPN PUNCT

the dependency of the phrase into Afghanistan is wrong, because the phrase should modify sent (as in sent into Afghanistan) not troops (because troops into Afghanistan doesn’t make sense). Here is the correct parse:

root

punct

nmod

nsubj dobj case

Moscow sent troops into Afghanistan .

PROPN VERB NOUN ADP PROPN PUNCT

More generally, here are four types of parsing error:

• Prepositional Phrase Attachment Error: In the example above, the phrase into Afghanistan is a prepositional phrase. A Prepositional Phrase Attachment Error is when a prepositional phrase is attached to the wrong head word (in this example, troops is the wrong head word and sent is the correct head word). More examples of prepositional phrases include with a rock, before midnight and under the carpet.

• Verb Phrase Attachment Error: In the sentence Leaving the store unattended, I went outside to watch the parade, the phrase leaving the store unattended is a verb phrase. A Verb Phrase Attachment Error is when a verb phrase is attached to the wrong head word (in this example, the correct head word is went).

• Modi er Attachment Error: In the sentence I am extremely short, the adverb extremely is a modi er of the adjective short. A Modi er Attachment Error is when a modi er is attached to the wrong head word (in this example, the correct head word is short).

• Coordination Attachment Error: In the sentence Would you like brown rice or garlic naan?, the phrases brown rice and garlic naan are both conjuncts and the word or is the coordinating conjunction. The second conjunct (here garlic naan) should be attached to the rst conjunct (here brown rice). A Coordination Attachment Error is when the second conjunct is attached to the wrong head word (in this example, the correct head word is rice). Other coordinating conjunctions include and, but and so.

In this question are four sentences with dependency parses obtained from a parser. Each sentence has one error, and there is one example of each of the four types above. For each sentence, state the type of error, the incorrect dependency, and the correct dependency. To demonstrate: for the example above, you would write:

• Error type: Prepositional Phrase Attachment Error

• Incorrect dependency: troops ! Afghanistan

• Correct dependency: sent ! Afghanistan

Note: There are lots of details and conventions for dependency annotation. If you want to

learn more about them, you can look at the UD website: http://universaldependencies.

org5 or the short introductory slides at: http://people.cs.georgetown.edu/nschneid/ p/UD-for-English.pdf. However, you do not need to know all these details in order to do this question. In each of these cases, we are asking about the attachment of phrases and it should

• But note that in the assignment we are actually using UDv1, see: http://universaldependencies.org/docsv1/

CS 224n Assignment 3 Page 6 of 7

be su cient to see if they are modifying the correct head. In particular, you do not need to look at the labels on the the dependency edges { it su ces to just look at the edges themselves.

i.

root

punct

conj

nmod

cc

case

dobj

nsubj

aux

det

acl

amod

I

disembarked

and was

heading to

a wedding fearing

my death

.

PRON

VERB

CCONJ AUX VERB ADP DET NOUN VERB PRON NOUN PUNCT

ii.

root

punct

nmod

conj

nmod

case

ccomp

xcomp

nmod:poss

nsubj

nsubj mark advmod cc dobj case amod

It makes me want to rush out and rescue people from dilemmas of their own making .

PRP VERB PRON VERB PART VERB ADV CCONJ VERB NOUN ADP NOUN ADP PRON ADJ NOUN PUNCT

iii.

root

punct

nsubj

nmod

nmod

cop

case

appos

case

det

acl

xcomp

at

case

punct

It

is

on

loan from a

guy

named Joe

O’Neill

in Midland

,

Texas

.

PRON AUX ADP NOUN ADP DET NOUN VERB PROPN PROPN ADP PROPN PUNCT PROPN PUNCT

iv.

root

punct

nmod

case

nsubj

det

nmod

nmod

aux

advmod

case

case

aux

amod

det

compound

Brian has been one

of

the most crucial elements

to the success of

Mozilla software

.

PROPN AUX AUX NUM ADP DET ADV ADJ NOUN ADP DET NOUN ADP PROPN NOUN PUNCT

Submission Instructions

You shall submit this assignment on GradeScope as two submissions { one for \Assignment 3 [coding]" and another for ‘Assignment 3 [written]":

1. Run the collect submission.sh script to produce your assignment3.zip le.

CS 224n Assignment 3 Page 7 of 7

2. Upload your assignment3.zip le to GradeScope to \Assignment 3 [coding]".

3. Upload your written solutions to GradeScope to \Assignment 3 [written]".

More products