$24
You will need to use the Penn Treebank corpus for this assignment. Four data les are provided: train.txt, train.5k.txt, valid.txt, and input.txt. You can use train.txt to train your models and use valid.txt for testing. File input.txt can be used for a sanity check on whether the model produces coherent sequences of words for unseen data with no next word.
N-gram (55 points)
(10 pts) Preprocess the train and validation data, build the vocabulary, tokenize, etc.
(10 pts) Implement an N-gram model (bigram or trigram) for language modeling.
(10 pts) Implement Good Turing smoothing.
(10 pts) Implement Kneser-Ney Smoothing using:
PKN(wijwi
max(c(wi 1; wi)
d; 0)
1) =
+ (wi 1)PCONTINUATION(wi)
c(wi
1)
where
d
(wi 1) =
jfw : c(wi 1; w) 0gj
c(wi 1)
P
CONTINUATION
(w) =
P
jwi 1 : c(wi 1; w) 0j
w0 jfwi0 1 : c(wi0 1; w0) 0gj
(5 pts) Predict the next word in the valid set using a sliding window. Report the perplexity scores of N-gram, Good Turing, and Kneser-Ney on the test set.
(10 pts) There are 3124 examples in input.txt. Choose the rst 30 lines and print the predictions of next words using your N-gram model.
RNN (45 points)
(5 pts) Initialize parameters for the model.
(10 pts) Implement the forward pass for the model. Use an embedding layer as the rst layer of your network (e.g. tf.nn.embedding lookup). Use a recurrent neural network cell (GRU or LSTM) as the next layer. Given a sequence of words, predict the next word.
(5 pts) Calculate the loss of the model (sequence cross-entropy loss is suggested) e.g. tf.contrib.seq2seq.sequence loss.
(d) (5 pts) Set up the training step: use a learning rate of 1e 3 and an Adam optimizer. Set window size to be 20 and batch size to be about 50.
(e) (10 pts) Train your RNN model.
perplexity is exp total loss
number of predictions
Calcuate the model’s perplexity on the test set. Prove that
.
(f) (10 pts) Print the predictions of next words in the same 30 lines of input.txt as in N-gram.
Submission Instructions You shall submit a zip le named Assignment3 LastName FirstName.zip which contains:
python les (.ipynb or .py) including all the code, plots and results. You need to provide detailed comments in English.
Page 2