$24
Getting started
All les that are necessary to do the assignment are contained in a tarball which you can get from:
http://courses.engr.illinois.edu/cs447/HW/cs447_HW1.tar.gz
You need to unpack this tarball (tar -zxvf HW1.tar.gz) to get a directory that contains the code and data you will need for this homework.
Part 1: Language models and smoothing (6 points)
1.1 Goal
The rst part of the assignment requires you to train some simple n-gram language models on a corpus of movie reviews and to test them on two smaller corpora: a collection of positive reviews, and one of negative reviews.
1.2 Data
The corpora are stored as text les, where each line is one sentence from a review:
train.txt: contains the training corpus of movie reviews (30k sentences)
pos test.txt: contains the test corpus of positive reviews (1k sentences)
neg test.txt: contains the test corpus of negative reviews (1k sentences)
Tokens (this includes words and punctuation marks, which you should treat like regular tokens) are separated by whitespaces.
1.3 Provided Code
To get you started, we have provided the module hw1 lm.py. This le contains code for reading in the corpora (as in HW0) and maintaining a basic probability distribution (UnigramDist).
We have also de ned a parent LanguageModel class, along with subclass de nitions for the ve language models we'd like you to implement. Your main task (described in section 1.4) is to implement the high-level methods of these models, speci cally: generating sentences, calculating the probability of sentences, and computing the perplexity of a corpus.
Internally, we recommend that you treat each sentence as a list of tokens; this is the same representation returned by our input method.
1
2
1.4 What you need to implement
1.4.1 Preprocessing
To make your models more robust, it is necessary to perform some basic preprocessing on the corpora.
Sentence markers For all corpora, each sentence must be surrounded by a start of sentence and end of sentence marker (<s ... </s). These markers will allow your models to generate sentences that have realistic beginnings and endings, if you train your model properly.
Handling unknown words In order to deal with unknown words in the test corpora, all words that ap-pear only once in the training corpus must be replaced with a special token for unknown words (e.g. 'UNK') before estimating your models. When unknown words are encountered in the test corpora, they should be treated as that special token instead.
These preprocessing steps have been provided for you in the assignment code.
1.4.2 The LanguageModel classes
In order to compare the e ects of using di erent-order n-grams and smoothing, we require you to implement the following ve models:
^
1. UnigramModel: an unsmoothed unigram model, with probability distribution P (w)
2. SmoothedUnigramModel: a unigram model smoothed using Laplace (add-one) smoothing, with proba-bility distribution PL(w)
^ 0j
3. BigramModel: an unsmoothed bigram model, with probability distribution P (w w)
4. SmoothedBigramModelAD: a bigram model smoothed using absolute discounting, with probability dis-tribution PAD(w0jw)
5. SmoothedBigramModelKN: a bigram model smoothed using Kneser-Ney, with probability distribution
PKN (w0jw)
Laplace Smoothing For the SmoothedUnigramModel we want you to use Laplace smoothing, also known
^
as add-one smoothing, on the unigram model P (w). Remember that in Laplace smoothing we increase the counts of all events by one and renormalize. This takes probability mass away from seen events and reserves it for unseen events (see Lecture 4)
In order to smooth your unigram model, you will need the number of words in the corpus, N, and the number of word types, S. The distinction between these is meaningful: N indicates the number of word instances, where S refers to the size of our vocabulary. The sentence \the cat saw the dog" has four word types (the, cat, saw, dog), but ve word tokens (the, cat, saw, the, dog). The token \the" appears twice in the sentence, but they share the same type the.
If c(w) is the frequency of w in the training data, you can compute PL(w) as follows:
c(w) + 1
PL(w) =
N + S
Absolute Discounting Smoothing In order to produce the SmoothedBigramModel, we want you to use
^ 0j
absolute discounting on the bigram model P (w w).
You need to compute a discounting factor D. If n1 are the number of bigrams w1w2 that appear only once, and n2 are the number of bigrams w1w2 that appear exactly twice, compute D as
n1
D = n1 + 2n2 :
3
For each distribution P (w0jw), you need to then also compute the number of seen bigram types ww0, Sw (this captures how many di erent w0 occur after w in the training data).
S(w) = jw0 : c(ww0) 0j
If c(w) is the frequency of w in the training data, and c(ww0) the frequency of ww0 in the training data, compute PAD(w0jw) as follows:
P
(w0
j
w) =
max(c(ww0) D; 0)
+
D
S
w
P
(w0)
AD
c(w)
c(w)
L
Kneser-Ney Smoothing For the SmoothedBigramModelKN, Kneser-Ney augments absolute discounting with another way of handling lower order unigram distribution. Instead of directly using the Laplace smoothed unigram in absolute discouting, here you can replace the MLE unigram distribution with the continuation probability PC , that estimates how likely the unigram is to continue a new context, normalized by the total number of unique bigrams. You can then compute PC as follows (here i is a free variable):
PC (w0) =
jw : c(ww0) 0j
jwi 1wi : c(wi 1wi) 0j
Then compute PKN as,
P
(w0
j
w) =
max(c(ww0) D; 0)
+
D
S
w
P
(w0)
KN
c(w)
C
c(w)
1.4.3 Generating sentences
For each of the four language models, you need to implement the following methods:
generateSentence(self): returns a sentence sent that is generated by the language model. sent is a list of the form [<s; w1; :::; wn; </s], where w1 to wn are words in your vocabulary (including UNK, but excluding <s and </s). You can assume that <s starts each sentence (with probability 1). The following words (w1; :::; wn; </s) are generated according to your language model's distribution. The number of words (n) is not xed. Instead, you stop generating a sentence as soon as you generate the end of sentence symbol </s.
getSentenceProbability(self, sen): returns the probability of the sentence sen (which is again a list of the form [<s; w1; :::; wn; </s]) according to the model.
Please use the provided generateSentencesToFile method and your unigram and bigram language models to generate 20 sentences (saved as unigram output.txt, smooth unigram output.txt, bigram output.txt, smooth bigram ad output.txt, and smooth bigram kn output.txt).
Implementation hint In order to avoid under ow, your implementation may have to use log-probabilities internally. That is, once you have computed each (smoothed) probability from your relative frequency estimates (counts), you need to convert it to its logarithm (use math.log(p) for natural logs or math.log(p,
for base-2 log, but make sure you always use the same base).
1.4.4 Computing the perplexity of the test corpora
You need to compute the perplexity (normalized inverse log probability) of the two test corpora according to all ve of your models (unsmoothed unigram, smoothed unigram, unsmoothed bigram, smoothed bigram ad and smoothed bigram kn). Implement the following method to evaluate a whole corpus:
getCorpusPerplexity(self, corpus): given a corpus corpus, calculate and return its perplexity
4
For a corpus W with N words, Jurafsky and Martin tells you to compute perplexity as follows:
v
N
1
N
P erplexity(W ) =
uY
j
1)
ui=1 P (wi
wi
t
Since this is prone to under ow issues, you may be better o computing it using logarithms:
P erplexity(W ) = exp
1 N
log P (wijwi 1)
N i=1
X
Evaluate the models on the test corpora. Do you see a di erence between the two test domains?
1.5 What you will be graded on
You will be graded based on the les you submit and the answers you provide to the questions on Compass. Each of your models (excluding the unsmoothed unigram model) is worth 1 point (4 points total), and the Compass questions are worth 0.5 points (2 points total). Keep in mind that we must be able to run your submitted code or you will receive 0 points.
To help you identify errors, we have included a test script hw1 lm test.py. This script will use the classes you implement in hw1 lm.py and test their methods against a simple corpus, test.txt, which we have also included.
1.6 What to submit
See the end of this document for full submission guidelines, but the les you must submit for this portion are:
hw1 lm.py: your completed Python module for language models (see all of section 1.4)
Four text les, containing sentences generated by your language models (see section 1.4.3). For each sentence, include the probability your model assigns to that sentence.
unigram output.txt: a text le containing the 20 sentences generated by your unsmoothed unigram language model
smooth unigram output.txt: a text le containing the 20 sentences generated by your smoothed unigram language model
bigram output.txt: a text le containing the 20 sentences generated by your unsmoothed bigram language model
smooth bigram ad output.txt: a text le containing the 20 sentences generated by your smoothed bigram language model using Absolute Discounting Smoothing
smooth bigram kn output.txt: a text le containing the 20 sentences generated by your smoothed bigram language model using Kneser-Ney Smoothing
Additionally, you must answer the following discussion questions on Compass when you submit your assignment. These questions can be found by navigating to Course Content →Homeworks →Homework 1 →Homework 1 Language Model Discussion Questions
When generating sentences with the unigram model, what controls the length of the generated sen-tences? How does this di er from the sentences produced by the bigram models?
Consider the probability of the generated sentences according to your models. Do your models assign drastically di erent probabilities to the di erent sets of sentences? Why do you think that is?
Generate additional sentences using your bigram and smoothed bigram models. In your opinion, which model produces better / more realistic sentences?
For each of the four models, which test corpus has a higher perplexity? Why? Make sure to include the perplexity values in the answer.
5
Part 2: Finite-State Transducers (4 points)
2.1 Goal
Your task is to implement a nite-state transducer (or FST; see Lecture 2) which transduces the in nitive form of verbs to their correct -ing form. You will be graded according to how many verbs your transducer handles correctly.
2.2 Data
The input to your program is a text le with a single (in nitive) verb per line. For each verb, your program should compute the correct translation (e.g. walk == walking) and print it as a line in the output le.
360verbs.txt: sample input le containing a list of 360 in nitive verbs, one verb per line.
360verbsCorrect.txt: sample output le containing the correct translation (in nitive to -ing form) for each of the 360 verbs in order, one translation per line.
2.3 Provided Code
To help you with your task, we've provided fst.py, a module for constructing and evaluating FSTs. You shouldn't modify any of the code in this le. Below, we describe the most useful methods provided by this module:
FST(self, initialStateName): instantiate an FST with an initial (non-accepting) state named initialStateName
addState(self, name, isFinal): add a state named name to the FST; by default, isFinal=false and so the state is not an accepting state.
addTransition(self, inStateName, inString, outString, outStateName): adds a transition be-tween state inStateName and state outStateName, where both of these states already exist in the FST. The FST can traverse this transition after reading inString1, and outputs outString when it does so.
addSetTransition(self, inStateName, inStringSet, outStateName): adds a transition between state inStateName and state outStateName for each character in inStringSet. For each transition, the FST outputs the same character it reads.
You will need to implement your solution by modifying the le hw1 fst.py. This le uses the function buildFST() to de ne a rudimentary (and buggy) FST for translating verbs. Your task is to modify this function (using the methods in fst.py) to produce a better FST. You are free to de ne your own character sets (we've started out with some useful ones at the beginning of the le) and any helper methods you require. The main method evaluates the sample FST on an input le. When implementing your FST, you can save your results to another le (e.g. 360verbsGuess.txt) and run diff against 360verbsCorrect.txt to see how well you're doing:
python hw1_fst.py 360verbs.txt 360verbsGuess.txt
diff -U 0 360verbsCorrect.txt 360verbsGuess.txt | grep ^@ | wc -l
This will print the number of incorrect translations.
inString can be at most one character long
6
2.4 What you need to implement
Your task is to x the implementation of buildFST() by replacing the sample FST with a better FST for translating in nitive verbs to their correct -ing form. Your FST can be non-deterministic (that is, a state may have multiple transitions for the same input character), but each accepted string should only have one analysis.2 It will accept a string if it reaches a nal state after having read its last character. If it accepts a string, it will output its translation. If it does not accept a string, it will output FAIL. All verb stems are at least three letters long. In many cases, the -ing form just consists of stem + ing:
walk == walking fly == flying
But there are a number of exceptions. In particular, in order to correctly translate the 360 verbs we've provided, your FST must implement the following rules:
Dropping the nal -e: Drop the nal -e if and only if it is preceded by a consonant or by the letter u:
ride == riding make == making see == seeing
argue == arguing seize == seizing
Double the nal consonant: Double a nal single -n, -p, -t, -r if and only if it is preceded by a single vowel3:
stop == stopping occur == occurring halt == halting
admit == admitting stoop == stooping
set == setting leap == leaping
bar == barring
Exceptions to this rule are verbs ending in -er, -en:4:
gather == gathering happen == happening
Change nal -ie to -y : die == dying
Notes: When you de ne a transition, the input strings can only be zero ("") or one character long, and your FST can't have more than 30 states (you shouldn't need that many).
2.5 What to submit
For this portion, you only need to submit one le:
hw1 fst.py: your completed Python module for translating verb forms using FSTs (see section 2.4) We will use our own copy of the fst module and input les to test your program during grading.
2.6 What you will be graded on
You will be graded according to the number of verbs in 360verbs.txt that your FST correctly translates (100% is 4 points; the provided FST starts at 78%). You need to make sure that the function de nition of buildFST() remains the same, so that we can check your results and verify that your FST adheres to the speci cation.
If your FST accepts a string along multiple paths, all of the outputs will all be printed out and the translation will be marked incorrect: e.g. appeal -- appealingappealling
3This is a simpli ed version of the actual spelling rule. Our data set does not contain verbs like pardon, where this rules does not apply. It also does not contain verbs that end in other letters like stab, slam, where this rule also applies.
4We are assuming American English spelling for labeling, etc.
7
Submission guidelines
You should submit your solution as a compressed tarball on Compass. To do this, save your les in a directory called abc123 hw1 (where abc123 is your NetID) and create the archive from its parent directory (tar -zcvf abc123 hw1.tar.gz abc123 hw1). Do not include the corpora themselves, but do include the following les:
hw1 lm.py: your completed Python module for language models (see all of section 1.4)
Five text les, containing sentences generated by your language models (see section 1.4.3). For each sentence, include the probability your model assigns to that sentence.
unigram output.txt: a text le containing the 20 sentences generated by your unsmoothed unigram language model
smooth unigram output.txt: a text le containing the 20 sentences generated by your smoothed unigram language model
bigram output.txt: a text le containing the 20 sentences generated by your unsmoothed bigram language model
smooth bigram ad output.txt: a text le containing the 20 sentences generated by your smoothed bigram language model using Absolute Discounting Smoothing
smooth bigram kn output.txt: a text le containing the 20 sentences generated by your smoothed bigram language model using Kneser-Ney Smoothing
hw1 fst.py: your completed Python module for translating verb forms using FSTs (see section 2.4)
Additionally, you must answer the discussion questions on Compass when you submit your assignment. These questions can be found by navigating to Course Content →Homeworks →Homework 1 →Homework 1 Language Model Discussion Questions