Starting from:
$30

$24

Homework 4 Natural Language Processing Solution

Getting started




All les that are necessary to do the assignment are contained in a tar ball which you can get from:




http://courses.engr.illinois.edu/cs447/HW/cs447_HW4.tar.gz




You need to unpack this tar ball (tar -xzvf cs447 HW4.tar.gz) to get a directory that contains the code and data you will need for this homework.







IBM word alignment (10 points)




1.1 Goal




Your tasks for this assignment are to implement and train the IBM Model 1 for word alignment (see Lecture 22) on a parallel corpus of movie subtitles and to nd the best alignment for a set of test sentence pairs.




It may be helpful to start with the Appendix for a extended review of IBM Model 1, and to familiarize yourself with the notation used in this handout.




1.2 Data




The homework release contains two parallel corpora, eng-spa.txt (English-Spanish) and eng-ger.txt (English-German). In both cases, the target language for translation is English. We also provide a smaller corpus for the testing script, eng-spa small.txt.







1.3 Provided code




We have provided functionality in hw4 translate.py to get you started on your solution. An instance of IBMModel1 is initialized with a text le containing a parallel training corpus. The text le is split up into pairs of unaligned sentences (English followed by non-English). The training corpus is stored in the following data structures (using the provided initialize method):







fCorpus: a list of foreign (e.g. Spanish) sentences.




tCorpus: a list of target (e.g. English) sentences (the sentence at position i in tCorpus is a translation of the sentence at position i in fCorpus).




trans: a map of frequency counts; trans[ex][fy] is initialized with a count of how often target word ex and source word fy appear together.




You may de ne and use any other data structures you need.









2







1.4 What you need to implement




Your task is to nish implementing the methods in the IBMModel1 class. You should initialize any additional data structures you need in the init method.







1.4.1 Part 1: Training the model (7 points, +2 EC)




You need to train your model using the EM algorithm by implementing the submethods in trainUsingEM. Treat each English sentence as if it begins with a NULL word in position 0 (this is already implemented), and refer to the Appendix for notation in these descriptions.




computeTranslationLengthProbabilities:




Compute the translation length probabilities q(mjn) from the corpus. These probabilities aren't nec-essary for calculating an alignment between two given sentences, but they would be necessary in order to produce a target sentence from a source sentence.1




getTranslationLengthProbability:




Return the pre-computed translation length probability q(mjn).




initializeWordTranslationProbabilities:




For each English word that occurs in the corpus (including the NULL word), choose initial values for the translation probabilities pt(fje): Each distribution pt(:::jex) should be initialized as a uniform distribution over all the words fy that occur in the source translation of at least one of the sentences in which ex occurs. (You could also use a uniform distribution over all words in the corpus, but that would require considerably more memory).




getWordTranslationProbability:




Return the (current) value of the translation probability pt(fyjex).




computeExpectedCounts:




For each pair of sentences (f(s); e(s)), 1 s S with f(s) = f1(s):::fm, and e(s) = e(0s):::en, compute the expected counts for the translation probabilities. That is, for each English word e(is) with 0 i n and for each source word fj(s) with 1 j m:




(s)
(s)


(s)


(s)


pt(fj(s)jei(s))


hc(fj
jei
; f


; e


)i =




(1)




pt(fj(s)je0(s)) + + pt(fj(s)jei(s)) + pt(fj(s)jen(s))



updateTranslationProbabilities:




For each English word ex that appears at least once in our corpus:




Compute a normalization factor Zex = y s=1hc(fyjex; f(s); e(s))i




For each source word fy that appears in at least one of the sentence pairs in which ex occurs,




compute a new pt(fyjex):P PS



P
S
c(f
ex; f(s); e(s))












pt(fyjex) =
s=1h


yZjex
i
(2)



writeModel:




Write the parameters of your model to a le in a legible format. You should be able to examine these les to compare di erent set of parameters (e.g. right after initialization, after training, or in between traiing iterations) and see where your model is improving.







If we were using a better language model, we could try to generate a translation; because Model 1 treats alignment probabilities uniformly, the resulting sentence would be word salad.
3







Convergence (+2 EC) During training, the provided code has you repeat computeExpectedCounts and updateTranslationProbabilities for 10 iterations (by default). Instead, you should really iterate until the probabilites have converged. You could check convergence by computing the log likelihood of the corpus:

L(C) =
S 1
Xs
log p(f(s)jes)


p(fj(s)jei




1
=
S 1


log 0
(n + j1)m
m n


)








q(m n)


(s)








X
@


jY X
m






An
















s




=1 i=0




log p(fj(s)jei(s))!
/log(q(mjn)) m log(n + 1) +




X








X






Xi


s








j=1






=0



The model has fully converged when the change between log-likelihood between each iteration of training becomes zero (approximate this by stopping when log-likelihood decreases by an amount less than ).2 You will get two bonus points if you compute log likelihood and run until converence (mention this in your README).




1.4.2 Part 2: Finding the best alignment (3 points)




Your second task is to nd the best word alignment a = a1:::am for a pair of sentences (f(s); e(s)).

Equations 8-10, the best alignment is:




a (f(s);e(s)) = argmaxaP (f(s); aje(s)) = argmaxa Yj
(s)
(s)


pt(fj
jeaj
)
You need to nish implementing the align method: align:















Using







(3)



Return the best alignment between fSen and tSen, according to your model. The alignment must be returned as a list of integers where the jth integer indicates the alignment for the jth word in the foreign sentence fSen (the list should have the same length as fSen has words). A value of 0 means that the foreign word is aligned with the NULL token of the target sentence tSen, a value of 1 means that it is aligned with the rst word on the sentence (tSen[1]), etc.




1.4.3 Test script




The test script hw4 test.py uses the English-Spanish corpus eng-spa small.txt to train your model over 20 iterations of EM, and then compares its results with a set of reference alignments (reference alignments.txt). The reference alignments were produced by the course sta implementation under the same conditions as the test script; however, there are many rare words in the dataset and so tiebreaking may cause a valid implementation's accuracy to be below 100% (there will be no penalty in this case).







1.5 What you will be graded on




The grade for your implementation of IBM Model 1 will be based on the following rubric:




3 points: Accuracy of your model's alignments vs. the course sta 's implementation (as we don't have gold human-annotated alignments).




1 point: Correctly implementing the translation length probability q(mjn) in computeTranslationLengthProbabilities, evaluated by calling getTranslationLengthProbability before the rst iteration of EM.




1 point: Initializing the word translation probabilities uniformly in initializeWordTranslationProbabilities, evaluated by calling getWordTranslationProbability before the rst iteration of EM.






You should experiment with di erent values of ; if the value is too large, your training will abort before convergence. If the value is too small, a lot of time will be spent on the latter iterations without meaningfully altering the parameters
4







2 points: Correctly implementing the EM algorithm (computeExpectedCounts and updateTranslationProbabilities). Since it is unlikely that you will be able to accurately align the sentences without implementing EM training correctly, this is a very important part of the assignment.




1 point: E ciency (we should be able to run your code on a full corpus in under 20 minutes).




2 points: Quality of README discussion (see below) and clarity of distribution output from writeModel.




+2 points EC: Implementing the log-likelihood stopping criterion in trainUsingEM.




In your README, discuss where your model improved improved on the initial parameters during training (e.g. on rare words, function words, punctuation, phrases, etc.), and examine some of the alignments that your model produces (especially on sentences longer than 5 words). Is it a usable translation system? What kinds of systematic errors does your model make? Are these errors due to (a lack of) data, or are they a result of assumptions built into the model? If the latter, what changes would you make to the model to improve its performance?




1.6 What to submit




The only le you need to submit for this part is your completed hw4 translate.py program. You should submit your solution as a compressed tarball on Compass; to do this, save your les in a directory called abc123 cs447 HW4 (where abc123 is your NetID) and then create the archive from its parent directory (tar -czvf abc123 cs447 HW4.tar.gz abc123 cs447 HW4). Please include the following les:







hw4 translate.py: your Python implementation of IBM Model 1.







README.txt: a text le explaining your system and summarizing your ndings.




1.7 Other language pairs




If you want to see how your model performs on other languages, you can download more language pairs from http://opus.lingfil.uu.se/OpenSubtitles_v2.php (the lower triangle of this matrix has plain text les). NB: I know many of you speak Chinese, but the Chinese text on this site is not word-segmented, so that may be an issue. Feel free to try anyway. Note that these corpora come as two separate les, one per language.




Appendix: A longer explanation of Model 1




To help understand what you have to do (and why), we provide a more detailed review of IBM Model 1 in this section. Don't worry { it is not as complicated as it looks at rst!




Notation Throughout this assignment, we'll be translating from a source language f (non-English) to the target language e (English).




We'll use fy 2 f to denote a word in the vocabulary of the source language, and ex 2 e to denote a word in the vocabulary of the target language (ex and fy are both word types).




The training corpus C consists of S sentence pairs (f(s); e(s)), 1 s S. Thus, f is a vocabulary/set of word types, and f(s) is a sentence/list of word tokens (the same goes for e and e(s)).




We'll use m to denote the length of source sentence and n to denote the length of a parallel target sentence.




For each pair of sentences (f(s); e(s)) in the training corpus, f(s) = f1(s)::::fm and e(s) = e(0s)e(1s):::en. That is, fj(s) is the jth word in source sentence f(s), e(is) is the ith word in source sentence e(s), and e(0s) = NULL for all s.




When de ning probabilities for a single sentence, we may drop the superscript (s); however, the subscript indices (j and i) should still be consistent with the previous de nition.




Given an alignment a, aj = i when word fj in the target sentence is aligned with word ei in the target sentence (when this holds, we can also refer to ei as eaj ).




Model parameters Recall that Model 1 consists of the following distributions:




Length probabilities q(mjn): the probability that the non-English sentence is of length m given that the English sentence is of length n. You can obtain these probabilities simply by counting the lengths of the sentence pairs in the training data.




Alignment probabilities p(ajn; m): the probability of alignment a, given that the English sentence is of length n, and the non-English sentence is of length m. Recall that an alignment speci es for each position 1::::m in the non-English sentence which English word (NULL or e1; ::::en) it is aligned to.




That is, we have 1 + n choices for each of the m words. In Model 1, all alignments have the same probability; the probability that word fj is aligned to word ei is n+11 , and so the probability of one




particular alignment vector is simply a = a1; ::::am is
1
.
(n+1)m



Translation probabilities pt(fyjex): the probability that the English word ex is translated into (\generates") the non-English word fy. We'll refer to this fmaily of distributions as pt(fje), and we will use the EM algorithm to estimate these probabilities.




Training: estimating the translation probabilities pt(fyjex) with the EM algorithm Since Model

1 is so simple, we only need to learn (estimate) the translation probabilities pt(fyjex). If the corpus was annotated with alignments, we could use standard relative frequency estimation: that is, we could simply count how often we see word ex aligned to word fy in our corpus (let us call this count c(fy; ex)), and divide this count by the count of how often we see word ex aligned to any word fz in our corpus:






c(fy; ex)




pt(fyjex) =
Pz c(fz; ex)
(relative frequency estimate)
(4)



However, our corpus consists only of unaligned sentence pairs, so we will have to replace the actual frequencies c(fy; ex) with expected frequencies hc(fy; ex)i:

hc(fy; ex)i




pt(fyjex) = P (expected relative frequency estimate) (5) zhc(fz; ex)i




Here, hc(fy; ex)i indicates how often we expect to see word ex aligned to word fy in our corpus. Since our corpus consists of many sentence pairs, hc(fy; ex)i is the sum of how often we expect to see word ex aligned to word fy in each of the S sentence pairs in our corpus. That is, if we write hc(fy; exjf(s); e(s))i for the number of times we expect to see word ex aligned to word fy in the s-th sentence pair, then:






S


hc(fy; ex)i =
Xs


hc(fy; exjf(s); e(s))i
(6)


=1





So, how do we compute hc(fy; exjf (s); e(s))i for a given sentence pair (f(s); e(s))? Obviously, f(s) has to contain the word fy and e(s) has to contain the word ex for this count to be non-zero, but let us assume this is the case. For simplicity let us also assume that fy occurs at position j in f(s) (i.e. fj(s) = fy), and that ex occurs at position i in e(s) (i.e. e(is) = ex).




Word ei(s)
is aligned to fj(s) if the alignment of fj is i, i.e.
if aj = i.
That is, we need to compute
P (aj = i j


(s)
(s)


i.e.
the probability of a


= i, given e(s) and f(s).
Let
A
be the set of all possible
e


(;sf)


),


(s)


j














alignments of f


and e


, and Aaj =i be the set of all alignments in which aj = i. Then,






















Pa2Aaj =Pi
P (a
e(s); f(s))




















P (aj = i j e(s); f(s)) =
(a0
j
je(s); f(s))


(7)






















Pa0 2A
















It turns out that this simpli es to the (current)

(s)

translation probability of fj(s) given e(is), divided by the sum given any word in e(s) (including NULL):



P (aj = i j
e
(s)
; f
(s)
) =




pt(fj(s)jei(s))














(8)










P
n(s)




(s)


(s)








































i0 =0 pt(fj




jei0




)










This is what you will have to compute at each iteration when you train Model 1.
















look at how we compute P (a


j
e(s); f(s)). The probability model
Why is this the correct thing to do? Let us(s)




















































(s) (s)
:::en as
de nes the probability of sentence f
(s) = f
1


::::fm and alignment a = a1:::am given e(s) = e
0
e
































































1














































m
























P (f; aje
(s)
) =
q(mjn)
p(ajn; m)










(s)






(s)
)




(9)








pt(fj
jeaj














|


{z
}
|


{z


}




translation


















length of f


alignment
=1:


































jY
































































|






















































1










m


{z(s)




(s)}












= q
(
mj
n
)


















pt(fj


jeaj
)














(n + 1)m




















=1:


























length of f


















jY


































|


{z
}
|
{z
}
|




{z










}




















alignment




translation








But since f(s) is given, we need to know P (ajf(s); e(s)), which we can compute as follows (here A is the set of all possible alignments of f(s) and e(s), and we omit superscripts for fj(s) and e(is) for legibility):




P (a
f(s); e(s)) =
P


P (a; f(s)je(s))




















(10)






















a0 2A P (a0jf(s); e(s))












j


























q(m n) (n+1)m




Q
j=1: pt(fj eaj )




















j






1










m
j












=






0 2A






j
(n














j
j








P q(m n)


















1






Q








































m






)








1






m






m
pt(fj
eaj )








a




q(m n)








+1)m




j=1: pt
(fj
ea0




q(m n)
(n+1)m








aQ










j=1 pt(fj eaj0 )


=










j


(n+1)










j=1:
j






















j
j=1
1




Paj








Q
m


j




















t(fj
























































0 2A
























a0Q
m
p












e






)






)


























j=1 pt(fj eaj0
















=


















j






































P


2A
Q
m
















j

















































































That is, all we really need to know are the translation probabilities pt(fyjex) for the words in e(s) and f (s). And since in Model 1, the di erent positions of a are lled independently, we don't even need to compute the probability of each indvidual alignment in order to compute P (aj = i j e(s); f(s)). Instead, we just consider word fj(s), and that gives us equation 8.

More products