$29
The purpose of this assignment is to gain understanding of the Viterbi algorithm, and its application to part-of-speech (POS) tagging. The Viterbi algorithm will be related to two other algorithms.
You will also get to see the Universal Dependencies treebanks. The main purpose of these treebanks is dependency parsing (to be discussed later in the module), but here we only use their part-of-speech tags.
Getting started
We will be using Python3. On the lab (Linux) machines, you need the full path /usr/local/python/bin/python3, which is set up to work with NLTK. (Plain python3 won’t be able to nd NLTK.)
If you run python on your personal laptop, be sure it is Python3 rather than Python2; in case of doubt, do python --version. Next to NLTK (https://www. nltk.org/), you will also need to install the conllu package (https://pypi.org/ project/conllu/).
To help you get started, download gettingstarted.py and the other Python les, and the zip le with treebanks from this directory. After unzipping, run /usr/local/python/bin/python3 gettingstarted.py. You may, but need not, use parts of the provided code in your submission.
The three treebanks come from Universal Dependencies. You can download the entire set of treebanks from https://universaldependencies.org/.
1
Parameter estimation
First, we write code to estimate the transition probabilities and the emission proba-bilities of an HMM (Hidden Markov Model), on the basis of (tagged) sentences from a training corpus from Universal Dependencies. Do not forget to involve the start-of-sentence marker hs i and the end-of-sentence marker h=s i in the estimation.
The code in this part is concerned with:
counting occurrences of one part of speech following another in a training corpus, counting occurrences of words together with parts of speech in a training corpus, relative frequency estimation with smoothing.
As discussed in the lectures, smoothing is necessary to avoid zero probabilities for events that were not witnessed in the training corpus. Rather than implementing a form of smoothing yourself, you can for this assignment take the implementation of Witten-Bell smoothing in NLTK (among the implementations of smoothing in NLTK, this seems to be the most robust one). An example of use for emission probabilities is in le smoothing.py; one can similarly apply smoothing to transition probabilities.
Three algorithms for POS tagging
Algorithm 1: eager algorithm
First, we implement a naive algorithm that chooses the POS tag for the i-th token on the basis of the chosen (i 1)-th tag and the i-th token. To be more precise, we determine for each i = 1; : : : ; n, in this order:
^
^
ti = argmaxti
P (ti j ti 1) P (wi j ti)
^
is the start-of-sentence marker hs i. Note that the end-of-sentence marker
assuming t0
h=s i is not even used here.
Algorithm 2: Viterbi algorithm
Now we implement the Viterbi algorithm, which determines the sequence of tags for a given sentence that has the highest probability. As discussed in the lectures, this is:
1n =
t1 tn
=1
( i j
i 1)
( i j i)!
( n+1 j
n)
n
^
^
Yi
argmax
P
t
t
P w t
P t
t
t
t
2
where the tokens of the input sentence are w1 wn, and t0 = hs i and tn+1 = h=s i are the start-of-sentence and end-of-sentence markers, respectively.
To avoid under ow for long sentences, we need to use log probabilities.
Algorithm 3: individually most probable tags
We now write code that determines the most probable part of speech for each token individually. That is, for each i, computed is:
X
n
!
^
Y
P (ti j ti 1) P (wi j ti)
P (tn+1 j tn)
ti = argmaxti
t1
ti 1ti+1 tn
i=1
To compute this e ectively, we need to use forward and backward values, as discussed in the lectures on the Baum-Welch algorithm, making use of the fact that the above is equivalent to:
t^i = argmaxti
t1 ti 1
k=1 P (tk j tk 1) P (wk j tk)
P
Q
i
n
P
ti+1 tn Qk=i+1 P (tk j tk 1) P (wk j tk) P (tn+1 j tn)
The computation of forward values is very similar to the Viterbi algorithm, so you may want to copy and change the code you already had, replacing statements that maximise by corresponding statements that sum values together. Computation of backward values is similar to computation of forward values.
See logsumexptrick.py for a demonstration of the use of log probabilities when probabilities are summed, without getting under ow in the conversion from log prob-abilities to probabilities and back.
Evaluation
Next, we write code to determine the percentages of tags in a test corpus that are guessed correctly by the above three algorithms. Run experiments for the training and test corpora of the three included treebanks, and possibly for treebanks of more languages (but not for more than 10; aim for quality rather than quantity). Compare the performance of the three algorithms.
You get the best experience out of this practical if you also consider the languages of the treebanks. What do you know (or what can you nd out) about the morphological and syntactic properties of these languages? Can you explain why POS tagging is more di cult for some languages than for others?
3
Requirements
Submit your Python code, with a README le explaining how to run it, and a report describing your ndings, including experimental results and their analysis. You would include the treebanks needed to run the code, but please do not include the entire set of Universal Dependencies treebanks, because this would be a huge waste of disk space and band width for the marker.
Marking is in line with the General Mark Descriptors (see pointers below). Evi-dence of an acceptable attempt (up to 7 marks) could be code that is not functional but nonetheless demonstrates some understanding of POS tagging. Evidence of a reason-able attempt (up to 10 marks) could be code that implements Algorithm 1. Evidence of a competent attempt addressing most requirements (up to 13 marks) could be fully correct code in good style, implementing Algorithms 1 and 2 and a brief report. Ev-idence of a good attempt meeting nearly all requirements (up to 16 marks) could be a good implementation of Algorithms 1 and 2, plus an informative report discussing meaningful experiments. Evidence of an excellent attempt with no signi cant defects (up to 18 marks) requires an excellent implementation of all three algorithms, and a report that discusses thorough experiments and analysis of inherent properties of the algorithms, as well as awareness of linguistic background discussed in the lectures. An exceptional achievement (up to 20 marks) in addition requires exceptional understand-ing of the subject matter, evidenced by experiments, their analysis and re ection in the report.
Hints
Even though this module is not about programming per se, a good programming style is expected. Choose meaningful variable and function names. Break up your code into small functions. Avoid cryptic code, and add code commenting where it is necessary for the reader to understand what is going on. Do not overengineer your code; a relatively simple task deserves a relatively simple implementation. If you feel the need for Python virtual environments, then you are probably overdoing it. The code that you upload would typically consist of one or at most a handful of .py les.
You cannot use any of the POS taggers already implemented in NLTK. However, you may use general utility functions in NLTK such as ngrams from nltk.util, and FreqDist and WittenBellProbDist from nltk.
When you are reporting the outcome of experiments, the foremost requirement is reproducibility. So if you give gures or graphs in your report, explain precisely what you did, and how, to obtain those results. Keep in mind that not all treebanks are the same size, and a common pitfall is making unfair comparisons between languages by not correcting for di erences in corpus size.
4
Pointers
Marking http://info.cs.st-andrews.ac.uk/student-handbook/ learning-teaching/feedback.html#Mark_Descriptors
Lateness http://info.cs.st-andrews.ac.uk/student-handbook/ learning-teaching/assessment.html#lateness-penalties
Good Academic Practice https://www.st-andrews.ac.uk/students/rules/academicpractice/
5