Starting from:
$34.99

$28.99

Word Sense Disambiguation (WSD) Using Vector Space Model (VSM) Solution

Introduction

In this assignment, we will:

1.   Briefly introduce WSD and the Vector Model.

2.   Go through the basics of WordNet and scikit-learn.

3.   Develop  a  WSD  system  based  on  the  Vector  Model  with  supervised  learning

algorithms.

4.   Use feature extraction methods to enhance your WSD system.

 

This document is structured in the following sequence:

 

WSD and Vector

Model


 

Introduction to WSD and the Vector Space Model VSM.

 

 

Environment

Setup


 

Set up everything to get you ready to start the assignment.

 

 

WordNet and scikit-learn

 

 

 

 

Assignment Part A: Basic WSD system

 

 

 

 

Assignment Part B: Feature Extraction

 

 

 

Submission


 

Quick intro to WordNet and scikit-learn Python libraries.

 

 

 

 

We will create, train, and evaluate a WSD system using the

SENSEVAL corpus based on the vector space model VSM.

 

 

 

 

We will use feature extraction methods to enhance your WSD

system.

 

 

 

Once you’re done, follow the instructions to submit.

Introduction to WSD and VSM

 

 

What is word sense disambiguation (WSD)?

 

Word sense disambiguation (WSD) is the task of determining which sense of an ambiguous word is being used in a particular context. The solution to this problem impacts other NLP- related problems such as machine translation and document retrieval.

 

 

 

What is the lexical sample task?

 

The standard WSD (SENSEVAL) task has two variants: “lexical sample” and “all words.” The former comprises disambiguating the occurrences of a small sample of target words which were previously selected, while in the latter all the words in a piece of running text need to be disambiguated. In this assignment, we will be working on the lexical sample task.

 

 

 

Algorithms for WSD:

 

There are several types of approaches to WSD, including dictionary-based methods, semi- supervised methods, supervised methods, and unsupervised methods. Supervised methods based on sense-labeled corpora are the best-performing methods for sense disambiguation. But such labeled data is expensive and limited. In contrast, dictionary-based methods and unsupervised methods  require  no  labeled texts  and are more efficient,  but have lower accuracy. In the following part, you will see how two algorithms, a supervised algorithm based on the vector space model and the Lesk algorithm, work on WSD.

 

 

 

What is the vector space model (VSM)?

 

The vector space model (VSM) is a widely-used model in NLP. The basic idea is to represent each instance of an ambiguous word as a vector whose features (attributes) are extracted according to some rules. Usually classification or clustering methods are then applied to solve the problem.

 

The supervised approach to sense disambiguation can be based on the vector space model. Here is a skeleton supervised algorithm for the lexical sample task.

 

 

1.   For each instance wi  of word w in a corpus, compute a context vector 𝑐⃗. Label the

class of each vector by the sense of word w in the context.

2.   Train a classifier with these labeled context vectors 𝑐⃗.

3.   Disambiguate a particular token t of w by predicting the class of token t with the

trained classifier.

 

 

Later  in  the  assignment  we  will  explain  how  to  extract  context  vectors  and  what classification method you can use in order to implement this algorithm.

What is the Lesk algorithm?

 

The Lesk algorithm is a well-studied dictionary-based algorithm for word sense disambiguation. It is based on the hypothesis that words used together in text are related to each other and that the relation can be observed in the definitions of the words and their senses. Two (or more) words are disambiguated by finding the pair of dictionary senses with the greatest word overlap in their dictionary definitions.

 

The pseudo code of Lesk is shown below.

 

 

function SIMPLIFIED LESK(word, sentence) returns best sense of word best-sense ← most frequent sense for word

max-overlap ← 0

context ← set of words in sentence

for each sense in senses of word do

signature ← set of words in the gloss and examples of sense overlap ← COMPUTEOVERLAP(signature, context)

if overlap max-overlap then

max-overlap ← overlap best-sense ← sense

end

return(best-sense)

 

 

You do not need to implement the Lesk algorithm in this assignment. But as an example you should get a sense of what an unsupervised algorithm is like.

 

For  our  lexical  sample  task,  assume  we  have  a  Lesk  Algorithm  implementation  in  the following format:

 

lesk(context_sentence, ambiguous_word):

- param str context_sentence: The context sentence where the ambiguous word occurs.

- param str ambiguous_word: The ambiguous word that requires WSD.

 

- return: “lesk_sense” The Synset() object with the highest signature overlaps.

 

 

 

WordNet

 

WordNet is a lexical database (originally developed for English) that contains information about words, their senses, and the relationships between them. A word with multiple senses is called polysemous. A sense with multiple words is called a “synset” (synonym set). Synsets may be related to each other in different ways, e.g., “cat” is a hypernym (more general concept) of “lion.”

Environment Setup

 

You can copy the provided code and data by running the following commands:

 

cd ~/hidden/$PIN/

mkdir Homework3

cp /home/595/Homework3/student_files/* Homework3/

 

 

Basic WordNet and scikit-learn tutorial

 

How to use WordNet in NLTK for WSD tasks

 

- Import wordnet

from nltk.corpus import wordnet as wn

 

- Get a word’s synset

A synset is a set of words with the same sense. It is identified with a 3-part name of the form:

word.pos.nn. Below are a description of the synsets function and examples of its usage.

 

def synsets(self, lemma, pos=None, lang='en'):

- Load all synsets with a given lemma (word) and part of speech tag.

- If no pos is specified, all synsets for all parts of speech will be loaded.

- If lang is specified, all the synsets in that language associated with the lemma will be returned. The default value of lang is ‘en’.

 

wn.synsets('dog')

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), ...]

wn.synsets('dog', pos='v', lang = ‘en’) [Synset('chase.v.01')]

 

- Get possible definitions for a word from its synsets

for ss in wn.synsets('bank'):

       print ss, ss.definition()

Synset('bank.n.01') sloping land (especially the slope beside a body of water)

Synset('depository_financial_institution.n.01') a financial institution that accepts deposits and channels the money into lending activities

Synset('bank.n.03') a long ridge or pile

Synset('bank.n.04') an arrangement of similar objects in a row or in tiers



 

- What languages can we use in NLTK?

The WordNet corpus reader gives access to the Open Multilingual WordNet, using ISO-639 language codes. For example, “cat” stands for Catalan (a language spoken in Spain, France and Andorra, which is related to Spanish, Italian, and the other Romance languages).

sorted(wn.langs())

['als', 'arb', 'cat', 'cmn', 'dan', 'eng', 'eus', 'fas','fin', 'fra', 'fre', 'glg', 'heb', 'ind', 'ita', 'jpn',

'nno','nob', 'pol', 'por', 'spa', 'tha', 'zsm']

How to use scikit-learn for machine learning tasks

 

- What is scikit-learn?

scikit-learn (http://scikit-learn.org/stable/index.html) is an open source machine learning library for Python. It features various classification, regression, and clustering algorithms including support vector machines, k-nearest neighbors, logistic regression, naive Bayes, random forests, gradient boosting, k-means, and DBSCAN. For this class, we don’t need to understand all the details of these algorithms.

 

- Loading an example dataset

scikit-learn comes with a few standard datasets. A dataset is a dictionary-like object that holds all the data and some metadata about the data. The  data is stored in the .data member. In the case of supervised problems, one or more response variables are stored in the .target member. To load an example dataset,

from sklearn import datasets

iris = datasets.load_iris()

digits = datasets.load_digits()

 

- Learning and predicting

Classification is a basic task in machine learning. The task is, given a set of data, usually represented as vectors, and several classes, predict which class each vector belongs to. To solve the problem, we will first train a classifier (estimator) using some data whose classes have  already  been  given.  Then  we  will  predict  the  classes  of  unlabeled  data  with  the classifier we just trained.

 

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X,y) and predict(T). An example of an estimator is the class sklearn.svm.SVC that implements support vector classification.

from sklearn import svm

clf = svm.SVC(gamma=0.001, C=100.)

 

To train the model with all but the last item in our training dataset

clf.fit(digits.data[:-1], digits.target[:-1])

 

Now you can predict new values by

clf.predict(digits.data[-1])

 

Here is another example of how to use nearest neighbors algorithm to do the classification.

from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(15, weights=’uniform’)

clf.fit(iris.data, iris.target)

clf.predict(iris.data)

Provided Files

 

Provided data

 

We  will  be  using  data  from  the  Senseval  3  Lexical  Sample  Task  on  multilingual  WSD, specifically the English, Catalan, and Spanish corpora.

 

Datasets:

 

All the data needed for this assignment are in this directory:

/home/595/Homework3/data

 

The data include:

English-train.xml      Training data for English

English-dev.xml        Development data for English

English-dev.key        Answer for the development data for English

English.sensemap       Sense map file needed for evaluation

 

Catalan-train.xml      Training data for Catalan

Catalan-dev.xml        Development data for Catalan

Catalan-dev.key        Answer for the development data for Catalan

 

Spanish-train.xml      Training data for Spanish

Spanish-dev.xml        Development data for Spanish

Spanish-dev.key        Answer for the development data for Spanish

 

 

 

The data files are in .xml form, where the tags are:

 

1.   lexelt/lemma:  Each lexelt/lemma represents a word to be disambiguated. The item attribute indicates the word and its POS tag. One lexelt could have several instances.

 

<lexelt item="difference.n"

 

2.   instance:  Each  instance  is  a  case  where  the  target  word  appears  in  a  context.  An instance contains one context.

 

<instance id=" difference.n.bnc.00001061"

 

3.   context: The context of the word from each instance. You can choose to use the entire context or just some words from the context (e.g., common collocations, words within a small window, words with high PMI values, etc).

 

<context

In 1991/92 we shall need support even more . Every donation does help . That help makes

all the <headdifference</head to people sick with AIDS who want to stay at home , rather than spend time unnecessarily in hospital . Please help !  SIR JOHN FORD KCMG MG CHAIRMAN OF TRUSTEES

</context

 

4.   answer: The sense of the word in this instance. It only appears in training data.

 

<answer instance="difference.n.bnc.00001061" senseid="difference%1:24:00::"/

 

Note: For Catalan and Spanish data, you will also see <previous, <target, <following

tags in a <context tag. For this assignment we only use the text in the <target tag.

 

Note2: If an instance contains multiple <answer tags, we will only use the first tag for the purpose of training.

 

You can use the xml package in Python to parse the .xml file.

 

WSD systems are evaluated by computing the percentage of ambiguous words for which the WSD system has predicted the correct sense. We have provided an evaluator for you. You will use it to evaluate the performance of the algorithms you implement.

 

The .key files and .sensemap files are only used for evaluation. You do not need to know the meaning of these files. Later you will see how to use the evaluator with these files.

 

 

Provided code

 

baseline.py      a baseline WSD system implementation scorer2          a script to evaluate WSD systems scorer2.c        the source code for scorer2 written in C A.py             skeleton code for part A

B.py             skeleton code for part B

main.py          runs the assignment

 

Before you start

 

You will be implementing a supervised algorithm based on the vector space model according to  the description in the Introduction part. You will need to work on three languages: English, Catalan, and Spanish.

 

As a baseline, we explain the sense of a word in any context as the most frequent sense of this word. You can see the output of the baseline by running

 

python baseline.py language

 

The output file is language.baseline, with the same format in the format below:

 

lexelt_item instance_id sense_id

 

Your implementation should get a better performance than this baseline. You will be graded based on how good your implementation is.

Assignment Part A

 

For this part, you will be implementing a basic WSD system with two supervised algorithms based on the vector space model VSM using the Senseval data set.

 

 

 

Implementation

 

For each language (Catalan, English, and Spanish), you should follow the steps below.

 

1.      Compute  context  vectors  for  each  instance  of  a  word.  You  will  use  language- train.xml as training data.

 

Suppose in a test instance Ti, the word you need to disambiguate is w. At the beginning you need to tokenize the sentences in the context with nltk.word_tokenize(). Si  is the set of words that are within k distance of word w. Let S be the union of set Si of size n which may contain duplicate words. Then each test instance will correspond to a vector, where each attribute is the count of a word in set S. For this assignment, we set the window size k=10.

 

Implement part A1 in the build_s and vectorize functions of A.py.

 

2.      Train a classifier using the context vectors you obtained. You can use scikit-learn library to conduct this step. Here you need to try two different classifiers, k-nearest neighbors KNN (neighbors.KNeighborsClassifier class) and linear support vector machines SVM (svm.LinearSVC class).

 

You will notice that there are several parameters for the classifiers. For this assignment you can just use the default settings (k=5 for KNN).

 

Implement part A2 in the classify function of A.py.

 

You will disambiguate the target word in each test instance in the development data set. For each classifier, the output should be a single file corresponding to language-dev.xml, with each line for a test instance in the format below:

 

lexelt_item instance_id sense_id

 

Note: The lines in the output file have to be sorted in alphabetical ascending order (A-Z) first by lexelt_item, then by instance_id.  Also, please remove the accent of characters in the output file.

 

3.      Use the SVM classifier to perform disambiguation for each test instance in  language- dev.xml. Write the output for the SVM classifier for each language to a file named by SVM-Language.answer formatted as described above.

 

4.      Use the KNN classifier to perform disambiguation for each test instance in  language- dev.xml. Write the output for the KNN classifier for each language to a file named by KNN-Language.answer formatted as described above.

 

Implement parts A3 and A4 in the print_results function of A.py.

 

 

5.     In your report file, write and compare the performance of the two classifiers.

 

Assignment Part B

 

For this part, you will improve your WSD system. You will need to compare the performance of different features and classifiers, find the best combination, and describe your work in the report.

 

1.      Try extracting better features than just taking all the words in the window, and then redo the classification. You should try the following approaches (it is not necessarily that adding one of them will improve the system), also try to use different combinations:

 

a)           Add collocational features such as: surrounding words w-2 , w-1  , wo  w1 , w2  and part-of-speech tags POS-2 , POS-1, POS0 POS1 , POS2.

 

b)         Remove stop words, remove punctuations, do stemming, etc. c)         Use the method described below.

Suppose the word w that needs disambiguation has k senses. For a sense s of w and a word c in the window of w, we compute the “relevance score” of c with s using

log (


��(��|��)



)
 
��(��|𝑐  )

where ��(��|��) is the probability that the word w has sense s when c appears, and

is computed using

��(��|��) =


𝑁��,𝑐

𝑁𝑐

where 𝑁��,𝑐  is the number of instances where the context contains c and the sense of word w is s, and ��𝑐 is the number of instances where the context contains c.  ��(��|��) is the probability that the word w has sense other than s

when c appears and can be computed similarly.

 

For each sense s, we compute how relevant the words in the window are and select the top words as features. The final set of features is the union of all the top features for each sense.

 

For example, in English-train.xml, the word activate has 3 senses, 38201,

38202 and 38203. We now compute the relevance score for the word protein.

There are 10 test instances where the context contains protein and the sense is

38201. There are 8 test instances for 38202 and 1 for 38203. So the score for



.
 
(38201, protein) is log  10/ 19

9/19


(38202, protein) gets log   8 / 19   and (38203, protein)

11/19

gets log  1 / 19   .

18/19

 

d)          Add more features by obtaining the synonyms, hyponyms and hypernyms of a word in WordNet; you should try different combination of these features rather than  combining  all  of  them.  For  instance,  you  might  add  hyponyms  only,

synonyms and hyperny, or combining all of these features.

 

e)          Try  good  feature  selection  method.  Hint:  Chi-square  or  pointwise  mutual information (PMI) may be useful for selection features.

 

Implement part B1a-d in the extract_features function of B.py. Implement part

B1e in the feature_selection function of B.py.

 

In  your  report,  describe  how  much  each  of  the  approaches  above  improves  the performance.

 

2.      Find the best combination of the feature extracting approaches. Also, use the classifier that gives better results. Submit your code with the best performance and the output files from that code; name those output files Best-Language.answer.

Implement part B2 in the classify function of B.py.

 

3.   You  are  welcome  to  add  more  features  and  to  adjust  the  window  size  k  and  the classifier parameters to get better results. If you use your own parameters or add different features than above, include in your report how you obtained the settings and what features you added. You can also try other classifiers if you want. We will compare the  performance of  all the  students’ classifiers,  and give  bonus  points for the  top performers.

 

 

Conclusions

Finally, write the conclusions you draw from this assignment. Also, include any interesting observations.

Discuss why are some languages are easier than other languages for WSD tasks (Extra points).

 

 

 

 

Evaluation

 

You will evaluate the result you obtained from the previous parts. Use scorer2 as shown below to evaluate your output.

 

./scorer2 answer_file key_file [sense_map_file]

 

where answer_file is the output of your algorithm and key_file is the gold standard for the test data.

Here is an example:

 

./scorer2 English.answer /home/595/Homework3/data/English-dev.key

/home/595/Homework3/data/English.sensemap

 

Note: For Catalan and Spanish data, there is no sensemap file. So you do not need to include it in the command.

 

Note2: scorer2 is written in C, and we have given you a compiled binary. We have included the source code here in case you want to try it on other platforms.

Include the evaluation results in your report. Include any interesting findings and explain why they occurred (up to 2-3 paragraphs and a table with results).

 

 

 

Submission

 

As usual, you need to create a directory named Homework3 in your hidden directory and copy all the files into it. The files you should submit are:

 

main.py A.py B.py

SVM-English.answer

SVM-Spanish.answer

SVM-Catalan.answer KNN-English.answer KNN-Spanish.answer KNN-Catalan.answer Best-English.answer Best-Spanish.answer Best-Catalan.answer report.txt

 

As a final step, run a script that will setup the permissions to your homework files, so we can access and run your code to grade it:

 

/home/Homework3/hw3_set_permissions.sh <YOUR_PIN

 

Note: You can create extra source code files, but make sure we can run your algorithm simply by running:

 

 

python main.py <input_training file <input test file <output KNN

file <output SVM file <output best file <language

For instance:

python main.py   /home/595/Homework3/data/English-train.xml

/home/595/Homework3/data/English-dev.xml   KNN-English.answer   SVM-

English.answer Best-English.answer English

More products