Lab # 5 Solution

Starting from:

$30

Home

Lab # 5 Solution

Natural Language Processing (NLP) is the sub- eld of arti cial intelligence that deals with designing al-gorithms, programs, and systems that can understand human languages in written and spoken forms. Word embeddings are a recent advance in NLP that consists of representing words by vectors (or arryas) of oat-ing point numbers in such a way that if two words are similar, their embeddings are also similar. See https://nlp.stanford.edu/projects/glove/ for an overview of this interesting research.

In order to work in real-time, NLP applications such as Siri and Alexa use hash tables to e ciently retrieve the embeddings given their corresponding words. In this lab you will implement a simple version of this. The web page mentioned above contains links to les that contain word embeddings of various lengths for various vocabulary sizes. Use the le glove.6B.50d.txt (included in the le glove.6B.zip), which contains word embeddings of length 50 for a very large number of words. Each line in the le starts with the word being described, followed by 50 oating point numbers that represent the word’s vector description (the embedding). The words are ordered by frequency of usage, so "the" is the rst word. Some "words" do not start with an alphabetic character (for example "," and "."); feel free to ignore them.

Your task for this lab is to compare the running times of two implementations of tables to retrieve word embeddings to enable the (hopefully fast) comparison of two given words. One of your table implementations should use a binary search tree; the other should use a hash table with chaining. See the appendix for sample runs of your program.

Your program should do the following:

Prompt the user to choose a table implementation (binary search tree or hash table with chaining).

Read the le "glove.6B.50d.txt" and store each word and its embedding in a table with the chosen implementation. Each node in the BST or hash table must consist of a list of size two, containing the word (a string) and the embedding (a numpy array of length 50). For the hash table, choose a prime number for your initial table size and increase the size to twice the current size plus one every time the load factor reaches 1. Caution: do NOT recompute the load factor every time an item is entered to the table, instead, add a num items elds to your hash table class.

Compute and display statistics describing your hash table. See the appendix for examples for both implementations. Feel free to suggest others.

Read another le containing pairs of words (two words per line) and for every pair of words nd and display the "similarity" of the words. To compute the similarity between words w0 and w1, with embeddings e0 and e1, we use the cosine distance, which ranges from -1 (very di erent) to 1 (very similar), given by:

e0 e1

sim(w0; w1) =

je0jje1j

where e0 e1 is the dot product of e0 and e1 and je0j and je1j are the magnitudes of e0 and e1.

Recall that the dot product of vectors u and v of length n is given by u v = u0 v0 +u1 v1 +: : :+un 1 vn 1 p q

and the magnitude of a vector u of length n is given by juj = u u = u20 + u21 + : : : u2n 1.

Display the running times required to build the table (item 2) and to compute the similarities (item 4). Do not include the time required for displaying results. Use a large enough word le for item 4 in order to derive meaningful results.

Use the program hash table chain strings.py provided in the class webpage as a starting point for your hash table implementation. It shows how to compute hash values for strings and how to include multiple elds in a hast table node.

As usual, write a report describing your work. Discuss the e ciency of the two methods being compared and also comment on word embeddings as a way or representing words.

Appendix

Sample run 1:

Choose table implementation

Type 1 for binary search tree or 2 for hash table with chaining

Choice: 1

Building binary search tree

Binary Search Tree stats:

Number of nodes: XXX

Height: XXX

Running time for binary search tree construction: XXX

Reading word file to determine similarities

Word similarities found:

Similarity [bear,bear] = 1.0000

Similarity [barley,shrimp] = 0.5353

Similarity [barley,oat] = 0.6696

Similarity [federer,baseball] = 0.2870

Similarity [federer,tennis] = 0.7168

Similarity [harvard,stanford] = 0.8466

Similarity [harvard,utep] = 0.0684

Similarity [harvard,ant] = -0.0267

Similarity [raven,crow] = 0.6150

Similarity [raven,whale] = 0.3291

Similarity [spain,france] = 0.7909

Similarity [spain,mexico] = 0.7514

Similarity [mexico,france] = 0.5478

Similarity [mexico,guatemala] = 0.8114

Similarity [computer,platypus] = -0.1277

Running time for binary search tree query processing: XXX

Sample run 2:

Choose table implementation

Type 1 for binary search tree or 2 for hash table with chaining

Choice: 2

Building hash table with chaining

Hash table stats:

Initial table size: XXX

Final table size: XXX

Load factor: XXX

Percentage of empty lists: XXX

Standard deviation of the lengths of the lists: XXX

Reading word file to determine similarities

Word similarities found:

Similarity [bear,bear] = 1.0000

Similarity [barley,shrimp] = 0.5353

Similarity [barley,oat] = 0.6696

Similarity [federer,baseball] = 0.2870

Similarity [federer,tennis] = 0.7168

Similarity [harvard,stanford] = 0.8466

Similarity [harvard,utep] = 0.0684

Similarity [harvard,ant] = -0.0267

Similarity [raven,crow] = 0.6150

Similarity [raven,whale] = 0.3291

Similarity [spain,france] = 0.7909

Similarity [spain,mexico] = 0.7514

Similarity [mexico,france] = 0.5478

Similarity [mexico,guatemala] = 0.8114

Similarity [computer,platypus] = -0.1277

Running time for hash table query processing: XXX