$24
Natural Language Processing (NLP) is the sub- eld of arti cial intelligence that deals with designing al-gorithms, programs, and systems that can understand human languages in written and spoken forms. Word embeddings are a recent advance in NLP that consists of representing words by vectors (or arryas) of oat-ing point numbers in such a way that if two words are similar, their embeddings are also similar. See https://nlp.stanford.edu/projects/glove/ for an overview of this interesting research.
In order to work in real-time, NLP applications such as Siri and Alexa use hash tables to e ciently retrieve the embeddings given their corresponding words. In this lab you will implement a simple version of this. The web page mentioned above contains links to les that contain word embeddings of various lengths for various vocabulary sizes. Use the le glove.6B.50d.txt (included in the le glove.6B.zip), which contains word embeddings of length 50 for a very large number of words. Each line in the le starts with the word being described, followed by 50 oating point numbers that represent the word’s vector description (the embedding). The words are ordered by frequency of usage, so "the" is the rst word. Some "words" do not start with an alphabetic character (for example "," and "."); feel free to ignore them.
Your task for this lab is to compare the running times of two implementations of tables to retrieve word embeddings to enable the (hopefully fast) comparison of two given words. One of your table implementations should use a binary search tree; the other should use a hash table with chaining. See the appendix for sample runs of your program.
Your program should do the following:
Prompt the user to choose a table implementation (binary search tree or hash table with chaining).
Read the le "glove.6B.50d.txt" and store each word and its embedding in a table with the chosen implementation. Each node in the BST or hash table must consist of a list of size two, containing the word (a string) and the embedding (a numpy array of length 50). For the hash table, choose a prime number for your initial table size and increase the size to twice the current size plus one every time the load factor reaches 1. Caution: do NOT recompute the load factor every time an item is entered to the table, instead, add a num items elds to your hash table class.
Compute and display statistics describing your hash table. See the appendix for examples for both implementations. Feel free to suggest others.
Read another le containing pairs of words (two words per line) and for every pair of words nd and display the "similarity" of the words. To compute the similarity between words w0 and w1, with embeddings e0 and e1, we use the cosine distance, which ranges from -1 (very di erent) to 1 (very similar), given by:
e0 e1
sim(w0; w1) =
je0jje1j
where e0 e1 is the dot product of e0 and e1 and je0j and je1j are the magnitudes of e0 and e1.
Recall that the dot product of vectors u and v of length n is given by u v = u0 v0 +u1 v1 +: : :+un 1 vn 1 p q
and the magnitude of a vector u of length n is given by juj = u u = u20 + u21 + : : : u2n 1.
Display the running times required to build the table (item 2) and to compute the similarities (item 4). Do not include the time required for displaying results. Use a large enough word le for item 4 in order to derive meaningful results.
Use the program hash table chain strings.py provided in the class webpage as a starting point for your hash table implementation. It shows how to compute hash values for strings and how to include multiple elds in a hast table node.
As usual, write a report describing your work. Discuss the e ciency of the two methods being compared and also comment on word embeddings as a way or representing words.
Appendix
Sample run 1:
Choose table implementation
Type 1 for binary search tree or 2 for hash table with chaining
Choice: 1
Building binary search tree
Binary Search Tree stats:
Number of nodes: XXX
Height: XXX
Running time for binary search tree construction: XXX
Reading word file to determine similarities
Word similarities found:
Similarity [bear,bear] = 1.0000
Similarity [barley,shrimp] = 0.5353
Similarity [barley,oat] = 0.6696
Similarity [federer,baseball] = 0.2870
Similarity [federer,tennis] = 0.7168
Similarity [harvard,stanford] = 0.8466
Similarity [harvard,utep] = 0.0684
Similarity [harvard,ant] = -0.0267
Similarity [raven,crow] = 0.6150
Similarity [raven,whale] = 0.3291
Similarity [spain,france] = 0.7909
Similarity [spain,mexico] = 0.7514
Similarity [mexico,france] = 0.5478
Similarity [mexico,guatemala] = 0.8114
Similarity [computer,platypus] = -0.1277
Running time for binary search tree query processing: XXX
Sample run 2:
Choose table implementation
Type 1 for binary search tree or 2 for hash table with chaining
Choice: 2
Building hash table with chaining
Hash table stats:
Initial table size: XXX
Final table size: XXX
Load factor: XXX
Percentage of empty lists: XXX
Standard deviation of the lengths of the lists: XXX
Reading word file to determine similarities
Word similarities found:
Similarity [bear,bear] = 1.0000
Similarity [barley,shrimp] = 0.5353
Similarity [barley,oat] = 0.6696
Similarity [federer,baseball] = 0.2870
Similarity [federer,tennis] = 0.7168
Similarity [harvard,stanford] = 0.8466
Similarity [harvard,utep] = 0.0684
Similarity [harvard,ant] = -0.0267
Similarity [raven,crow] = 0.6150
Similarity [raven,whale] = 0.3291
Similarity [spain,france] = 0.7909
Similarity [spain,mexico] = 0.7514
Similarity [mexico,france] = 0.5478
Similarity [mexico,guatemala] = 0.8114
Similarity [computer,platypus] = -0.1277
Running time for hash table query processing: XXX