Starting from:
$30

$24

Lab # 5 Solution

Natural Language Processing (NLP) is the sub- eld of arti cial intelligence that deals with designing al-gorithms, programs, and systems that can understand human languages in written and spoken forms. Word embeddings are a recent advance in NLP that consists of representing words by vectors (or arryas) of oat-ing point numbers in such a way that if two words are similar, their embeddings are also similar. See https://nlp.stanford.edu/projects/glove/ for an overview of this interesting research.




In order to work in real-time, NLP applications such as Siri and Alexa use hash tables to e ciently retrieve the embeddings given their corresponding words. In this lab you will implement a simple version of this. The web page mentioned above contains links to les that contain word embeddings of various lengths for various vocabulary sizes. Use the le glove.6B.50d.txt (included in the le glove.6B.zip), which contains word embeddings of length 50 for a very large number of words. Each line in the le starts with the word being described, followed by 50 oating point numbers that represent the word’s vector description (the embedding). The words are ordered by frequency of usage, so "the" is the rst word. Some "words" do not start with an alphabetic character (for example "," and "."); feel free to ignore them.




Your task for this lab is to compare the running times of two implementations of tables to retrieve word embeddings to enable the (hopefully fast) comparison of two given words. One of your table implementations should use a binary search tree; the other should use a hash table with chaining. See the appendix for sample runs of your program.




Your program should do the following:




Prompt the user to choose a table implementation (binary search tree or hash table with chaining).



Read the le "glove.6B.50d.txt" and store each word and its embedding in a table with the chosen implementation. Each node in the BST or hash table must consist of a list of size two, containing the word (a string) and the embedding (a numpy array of length 50). For the hash table, choose a prime number for your initial table size and increase the size to twice the current size plus one every time the load factor reaches 1. Caution: do NOT recompute the load factor every time an item is entered to the table, instead, add a num items elds to your hash table class.



Compute and display statistics describing your hash table. See the appendix for examples for both implementations. Feel free to suggest others.



Read another le containing pairs of words (two words per line) and for every pair of words nd and display the "similarity" of the words. To compute the similarity between words w0 and w1, with embeddings e0 and e1, we use the cosine distance, which ranges from -1 (very di erent) to 1 (very similar), given by:



e0 e1

sim(w0; w1) =




je0jje1j

where e0 e1 is the dot product of e0 and e1 and je0j and je1j are the magnitudes of e0 and e1.




Recall that the dot product of vectors u and v of length n is given by u v = u0 v0 +u1 v1 +: : :+un 1 vn 1 p q







and the magnitude of a vector u of length n is given by juj = u u = u20 + u21 + : : : u2n 1.




Display the running times required to build the table (item 2) and to compute the similarities (item 4). Do not include the time required for displaying results. Use a large enough word le for item 4 in order to derive meaningful results.



Use the program hash table chain strings.py provided in the class webpage as a starting point for your hash table implementation. It shows how to compute hash values for strings and how to include multiple elds in a hast table node.




As usual, write a report describing your work. Discuss the e ciency of the two methods being compared and also comment on word embeddings as a way or representing words.




Appendix




Sample run 1:




Choose table implementation




Type 1 for binary search tree or 2 for hash table with chaining




Choice: 1




Building binary search tree




Binary Search Tree stats:




Number of nodes: XXX




Height: XXX




Running time for binary search tree construction: XXX




Reading word file to determine similarities




Word similarities found:




Similarity [bear,bear] = 1.0000




Similarity [barley,shrimp] = 0.5353




Similarity [barley,oat] = 0.6696




Similarity [federer,baseball] = 0.2870




Similarity [federer,tennis] = 0.7168




Similarity [harvard,stanford] = 0.8466




Similarity [harvard,utep] = 0.0684




Similarity [harvard,ant] = -0.0267




Similarity [raven,crow] = 0.6150




Similarity [raven,whale] = 0.3291




Similarity [spain,france] = 0.7909




Similarity [spain,mexico] = 0.7514




Similarity [mexico,france] = 0.5478




Similarity [mexico,guatemala] = 0.8114




Similarity [computer,platypus] = -0.1277




Running time for binary search tree query processing: XXX




Sample run 2:




Choose table implementation




Type 1 for binary search tree or 2 for hash table with chaining




Choice: 2




Building hash table with chaining




Hash table stats:




Initial table size: XXX




Final table size: XXX

Load factor: XXX




Percentage of empty lists: XXX




Standard deviation of the lengths of the lists: XXX




Reading word file to determine similarities




Word similarities found:




Similarity [bear,bear] = 1.0000




Similarity [barley,shrimp] = 0.5353




Similarity [barley,oat] = 0.6696




Similarity [federer,baseball] = 0.2870




Similarity [federer,tennis] = 0.7168




Similarity [harvard,stanford] = 0.8466




Similarity [harvard,utep] = 0.0684




Similarity [harvard,ant] = -0.0267




Similarity [raven,crow] = 0.6150




Similarity [raven,whale] = 0.3291




Similarity [spain,france] = 0.7909




Similarity [spain,mexico] = 0.7514




Similarity [mexico,france] = 0.5478




Similarity [mexico,guatemala] = 0.8114




Similarity [computer,platypus] = -0.1277




Running time for hash table query processing: XXX

More products