$24
Problem description
Main objective: We’re going to do a word analysis of all the compositions of William Shakespeare. Get the full text of all compositions of William Shake-speare from the course website, on the HW page. The le is called shakespeare.txt. (It was obtained from Project Guterberg, so there are some copyright messages
in it.) It is vital that you download the le and save it. Do not copy and paste after clicking the link, since that does funky things with whitespace.
We want to answer the following queries from the text.
For any word length ‘ and number k, what is the k most frequent word of length ‘?
Format and Output: You should provide a Make le. On running make, it should create \Bard.jar". You should run the jar with two command line arguments: the rst is an input le, the second is the output le.
Each line of the input le corresponds to a new query. The query is a pair of numbers: LENGTH RANK. The rst number is the length of the word, the second number is the rank, which starts from 0. The ranking is done in decreasing order of frequency, and increasing lexicographic order. Thus, if two words have the same frequency, then the word that is earlier lexicographically has the smaller (earlier) rank.
So, \7 0" refers to the most frequent word of length 7. Or, \9 3" refers to the fourth most frequent word of length 9.
The output for each line is the corresponding word. If no word exists, the output is ’-’. This can happen if the Bard decided not to use any words of that
1
length, or there is no word of that rank. For example, if the input le is:
0
9 2
8 14
8 15
0
The output le is:
gloucester
messenger
business
personal
-
So \gloucester" is the most frequent word of 10 letters, etc. It turns out that both \business" and \personal" (both with 8 letters) have the same frequency of 230, but we rank \business" earlier than \personal".
Instructions on parsing: It is very important that you follow these in-structions. In general, parsing literary texts is a fairly messy task, since we have to make numerous decisions on how to split words. There is no one right way, but you need to follow my way so that your output matches mine.
The short answer on how to parse is to tokenize by whitespace and any of the following punctuation marks/symbols:
Question mark (?) Comma (,)
Period (.)
Exclamation mark (!) Colon (:)
Semicolon (;)
Square brackets ([ or ])
Here’s one way of doing this. It is probably convenient to read the le line by line, so you will get a string (say) with a line. You rst replace each of the above symbols, by a whitespace (like a tab). This can be done by the \replace" method for strings. (https://www.javatpoint.com/java-string-replace)
Then, you tokenize the resulting string by whitespace. You can use the \split" method. If the resulting string was called str, the command tokens = str.trim().split("nns+") would produce a string array tokens obtained by splitting str by all whitespace. The trim method just removes any whitespace at the beginning or end, to prevent empty strings from appearing in tokens.
Each element (string) in tokens is a word. Furthermore, we will convert all words lowercase, using the method toLowerCase().
Note that we are not splitting by hyphens, apostrophes, or quotation marks.
One can debate the merits of this, but please follow this scheme for consistency.
2
Suggestions on coding: You can use any data structures you want, in-cluding any built in Java type/class. Hashtables are incredibly useful for this problem, so you will want to check out Java’s hashtable class
(https://docs.oracle.com/javase/7/docs/api/java/util/Hashtable.html).
It is possible to pass through the le once, parse as described above, and produce a hashtable that stores all words with their frequencies. Now, you need to iterate over the hashtable to collect, for each length, all words of that length. Then, you can sort these words by decreasing order of frequency, and break ties among frequency using (increasing) lexicographic order. Comparators are a con-venient way to do this, and there are numerous examples online for this. I liked the following: https://www.tutorialspoint.com/java/java using comparator.htm. Basically, if you de ne a class for words that stores frequencies, you can de ne a compare method that compares two objects of this class. Then, you can call Java’s built in sort function on any arraylist of objects through Collections.sort. Check out the link above for doing this.
If you set this all up, then the actual task of producing the output is quite trivial. For each length, you already have words sorted by rank. So you do a single lookup for each input given.
Example commands and outputs: On the HW website, you will nd a le Examples.zip. Download and unzip it. There are a number of input and corresponding output les.
test-input.txt, test-output.txt. If you run your program with in-put le test-input.txt, the output le should be exactly the same as test-output.txt. This will be used by the checking script.
more-input.txt, more-output.txt. This is a pretty comprehensive suite of tests. If you get this correct, you’re reasonably guaranteed to be correct overall.
Grading
You code should terminate within 3 minutes for any input le with, say, at most 100 queries. If it doesn’t, we will not give you credit. It’s quite hard to give partial credit for this problem. You’ll notice that once you get a few corner cases correct, your code will completely work.
(10 points) For a full solution as described above.
(3 points) Only answers word frequency queries.
Fun facts
Have some fun with this assignment. Can you nd out the number of unique word that dear Bill used? What if you also tokenize by hyphens? What changes?
Some things that I learned.
3
Not surprisingly, \the" is the most common three letter word.
The longest word without hyphens is \honori cabilitudinitatibus". The coolest word, IMHO, is \anthropophaginian".
\caeser" is more frequent that \brutus", who both beat \othello". Clearly, \tennis" was played in Elizabethan times.
There are two \impossibilities" in all of Shakespeare’s work.
4