Starting from:
$35

$29

Project 9 Keywords Solution



The objective of this assignment is to write a program to analyze a
text document and extract keywords. This could be useful for
annotating/indexing a document collection, or for making Word Clouds.

For this project, you will need to use unordered_maps, also known as
hash tables. The central step is to create a map to store the counts
of words that appear in a document. Then one could sort it and print
out the most frequent words.

The problem is that the most frequent words are almost always
uninformative, uninterestng words, like 'and', 'a', and 'the'. These
are known as "stop words". One approach would be to try to filter
these out using a predefined list of common words. Instead, we are
going to use a different approach. Suppose the document comes from a
collection. For example, suppose it is one of the 10 Books from
Aritstotle's Ethics. Then we can use the frequencies of words in the
whole document collection as a reference. A word like "the" occurs
frequently in Book 2, but it also occurs frequently in all the Books.
This suggests that a different method for identifying meaningful,
representative words is to compute the ratio of the frequency of each
word in a particular document to the frequency of that word over the
whole collection.

Here is a particular implementaion of this idea. For a word W in
document D, compute the number of times W occurs in D, N(W,D), and the
total number of words in the document, N(D). Then do the same thing
for the whole document collection C - compute N(W,C) and N(C).
Next, compute the expected number of words in X based on C
as follows:

freq(W,C) = N(W,C) / N(C)
expect(W,D) = N(D) * freq(W,C)

Finally, you can compute the "relative enrichment" of
the word W in document D compared to C as:

enrich(W,D) = N(W,D) / expect(W,D)

You can then print out the words in the document sorted by enrichment.
However, if you try this literally, you might observe that the
keywords are not very compelling. The most enriched words are often
words that occur infrequently overall. An improvement is to add in
pseudo-counts to the enrichment ratio (to both the numerator and
denominator). Here is the adjusted formula:

enrich(W,D) = (N(W,D)+PC) / (expect(W,D)+PC)

Try setting PC to 5 and see if it makes the most enriched words
more interesting and representative (compared to PC=0).



About your program
------------------

Suppose your program is called 'keywords'. If you call it with one
command line argument (a text document), it will read-in the document,
count all the words, and print them out in sorted order.

In order to combine counts for variants of the same word, here is some of the
processing you should apply:

1. convert all characters to lower-case (to get rid of capitalization)
1. remove digits (e.g. including years)
2. remove punctuation: ",.?;:'!()[]$/
3. note: keep hyphens (-)

This is not perfect, but it will be good enough. It doesn't properly
handle contractions, and it does not handle variations like plurals of
nouns (e.g. 'dog' versus 'dogs') or past tense of verbs (e.g. 'say'
vs. 'said'). In real document analysis, this is called 'stemming',
but is rather complicated.

Here is an example (from the concatenation of all 10 Books):

keywords Aristotle_Ethics.txt
6842 the
4594 of
3936 and
3506 to
3487 is
2682 in
1976 a
1574 it
1514 that
1468 are
1462 be
1280 for
1279 as
1265 not
1137 but
1093 or
1019 which
...


If you now want to use this to analyze a particular document,
give the name of a second fileaname on the command line.

words Aristotle_Ethics.txt Aristotle_Ethics_Book1.txt
WORD num(doc) freq(doc) num(all) freq(all) expected ratio
goods 22 0.002425 49 0.000424 3.8 3.0523
soul 23 0.002535 54 0.000467 4.2 3.0309
happy 25 0.002756 66 0.000571 5.2 2.9469
happiness 34 0.003748 119 0.001030 9.3 2.7197
blessed 12 0.001323 18 0.000156 1.4 2.6510
manifestly 11 0.001213 14 0.000121 1.1 2.6235
life 39 0.004299 161 0.001393 12.6 2.4949
excellence 19 0.002094 62 0.000536 4.9 2.4326
...


Document Collections for Testing
--------------------------------

Two document collections are provided as .zip files that can be
downloaded from the course website: Ethics by Artistotle (10 Books),
and Essays from Ralph Waldo Emerson. These were
downloaded from Project Gutenberg. They are provided in both
concatenated form (1 file), and split up into separate documents.

To get a quick overview of the contents of the individual documents, see this:

https://en.wikipedia.org/wiki/Nicomachean_Ethics
https://en.wikipedia.org/wiki/Essays:_First_Series

More products