$24
Introduction
Professor Jackson was just assigned to be the editor of a riveting textbook titled “Advanced Data Structure Implementation and Analysis”. She is super excited about the possibility of delving into the material and checking it for technical correctness. However, one of the more mundane tasks she must perform is creating an index for the book. Everyone has used the index at the back of a book before. An index organizes important words or phrases in alphabetical order together with a list of pages on which they can be found. But, who or what creates these indexes? Do humans create them? Do computers create them? As a comp sci prof, Jackson decides she wants to automate the process as much as possible because she knows that an automated indexer is faster and more accurate, and because it can be reused later when she finishes writing her own book. So as she is editing the book, she keeps a list of words on each page that should be included in the index. However, time is short, and she needs to get the book edited AND indexed quickly. She’s enlisted your help to write an AutoIndexer.
Your Task
You will implement a piece of software that can read in Professor Jackson’s keyword file (raw ASCII text with page indications), process the keyword data from the book, and output the complete index to a separate file. All of this must be done within specific implementation constraints described in the forthcoming sections.
Implementation Requirements
You’ll read from the ASCII text file generated by Prof Jackson. We’ll call this the input text file. Once you read in all of the data and process it, you’ll write the index to an output file. We’ll call this the output text file.
Input File
The input text file will contain a list of keywords and phrases from the book separated into groups based on the page each word or phrase appears on. The end of the list of keywords will be indicated by <-1 at the end of the file. If a phrase is to be indexed, the words that comprise the phrase will be surrounded by square brackets (ex: [binary search tree]). No index word or phrase will exceed 40 characters in length (not including square brackets for phrases).
Here are a few things you should know about Prof. Jackson’s messy style for keeping track of the keywords. She didn’t pay attention to letter case, so you’ll need to account for that in your program. This means that ‘tree’ and ‘Tree’ should be considered as the same word. Page numbers will appear in angle brackets (ex: <8) and will always be on their own individual line. Page number will not necessarily be in order. Because of the editing process, Jackson may accidentally repeat page numbers due to re-reading the same section multiple times. This may mean she accidentally lists a word twice on the same page. In this case, there’s no need to list the word or phrase twice in the index. A (very very) simple input text file can be found in Listing 1.
The Output Text File
The output text file will be organized in ascending order with numeric index categories appearing before alphabetic categories. Each category header will appear in square brackets followed by index entries that start with that letter in ascending alphabetic or numeric order. An index entry will consist of the indexed word, a colon, then a list of page numbers where that word was found in ascending order. No output line should be longer than 50 characters. The line should wrap before 50 characters and subsequent lines for that particular index entry should be indented 4 spaces. An example output text file can be found in Listing 2.
The Vector Class
You don't have any idea how many individual words, index entries, etc. will be present in the input data file. And since Jackson doesn't like the container classes from the c++ standard library, you can’t use the vector class that automatically grows as you insert elements into it. You'll need to implement some “data structure” that is capable of “growing” as needed. This sounds like a good place to use a vector. You’ll need to implement a vector class that should minimally include the following features/functionality:
a vector shall be able to hold any data type (template your class)
a vector shall be a contiguously allocated, homogeneously typed sequential container
a vector shall grow as needed
in other words, don’t start with an array of 500,000 elements or something like that
a vector shall minimally contain the following functionality:
add a new item to the container
access elements using the [] operator
remove an element from the container
Follow rule of three
There’s a great deal of other functionality that SHOULD be included, but this is the minimum amount needed. You should make sure your vector class is adequately tested using CATCH (see below)
Test Driven Development and the Catch Library
There are many schools of thought on the best way to write code and test it. One of these methodologies is called Test Driven Development (TDD). In TDD, the basic idea is you write the testing code for whatever things you're working on first, and then write the thing to pass the tests. For example, if you're implementing a String class, you might initially decide that String objects need to be able to be copied and printed to the screen. So, before writing any part of the string class implementation, you write some code to test copy functionality and printing functionality. Clearly, if string class hasn't been written yet, these test will fail - that's exactly what you want to happen. Initially, all your tests will fail. But as you begin writing the code to implement string, the tests will begin to pass.
Quite a few frameworks exists for TDD (and unit testing, for which TDD is a particular type). For Data Structures, we will use the CATCH Framework. While CATCH supports multiple paradigms of testing, we will use it in the TDD mindset for 2341. Please read the CATCH Framework Tutorial.
It is expected that you will follow the TDD mindset when developing the vector class. The TAs have been directed to give guidance under that paradigm. Therefore, you'll need to create your test case source files along with your project files. This will be covered in more detail in Lab this week. You are only doing yourself a disservice by not fully investing your time and energy in TDD. Every bug you tease out now is one less you'll need to worry about later.
Assumptions
You may make the following simplifying assumptions in your project:
The input file will be properly formatted according to the rules above
You need to remove punctuation from the input file words. ‘Data!!!’ and ‘data’ should be considered the same word
No line of text in the input file will contain more than 80 characters
No word or phrase will be longer than 40 characters
Different forms of the same word should be considered as individual entries in the index (e.g. run, runs, and running would each be considered individual words)
Execution
There will be two modes of execution for this project:
test mode
this mode will cause all of your Catch TDD Tests to be executed (and they should pass of course).
test mode is indicated by a -t command line argument
run mode
this mode will run the autoindexer
run mode will be indicated by a -r <input file name <output file name set of command line arguments.
If we were running test mode from the command line, it would be executed similar to:
$ ./indexer -t
If we were running run mode from the command line, it would be executed similar to:
$ ./indexer -r input.txt index.out