Programming Assignment 3: Indexer Solution

Starting from:

~~$29.99~~

$23.99

Home

Warning: As you will see below, the descriptions of the assignments will be increasingly complex because we are asking you to build increasingly bigger programs. Make sure to read the assignment carefully! This is critical because this document essentially describes the requirements for your program.

1. Introduction

In this assignment you will practice using the file system API (as well as pointers in different data structures). In particular you will be creating, opening, reading, writing, and deleting files.

Your task is to write an indexing program, called an indexer. Given a set of files, an indexer will parse the files and create an inverted index, which maps each token found in the files to the subset of files that contain that token. In your indexer, you will also maintain the frequency with which each token appears in each file.

Here is an example of how the indexer should work. If you are given the following set of files:

File Name            File Content

boo                       A dog name name Boo baa                        A cat name Baa

The indexer should read the files and produce the following inverted index, in sorted order by word:

“a”    (“boo”, 1), (“baa”, 1) “baa”    (“baa”, 1)

“boo”    (“boo”, 1) “cat”    (“baa”, 1) “dog”    (“boo”, 1)

“name”    (“boo”, 2), (“baa”, 1)

After constructing the entire inverted index in memory, the indexer will save it to a file. Some observations:

An inverted index is just a sequence of mappings, where each mapping maps a token (e.g., “dog”) to a list of records, with each record containing the name of a file whose content contains the token and the frequency with which the token appears in the file.

The above depiction just gives a logical view of the inverted index. In your program, you have to define data structures to hold the mappings (token       list) and the records (file name, count).

The mappings are maintained in sorted order of the tokens. You will see later why this is

useful. Sorting in ascending or descending order doesn’t matter so much. We will just arbitrarily say for this assignment that the sequence should be maintained in ascending sorted order based on the ASCII coding of characters (i.e., “a” before “b” and “aa” before “ab”).

Records in each list are maintained in descending sorted order based on frequency counts of

the tokens in the files. Again, you will see later why this is useful.

Tokens are not case sensitive. All upper-case letters should be converted to lower-case.

The tokenizer and sorted-list that you wrote in earlier assignments may be useful for this assignment (although you have to modify the tokenizer to work with a file, rather than a string).   You also have the option of using the improved tokenizer.c file attached to this assignment.

2. Implementation

Since you are implementing a program in this assignment, there is no programming interface to follow. Instead, your program must implement the following command-line interface:

index <inverted-index file name <directory or file name

The first argument, <inverted-index file name, gives the name of a file that you should create to hold your inverted index. The second argument, <directory or file name, gives the name of the directory or file that your indexer should index. If the second argument is a directory, you need to recursively index all files in the directory (and its sub-directories). If the second argument is a file, you just need to index that single file.

When indexing files in a directory, you may have files that have the same name in separate directories. You should prepend the pathname (relative to the input directory name) in every record in the inverted index, rather than just the file name.

Tokenization is a little different in this assignment than in the previous assignment. You are not given a set of separators. Instead, we define tokens as any sequence of consecutive alphanumeric characters (a-z, A-Z, 0-9). All other characters are separators. Note that you can use the entire ASCII coding minus the alphanumeric characters as your separators to minimize the change to your tokenizer. But, this is not efficient, since the alphanumeric characters is only a small subset of ASCII. This would be even truer if we extend the character set beyond ASCII. So, you should not take this easy way out.

Examples of tokens according to the above definition include:

a, aba, c123, 1, 454

If a file contains

This an$example12 mail@rutgers

it should tokenize to

this       an       example12      mail      rutgers

The inverted index file that your indexer writes   must follow the following format, where   I’m showing each space as an _ to make it more clear:

<list_token name1_count1_name2_count2_name3_count3_name4_count4_name5_count5 name6_count6_name7_count7_name8_count8_name9_count9_name10_count10

...

</list

with the lists arranged in ascending sorted order of the tokens. Note you must obey the line breaks as shown. Each line containing the (file name, count) records can contain at most 5 records.

So, the example inverted index from Section 1 could look like:

<list a

boo 1 baa 1

</list

<list baa baa 1

</list

<list boo boo 1

</list

<list cat baa 1

</list

<list dog boo 1

<list name boo 2 baa 1

</list

This format is quite inefficient in a number of ways. We will optimize later. For now, we want to be able to easily read the inverted index for debugging.

You should carefully consider preexisting conditions and how they relate to the execution of your program. You should outline and implement a strategy to deal with these circumstances. For example, if a file already exists with the same name as the inverted-index file name, you should ask the user if they wish to overwrite it. If the name of the directory or file you are to index does not exist, your indexer should print an error message and exit gracefully rather than crash. There are many other error cases that you will need to consider.

You should use multi-file compilation to carefully organize your code. For example, the tokenizer should be in its own .c file, with a .h file that callers should include. The same applies for the sorted-list. You should also write a Makefile to efficiently compile and link your indexer.

3. Hints

Data structures that might be useful include the sorted list you just implemented and/or a hash table.

A custom record type (e.g., a record {“baa”, 3}) that can be inserted into multiple data

structures, such as a sorted list and/or a hash table).

You can use your sorted list to maintain the set of tokens in ascending order. Since we are asking for the records for each token to be sorted in descending order by frequency, you have to flip the meaning of < and in your record comparator function.

You should probably approach this in steps.

o   First, you might get your tokenizer to generate correct tokens from a file.

o   Next, you might get your program to walk through a directory.

o     Next, you might implement a data structure that allows you to count the number of occurrences of each unique token in a file.

o   And so on …

4. What to Turn In

A tarred gzipped file name pa3.tgz that contains a directory called pa3 with the following files in it: All the .h and .c files necessary to produce an executable named index.

A makefile used to compile and produce the index executable. It must have a target clean to prepare a fresh compilation of everything.

A file called testplan.txt that contains a test plan for your indexer. You should include the

example files and/or directories that you test your indexer on but keep these from being too large, please. (We might test your program with a very large data set though so don’t skip testing your program for scalability. In your test plan, you should discuss the larger scale testing and the results, but you can skip including the data set).

A readme.pdf file that describes the design of your indexer. This should also include the

usual analysis of time and space usage of your program. Starting in this assignment, you do not need analyze every single function. Rather, you need to analyze the overall program. (So, for example, analyzing initialization code is typically not too   important unless this initialization depends on the size of the inputs.)

As usual, your grade will be based on:

Correctness (how well your code is working),

Quality of your design (did you use reasonable algorithms),

Quality of your code (how well written your code is, including modularity and comments), Efficiency (of your implementation), and

Testing thoroughness (quality of your test cases).