Programming Assignment 2: Dictionary Stat Generator So

Starting from:

~~$35~~

$29

This assignment is designed to provide you some experience with C programming. Your task is to write a program that reads in a collection of dictionary and data le pairs and generates statistics about each pair. There are two parts to this assignment. The rst part of the assignment is for regular credit and the second part of the assignment is extra credit.

First Part (100 points)

Your task in the rst part is to write a C program called first that reads an input le, which contains a list of dictionary and data le pairs, and generates statistics for each pair.

Each line in your input le contains the names of the dictionary and data les. You have to read the les and generate the following statistics:

1. For every word w in the dictionary le, count the number of words w0 that occur in the data le such that w0 = w.
2. For every word w in the dictionary le, count the number of words w0 that occur in the data le such that w is a proper pre x of w0 (we shall say w0 is a superword of w).

Write all the unique words in the dictionary along with these counts to the output le in lexicographical (i.e. alphabetical) order.

De nition of a word: Any string of characters can be a word. For example in the sentence:

"a&ab&abc234 dfg"

the words are:

a, ab, abc, dfg

They do not need to be meaningful. Words, in both dictionary and data les, correspond to the longest continuous sequence of letters read from the le. Another way to say this is that words are any sequence of letters separated by non-letter characters (punctuation, numbers, whitespace, etc.). Each unique word is case-insensitive. That is, \boOK", \Book" and \bOOk" are all occurrences of the same (unique) word \book".

Case-insensitivity also applies when matching pre xes: for example, both \Boo" and "bOo" are proper pre xes of \bOOk", which itself is a proper pre x of \booKING" (See the example below).

As an example, suppose the content of the dictionary le is:

boo22$Book5555bOoKiNg#bOo#TeX123tEXT(JOHN)

and that of the data le is:

John1TEXAN4isa1BOoRiSH%whohasa2bo3KING BOOKING bOoKIngs$12for a TEX-Text(BOOKS(textBOOKS)

Then, the various counts for the unique words in the dictionary le are:

1

Unique words
No. of occurrences
No. of superwords
boo
0
4
book
0
3
booking
1
1
tex
1
3
text
1
1
john
1
0

Input/Output speci cation

Usage interface

The program first should have the following usage interface:

first <mappingfile>

where <mappingFile> is the name of the mapping le. You can assume that the mapping le will exist and that it is well structured. So, unlike assignment1, you don’t need to check the structure of the mapping le

If no argument or more than one argument is provided, or the le names provided are invalid, the program should print \invalid input" and abort.

Input speci cation

Here, and below, let m be the maximum number of words in either the dictionary or data les. Every word is of length at most k. Let n be the number of unique words in the dictionary le.

Your input will be a mapping le, which contains lines of the form: hdictFilei hdataFilei, where dictFile and dataFile are the dictionary and data les for your program, respectively. An example of a mapping le is given below:

dict_1 data_1

dict_m data_m

The les: dict 1.txt and dict m.txt are dictionary les and les: data 1.txt and data m.txt are data les. They are plain text les with no special structure.

Output speci cation

Your program should generate several output les outi.txt, where i is the line number in mapping le. It means that you need to get mapping le as an argument to your program. Then each line in the mapping le has information about the dictionary le and the data le.

For example suppose line j in the mapping le is dict j data j. In this case you should produce outj.txt, which contains the described informations. Remember that you shouldn’t have any spaces at the end of the lines in your output les. Also, you shouldn’t have any empty lines in your outputs les . The program should write all the unique words (see de nition above) that occur in the dictionary le along with their various counts (See above), in lexicographical order, one word per line to the output les i.e. the output should have exactly n lines:

<word1> <count11> <count12>

<word2> <count21> <count22>

.

.

.

<wordn> <countn1> <countn2>

such that

1. <word1>, <word2>, <word3> : : : <wordn> is the lexicographical ordering of the unique words occurring in the dictionary,

2

2. <counti1> denotes the number of occurrences of <wordi> in the data le,

3. <counti2> denotes the number of proper superwords of <wordi> in the data le,

4. Line i is <wordi><space><counti1><space><counti2>

Note that you can assume the dictionaries are non-empty.

Also note that if the mapping le has more than one line, you need to do same proce-dure for each line and create the result le (outi.txt) of every line.

For example, running first on the input described above should produce the following output:

boo 0 4

book 0 3

booking 1 1

john 1 0

tex 1 3

text 1 1

Second Part (Extra Credit - 50 points)

Your task in this second part is to write a C program called second that reads a input le, which contains a list of dictionary and data le pairs, and generates statistics for each pair.

Each line in your input le contains the names of the dictionary and data les. You have to read the dictionary and data les and generate the following statistics:

1. For every word w in the dictionary le, count the number of words w0 that occur in the data le such that w0 = w.

2. For every word w in the dictionary le, count the number of words w0 that occur in the data le such that w0 is a proper pre x of w.

Write all the unique words in the dictionary along with these counts to the output le in lexicographical (i.e. alphabetical) order. A word is de ne the same way as in the rst part.

Case-insensitivity also applies when matching pre xes: for example, both \Boo" and "bOo" are proper pre xes of \bOOk", which itself is a proper pre x of \booKING" (See the example below).

As an example, suppose the content of the dictionary le is:

boo22$Book5555bOoKiNg#bOo#TeX123tEXT(JOHN)

and that of the data le is:

John1TEXAN4isa1BOoRiSH%whohasa2bo3KING BOOKING bOoKIngs$12for a TEX-Text(BOOKS(textBOOKS)

Then, the various counts for the unique words in the dictionary le are:

Unique words
No. of occurrences
No. of pre xes
boo
0
1
book
0
1
booking
1
1
tex
1
0
text
1
1
john
1
0

3
Input/Output speci cation

Usage interface

The program second should have the following usage interface:

second <mappingFile>

where <mappingFile> is the name of the mapping le. You can assume that the mapping le will exist and that it is well structured. So, unlike assignment1, you don’t need to check the structure of the mapping le.

In case no argument or more than one argument is provided, or the le names provided are invalid, the program should print \invalid input" and abort.

Input speci cation

Here, and below, let m be the maximum number of words in either the dictionary or data les. Every word is of length at most k. Let n be the number of unique words in the dictionary le.

Your input will be a mapping le, which contains lines of the form: <dictFile> <dataFile> dictFile and dataFile are the dictionary and data les for your program, respectively. They are plain text les with no special structure.

Output speci cation

Your program should generate several output les outi.txt, where i is the line number in mapping le. It means that you need to get mapping le as an argument to your program. Then each line in the mapping le has information about the dictionary le and the data le.

For example suppose line j in the mapping le is dict j data j. In this case you should produce outj.txt, which contains the described informations. Remember that you shouldn’t have any spaces at the end of the lines in your output les. Also, you shouldn’t have any empty lines in your outputs les . The program should write all the unique words (see de nition above) that occur in the dictionary le along with their various counts (See above), in lexicographical order, one word per line to the output les i.e. the output should have exactly n lines:

<word1> <count11> <count12>

<word2> <count21> <count22>

.

.

.

<wordn> <countn1> <countn2>

such that

1. <word1>, <word2>, <word3> : : : <wordn> is the lexicographical ordering of the unique words occurring in the dictionary,

2. <counti1> denotes the number of occurrences of <wordi> in the data le,

3. <counti2> denotes the number of proper pre xes of <wordi> in the data le,

4. Line i is <wordi><space><counti1><space><counti2>

Note that you can assume the dictionaries are non-empty.

Also note that if the mapping le has more than one line, you need to do same proce-dure for each line and create the result le (outi.txt) of every line.

For example, running second on the input describe above should produce the following output:

boo 0 1

book 0 1

4

booking 1 1

john 1 0

tex 1 0

text 1 1

Design & Implementation

Design

We recommend using the data structure \trie" (http://en.wikipedia.org/wiki/Trie), though you are free to use whatever data structures and algorithms you want as long as your implementation is reasonably e cient (See the Grading Section for detailed e ciency requirements).

Implementation

As an additional requirement, your code should implement and use the following functions:

void readDict(FILE *dict file): The function takes a le pointer to the dictionary le and reads the unique words from the dictionary le and stores them in an appropriate data structure.

void matchStr(char* str): This function will take a word and search the data structure that holds the unique dictionary words in order to nd matches and to update the various counts. This function should be used while scanning the data le for occurrences of dictionary words and their pre xes and superwords.

void printResult(): This function will produce the output of the program.

You are free to add any other functions you need.

You are allowed to use functions from standard libraries (e.g., strcmp()) but you cannot use third-party libraries downloaded from the Internet (or from anywhere else). If you are unsure whether you can use something, ask us.

We will compile and test your program on the iLab machines so you should make sure that your program compiles and runs correctly on these machines. You must compile all C code using the gcc compiler with the -Wall -Werror -fsanitize=address ags.

Submission

You have to e-submit the assignment using Sakai. Your submission should be a tar le named pa2.tar. To create this le, put everything that you are submitting into a directory named pa2. Then, cd into the directory containing pa2 (that is, pa2’s parent directory) and run the following command:

tar cvf pa2.tar pa2

To check that you have correctly created the tar le, you should copy it (pa2.tar) into an empty directory and run the following command:

tar xvf pa2.tar

This should create a directory named pa2 in the (previously) empty directory.

The pa2 directory in your tar le must contain 2 folders( rst and second). If you don’t want to do extra credit part you can leave second folder empty. Each folder should have:

readme.pdf: This le should brie y describe the main data structures being used in your program, a big O analysis of the run time and space requirements of your program in terms of the parameters k,m, and n (See the First Input/Output and Second Input/Output sections above), and any challenges that you encounter in this assignment.

5

Makefile: There should be at least two rules in your Make le:

1. first: build your first executable. (Or second, if you are doing the extra credit.)

2. clean: prepare for rebuilding from scratch.

source code: all source code les necessary for building your programs. Your code should contain at least two les: first.h and first.c.

If you want to do extra credit part, you need to do all of the previous tasks for second part.

Grading guidelines

The grading will be done in two phases, an automated phase and a manual phase.

Automated grading phase

This phase will be based on programmatic checking of your program. We will build a binary using the Make le and source code that you submit, and then test the binary for correct functionality and e ciency against a set of inputs. Correct functionality and e ciency include, but are not limited to, the following:

1. We should be able build your program by just running make.

2. Your program should follow the format speci ed above for the usage interface.

3. Your program should strictly follow the input and output speci cations mentioned above. (Note: This is perhaps the most important guideline: failing to follow it might result in you losing all or most of your points for this assignment. Make sure your program’s output format is exactly as speci ed. Any deviation will cause the automated grader to mark your output as \incorrect". REQUESTS FOR RE-EVALUATIONS OF PROGRAMS REJECTED DUE TO IMPROPER FORMAT WILL NOT BE ENTERTAINED.)

4. There will be three types of test cases: test cases for checking I/O speci cations (as discussed in points 1-3) (class A test cases), small test cases for checking correctness (class B test cases), and big test cases that are designed to test the e ciency of your program (class C test cases): if you are simply storing all the dictionary words in an array and then matching every word in the data le against every dictionary word by doing a linear search, your program will most likely exceed the time limit when run on the big test cases.

5. We recommend that the running time and space complexity of your programs be roughly O(mk) and O(nk) respectively. Any correct program whose complexity matches the recommended values is guaranteed to work on all of the big test cases (Note: Requests for re-evaluations of programs that fail on big test cases will not be entertained).

Note: This phase will also involve detecting copying, if any. If two submissions are found to be identical, they will instantly be awarded zero points.

Manual grading phase

This phase will involve manual inspection of your code and the readme.pdf le provided by you along with the code. Note that only programs that pass the class A and class B test cases will be eligible for grading in the manual phase. Here are some guidelines you should follow in order to score well in this phase:

1. Explain your design (choice of data structure and algorithms) and the big O analysis carefully in the readme.pdf le.

2. Modularize your code i.e. split your code into pieces that make sense, where the pieces are neither too big nor too small.

6

3. Make sure your code is well-documented with comments. This does not mean that you should comment every line of code. Common practice is to document each function (the parameters it takes as input, the results produced, any side-e ects, and the function’s functionality) and add comments in the code where it will help another programmer gure out what is going on.

4. Use variable names that have some meaning (rather than cryptic names like strA).

5. Additionally, the following will make it easier for us to go through your code:

De ne prototypes for all functions.

Place all prototype, typedef & struct de nitions in header (.h) les.

Autograder

We provide the AutoGrader to test your assignment. AutoGrader is provided as autograder.tar.gz.

Executing the following command will create the autograder folder.

$tar zxvf autograder.tar.gz

There are two modes available for testing your assignment with the AutoGrader.

First mode

Testing when you are writing code with a pa2 folder

1. Lets say you have a pa2 folder with the directory structure as described in the assignment.

2. Copy the folder to the directory of the autograder

3. Run the autograder with the following command $python auto grader.py

It will run the test cases and print your scores.

Second mode

This mode is to test your nal submission (i.e, pa2.tar)

1. Copy pa2.tar to the autograder directory

2. Run the autograder with pa2.tar as the argument. The command line is $python auto grader.py pa2.tar

7