$29
For this assignment, we want to process WORDS in input files and, potentially, change some of those WORDS before printing them out.
For our assignment, a WORD is a sequence of characters, separated by whitespace (that is, spaces, tabs, newlines; what cctype’s isspace() returns true for).
It is possible that a WORD may begin and/or end with one or more punctuation characters (what cctype’s ispunct() returns true for). Such a WORD therefore has leading and/or trailing punctuation.
A SUBSTITUTION is a line containing two WORDS.
Your program will process a file containing SUBSTITUTIONS and a file containing WORDS. Each SUBSTITUTION in the first file is a replacement rule for WORDS in the second file. Your program should process the input files and apply the SUBSTITUTIONS to each input WORD before printing it out. If a WORD in the input matches the first WORD in a SUBSTITUTION, it is replaced with the second word in the SUBSTITUTION, and then it is printed. If no SUBSTITUTION applies, the WORD is printed unchanged.
For example, imagine that we have these SUBSTITUTIONS in the first file:
foo many
hi hello
fish bicycle
And suppose we are given this second file:
So, hi everyone! This reminds me of foo things.
I need a new fish for my birthday.
The resulting output would be:
So, hello everyone! This reminds me of many things.
I need a new bicycle for my birthday.
NUMBER OF CHANGES: 3
There are a few rules about the file of SUBSTITUTIONS:
Blank lines are ignored
Lines that do not have two WORDS are ignored
Any leading and trailing punctuation in WORDS in a SUBSTITUTION are discarded
All of the letters in the WORDS in the SUBSTITUTIONS file should be converted to lower case
If there are multiple SUBSTITUTIONS with the same first WORD, the last SUBSTITUTION is used, and all prior SUBSTITUTIONS for that WORD are discarded.
Because we discard punctuation in the SUBSTITUTIONS, and because we convert to lower case, the following three lines in the SUBSTITUTIONS file all have the same first WORD. According to the rules, the last SUBSTITUTION is the only one that is retained.
1
Hi!!! hello
“hi” hello
Hi Hello
When deciding if an input WORD matches a word in a SUBSTITUTION, the following rules should apply:
Any leading and trailing punctuation is ignored when deciding if a WORD matches
Any difference in case are ignored when deciding if a word matches
When performing a replacement, leading and trailing punctuation from the WORD is preserved
If the first letter after any leading punctuation in the input WORD is a capital letter, then the first letter in the resulting replacement should also be capitalized
At most one replacement should be performed for each WORD
For example, imagine that we have these substitutions:
foo
hi hello
fish bicycle
And suppose we process this input:
“Hi, everyone!” said the boy. “Yeah, hi! I want a brand new fish!”
The resulting output would be:
“Hello, everyone!” said the boy. “Yeah, hello! I want a brand new bicycle!” NUMBER OF CHANGES: 3
The program should keep a count of the number of substitutions applied. If any replacements were made, the last line of output, after processing the entire input file, should be the line: NUMBER OF CHANGES: N
Where N is the number of replacements that were made.
The program takes two command line arguments: the first is the name of a file containing SUBSTITUTIONS, and the second is the name of a file containing WORDS. The program should process the SUBSTITUTIONS file according to the rules described above. Then, it should read the file containing WORDS, apply any replacements indicated from the SUBSTITUTIONS, and print out the result.
Note that any and all whitespace between WORDS in input is printed unchanged to output.
If exactly two filenames are not provided, print “TWO FILENAMES REQUIRED”, and stop. If a file cannot be opened or read for any reason, print the error message “BAD FILE FILENAME”, and stop. It is possible that either or both of the files your program reads may be empty. It is also possible that the second file may contain no words. Neither of these situations are errors.
2
The program will be submitted and graded in separate parts:
Part 1
Recognizing error cases associated with the wrong number of file names and files that cannot be opened.
Handling cases with an empty SUBSTITUTIONS file.
Handling simple substitutions (no punctuation, no case changes).
Part 2
Handling more complex cases in SUBSTITUTIONS files (duplicated words, mixed case words, punctuation, etc).
Handling cases with punctuation at the start and/or end of WORDS in the WORDS file.
Handling cases with mixed case for matches.
Handling cases with capitalized WORDS that are substituted.
Note that both in your “work” directory, and also on Moodle, there are files:
cases.txt
Cases.tar.gz
runcase
The cases.txt file covers all the possible cases, as in what arguments are used for each test case.
The cases.tar.gz is a zipped file containing all the input and output files.
The runcase file is a shell script that lets you run a single test case.
3