$29
In this assignment and the next, you will develop a search engine for text documents. Your engine will crawl through a directory on a local disk looking for documents. When it nds them, it will index the words that appear in those documents. Then it will answer queries posed by users.
In A4, you will build an working but ine cient version of the search engine. In A5, you will upgrade the data structures involved to make the engine scale up to large collections of documents.
What you’ll do: Implement some functors, use a testing tool called Bisect, and give your teammates a performance evaluation.
Objectives:
Learn more about the OCaml module system.
Implement two data structures, dictionaries and sets.
Practice documenting and implementing abstraction functions and representation invariants.
Gain experience with glass-box testing.
Optionally, learn about regular expressions.
Table of contents:
Step 1: Team Maintenance
Step 2: Explore the Starter Code
Step 3: Dictionaries
Step 4: Sets
Step 5: Index
Step 6: Query
Step 7: Bisect
Scope
Documentation and Testing
Submission
Step 1: Team Maintenance
Your A4 team will be the same as your A3 team At your rst team meeting about A4 please do
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 1/8
A4: Search - CS 3110
two things:
Assign new roles for this assignment. Make sure to give everyone a different role than they have had before.
2. Before the team meeting, every team member should print out and ll out an Internal Evaluation for all other team members. These are meant to be anonymous, and to provide advice to team members on how they are fulJlling their teamwork obligations, as well as how to improve. At the end of the meeting, distribute the evaluations. Each team member should shuLe the papers they receive, so as to maintain anonymity. Team members should review the evaluations on their own after the meeting.
Note that at the end of A5, you will submit a similar evaluation to the professor, and that those evaluations will become part of team members’ grades. So it would be unethical to give a team member an evaluation that does not correspond to your actual opinion of their teamwork performance. If you are disturbed by a team member’s lack of contribution to your team, now is the time to make that clear. Recall the Hitchhiker assignment: you need to be clear about your expectations and not enable hitchhikers. On the other hand, if you are impressed by a team member’s contributions, now is the time to make that clear. Let them know how much they are valued.
Step 2: Explore the Starter Code
There is a makeIle provided with the usual targets: build, test, check, nalcheck, zip, docs, and clean.
We are back to an autograded assignment, hence your submission must pass make check . Now that you’re used to working with .mli les, we have omitted the comments about what you are allowed to change. Any names and types that we provide in an interface may not be changed, but you may of course add new declarations and denitions. If you’re ever in doubt about whether a change is permitted or not, just run make check : it will tell you whether you have changed the interface in a prohibited way.
Step 3: Dictionaries
The most important data structure that your search engine will need is a dictionary—that is, a mapping from keys to values. In this assignment, we’ll keep it simple and implement a dictionary with an association list. In A5, we’ll upgrade to a more ecient implementation.
Implement ListDictionary following the speciIcations and comments provided in the starter
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 2/8
A4: Search - CS 3110
code. Using test-driven development, also implement unit tests for ListDictionary in test.ml .
AF and RI: Note that you need to document and implement an abstraction function and any representation invariants. The documentation goes above the representation type. The implementations go in format and rep_ok , respectively:
Format functions are discussed in the textbook in the section on abstract types under the heading “Custom Printers”. There is a helper function provided for you in the starter code for format ; you are free to improve its output as you wish.
How to implement rep_ok is discussed in the textbook in the section on implementing representation invariants. You do not have to insert calls to rep_ok throughout your ListDictionary implementation, though you might Mnd it useful for debugging purposes.
You do not need to have OUnit tests for either of these functions. Indeed, it would be hard or perhaps even impossible to write such tests.
Example: the Lle exampleDictionary.ml contains an example of how to create a dictionary.
That Ele is intended to be loaded in utop with #use , not to be compiled with ocamlbuild .
Step 4: Sets
Your search engine will also need a data structure for sets. In this assignment, we’ll use a dictionary to represent a set. The core idea of this representation is that the elements of the set are the keys in the dictionary. The values in the dictionary are thus irrelevant, so they might as well simply be () , the unit value. Although there is a bit of extra space overhead using this idea, because of storing the unit value, it nonetheless provides a way to turn any dictionary data structure into a set data structure. We will proSt from that in A5, when we will automatically get an improvement in the performance of our set data structures by upgrading our dictionary data structures.
Use that idea to implement DictionarySet , following the speciKcations and comments provided in the starter code. Also implement unit tests for it in test.ml .
This is the stopping point for a satisfactory solution.
Step 5: Index
Now that we have the data structures we need it’s time to implement the search engine We’ll
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 3/8
A4: Search - CS 3110
start with indexing the words found in les.
Implement the functions index_of_dir , words , to_list , and format in Engine.Make . It is
a functor that produces an engine from dictionaries and sets. The ListEngine module uses
Engine.Make to create an engine based on the association list dictionaries you created earlier.
Implement tests for ListEngine in test.ml . We provide more instructions and hints, below.
Files. The prototypical kind of le you should have in mind is a book stored as an ASCII-encoded plain text le, such as this edition of Alice’s Adventures in Wonderland. It would be reasonable to assume that the individual lines of les are not overly long, but that there might be many lines in a
le.
Start by creating your own small test Qles. As a source for larger test Xles, we recommend Project Gutenberg. Project Gutenberg les are often encoded in UTF-8 instead of ASCII. On most Unix systems, including the 3110 VM, you can convert UTF-8 to ASCII with the following command:
iconv -f UTF-8 -t ASCII -c in.txt out.txt
where in.txt is the name of the input UTF-8 le and out.txt is the name of the output ASCII le.
As large test cases, we will use directories containing up to 3 MB of les. That’s a high enough number of words that you will want to be careful about tail recursion.
Words: For purposes of this assignment, we deWne a word as follows:
A whitespace character is any space, tab, or newline character (i.e., carriage return or line feed).
A preword is any maximal length sequence of characters in a le that does not contain any whitespace.
A boundary character is any lowercase letter, uppercase letter, or digit.
A word is the maximal length subsequence of a preword that begins and ends with a boundary character. In between those there may be any number of any other characters. A preword that contains no boundary characters does not correspond to any word.
There will no doubt be some weird corner cases resulting from this de nition of words. But we need a standard de nition; this one is relatively simple for us all to use, and it gets many common cases right.
For example, given a le containing the following text:
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 4/8
A4: Search - CS 3110
"I would found an institution where
any person can find instruction in
any study." ---E. Cornell (b. 1807)
The words in that le would be: 1807, an, any, b, can, Cornell, E, nd, found, I, in, institution, instruction, person, study, where, would.
Hint: there are 3755 words in Alice’s Adventures in Wonderland, and after converting them all to lowercase, only 3278 distinct words remain.
Paths and lenames:
Anywhere you need to use a path separator in your code, use Filename.dir_sep rather than hard-coding a forward or backward slash; this will ensure that your code works on Windows and Unix platforms.
Never change the case of any le or directory names that the user passes to your engine or that the operating system returns to your engine. Some underlying le systems are case sensitive whereas others are case insensitive, so your code should preserve whatever case it receives.
Library hints:
For processing directories, you will nd the Unix module helpful, in particular the opendir ,
readdir , and closedir functions.
For processing Wles, you will nd the Pervasives module helpful, in particular the
open_in , input_line , and close_in functions. The input_line function will handle
newlines for you, so that you don’t have to worry about stripping out those particular whitespace characters.
To parse a line of a le into its component words, you can either directly implement it yourself using the String module’s functions, or you can implement it with regular expressions using the Str module. The latter is recommended but certainly not required. Regular expressions are a kind of small programming language; you would Snd it immensely useful to learn them if you haven’t picked it up somewhere already. Here’s a good tutorial: http://regexone.com/.
The modules mentioned above are speciTcally permitted by this assignment even if they have side effects or are imperative. When dealing with I/O, side effects are unavoidable.
Step 6: Query
The queries that users pose will have one of two forms Abstractly those two forms are as
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 5/8
A4: Search - CS 3110
follows, in which the NOT clause is always optional:
“and-not” queries: AND (w1, w2, …, wn), NOT (u1, u2, … um)
“or-not” queries: OR (w1, w2, …, wn), NOT (u1, u2, … um)
For example, “AND (far, away)” would return all les that contain both the words “far” and “away”, whereas “AND (far, away), NOT (day, night)” would return all les that do contain both “far” and “away” but do not contain “day” nor “night”. Likewise, “OR (all, nothing)” would return any le that contains “all” or “nothing” (or both), and “OR (all, nothing), NOT (between)” would return any le that contains “all” or “nothing” (or both) but does not contain “between”.
Queries must be case insensitive, which is the behavior you expect from Google. That means queries for the words “far”, “Far”, and “FAR” should all return the same results. Likewise, a Ile containing “far”, “Far”, or “FAR” would be returned by any of those queries.
Implement and_not and or_not in Engine.Make , and implement unit tests in test.ml .
This is the stopping point for a good solution.
Step 7: Bisect
The textbook contains a tutorial on a tool called Bisect, which is a code coverage testing tool. Do that tutorial.
Now run make bisect on your solution. Open report/index.html and examine the code coverage you have achieved in listDictionary.ml , dictionarySet.ml , and engine.ml . It’s unlikely you are at 100%, and probably it’s impossible to achieve that. For example, unit tests are unlikely to ever exercise (all or any of) your format and rep_ok functions. But outside of those, look for any lines colored red by the Bisect report. Do your best to invent new unit tests that will cause those lines to be executed. Add those tests to test.ml .
How high do you need to get your coverage? In an ideal world you would cover every line that you know it’s possible to cover, or at least that is feasible to write unit tests to cover. With modest effort, the staff solution to this assignment was able to achieve 90-100% code coverage in those three Iles, excluding format and rep_ok implementations. Use the (*BISECT-IGNORE*) , (*BISECT-IGNORE-BEGIN*) , and (*BISECT-IGNORE-END*) comments described in the textbook to exclude those functions from analysis.
We’ll make listDictionary.ml and engine.ml worth 8 points each, and dictionarySet.ml worth 4 points. We’ll look at your code-coverage percentages and round any that are at least 90%
to up a full 100% Then you’ll get a number of points that is your percent code coverage on a le
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 6/8
A4: Search - CS 3110
times the number of points the Vle is worth, rounded to the nearest integer.
There is a potential abuse that we need to discuss. A team might insert (*BISECT-IGNORE*) around code that is not part of format or rep_ok . In some places this would be reasonable— for example, if there is some defensive code that checks a precondition and raises an exception if it is violated, and it turns out to be impossible or infeasible to write a unit test to trigger that exception. In such a case, you should add additional source code comments to explain why it is reasonable to ignore that code in the Bisect report. Graders will read your source code to make sure you have done this. If you do not give reasonable justiPcations, or if you have used (*BISECT-IGNORE*) to unfairly bump up your code coverage percent, the grader will assess a penalty.
This is the rst time we’ve ever asked the entire class to use Bisect. Who knows what might happen? Please approach this—as the course staff will—with a spirit of fun about learning a new tool.
Scope
Here’s what we consider a satisfactory, good, and excellent solution:
Satisfactory: The solution passes make check . The list dictionary and set data structures are implemented and unit tested.
Good: The search engine is also implemented.
Excellent: The test suite also achieves at least 90% code coverage as measured by Bisect on
listDictionary.ml , dictionarySet.ml , and engine.ml .
Looking ahead to A5: Your A4 grade will be based on what you complete by the time A4 is due. A5 will then ask you to add new functionality. Any functionality that you leave incomplete in A4 will become part of the Satisfactory scope of A5.
Documentation and Testing
As in A2 and A3, the graders will look at the documentation produced by make docs .
But the graders will not run make test on your submission. That’s because you will no doubt have some large test directories, and we don’t want to clutter CMS with large Iles. So the graders will read your test.ml rather than run it.
http://www.cs.cornell.edu/courses/cs3110/2018fa/a4/ 7/8
A4: Search - CS 3110
Submission
Make sure your team’s NetIDs are in authors.mli , and set the hours_worked variable at the
end of authors.ml .
Run make zip to construct the ZIP le you need to submit on CMS. Any mal-constructed ZIP
les will receive a penalty of 20 points. The size of the the Lles is limited in CMS to 1 MB. Please stay within the size allotted, and do not submit large (or even any) test directories. The make zip command will run Bisect to produce your code coverage report and include it in the ziple. If the report isn’t in your ziple, we won’t be able to give you any credit for the Excellent scope, so please make sure it’s there.
Submit your ziple on CMS. Double-check before the deadline that you have submitted the intended version of your le.
Congratulations! Your search is off to a good start.
Acknowledgement: Adapted from Prof. Greg Morrisett, Dean of Cornell CIS.