Starting from:
$35

$29

Final Project Let's Search Solution

Scenario




Mustang Law Firm helps individuals and non profits through reasonably priced legal services. Their drowning under the prices of subscription based services to search court cases, so they’ve decided to implement an in-house solution. You have been hired by Mustang Law Corp to implement a proof-of-concept search engine so they can evaluate your software development prowess. Your search engine will cover all opinions issued by the Supreme Court of the United States (SCOTUS). It will provide the user with an interface to easily search the court cases and return ranked results.







Search Engine Architecture




Search engines are designed to allow users to quickly locate the data they want or need. Input to a search engine is a set of documents commonly referred to as the​corpus​. Typically, the user will enter a search query, and any documents (opinions from SCOTUS, in this case) that satisfy that query are returned to the user. Another task of a search engine is to rank the results based upon relevancy.




The four major components1 of a typical search engine are the following:




Document parser/processor,



Query processor,



Search processor, and



Ranking processor.



Figure 1 provides a general overview of a potential system architecture.







Figure 1 – Sample Search Engine System Architecture













The fundamental “document” for this project is one opinion issued by SCOTUS with it’s associated metadata such as date posted and tags.




Below is an overview of the major tasks/responsibilities of each of the components of the search engine.




The​index handler​,the workhorse of the search engine, is responsible for the following:




Reading from and writing to the main index​. You'll be creating an ​inverted file index​which stores references from each element to be indexed to the corresponding document(s) in which those elements exist.



Searching the inverted file index based on a request from the query processor​.



Storing other data with each word​.



The​document parser/processor​is responsible for the following tasks:




Processing each decision/opinion in the corpus​. The dataset contains one file per decision/opinion issued. Each document is in JSON format. Depending on the particular opinion, the actual text of the opinion is provided in plain text, HTML, or both. Processing of an opinion involves the following steps:



Removing stopwords from the opinions​.Stopwords are common words that appear in text but that provide little discriminatory power with respect to the value of a document relative to a query because of the commonality of the words. Example stop words include “a”, “the”, and “if”. One possible list of stop words to use for this project can be found at ​http://www.webconfs.com/stop-words.php​. You may use other stop word lists you find online.



Stemming words​.Stemming2 refers to removing certain grammatical modifications to words. For instance, the stemmed version of “running” may be “run”. For this project, you may make use of any previously implemented stemming algorithm that you can find online. One such algorithm is the Porter Stemming algorithm. More information as well as implementations can be found at ​http://tartarus.org/~martin/PorterStemmer/​. Another option is ​http://www.oleandersolutions.com/stemming/stemming.html​. C ++ implementation of Porter 2: ​https://bitbucket.org/smassung/porter2_stemmer/src​.



Computing/maintaining information for relevancy ranking.​You’ll have to design and implement some algorithm to determine how to rank the results that will be returned from the execution of a query. You can make use of metadata provided, important words in the opinions (look up term-frequency/inverse document frequency metric), and/or a combination of several metrics.



The​query processor​is responsible for:




Parsing of queries entered by the user of the search engine.​For this project, you'll implement functionality to handle​simple​prefix Boolean queries entered by the user. The Boolean expression will be prefixed with a Boolean operator of either AND or OR if there is more than one word is of interest. Trailing search terms may be preceded with NOT to indicate SCOTUS opinions including that term should be removed from the resultset. For simple one-term searches, an operator is not required. Here are some examples:



education



This query should return all opinions that contain the word education.



AND higher education texas



This query should return all opinions that contain the words higher​and education ​and​texas
OR privacy security



















■ This query should return all opinions that contain either privacy​OR​security


OR​both.




○ AND privacy security NOT computer




■ This query should return all opinions that contain privacy and security, but not

computer.




security NOT privacy




■ This query should return all opinions that contain security, but not privacy.
● Ranking the Results.​Relevancy​
ranking​refers to organizing the results of a query so that “more
relevant” documents are higher in the result set than less relevant documents. The difficulty
here is determining what the concept of “more relevant” means. One way of handing relevancy
is by using a basic​term frequency – inverse document frequency​(tf/idf) statistic3. tf/idf is
used to determine how important a particular word is to a document from the corpus. If a word
appears frequently in document d​ but infrequently in other documents, then document d​




t​
t
would be ranked higher than another document d​ in which a query term appears frequently,

s​




but it also appears frequently in other documents as well. There is quite a bit of other information that you can use to do relevancy ranking as well such as date of the SCOTUS opinion, etc.




The​user interface​is responsible for:




Receiving queries from the user



Communicating with the Search Engine



Formatting and displaying results in an organized, logical fashion



More info on the UI later.




The Index




The​inverted file index4 is a data structure that relates each unique word to the document(s) in which it appears. It allows for efficient execution of a query to quickly determine in which documents a particular query term appears. For instance, let's assume we have the following documents with ascribed contents:




d1 = ​Computer network security



d2 = ​network cryptography
d3 = ​database security



The inverted file index for these documents would contain, at a very minimum, the following:




computer = d1



network = d1, d2



security = d1, d3



cryptography = d2



database = d3



The query “AND computer security” would find the intersection of the documents that contained

computer​and the documents that contained​security​​.




set of documents containing computer = d1



set of documents containing security = d1, d3



the intersection of the set of documents containing computer AND security = d1












Inverted File Index Implementation Details
















Figure 2: Index Interface Class Diagram







The heart of this project is the​inverted file index​. Notice from the example above that what is being stored is essentially words and a list of documents in which each appears (accompanied by some bookkeeping information). Once the index is created, retrieving the set of documents for various words will be the central task of the index. Therefore, we should use a data structure that will allow for efficient searching.




You will implement at least two different data structures to store the index.




AVL Tree



Hash Table with collisions handled by separate chaining



You may also include a list implementation of the index. This will provide a good starting place.




In your implementation, you should strive to abstract the index data structure idea from the underlying storage data structure implementation. What this means is that the AVL tree and hash table implementation should share a common interface. In Figure 2, IndexInterface may contain abstract methods such as addWord(), getDocsForWord(word:string), etc.




Each class that inherits from IndexInterface would be required to implement those methods. So that would mean that we could do something like:




IndexInterface* if = new AVLTreeIndex;




or




IndexInterface* if = new HashTableIndex;




Index Persistence




The index must also be persistent once it is created. This means that the contents of the index should be written to disk when the program ends and read when the program starts. The user should have the option of clearing the index and starting over.




User Interface




The user interface of the application should provide the following options:




The program should have​two modes​(controlled by menu or command line parameters)



maintenance mode –



allows the user to add opinions to the index by supplying the path to a folder containing new opinions
allows the user to clear the index completely



interactive mode –



allow the user to indicate if they want the index loaded into an AVL structure or a hash table structure (if a persisted index exists).
allow the user to enter a properly formatted Boolean query (as described above).

The results should display the opinion’s identifying/important information such as year, parties to the case, which justice wrote the majority opinion, etc. The result set shown to the user need not contain any more than the top 15 opinions. If you’d like to show more, you may wish to paginate.



The user should be allowed to choose one of the opinions from the result set above and have the first ~300 words of the opinion printed.
Helpful Hint: that the query terms should have stop words removed and stemmed before querying the index.
Upon request, print basic statistics of the search engine including:



Total number of opinions indexed



Average number of words indexed per opinion (after removal of stop words)
Top 50 most frequent words (after removal of stop words)



Any other options you deem appropriate and useful.



Document Data Set




The dataset can be found at https://www​.courtlistener.com/api/bulk-data/opinions/scotus.tar.gz​.







Mechanics of Implementation




Some things to note:




This project may be done individually, in teams of two students, or in teams of three students.



Individually: Finish all work on your own.



Team of 2 students:



Each team member must contribute to both the design AND implementation of the project.
Each class in the design must have an “owner”. The owner is a group member that is principally responsible for its design, implementation and integration into the overall project.



Team of 3 students:



Complete all work for this project and, additionally, the following feature:



Implement 2-word phrase searching.



A 2-word phrase search will be indicated by square brackets



(e.g. []) around the 2-word phrase.




You may assume no phrase query will contain more than 2 words.



No query will contain more than one search phrase.



This project must be implemented using an object-oriented design methodology.



You are free to use as much of the C++ standard library as you would like. In fact, I encourage you to make generous use of it. You may use other libraries as well except for the caveat below.

You must​​implement your own version of an AVL tree and Hash Table (the storage data structures for the index). You may, of course, refer to other implementations for guidance, but you MAY NOT incorporate the total implementation from another source.



You are free to use any library you can find to parse JSON and HTML data.



All of your code must be properly documented and formatted



Each class should be separated into interface and implementation (.h and .cpp) files unless templated.
Each file should have appropriate header comments to include owner of the class and a history of updates/modifications to the class.



Submission Schedule




You must submit the following:




Teams:​​ Due Friday Nov 2, 2018 @ 5pm. Any name not submitted by this time is subject to being required to do the project individually. This info will be submitted via a Google Form to be distributed soon.
Initial Design Documents​:Due Monday Nov 12 @ 6am uploaded to Canvas. DO NOT think of this as your only task during week 1. This due date is to allow everyone to have one lab session before submitting.



Milestone 1 and Parsing Speed Check:​Monday Nov 19. More on this soon.



 



Complete project with full user interface



User's Manual to include information on management piece as well as regular user piece



Documentation



updated UML diagrams



documentation about each class in your project. Consider using Doxygen5 for this.
Report that compares the underlying functionality of the AVL implementation vs the Hash Table implementation.

Which one is better?



How do you know?



Can you quantify this?



Is one better for small data sets compared to large data sets?



Demonstration of functionality to Professor Fontenot and TAs on Monday December 3, 2018 (sign up sheet to be distributed).



Thoughts and Suggestions




If you wait even 1 week to start this project, you will likely not finish.



A significant portion of your grade will come from your demonstration of the project to Prof. Fontenot and the TAs. Be ready for this.
Take an hour to read about the various parts of the C++ STL, particularly the container classes. They can help you immensely in the project.






As mentioned previously, beware of code that you find on the Internet. It isn't always as good as it it may seem upon initial inspection. Make sure that any code you use in the project is cited/referenced in the header comments of the project.



Take time to open a few of the SCOTUS opinion files in a text editor and examine them. Data is rarely beautiful and nicely formatted.



Grading:




The final implementation project is worth​20% of your final grade​in this course (all other implementation projects are worth 30% percent of your final grade).
Early Design Documents will count as a homework assignment.



The Check In and Speed check will count as a homework assignment. The grade will not be based on speed, but only on completeness.
The analysis paper comparing the performance of the implemented data structures will count as a homework assignment.
The documentation for your project will count as a homework assignment

More products