Starting from:

$29.99

Project #5 Solution

1    Note

 

This project  is voluntary.  It will give 5% (if you get 100%) to your grade, applied after the curve (if there is one).  It will be graded a little more harshly than  previous projects.

 

 

2    Overview

 

The purpose of this project  is to learn about  string  matching  by comparing multiple  string searching  algorithms  to see under  what  circumstances  one algorithm  may outperform  an- other.

Your job is to implement the Rabin-Karp algorithm  and Knuth-Morris-Pratt  algorithm, as described in class, to search an input  text  file for each requested  string  from a different input  patterns file.  The program  should return  an indication  if the patterns are found, as well as timing  information  about  the  search algorithm.   You will be provided  with a large text  document and a set of pattern strings.

The writeup of the project should explain your results and give a summary of the perfor- mance for the different algorithms  you have implemented.

Do not use any string matching  or search functions that  are built into Java  when writing your algorithm  implementations. Also, you may use whatever  you wish when reading in the search strings file, but  do not use search or find routines  when reading in the source file.

 

 

3    Performance

 

This project is about  comparing different algorithms.  As such, the program you write must terminate in a reasonable  amount  of time  for all strings  provided.   If your  program  does not perform adequately,  then  there  is something  wrong with your program  and it must  be examined.  Common inefficiencies may include: failure to properly buffer blocks of characters from the disk drive; using an inefficient data  structure; repeating  steps unnecessarily; or any number of simple mistakes.

Since the  project  is partly  about  algorithm  efficiency, it may prove fun to see how fast you can make your algorithms  run.   To that  end,  the  fastest  3 projects  submitted will be

&Into the fray good men&

&To be, or not to be&

&There’s a blank space at the end of this search string &

&This string has a line-feed that must be matched&

Figure  1: Sample pattern string  file. Any valid filename should be accepted.

 

 

 

given 5 additional  bonus points (not over 100%). “Fastest” will be defined as the total real runtime of your program from start  to finish when running on a dedicated  machine.  Grading software will be used to measure  your projects  total  execution time so you do not need to implement any  additional  timing  checks.  When  running  the  time  trials  I will be using a different test  case than  the one provided.

 

 

4    User Interface

 

Your program  needs to take  in three  command-line  parameters:  the  first argument is the name of the pattern file containing  the strings to find, second is the name of the text  file to search, third  is the name of the file to output the match  results.  Remember,  each of these filenames can be arbitrary.  The  main  Java  class needs to be named  StrMatch.java. An example command-line  might look like:

 

java StrMatch strings.txt somesource.txt results.txt

 

 

5    Input

 

The pattern strings file will be a text file with strings separated  by blank lines and/or newlines and/or spaces.  Each search string  will be surrounded  by an ampersand  (‘&’) symbol.  See Figure  1 for a sample file.  For simplicity you may assume that  ampersands  (‘&’) will not appear  in the pattern strings themselves.

Any input  pattern file may be used with your project.  Please use the provided  pattern file and input text file for the algorithm comparison in your report.  The files have been made available online on Piazza.

 

 

5.1     Newlines

 

Because you are dealing with text  files, machine dependent newlines must  be detected  and allowed to match.  Your program needs to be smart enough to know that  a multi-line pattern string that  contains Windows newlines may match a multi-line text string that  contains Unix newlines, and vice versa.

Recall that  Windows uses two characters  to represent a text  newline (\r \ n) (Hex 0x0D,

0x0A), while Linux & Mac (Hex 0x0A) only use one. Your algorithm  must have these three

characters  match  so that  multi-line pattern strings can be used by your program.

 

 

Input  File:

 

 

 

 

 

results.txt:


&To be, or not to be&

&That there’s a blue cucumber!&

&Hippocampelephantocamelos&

RK FAILED: To be, or not to be

KMP FAILED: To be, or not to be

RK MATCHED: That there’s a blue cucumber! KMP MATCHED: That there’s a blue cucumber! RK MATCHED: Hippocampelephantocamelos

KMP MATCHED: Hippocampelephantocamelos

 

Figure  2:  Sample  pattern file and  the  corresponding  output file results.txt. These  are  real  quotes from real plays.  Can  you guess the source  file?

 

 

 

In other  words, if your pattern string  contains  a newline represented  by 0x0A it should match  a pattern that contains  0x0D 0x0A; assuming  that  the  rest  of the  pattern matches as well.  The  same should be true  for matches  going the  other  way, from 2 character  to 1 character  newlines.

 

 

6    Output

 

See Figure 5 for a sample pattern file and the EXACT  format  expected  in the output file. You may use Standard-Out or create another  file to store the statistics needed to study your algorithm  implementations.  Each  search  algorithm  needs to find the  first match  and  can then  terminate the search for that  pattern.  When running  time trials  for your report  it is important to remember  that  there  are different  types  of time:  CPU  time,  real time,  user time, and so on.  It is quite possible that  the real time taken  to run your algorithm  may be much longer than  the CPU time taken because of process scheduling on a shared computer, disk I/O  and  screen output.  But  then,  I know you’ll explain  all the  different things  you encounter  in your report.

 

 

7    Report in  readme.txt

 

In  readme.txt you  need  to  give a  report  comparing  the  performance  of Rabin-Karp   to

Knuth-Morris-Pratt when using the pattern strings provided with the sample text  file.

When creating your own test cases, be sure to look at various sizes of strings and perhaps even some degenerate  cases such as files full of a single letter.   You may want  to compare your implementations with Java’s built-in  search routines  to see how they compare.

The report  for this project will be worth 25% of the project grade.

 

 

8    Turning in  your pro ject

 

Please follow the  project  protocol  for files that  should be turned  in.  Put  your name,  and your partner’s  name in the report.  Students  are not getting  credit on projects because they are  forgetting  to include their names.

Use the Linux turn-in  software to submit  your project.  The command will look like:

 

turnin -submit lschmidt project5 your_files

 

After you have turned  in your project it is now possible to use the nifty turnin  “--verify”

flag to check that  your files have been turned  in properly.

 

 

9    Questions and Pro ject Clarifications

 

Clarifications  regarding  this project  will be posted  to Piazza.  Though  I do not expect any changes, you are responsible for any modifications found there.

More products