Starting from:
$29.99

$23.99

Programming Assignment # 1 Solution

1    Introduction

 

Data  compression is both useful and/or necessary in a variety  of contexts.  Stor- age and transmission of data  in a compressed form typically requires fewer bits, thus  utilizing  resources more efficiently. In this assignment you will experiment with different compression techniques  covered in class. Your task has two parts. First,  implement  a program  to  evaluate  the  amount  of entropy  in a text  file and  to produce  an optimal  Huffman code.  Second, implement LZ compression and  decompression.   The  input  will be text  files containing  ASCII  characters, without unicode characters. Java  source code for reading  and writing  to files is provided  on the class web site; you are not required  to implement the Java  code for file IO. You may develop your own interface  for file IO if you wish to do so, provided  it conforms with the specification  described  below.

 

 

2    Part 1 - Entropy and Huffman Coding

 

2.1     Description

 

Given a text  file, it is possible to compute  the probability of occurrence  of each symbol in that file according  to the first order model.  From  these  probabilities the entropy  of the characters in the file can be computed and an optimal  Huff- man  code for the  characters in the  file produced.   The  entropy  estimates  the average  number  of bits  required  to encode each character in the  text  file. It is then  possible to estimate  the  amount of compression  achievable  by that model for the input text  file.

 

2.2     Implementation

 

Write  an algorithm estimating the amount of compression  achievable  by a first order model given a text  file. This  algorithm should compute  character proba- bilities and design an optimal  Huffman code by constructing the best Huffman tree.  Given the Huffman code, the actual  length  of the Huffman coded file can

 

Table 1: Example  code

 

Symbol    Code

c              1

g              01

t               001

a              000

 

be calculated  although you are not  required  to output the  Huffman coded file. For  example,  consider  the  following Huffman  code in Table  1.  The  number  of bits needed for the compressed  string  ”cccg” is 3*1+1*2  = 5.

 

 

3    Deliverables

 

Your program  (call the source code “Huff.java”) should be invokable  with a file name (e.g., tmp)  as a command  line argument, as follows:

 

rivera.cs.utexas.edu$$ java Huff tmp

 

The expected  output should be as follows:

Actual length of the file by Huffman coding is

45

and minimum achievable is

42.93

 

Note that both the actual  output length with huffman coding and minimum achievable  for the file are in number  of bits.

One final remark:  include whitespaces,  commas, matching case, etc.  in your frequency calculation. Automatic grading will be used, and there will be trouble if your output doesn’t match  exactly.  Also, you can use the method  giveArray() in the IO.Compress class to read characters from the file into an array;  see the last  section below.

 

 

4    Part 2-  LZ  Compression

 

4.1     Description

 

The  Lempel-Ziv  compression  algorithm given  in  the  course  packet  accepts  a string  of text  and produces  a sequence of pairs (n; c), where n is the index into the dictionary, and c is a character. The decompression  algorithm reconstructs the dictionary using the pairs that are retrieved  from a compressed file. During decompression  each pair in the sequence causes a new entry  to be added into the dictionary. For example,  when the  decompression  algorithm  processes the  pair (n; c), it is known that the  index n refers to a string  that is already  contained in  the  dictionary.   Therefore  the  new  entry  to  be  added  into  the  dictionary is obtained by concatenating the  string  corresponding  to n with  the  character represented by c.  You are  required  to  implement  algorithms  for compression and decompression  using this method.

 

4.2     Implementation - Compression

 

Write  a program  that is invoked  with  the  name  of a text  file.  This  program should  create  a compressed  file and  write  it to the  disk.  Your implementation should use the trie based approach (see Implementation of the Dictionary, page

21 of handbook). The name of the compressed file is obtained by concatenating the name of the input  file with the string  “.myZ”.

 

4.3     Implementation - Decompression

 

The  decompression  algorithm should  be  invokable  with  the  name  of a  com- pressed  file, and  it should  write  a decompressed  file to the  disk.  The  name  of the decompressed  file is determined by concatenating the name of the input  file with the string  “.unZ”.

 

4.4      Deliverables

 

You are required  to submit  a file called “Comp.java”, which takes two command line arguments, the  first argument  is either  the  character ’c’(for compression) or ’d’(for decompression).   The  second argument is the  name  of the  file to be compressed  or decompressed.  An example run of your program  might look like this -

rivera.cs.utexas.edu$$ java Comp c tmp rivera.cs.utexas.edu$$ ls -l tmp*

-rw-r--r-- 1 ankur grad 5029 Mar 25 22:26 tmp

-rw-r--r-- 1 ankur grad 5600 Mar 26 11:06 tmp.myZ rivera.cs.utexas.edu$$ java Comp d tmp.myZ rivera.cs.utexas.edu$$ ls -l tmp*

-rw-r--r-- 1 ankur grad 5029 Mar 25 22:26 tmp

-rw-r--r-- 1 ankur grad 5600 Mar 26 11:06 tmp.myZ

-rw-r--r-- 1 ankur grad 5029 Mar 26 11:06 tmp.myZ.unZ rivera.cs.utexas.edu$$ cmp tmp tmp.myZ.unZ rivera.cs.utexas.edu$$

 

 

 

4.4.1     Remarks

 

In some cases, you may  find that the  size of the  “.myZ”  file is actually  larger than  the  input  file.   We  have  observed  that it  is so for small  files.   Explain why this  happens  in your  readme  file.  Along with  the  Comp.java file you are also  required  to  submit  two  files for  which  your  program  indeed  does  some compression  (i.e.  the size of “.myZ”  file is smaller than  the original file). Name these  files “smaller1.txt” and  “smaller2.txt”.  You are also required  to submit a text  file for which your  compression  algorithm actually  increases  the  size of the  file.  Name  this  file “larger.txt”.  Make sure the  file names  match  exactly, because automatic grading  will be used.

 

 

5      Questions and Clarifications

 

Clarifications   regarding   this  project   will  be  posted  to  the  piazza.    Though changes are not expected,  you are responsible for any modifications  found there. Please  also try  to ask questions  there  so other  students may benefit from your questions.

 

 

6    Deadline and Sumission Instructions

 

The  project  is due Feb.  13th  2012 11:59PM.  No late  submissions  allowed.  As usual,  your  readme  file must  include  the  name  of both  the  partners and  their EIDs,  as well as their  e-mail  addresses.    Adhere  to  the  project  protocol!   To package your project  into a jar file, type:

“jar cvf project1.jar Huff.java Comp.java smaller1.txt smaller2.txt larger.txt readme.txt”

You can include  additional java  source files if you need them.   Just  make sure they  are all in your jar file. To actually  turn  in the project,  type:

 

turnin --submit lschmidt project1 project1.jar

 

To confirm your submission:

 

turnin --list lschmidt project1

 

Only one submission  per group.

 

 

7      Appendix - IO  source code

 

We will provide a file named IO.java  on the class web page.  This file implements three  java classes IO.pair,  IO.Compress and IO.Decompress, which can be used without modifications to complete this assignment. If you use this file, include it in the jar  file you submit.  The class IO.Compress can be used for implementing both  the  parts  of this  assignment.  The  class IO.Decompress  can  be used  for implementing the decompression  algorithm.

7.1  IO.pair

 

This  class is used for storing  the  tuple  (index,character). This  is the  type that will be returned by the  method  decode(...)   in the  class IO.Decompress.   If p is an  object  of the  type  IO.pair,  then  p.index  refers to  the  index  of an  entry  in the  dictionary, and  p.extension  refers to the  character that must  be appended to this  string  to create  a new dictionary entry.   The  pair  class also contains  a field named  ’valid’.  This  is helpful to determine when there  are no more pairs to be decoded.

 

7.2      IO.Compress

 

The  constructor of this  class  does  two tasks.    First,   it  initializes  its  internal objects so that they are ready to read from the input  file. Second, it determines the name of the output file, and initializes it, so that the next call to encode(...) writes to the output file. This class provides  a method  giveArray() for copying all the  characters from a file to  a character array.   It  also provides  a method encode(int  x, char  a),  where the  parameter x refers to an index  in the  dictio- nary,  and a is the character that must be appended to the corresponding  string. This method  should be called during  compression  when a new pair is added  to the  sequence.  The  method  done()  should be called at  the  end of the  compres- sion process.  Note that the  method  encode will be called several times  during compression.  The  following code shows some parts  of an example  compression program.

 

/*infile is the name of the file to be compressed*/

public static void compress(String infile) throws Exception

{

IO.Compressor io ;

/* Initialize a IO.Compressor object so that it is ready to

* read the input file, and to write the output file to the disk */

io = new IO.Compressor(infile);

/* Read all characters from the input file to a character array */

char[] carray = io.giveArray();

...

/* Perform compression on the array carray, *

* this part may call io.encode(...) several times */

...

/* Close all relevant files */

io.done();

}

 

7.3      IO.Decompress

 

The  constructor of this  class  does  two tasks.    First,   it  initializes  its  internal objects  so that they are ready  to read from the input  file (i.e.  compressed  file). Second, it determines the name of the output file, and initializes  it, so that the next

 call to io.append(...) writes to the output file. This class provides a method decode()  which returns the next  unprocessed  pair from the sequence contained in  the  compressed  file (if no  more  pairs  are  available,  then  it  returns a  pair whose ’valid’ field is set to false).  This method  will also be called several times during  the decompression  process (like the method  encode in IO.Compress). It also provides  a method  append(String s),  which  appends  the  string  s to  the output generated so far.  The  method  done should  be called at  the  end of the decompression  process.   The  following code shows some parts  of an  example decompression  program.

 

 

/*infile is the name if the file to be decompressed*/

decompress(String infile) throws Exception

{

/* Initialize a IO.Decompressor object so that it is ready *

* to read the sequence of pairs from infile, and to write *

* the decompressed file to the disk */ IO.Decompressor io = new IO.Decompressor(infile); while(...)

{

....

/* Every call to io.decode() returns the next unprocessed pair from the *

* compressed file. next.valid is false if no more pairs are left. */ IO.pair next = io.decode();

....

/* output is the latest entry that was added to the *

* dictionary, append it to the decompressed file */

io.append(output);

....

}

/* Close all relevant files */

io.done();

}

More products