Assignment 2: File Compression Solution

Starting from:

~~$30~~

$24

Warning: As you will see below, the descriptions of the assignments will be increasingly complex because we are asking you to build increasingly bigger programs. Make sure to read the assignment carefully!

0. Abstract

The opendir() and readdir() functions allow you to programmatically traverse through the file system and find/open any file in it (that you have permission to access). File descriptors and open() allow you to read from any file and create file at any point (that you have permission to access). This is quite a bit of power and capability for the very small and direct syntax that is required. The file descriptor abstraction is so generic that once you master its use, you have mastered nearly all basic IO you can do in C. You'll exercise your file manipulation capabilities to first generate Huffman codes for a series of files and then to compress and decompress files with a given Huffman codebook.

1. Introduction

In this assignment you will practice using the file system API and building different data structures to support an algorithm. In particular you will be creating, opening, reading, and writing files and building and traversing various trees and lists. Your task is to write a program that will, based on a command-line switch, either: descend through a directory structure, compute a Huffman code across all files it finds and write out that codebook as a file, or apply a Huffman codebook to a file (or all files it finds), compressing them, or use a Huffman code book to decompress a file (or all files it finds).

1.a. Huffman codes

Huffman codes are a fairly quick way to build a lossless code for data that avoids the prefix problem of dynamically-sized codes, namely that of ambiguity. If you have the tokens, 'a', 'b', 'c', 'd' and assign them the codes 0, 1, 01 and 11, respectively, it is unclear if the code sequence “11” means “bb” or “d”. In order to build a Huffman code, you first must count the rate of occurrence of all tokens, sort them, build a Binary Tree out of them, with each token as a leaf. Then traverse the Binary Tree built, recording a '0' for each left branch, '1' for each right branch, assigning the bit sequence generated to each token as a code value. For instance, to build the codebook below:

Word
Frequency
Bit Encoding
a
5
1100
dog
9
1101
cat
12
100
button
13
101
ball
16
111
and
45
0

minheap:
take top 2:
build tree0:
insert node for tree0:
a
5
a
5

cat
12
dog
9
dog
9

button
13
cat
12

tree0
14
button
13

ball
16
ball
16

and
45
and
45

minheap:
take top 2:
build tree1:
insert node for tree1:
cat
12
cat
12

tree0
14
button 13
button
13

ball
16
tree0
14

tree1
25
ball
16

and
45
and
45

minheap:
take top 2:
build tree2:
insert node for tree 2:
tree0
14
tree0
14

tree1
25
ball
16
ball
16

tree2
28
tree1
25

and
45
and
45

minheap:
take top 2:
build tree3:
insert node for tree 3:
tree1
25
tree1
25

and
45
tree2
28
tree2
28

tree3
53
and
45

minheap:
take top 2:
build tree4:
and
45
and
45

tree3
53
tree3
53

Label each link with '0' for left and '1' for right:

Traverse the tree, recording bit sequence to reach each token:

0 and

100 cat

101 button

1100 a

1101 dog

111 ball

Replace all occurrences of tokens with bit sequences from the codebook:

Test.txt contents: Test.txt.hcz contents:

a dog and cat 110011010100

Building a Huffman code will give you a lossless, dynamic- length code with no ambiguity. The code is not informationally optimal, but is optimal in prefix length as it will bias the shorter codes to be used by the tokens that appear most often. Informationally-optimal codes can be built with a somewhat more complex algorithm we won't be covering.

2. Implementation

2.a. Command Line Interface

Your program must implement the following command-line interface:

./fileCompressor <flag <path or file |codebook|

Your program must support the following flags: (which may appear in any order, but before any arguments)

b build codebook: the fileCompressor will index the file indicated and output a 'HuffmanCodebook' file in the same directory the fileCompressor is invoked in

c
compress:
the fileCompressor will compress the file indicated, <filename,

using the given codebook in to <filename.hcz alongside the original

file
d
decompress:
the fileCompressor will decompress the file indicated, <filename.hcz,

using the given codebook in to <filename alongside the original file
R
recursive:
including a 'R' argument with any of the above will run them in

recursive mode, in which case the file argument will need to be a path

and fileCompressor will recursively descend in to that directory and all

subdirectories, recursively applying the given operation to all files found
e.g.: ./fileCompressor -R -b ./
recursively index all files found inthe current working directory, all

subdirectories and emit a single ./HuffmanCodebook for them

./fileCompressor -d ./dir/test/file.txt.hcz ./HuffmanCodebook

decompress ./dir/test/file.txt.hcz in to ./dir/test/file.txt using the codebook ./HuffmanCodebook
2.b. HuffmanCodebook format

When building the HuffmanCodebook, it should be in the following format:

|escape char|

<ASCII bytestring <tab <token <newline

...

<blank line - newline terminated

e.g.:

\

and
100 cat

101 button

1100 a

1101 dog

111ball

<blank line - newline terminated

2.c. Tokenizing

Tokens are defined as any sequence of chars terminated by either a whitespace or newline or other control code.

e.g.:

If a file contains:

This an$example12 mail@rutgers

It should tokenize to:

This

an$example12

mail@rutgers

In order to make code creation and token representation simpler computationally, you should represent all your delimiters as distinct tokens. So, if you are parsing a file and find that it has spaces, tabs and newlines, you should insert representations of tab, space and newline in to your table and compute codes for them as well. This will certainly increase average code length and might cause your 'compressed' version to not be smaller than your plain text, however it will make detection, compression and decompression more straightfoward. Be careful to escape control codes as you write them to your codebook. If you write '\n' to a codebook, you will get a new line. If you write '\t' you will get a tab, not the chars '\' and 't', which can be confusing or ambiguous. There are multiple ways to escape control codes so you write a representation of the code, and not the code itself. If you want to pick an escape character of your own, feel free to use the first line of your codebook to hold it. If you use a different method and do not need to select a unique escape char, you do not need to use the first line for an escape character.

3. Methodology

You may want to first code a min-heap and binary tree. If you are going to build a binary tree, it isn't too much farther to build a binary search tree, which will allow you to quickly sort tokens in to your tree (it is not too many steps to convert a BST in to an AVL tree as well, for guaranteed O(lg n) inserts). This can be done in isolation from any other code.

You may want to approach building up your functionality in the following manner

Get your program to find all files in a directory and print out their names

Get your program to recursively descend down all subdirectories and print out their names with paths

open all files and output them to stdout to test

open all files and write them to <filename.test to test
Build and test tokenizing static strings in your code

count the number of occurrences of each token read in from files

Generate a Huffman codebook given a sequence of tokens and frequencies
Implement the -b and -R flags and test writing out a Huffman code book

... etc.

4. Results

Submit a tarred gzipped file named Asst2.tgz that contains:

a directory named Asst2/ that contains the following:

all .h and .c files necessary to produce an executable named fileCompressor

“Makefile” used to build the fileCompressor executable

make sure it has an 'all' target as its first directive, as well as a 'clean' target

“testplan.txt” that details the testing you ran your program through
“readme.pdf” that describes the design and implementation of your fileCompressor

include an analysis of the time and space usage and complexity of your program. You do not need to analyze every size function, but you should analyze the overall program's time and space used as a function of the size of the input.

Your grade will be based on:

Correctness

Design Quality

Code Quality

Efficiency

Testing

(how will your code operates)

(how well your code works)

(how well written and readable your code is)

(asymptotically in time and space)

(how thorough your cases are)

Grouping:

Unless explicitly exempted, you must have a partner, if you do not you will incur a 20pt penalty.

Be sure to sign up on the Google sheet for this assignment, even if you did last time.

If you do not, you will not receive a grade.

F-commands:

Do not use fopen, fread, fwrite, or any of their ilk.

(don't think it, don't code it .. don't think it, don't code it.. don't think it, don't code it)

More products

$6.00 OFF

$6.00 OFF

$6.00 OFF