Starting from:

$35

Programs: hencode and hdecode Assignment 3


"I’d crawl over an acre of ’Visual This++’ and ’Integrated Development That’ to get to gcc, Emacs, and gdb. Thank you." (By Vance Petree, Virginia Power)


| /usr/games/fortune


Programs: hencode and hdecode

This assignment is to build a le compression tool, hencode, that will use Hu - man codes to compress a given le, and a le decompression tool, hdecode, that will uncompress a le compressed with hencode.

Usage:
hencode in le [ out le ]

hdecode [ ( in le j - ) [ out le ] ]

hencode must:

take a command line argument that is the name of the le to be com-pressed; and

take a second, optional, command line argument that is the name of the out le. (If this argument is missing, the output should go to stdout.)

properly encode its input into a binary Hu man-coded le according to the algorithm below.

hdecode must:

Take an optional command line argument that is the name of the le to be uncompressed. If this argument is missing, or if the input le name is \-", hdecode will take its input from standard input.

Take a second, optional, command line argument that is the name of the output  le. (If this argument is missing, the output should go to stdout.)

Properly decode its input and restore the original data.

Both must:

use only Unix unbu ered IO (read(2) and write(2)) for reading and writing the les; and



1





share as much of the code base as is reasonable between the two programs; and

demonstrate robust programming techniques including proper handling of errors; and

adhere to a \reasonableness" standard with respect to performance.

See also the \Pesky Details" section below.

Hu man Codes

In 1952, David A. Hu man proposed a method for compressing les by gener-ating a variable length encoding of characters based on the relative frequency with which each character occurs in a le. This encoding is what is known as pre xless code because no letter encoding is a pre x of any other encoding. In practice, this means that the coding can be stored in a binary tree with the characters at the leaves. Each character’s encoding can be determined by look-ing at the sequence of right or left child node transitions and representing those as either zero bits or one bits.

Generating the Tree

Note: Be sure to follow the tiebreaker conventions outlined in this document. If you do something di erent, you will still get a valid encoding of your le, but will be di erent from the reference one. That means that both programs will have to work perfectly all the way through in order to be tested. If you follow the tiebreakers partial credit becomes a possibility.

To build the tree, it is rst necessary to generate a histogram of all the characters in the le, remembering that any value from 0 to 255 represents a valid byte. We will demonstrate the algorithm through the following encoding example.

If the  le you are encoding consists of the string \bmmmcccccttttttaaaaaaa"

the histogram will be:
Byte
Count



0x61
(’a’)
7
0x62
(’b’)
1
0x63
(’c’)
5
0x6d (’m’)
3
0x74
(’t’)
6




Take these counts and generate a linked list of them in ascending order of frequency. If there is a tie in frequency, do a secondary sort and place the characters also in ascending order. For the histogram above, this list will look like:





2








0x62 ('b')    0x6d ('m')    0x63 ('c')    0x74 ('t')    0x61 ('a')
1           3               5              6              7




Now, remove the rst two nodes on the list and construct a new node whose frequency is the sum of the two removed nodes. Make the rst removed node the left child of the new node and the second removed node the right subchild:






4
0
1
0x62 ('b')
0x6d ('m')
1
3



Next, reinsert the new node into the remaining list of nodes in order. If there is a frequency tie, insert the new node before the other node in the list. After reinsertion of the new node, our list will look like:






4
0x63 ('c')
0x74 ('t')
0x61 ('a')


5
6
7





0

1


0x62 ('b')

0x6d ('m')


1

3





Repeat the process until there is only one node left in the list. The rest of the process for this le is shown in Figure 1.

Now that you have the tree, to encode the le all that is necessary is to do a pass over the tree to extract the encodings into a table, then re-read the input le and translate each input character into the appropriate sequence of bits. If the le ends with a partially lled byte, pad the nal byte with zeros.

The codes extracted from the tree above will be:


3







0x74 ('t')
0x61 ('a')

9
6
7








0
1


4
0x63 ('c')



5





0
1


0x62 ('b')
0x6d ('m')

1

3



(a) After a second pass








9

13



0
1
0
1

4
0x63
('c')
0x74 ('t')
0x61 ('a')



5

6
7







0

1




0x62 ('b')

0x6d ('m')



1

3







(b) After a third pass









22




0
1




9
13



0
1
0
1

4
0x63 ('c')
0x74 ('t')
0x61 ('a')



5
6
7






0

1
4








0x62 ('b')

0x6d ('m')


1

3







(c) The    nal tree





0x61 (’a’):    11

0x62 (’b’):    000

0x63 (’c’):    01

0x6d (’m’):    001

0x74 (’t’):    10

File Format

The output format for the encoded le consists of two parts. First, comes a header that contains the frequency information necessary to re-create the tree, then the bits of the encoded le.

The rst four bytes of the header consists of an integer containing the number of characters in the frequency table. (Remember, not all les will contain all 256 possible bytes.) Following this comes the frequency table, in alphabetical order1. Each element of the frequency table consists of a single byte that it the byte itself, followed by a four-byte integer that is the frequency.


Number of chars

(uint32 t)

4 bytes

c1

(uint8 t)

1 byte

count of c1 (uint32 t) 4 bytes

c2

(uint8 t)

1 byte

count of c2 (uint32 t) 4 bytes

. . .

. . .

. . .

cn

(uint8 t)

1 byte

count of cn (uint32 t) 4 bytes

Figure 2 shows the resulting encoded le for the input above (for a little-endian machine) as hexadecimal bytes. Boxes are drawn around multi-byte data for clarity. Note that there is no padding in this particular example. Also, there is no separation between the header and the body, nor is there an end of le marker. These are all determined by counting.


Header



Body


Header size
’a’

Count

’b’

Count

















05
00
00
00
61
07
00
00
00
62
01
00
00
00













’c’

Count

’m’

Count

’t’

Count
















63
05
00
00
00
6d
03
00
00
00
74
06
00
00
00













File Contents

























04
95
56
aa
bf


























Figure 2: \bmmmcccccttttttaaaaaaa" as encoded by hencode.


Encoding

To encode the le, build the tree, extract the encodings into a code table, then re-read the input generating the output as you go.

Decoding

To decode, read the header of the encoded le, regenerate the tree, then use the bits of the le to walk the tree to regenerate the le. Start at the root. For each


    • This is not necessary for le compression, but it will make your output match the reference program’s output.


5





zero bit, go left, for each one bit, go right. Because this is a pre xless code, you will know you have decoded a character when you reach a leaf. Output this character and start again at the root.

You know how many characters are in the output from the counts encoded in the frequency table.

Tricks and Tools

xxd(1)

Useful for reading binary  les to see what’s what.



open(2)

The Unix basic IO functions. Check out the man pages.
close(2)


read(2)


write(2)


lseek(2)






Figure 3: Some potentially useful library functions

Some potentially useful library functions listed in Figure 3. Also, it might be helpful to know that <stdint.h> de nes a set of xed-width data types which are more portable than ints when you really need to know how big something is. These exist in signed and unsigned versions named intXX t and uintXX t, where XX is the number of bits. For example, uint32 t size makes size a 32-bit unsigned integer.


Coding Standards and Make

See the pages on coding standards and make on the cpe 357 class web page.


Pesky Details

Endianness

Because you are reading and writing binary data, the endianness of integers will be apparent. For this assignment you do not need to be concerned with this, but know that les encoded on a big-endian machine will not decode properly on a little-endian one (and vice-versa). Note that the x86 (vogon) is little-endian but if you’re developing on a PowerPC Mac, it’s big-endian.

Bits and their order

If it is necessary to pad the  nal byte of the  le, pad it with zero bits.

When lling bits, ll from the high order bits. That is, if the nal four bits of an encoded le are 1010, the nal byte of the le will be 0xA0.




6





What to turn in

Submit via handin on the Unix servers to the asgn3 directory of the ngonella account:

your well-documented source  les.

A make le (called Makefile) that will build your programs when given the command \make all".

A README le that contains: { Your name.
{ Any special instructions for running your program.

{ Any other thing you want me to know while I am grading it.

The README le should be plain text, i.e, not a Word document, and should be named \README", all capitals with no extension, ‘.md‘, or ‘.txt‘.

Sample runs

Below are some sample runs of hencode and hdecode.

Here is an example of encoding the    le used in the example above:

    • htable testfile 0x61: 11

0x62: 000

0x63: 01

0x6d: 001

0x74: 10

    • hencode testfile testfile.huff

    • hdecode testfile.huff testfile.out

    • diff testfile testfile.out

    • ls -l

total 12






-rw-------
1
ngonella ngonella 22
Oct 18
15:00 testfile
-rw-------
1
ngonella ngonella
35
Oct
18
15:02 testfile.huff
-rw-------
1
ngonella ngonella
22
Oct
18
15:02 testfile.out

Note that, because testfile is so small, this made the \compressed" le bigger. Consider this example with a larger le, the class notes for cpe357 so far:

% ls -l class.ps

-rw-------    1 ngonella ngonella 339520 Oct 18 15:04 class.ps

    • hencode class.ps class.ps.huff

    • ls -l class.*


7





-rw-------
1
ngonella ngonella
339520
Oct
18
15:04 class.ps
-rw-------
1
ngonella ngonella
217162
Oct
18
15:05 class.ps.huff
%







A reduction of 36% isn’t so bad.



















































8

More products