Huffman Tree Encoding Solution

Starting from:

$30

Home

You are asked to write a compression algorithm to compress English language text files.

You decide to use a Huffman encoder, which is a fairly quick and easy way of turning a

sequence of ASCII characters into a compressed bit stream. For this tutorial, you need

only implement the compression side, and you do not need to pack the output into a stream

of bytes. You can simply use the ASCII characters 0 and 1 to represent the compressed

bit representation. Of course, your compression factor will be 8x less as a consequence!

So for extra credit, you can do some bit operations to pack the 0s and 1s into an array of

bytes.

1

Constraints

The constraints for this tutorial are as follows:

1. Your implementation must use at least a HuffmanTree class (which manages the tree

and has methods to compress data, build a Huffman tree etc) and a HuffmanN-ode

class, which represent the tree nodes in the underlying Huffman tree (which is a

simple binary tree). These must be constructed and destructed appropriately. Your

constructor and destructor will need to be defined to use the specified smart pointers

to manage memory see point 2) below.

2. You must use std::shared ptr<HuffmanNode to represent the left/right links for

your tree nodes as well as the root of the tree in your HuffmanTree class.

3. your destructors must *not* explicitly delete the tree node representation: this should

happen automatically when the root node is set to “nullptr” in the Huff-manTree

destructor. You should not call delete at all in the HuffmanNode destruc-tor. That is

the benefit of using a managed pointer. You may, however, allocated your

HuffmanTree instance on the heap using new, in which case you would call delete to

free that up and the end of your program.

4. You will need to implement: a default constructor, a move constructor, a copy

constructor, a destructor, an assignment operator and a move assignment operator

more details are provided below.

5. Your program will be driven from the command line, and should be invoked as follows:

huffencode <inputFile<output file where huffencode is the nameof your application, inputFile is an input English (ASCII) text file and output file

is the names of your output compressed “bitstream”.

Note: if you do not do any bit packing, this file may be larger than the input (!).

To compute the actual packed file size in bytes (approximately) you can compute:

(Nbits/8) + (Nbits%8 ? 1 : 0) using integer arithmetic. This accommodates

extra padding bits needed to deal with output bit stream sizes that are not a power

of 2.

6. Prior to constructing the huffman encoding tree you have to count the number of

occurrences of each letter in the file. To store these frequencies you can use a

std::unordered map with keys of type character and values of integer type (the

unordered map is very similar to Java’s HashMap). When you build the Huffman tree,

you will need this information to decide how to insert nodes for each letter. The

relevant header file is <unordered map. More information is available at http://

en.cppreference.com/w/cpp/container/unordered_map.

7. You should use a std::priority queue<HuffmanNode data structure to assist you

in the building the tree. The relevant header file is <queue

A priority queue returns the next largest or smallest item, depending on how you

have set it up. This is required to decide which tree nodes to aggregate as you build

your tree. To ensure it works, you will need to provide a binary function predicate

bool compare(const HuffmanNode& a, const HuffmanNode& b)

{

If (a < b) return true; // or if the algorithm requires that ordering

else return false;

}

You will need to ensure the “<” operation is evaluated sensibly for the type Huff-

manNode , as required by the tree construction algorithm. This function (which

can be a free function) will be used by the priority queue to decide how to order its

HuffmanNode elements.

A priority queue supports a “push()” method (to insert elements), a “top()” method

to copy the next element to be returned, and a “pop()” method to discard the next

element to be returned. You can call “empty()” to see if the queue is empty or

“size()” to see how many elements remain.

To create an instance of the priority queue you would have code like the follow-

ing: std::priority queue<HuffmanNode, comparemyQueue; See http://www.

cplusplus.com/reference/queue/priority_queue/pop/ for a simple usage with

plain old ints.

2

Huffman compression algorithm

You can build a Huffman tree as follows:

1. allocate a shared pointer<HuffmanNode for each letter, which contains the letter

and frequency;2. create a std::priority queue<HuffmanNode, compare, making sure that the

right ordering (i.e. return smallest item) has been set in the compare function;

Then, repeat until the queue has only one node, which will be the tree root:

3. choose the two nodes with the smallest letter frequency and create a new internal

parent node which has these two nodes as its left and right children. The smaller

nodes are inserted as the left child. These nodes should be popped out of the queue

-you do not need to examine them again;

4. the new node (an internal node) has no letter associated with it, only a frequency.

This frequency is the sum of the frequencies of its two children.

5. insert this new node in to the priority queue

At the end of this process you will have built your Huffman tree. If you have done this

correctly the smart pointers will be sharing resources correctly and will auto destroy when

the HuffmanTree goes out of scope (or is deleted) at the end of your program.

3

Code table

Next, you need to build the “code table”: this is a mapping that gives a bit string code for

each letter in your alphabet. The easiest way to do this (perhaps) is to recurse down the

tree from the root building up a string as you go. When you hit a leaf node with a letter,

you then output that letter with the corresponding bit string you have accumulated. When

traversing the tree, add a “0” if branching left and add “1” if branching right. You can

use an inorder walk and output when you get to a leaf node. It is up to you to decide how

to store this code book. Ideally, finding symbol and its corresponding bit string should be

fast.

4

Compression

To compress your ASCII text file, take each character, in turn, finding its bit string code and

write this into the output buffer. This “buffer” can just be a std::string which you

append characters/strings to. At the end, you can extract the c str() (a pointer to a byte

array) and write this out to file as a block of bytes. You should write out your code table to a

separate file named “output file”.hdr where output file is passed using command line

argument. The header should have a field count at the top and a field for every character

(consisting of the character and its code representation).

5

Mark allocation

Doing everything required above will get you 90%.

Extra credit:

• Convert to bit stream (5%): Convert your bits into an actual bit stream and store

this in a byte array that is large enough; this should be written to disk as a binary

file with a single int as header which lists the number of bits in the file. (you can

work out number of bytes to read from this).• Now write a method to read in and unpack the bitstream (5%). You should check

that your unpacked bitstream matches the original encoded data.

• You may find the discussion at http://www.cprogramming.com/tutorial/bitwise_

operators.html useful if you have never done this in Java. Take care to observe the

behaviour of the shift operators for signed and unsigned integral types.

6

Unit tests

You must provide unit tests for each of the sections of your code. The unit tests should

be specified in a separate file and compiled using a separate rule in your Makefile. Unit

testing will require that you separate out the different steps into separate units of work,

so that they can be tested individually. Therefore you should aim not write all of your

code in one very long driver file for instance.

The unit tests will account up to 10% of your final mark (depending on whether each of

the steps (frequency counting, tree construction, code table construction (tree traversal),

encoding, etc.) is tested to satisfaction. The marks awarded for bit packing and unpacking

includes the unit tests to show that these work as expected.

You should use the catch.hpp unit testing framework to test your code. A full tutorial is

available at https://github.com/philsquared/Catch/blob/master/docs/tutorial.md.

7

Suggested work flow

• Week 1:

– Define classes, default constructor/destructor

– Build tree

– Construct frequency table

– Write unit tests

• Week 2:

– Add in all extra constructors /move ops for tree and Node

– create code book table

– Encode string

– Output bit string to file

– Compute compression ratio (on “bit string” vs input chars)

– extra: pack bit string, unpack

– More unit tests

Please Note:

1. A working makefile must be submitted. If the tutor cannot compile your program

in the lab by typing make, you will receive 50% of your final mark.2. You must use version control from the get-go. This means that there must be a .git

folder alongside the code in your project folder. A 10% penalty apply should you

fail to include a local repository.

3. You must provide a README file explaining what each file submitted does and how

it fits into the program as a whole. The README file should not explain any theory

that you have used. These will be used by the tutors if they encounter any problems.

4. Do not hand in any binary files. Do not add binaries (.o files and your executable) to

your local repository.

5. Please ensure that your tarball works and is not corrupt (you can check this by trying

to extract the contents of your tarball - make this a habit!). Corrupt or non-working

tarballs will not be marked - no exceptions!

6. A 10% penalty per day will be incurred for all late submissions. No hand-ins will be

accepted if later than 5 days.

7. DO NOT COPY. All code submitted must be your own. Copying is

punishable by 0 and can cause a blotch on your academic record.

Scripts will be used to check that code submitted is unique.