Assignment 8. Hufman Coding Solution

Starting from:

~~$35~~

$29

Home

Late days may not be used on this assignment, and no late submissions will be accepted without prior approval by the head TA. All requests for extensions must be received at least 24 hours before the assignment deadline.

All due dates and submission times are expressed in Pacifc time.

You are permitted to work on this assignment in pairs.

Your fnal assignment of the quarter is to write a program that uses Hufman coding to compress and decompress fles. It pulls together ideas from all over the quarter – recursive exploration, linked structures, and trees. Once you’ve fnished coding this one up, you’ll have a fairly impressive piece of software that’s powered by a bunch of clever algorithms. We hope this is a ftting capstone to your experience in CS106B!

Here’s our recommended timetable for completing this assignment:

Aim to complete Milestone One by Monday.

Aim to complete Milestone Two by Tuesday.

Aim to complete Milestone Three by Wednesday.

Aim to complete Milestone Four by Thursday.

Bask in the glory of having completed the last assignment of the quarter, and explore Part Five on Thursday to appreciate the program you’ve just written.

Assignment Logistics

Starter Files

We provide a ZIP of the starter project. Download the zip, extract the fles, and double-click the .pro fle to open the project in Qt Creator.

Starter code

Resources

Check out our Guide to Hufman Coding for more information about how the Hufman coding algorithm works.

Getting Help

Keep an eye on the Ed forum for an announcement of the Assignment 8 YEAH (YEAH = Your Early Assignment Help) group session where our veteran section leaders will answer your questions and share pro tips. We know it can be daunting to sit down and break the barrier of starting on a substantial programming assignment – come to YEAH for advice and confdence to get you on your way!

We also here to help if you get run into issues along the way! The Ed forum is open 24/7 for general discussion about the assignment, lecture topics, the C++ language, using Qt, and more. Always start by searching frst to see if your question has already been asked and answered before making a new post.

To troubleshoot a problem with your specifc code, your best bet is to bring it to the LaIR helper hours or ofce hours.

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 1/10
2023/3/30 20:04 CS106B Huffman Coding

Milestone One: Build the Hufman Tree

In this frst milestone, you’ll implement this function:

EncodingTreeNode* huffmanTreeFor(const string& str);

This function takes as input a piece of text, then builds a Hufman coding tree for that text using the algorithm from class. (It’s the one that makes a priority queue of a bunch of singleton nodes, then repeatedly combines nodes together.)

Before you code this up, take a minute to make sure you understand how Hufman coding works.

Short Answer Questions, Part One

Edit the fle ShortAnswers.txt with your answer to the following question:

Q1. Draw the Hufman coding tree that would be produced for the input string "aabbbbccc" by following the algorithm from class. One specifc note: when merging trees together, structure your tree so that the frst node removed from the priority queue becomes the 0 child of the tree and the second node removed from the priority queue becomes the 1 child of the tree.

Now, back to C++. The result of huffmanTreeFor is a EncodingTreeNode* object, where

EncodingTreeNode is defned as follows:

struct EncodingTreeNode {

char ch;
// Which character is stored here.
EncodingTreeNode* zero;
//
Child tree labeled
0.
EncodingTreeNode* one;
//
Child tree labeled
1.
};

The ch variable here is only meaningful if the EncodingTreeNode is a leaf node in the tree (do you see why?), so you should not read or write this variable in a EncodingTreeNode unless that node is a leaf.

In the course of coding up this part of the assignment, you’ll need to use a priority queue. You built one of these in Assignment 5, but that HeapPQueue type was specifcally designed to work with DataPoint objects. We’ve provided the "priorityqueue.h" header fle, which defnes a PriorityQueue<T> template that can store objects of any type. Consult the Stanford C++ Library Documentation for more information about how to use this type. In particular, make sure you know how to enqueue and dequeue elements and how to determine the priorities of relevant items.

One little nuance that you’ll need to be aware of: the test cases in the starter fles we’ve provided all assume that, when you’re running the Hufman tree construction algorithm, the frst node you pull out of the queue will end up as the 0 child of the newly-constructed node, and the second node you pull out of the queue will end up as the 1 child. Make sure to follow this convention in this assignment.

To summarize, here’s what you need to do:

Milestone One Requirements

1. Implement the huffmanTreeFor function in Hufman.cpp.

2. Add at least one custom test case for this function, and test your code thoroughly.

Our test coverage here is not as robust as in the previous assignments. You will need to test your code thoroughly. A recommendation: write a test case that creates a Hufman tree from a string of your choosing, then set a breakpoint and run the tests with the debugger on. Using the same techniques from Assignment 7 that you used to map out and escape the

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 2/10
2023/3/30 20:04 CS106B Huffman Coding

labyrinth, poke around in the debugger and draw out the trees that you’re producing. Do they match what you expected to fnd? If so, great! If not, step through your construction algorithm in the debugger and see what happens.

Some notes on this problem:

Do not edit the EncodingTreeNode type when solving this problem. (More generally, don’t edit this type at any point in the assignment.)

The tree construction algorithm we provided in class assumes that the input string has at least two diferent characters in it, and if this isn’t the case, your code should call the error() function to report an error. (Do you see why the algorithm doesn’t work in these cases?)

If multiple characters are tied for having the same weight, or if two or more intermediate trees are tied for having the same weight, you can break ties arbitrarily. Keep in mind that there can be many equally good Hufman trees for a given string, difering only in how the algorithm broke ties. As a result, if you’re writing custom test cases, make sure that your tests don’t assume anything about the specifc way that the algorithm breaks ties when they arise.

You should completely ignore the ch feld of EncodingTreeNode for all nodes except leaves, the same way that you completely ignored the value feld of the LinearProbingHashTable::Slot type when the type feld was set to SlotType::EMPTY.

You should be able to handle strings made of any characters, not just letters, numbers, etc. In particular, you can’t reserve certain char values to mean “this is an internal node.” Many fles on disk – especially fles containing non-text data – make use of all possible characters.

Milestone Two: Implement Text Encoding and Decoding

Your next task is to write a pair of functions that are inverses of one another:

string decodeText(Queue<Bit>& bits, EncodingTreeNode* tree); Queue<Bit> encodeText(const string& str, EncodingTreeNode* tree);

This decodeText function takes as input a Queue<Bit> representing some data that’s been compressed, along with a pointer to the encoding tree that was used, then returns the decompressed string. The encodeText function takes as input a string and an encoding tree, then returns a Queue<Bit> loaded with the bits formed by encoding that string using that tree.

Before you code this up, though, take a minute to make sure you know how to run the algorithms with a pencil and paper.

Short Answer Questions, Part Two

Edit the fle ShortAnswers.txt with your answer to the following questions.

Q2. Decode the string of bits 1100010011000100 using the encoding tree shown below:

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 3/10
2023/3/30 20:04 CS106B Huffman Coding

Q3. Encode the string ABRACADABRA using the encoding tree from Q2.

Returning to C++ Land, you’ll notice that these functions make reference to a Bit type, which represents a single bit. The Bit type works like a regular integer, though note that if you try assigning a value to a bit that isn’t a 0 or a 1 it’ll report an error. For example:

Bit b = 0;

queue.enqueue(0);

if (queue.dequeue() == 1) { ... }

You can’t perform arithmetic on Bits. But that’s a good thing, because there’s no reason you should need to do that in this assignment.

To summarize, here’s what you need to do:

Milestone Two Requirements

1. Add a custom test case for the decodeText function in Huffman.cpp.

2. Implement the decodeText function and test thoroughly.

3. Add a custom test case for the encodeText function.

4. Implement encodeText and test thoroughly.

Some notes on this problem:

You’re given the input Queue by reference in decodeText, and your code can modify that Queue however it feels like. The queue can end up empty, or could contain the original bits, or could contain whatever sequence of 0s and 1s that you’d like.

It's okay to write these functions either iteratively or with a mix of iteration and recursion. However, we do not recommend writing either function purely recursively, as the input Queue<Bit>s can be enormous and you won't have the stack space to make a new function call for each bit. However, if you want to recursively walk one or both trees when doing encoding or decoding, that's perfectly fne.

In our initial examples of encoding and decoding text in lecture, we had access to an explicit table that represented the bit patterns associated with each character. It is signifcantly easier to implement the encodeText function if you build a table like this before writing out the characters. However, having a table like that isn’t necessary for the decodeText function – do you see why?

You can assume that the input tree is not null and doesn’t consist of just a single node. (These are the two degenerate cases that we said you didn’t need to handle in huffmanTreeFor.)

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 4/10
2023/3/30 20:04 CS106B Huffman Coding

The tree given to encodeText and decodeText will always contain leaves holding all the characters that appear in the input string or input queue of bits (respectively), but may also contain other characters. The input tree does not necessarily have to be the optimal Hufman coding tree.

You can assume that the bits provided to decodeText form the correct encoding of a piece of text that was encoded using the same encoding tree that was provided to you. However, you should be careful about how you use this assumption. Specifcally, if there’s a bug in your encodeText, then you may end up passing invalid bits into decodeText – oops! – and so we recommend making your decodeText function as “bulletproof” as possible by detecting and handling errors as you fnd them.

In C++ there is a diference between the numbers 0 and 1 and the characters '0' and '1'. Be careful not to write something like

Bit zero = '0'; // Error!

Bit one = '1'; // Error!

since this attempts to take the ASCII codes for '0' and '1' (namely, 48 and 49) as values for bits. Instead, use the numbers themselves:

Bit zero = 0; // Beautiful!

Bit one = 1; // Pulchritudinous!

Milestone Three: Implement Tree Encoding and Decoding

One of the practical concerns of Hufman coding that we discussed in class is that if you do choose to use Hufman coding, you need to somehow store information about which encoding tree you used. That way, whoever is decoding the information knows which codes corresponding to which characters. This means that we’ll need to devise a pair of functions that allow you to take a tree and encode it in a way that can be stored on disk, plus a way to decode the encoded representation back into the tree.

There are many ways to do this, but one of the most space-efcient ways is the following. We’ll encode each tree as a pair of a Queue<Bit> representing the shape of the tree and a Queue<char> representing which characters are in the leaves of the tree. The shape of the tree is encoded as follows:

If the tree is a single leaf node, it’s represented by the bit 0.

If the tree is not a leaf node, it’s represented by a 1 bit, followed by the encoding of its zero (left) subtree, followed by the encoding of its one (right) subtree.

The Queue<char> representing which characters are stored in the leaves of the tree is found by reading the leaves in the pattern you’d get by doing either an inorder or postorder traversal of the tree. For example, here are several Hufman trees and how they’d be represented as a sequence of bits and a sequence of characters:

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 5/10
2023/3/30 20:04 CS106B Huffman Coding

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

This is an extremely compact way of writing out a tree: the tree itself is written out with one bit per node, and the contents of the leaves are written out as-is. Most importantly, it’s possible to reverse the encoding back into a tree. Before writing code to compute or process these sequences, grab a pencil and paper and make sure you’re comfortable with how these work.

Edit the fle ShortAnswers.txt with your answers to the following questions.

Short Answer Questions, Part Three

Q4. Write out the Queue<Bit> and Queue<char> associated with this encoding tree:

Q5. Decode this Queue<Bit> and Queue<char> into an encoding tree and draw the result:

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 6/10
2023/3/30 20:04 CS106B Huffman Coding

111000100

MDCLV

Your task in this part of the assignment is to write a pair of functions

void encodeTree(EncodingTreeNode* tree, Queue<Bit>& bits, Queue<char>& leaves);
EncodingTreeNode* decodeTree(Queue<Bit>& bits, Queue<char>& leaves);

This frst function takes a tree and produces the two Queues shown above. The second function takes in the two Queues and reconstructs the tree they represent.

Specifcally, you should do the following:

Milestone Three Requirements

1. Implement the encodeTree and decodeTree functions in Huffman.cpp.

2. Add at least two custom test cases, one for each function, and test your code thoroughly.

Some notes on this problem:

There are no requirements about what the Queue<Bit> or Queue<char> should contain after decodeTree fnishes running. They could be unchanged, or completely empty, or flled with whatever contents you’d like.

You can assume the inputs to decodeTree are correct, in that the bits and characters are a correct representation of some encoding tree, that there aren’t stray bits in the queue, that you have the exact correct number of characters that you need, etc.

However, as with encodeText and decodeText, a bug in your implementation of encodeTree may manifest as errors in decodeTree, so we recommend making your decodeTree function “robust” and able to identify and fag errors if it fnds them.

You can assume the tree provided as input to encodeTree is a valid coding tree – each internal node will have exactly two children, only the characters in the leaves matter, each character appears in at most one leaf, etc.

You can assume the input Queue<Bit> and Queue<char> to encodeTree are empty when that function is frst called.

You don’t need to worry about the edge cases that came up in the frst milestone. You can assume the tree exists and contains at least two leaves.

Milestone Four: Put It All Together

You now have all the pieces you need to build the fnal Hufman encoder and decoder.

Implement the following pair of functions:

HuffmanResult compress(const string& text);

string decompress(HuffmanResult& file);

This frst function takes as input a string of text and produces a HuffmanResult containing all the information necessary to store the compressed contents of that string to disk. The second function takes as input a compressed HuffmanResult and returns the string that it encodes. Here, the HuffmanResult type represents a combination of the encoded tree and the encoded text:

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 7/10
2023/3/30 20:04 CS106B Huffman Coding

struct HuffmanResult {

/* Encoded version of the Huffman tree. */

Queue<Bit> treeBits;

Queue<char> treeLeaves;

/* Encoded version of the message. */

Queue<Bit> messageBits;

};

This section should mostly consist of putting together the diferent pieces you wrote earlier in the right order. In doing so, you may discover that there were some bugs lurking in your implementation, which you’ll need to then correct to get everything working. (It’s common in software engineering to fnd that each individual piece of a program passes its tests independently but fails when things come together; it’s usually either due to tests not covering every possible case or to some unexpected interactions between the components). When that happens, use the debugger to isolate where the issue is. Did you build the encoding tree incorrectly? Did you encode the message bits wrong, or is the issue in the decoder?

To summarize, here’s what you need to do:

Milestone Four Requirements

1. Implement the compress and decompress functions in Huffman.cpp.

2. Add at least one custom test case to ensure that your code works. Take this step seriously – writing good tests here will help you identify and smoke out errors that might be lurking in your code.

Testing is key here! We have not provided enough tests with the starter fles to poke and prod these functions in all the ways they can be poked and prodded, and you should not consider your implementation to be ready until you’ve written some tests of your own and used the interactive “Compress a File” and “Decompress a File” options to confrm that your code works end-to-end.

Some notes on this part of the assignment:

Your implementation should call error() if the input string to compress doesn’t have at least two distinct characters in it.

You can assume that the input to decompress() is a valid compressed fle and don’t need to worry about what happens if this isn’t the case. Then again, making this function as robust as possible is a great idea during testing.

Our provided starter fles contain logic to take a HuffmanResult and save it to disk and to take a HuffmanResult and load it from disk. The internal logic of how this works is more of a CS107-level topic that involves manipulating individual bits of variables, which is beyond the scope of what we’re going to cover in CS106B. However, if you’re curious to see how this works, you’re welcome to check out Demos/HuffmanResult.cpp to learn more!

Make sure not to leak any memory in your implementations.

There are no requirements about what the fnal contents of the HuffmanResult may be after calling decompress. You can leave it unchanged, empty it out, etc.

Milestone Five: Enjoy Your Creation!

Congratulations! You’ve just built a compression and decompression algorithm! All that’s left to do now is to try it out on a few examples to see how well it does.

We’ve bundled a number of sample fles with the assignment, some already compressed (they have the sufx .huff) and some not yet compressed. Use your program to decompress the compressed fles to see what they contain, and then try compressing some of the sample fles so that you can see how well those fles compress!

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 8/10
2023/3/30 20:04 CS106B Huffman Coding

Try decompressing the fle samples/Congratulations.bmp.huff. That fle will expand out to one named Congratulations.bmp, which you should defnitely take a look at once you’ve fnished the assignment. The same is true for the fle samples/Woohoo!.mp4.huff.

There are no deliverables for this section. Just play around and see what you fnd!

If you enjoyed this assignment, you might want to consider taking ENGR 76 (Information Science and Engineering), which explores topics like Hufman coding and information theory.

(Optional) Milestone Six: Extensions!

If you enjoyed this assignment and want to run further with it, we’d love to see what you come up with! Here are a few suggestions to help you get started.

Compute Shannon entropy. The Shannon entropy of a piece of text is, intuitively, how "surprised" you should be when you see a character. It gives a way of quantifying the information content of the text, under the assumption that each character is chosen independently of all the others around it. Read up online on the defnition of Shannon entropy, then write some code to compute the Shannon entropy of the input text.

In a sense, the Shannon entropy is the theoretical lower bound on the number of bits required for an "average" character in the text. Specifcally, if your text is characters

n
long and has Shannon entropy , then any compression scheme based on prefx codes
must use at least
H

bits. Compare how many bits are actually generated by Hufman
nH
coding against the theoretical minimum predicted by Shannon entropy. What do you fnd?

Implement a more advanced compression algorithm. Hufman coding is a good compression algorithm, but there are better alternatives in many cases. Many modern compression algorithms are based on an algorithm called LZW, named after the initials of its inventors. Other topics to look up include move-to-front coding, run-length encoding, and the Burrows-Wheeler transform.

Submission Instructions

Before you call it done, run through our submit checklist to be sure all your ts are crossed and is are dotted. Make sure your code follows our style guide. Then upload your completed fles to Paperless for grading.

Partner Submissions:

If you forget to list your partner you can resubmit to add one

Either person can list the other, and the submissions (both past and future) will be combined

Partners are listed per-assignment

You can't change/remove a partner on an individual submission

Please submit only the fles you edited; for this assignment, these fles will be:

ShortAnswers.txt

Huffman.cpp

You don't need to submit any of the other fles in the project folder.

Submit to Paperless

If you modifed any other fles that you modifed in the course of coding up your solutions, submit those as well. And that’s it! You’re done!

Good luck, and have fun!

Due Friday, March 17 at 1:00 p

Assignment Logistics

Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 9/10
2023/3/30 20:04 CS106B Huffman Coding

Congratulations on fnishing the last assignment of the quarter!

We’re extremely impressed with how much progress you’ve made since the start of

the quarter. Best of luck going forward!

Due Friday, March 17 at 1:00 p

All course materials © Stanford University 2023 Assignment Logistics

Website programming by Julie Zelenski with modifcations by Keith Schwarz • Styles adapted from Chris Piech • This page last updated 2023-Mar-18
Milestone One: Build the Hufman T

Milestone Two: Implement Text Enc

Milestone Three: Implement Tree En

Milestone Four: Put It All Together

Milestone Five: Enjoy Your Creation

(Optional) Milestone Six: Extension

Submission Instructions

https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/assignments/a8/ 10/10