$24
Objectives of this assignment
Use the C programming language to write your implementation of the Lempel-Ziv-Welch compression/decompression algorithm, in a file named “LZW.c”
Test your code against the provided test data, and other test data you create, to ensure it’s working as expected. The instructor will also be testing your program against test data that isn’t provided to you. Think creatively about how to “break” and strengthen your code!
This assignment: “LZW.c”
In this assignment, we will implement Lempel–Ziv–Welch (LZW), a universal, lossless data compression algorithm. LZW is the algorithm of the Unix file compression utility, compress, and is used in the GIF image format. “Lossless” means that the compression operation preserves 100% of the information in the original format: with a correctly implemented algorithm, you can always recover the identical original data given its compressed form, even though the compressed form maybe significantly smaller.
While the (glory) days when you could develop a company based on a compression algorithm are long-gone, there is still lasting practical value in knowing how to manipulate files of all types while implementing algorithms of increasing complexity, not to mention continuing to gain proficiency with array-based implementations -- we can achieve all of the above with a deeper dive into compression via LZW.
Broadly speaking, compression is a process of recognizing recurring patterns in some piece of information to be communicated, and replacing multiple duplicate patterns with shorter references to entries in a list of such patterns. For example, imagine you have a list of numbers on a piece of paper, and you need to read it to a friend over the phone. As you’re going through the list, if you see the same number repeated enough times, you’re likely to say something like “48, 3011 ten times, 189, …” and rely on a shared understanding of natural language grammar to distinguish between the in-band data (the numbers) and the out-of-band data (the instructions about patterns). Unfortunately, digital data doesn’t have a “universal grammar” -- general-purpose data compression algorithms that are expected to operate on any type of data must be able to learn the repeated sequences in particular files automatically. As in life, “There ain't no such thing as a free lunch”. For files that contain random data, or patterns that are too diverse or too sparsely distributed, a general-purpose compression algorithm might even produce “compressed” data that’s larger than the original!
LZW is a method for adaptive / dynamic compression - it starts with an initial model, reads data piece by piece and then updates the model while encoding the data as it proceeds through the input file. LZW can be used for binary and text files - it works on any type of file -- that said, it is
known for performing well on files or data streams with (any type of) repeated substrings (e.g.
text files, which often contain frequently used words like “an”, “the”, “there”, etc.).
With English text, the algorithm can be expected to offer 50% compression levels or greater. Binary data files, depending on structuring are capable of decent compression levels as well - for this assignment, file size is not part of the grading criteria--in fact, the evaluation relies on your code producing exact replicas of the expected compressed files so don’t get toocreative!
As in the previous assignment, we are interested in “round-trip” encoding and decoding, translating from the original format to encoded and back to original. I hope that some of the lessons from the previous assignment will become useful sources for your reuse. While the compression technique may be new, other details remain similar. Implementing a more complex algorithm shows how we might layer on complexity once we establish base skills.
Background Information
Dictionary Data Structure
A dictionaryis a general-purpose data structure for storing a group of objects. A dictionary has a set of keys and each key has a single associated value - the elements in this abstract data structure are commonly called, key-value pairs.Dictionaries are usually implemented using hash tables, and are often known as “dict”, “hash”, “map”, or sometimes “hashmap” in different programming languages.
Binary Coding Schemes
The “LZW” algorithm described by Welch's 1984 article [1] encodes variable-length sequences of 8-bit bytes as fixed-length 12-bit codes.
As an example of a binary code, this toy coding scheme uses three-bit codes to represent a choice of eight possible symbols (the letters ‘A’ through ‘H’). The choice of symbols is arbitrary--we could have used eight different emoji. As long as the sender and receiver have a shared understanding of what the codes mean, they can communicate using compressed data.
binary code
symbol
000
A
001
B
010
C
011
D
100
E
101
F
G
H
In a binary code, each additional bit added to the code scheme doubles the number of distinct symbols that can be encoded--for an n-bit code, 2ndifferent symbols can be represented. In the classic LZW algorithm, the choice of a 12-bit coding scheme allows for 4096 distinct symbols with binary codes between 000000000000 and 111111111111. Each binary code can also be represented as three hexadecimal digits, e.g. 000 to FFF.
The dictionary data structure to be used for this assignment has a capacity of 4096 entries (212), of which:
the first 256 entries, numbered 0 through 255, are reserved for every possible 1-byte value (i.e. entry #0 is for the ASCII NUL character, entry #32 is for the ASCII space character, entry #65 is for the capital letter “A”... look at man ascii!(press ‘q’ to exit the man page reader))
the entries numbered 256 through 4094 are assigned dynamically by the compression algorithm to represent multi-byte sequences in the input data.
the last entry, numbered 4095, is reserved for a padding marker
The code that refers to an entry in the dictionary is simply the entry’s numeric position, encoded as a 12-bit binary integer.
Additional Notes
Both encoding and decoding programs refer to the dictionary countless times during the algorithm. You may want to reference the hints in the Provided Materials section of this document.
The encoder looks up indexes in the dictionary by using strings. That is, to aid the look-up of indexes, we use a dictionary of strings - i.e. an array of strings. With each string having some maximum number of characters, you might consider using as your data structure, a multidimensional array -- as discussed in Unit 3, Slide 29, this has a form similar to: e.g., X[row][column], recall that we access elements of the array through integer indexes.
Why select such an approach? A data structure with 0(1), “Order-1 time complexity”, can be useful in practice.
Algorithm and Example
LZW Encoding
Start with a standard initial dictionaryof 4096 entries (each entry can be identified using a 12-bit number)
Set wto ''
As long as there are more characters to read
read a character k
if wkis in the dictionary
set wto wk
else
output the code for w
add wkto the dictionary, if it’s not full
set wto k
output the code for w Note #1: Step 2 is:
Set w to '' It is not:
Set w to ' '
That is, you start off with w set to nothing, not a space character. So if you append 't' to '', you end up with 't' (not ' t').
Note #2: The first 256 entries (0-255) in the LZW dictionary should be the 1-byte codes whose values are 0-255. That is,
dict[42][1] is 42 (and dict[42][0] is 1) dict[228][1] is 228 (and dict[228][0] is 1)
In general, for 0 <= i <= 255: dict[i][1] is i (and dict[i][0] is 1)
LZW Decoding
Start with a standard initial dictionaryof 4096 entries (the first 256 are standard ASCII, the rest are empty).
Read a code kfrom the encoded file
Output dict[k]
Set wto dict[k]
As long as there are more codes to read in the input file
read a code k
if kis in the dictionary
output dict[k]
add w+ first character of dict[k] to the dict
else
add w+ first char of wto the dict andoutput it
c. Set wto dict[k]
The programs LZW.c takes as its first two inputs, the name of an input file as the first command line argument, and the name of an output file as the second command line argument.
If the second passed command line argument is an ‘e’, then the program creates a new file with the same name as the file in the command line, but with the extension .LZW. The new file contains the LZW encoding of the original file.
If the second passed command line argument is a ‘d’, then the program takes the name of compressed file as a command line argument (a file ending in extension, .LZW). The program interprets the original file name by dropping the .LZW file extension and then decodes the LZW encoding back to the “original.” (actually a copy).
The program should open both the input file and output file as binary files and make no assumption about what byte values appear in the files. Here's what their behaviour looks like:
/assignment2 cat tests/encode_short_string/input.txt this_is_his_history
/assignment2 ./tools/b2x ./tests/encode_short_string/input.txt 746869735F69735F6869735F686973746F72790A
/assignment2 ./LZW ./tests/encode_short_string/input.txt e
/assignment2 ./tools/b2x ./tests/encode_short_string/input.txt.LZW 07406806907305F10205F10110310707406F07207900AFFF
# different hex code means the input file and the encoded file are different
/assignment2 diff ./tests/encode_short_string/compare.txt.LZW
./tests/encode_short_string/input.txt.LZW
# no output means the files are the same
rename our file so we don’t overwrite it (harder to determine if program executed correctly) mv ./tests/encode_short_string/input.txt ./tests/encode_short_string/input.old
now let’s decode our compressed file, and make sure it matches the original input
/assignment2 ./LZW ./tests/encode_short_string/input.txt.LZW d
/assignment2 cat ./tests/encode_short_string/input.txt
this_is_his_history
~/assignment2 diff ./tests/encode_short_string/input.txt ./tests/encode_short_string/input.old
no output means they’re the same! ~ /assignment2
This session shows the contents of the file input.txt. The file contains the string this_is_his_history.
The hexadecimal values of the ASCII codes are also shown. Running the program LZW(with the encode flag) on input.txt produces a file input.ext.LZW. The file input.txt.LZWcontains 12-bit values for the LZW codes:
074 = t
068 = h
069 = i
073 = s
05F=_
102 = is
05F=_
101 = hi
103 = s_
etc.).
The decode version of the program, LZW <filename d, then correctly recovers the original file name from the given input, dropping .LZW from the input file name, and recovers the original data.
Requirements
The requirements for this assignment are as follows:
Use arrays (and not dynamic memory allocation!) as an aggregate data type for file processing (encoding and decoding).
Create a main function that:
a. Takes two command line arguments:
A string indicating the name of the file to operate on (the “input” file);
A one-character flag indicating whether the program should operate in encoding or decoding mode. Use ‘e’ for encoding mode, or ‘d’ for decoding mode.
If the second argument is not provided, or is not either ‘e’ or ‘d’,
print “Invalid Usage, expected: LZW {input_file} [e | d]” to the console
terminate the program with exit code 4.
b. Opens the input file indicated by the first argument.
If there is no filename specified:
print “Error: No input file specified!” to the console
terminate the program with exit code 1.
If the input filename specified doesn’t exist or can’t be read for any reason:
print “Read error: file not found or cannot be read” to the console
terminate the program with exit code 2.
Opens an output file using the naming convention as specified
Depending on the value of the second command-line argument, passes the string read from the input file to either the encode or decode function, as described below
Create a second function called encode with the following characteristics:
Takes two C files as function parameters:
an input file
an output file
Encodes the provided file using the algorithm described above.
If the file was encoded successfully:
create the encoded representation to the file system
terminate the program with exit code 0, indicating success
If the file could not be encoded for any reason (e.g. a particular input fails):
print “Error: File could not be encoded” to the console
terminate the program with exit code
Create a third function called decode with the following characteristics:
Take two C files as function parameters:
an input file
an output file
Decode the provided file using the algorithm described above
If your program encounters invalid, corrupted or badly formatted data that cannot be decoded according to the specified algorithm, the program should:
print “Error: Invalid format” to the console
terminate the program with exit code 3.
If the file was decoded successfully:
write the resulting decoded representation to a file with the expected name
terminate the program with exit code 0, indicating success
If the file could not be decoded for any reason (other than those stated above):
print “Error: File could not be decoded” to the console
terminate the program with exit code 5
By now, you’re developing your own tests to evaluate your software (we don’t get what we expect, we get what we inspect) -- consequently, we would like to see at least one test input from YOU - use the testing convention described in this assignment - i.e. if your test input is expected to successfully execute, then it should follow the happy path of the algorithm. If the test input is expected to fail, then it should show a brief error message followed by exiting with error code6.
You may also think of tweaks to the algorithms that will make the encoder/decoder run faster, or make the compression better. If so, implement these ideas, and compare. But, this is not necessary. You are encouraged to devote time to testing your program.
Provided Materials
We have created a git repository for you - it contains a starter file, LZW-starter.c, a tool, b2x, a limited test suite, and these assignment instructions. There are also some hints in this section.
LZW-starter.c
A “starter kit” for LZW.c for you to reuse in your implementation. Copy it to LZW.c and edit it, or copy and paste parts of it. The starter-file contains a few functions you are invited to use in your solution - these are read12(), write12() and flush12(). It is not your aim to understand the implementation of these functions (although you’re welcome to explore as you see fit) - it IS your aim to understand how you might apply. these functions in your solution - if you choose to apply them, it should have the effect of simplifying the solution. The use of flush12() might have an analogous application to the use of fflush(stdin) - while we’re on the subject: fflush() might have some use in your solution as well.
b2x Tool
We created a directory /tools and placed a source file, b2x.c, that reads a binary file and writes the “binhex” representation of a file to the screen -- binhex is short for binary-to-hexadecimal - it’s yet another encoding mechanism - you can compile and use this tool if useful (it is not required). You can also use the common unix hexdump utility for similar purposes.
Test Suite
We will supply a limited suite of test cases that you can use to evaluate your own program, and get a feel for the types of inputs you can expect. You will see the tests include both text and binary files. The test suite has the same structure as the previous assignment - a set of directories, with one test case provided per directory. The directory, as previously, is named according to the type of test - e.g. encode_text or decode_image, where the first word denotes the mode the program should be run in while running this test.
Things are not identical to the previous assignment - for example, in the case of encoding:
the filename passed may have anyname and extension, e.g. input.ext(your program is expected to read that file as its input, with ext replaced, e.g. with “txt” or “jpg”)
NOTE: Your program does notneed to locate the appropriate file to operate on -- you can look at the files in the test case directory, and pass the input filename as a command line parameter!
Continuing the example above, if your file is expected to run successfully given its input, the directory will contain a file whose name is compare.ext.LZWwhere extis the original uncompressed filename extension
NOTE: Why “compare.ext”? -- Well, we give you the input file for testing purposes; if properly run, you should get input.ext.LZW as your output - if we provided you with this properly named file as output, then you could perhaps overwrite it accidentally, which would remove your ability to compare - hence, we prefix the outputs with “compare”.
Your program’s output is expected to precisely(bit-for-bit!) match the contents of this output file.
If your program is expected to terminate with a non-zero (unsuccessful) error code when given the provided input file, the directory will contain one of the following:
a file named n.err.txt,where nis the expected exit code, containing text that your program’s output is expected to precisely match, or;
a file named n.err,where nis the expected exit code. In this case your output doesn’t need to match any specific error message, but your program should still exit with error code n
NOTE: for YOUR test(s) that you provide, should the test(s) exit with an error code, the error code must be 6.
In the case of decoding, we provide input.ext.LZW and also provide compare.ext - again, you can derive the resulting output filename from the given input, and then compare accordingly.
Hints
Use an array of strings for the dictionary. Here's an example: unsigned char dict[4096][32];
That's an array of 4096 strings, with each string having a maximum of 32 characters. That means that your program must be careful not to add to the dictionary once it is full and not to add strings to the dictionary that are too long. The code for each entry in the dictionary is simply its position in the array. For example, the code for the entry dict[267] is 267
I strongly recommend against using the standard C null-terminated strings in the dictionary (since null might be a valid character in a LZW dictionary entry). Instead, store the length of the string as the first byte of the string like this:
dict[i][0] = 3; dict[i][1] = 'a'; dict[i][2] = 'b';
dict[i][3] = 'c';
You will not be able to use the built-in string functions (like strcmp, strlen, etc.) for this type of string.
Process
To get started, the assignment repository has already been set up in your person GitLab space, listed as /assignment2.git. We set this up in our local machine space with git clone:
git clone
https://gitlab.csc.uvic.ca/courses/2019091/SENG265/assignments/<NETLIN
K_ID/assignment2.git
(all one line, with <NETLINK_ID replaced with your personal Netlink identifier)
Note that by cloning from your personal repository, this automatically sets up your remote on GitLab so when you git pushlater, your changes will automatically be visible to both you and the teaching team.
Use your favourite text editor to write your program.
Once you’re able to compile your program, test it by running it with the input.<extfiles in the provided test suite, and any other files you can think of. Be creative; try to predict what kinds of inputs will break your code! Try random files lying around on your computer or the internet to see what happens!
If it seems like your changes are having no effect, don’t forget to recompile your program after changing it.
When you’ve made changes to the file that you want to keep, git commit your changes to your local copy of the repository. Git will save the history of past commits. When you want to upload your changes to your repository, use git push.
Deliverables
Write a C program in a file named LZW.c (case sensitive), placed in the root of your assignment2 repository, that implements the above requirements.
The program must compile successfully using gcc(within the lab environment)
- We will compile with gcc LZW.c -o LZW
The assignment will be marked based on the last version of your program pushed before the deadline.
Evaluation
Compilation
(10% of the mark, if your program compiles at all)
Correctness
(60% of the mark, based on the proportion of provided and secret test cases that your program successfully passes)
The teaching team will run your code using the provided test suite to verify whether your program meets the requirements:
When run with correct parameters and given input in the expected format, it produces the expected output
When run with incorrect parameters or badly formatted input data, it produces the expected error messages and the expected exit codes
We will also evaluate your program using an additional test suite that is not provided to you, to ensure that your solution isn’t over-fit to the provided test suite.
Your Test Suite
(10% of the mark, YOU supply additional test(s) constructed using the same scheme)
The teaching team will run your test case against your solution and validate that it works. We might also run your test case against otherstudent solutions. If seen as a particularly effective test, the teaching team reserves the right to provide a small bonus grade beyond the 10% allocated for this item.
Style
(20% of the mark, based on the subjective assessment of the teaching team)
The teaching team will read your code to judge whether it is clean, original, elegant and fit to purpose.
A few examples of bad programming style:
inconsistent formatting, indentation, etc. -- pick a style and use it consistently
misleading or non-descriptive names for variables and functions
insufficient whitespace impairing readability
long sections of repetitive code (sometimes it’s better to repeat short sections than introduce an abstraction!)
“dead” code, i.e. present in the file but never called
complex code without explanatory comments (don’t go overboard; simple code often doesn’t need comments)
Programming environment
For this assignment, please ensure your work compiles and executes correctly on the Linux machines in ELW B238. You are welcome to use your own laptops and desktops for much of your programming; if you do this, give yourself a few days before the due date to iron out any bugs in the C program you have uploaded to the BSEng machines. (Bugs in this kind
of programming tend to be platform specific, and something that works perfectly at home may end up crashing on a different hardware configuration.)
Individual work
This assignment is to be completed by each individual student (i.e., no group work). Naturally you will want to discuss aspects of the problem with fellow students, and such discussion is encouraged. However, sharing of code fragments is strictly forbidden without the express written permission of the course instructor (Murray). If you are still unsure regarding what is permitted or have other questions about what constitutes appropriate collaboration, please contact me as soon as possible. (Code-similarity analysis tools will be used to examine submitted programs.)
Additional Information
Because image compression remains a subject of great interest, you may wish to learn more about it -- feel free to review or ignore the following content as you see fit:
YouTube videos on image file compression
JPEG 'files' & Colour (JPEG Pt1)- Computerphile
JPEG DCT, Discrete Cosine Transform (JPEG Pt2)Computerphile
Articles on image file compression
TIFF LZW Compression Algorithm: https://www.fileformat.info/format/tiff/corion-lzw.htm
GIF LZW Compression Algo: http://giflib.sourceforge.net/whatsinagif/lzw_image_data.html
References
A Technique for High Performance Data Compression, IEEE Computer, Terry A. Welch, June 1984 (8:19), accessed at this link
LZW Data Compression, Dr. Dobbs Journal, Mark Nelson, October 1989, accessed at this link
LZW Data Compression Revisited, Dr. Dobbs Journal, Mark Nelson, November 2011 at link