Definitions
In a lot of texts regarding compression you will run across various conventions of syntax to describe the coding of data. Much of them are used in fields like automata theory, group theory and statistics. Although, the field of coding theory itself has a long and rich history in the 20st century; much of the terminology is borrowed from fields that gave rise to it.
I find the best way to demonstrate the following definitions is by example. Let us take the text "abbcc" as a sample.
-
Alphabet: The set of symbols that we will be encoding.
- Form:
- Example: where
-
String: A permutation of symbols from our alphabet. From our example alphabet, strings like "a", "aba", "abc" or "cccb" could all be possible. Using regular expression syntax, we can describe this in our example.
- Form:
- Example:
-
Probabilities: The set of probabilities for the alphabet. This is also known as their weights and can sometimes be denoted as w instead of p.
- Form:
- Example: where since 'a' occurs once, 'b' occurs twice and 'c' occurs twice out of 5 symbols all together.
-
Codewords: The set of codewords that we will create for every symbol in our alphabet.
- Form:
- Example: where for symbols respectively. Another way to look at it is that we are providing a bijective function: .
Code Width
Standard coding of information for most computers today is byte aligned. For example, in character codes there is ASCII, which codes the entire English alpha-numeric system in a single byte. While UTF-8 has proven to be a much better standard due to better localization support and variable width codes, ASCII serves our purposes for this discussion. ASCII is considered a fixed width coding scheme:
When performance is at stake, it is best to code information atomically aligned to the computer architecture that it is being used for. This means allocating and defining information in terms of bytes (defined as 8-bits) for the majority of systems today. Although, if space is the priority, performance can take a hit to shift around bits along the byte boundaries and create variable width codes that do not byte align.
Note: The ellipses in the alphabet denote the control characters (eg. CR, LB) and grammatical structure symbols (eg. #, $, %) contained in the ASCII code page. The alphabets will reflect this in the ellipses.
Data Compression
Variable Width Savings
Coming back to the example of ASCII, if we wish to encode a string, w, like "aaabb", it will take up 5 bytes. Although we only really need one bit to encode our information if we assign 0 to 'a' and 1 to 'b'.
We just took a 5 byte ASCII encoded string and compressed it down to under a byte. Now let us take another string, w, and give it "aaabbcc". Three alphabet symbols need to be coded; it will take at least 2 bits. At first glance, this assignment would make sense:
Although this gives us an ambiguous string since it could either be "aaabbcc", "aaabbabab", "aacbcc" or any other permutation of our codeword set when we attempt to decompress it. The reason for this ambiguity is due to the fact that the codeword for 'a' (0), is used to start the codeword for 'c' (01).
Prefix Codes
The way to get around this ambiguity is to ensure that all of your codes are prefix type. Something like this would work:
This ensures that we maintain the one-to-one mapping between our alphabet and codewords; we can translate alphabet symbols to codewords and codewords back to alphabet symbols without confusion.
Codeword Assignment
So how do you decide which symbols, get which codewords? Also, since the size of the codeword is critical to the compression ratio (average codeword length to average symbol length), designing an algorithm to weigh the more frequently used symbols is important. In our previous example, it would be poor design to give 'a' the longest codeword because it is the most used symbol in the string. This is where the Huffman Coding algorithm comes into play.
Huffman Coding Algorithm
- Count the number of times each symbol occurs in the data source.
- Put [symbol:count] pairing as nodes in a minimum priority queue based on the symbol's count.
- Evaluate number of queue nodes.
- If queue has at least two nodes:
- Dequeue the two nodes.
- Create a new node with a null symbol and set the count equal to the sum of the two dequeued node's counts.
- Attach the two nodes as children to the new node.
- Put the newly created node into the queue.
- Go to step 3.
- If queue has only one node, dequeue and set as the root of the tree.
- Assign each left branch a 0 and each right branch a 1, or vice versa if you choose.
- The symbol nodes will be the children of the tree and their codeword will be the bits that are traversed to get there.
Algorithm Example
Given our previous example string of "aaaabbcc", let us perform the algorithm on it:
- Initial queue state with only symbol nodes in it.
- New node created from first two symbol nodes and requeued.
- New node created from non-symbol and symbol node, becoming the tree's root node.
- The final alphabet to codeword mapping is:
If you need further examples, you can look here, here and here. Once the you get a grasp of the algorithm, you will see the simplicity and beauty of it.
Working Code Example
If you want to see a naive implementation of the Huffman Coding algorithm, I posted some source code on my GitHub account. Below is the central code to the algorithm execution.
More Reading & Resources
If you are interested in coding algorithms, these links might be of interest to you.