Thursday, September 20, 2012

Understanding compression and Huffman Coding.

Definitions

In a lot of texts regarding compression you will run across various conventions of syntax to describe the coding of data. Much of them are used in fields like automata theory, group theory and statistics. Although, the field of coding theory itself has a long and rich history in the 20st century; much of the terminology is borrowed from fields that gave rise to it.

I find the best way to demonstrate the following definitions is by example. Let us take the text "abbcc" as a sample.

  • Alphabet: The set of symbols that we will be encoding.
    • Form:
    • Example: where
  • String: A permutation of symbols from our alphabet. From our example alphabet, strings like "a", "aba", "abc" or "cccb" could all be possible. Using regular expression syntax, we can describe this in our example.
    • Form:
    • Example:
  • Probabilities: The set of probabilities for the alphabet. This is also known as their weights and can sometimes be denoted as w instead of p.
    • Form:
    • Example: where since 'a' occurs once, 'b' occurs twice and 'c' occurs twice out of 5 symbols all together.
  • Codewords: The set of codewords that we will create for every symbol in our alphabet.
    • Form:
    • Example: where for symbols respectively. Another way to look at it is that we are providing a bijective function: .

Code Width

Standard coding of information for most computers today is byte aligned. For example, in character codes there is ASCII, which codes the entire English alpha-numeric system in a single byte. While UTF-8 has proven to be a much better standard due to better localization support and variable width codes, ASCII serves our purposes for this discussion. ASCII is considered a fixed width coding scheme:

When performance is at stake, it is best to code information atomically aligned to the computer architecture that it is being used for. This means allocating and defining information in terms of bytes (defined as 8-bits) for the majority of systems today. Although, if space is the priority, performance can take a hit to shift around bits along the byte boundaries and create variable width codes that do not byte align.

Note: The ellipses in the alphabet denote the control characters (eg. CR, LB) and grammatical structure symbols (eg. #, $, %) contained in the ASCII code page. The alphabets will reflect this in the ellipses.

Data Compression

Variable Width Savings

Coming back to the example of ASCII, if we wish to encode a string, w, like "aaabb", it will take up 5 bytes. Although we only really need one bit to encode our information if we assign 0 to 'a' and 1 to 'b'.

We just took a 5 byte ASCII encoded string and compressed it down to under a byte. Now let us take another string, w, and give it "aaabbcc". Three alphabet symbols need to be coded; it will take at least 2 bits. At first glance, this assignment would make sense:

Although this gives us an ambiguous string since it could either be "aaabbcc", "aaabbabab", "aacbcc" or any other permutation of our codeword set when we attempt to decompress it. The reason for this ambiguity is due to the fact that the codeword for 'a' (0), is used to start the codeword for 'c' (01).

Prefix Codes

The way to get around this ambiguity is to ensure that all of your codes are prefix type. Something like this would work:

This ensures that we maintain the one-to-one mapping between our alphabet and codewords; we can translate alphabet symbols to codewords and codewords back to alphabet symbols without confusion.

Codeword Assignment

So how do you decide which symbols, get which codewords? Also, since the size of the codeword is critical to the compression ratio (average codeword length to average symbol length), designing an algorithm to weigh the more frequently used symbols is important. In our previous example, it would be poor design to give 'a' the longest codeword because it is the most used symbol in the string. This is where the Huffman Coding algorithm comes into play.

Huffman Coding Algorithm

  1. Count the number of times each symbol occurs in the data source.
  2. Put [symbol:count] pairing as nodes in a minimum priority queue based on the symbol's count.
  3. Evaluate number of queue nodes.
    • If queue has at least two nodes:
      • Dequeue the two nodes.
      • Create a new node with a null symbol and set the count equal to the sum of the two dequeued node's counts.
      • Attach the two nodes as children to the new node.
      • Put the newly created node into the queue.
      • Go to step 3.
    • If queue has only one node, dequeue and set as the root of the tree.
  4. Assign each left branch a 0 and each right branch a 1, or vice versa if you choose.
  5. The symbol nodes will be the children of the tree and their codeword will be the bits that are traversed to get there.

Algorithm Example

Given our previous example string of "aaaabbcc", let us perform the algorithm on it:

  1. Initial queue state with only symbol nodes in it.



  2. New node created from first two symbol nodes and requeued.



  3. New node created from non-symbol and symbol node, becoming the tree's root node.



  4. The final alphabet to codeword mapping is:

If you need further examples, you can look here, here and here. Once the you get a grasp of the algorithm, you will see the simplicity and beauty of it.

Working Code Example

If you want to see a naive implementation of the Huffman Coding algorithm, I posted some source code on my GitHub account. Below is the central code to the algorithm execution.

More Reading & Resources

If you are interested in coding algorithms, these links might be of interest to you.

Tuesday, September 18, 2012

IMAPI version 2 managed wrapper.

Introduction

Microsoft introduced native image mastering support in Vista, going forward. The library itself is not too complex but the latest version of the image mastering API (IMAPI) has some quirks. This is well documented by Eric Haddan in his article describing how to build an interop file and use it from .NET.

Implementation

I wanted to take it a step further and provide a library wrapper that was CLI compliant. The end result is a class that provides 3 methods, called in order, to burn an image. The project can be found on my GitHub account.

Properties

  • Media: A read only property that populates after the LoadMedia call and reports what type of physical media is connected (eg. CD-R, DVD-R, DVD+RW, etc.).
  • MediaStates: A read only collection of states in which the media is under, that populates after the LoadMedia call (eg. is blank, overwriteable, unsupported).
  • MediaCapacity: A read only property that populates after the LoadMedia call and reports the size capacity of the media.
  • Nodes: A root list of file and directory nodes that are to be written to the media.
  • Recorders: A read only selectable collection. A recorder must be selected for the LoadRecorder call.
  • VolumeLabel: A property that sets the volume label of the media to be written to.

Methods

  • LoadRecorder: Verifies a recorder is selected, loads the recording drive resources and verifies that it is working.
  • LoadMedia: Loads the media resource and verifies that it is valid for the recorder. An exception will occur if the media isn't supported (eg. closed session disc, no disc present).
  • FormatMedia: Formats the media of any previous data.
  • WriteImage: The final step to write your files to disc. This is a blocking call.

Events

  • FormatEraseUpdate: Reports progress of FormatMedia call.
  • FormatWriteUpdate: Reports progress of files being written to the media.
  • ImageUpdate: Reports progress of files being added to the media image.

Caveats

I use this library for the simple single purpose of writing finalized images to a blank CD. It has not been fully tested to accomodate all media and recorder state scenarios; I haven't tried it yet with DVD's or multi-session discs. If you would like to extend the library, feel free to push some changes to the GitHub project and I will wrap them in.

I decided to leave the method calls as blocking (synchronous) since there could be many different synchronization models that may use this library. For my purposes, I used this in an WPF application that would call the WriteImage method from a Task and then post back the progress events of the write using the Dispatcher from my WPF window. Below is a code snippet demonstrating this.

private ImageMaster _burner;

...

private void _CreateCd()
{
 // files added to _burner and _burner.WriteImage called
}

private void _burner_FormatWriteUpdate(IMAPI2.ImageMaster sender, IMAPI2.FormatWriteUpdateEventArgs e)
{
 this.Dispatcher.Invoke(() => this.statusProgressBar.Value == (e.ElapsedTime / e.TotalTime) * 100);
}

private void CreateCdWindow_Loaded(object sender, System.Windows.RoutedEventArgs e)
{
 System.Threading.Tasks.Task.Factory.StartNew(_CreateCd);
}

Thursday, September 13, 2012

Implementing a bit reader/writer in C.

Introduction

I was working on implementing a simple compression tool based on naive Huffman coding and initially assumed all of my prefix codes would be at or under a byte long. Although I failed to account for the worst case of Huffman coding, which can yield codes up to 255 bits in length.

This required a bit reading and writing function that would allow for variable length codes. I could have used other methods to balance worst case Huffman trees, but this issue presented a unique challenge. How to write a bit reader/writer in C.

Bit Reader/Writer Specifications

  • Read and write using reader and writer ADT's.
  • Handle any possible bit length given.
  • Remember bits read from file but are outside the requested bit length (eg. 2 bytes read in but only 12 bits requested, leaving 4 bits in the buffer).
  • Flush a partially written block to disk on a free of the bit writer ADT.
  • Return number of bits read and written on the read and write functions.

Design

The underlying structure to the ADT will contain fields for the: file pointer, the byte long buffer and the buffer bit length. The examples given for every case are in steps; what is originally in the buffer, what bits are requested, what bits are read from or written to the file, and finally what bits are left over and buffered.

Reader

There are two main cases to consider when designing the read function.
  1. Empty Buffer: The is simple since the file read bits will already be byte aligned to your output bits. You just need to simply mask out the unneeded bits and buffer them for the next read (eg. no bits buffered, 12 bits requested, 16 bits read in, 4 bits buffered).
  2. Populated Buffer: The can get a little trickier, especially in the last two cases of this.
    1. Requested bits already buffered. This is very easy since it doesn't require any file read and the buffer just needs to shift out the read bits (eg. 7 bits in the buffer and 3 are requested; buffer serves out 3 and truncates to 4).
    2. Not all requested bits buffered; end of read bits not in same byte block as end of output bits. The last output byte block must mask out the unneeded bits, put them in the buffer and then append the additional unneeded read bits from the next byte block (eg. 4 bits buffered, 13 bits requested, 16 bits read in, 8 bits buffered).
    3. Not all requested bits buffered; end of read bits in same byte block as end of output bits. The last output byte block must mask out the unneeded bits and put them in the buffer (eg. 4 bits buffered, 10 bits requested, 8 bits read in, 2 bits buffered).

Writer

There are only two cases to consider when designing the write function.
  1. Full Buffer: Shift in write bits, commit to disk and repeat until all bits are processed. If the last bit to write is not on the buffer byte boundary, it will wait until the next write (eg. 6 bits buffered, 12 bits to write, 16 bits written out, 2 bits buffered).
  2. Partial Buffer: Shift in write bits (eg. 2 bits buffered, 4 bits to write, no bits written out, 6 bits buffered).

Caveats

The code below is purely a demonstration of how a bit reader/writer works in a generic sense. There is room for plenty of optimization if you know what type of architecture your targeting and the domain of your reader/write arguments.

The Code

Header File

Source File