Introduction
I was working on implementing a simple compression tool based on naive Huffman coding and initially assumed all of my prefix codes would be at or under a byte long. Although I failed to account for the worst case of Huffman coding, which can yield codes up to 255 bits in length.
This required a bit reading and writing function that would allow for variable length codes. I could have used other methods to balance worst case Huffman trees, but this issue presented a unique challenge. How to write a bit reader/writer in C.
Bit Reader/Writer Specifications
- Read and write using reader and writer ADT's.
- Handle any possible bit length given.
- Remember bits read from file but are outside the requested bit length (eg. 2 bytes read in but only 12 bits requested, leaving 4 bits in the buffer).
- Flush a partially written block to disk on a free of the bit writer ADT.
- Return number of bits read and written on the read and write functions.
Design
The underlying structure to the ADT will contain fields for the: file pointer, the byte long buffer and the buffer bit length. The examples given for every case are in steps; what is originally in the buffer, what bits are requested, what bits are read from or written to the file, and finally what bits are left over and buffered.
Reader
There are two main cases to consider when designing the read function.- Empty Buffer: The is simple since the file read bits will already be byte aligned to your output bits. You just need to simply mask out the unneeded bits and buffer them for the next read (eg. no bits buffered, 12 bits requested, 16 bits read in, 4 bits buffered).
- Populated Buffer: The can get a little trickier, especially in the last two cases of this.
- Requested bits already buffered. This is very easy since it doesn't require any file read and the buffer just needs to shift out the read bits (eg. 7 bits in the buffer and 3 are requested; buffer serves out 3 and truncates to 4).
- Not all requested bits buffered; end of read bits not in same byte block as end of output bits. The last output byte block must mask out the unneeded bits, put them in the buffer and then append the additional unneeded read bits from the next byte block (eg. 4 bits buffered, 13 bits requested, 16 bits read in, 8 bits buffered).
- Not all requested bits buffered; end of read bits in same byte block as end of output bits. The last output byte block must mask out the unneeded bits and put them in the buffer (eg. 4 bits buffered, 10 bits requested, 8 bits read in, 2 bits buffered).
Writer
There are only two cases to consider when designing the write function.- Full Buffer: Shift in write bits, commit to disk and repeat until all bits are processed. If the last bit to write is not on the buffer byte boundary, it will wait until the next write (eg. 6 bits buffered, 12 bits to write, 16 bits written out, 2 bits buffered).
- Partial Buffer: Shift in write bits (eg. 2 bits buffered, 4 bits to write, no bits written out, 6 bits buffered).
Caveats
The code below is purely a demonstration of how a bit reader/writer works in a generic sense. There is room for plenty of optimization if you know what type of architecture your targeting and the domain of your reader/write arguments.
Right on. The company I work for uses a bit reader/writer of a similar nature. One of the things that has been on my wish list for a while, though, is thread-safety built in. Your implementation is definitely thread-safe (whereas our reader functions are currently not, as they directly change the state of the bit buffer), but it would open a fresh file descriptor into the same file for each thread. It would be useful to have a reader that iterates over an existing buffer, that way you can save on memory (and file descriptors).
ReplyDeleteGood work, either way.
Thanks. Feel free to take what you need. I still need to batter the reader with more test cases and the writer too but I think the logic is sound.
DeleteOut of curiosity, what do you use for testing C?
DeleteMy company uses the Google C++ Testing Framework (http://code.google.com/p/googletest/), although we use C, not C++. I like it quite a bit, and have not really looked into alternatives.
Embarrassingly, I didn't use any framework. I just had a script that generated random file lengths and then generated sets of inputs as permutations over a domain of 1 to 100.
Delete