papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa

Courtesy of reverser's page of reverse engineering

Here is the second part of Joa's essay.
See also the first part (Introduction)


Joa's work is extremely clear and will guide and help you to understand these matters, that for obvious reasons no real reverser should ever underestimate. My advice: read, re-read and then -once you have understood the basis- search the web and deepen (and then contribute :-)


Little essay about the various methods and viewpoints of crunching.

Part II: Propability and Crunching (Huffman isn't the only way)



By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


OK, let's dive into the world of crunchers and munchers...

In the Intro i told you that crunching is a process of reducing
the SPACE data needs to be stored, but not the MEANING (the
informations transmitted in the symbols).
The meter 'Entropy' is used here like in the physics. The more
entropy a symbol has, the LESS information is transmitted thru it.
Less information compared to the whole of all symbols, that is.
This means of course there is a lot of redundancy in it - meaning
a big 'crunch me, crunch me' for us. So we should concentrate
on the symbols that have a very high entropy. How we calculate
an entropy for our means depends on the data we want to crunch
and with which algorithm we want to crunch it.
I also told you that a certain student called Huffman once came
up with an algorithm to compress data. His idea is based on the
fact that in data (escpecially speech) some symbols happen 
to be more often than others. Here one overstated example:

aaaaaaaaaabbbbbcccdd

What we have here is a algorithm based on propability. That means
that some symbols are not transmitted in the way they used to be,
because the algorithm will create new symbols in respect to the
propabilities of the symbols. Some of these new symbols will be
shorter. And those lesser apperaring symbols will be transformed
into longer symbols:

Before:

a = 61h = %0110 0001
b = 62h = %0110 0010
c = 63h = %0110 0011
d = 64h = %0110 0100

and 

after:

????

How do we compute such a new symbol table? OK, before we explore
this algorithm let me change the data from ASCII to the lower
nibble. That should be enough to explain the algorithm:

Before:

a = %0001
b = %0010
c = %0011
d = %0100

and we have the propability:

a = 10
b =  5
c =  3 
d =  2
   ===
    20 <- Overall value. Important when calculating entropy

To save this original sequence to disk we would need: 

10 * 4 = 40
 5 * 4 = 20
 3 * 4 = 12
 2 * 4 =  8
        ===
         80 bits


Lets crunch this:
The original algorithm works with a tree building up from the ground
to the top. You can of course try to optimize it in another way. But
here's the algorithm as Huffman intended it:

Step 1:

10   5   3   2       How often the symbol appears
                     Tree (empty at first)
10   5   3   2       The added sum of the actual part of the tree
(a) (b) (c) (d)      Symbol


Step 2:

Now we take the two lowest (that is lowest in the sum of the value) entries
of the tree and add the values thus giving a new knot with the former part
as right and left part:

10   5   3   2       How often the symbol appears
                     Tree
10   5   3   2       The added sum of the actual part of the tree
(a) (b) (c) (d)      Symbol
         ^   ^
         ^   ^
         Two lowest values (3 and 2). 
         Take them and add the values of the two actual 
         parts of the tree and give the sum to a new knot. The two former
         parts become a left and right treebranch. 

10   5   3   2       How often the symbol appears
10   5     5         The tree
(a) (b)	  / \
         3   2
        (c) (d)

This Step 2 repeats until no two values are left:


 a   b   c   d
10   5   3   2       How often the symbol appears
10   5     5         The tree
(a) (b)	  / \
         3   2
        (c) (d)
     ^     ^ 
     ^     ^
         Two lowest values (5 and 5). 
         Take them and add the values of the two actual 
         parts of the tree and give the sum to a new knot. The two former
         parts become a left and right treebranch. 


 a   b   c   d
10   5   3   2       How often the symbol appears
10      10
(a)    / \
     5     5
    (b)	  / \
         3   2
        (c) (d)
^       ^ 
^       ^
         Two lowest values (10 and 10). 
         Take them and add the values of the two actual 
         parts of the tree and give the sum to a new knot. The two former
         parts become a left and right treebranch. 

 a   b   c   d
10   5   3   2       How often the symbol appears
     20              The tree
    /  \
  10   10
  (a)  / \
     5     5
    (b)	  / \
         3   2
        (c) (d)


Step 3:

We have a tree, so what?
What we now have to do is to to transform the tree into 
sequences of 1 and 0. To do that we start at the root giving
the root a 0 and 
- all left branches a 0 
and 
- all right branches a 1:

                   20              The tree
                  /  \
                 /    \
            [0] /      \[1]
               10      10
               (a)    /  \
                     /    \
                [0] /      \[1]
                   5        5
                  (b)	   / \
                          /   \
                     [0] /     \[1]
                        3       2
                       (c)     (d)

(Of course you could give all left branches a 1 and all right
branches a 0. Just keep straight with the directions, that's
all).

So we have the followin codes for our symbols:

a: %0
b: %10
c: %110
d: %111

Well two things are to be mentioned here:
This tree is optimized in a way that you can not produce a more
condensed one. It is not possible to create a tree that gives
more respect to all symbols in the table, calculating in whole
bits (that's no joke. There is a algorithm which calculates with
fractals of a bit and is far more powerful than Huffman. 
I'll explain this, too). Try it for yourself.
Important:
The tree is value-unique, meaning that you can't interpret the
beginning of a certain symbol as another (shorter) symbol:

(Not unique and wrong table)
a: %00
b: %10
c: %101
d: %001

If you would read a '00' what would this be? A 'a' or the beginning
of a 'd'? With this table it is not possible to answer the question
giving only the solution of a GPF. 
It it utmost important that a tree is completely unique. 
Therefore a complete symbol cannot be the beginning of another symbol!!!
If you try to build up simple trees the unique way you'll quickly
realize that the Huffman-Algorithm is the method of your choice.



Our sequence

  aaaaaaaaaabbbbbcccdd

would now need:
10 * 1 = 10
 5 * 2 = 10
 3 * 3 =  9
 2 * 3 =  6
        ===
         35 Bit. Not bad, eh?



So that means that we can optimize codes that happen to show up
more frequently than others. Big Deal. Where does this sort of
thing happen, eh? Dr. Watson?

Here, Sir. 
Well we have a lot of interesting and probably crunchable data
in SPEECH. In a normal written english text for example the letter
'e' appears the most. If we were able to compress a normal text 
using the Huffman-method we would save a lot of diskspace.

Izzatso? Well, yes, Watson is right. The letter 'e' is dominating
(not only) the english language. Here the last two paragraphs with the 
letter 'e' deleted:

---------------------------------------------------------------
So that mans that w can optimiz cods that happn to show up
mor frquntly than othrs. Big Dal. Whr dos this sort of
thing happn, h? Dr. Watson?

Hr, Sir. 
Wll w hav a lot of intrsting and probably crunchabl data
in SPCH. In a normal writtn nglish txt for xampl th ltter
'' appars th most. If w wr abl to comprss a normal txt 
using th Huffman-mthod w would sav a lot of diskspac.
---------------------------------------------------------------
While some words are still easy read, others have become
totally senseless. 

All we have to do is count all normal letters, sort them
by their frequency and apply the Huffman-algorithm to the
table. Then we would use this tree to encode our data
enew and - voila. There we are.

But...

Yes, Watson?

In your last chapter (Intro) you mentioned that BOTH, the
sender AND the receiver must have the same MODEL of data
(de)coding for to understand each other. If we don't 
transmit the Huffman-tree with the data, how does the 
other program on the other side know our tree?

Good question, Watson. It can't!!! Yes, if the receiver
of the message has no knowledge of the data-specific
Huffman-tree all he will see is total garbage (and most 
probably a GPF will follow). 
The receiver MUST use the same tree. That means, that, 
either, one has communicated/transmitted the tree on 
a second way or one uses a more general Huffman-tree, 
which is of course not optimized for the specific file 
we have transmitted. A third way will be explained now.

In the early days of the C64 you could tell that a program
used a Huffman-like tree when you recognized a block consisting
of all 256 possible chars, sorted in a strange way in the 
beginning of the file. That meant that the BASIS for the
complete tree (all 256 chars in the sorted order) was transmitted WITHIN 
the crunched file (which of course was now the crunched size + 256 Bytes ;)

The BASIS?
Yes, the basis.


Hm, i have a little problem...

What is it, Watson?

I think i got the theoretical aspect. But: how do i programm this
in reality? And what is this basis, you are talking about?


All right. 
Well, we have to do with trees, you know. And the main part about
trees is the link to the other parts of the tree. When you want
to programm trees, you better get accomplished with languages 
supporting pointers. 

A typical C/C++ struct would look like this:

struct Tree
{
    Tree* zero;		//pointer to struct for zeros
    Tree* one;		//pointer to struct for ones
    long  value;	//value of the leaf
    long  sum;		//when getting linked and 
                        //becoming a knot here the sum, else 0.
    char  symbol;       //my symbol
};

You should build a tree consisting of 258 entries, with entry
257 as special EOF-signal and entry 258 as dummy for later
services.

One of the most important steps of the alg is the counting
of the bytes. With this the whole system stands or falls.
We also have to consider the DEcoding part. We can't transmit
1024 bytes, just because ONE character has to be encoded with
10 bits, leaving 500 bytes completely zero. 
And always think of it: Coder and DEcoder MUST use the same
model or else...GPF.

So, to ensure that both use the same model, the process with
the smaller data entry dictates how the coder has to work.
We decide that a table of 256 bytes, transmitted with the 
crunched file is enough overhead. What is in these 256
bytes, you ask? It's simple: It's the frequencies of the
symbols scaled down to 8 bits. Why?

Well, the general problem is: How do coder and decoder know
that, for example, the symbol 'y' is to be encoded with 10
bits and the symbol 'e' with 6 bits?
The best way is to let the coder and the decoder BOTH build 
up a tree. When this happens, we can be sure, that both use
the same model and everybody is happy ("ha- vannagila, 
havannagila, havannagila..." :-).
The only thing that both, the coder and the decoder have in
common is the frequency of the symbols. So the only thing
we can transmit are the frequencies. And to save space, we
just scale them down to 8 bit. Beware of having a symbol
with the frequency zero. It will never be encoded or decoded.
One good way of scaling down the table is the following:

Counting, which symbol appeared the most and put it into maxcounter
if (maxcounter == 0) maxcounter++;
maxcounter /= 255;
maxcounter++;
in a loop divide all entries by maxcounter
(if an entry, which had a positive value before and now has
been zeroed - move a 1 to it).

for example:

the maxcounter is 32767;
maxcounter = 32767 / 255 = 128;
macounter++ = 129;
freq[0] = 1234;
freq[1] = 23456;
freq[2] = 12;

freq[0] = 1234 / 129 = 9
freq[1] = 23456 / 129 = 181
freq[2] = 12 / 129 = 0; must be made 1;


So if we transmit the scaled down frequencies with the decoder,
all the decoder has to do is create the Huffman-tree from the
frequencies and start decoding the inputstream.
Yes. That means the both, coder and decoder use the same
routines. If you can reverse the algorithm of the decoder
you know at least half of the coding algorithm. In the huffman
case, you'll know ALL the important stuff. This behaviour will
be found in every decrunching routine. To decrunch something
the decruncher has - at least partly - reverse the steps that
the cruncher did. Step successfully thru a decruncher and you
have - with a little phantasy - the crunch-algorithm (at least
the theoretical aspect of it) reversed. I never found a cruncher
where this was not the case, so i suspect that all packers have
this weak spot. Weak, because a lot of crunching alg are packaged.
That means, that after some data a new cycle of crunched data 
begins. If you manage to isolate a single (best hint: in the
last part of the file) cycle, you can study the alg 
while decrunching just about 10-20 bytes. Good, eh?




These are some steps which should give you the general idea of 
creating a Huffman-tree:
- Build up a table consisting of 256 longs.
- Count up the bytes you want to crunch (the whole file for example)
  where you count up the frequency of all the bytes in your
  source. As there are only 256 possibilities, there are no probs.
- Scale the frequency down to 8 bit.

- Build up a second array consisting of 256 chars. Fill them
  with the chars 0...255. You will have the char 0 in entry [0],
  char 1 in entry [1], the char 'A' in entry[0x41] and so on.
- Sort the frequency table, but when you switch the values,
  do also switch the values in the second array, holding the
  chars. With that you'll create a sorted model of your input
  data.

- After sorting, proceed as follows:
   - create an array of 257 entries of type Tree.
     (If you have another service ready, use 258).
   - fill the .value of each entry with the frequency of the
     symbol. Fill the .symbol with the entry of the symbol char
     array. Fill the .zero and .one with NULL to mark them
     as leafes. This will seperate them from the knots where
     there will be the adresses of your next branches/leafes.
     fill .value of entry[256] (the 257th) with 1 as you will
     only one EOF, right?
   - recursively build up the tree. Don't forget, that you
     may have to add a last knot to link the last two partial-trees
     together. One way to do this:
     - Have a counter set to 257. 
     - repeat:
     - take the least two entries.
     - add up their value.
     - create a new entry of type Tree.
     - link the two entries to the new tree.
     - give the .sum of the new entry the added value
     - delete the two entries from the list (NOT FROM THE MEMORY) and add the new one.
     - reduce the counter by 1
     - resort the trees with criteria .sum
     - while counter > 1
   - create a new array with the struct: char symbol, long bitsvalue, long number_of_bits_to_be_written
   - recurse thru the tree to get to every char to fill up the new array. 
     For every char you reached fill up the entries for symbol, bitsvalue and number_of_bits_to_be_written
   - nearly finished. 
   - Your encoding array is ready. All you have to do now is to read a char from the input
     and write the corresponding bitvalue to the outputstream with the number of bits,
     given from the array-entry. Send the EOF at last and you are done.

For decoding, the process is similar:

Create the tree. (thats the similar part)

unsigned char symbol = 0x00;
startknot = knot(0) //(of type Tree)
knot = startknot    //(of type Tree)

while (symbol != EOF)
   clear symbol.
   while (not symbol)
      
      get bit
      if ( (bit == 0) && (knot->zero != NULL) )
         knot = knot->zero
      else 
         if (knot->one != NULL) ) knot = knot->one

      if (knot.one == NULL) //is enough. knot.zero is NULL, too
         //symbol found. great. To the output with it
         symbol = knot.symbol
      end if
    end while   
    Do_Outpur_Char(symbol);
end while

Done. 


         
Hm, i have another little problem...

What is it, Watson?


I did as you told me. Took a table of 256 longs, counted the 256 chars,
sorted them and let the computer calculate an optimized Huffman-tree.
But all i got was a tree with all leaves having a length of 8 - all chars
are encoded with 8 bits and there is no crunching in that. What happened?

Ah, Dr. Watson,
you did not remember my sentence about Entropy. This file you examined
had all bytes nearly equally spread all over the length of the file.
So -statisticalwise- the file is equalized. No surprises here. With 
a Huffman-tree, which considers the whole file you won't get any 
crunching here, i fear.

But what do i do now?

Well, there are three scenarios that will satisfy your desire:
- We use a different algorithm
- We use a brutal tree
- We use an adapting technique

??? Tell me about the brutal tree.

Ok, i got this one from the 1001 Packer from the C64 (Anyone ringing a 
bell? 1001 Crew? No? Where are the old days gone to...)
They used a tree i just can label 'brutal tree'. Here's the idea:
When we have the case that we have data that is so equal that no
Huffman-tree will help us, we help us ourself. We take all the
Bytecounters and sort them. Then we create a table of the 256 chars
ordered by the sorted table, so that the still most frequent
chars appear on the top. Now we have the 256 possible bytes sorted.
So far so good.
Lets think about a byte. How do we read a byte? Isn't it:

#33 = $21 = %0010 0001  ?

What, if we we would change our view of the binary to a 2 + 6:

#33 = $21 = %00 100001

Our 256 chars would now be represented as 4 (00, 01, 10, 11) blocks
of 6 bits. What if we now 'huffman' these first two bits to:

0
10
110
111

So that we would have %0 100001 instead of %00 100001. Every byte that
is in the first 64Byteblock is coded with 7 bits, while the others
are coded either with 8 or with 9 bits. We can only hope that there
are enough bytes in the file to outweigh the two 9bit-chars. But it is
amazing how this little trick works. And the speed. It's so easy to
implement it and it is so fast. Even on the C64 it took only fractals of
a second to decode a programm as big as the main memory area (other 
packers took several seconds for the same size).


Wow, cool. But what about your different algorithms 
and this 'adapting technique' ?


Well, this is another story and will be told in the next chapter...




I hope you enjoyed this chapter. 
There will be more - i promise.
Just have a little patience.

Greetings from a programmer

Joa

In the meantime, after having read Joa's work above, you may enjoy a look at a huffer and a unhuffer, both written in C some time a go, and donated to the world, by Shaun Case... once you are done studying it, may be you'll want to have a look also at sixpack another more advanced (adaptive Huffman encoding) compresser, written in C as well, by Philip Gage, for a data compression competition some time ago... since this code includes SIX different algorhitms, I believe it could be useful for readers as 'passage' to the next part of Joa's essay (if Joa doesn't agree with me I'll kill these links of course).




redhomepage redlinks red anonymity red+ORC redstudents' essays redacademy database
redtools redcounter measures redcocktails redantismut redbots wars redsearch_forms redmail_fravia
redIs reverse engineering legal?