papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version September 1998 ~
by Joa Part VI

Courtesy of reverser's page of reverse engineering

Well, Joa continues his fundamental paper on crunching, this is part VI
enjoy!

10 July 98 Joa ~ crunchi1.htm Little essay about the various methods and viewpoints of crunching papers ~fra_0126
10 June 98 Joa ~ crunchi2.htm Little essay about the various methods and viewpoints of crunching II papers ~fra_0129
17 June 98 Joa ~ crunchi3.htm Little essay about the various methods and viewpoints of crunching III papers ~fra_012E
17 June 98 Joa ~ crunchi4.htm Little essay about the various methods and viewpoints of crunching IV papers ~fra_012F
10 July 98 Joa ~ crunchi5.htm Little essay about the various methods and viewpoints of crunching V papers ~fra_0133
16 September 98 Joa ~ crunchi6.htm Little essay about the various methods and viewpoints of crunching VI papers ~fra_0133


Little essay about the various methods and viewpoints of crunching.

Part VI: The secret of Zip and Rar (LZW explained)



By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.


Last time i showed you how you could reach enormous crunch-ratios using the
LZ77-technique. I wouldn't wonder if someone would build a protection system
upon it. You'd just have to allocate some stackspace for your routine, decrunch
it into the stack and execute it. You'd would have to switch to a
decrunch-from-behind version and just by copying the decrunching routine into
the stack the final crunches would overwrite the last decrunch-loop part
thus immediately executing the decrunched routine afterwards.

But back to crunching. Soon after publishing the LZ77 alg Lempel and Ziv
fumbled a little bit around and came up with another table-oriented crunching
alg. While LZ77 used a Offset-Length pair to implement a reference to some
already passed data, LZ78 walked a completely different path. Lempel and Ziv
came up with an idea of directly addressing a table. In this table there would
be the strings they passed. The only thing you'd have to do would be to
derefernce the table entries down to the strings and you would have a crunch.

The table would start with a null-string. One Byte would be read ahead. One
acutal string would be initialized with this read-ahead byte. Then you'd check
if you already had this string in your table. If not, you would output it and
afterwards you'd add this string to your table. If you already had this string
in your table you'd read another byte and add it to your string. It is
important to notice that the last string to be output build up from a known
string in the table plus a new char.
Mark Nelson, a good source for crunching (next to Charles Bloom),
shows this alg quite clear:

for (;;)    //until the end of the file
{
  act_phrase = 0;   //the zero string
  act_len = 0;
  memset (test_string, 0x00, MAX_STRING);

  //...........................................................................
  for (;;)
  {
    test_string[act_len++] = getc(input);
    new_phrase = search_phrase_in_table(test_string); // is test_string in
                                                      // table?

    if (new_phrase == -1)   // no
    {
      break;                // then break out of inner loop
    }
    //else
    act_phrase = new_phrase // else set new actual phrase
  }
  //...........................................................................
  //Here we breaked out of the inner loop. This could happen, because of EOF or
  //because a string could not be found in the table (in case of EOF this will
  //happen, too, of course ;)
  output(act_phrase);            //output the reference of the table
  output(test_string[act_len -1] ); //afterwards output the difference char
  add_string_to_table(test_string); //and actualize table
}

Let's crunch the following term together:

0123456789a iwobiwoiwoi


Our table is empty in the beginning except for a null-string at pos 1.

pos:  0
act_phrase = 0;
act_len = 0;
test_string = [0x00, 0x00, 0x00...]


Ok, we enter the first char: i

                ...test_string[act_len++] = getc(input);...
char: 'i'
pos:  0
act_phrase = 0;
act_len = 1;
test_string = ['i', 0x00, 0x00...]
table = [ {"", 0} ]


                After the search, new_phrase is -1, so we break


  output(act_phrase);            //output the reference of the table
  output(test_string[act_len -1] ); //afterwards output the difference char
  add_string_to_table(test_string); //and actualize table

our output is: {0}i

We output the act_phrase (0) and output the 'i' afterwards. Then we update the
table with the 'i'-entry and jump back to our loop.

pos:  1
act_phrase = 0;
act_len = 0;
test_string = [0x00, 0x00, 0x00...]

enter char (w)

char: 'w'
pos:  1
act_phrase = 0;
act_len = 1;
test_string = ['w', 0x00, 0x00...]
table = [ {"", 0}, {"i", 1} ]

                After the search, new_phrase is -1, so we break

  output(act_phrase);            //output the reference of the table
  output(test_string[act_len -1] ); //afterwards output the difference char
  add_string_to_table(test_string); //and actualize table

We output the act_phrase (0) and output the 'w' afterwards. Then we update the
table with the 'w'-entry and jump back to our loop.

[...]
for the chars 'o' and 'b' it's the same. We reenter on the next 'i'

pos:  4
act_phrase = 0;
act_len = 0;
test_string = [0x00, 0x00, 0x00...]

enter char (i)

char: 'i'
pos:  4
act_phrase = 0;
act_len = 1;
test_string = ['i', 0x00, 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4} ]

    new_phrase = search_phrase_in_table(test_string); // is test_string in
                                                      // table?

    yes, it is, so

            act_phrase = new_phrase (1 in this case)

    and back to our inner loop.

char: 'w'
pos:  5
act_phrase = 1;
act_len = 2;
test_string = ['i', 'w', 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4} ]

    Do we have a 'iw' in our table? don't think so. That means next step we
    break.

  output(act_phrase);               // (1 in this case)
  output(test_string[act_len -1] ); // (w in this case)
  add_string_to_table(test_string);

and to the next char...


char: 'o'
pos:  6
act_phrase = 0;
act_len = 1;
test_string = ['o', 0x00, 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5} ]

    well, we have an 'o' in our table, so the string is extended with the
    next char: 'i'.

char: 'i'
pos:  7
act_phrase = 3;
act_len = 1;
test_string = ['o', 'i', 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5} ]


    Some 'oi' here? No, sir. no 'oi' here. We break and write:

  output(act_phrase);               // (3 in this case)
  output(test_string[act_len -1] ); // (i in this case)
  add_string_to_table(test_string);

the next char is the 'w':

char: 'w'
pos:  8
act_phrase = 0;
act_len = 1;
test_string = ['w', 0x00, 0x00...]
table = [ {"", 0}  , {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5},
          {"oi", 6} ]

after adding the next char (coz 'w' is in our table) we have:

char: 'o'
pos:  9
act_phrase = 3;
act_len = 1;
test_string = ['w', 'o', 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5},
          {"oi", 6} ]

    Some 'wo' here? Nope. We break and write:

  output(act_phrase);               // (2 in this case)
  output(test_string[act_len -1] ); // (o in this case)
  add_string_to_table(test_string);

The final char (plus EOF) is confronted with the following table:

table = [ {"", 0}  , {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5},
          {"oi", 6}, {"wo", 7} ]

then the i+EOF is added to the table.


The output we produced so far (spaces inserted for clarity):

{0}i {0}w {0}o {0}b {1}w {3}i {2}o {1}EOF

Hmm. This doesn't look like an algorithm which produces good results in small
files. But think about this: if there would have been another "iwo" at the end
of the file instead the 'i' we would have coded {5}o thus saving us two bytes.
And for the next 'iwo' we would have had a new table entry ready. This means
that the algorithm manages inequally long strings with just one entry
reference. If our reference is shorter than the string, we crunch. Normally we
achieve an increasing crunch-ratio after ca. 100 bytes. From that moment on our
table begins to fill with strings longer than one byte and from that moment on
we save bits with every reference we write.

I have a question, sir...

Yes, Watson?

This table-reference - how big should that table be and how do we code the
reference? And while i'm at the point: How do we search the table? This will be
the most time-consuming part, i think. And, oh yes, what if the table is full?


You are absolutely right, Watson.
Good questions!

OK!

How do we search the table FAST? Well, to be honest, i do not prefer one method
over the other. One good method would be to implement a hash-table. In C++ you
would implment a Set of Strings and hope that the implementation from your
compiler vendor is fast enough to handle some 100 Kbyte Strings. Another way
would be to allocate a biiiiig amount of memory (premise: size MOD 256 == 0),
divide it by 256 and reserve these chunks for the according ascii-values). With
this method you would have equally long maximumsized strings ready for
binary-searching and overwriting. These chunks would then be filled with the
string and a timestamp. if our chunk was full, the next time we would have to
add a string we would delete the earliest one (or maybe two or three) and could
add the new string.
I know that this method screams "RAM WASTE!!! RAM WASTE!!!", but let me tell
you that the ACB-cruncher from George Buyanovsky needs 16 MByte RAM in one
chunk to work. We will deal with 256 MByte RAM on every average PC within the
next 5 years. Not just because RAM gets cheaper every minute, but because the
average user wants to scan and digitize things (in 1998: buy three CDs or a
scanner. What do you think we will be in 5 years?).
A good tablesearch-algorithm is the stand-or-fall part of the cruncher.

As far as i'm concerned: Let the cruncher take as much memory as possible.

If you have 32 MB reserved in one piece and you want a maximum stringlen of 128
Byte per string you will have 33554432 / 256 =  131072 Bytes per chunk.
And 131072 / 128 = 1024 Strings per chunk. The more you reduce your stringlen,
the more strings you can pack into the chunk. Reducing the maximum stringlen
down to 64 brings you 4096 strings per chunk. This is quite good. 4096 strings
just for one ascii-slot.

Well, the way we code the reference is completely ours. We should take a look
back to our LZ77 essay and let our phantasy roam. From the most stupid 8+12 bit
reference (gives us crunching only at minimum 4 byte strinlen) to unary +
your_stuff - everything is possible. However it should be fast.


The coding part is pretty easy, isn't it? You could theoretically write a LZ78
cruncher in Visual Basic in a few hours and it would actually work.
And what about the decoding part? Well, no problems here, too. The decruncher
starts with the same emptied model (emptied table) as the cruncher:


  New_String = Take the first code from decoding a bit_input
  output NewString
  take the next char
  output it
  add char to New_String
  add New_String into table.
repeat until EOF



//.............................................................................
While this method is pretty good and pretty easy to implement, 1984 Terry
Welch came up with a fundamental improvement of this algorithm. Since then the
term LZW is also common. Welch had the idea that one could pre-create a table
(with the first 256 possible chars) and that the algorithm would only output
table-references.

With this trick one could create a huffman-tree for all those codes one writes
thus letting us write even fewer bits (there is a method of writing even fewer
bits than Huffman - arithmetic coding, but we'll come to this later).

So, how does this LZW work? Let's have a look:

The basic LZW-algorithm:

//Read ahead because we definitively know every single char
oldString[0] = getc(input);
oldString[1] = 0x00;

while (! eof(input))
{
  c = getc(input);                //We get the second string from our input
  strcpy(newString, oldString);   //and save the oldstring into the search
                                  //string. The first time the oldstring is
                                  //our very first char of our input. But
                                  //later it will be longer strings.

  strncat (newString, &c, 1);     //add the actual read char to the
                                  //actual string

  if (stringInTable(newString))   //when the string is already in the table
  {
    strcpy (oldString, newString) //take it as oldString and read the next char
  }
  else                            //else
  {
    code = searchTable(oldString);//get the code for the last known String
    output (code);                //and output it
    addStringToTable(newString);  //add the new String for the next time
    oldString[0] = c;             //and start again with the latest unknown
    oldString[1] = 0x00;          //char
  }
}

Looks pretty easy, ain't it? Well, it is. Let's crunch the our little 
source again (this time a little longer):


                                  0123456789abcde
                                  iwobiwoiwoiwoix


           
(We have a table, where the first 256 (0 - 255) entries are occupied with the 
corresponding values)
Let's trace the steps of the cruncher. Maybe you should print the above
typed source for checking.

  oldstring = ( 'i', 0x00 )
  
(we enter the loop) (Pos = 0)

  c = 'w'
  newString = ( 'i', 0x00 )
  newString = ( 'iw', 0x00 )
  
(no, we don't have 'iw' in our table yet)

  code = searchTable(oldString)   //oldString = 'i', 0x00
  output (code)                   //i
  addString (newString)
  oldString[0] = 'w'
  oldString[1] = 0x00
  
(back to the top of the loop) (Pos = 1)
(our table has the 256 entries plus [ {256, 'iw'} ]

  oldstring = ( 'w', 0x00 )
  c = 'o'
  newString = ( 'w', 0x00 )  
  newString = ( 'wo', 0x00 )  

(no, we don't have 'wo' in our table yet)

  code = searchTable(oldString)   //oldString = 'w', 0x00
  output (code)                   //w
  addString (newString)
  oldString[0] = 'o'
  oldString[1] = 0x00

(back to the top of the loop) (Pos = 2)
(our table has the 256 entries plus [ {256, 'iw'}, {257, 'wo'} ]

  oldstring = ( 'o', 0x00 )
  c = 'o'
  newString = ( 'b', 0x00 )  
  newString = ( 'ob', 0x00 )  

(no, we don't have 'ob' in our table yet)

  code = searchTable(oldString)   //oldString = 'o', 0x00
  output (code)                   //o
  addString (newString)
  oldString[0] = 'b'
  oldString[1] = 0x00

(back to the top of the loop) (Pos = 3)
(our table has the 256 entries plus [ {256, 'iw'}, {257, 'wo'}, {258, 'ob'} ]

  oldstring = ( 'b', 0x00 )
  c = 'i'
  newString = ( 'b', 0x00 )  
  newString = ( 'bi', 0x00 )  

(no, we don't have 'bi' in our table yet)

  code = searchTable(oldString)   //oldString = 'b', 0x00
  output (code)                   //i
  addString (newString)
  oldString[0] = 'i'
  oldString[1] = 0x00

(Now we will have something to crunch :)
(back to the top of the loop) (Pos = 4)
(our table has the 256 entries 
plus [ {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}  ]

  oldstring = ( 'i', 0x00 )
  c = 'w'
  newString = ( 'i', 0x00 )  
  newString = ( 'iw', 0x00 )  

(YES, we have 'iw' in our table )

  oldString = ( 'iw', 0x00)
(back to the top of the loop)
  oldString = ( 'iw', 0x00)
  c = 'o'  

  newString = ( 'iw', 0x00 )  
  newString = ( 'iwo', 0x00 )  


(no, we don't have 'iwo' in our table yet)

  code = searchTable(oldString)   //oldString = 'iw', 0x00
  output (code)                   //[256]
  addString (newString)
  oldString[0] = 'o'
  oldString[1] = 0x00


(back to the top of the loop) (Pos = 6)
(our table has the 256 entries 
plus [ {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}  ]

  oldstring = ( 'o', 0x00 )
  c = 'i'
  newString = ( 'o', 0x00 )  
  newString = ( 'oi', 0x00 )  

(no, we don't have 'oi' in our table yet)

  code = searchTable(oldString)   //oldString = 'o', 0x00
  output (code)                   //o
  addString (newString)
  oldString[0] = 'i'
  oldString[1] = 0x00


(back to the top of the loop) (Pos = 7)
(our table has the 256 entries 
plus [ 
       {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'},
       {261, 'oi'}  
     ]  

  oldstring = ( 'i', 0x00 )
  c = 'w'
  newString = ( 'i', 0x00 )  
  newString = ( 'iw', 0x00 )  

(YES, we have a 'iw' in our table)
  oldString = ( 'iw', 0x00)
(back to the top of the loop)
  oldString = ( 'iw', 0x00)
  c = 'o'  

  newString = ( 'iw', 0x00 )  
  newString = ( 'iwo', 0x00 )  

(YES, we have a 'iwo' in our table)
  oldString = ( 'iwo', 0x00)
(back to the top of the loop)
  oldString = ( 'iwo', 0x00)
  c = 'i'  

  newString = ( 'iwo', 0x00 )  
  newString = ( 'iwoi', 0x00 )  


(no, we don't have 'iwoi' in our table yet)

  code = searchTable(oldString)   //oldString = 'iwo', 0x00
  output (code)                   //260
  addString (newString)
  oldString[0] = 'i'
  oldString[1] = 0x00


(back to the top of the loop) (Pos = a)
(our table has the 256 entries 
plus [ 
       {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'},
       {261, 'oi'}, {262, 'iwoi'}    
     ]  

  oldstring = ( 'i', 0x00 )
  c = 'w'
  newString = ( 'i', 0x00 )  
  newString = ( 'iw', 0x00 )  

(YES, we have a 'iw' in our table)
  oldString = ( 'iw', 0x00)
(back to the top of the loop)
  oldString = ( 'iw', 0x00)
  c = 'o'  

  newString = ( 'iw', 0x00 )  
  newString = ( 'iwo', 0x00 )  

(YES, we have a 'iwo' in our table)
  oldString = ( 'iwo', 0x00)
(back to the top of the loop)
  oldString = ( 'iwo', 0x00)
  c = 'i'  

  newString = ( 'iwo', 0x00 )  
  newString = ( 'iwoi', 0x00 )  

(YES, we have a 'iwoi' in our table)
  oldString = ( 'iwoi', 0x00)
(back to the top of the loop)
  oldString = ( 'iwoix', 0x00)
  c = 'x'  

  newString = ( 'iwoi', 0x00 )  
  newString = ( 'iwoix', 0x00 )  


(no, we don't have 'iwoix' in our table yet)

  code = searchTable(oldString)   //oldString = 'iwo', 0x00
  output (code)                   //262
  addString (newString)
  oldString[0] = 'x'
  oldString[1] = 0x00

So, our output would look something like this:

 iwob[256]o[260][262]


Hey, we have two codes directly. Does this make trouble when decrunching?

Well, let's see how we would decrunch our little baby:


//Well, we know that the first character is uncrunchable, so let's output it
//directly
oldString[0] = decode(input())
oldString[1] = 0x00;

putString(oldString);      

newCode = input()                         //We get a new inputCode
while (newCode != EOF)
{
  newString = searchInTable(newCode);     //We have to have it in our table
  putString(newString);                   //then we output it
  strncat (oldString, &newString[0], 1);  //the oldstring is concatenated by 
                                          //the first char of the freshly read
                                          //string
  addString (oldString);                  //this string is then added to the
                                          //table
  strcpy (oldString, newString);          //then the freshly read string is
                                          //from now on our new base 
  newCode = input()                       //We get a new inputCode
}


OK, let's trace this alg when we decrunch our crunch:

 iwob[256]o[260][262]

Our table has the table entries 0 - 255 filled with the corresponding values.

oldString[0] = 'i';
oldString[1] = 0x00;
putString(oldString);       //i is output

newCode = 'w';

(enter loop)
  newString = 'w'
  Output 'w'                //w is output
  
  oldString = 'iw'          //build up new String
  //                        //and add it to the table

  oldString = 'w'    
  newCode = 'o'

our table has the 256 entries 
plus [ 
       {256, 'iw'}
     ]  
(enter loop)
  newString = 'o'
  Output 'o'                //o is output
  
  oldString = 'wo'          //build up new String
  //                        //and add it to the table

  oldString = 'o'    
  newCode = 'b'

our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}
     ]  
(enter loop)
  newString = 'b'
  Output 'b'                //b is output
  
  oldString = 'ob'          //build up new String
  //                        //and add it to the table

  oldString = 'b'    
  newCode = [256]

our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}
     ]  
(enter loop)
  newString = 'iw'
  Output 'iw'               //iw is output
  
  oldString = 'bi'          //build up new String (just the first char from the
                            //new String)
  //                        //and add it to the table

  oldString = 'iw'    
  newCode = 'o'


our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}
     ]  
(enter loop)
  newString = 'o'
  Output 'o'                //o is output
  
  oldString = 'iwo'         //build up new String
  //                        //and add it to the table

  oldString = 'o'    
  newCode = [260]
  

our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}
     ]  
(enter loop)
  newString = 'iwo'
  Output 'iwo'              //o is output
  
  oldString = 'oi'          //build up new String (just the first char)
  //                        //and add it to the table

  oldString = 'iwo'    
  newCode = [262]
  

our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}
      {261, 'oi'}
     ]  
(enter loop)
  newString = CRASH !!! CRASH
  
Damn. I knew it. The cruncher added a code and used it immediately and the
decruncher can't use it as it doesn't know how to create it. Or does it???

Hmmmm....

Ok, ok, i lied. I knew it all the time that this constellation was deadly for 
our decruncher. This behaviour appears, whenever the cruncher meets a 
constellation like (iwo b iwo iwo iwo) where the last character of a crunch is 
the beginning of the next sequence. This leads to creating table-entries which 
the decruncher can not foresee. But lucky us - this is the only case when this 
happens. A little if-clause will just catch the null-pointer and just fetch the 
oldString and copy it to newString. Then we add the first char of newString to 
the end of newString. Then we continue as always. The if-clause could look like 
this:

  if (newString == null)
  { 
    char c[2]
    //we still know oldString
    strcpy (newString, oldString)
    char c[0] = newString[0];
    char c[1] = 0x00;
    strcat (newString, c);    
  }

With this clause let's restart from the last part:

our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}
     ]  
(enter loop)
  newString = 'iwo'
  Output 'iwo'              //o is output
  
  oldString = 'oi'          //build up new String (just the first char)
  //                        //and add it to the table

  oldString = 'iwo'    newCode = [262]


our table has the 256 entries 
plus [ 
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, 
      {259, 'bi'}, {260, 'iwo'} {261, 'oi'} 
     ]  
(enter loop) 
  newString = null
  if (newString = null)
  {
    newString = 'iwo'
    newString = 'iwoi'      //First character added again
  }
  Output 'iwoi'             //iwoi is output
  
  oldString = 'iwoi'        //build up new String (just the first char)
  //                        //and add it to the table
...
and then we have our entry 262.

Phew.

You see when you try this on your own, that there is a lot of room for 
improvement. 
But this is the basic algorithm and as now know the secrets of LZW you know 
what RAR and ZIP are doing. I have some sources i can mail you, if you want.


All right, next time i'll explain something related to Huffman - 

                The Arithmetic Crunching Method (tataaa... :-)
                (how can i crunch 0.74 bits ???)


Greetings from a programmer,

Joa




redhomepage redlinks red anonymity red+ORC redstudents' essays redacademy database
redtools redcounter measures redcocktails redantismut redbots wars redsearch_forms redmail_fravia
redIs reverse engineering legal?