papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa

Courtesy of reverser's page of reverse engineering

> If you signal me, that this was understandable and want
> more, you will - i promise

Well, I do signal you that this was understandable and that I want more!
In fact I hope that you'll deal us some good cards about "their delicate and secured data" and about the "lots of algorithms" used inside the black boxes... because we want to have some more light inside all those black and dark boxes :-)


         Little essay about the various methods and viewpoints of crunching.

                                    Part I: Introduction


By Joa Koester (JoKo2000(at)hotmail(point)com) 
This essay is free to be copied and read and spread as long as
you are decent enough to let my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


But on with the essay...

as i recently visited reverser+'s page on the net, i was amazed about the 
knowledge packed together, the philosophy and the attidude contained 
in the writings. But i missed a little bit the **other side**. 
That is, us programmers, condemned to write software for banks, 
insurance companys etc., so they can make a lot of money, ripping
a lot of people of. These companies are often serious about data
hiding and are always eager to have their delicate data secured.
One way of securing data is crunching (packing) them. This has two valid 
points:

 - the data is secure (coz the crunching algorithm is not
       made public)
 - the data is more compact, meaning it can be easier transfered;
       the risk of a disc producing a read/write error, vaporising
	   personal data is definitevly lowered, when the data only
	   takes 50Kbyte on a disk rather than 200KByte (of course
	   if a read write error happens exactly in these 50KByte,
	   the file is also gone ;)

This brings us to the question:

WHAT IS CRUNCHING?


Well, a pseudo-mathematical answer would be:
everything that reduces a certain amount of data in its
SPACE, but not in it's INFORMATION (lossless crunching
that is, there are some quality (information) reducing 
algorithms out there - jpeg for example).


So we have some data IN:

AAAAACCCCCDDDDDDCCCCCefg


and we have a black box:

  /-------\
/           \
| Holy crap |
| happening |
|   here    |
-------------

and we have more or less unreadable data OUT:

@$/)!%A3yfg


So, what's the principle of the black box? 
Is there one Super-Algorithm that you are not allowed
to know ("Behold my son, tis knowledge is not 
for your eyes") ?

Of course not. There are principles. And there are lots
of algorithms, not just one.
And you ALREADY know a lot of these priciples 
from the everyday live.


When you have to cross a heavy-driven road you more or
less wait at the traffic light to become green. You stand 
there and see a red light. Maybe the light is shaped like
a standing person or a cross or something else. But it is red
and you already know that this means: Hey, stop, don't go.
If you go, you are violating three different traffic laws at least,
obstruct the cars and impose the next world war. Besides
you put your life in danger ;) 
And all this information just in the little red light on 
the other side of the street.
Are all red lights this way? No.
If you, for instance, are a musician and you are about to
record a song, you will press record & play on your 
recorder and a RED light will show up, telling you, that
you better not make thousands of mistakes and record 
what you are doing properly. The red light can be a circle,
a square, even a text, it doesn't matter. But it will be
red!!!


Dr. Watson, what do you think?

Well...
What we have here is a case of crunching:
The various informations are condensed into few different
symbols as possible. In both examples, only one symbol 
(the red light) is needed to get the MEANING (the information) 
transmitted. 

Right, Watson, right. And could we switch the information
contained in the symbols? That is, making the red light on
the recorder telling us, when to stop before a traffic light?

No. They are both red lights, that's true. But the red light
in the recorder has nothing to do with crossing a road and
the traffic light has nothing to do with us recording a song.
The CONTEXT is different. The only thing that is similar is
the color.

Hm, hm.
Condensing information (much source symbols -> less destination
symbols) and keeping the CONTEXT in mind. Sounds pretty 
good to me. 

Kind of 
	switch (context)
	{
		case traffic:
			if (red_Light) {...} else {...}
			break;

		case recording_music:
			if (red_Light) {...} else {...}
			break;

		default:
			No_Condensed_Symbols();
			break;
	}
ain't it?

In all crunching we will always have something that will 
tell us, in which context we will have to switch and because
of this we will know how the next following symbol(s) is/are 
to be interpreted.


Dr. Watson, are all interpretations dependend on only one symbol?

Hm, i would say no. There may be cases, where this is true,
but in most cases, there are more than one symbol defining
exactly, what's going on. There are crossroads where are
streets leading straight ahead and right and there are
crossroads where cars will drive left or straight ahead or
right. This will depend on which part of the crossroad the
car stands, so that the traffic for straight ahead can go
but the traffic for the right has still to wait for THEIR 
specific traffic light to switch from red to green. Another
example would be the position of the light. If the position
of the red light and the green light would be switched, 
there would be some chaos, i bet.

Sounds resonable, Watson. You say, that there are symbols
for a general context which are finetuned thru other symbols
defining the exact context to be interpreted? 

Exactly.

But what do you think how it is possible that all people
know that they have to stop on a red sign and go on a green
one?

Well, i would say that they know, because someone told them.
The parents, perhaps. Or they are taught so in the school.


In fact, to crunch and decrunch information correctly, both,
the sender and the receiver have to use the same way of 
interpreting the data. Society has ruled that a red traffic
light is a STOP. And so traffic lights will switch to red
when it is time to stop the traffic for one direction. And
on the other side YOU get taught by the society that you
better not run across the street when you have a red or else
you play 'Frogger' for real...
So put in one sentence - Both the sender and the receiver
use the same MODEL of data-interpretation.


Dr. Watson, what if i would like to crunch the information
transmitted in the red traffic light?

This would be nearly impossible. The whole meanings of 
what the traffic light means is already emitted in only
one symbol (the red light i mean now). There is a point
where the number of informations can't be reduced any
more without getting complications elsewhere. 
Imagine one would pack all three lights (red, green
and yellow) into one light that would change it's color,
depending on it's actual state. Ok, you would have less
metal, only one light to look at and less glass to clean.
The normal case would be condensed - not in interpretation
but in material. The routine of Green - Yellow - Red - 
Yello - Green... would stay. So far so good. But traffic
lights have the ability to send more complex signals also.
When, for example, there is a traffic-jam ahead and the
police notices this, they can (at least where i live) achieve
that the traffic lights green and yellow will blink together
to signal an upcoming jam so that the drivers can react to 
this signal. When all lights would be build in one case,
one would have to think of a special flashing / flashing
in a special speed or something like that. Not very
good. Especially for older drivers whose reaction times may
be slower - they would concentrate more on interpreting the
flashing signal than on the traffic itself increasing the 
risk of producing an accident. One other point would be the
shape of the light. A standing man in red and a walking man
in green would mean a complex shape of glass illuminated with
a complex circuitry. This would mean, if one part would activate
falsely, you would have, for example, a red man standing with
one green leg walking. Very confusing, eh? So condensing one thing
over the point of information content (also known as 
ENTROPY) on it's maximum leads to enlarging other parts giving
them biiiig overhead. How do we know that this process is worth
doing all this?

Well, a certain student once came up with exactly this question
and he answered it by himself: It depends on the probability of
the certain symbols. If some symbols are statistically so often
in our stream of perception (analyzing, reading buffer data, etc.)
that we can condense them enough that, even with the enlargement
of the other symbols (which but are not so often) we have an
overall crunching than it's worth it. The name of the student
was Huffman...
For example, you have:
aaaaaaaaaabbbbbcccdd (10 x a, 5 x b, 3 x c, 2 x d) 20 chars

then you would have 20 x 8 bits = 160 bits.

If you now would transmit the 'a' with 7 and the 'c' and 'd'
chars with 9 bits you would have	10 x 7 = 70
									 5 x 8 = 40
									 3 x 9 = 27
									 2 x 9 = 18
									         __
											155

So we would save 5 Bits. Not much, but better than nothing.
Of course you have to come up with an optimized coding for
these values as you wouldn't want to calculate by hand, which
char you should with which number of bits without confusing with
the handling of the other chars. But Huffman found a perfect
algorithm giving you the optimized value-table for your chars.
(but this is for the next part ;)



To condense the intro a little bit:

- To crunch is a way to transmit a certain amount of information
    in less symbols normally needed.
- How you crunch/decrunch depends on the actual context of 
	data actually received
- Both, the sender and the receiver will build up the same
	way of interpreting the data, building up the same model
- When transforming long information-symbols to shorter packages of
	symbols and thus reducing the output, we will face the case
	that there will some (hopefully seldom) symbols getting transformed
	into LONGER symbols. If we have totally random data crunching happens
	also totally random, making our affords nil. 
	That is BTW the reason why packing an already packed zip or rar 
	file is in almost all cases useless - the data written by those
	packers is nearly perfect random. 


I hope you enjoyed this intro. If you signal me, that this was 
understandable and want more, you will - i promise.

Greetings from a programmer

Joa



redhomepage redlinks red anonymity red+ORC redstudents' essays redacademy database
redtools redcounter measures redcocktails redantismut redbots wars redsearch_forms redmail_fravia
redIs reverse engineering legal?