papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1999 ~
by Joa ----- Part. VIII (Burrows - Wheeler - Transformation (BWT)

Courtesy of reverser's page of reverse engineering

Well, Joa continues his fundamental paper on crunching, this is part VIII
enjoy!

12 December 98 Joa ~ crunchi1.htm Little essay about the various methods and viewpoints of crunching papers ~fra_0126
10 June 98 Joa ~ crunchi2.htm Little essay about the various methods and viewpoints of crunching II papers ~fra_0129
17 June 98 Joa ~ crunchi3.htm Little essay about the various methods and viewpoints of crunching III papers ~fra_012E
17 June 98 Joa ~ crunchi4.htm Little essay about the various methods and viewpoints of crunching IV papers ~fra_012F
17 June 98 Joa ~ crunchi5.htm Little essay about the various methods and viewpoints of crunching V papers ~fra_012F
17 June 98 Joa ~ crunchi6.htm Little essay about the various methods and viewpoints of crunching VI papers ~fra_012F
17 June 98 Joa ~ crunchi7.htm Little essay about the various methods and viewpoints of crunching VII papers ~fra_xxxx


Little essay about the various methods and viewpoints of crunching.

Part VIII: Burrows - Wheeler - Transformation (BWT)



By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.
hello, hello, well, shave my legs and call me Grandpa if that wasn't a long delay, but i still have some serious difficulties accessing the Internet recently. I hope that i can access you more frequently in the near future. And to sweet up the time till then a little bit here my next essay about crunching methods: THE BURROWS - WHEELER - TRANSFORMATION (tataaa :) As we discussed last time the mechanism of Arithmetic Crunching some of you asked me about the source (the input of the cruncher) and that sometimes it would take the cruncher a long time to adjust to the source data. Yes, there is a problem with this kind of crunching: the source itself should be transformed to be better crunchable. But how do we transform a data to be better crunchable with a fast algorithm? Well, the arithmetic cruncher would love to crunch data like (all hex): 00, 01, 05, 02, 08, c2, 03, 01, c0, 0a, 12, 01, 02.... but how do we transform a sequence like {0a, f2, f2, 3c, 0a, 77, 44, 0a} into such cruncherfriendly data? Well, luckily for us, there is an algorithm available for us that does exactly what we need. This alg is called Move-To-Front (MTF). The MTF-Algorithm MTF is - once you had the idea - a pretty simple algorithm. It transforms the input data into some other data, it is reversible and it is pretty fast. The basic idea is that we don't output the data itself but the index of the char in some table (in our case the table has the usual possible 256 entries). After the output we save the byte, shift the rest below one position up and MOVE the saved byte TO the FRONT of the table. A simple char-Table will show you what i mean: Let's presume the input is {0a, f2, f2, 3c, 0a, 77, 44, 0a} That's 8 entries. But we have only 5 distinct values. As we want a reference table, we only need the distinct values, so for the sake of our example we take a reference table with only 5 entries (which are - for the example - filled with the distinct input values as starting values. In a real case the table would have 256 entries filled with the 256 possible values for start): Table Value [ ] 0 0a 1 f2 2 3c 3 77 4 44 Now let's transform the first byte: 0a We search our table for the value 0a and we find it at offset 00. And that's our output: 00. Now we shift the rest one position upwards and move the 0a at offset 00 to offset 00. Luckily for us the loop terminates immediately and our table looks now exactly like before: Table Value [ ] 0 0a 1 f2 2 3c 3 77 4 44 And now the second byte: f2 f2 is found at position 01 and this is our output: 01. Now we save the f2 at position 01, move all other entries below the position 01 one up and move the saved f2 to position 00. Let's watch the manipulation of the table in slow motion: Table Value Table Value Table Value [ ] [ ] [ ] 0 0a f2 is saved. Copy 0 0a insert the -> 0 f2 1 f2 <- the entries below -> 1 0a saved f2 at 1 0a 2 3c one position up: 2 3c offset 00: 2 3c 3 77 3 77 3 77 4 44 4 44 4 44 The next byte is again: f2 As we find it at position 00 our output is 00. Our moving loop terminates immediately and now let's have a look at our table and our output so far: Table Value [ ] 0 f2 1 0a 2 3c 3 77 4 44 Output so far: 00, 01, 00 Oh, ah, fascinating. Our input has surely been transformed into something more likely to be crunched by an Arithmetic Cruncher system. Let's continue with the next bytes: Inputbyte is 3c 3c is found at offset 02. So we save the 3c at offset 02, move the rest below position 02 one up and move 3c to offset 00: Table Value Table Value Table Value [ ] [ ] [ ] 0 f2 3c is saved. Copy 0 f2 insert the -> 0 3c 1 0a the entries below 1 f2 saved f2 at 1 f2 2 3c <- one position up: -> 2 0a offset 00: 2 0a 3 77 4 77 4 77 4 44 5 44 5 44 Output so far: 00, 01, 00, 02 The next input byte is 0a. It is found at offset 02 which is our output. Now we move the 0a at position 02 to the front: Table Value Table Value Table Value [ ] [ ] [ ] 0 3c 0a is saved. Copy 0 3c insert the -> 0 0a 1 f2 the entries below 1 3c saved 0a at 1 3c 2 0a <- one position up: -> 2 f2 offset 00: 2 f2 3 77 4 77 4 77 4 44 5 44 5 44 Output so far: 00, 01, 00, 02, 02 Now for the next input byte: 77. 77 is found at offset 03 which is output. This is how our table is altered: Table Value Table Value Table Value [ ] [ ] [ ] 0 0a 77 is saved. Copy 0 0a insert the -> 0 77 1 3c the entries below 1 0a saved 77 at 1 0a 2 f2 one position up: 2 3c offset 00: 2 3c 3 77 <- -> 4 f2 4 f2 4 44 5 44 5 44 Output so far: 00, 01, 00, 02, 02, 03 After the next input byte (44) our output is 04 and our table looks like this: Table Value [ ] 0 44 1 77 2 0a 4 3c 5 f2 Output so far: 00, 01, 00, 02, 02, 03, 04 After the final input byte (0a) our output is 02 and our table looks like this: Table Value [ ] 0 0a 1 44 2 77 4 3c 5 f2 Output so far: 00, 01, 00, 02, 02, 03, 04, 02 As you can see these output bytes are far more crunchable than the original bytes. But is it really reversible? Let's retransform the first few bytes together. At first we build the basic transformation table which is the same as the cruncher uses (remember how cruncher and decruncher have to use the same model?): Table Value [ ] 0 0a 1 f2 2 3c 3 77 4 44 Again, in a real version the table would be filled with the standard 256 possible values of the ASCII-TABLE. Now let's retransform the first byte: 00 We look for the 00th byte in our table, find a 0a and output it. Then we take the 00 as starting point for the shifting and moving and proceed as we already know from the first pass. As the loop terminates immediately we can continue with the next input byte. The next input byte is a 01. Now, at offset 01 we find a f2, output it and proceed: Save the char at offset 01, shift all entries below offset 01 one position up and move the saved byte into position 00. After that our table looks like this: Table Value [ ] 0 f2 1 0a 2 3c 3 77 4 44 As you can already see the table looks exactly like the table when we had our first pass and transformed the char f2. If we would continue to scan thru the rest of the input (the former output) we would see that the transformation table would be exactly like the one in the first pass. So, let's recapitulate: the MTF-Algorithm is fast, easy to implement and a good preprocessor for Huffman or (even better) Arithmetic Crunching. It's also a nice basic idea for crypting data. Yes, but, ehm... Yes, Watson? Well, i thought, that this essay was about the Burrows-Wheeler-Transformation, not about the MTF-algorithm. You are right. But the authors of the BWT, Michael Burrows and David Wheeler recommended that data should be crunched in such an order: Data -> (maybe a simple RLE cruncher ->) BWT -> MTF -> Huffman or Arithmetic Cruncher Ah, i see. Yes, and now as the basic premises are all clear let's examine the BWT. What is the BWT? Well, a short answer would be: an alg that sorts data, saves a certain aspekt of the sorted data, saves a starting index and that's it. What? No compression? No, Watson, no compression. The transformation is just what it is: A transformation. But think about it for a moment. A reversible transformation has a big advantage for designers: You can inspect and/or manipulate the transformed data in another way than you can with the original data. Take the MTF for example. It transforms the input data into some other data. But this other data is in most cases much better crunchable than the original data. And as the process is completely reversible you don't have any disadvantage. You could write a special cruncher sensitive to the output of MTF processes with which you could achieve better results than achieved with normal data. You could say that this kind of transforming data, manipulating that other data and retransforming the data to the original kind is some form of indirection. Indirection is very crucial for programmers (and philosophers); it enables you to solve and discuss problems on a much more comfortable level than the original problem stated. Keep this in your mind: if you have some problems, viewing the problem from a differenct angle gives you much more possibilities than trying to solve the problem with brute force. But back to the BWT. So, the BWT is just some kind of indirection. And it has to be reversible. But how does this work together with sorting? Well, the BWT algorithm gives you the possibility to unsort your data... Yes, you read right. With BWT you can unsort the saved data you constructed by BWT. OK, the data you save itself is not sorted, but is has some interesting features: 1. It has in some way the FOLLOWING context of the original data. 2. It enables you to unsort the data and rebuild the original data. 3. The bigger the chunks of data you sort, the better your results for the crunching stage. So, speedwise it's a matter of how fast your sorting algorithm is. crunchwise it's a matter of how much memory you have. So, for example let's transform the word CRACKER. To do this we need an array to all possible shifted forms of the word. If you take the word CRACKER, it has 7 letters. So we can shift it 7 times, from shift 0 to shift 6. The result is something like this: CRACKER RACKERC ACKERCR CKERCRA KERCRAC ERCRACK RCRACKE How do arrange this? Well, one easy way would be to duplicate the source data in RAM and just install enough charpointers (7 in that case). < RAM-DATA > CRACKERCRACKER S0 ^^^^^^^ S1 ^^^^^^^ S2 ^^^^^^^ S3 ^^^^^^^ S4 ^^^^^^^ S5 ^^^^^^^ S6 ^^^^^^^ When we later would sort the data we would have a very simple compare logic. Another way would be just 7 charpointers and the compare part of the sort logic would have to wrap around if it reached the end of the block: S0 CRACKER S1 RACKERC S2 ACKERCR S3 CKERCRA S4 KERCRAC S5 ERCRACK S6 RCRACKE I assume the latter for now. But it's a matter of taste, nothing else. However, what we do next is to sort the data. As pointers we take our 7 charpointer and we should only swap the string pointers as swapping some 100 KByte could take a looong time if done some 100 times :). Again: if you sort that way i do, your compare function has to recognize the end of the block to wrap around. Otherwise your sorting will fail. (You should save the pointer of S1, as you will need it later.) So, after sorting our new set of strings looks like this: S2 ACKERCR S3 CKERCRA S0 CRACKER S5 ERCRACK S4 KERCRAC S1 RACKERC S6 RCRACKE Now, what next? Ok, let's take a closer look to the first and last column of the strings: < block > S2 A CKERC R S3 C KERCR A S0 C RACKE R S5 E RCRAC K S4 K ERCRA C S1 R ACKER C S6 R CRACK E The first column ist SORTED. Well, we can expect it to be sorted, as we have just sorted the data :) And what about the last column? Well, the least we can say - as we are having a wrap-around look - is - that the first and the last column both contain the word (of course) in some more or less strange order and - the letter in the last column is the prefix letter of the letter in the first column in the same row. Now, what we save to disk is the LAST column and an index. The index is the position of the first letter of the original word in the saved data (our last column). When we rotate the last column by 90 degrees it looks like: RARKCCE. And the 'C' of CRACKER is at position 5 (zero-based). It is important to take the correct offset of the correct 'C' as it is the starting point for the re-transformation. There is an easy way to find the right 'C'. There are two 'C's in our little example. But we need the 'C' which starts the word CRACKER. If you look at the table, it is the 'C' of the last column of S1. Why? Because S0 is the original (shifted 0 times). And S1 is the original shifted once: S0 C RACKE R S1 R ACKER C That's why i'd save the pointer S1. As we want the first char of the word, but have to use the last column, we only can take the char in the last colum of S1. So, all, we have to do is to count thru our charpointers until we reach S1: < block > 0: S2 A CKERC R 1: S3 C KERCR A 2: S0 C RACKE R 3: S5 E RCRAC K 4: S4 K ERCRA C 5: S1 R ACKER C <- BINGO 6: S6 R CRACK E There comes our index: 5. We output it and are done with the first pass of the transformation. However, there are two naturally rasing questions: How do we unsort RARKCCE and how do we rebuild our original data CRACKER? ^ Well, to restore our original data we need some kind of re-transformation table. In this table the offsets of the next char would be saved and the only thing we would have to do is to follow the jumps in the table (starting with our saved index) and output the char found at the given offset in our data. We are basically building a sequence table, only that it the given position is also the offset in our source data to be output. For our data RARKCCE, how would such table look like? Well, the table would look like this: Table [ ] 1 4 5 6 3 0 <- Start here 2 Now we include an offset, add the source data to the table to the view and watch them parallel: Offset Table Data [ ] 0 1 R 1 4 A 2 5 R 3 6 K 4 3 C 5 0 C <- Start here 6 2 E Let's rebuild the data with this table: we start at offset 5, our given saved index and output the char 'C'. The entry in the table sends us to position 0. There we find an 'R' and output it. The table offsets at position 0 says 1, so we goto position 1 and find an 'A' there. The offset at position 1 sends us to position 4. There awaits us a 'C'. From there we are sent to the position 3 and can fetch a 'K' there. Now we jump to offset 6 and can harvest an 'E'. From position 6 we are sent to pos 2, output the 'R' from there and are sent to our starting position, which tells us, that we are finished. Our original data is revuild. Astounding, isn't it? Nice table, all right, but how do we build this table? Well, here comes the (un)sorting part into the game. Again, let's watch our first and our last column (last column first): L F 0: R A 1: A C 2: R C 3: K E 4: C K 5: C R 6: E R Well, if we take the first char in the column (F), we find an 'A'. Where do we find the first 'A' in column (L)? At offset 01. Hm. And the next char is a 'C'. Where do we find this in the column (L)? At offset 4. Let's have a look at the table again after filling in these values: Offset Table Data [ ] 0 1* R 1 4* A 2 5 R 3 6 K 4 3 C 5 0 C <- Start here 6 2 E Wow, the first two offsets are correct. And watch the two columns again: L F 0: R A 1: A C 2: R C 3: K E 4: C K 5: C R 6: E R You start at the offset 5 (C R) (* = (already) used) Offset Table Data Data-Sort [ ] 0 1 R A 1 4 A C 2 5 R C 3 6 K E 4 3 C K 5 0 * C * R <- Start here 6 2 E R and jump to the first offset of an 'R' (offset 0) you see a (R A). Offset Table Data Data-Sort [ ] 0 1 * R * A <- ActPos 1 4 A C 2 5 R C 3 6 K E 4 3 C K 5 0 * C * R 6 2 E R The first 'A' is at offset 1 (A C). Offset Table Data Data-Sort [ ] 0 1 * R * A 1 4 * A * C <- ActPos 2 5 R C 3 6 K E 4 3 C K 5 0 * C * R 6 2 E R The NEXT AVAILABLE 'C' ist at offset 4 (C K): Offset Table Data Data-Sort [ ] 0 1 * R * A 1 4 * A * C 2 5 R C 3 6 K E 4 3 * C * K <- ActPos 5 0 * C * R 6 2 E R from where you can reach the first 'K' at offset 3 (K E). Offset Table Data Data-Sort [ ] 0 1 * R * A 1 4 * A * C 2 5 R C 3 6 * K * E <- ActPos 4 3 * C * K 5 0 * C * R 6 2 E R The first 'E' is at offset 6 (E R). Offset Table Data Data-Sort [ ] 0 1 * R * A 1 4 * A * C 2 5 R C 3 6 * K * E 4 3 * C * K 5 0 * C * R 6 2 * E * R <- ActPos The NEXT AVAILABLE 'R' for us is at offset 2 (R C). Offset Table Data Data-Sort [ ] 0 1 * R * A 1 4 * A * C 2 5 * R * C <- ActPos 3 6 * K * E 4 3 * C * K 5 0 * C * R 6 2 * E * R As the next available 'C' is at the starting offset (5) we are finished. To program a table-building part like this: An easy way to build the array is to have a sorted copy of the data ready and a table buffer big enough. Now you start with the first char in the sorted buffer. In our case an 'A'. Now as this is our first run, we simply search in our original data for an 'A' and find it at offset 01. As our searchloop is still at index 0 we write our offset (the 01) into the table at index [0]. Index Table Data Data-Sort [ ] 0 1 R A 1 A C 2 R C 3 K E 4 C K 5 C R 6 E R Now we increment our index to 01 and in the sorted data we find a 'C'. Now, maybe we already had a 'C' searched before. We could have cached the last char and the last char pos and would check this out now. Or we could look one position back. Let's assume the latter. When we look one position back, we see an 'A' and recognize that we are indeed the first 'C'. So we search the original data for the first 'C' to come and find it at offset 04. Our searchloop index is 01 so we write our offset 04 to the table at index [1]. Index Table Data Data-Sort [ ] 0 1 R A 1 4 A C 2 R C 3 K E 4 C K 5 C R 6 E R Now we increment from 01 to 02 and find again a 'C'. As we look one position back we see that we already had searched for a 'C'. We fetch the saved offset (04) and start to search from 04 + 1. As we are on position 5 we immediately find another 'C' and are done. We write the offset 5 into the table at index [2] and go on. Index Table Data Data-Sort [ ] 0 1 R A 1 4 A +---- C 2 5 R ! +-- C 3 K ! ! E 4 C <-+ ! K 5 C <---+ R 6 E R The 'E' and 'K' are treated like the 'A' (no former search done, start searching from the beginning): Index Table Data Data-Sort [ ] 0 1 R A 1 4 A C 2 5 R C 3 6 K E 4 3 C K 5 C R 6 E R The first and the second 'R' are done like the 'C': Are we the first one? If so, start to look from the beginning of the original data, else get the last known position and start searching from this last position +1. When we are ready, our table looks like this: Offset Table Data Data-Sort [ ] 0 1 R A 1 4 A C 2 5 R C 3 6 K E 4 3 C K 5 0 C R 6 2 E R Yep, that's the table we need to rebuild our really original data from the data we saved. And where do we get the sorted copy of the data? Well, there's no way around it: You have to reserve another block of memory, copy the saved data to it and SORT the saved data there. As we just transformed the data there is no byte missing. The inner trick of the BWT is the rebuilding of the original data with the intervals between the sorted column of the data and the prefix column (our saved data). Neat, eh? Let's repeat: To restore our original data from the saved data, we arrange a second copy of the data, sort it, build the table from it and with the table we can restore our original data (starting with the saved starting index). To transform the data we sort it using pointers, save the last column of the sorted data and also save the offset of S1 as starting index. Either in the net or on my harddisk is a complete set of sources doing the job for you. If i am successfull you should find an PART to this essay (and if reverser+ allows so :). If you find the attachment, take your compiler and compile the progs. The correct sequence for crunching / decrunching is: Crunch: SourceProg -> RLE -> BWT -> MTF -> ARI -> CrunchProg Decrunch: CrunchProg -> UNARI -> UNMTF -> UNBWT -> UNRLE -> SourceProg. Anyway, as always i'm open to suggestions, questions etc. Have just a little patience, that's all (maybe i should get a private INet-Access anyway :). Ok, see you next time Greetings from a programmer Joa




You are deep inside reverser's pages of reverse engineering, choose your way out!

redhomepage red+ORC redanonimity academy redcounter measures redbots' wars
redtools redour tools redhow to use our tools
redjavascript wars redreality cracking redacademy database redprogrammer's corner redhow to protect better
redantismut CGI-scripts redcocktails redsearch_page redhow to search redmail_reverser
redIs reverse engineering legal?