Date : Thu, 04 Dec 1986 22:30:00 MST
From : Keith Petersen <W8SDZ@SIMTEL20.ARPA>
Subject: CRUNCH-UNCRUNCH abstract
The following is Steven Greenberg's ABSTRACT.TEC from CRUNCH23.LBR.
Technical Abstract
CRUNCH 1.x maintained a table representing up to 4096 strings of
varying lengths using the so called LZW algorithm, which has been
described in the earlier documentation. These strings were ent-
ered into a table in a manner where the strings content was used
to determine the physical location (hashing), and that location
was used as the output code. Hash "collisions" were resolved by
maintaining another 12 bits of data per entry which was a "link",
or pointer to another entry.
In contrast, CRUNCH 2.x uses an "open-addressing, double hashing"
method similar to that employed in the UNIX COMPRESS. This meth-
od involves a table of length 5003, where only 4096 entries are
ever made, insuring the table never gets much more than about 80%
full. When a hash collision occurs, a secondary hash function is
used to check a series of additional entries until an empty entry
is encountered. This method creates a table filled with many
criss-crossed "virtual" chains, without the use of a "link" entry
in the table.
One reason this is important is that [without using any addition-
al memory] the 1 1/2 bytes which were previously allocated as a
link can now become the [output] code number. This enables us to
assign code numbers, which are kept right alongside the entry
itself, independently of the entry's physical location. This
allows the codes to be assigned in order, permitting the use of
9-bit representations until there are 512 codes in the table,
after which 10 bit representations are output, etc.
The variable bit length method has three ramifications. It is
particularly helpful when encoding very short files, where the
table never even fills up. It also provides a fixed additional
savings (not insubstantial) even when the table does fill up.
Thirdly, it reduces the overhead associated with an "adaptive
reset" to the point where it becomes a very viable alternative.
"Adaptive reset" simply means throwing out the whole table and
starting over. This can be quite advantageous when used proper-
ly. CRUNCH v2.x employs this provision, which was not incorpor-
ated in the V1.x algorithm.
"Code reassignment" is an advancement I introduced with the re-
lease of CRUNCH v2.0 based on original work. It is not used in
COMPRESS, any MS-DOS ARC program, or [to the best of my know-
ledge] any other data compression utility currently available.
There are many ways one might go about this (and at least as many
possible pitfalls). The algorithm I selected seemed to represent
a good tradeoff between speed, memory used, and improved perfor-
mance, while maintaining "soundness of algorithm" (ie it works).
Briefly, it works as follows: Once the table fills up, the code
reassignment process begins. (At this same time, the possibility
of adaptive reset is also enabled). Whenever a new code would
otherwise be made (if the table weren't full), the entries along
the hash chain which would normally contain the entry are
scanned. The first, if any, of those entries which was made but
never subsequently referenced is bumped in favor of the new en-
try. The uncruncher, which would not normally need to perform
any hash type function, has an auxiliary physical to logical
translation table, where it simulates the hashing going on in the
cruncher. In this fashion it is able to exactly reproduce the
reassignments made my the cruncher, which is essential.
---
I hope to write an article soon on "Recent Advancements in Data
Compression". It would cover the recent history generally, along
with a more detailed description of some of the algorithms, and a
series of additional ideas for future enhancement.
Steven Greenberg
16 November 1986