Date : Fri, 05 Dec 1986 20:00:00 MST
From : Keith Petersen <W8SDZ@SIMTEL20.ARPA>
Subject: Towards a UNIX equivalent for crunch 2.3
Here is another old DOC file from the original release of
CRUNCH/UNCRUNCH, DETAILS.DOC.
--Keith
*****************************************************************
* *
* LZW "Cruncher" Data Compressor Utility *
* LZW "Un-Cruncher" Data Decompressor Utility *
* *
* Z-80 Only, CP/M 2.2+ *
* v1.0 3/30/86 *
* -Steven Greenberg *
*****************************************************************
This document is intended to supplement the accompanying
ABSTRACT.DOC for those interested in some more technical details.
As mentioned, the cruncher is based an Kent Williams' imp-
lementation of the Lempel / Zev algorithm. For further infor-
mation on the algorithm itself, I refer you to his public domain
file LZW2COM.LBR which contains a description of the technique
and an actual implementation written in "C" source.
In order to make a practical stand alone "cruncher" that was
easy to use, especially for those already familiar with squeez-
ers, some header information had to be included in the resulting
"crunched" file (eg. the filename of the original file, etc.). I
have defined a header based on the time tested squeezed file
format, with some necessary changes and a few additions. The ad-
ditions are mostly to insure that files crunched now will always
be un-crunchable with future versions of the uncruncher, no mat-
ter what possible enhancements are made. Those familiar with the
MS-DOS ARC.xxx program have probably seen this idea in action.
More on this later.
Another slight problem with LZWCOM & LZWUNC had to do with
the question of termination. When the input file was exhausted
during compression, it was unlikely the output file was on a
sector boundary. No matter what the rest of the final output
sector was padded with ("1A"'s were used), the uncruncher would
try to uncrunch those bytes (since all data is conceivably val-
id). This resulted in occasional extra sectors of garbage
following an otherwise properly decoded file. While this did not
usually cause a problem, it was certainly not desirable.
I have chosen to handle the termination problem the same way
it was handled with squeezed files; by dedicating a unique code
to represent EOF (End Of Field). By only allowing 4095 instead
of 4096 different codes (not a major shortcoming), code 000 can
become a dedicated EOF. As soon as it is encountered on the
input file, the decoding process is known to be complete. For
those who are interested, the exact code put out by CRUNCH can be
duplicated by the "C" program LZWCOM if table entry zero "artifi-
cially" flagged as "used" (before initializing the table). That
insures that the code will never come up, except when manually
inserted at the end of file.
The other functional difference from LZWCOM involves repeat
byte coding. CRUNCH converts the "physical input stream" into a
"logical input stream", which is then handed to the cruncher.
The conversion takes 3 or more contiguous occurrences of the same
byte and encodes them as <byte> "90H" <count> where "count" is
the number of "additional" occurrences of <byte> (ie total occur-
rences -1). 90H itself is encoded as "90H" ,"00". This scheme
is identical to that used in standard squeezing.
Crunching requires only one pass through the input file,
while squeezing requires two. While this is one of its signif-
icant advantages, it does complicate the problem of including a
checksum, if one is desired, in the header of the result file
(since the value is not known until everything is done). A bad
solution is to close the finished output file, re-open it, insert
the checksum, and close it again. A good solution is to put the
checksum at the end of the output file right after the EOF. And
that's where it is. With all this in mind, herein follows a
specification for the format of a crunched file.
---------------------------------------------
ID FIELD: Bytes 0 and 1 are always 076H and 0FEH, respec-
tively. This identifies the file as "crunched".
FILENAME: The filename field starts at byte 2. It is a
field of variable length, terminated by a zero byte. The field
contains the filename as ASCII chars, including an ASCII "."
immediately preceding the filename's extension. Less than eight
characters may precede the "."; there is no necessity to pad the
filename with blanks. Additional characters after the 3rd exten-
sion character but before the zero byte specifically are allowed
and will be ignored by the current uncruncher. This allows an
area of unlimited size for date stamping, or other miscellaneous
information which a future cruncher or application program might
want to insert, for use or display by some uncrunching program.
By skipping over these bytes now, future incompatibilities are
eliminated.
Following the zero byte are the following 4 bytes, in order:
REFERENCE REVISION LEVEL: 1 byte }
SIGNIFICANT REVISION LEVEL: 1 byte } described later
ERROR DETECTION TYPE: 1 byte }
SPARE: 1 byte }
CRUNCHED OUTPUT: After the SPARE byte, the actual crunched
output finally begins. The crunched output is a series of 12-bit
codes in "natural" order. (Every other 12-bit code starts on a
byte boundary and includes the 4 ms bits of the next byte. The
"odd" codes start in the middle of a byte and include the whole
following byte as the remaining 8 ls bits). A 12-bit code of 000
is an EOF, as explained above. If the EOF code itself ends in
the middle of a byte, an additional 4 bits of zero are padded on
to get back on a byte boundary for the checksum.
CHECKSUM: The next two bytes are the 16-bit checksum, least
significant byte first. The checksum is the modulo 2^16 sum of
all the bytes as input from the physical input stream, prior to
repeat byte encoding (or, in the case of uncrunching, as output
to the physical output stream, after repeat byte decoding).
REMAINDER OF THE SECTOR: The remaining bytes in the sector
following the checksum are irrelevant. CRUNCH fills them with
"1A"'s.
---------------------------------------------
These are the four bytes not fully described above:
"Reference Revision Level": The program/revision level of
the program that performed the crunch operation. This byte is
put in for general reference only. The current value is "10"
(hex).
"Significant Revision Level": If the value of this byte in
a crunched data file exceeds the value contained within the un-
crunching program, the message "File requires newer revision of
program" will be displayed. If changes or enhancements are ever
made to CRUNCH which are significant enough to actually output an
incompatible file, the information in this byte will allow a new
revision of UNCR to be compatible with all existing data files,
old or new. The error message gets displayed only if someone
tries to uncrunch a new file with an old uncruncher which doesn't
know about the "future" format yet. Current value is "10" (hex).
"Error Detection Type": If this value is non-zero, the cur-
rent uncruncher will not examine the checksum or give an error
associated with it. This will permit a CRC type (or no error
checking) value to be used if circumstances warrant it. The cur-
rent UNCR program is always checking for "illegal" codes, which
are ones which have not been defined by previous codes. If any
are encountered, the message "Invalid Crunched File" is disp-
layed. This inherent self-checking probably precludes the neces-
sity of more advanced error checking.
"Spare": The SPARE byte is a spare byte.
- Steven Greenberg
30 March 1986