Date : Sun, 08 Jun 1986008:34:00-MDT
From : Bernie Eiben - LDP Workstations <EIBEN@dec-marlboro.ARPA>
Subject: Mini text regarding SQ,ARC,LZW
Since somebody asked about differences between Huffman encoding and
ARC [or whats the difference between a cherry and an apple-pie] - I
decided to give my 'two cents' worth. There are LONG articles and
dissertations floating around -- but who reads them...
A mini-introduction into ARC
ARC in some way is comparable to LSWEEP or LU in that it is a
PACKAGING method. Files with extension ARC contain a 'marker',
followed by file information, file-data, file information, file-data
etc.
ARC's packaging method guarantees NO GROWTH during storage , i.e.
file contents are analyzed before storage and either stored
1. AS IS {typically files in the 1 to 200 byte range}
2. with repeat-compression {same range as above}
3. using Huffman 8-byte encoding {sometimes executables}
4. using Lempel-Ziv-Welch encoding {all others}
ARC free's the user from 'worrying' about storage mechanisms and
delivers practically all needed services {extract, store, list, type,
check, execute and re-compress using 'latest' state of compression
technique}. ARC is 'downward' compatible. It is currently heavily
used in the MSDOS/PCDOS world, although usage in RCPM systems is
starting with availability of a fast DE-ARCer {a CP/M version of ARC
is 'in the works' by Bob Freed}.
ARC belongs into the category of "Share-ware" or "Free-ware" - it is
copyrighted by System Enhancement Associates {source-language C,
system MSDOS} - UnARC was written by Bob Freed for the Public Domain
{source-language assembler, system CP/M}.
A mini comparison of Huffman Encoding
and
Lempel-Ziv-Welch {LZW} techniques
Huffman Encoding expresses each storage unit as a variable length
pointer into a frequency-ordered tree. Compression is achieved by
choosing a 'native' storage unit {where repetitions are bound to
occur} and {on the average} expressing the more frequent storage units
with shorter pointers [although less used units might be presented by
longer pointers]. The Encoding process needs 'two passes' i.e. once
reading all units {under CP/M and MSDOS 8bit bytes} to build the
frequency ordered tree {also called the 'dictionary'} and then
translating all units into their respective pointer values. Original
filename, dictionary and pointer values are stored - by convention the
SECOND character of the filename extension is changed to Q - reminder
of a 'squeezed' file.
LZW expresses strings of 8-bit bytes by pointers into an 'ordered'
string-table. The rules for 'constructing' the table are reversible,
so that Compressor and De-Compressor can 'build' their table
'on-the-fly'. LZW is 'one-pass' although achieved speed is VERY
dependent on language implementation and available physical memory [in
general more than 90% of time spent in 'hashing' and table searching].
Although early implementations of LZW seemed to need more than 64K of
physical memory, current enhancements make a maximum of 2**11 table
entries sufficient to handle all cases. State of the art
implementations check 'compression ratio' on the fly - and rebuild the
table if compression ratio decreases beyond a minimum or rebuild the
table on table overflow.
Typical Huffman compression ratios hover around 33% {compressed file
is 66% of original, whereby 'text' is typically compressed a little
better, and 'executables' less}. Typical LZW compression ratios
average 55% - highest compression is achieved with pixel-information
{values of 90% are typical} - followed by 'text' with 50% and
executables around 20%. Although the original 'paper' on LZW
suggested implementation between CPU and peripheral devices
[terminal,disk-drives,mag-tapes] - current usage encompasses
file-compression {Unix COMPRESS, MSDOS ARC, CPM UNArc} - highspeed
proprietary MODEM-protocols {"LZW in SILICON"} and 'picture
transmission' at 1200 baud.
Rgds,
Bernie