Hash Zip
What is 'Hash Zip' many people ask? It is a compression algorithm developed
by myself with my own ideas as part of the ongoing research I do in the
free software arena when I am not billing hours for my company and other
company related activities.
The basic concept of hzip is to compress and decompress a set of data,
trading CPU time for compression ratio (emphasis on a _lot_ of CPU time)
using a hash and a few other `interesting' bits of data.
Currently, sha1 is the hash algorithm. I intend to make this a command line
flag in the future, but for now it is fixed on sha1. For every 256 bytes of
input, 30 bytes of output result, guaranteed. The big challenge that lies
ahead is to 'verify' this works for all possibilities of bits (2^256
verifications). I acknowledge that on today's CPU, this is not possible.
In the future, however, I expect someone or someone(s) to prove or disprove
that this algorithm is valid.
Update:
Mon Jul 14 07:49:42 CDT 2003
Thanks to everyone for their well thought input regarding number theory
and the reasons why my algorithm doesn't work for all possible input values.
While I conced that it is a true statement, I'm wanting to determine where my
algorithm fails (specifically which input values), so I can make it work for
the general case, and perhaps expand a data structure or two in the specific
cases where it does break down.
Hash Zip release 0.2 (Thu Jul 10 16:41:05 CDT 2003)
Major features of this new release:
`Valid Chars Count' is a (I believe) useful addition to hzip. However, the
implementation is slow causing no speedup to occur (yet) on most decompression
runs. I must detail a bit about how the `interesting information' is used to
understand `Valid Chars Count'. There are three different types of masks that
are used to attempt to limit the sha1 hash calls, and also limit the size
of the possibilities being brute forced. An optimization in 0.1 was to create
a list of `valid chars' in a table, and iterate through it instead of applying
the masks to each possible combination of bits. It is my experience that this
list of 'valid chars' is a superset of the set of u_int8_t values in the
data stream. So now on compression, I 'count' the number of 'valid chars'
that are indeed used to generate the output data set. And on decompression,
I (once again) count the number of valid chars that are in the potential
output data set. If the numbers do not match, a sha1 hash is not computed.
I simply need to tune the counting routine a bit to speed up hzip more.
An example of the 'valid chars count' effect on decompression can be illustrated
by the comparison of decompressing three strings with a trailing carriage return
between release 0.1 and 0.2:
'abcde'
0.1 Total hash calls for block 1: 5746167
85.98s real 41.78s user 0.01s system
0.2 Total hash calls for block 1: 4460925
105.59s real 49.71s user 0.00s system
'aaaaa'
0.1 Total hash calls for block 1: 37777
0.52s real 0.50s user 0.01s system
0.2 Total hash calls for block 1: 4
0.43s real 0.41s user 0.01s system
'aaaaaa'
0.1 Total hash calls for block 1: 377777
22.03s real 10.81s user 0.00s system
0.2 Total hash calls for block 1: 4
18.16s real 9.35s user 0.00s system
Hash Zip release 0.1 (Wed Jul 9 12:45:35 CDT 2003)
WARNING: This is pre-alpha in-flux unstable proof-of-concept software. Do not
expect anything but conveyance of concept. Perhaps a usable program for now,
but I _will_ be breaking binary compatibility with the output next revision.
So don't compress anything you want to keep just yet.
I've made a few cleanups since the last version you saw, but the big
news is that it now takes 25% of the original time to decompress 'Hello
'.
Sure, 46 seconds is a lot, but compared to 183 seconds, this is very good.
Two reasons for the speedup:
I added a 'u_int8_t' value cset to the hash block data. The upper 4 bits
and the lower 4 bits are counters. The upper is the max number of bits set
per char, the lower is the min number of bits set per char.
This 'cset' added information reduced the time from 183 to 66 seconds.
I also sped up 'incr()' quite a bit by computing a buffer of 'valid' characters
based on all the information known (currently from cset, mset, and uset).
Now, incr() simply iterates through this buffer for every character in
the data being computed, rather than applying the logic of the masks to each
character being considered.
This is what reduced the time from 66 to 43 seconds.
I expect the next release (0.2) to come shortly, with the addition of a
header field, to version control the data stream, do some sanity checks,
store a hash of the whole file, and further reduce the hash block size
by fixing the hash algorithm for the whole file, thereby mentioning
it once in the header vs one char for the type of hash algorithm per
hash block.
I also discovered I was using the wrong '#define' so instead of 36 bytes
for a hash I am using 20.
Currently, the 'hash block' weighs in at 30 bytes. I intend to remove
the hash algorithm byte, but add another byte that will help determine
what values should be used to initialize the computation, and which
direction (up or down) incr() should head.
I've also removed the headache of the 'len' byte being 'size = 2 ^ len'.
I didn't want to add a byte to the data stream for the final block, and
also I've decided that 256 bytes is large enough for one block of data,
instead of the former maximum size of 2^16 (64k).
Besides, guaranteed compression to 11% the original size still isn't bad,
eh?
I'm still not checking for collisions (haven't run across any yet, but I
intend to). When I do, this will dramatically slow down the compression
to be equal to the decompression speed. Sorry folks, but if there is
to be guaranteed 1:1 input to output ratio, I can see no way of avoiding
this.
Downloads
|