Fun with Lazy Arrays: the LZ77 Algorithm

Publish date: Jun 18, 2009
Last updated: Jan 2, 2025

Tags:

This is my third post investigating compression techniques related to the DEFLATE algorithm: the first on run-length encoding, and the second on simple Huffman Coding. This post models the LZ77 algorithm, the second of the two compression strategies used by DEFLATE, and in the process explores some interesting properties of Haskell’s basic Arrays.

Implementation:

> module LZ77
>     where
  
we will use GHC's "basic non-strict arrays" for this
experiment:
  
> import Data.Array

and use Ints to store the length of the entire decoded
message (needed to create our array):
  
> type Length = Int
> type Offset = Int
  
in place of the length in the standard length-offset pair
I've decided to use an Int representing the index of the
last element in the span relative to the element at
offset. Thus the pair encoding a span of only one element,
two elements back would be (0,2), rather than the more
traditional (1,2).

This simply makes more sense to me, especially since I want
to be able to encoded reversed sequences, as you will see.

> type Index = Int
  
and a few simple dataypes for our uncompressed and
compressed data respectively:
  
> type Decoded a = Array Index a
>
> data Encoded a = Enc Length [ Either a (Index,Offset) ]
  
The decompress function works by traversing the encoded message,
keeping track of our array index position (since offsets are
relative to the current position), and building an Array lazily
from a list which we generate, in part by referencing elements
from the partially generated array itself.
  
So when we see a Right value we look up in the Array the elements
referenced by the length-offset, concat-ing that list with the
result of processing the rest of the encoded message.
  
If we hit [] in 'dec' we call an error because the stored value
for the length of the uncompressed message in the Encoded type
was longer than what the 'decompress' function could produce.

It's diffficult to describe, but I hope the code is clear:
  
> decompress :: Encoded a -> Decoded a
> decompress (Enc el es) = decoded
>     where decoded = listArray (0,el - 1) $ dec 0 es
>             dec _ [] = error "message is shorter than it should be"
>             dec n (Left x : xs) = x : dec (n+1) xs
>             dec n (Right (iRel,off) : xs) =
>                 let i1 = n - off
>                     i2 = (if iN > i1 then succ else pred) i1
>                     iN = i1 + iRel
>                  in [ decoded!i | i <- [i1, i2 .. iN] ] ++ dec (n + 1 + abs iRel) xs

Some interesting things about the code above:

we create an array from a list, which we build, in part, by looking up elements from the array we are in the process of building.
we can compress a sequence of symbols which were seen previously but in reverse order, simply by storing a negative relative index in the (relative_index,offset) tuple. So, the string… "her racecar returns to race" might compress to: {her race(-5,2)turns to(4,19)}. I’m not sure if this is useful in real compression, especially when it comes down to the binary encoding.
even more interesting, we can use this same decoder function to decompress data that matches a sequence in parts of the array we haven’t built yet! We simply use a negative offset in our tuple.

Examples and conclusion:

Point (3) may or may not be something like the LZ78 algorithm, which apparently works by encoding future data, but it is defintely a cool thing to be able to do with arrays:

> coolArray = listArray (0,4) [0, coolArray!4 - 3, 2, coolArray!1 + 2, 4]

Here’s an example with great compression where the relative index exceeds the offset:

> exceedsOffset = elems $ decompress $
>     Enc 25 [Left 'B', Left 'l', Left 'a', Left 'h', Left ' ',
>             Left 'b', Right (17,5), Left '!']

…and a code example for (2) above:

> reverseReference = elems $ decompress $
>     Enc 27 [Left 'h',Left 'e',Left 'r',Left ' ',Left 'r',
>             Left 'a',Left 'c',Left 'e',Right(-5,2),Left 't',
>             Left 'u',Left 'r',Left 'n',Left 's',Left ' ',
>             Left 't',Left 'o',Right(4,19)]

…and finally an example combining (2) and (3):

> reverseLookAhead = elems $ decompress $
>     Enc 5 [Left 1, Right (-1,-3), Left 3, Left 4]

I was surprised to discover these properties in Haskell’s lazy Arrays. hope they came as a surprise to a few others.