I’m doing Stanford’s free online NLP class, and last week’s lesson introduced Levenshtein Distance and the traditional matrix-based algorithm. I implemented the algorithm in haskell to get a better understanding of it, but then got a bit obsessive about optimizing.

I wanted to develop an approach iteratively so I’d learn something along the way, but I didn’t take great pains to document the process very well, nor do I have any analysis here of low level details.

The best version is a github gist here if you want to play with it and optimize further.

## Playing with the algorithm

First here’s a very naive declarative approach. This describes the nature of the problem beautifully, but is slow as hell since we don’t try to re-use computations:

```
medNaive :: String -> String -> Int
medNaive [] s2 = length s2
medNaive s1 [] = length s1
medNaive s1@(c1:cs1) s2@(c2:cs2) = minimum
[ medNaive cs1 s2 + 1
, medNaive s1 cs2 + 1
, medNaive cs1 cs2 + if c1==c2 then 0 else 2
]
```

And here’s a version that is pretty much the standard matrix algorithm. Because we’re using a boxed array we can actually query the array while we build it:

```
import Data.Array
-- boxed array are what allow for this:
med :: String -> String -> Int
med s1 s2 =
let bnd = (length s1, length s2)
prep = zip [0..] . ('\NUL' :)
-- a list comprehension might fuse here where do notation doesn't
a = array ((0,0),bnd) $ do
(n1,c1) <- prep s1
(n2,c2) <- prep s2
let minEd
| n1==0 = n2
| n2==0 = n1
-- compare w/ min + min:
| otherwise = minimum
-- insert:
[ (a!(n1-1,n2)) + 1
-- delete:
, (a!(n1,n2-1)) + 1
-- substitute:
, (a!(n1-1,n2-1)) + if c1==c2 then 0 else 2
]
return ((n1,n2), minEd)
in a ! bnd
```

Initially, I’ll be benchmarking with criterion
on two random 200-character strings. The test `main`

function looks like:

```
main = defaultMain [
bench "simple array version" $ whnf (uncurry med) (string1, string2)
[
```

For this initial array version, compiled with `-O2`

we get:

```
collecting 100 samples, 1 iterations each, in estimated 6.440210 s
mean: 68.72456 ms, lb 68.40039 ms, ub 69.60141 ms, ci 0.950
std dev: 2.539885 ms, lb 1.197168 ms, ub 5.383915 ms, ci 0.950
```

But we don’t need to keep around the entire matrix, especially if we only want the final edit distance value, so iterating over the first string for each character of the second string should be a more efficient approach.

Here’s a version just using lists. This is the logical flavor of the algorithm we’ll be using going forward:

```
medList :: String -> String -> Int
medList s1 = finalCost . foldl' scanS1 (zip [1..] s1, 0) -- 0 is cost comparing nul -> nul
-- "fills in" what would be a column of our matrix
where scanS1 (v, costSW_i) c2 =
let (_, v') = mapAccumL scanVec (costSW_i, costSW_i + 1) v
-- given accumulator corresponding to costs to the southwest
-- and south, and the current character and cost from the
-- previous step (cost to the west) find min sub-edit:
scanVec (costSW, costS) (costW, c1) =
let cost = min (min costS costW + 1)
(costSW + if c1 == c2 then 0 else 2)
in ((costW, cost), (cost, c1))
in (v' , costSW_i + 1)
finalCost = fst . last . fst
```

This runs a bit slower than the array version above, but is a good starting point for further enhancement:

```
collecting 100 samples, 1 iterations each, in estimated 9.095287 s
mean: 103.2585 ms, lb 102.7181 ms, ub 103.8645 ms, ci 0.950
std dev: 2.926384 ms, lb 2.473240 ms, ub 3.647964 ms, ci 0.950
```

## Vectors

Let’s see if we can translate the algorithm we developed with lists to use the
`Vector`

type from the vectors
package, to see how much benefit we can get from the highly-optimized
stream-fusable
operations in that package.

Here we’ve just replaced one of the lists since there’s no drop-in replacement
for the `mapAccumL`

we’re using:

```
medVec :: String -> String -> Int
medVec s1 = finalCost . V.ifoldl' scanS1 (zip [1..] s1) . V.fromList
where scanS1 v costSW_i c2 =
-- TODO: see if we can turn this into a fold, scanl, or zipWith?
-- or perhaps accum from Vector?
let (_, v') = mapAccumL scanVec (costSW_i, costSW_i + 1) v
scanVec (costSW, costS) (costW, c1) =
let cost = min (min costS costW + 1)
(costSW + if c1 == c2 then 0 else 2)
in ((costW, cost), (cost, c1))
in v'
finalCost = fst . last
```

This runs about as fast as `medList`

above. We need more `Vector`

juice.

With a little thought we can turn the `mapAccumL`

above into a `scan`

and
iterate our scan over a `Vector`

for S1 as well. Using the strict `postscanl'`

function here also yielded a huge speed-up from the non-strict version.

```
medVecFinal :: String -> String -> Int
medVecFinal s1 = V.last . V.ifoldl' scanS1 costs_i . V.fromList
where vs1 = V.fromList s1
costs_i = V.enumFromN 1 $ V.length vs1 -- [0.. length s1]
scanS1 costs_W costSW_i c2 =
let v = V.zip vs1 costs_W
v' = V.postscanl' scanVec (costSW_i, costSW_i + 1) v
scanVec (costSW, costS) (c1, costW) =
(costW, min (min costS costW + 1)
(costSW + if c1 == c2 then 0 else 2))
in snd $ V.unzip v'
```

This is about twice as fast. Sweet!

```
collecting 100 samples, 2 iterations each, in estimated 9.639692 s
mean: 49.64144 ms, lb 49.33944 ms, ub 49.94830 ms, ci 0.950
std dev: 1.561553 ms, lb 1.359085 ms, ub 1.955104 ms, ci 0.950
```

Notice that the code above does no indexing or destructive updates; it’s written entirely as though we were working on lists.

### Unboxed Vectors

By just changing my vector import line to ```
import qualified
Data.Vector.Unboxed.Safe as V
```

I got a huge speed-up on the last
implementation above (15x in fact… spooky stuff).

```
collecting 100 samples, 19 iterations each, in estimated 6.332280 s
mean: 3.379753 ms, lb 3.371200 ms, ub 3.391660 ms, ci 0.950
std dev: 51.03072 us, lb 39.59751 us, ub 76.56945 us, ci 0.950
```

And it appears to run in constant space which is a good thing.

The LLVM backend didn’t seem to improve the benchmarks.

## Summary and TODOs

So our final vector version is about 30x faster than the original list version we used as a prototype, and it still just looks like list-processing code and we haven’t needed to muck around with strictness hints, or INLINE pragma (not that there aren’t improvements to be had there).

I would like to play with a further optimization to the algorithm that drops non-ascending prefixes from the cost vector we iterate over, but ran out of time. If I mess with that, I’ll probably explore the effects of destructive updates at the same time.

Improvements will end up on the gist I linked to above. Feel free to fork and improve!