I’m doing Stanford’s free online NLP class, and last week’s lesson introduced Levenshtein Distance and the traditional matrix-based algorithm. I implemented the algorithm in haskell to get a better understanding of it, but then got a bit obsessive about optimizing.
I wanted to develop an approach iteratively so I’d learn something along the way, but I didn’t take great pains to document the process very well, nor do I have any analysis here of low level details.
The best version is a github gist here if you want to play with it and optimize further.
Playing with the algorithm
First here’s a very naive declarative approach. This describes the nature of the problem beautifully, but is slow as hell since we don’t try to re-use computations:
medNaive :: String -> String -> Int medNaive  s2 = length s2 medNaive s1  = length s1 medNaive s1@(c1:cs1) s2@(c2:cs2) = minimum [ medNaive cs1 s2 + 1 , medNaive s1 cs2 + 1 , medNaive cs1 cs2 + if c1==c2 then 0 else 2 ]
And here’s a version that is pretty much the standard matrix algorithm. Because we’re using a boxed array we can actually query the array while we build it:
import Data.Array -- boxed array are what allow for this: med :: String -> String -> Int med s1 s2 = let bnd = (length s1, length s2) prep = zip [0..] . ('\NUL' :) -- a list comprehension might fuse here where do notation doesn't a = array ((0,0),bnd) $ do (n1,c1) <- prep s1 (n2,c2) <- prep s2 let minEd | n1==0 = n2 | n2==0 = n1 -- compare w/ min + min: | otherwise = minimum -- insert: [ (a!(n1-1,n2)) + 1 -- delete: , (a!(n1,n2-1)) + 1 -- substitute: , (a!(n1-1,n2-1)) + if c1==c2 then 0 else 2 ] return ((n1,n2), minEd) in a ! bnd
Initially, I’ll be benchmarking with criterion
on two random 200-character strings. The test
main function looks like:
main = defaultMain [ bench "simple array version" $ whnf (uncurry med) (string1, string2) [
For this initial array version, compiled with
-O2 we get:
collecting 100 samples, 1 iterations each, in estimated 6.440210 s mean: 68.72456 ms, lb 68.40039 ms, ub 69.60141 ms, ci 0.950 std dev: 2.539885 ms, lb 1.197168 ms, ub 5.383915 ms, ci 0.950
But we don’t need to keep around the entire matrix, especially if we only want the final edit distance value, so iterating over the first string for each character of the second string should be a more efficient approach.
Here’s a version just using lists. This is the logical flavor of the algorithm we’ll be using going forward:
medList :: String -> String -> Int medList s1 = finalCost . foldl' scanS1 (zip [1..] s1, 0) -- 0 is cost comparing nul -> nul -- "fills in" what would be a column of our matrix where scanS1 (v, costSW_i) c2 = let (_, v') = mapAccumL scanVec (costSW_i, costSW_i + 1) v -- given accumulator corresponding to costs to the southwest -- and south, and the current character and cost from the -- previous step (cost to the west) find min sub-edit: scanVec (costSW, costS) (costW, c1) = let cost = min (min costS costW + 1) (costSW + if c1 == c2 then 0 else 2) in ((costW, cost), (cost, c1)) in (v' , costSW_i + 1) finalCost = fst . last . fst
This runs a bit slower than the array version above, but is a good starting point for further enhancement:
collecting 100 samples, 1 iterations each, in estimated 9.095287 s mean: 103.2585 ms, lb 102.7181 ms, ub 103.8645 ms, ci 0.950 std dev: 2.926384 ms, lb 2.473240 ms, ub 3.647964 ms, ci 0.950
Let’s see if we can translate the algorithm we developed with lists to use the
Vector type from the vectors
package, to see how much benefit we can get from the highly-optimized
operations in that package.
Here we’ve just replaced one of the lists since there’s no drop-in replacement
mapAccumL we’re using:
medVec :: String -> String -> Int medVec s1 = finalCost . V.ifoldl' scanS1 (zip [1..] s1) . V.fromList where scanS1 v costSW_i c2 = -- TODO: see if we can turn this into a fold, scanl, or zipWith? -- or perhaps accum from Vector? let (_, v') = mapAccumL scanVec (costSW_i, costSW_i + 1) v scanVec (costSW, costS) (costW, c1) = let cost = min (min costS costW + 1) (costSW + if c1 == c2 then 0 else 2) in ((costW, cost), (cost, c1)) in v' finalCost = fst . last
This runs about as fast as
medList above. We need more
With a little thought we can turn the
mapAccumL above into a
iterate our scan over a
Vector for S1 as well. Using the strict
function here also yielded a huge speed-up from the non-strict version.
medVecFinal :: String -> String -> Int medVecFinal s1 = V.last . V.ifoldl' scanS1 costs_i . V.fromList where vs1 = V.fromList s1 costs_i = V.enumFromN 1 $ V.length vs1 -- [0.. length s1] scanS1 costs_W costSW_i c2 = let v = V.zip vs1 costs_W v' = V.postscanl' scanVec (costSW_i, costSW_i + 1) v scanVec (costSW, costS) (c1, costW) = (costW, min (min costS costW + 1) (costSW + if c1 == c2 then 0 else 2)) in snd $ V.unzip v'
This is about twice as fast. Sweet!
collecting 100 samples, 2 iterations each, in estimated 9.639692 s mean: 49.64144 ms, lb 49.33944 ms, ub 49.94830 ms, ci 0.950 std dev: 1.561553 ms, lb 1.359085 ms, ub 1.955104 ms, ci 0.950
Notice that the code above does no indexing or destructive updates; it’s written entirely as though we were working on lists.
By just changing my vector import line to
Data.Vector.Unboxed.Safe as V I got a huge speed-up on the last
implementation above (15x in fact… spooky stuff).
collecting 100 samples, 19 iterations each, in estimated 6.332280 s mean: 3.379753 ms, lb 3.371200 ms, ub 3.391660 ms, ci 0.950 std dev: 51.03072 us, lb 39.59751 us, ub 76.56945 us, ci 0.950
And it appears to run in constant space which is a good thing.
The LLVM backend didn’t seem to improve the benchmarks.
Summary and TODOs
So our final vector version is about 30x faster than the original list version we used as a prototype, and it still just looks like list-processing code and we haven’t needed to muck around with strictness hints, or INLINE pragma (not that there aren’t improvements to be had there).
I would like to play with a further optimization to the algorithm that drops non-ascending prefixes from the cost vector we iterate over, but ran out of time. If I mess with that, I’ll probably explore the effects of destructive updates at the same time.
Improvements will end up on the gist I linked to above. Feel free to fork and improve!