Brandon.Si(mmons)

code / art / projects

Initial Tests of Tries: Follow Up

This is to wrap up my previous post about Tries and my attempts at an implementation, and to summarize some of the things I think I learned about implementing a Trie structure in Haskell.

Complexity is Slow

My initial idea of using a container type at each node to store the collection of branches was a good idea for testing (I could easily swap in Data.Map, or my own simple binary tree) and convenience (I could use Data.Map’s built-in functions to insert the element of the list key into the the container); but it turned out to be bad for performance.

One thing I noticed was that Data.Map is not strict in its ‘value’ field, which is where I was storing more Trie types with more Maps to hold more Trie types… this seems to be simply too complicated to be fast.

Simplicity wins over Balanced Trees

Magnus Jonsson replied with a complete re-working of my implementation which uses different constructors in the Trie type to handle both the linear node-to-node movement through the key, and the searching through the set of branches. This gives a good illustration of the concept. His is the winner in terms of performance.

He doesn’t use any techniques to keep the tree balanced but his implementation out-performs my Trie with its cumbersome (but balanced) Maps at each node. I also found that when I swapped in a module MyMap (a simple non-balanced tree) in place of Data.Map that the code was about twice as fast in both tests I performed.

I would like to play with approaches to keeping the Branch trees balanced (or almost-balanced) in a lightweight way. Perhaps a splay tree or an adaptation would make sense .

Radix Trees don’t make Sense

EDIT: Jake McArthur has pointed out that my characterization of traversing data structures in FP and conclusion in this paragraph are probably incorrect.

I chose to use a special “bucket” constructor to hold the tails of keys in their original list form when the tail was unique; e.g. when inserting the key “washing” into a Trie containing only the key “waste”, the “was” would overlap and each element 'w'--> 'a' --> 's' would reside in its own Trie constructor, but both the “te” and “hing” would be stored as intact lists in separate ValBucket constructors, since there is no need to touch them yet.I thought extending this idea and implementing a radix tree might be even more efficient.

But this doesn’t make much sense in FP. Imagine the list type [] was defined in a more verbose way as follows:

1
2
data List a = L a (List a)
            | X

In our Trie structure, a sequence of key elements which have no branches (which would be reduced to a bucket in the radix tree) might look something like this:

1
Key 'w' (Key 'a' (Key 's' (Branches ... )))

…whereas the same situation in an attempted radix tree might look like:

1
 KeyBucket (L 'w' (L 'a' (L 's' X))) (Branches ...)

…which is essentially identical. And since in functional programming when we “traverse” a data structure (e.g. to compare two keys) we are actual destroying it and rebuilding it as we go, it stands to reason that it should be just as easy to rebuild our structure with a constructor of a different name (Key vs. List).

Storing the tail ends of keys when we can makes sense because we avoid destroying the structure until we have to inspect it.

Summary of Tests:

These are not scientific, but I include them to give you an idea of the performance differences I was seeing. Tests were run several times and averaged, and compiled with: ghc --make -prof -auto-all -O2 -funbox-strict- fields.

The first test frequency.hs is described in my last post. The second test dictLookup.hs reads an alphabetized list of words and builds a dictionary Trie from the word to the word’s line number. It then reads a random list of words from a file and prints (word, line_number) tuples.

Tests of frequency.hs:

MagnusTrie Trie (with MyMap) Data.Map Trie (with Data.Map)

ticks:

21

29

39

58

total alloc. (MB):

100

133

101

235

Tests of dictLookup.hs:

MagnusTrie Trie (with MyMap) Data.Map Trie (with Data.Map)

ticks:

17

20

18

46

total alloc. (MB):

99

133

103

215

The whole set of tests and files can be downloaded with:

1
$ darcs get http://coder.bsimmons.name/code/Trie/

Unpack the texts.tar.gz file before running any of the code. Thanks for the patience with the long post. Let me know your thoughts.

UPDATE: I finally was able to test wren ng thornton’s bytestring-trie library with one of my tests. I couldn’t profile the code (GHC complained the libs were missing) but I ran dictLookup.hs with BenchPress and the bytestring-trie modified version looked about twice as fast.

Comments