code / art / projects

Some Initial Tests of Tries

I have been interested in writing a Trie implementation to see how its performance compares to using Data.Map for storing dictionary-like data. I wrote a minimal implementation of Tries which exports insert,insertWith, and lookup definitions and can be used in place of Data.Map for those functions. I tested the implementation using a program frequency.hs which reads a source file and uses insertWith on each word to count the frequency of the words in a file; we then use lookup to print out the frequencies of a list of arbitrary words.

The Implementation:

The Trie uses a Data.Map at each node to provide quick access to each branch; I performed a test with a simple unbalanced binary tree for storing branches, but the performance was slightly worse.

We also store the unused tail of a list-key in a special ValBucket constructor so that we don’t need to store a bunch of singleton Maps. I was curious if laziness would provide the same benefit automatically (as the remainder of the list should be stored as a thunk until another key with the same prefix comes along, forcing evaluation), but my version without the buckets was pretty significantly slower.

The Trie type:
data Trie a v = Node { branches :: M.Map a (Trie a v) }
              | ValNode { branches :: M.Map a (Trie a v),
                          val      :: v }
              | ValBucket { bucket :: [a],
                            val    :: v }
              | Val { val :: v }
              | Empty
              deriving (Show)

Initial Testing Results

It is quite likely that there are obvious performance problems with my Trie implementation and I know that my test script doesn’t provide a very good look at the data structure (it doesn’t provide much of a benchmark for lookup, spends an inordinate amount of time preparing the text for insertion, etc.)

I was disappointed with the results which showed no improvement with Trie when compared with the same test using Data.Map. Here is a quick graph comparing execution times of frequency.hs using Map vs Trie as the size of the input text increases:

(note: tests compiled in GHC using -O1 optimization, and timed with Benchpress)

Based on the improvement I saw using the buckets for the string tails I want to see if extending the idea and turning the module into a radix trie will improve performance.

I’m hoping people will be interested in looking at my code and providing some feedback. You can use:

$ darcs get

If you want to download just the Trie module you can get it here.

UPDATE: Here are the results of a quick test in GHCi comparing the effects of using a Data.Map.Map vs. an Unbalanced Binary Tree vs. simple Lists to store the branches at each node:

Input File Size (kB) Data.Map(ms) Simple Tree(ms) List(ms)
96.7 618 1015 1998
596.5 10225 25255 71236

Data.Map, as expected, is the winner. It’s possible that with very small Tries that either lists or the simple trees would have slightly better performance.

Next on my agenda is to perform the following test: build up a Trie from a reduced english dictionary, then parse a random block of text, printing the definition of each word. I hope the Trie will fare better in this test.

UPDATE 2: I just noticed one property of the Data.Map library that probably makes it less optimal than it could be for this application: the Map type is strict in all its constructors except in the value of the key/value pair (which is where we are storing the recursive Trie paths from a node. I suppose this also means that -funbox-strict-fields has little effect on the structure.
In Trie.hs I replaced the M.insertWith function call (from Data.Map) with the strict insertWith' function and got some improvement in CPU usage.

You can read some of my conclusions in this post.