I have been interested in writing a Trie
implementation to see how its performance compares to using Data.Map for
storing dictionary-like data. I wrote a minimal implementation of Tries which
lookup definitions and can be used in
Data.Map for those functions. I tested the implementation using a
frequency.hs which reads a source file and uses
insertWith on each
word to count the frequency of the words in a file; we then use
print out the frequencies of a list of arbitrary words.
The Trie uses a
Data.Map at each node to provide quick access to each
branch; I performed a test with a simple unbalanced binary tree for storing
branches, but the performance was slightly worse.
We also store the unused tail of a list-key in a special
constructor so that we don’t need to store a bunch of singleton Maps. I was
curious if laziness would provide the same benefit automatically (as the
remainder of the list should be stored as a thunk until another key with the
same prefix comes along, forcing evaluation), but my version without the
buckets was pretty significantly slower.
1 2 3 4 5 6 7 8
Initial Testing Results
It is quite likely that there are obvious performance problems with my Trie
implementation and I know that my test script doesn’t provide a very good look
at the data structure (it doesn’t provide much of a benchmark for
spends an inordinate amount of time preparing the text for insertion, etc.)
I was disappointed with the results which showed no improvement with
when compared with the same test using
Data.Map. Here is a quick graph
comparing execution times of
frequency.hs using Map vs Trie as the size of
the input text increases:
(note: tests compiled in GHC using -O1 optimization, and timed with Benchpress)
Based on the improvement I saw using the buckets for the string tails I want to see if extending the idea and turning the module into a radix trie will improve performance.
I’m hoping people will be interested in looking at my code and providing some feedback. You can use:
If you want to download just the Trie module you can get it here.
UPDATE: Here are the results of a quick test in GHCi comparing the effects of using a Data.Map.Map vs. an Unbalanced Binary Tree vs. simple Lists to store the branches at each node:
|Input File Size (kB)||Data.Map(ms)||Simple Tree(ms)||List(ms)|
Data.Map, as expected, is the winner. It’s possible that with very small Tries that either lists or the simple trees would have slightly better performance.
Next on my agenda is to perform the following test: build up a Trie from a reduced english dictionary, then parse a random block of text, printing the definition of each word. I hope the Trie will fare better in this test.
I just noticed one property of the Data.Map library that probably
makes it less optimal than it could be for this application:
the Map type
is strict in all its constructors except in the value of the key/value pair
(which is where we are storing the recursive Trie paths from a node. I suppose
this also means that -funbox-strict-fields has little effect on the structure.
Trie.hs I replaced the
M.insertWith function call (from Data.Map) with
insertWith' function and got some improvement in CPU usage.
You can read some of my conclusions in this post.