Some initial tests of Tries
I have been interested in writing a Trie
implementation to see how its performance compares to using Data.Map for
storing dictionary-like data. I wrote a minimal implementation of Tries which
exports insert
,insertWith
, and lookup
definitions and can be used in
place of Data.Map
for those functions. I tested the implementation using a
program frequency.hs
which reads a source file and uses insertWith
on each
word to count the frequency of the words in a file; we then use lookup
to
print out the frequencies of a list of arbitrary words.
The Implementation:
The Trie uses a Data.Map
at each node to provide quick access to each
branch; I performed a test with a simple unbalanced binary tree for storing
branches, but the performance was slightly worse.
We also store the unused tail of a list-key in a special ValBucket
constructor so that we don’t need to store a bunch of singleton Maps. I was
curious if laziness would provide the same benefit automatically (as the
remainder of the list should be stored as a thunk until another key with the
same prefix comes along, forcing evaluation), but my version without the
buckets was pretty significantly slower.
data Trie a v = Node { branches :: M.Map a (Trie a v) }
| ValNode { branches :: M.Map a (Trie a v),
val :: v }
| ValBucket { bucket :: [a],
val :: v }
| Val { val :: v }
| Empty
deriving (Show)
Initial Testing Results
It is quite likely that there are obvious performance problems with my Trie
implementation and I know that my test script doesn’t provide a very good look
at the data structure (it doesn’t provide much of a benchmark for lookup
,
spends an inordinate amount of time preparing the text for insertion, etc.)
I was disappointed with the results which showed no improvement with Trie
when compared with the same test using Data.Map
. Here is a quick graph
comparing execution times of frequency.hs
using Map vs Trie as the size of
the input text increases:
(note: tests compiled in GHC using -O1 optimization, and timed with Benchpress)
Based on the improvement I saw using the buckets for the string tails I want to see if extending the idea and turning the module into a radix trie will improve performance.
I’m hoping people will be interested in looking at my code and providing some feedback. You can use:
$ git clone https://github.com/jberryman/Trie/
If you want to download just the Trie module you can get it here.
UPDATE: Here are the results of a quick test in GHCi comparing the effects of using a Data.Map.Map vs. an Unbalanced Binary Tree vs. simple Lists to store the branches at each node:
Data.Map, as expected, is the winner. It’s possible that with very small Tries that either lists or the simple trees would have slightly better performance.
Next on my agenda is to perform the following test: build up a Trie from a reduced english dictionary, then parse a random block of text, printing the definition of each word. I hope the Trie will fare better in this test.
UPDATE 2:
I just noticed one property of the Data.Map library that probably
makes it less optimal than it could be for this application:
the Map type
is strict in all its constructors except in the value of the key/value pair
(which is where we are storing the recursive Trie paths from a node. I suppose
this also means that -funbox-strict-fields has little effect on the structure.
In Trie.hs
I replaced the M.insertWith
function call (from Data.Map) with
the strict insertWith'
function and got some improvement in CPU usage.
UPDATE 3:
You can read some of my conclusions
in this post.