code / art / projects

Announcing Unagi-chan

Today I released version 0.2 of unagi-chan, a haskell library implementing fast and scalable FIFO queues with a nice and familiar API. It is available on hackage and you can install it with:

$ cabal install unagi-chan

This version provides a bounded queue variant (and closes issue #1!) that has performance on par with the other variants in the library. This is something I’m somewhat proud of, considering that the standard TBQueue is not only significantly slower than e.g. TQueue, but also was seen to livelock at a fairly low level of concurrency (and so is not included in the benchmark suite).

Here are some example benchmarks. Please do try the new bounded version and see how it works for you.


What follows are a few random thoughts more or less generally-applicable to the design of bounded FIFO queues, especially in a high-level garbage-collected language. These might be obvious, uninteresting, or unintelligible.

What is Bounding For?

I hadn’t really thought much about this before: a bounded queue limits memory consumption because the queue is restricted from growing beyond some size.

But this isn’t quite right. If for instance we implement a bounded queue by pre-allocating an array of size bounds then a write operation need not consume any additional memory; indeed the value to be written has already been allocated on the heap before the write even begins, and will persist whether the write blocks or returns immediately.

Instead constraining memory usage is a knock-on effect of what we really care about: backpressure; when the ratio of “producers” to their writes is high (the usual scenario), blocking a write may limit memory usage by delaying heap allocations associated with elements for future writes.

So bounded queues with blocking writes let us:

  • when threads are “oversubscribed”, transparently indicate to the runtime which work has priority
  • limit future resource usage (CPU time and memory) by producer threads

We might also like our bounded queue to support a non-blocking write which returns immediately with success or failure. This might be thought of (depending on the capabilities of your language’s runtime) as more general than a blocking write, but it also supports a distinctly different notion of bounding, that is bounding message latency: a producer may choose to drop messages when a consumer falls behind, in exchange for lower latency for future writes.

Unagi.Bounded Implementation Ideas

Trying to unpack the ideas above helped in a few ways when designing Unagi.Bounded. Here are a few observations I made.

We need not block before “writing”

When implementing blocking writes, my intuition was to (when the queue is “full”) have writers block before “making the message available” (whatever that means for your implementation). For Unagi that means blocking on an MVar, and then writing a message to an assigned array index.

But this ordering presents a couple of problems: first, we need to be able to handle async exceptions raised during writer blocking; if its message isn’t yet “in place” then we need to somehow coordinate with the reader that would have received this message, telling it to retry.

By unpacking the purpose of bounding it became clear that we’re free to block at any point during the write (because the write per se does not have the memory-usage implications we originally naively assumed it had), so in Unagi.Bounded writes proceed exactly like in our other variants, until the end of the writeChan, at which point we decide when to block.

This is certainly also better for performance: if a wave of readers comes along, they need not wait (themselves blocking) for previously blocked writers to make their messages available.

One hairy detail from this approach: an async exception raised in a blocked writer does not cause that write to be aborted; i.e. once entered, writeChan always succeeds. Reasoning in terms of linearizability this only affects situations in which a writer thread is known-blocked and we would like to abort that write.

Fine-grained writer unblocking in probably unnecessary and harmful

In Unagi.Bounded I relax the bounds constraint to “somewhere between bounds and bounds*2”. This allows me to eliminate a lot of coordination between readers and writers by using a single reader to unblock up to bounds number of writers. This constraint (along with the constraint that bounds be a power of two, for fast modulo) seemed like something everyone could live with.

I also guess that this “cohort unblocking” behavior could result in some nicer stride behavior, with more consecutive non-blocking reads and writes, rather than having a situation where the queue is almost always either completely full or empty.

One-shot MVars and Semaphores

This has nothing to do with queues, but just a place to put this observation: garbage-collected languages permit some interesting non-traditional concurrency patterns. For instance I use MVars and IORefs that only ever go from empty to full, or follow a single linear progression of three or four states in their lifetime. Often it’s easier to design algorithms this way, rather than by using long-lived mutable variables (for instance I struggled to come up with a blocking bounded queue design that used a circular buffer which could be made async-exception-safe).

Similarly the CAS operation (which I get exported from atomic-primops) turns out to be surprisingly versatile far beyond the traditional read/CAS/retry loop, and to have very useful semantics when used on short-lived variables. For instance throughout unagi-chan I do both of the following:

  • CAS without inspecting the return value, content that we or any other competing thread succeeded.

  • CAS using a known initial state, avoiding an initial read

Thoughts on FIFO-ness

I’ve been doing a lot of experimenting with concurrent operations in haskell and in particular playing with and thinking about the design of concurrent FIFO queues. These structures are difficult to make both efficient and correct, due to the effects of contention on the parts of the structure tasked with coordinating reads and writes from multiple threads.

These are my thoughts so far on FIFO semantics.

FIFO? And how!

In the interesting paper “How FIFO is your concurrent FIFO queue?”(PDF). A Haas, et al. propose that an ideal FIFO queue has operations that are instantaneous (think of each write having an infinitely accurate timestamp, and each read taking the corresponding element in timestamp order). They then measure the degree to which real queues of various designs deviate from this platonic FIFO semantics in their message ordering, using a metric they call “element-fairness”. They experimentally measure element-fairness of both so-called “strict FIFO” as well as “relaxed FIFO” designs, in which elements are read in more or less the order they were written (some providing guarantees of degree of re-ordering, others not).

The first interesting observation they make is that no queue actually exhibits FIFO semantics by their metric; this is because of the realities of the way atomic memory operations like CAS may arbitrarily reorder a set of contentious writes.

The second interesting result is that the efficient-but-relaxed-FIFO queues which avoid contention by making fewer guarantees about message ordering often perform closer to ideal FIFO semantics (by their metric) than the “strict” but slower queues!

Observable FIFO Semantics

As an outsider, reading papers on FIFO queue designs I get the impression that what authors mean by “the usual FIFO semantics” is often ill-defined. Clearly they don’t mean the platonic zero-time semantics of the “How FIFO… ” paper, since they can’t be called FIFO by that measure.

I suspect what makes a queue “strict FIFO” (by the paper’s categorization) might simply be

If write x returns at time T, then x will be read before the elements of any writes that have not yet started by time T.

The idea is difficult to express, but is essentially that FIFO semantics is only observable by way of actions taken by a thread after returning from a write (think: thread A writes x, then tells B which writes y, where our program’s correctness depends on the queue returning y after x). Note that since a queue starts empty this is also sufficient to ensure writes don’t “jump ahead” of writes already in the queue.

Imagine an absurd queue whose write never returns; there’s very little one can say for certain about the “correct” FIFO ordering of writes in that case, especially when designing a program with a preempting scheduler that’s meant to be portable. Indeed the correctness criterion above is actually probably a lot stricter than many programs require; e.g. when there is no coordination between writers, an observably-FIFO queue need only ensure that no reader thread sees two messages from the same writer thread out of order (I think).

The platonic zero-time FIFO ordering criterion used in the paper is quite different from this observable, correctness-preserving FIFO criterion; I can imagine it being useful for people designing “realtime” software.

Update 04/15/2014:

What I’m trying to describe here is called linearizability, and is indeed a well-understood and common way of thinking about the semantics of concurrent data structures; somehow I missed or misunderstood the concept!


At a certain level of abstraction, correct observable FIFO semantics shouldn’t be hard to make efficient; after all, the moments during which we have contention (and horrible performance) are also the moments during which we don’t care about (or have no way of observing) correct ordering. In other words (although we have to be careful of the details) a thread-coordination scheme that breaks down (w/r/t element-fairness) under contention isn’t necessarily a problem. Compare-and-swap does just that, unfortunately it breaks down in a way that is slower rather than faster.

Announcing Shapely-data v0.1

This is the first real release shapely-data, a haskell library up here on hackage for working with algebraic datatypes in a simple generic form made up of haskell’s primitive product, sum and unit types: (,), Either, and ().

You can install it with

cabal install shapely-data

Motivation and examples

In order from most to least important to me, here are the concerns that motivated the library:

  • Provide a good story for (,)/Either as a lingua franca generic representation that other library writers can use without dependencies, encouraging abstractions in terms of products and sums (motivated specifically by my work on simple-actors.

  • Support algebraic operations on ADTs, making types composable

      -- multiplication:
      let a = (X,(X,(X,())))
          b = Left (Y,(Y,())) :: Either (Y,(Y,())) (Z,())
          ab = a >*< b
       in ab == ( Left (X,(X,(X,(Y,(Y,()))))) 
                  :: Either (X,(X,(X,(Y,(Y,()))))) (X,(X,(X,(Z,())))) )
      -- exponents, etc:
      fanout (head,(tail,(Prelude.length,()))) [1..3]    == (1,([2,3],(3,())))
      (unfanin (_4 `ary` (shiftl . Sh.reverse)) 1 2 3 4) == (3,(2,(1,(4,()))))
  • Support powerful, typed conversions between Shapely types

      data F1 = F1 (Maybe F1) (Maybe [Int]) deriving Eq
      data F2 = F2 (Maybe F2) (Maybe [Int]) deriving Eq
      f2 :: F2
      f2 = coerce (F1 Nothing $ Just [1..3])
      data Tsil a = Snoc (Tsil a) a | Lin deriving Eq
      truth = massage "123" == Snoc (Snoc (Snoc Lin '3') '2') '1'

Lowest on the list is supporting abstracting over different recursion schemes or supporting generic traversals and folds, though some basic support is planned.

Finally, in at least some cases this can completely replace GHC.Generics and may be a bit simpler. See examples/Generics.hs for an example of the GHC.Generics wiki example ported to shapely-data. And for a nice view on the changes that were required, do:

git show 3a65e95 | perl /usr/share/doc/git/contrib/diff-highlight/diff-highlight

Why not GHC.Generics?

The GHC.Generics representation has a lot of metadata and a complex structure that can be useful in deriving default instances; more important to us is to have a simple, canonical representation such that two types that differ only in constructor names can be expected to have identical generic representations.

This supports APIs that are type-agnostic (e.g. a database library that returns a generic Product, convertible later with to), and allows us to define algebraic operations and composition & conversion functions.

Writing a Streaming Twitter Waterflow Solution

In this post Philip Nilsson describes an inspiring, principled approach to solving a toy problem posed in a programming interview. I wanted to implement a solution to a variant of the problem where we’d like to process a stream. It was pretty easy to sketch a solution out on paper but Philip’s solution was invaluable in testing and debugging my implementation. (See also Chris Done’s mind-melting loeb approach)

My goal was to have a function:

waterStream :: [Int] -> [Int]

that would take a possibly-infinite list of columns and return a stream of known water quantities, where volumes of water were output as soon as possible. We can get a solution to the original problem, then, with

ourWaterFlow = sum . waterStream

Here is the solution I came up with, with inline explanation:

{-# LANGUAGE BangPatterns #-}

-- start processing `str` initializing the highest column to the left at 0, and
-- an empty stack.
waterStream :: [Int] -> [Int]
waterStream str = processWithMax 0 str []

processWithMax :: Int -> [Int] -> [(Int,Int)] -> [Int]
processWithMax prevMax = process
    process []     = const []
    -- output the quantity of water we know we can get, given the column at the
    -- head of the stream, `y`:
    process (y:ys) = eat 1
        eat !n xxs@((offset,x):xs)
           -- done with `y`, push it and its offset onto the stack
           | y < x = process ys ((n,y):xxs)
           -- at each "rise" we can output some known quantity of water;
           -- storing the "offset" as we did above lets us calculate water
           -- above a previously filled "valley"
           | otherwise = let col = offset*(min y prevMax - x) 
                             cols = eat (n+offset) xs
                          -- filter out zeros:
                          in if col == 0 then cols else col : cols
        -- if we got to the end of the stack, then `y` is the new highest
        -- column we've seen.
        eat !n [] = processWithMax y ys [(n,y)]

The bit about “offsets” is the tricky part which I don’t know how to explain without a pretty animation.


It took me much longer than I was expecting to code up the solution above that worked on a few hand-drawn test cases, and at that point I didn’t have high confidence that the code was correct, so I turned to quickcheck and assert.

First I wanted to make sure the invariant that the “column” values in the stack were strictly increasing held:

 import Control.Exception (assert)

   --process (y:ys) = eat 1
     process (y:ys) stack = assert (stackSane stack) $ eat 1 stack

Then I used Philip’s solution (which I had confidence in):

waterFlow :: [Int] -> Int
waterFlow h = sum $ 
  zipWith (-) 
    (zipWith min (scanl1 max h) (scanr1 max h))

to test my implementation:

*Waterflow> import Test.QuickCheck
*Waterflow Test.QuickCheck> quickCheck (\l -> waterFlow l == ourWaterFlow l)
*** Failed! Falsifiable (after 21 tests and 28 shrinks):     

Oops! It turned out I had a bug in this line (fixed above):

                           --old buggy:
                           --cols = eat (n+1) xs
                           --new fixed:
                             cols = eat (n+offset) xs


The solution seems to perform pretty well, processing 1,000,000 Ints in 30ms on my machine:

import Criterion.Main

main = do
    gen <- create
    rs <- replicateM 1000000 $ uniformR (0,100) gen
    defaultMain [ bench "ourWaterFlow" $ whnf ourWaterFlow rs 

I didn’t get a good look at space usage over time, as I was testing with mwc-random which doesn’t seem to support creating a lazy infinite list of randoms and didn’t want to hunt down another library. Obviously on a stream that simply descends forever, our stack of (Int,Int) will grow to infinite size.

It seems as though there is a decent amount of parallelism that could be exploited in this problem, but I didn’t have any luck on a quick attempt.


Have a parallel solution, or something just faster? Or an implementation that doesn’t need a big stack of previous values?

A TypeFamilies Primer

TypeFamilies is a GHC extension that lets you create formerly-impossible abstractions in a very straightforward way. It took me several tries before they clicked for me though, so this is the introduction to TypeFamilies that I wish I had read first (although I just found Brent Yorgey’s, which would have done the trick).

I’m treating the subject very narrowly for most of this post, and try to round things out a little at the very end.

Primal Ignorance

If this isn’t the first thing you’ve read about TypeFamilies, it might be helpful to forget a few things. The question “what precisely is a type family?” isn’t going to be very helpful; in general, the terminology for the constellation of constructions that TypeFamilies gives you is a huge mess, with multiple partially-overlapping terms in the wild, none of which are helpful for developing an intuition about what all of this is about.

I also found various analogies I’d read to be useless, so forget those too.

Type synonyms as type-level functions

Consider the familiar type synonym:

type PairOf a = (a,a)

Normally this is presented as syntactic sugar, with little to do with the type system.

A more interesting way of thinking about PairOf is as a function (as suggested by the =), where evaluation involves substituting occurrences of the left hand side (LHS) with the right, in the usual way. These functions are evaluated in your type signatures at compile time.

The analogous regular term-level function would of course be:

pairOf a = (a,a)

Simple enough. Now let’s think about a simple term-level function, and see what an analogous type-level type synonym/function might look like:

last (a: [])     = a
last (a: (b:bs)) = last (b:bs)

For our type-level Last we need something like lists at the type-level, so we’ll use the common nested tuple representation of (,) as cons and () as the empty list, e.g.:

x :: (Int,(Int,(Int,())))  -- like [1,2,3]

Hopefully I didn’t just lose you. Remember for now we just care about using this list-like tuple thing in our type signatures.

If you were to charge ahead and try to define Last using type synonyms treated as full blown functions, you might come up with:

-- this isn't okay:
type Last (a, ())     = a
type Last (a, (b,bs)) = Last (b,bs)

Unfortunately the compiler will laugh at you. Type synonyms can only have abstract variable arguments on the LHS where above we have tried to deconstruct them using pattern matching, and to define a different RHS for both cases. Further we’ve made the definition recursive. None of that is okay.

In fact the humble type synonym is only a very simple sort of function (a natural transformation or something close) which is very easy to evaluate, but also very limited.

Finally, Enter TypeFamilies

The TypeFamilies extension lets us define Last successfully almost exactly as we did above.

{-# LANGUAGE TypeFamilies #-}

-- we have to "declare" `Last` separately, and the "family"
-- here distinguishes the syntax from a normal type synonym:
type family   Last l  

-- ...and then can define our "cases":
type instance Last (a,())     = a
type instance Last (a,(b,bs)) = Last (b,bs)

At this point when the type-checker sees Last (a,(b,bs)) in a type signature it will replace it with Last (b,bs), and continue until all of these “type functions” are evaluated. I may be fudging things a bit but that’s the general idea.

Since these are a more general sort of type function, they can even be used to replace traditional type synonyms:

type family   PairOf a
type instance PairOf a = (a,a)

But what is that good for?

It would be neat to be able to work with “lists” that look like e.g. (1,('a',("hello",()))); they are heterogeneous, operations like head would be type-safe, etc. So imagine we want to define a last on types of this list-like tuple sort of data.

What would the type of last look like? We know it has to be polymorphic, since its arguments might look like () or (1,(2,())), different types of course. So we’ll need a type-class (and a couple other standard extensions):

{-# LANGUAGE FlexibleInstances, FlexibleContexts #-}
class LastOfTupleList l where
    last' :: l -> Last l      -- < we use our `Last` type function

Our instances are trivial:

instance LastOfTupleList (a,()) where
    last' (a,()) = a  -- < the type-checker can see that, indeed, `a`
                      --   is equal to `Last (a,())`

instance (LastOfTupleList (b, bs))=> LastOfTupleList (a,(b,bs)) where
    last' (a,(b,bs)) = last' (b,bs)

Letting us do:

>>> last' (1,(2,()))
>>> last' (1,('a',("hello",())))

Notice how our instances of Last and last have almost the same structure; this is very common.

Functions… but open!

If you’re a programmer type you were probably irritated by my initial clumsy definition of last; why not:

last (a:[]) = a
last (a:as) = last as -- `as` matches whatever can fall through the pattern
                      -- above; in this case only non-empty lists

Well, I was being sneaky because the type-level analogue isn’t allowed!

type instance Last (a,()) = a
type instance Last (a,as) = Last as -- BAD!

This is because unlike functions, type families are open meaning, like typeclasses, a new instance can be added at any moment. Therefore there’s no way to define a “default” instance to use after all other matches fail, you simply get illegal overlapping synonym instances such as the one above; the order in which we defined the two doesn’t matter.

For some use cases this is what we need, for others (such as our last') we’d really prefer that type synonym families be closed so that we can pattern match in the usual way. This feature is apparently coming soon.

Other details

That should give you an intuition. At this point you might want to stop and read through the following documentation, or continue reading below before coming back to these links for a more refined understanding and additional details:

Associated type synonyms

Since we so often define define a type class in terms of one or more type families, we’re given a simplified syntax for combining them in one place.

class LastOfTupleList l  where
    type Last l        -- an *associated* type family
    last' :: l -> Last l

instance LastOfTupleList (a,()) where
    type Last (a,()) = a
    last' (a,()) = a

When people say “associated types” they mean type functions that are associated with a typeclass using the syntax above.


Type synonym family instances are said to be not injective, meaning two different type functions can map to the same type on the RHS, e.g.

type instance F Int  = Bool
type instance F Char = Bool

It’s easy to forget this when building new abstractions, and assume that the typechecker will infer from the RHS (e.g. Bool above) the argument passed in to the type function (Int or Char).

Data families

I’ve completely focused on the type synonym flavor of TypeFamilies above, but there is also a data/newtype flavor in which, for each instance definition the RHS is a brand new type declaration, rather than mapping to an existing type

-- from
data family T a
data    instance T Int  = T1 Int | T2 Bool -- new constructors T1 and T2 defined here
newtype instance T Char = TC Bool

Because each instance maps to a unique type, data families are injective allowing the type checker to infer the LHS of the equation knowing the right.

More other things

  • TypeFamilies provides the syntax a ~ b to indicate type equality constraints; this is especially useful with type synonym functions, but can be useful on its own as well.
  • kind signatures are required for type functions on types taking arguments, e.g. Maybe