https://github.com/jberryman/almost-inline-asm-haskell-example

This is especially useful if you want to return multiple values from a foreign procedure, where otherwise with the traditional FFI approach you would have to do some allocation and stuff the values into a struct or something. I find the above more understandable in any case.

Here’s an example of the dumped ASM from the `Main`

in the example above:

```
...
call newCAF
addq $8,%rsp
testq %rax,%rax
je _c73k
_c73j:
movq $stg_bh_upd_frame_info,-16(%rbp)
movq %rax,-8(%rbp)
movq $block_info,-24(%rbp)
movl $4,%edi
movl $3,%esi
movl $2,%r14d
movl $1,%ebx
addq $-24,%rbp
jmp sipRound_s_x3
_c73z:
movq $104,904(%r13)
movq $block_info,-32(%rbp)
movq %r14,-24(%rbp)
movq %rsi,-16(%rbp)
movq %rdi,-8(%rbp)
movq %rbx,(%rbp)
addq $-32,%rbp
...
```

You can see we just prepare argument registers, do whatever with the stack pointer, do a jump, and then push the return values onto the stack. For my purposes this was almost too much overhead to make this worthwhile (you can look at notes in the code).

I thought about sketching out a ghc proposal about a way to formalize this, maybe make it safer, and maybe somehow more efficient but I don’t have the time right now and don’t really have the expertise to know if this is even a good idea or how it could work.

]]>```
$ cabal install unagi-bloomfilter
```

The library uses the *bloom-1* variant from “Fast Bloom Filters and Their
Generalization” by Yan Qiao, et al. I’ll try to write more about it when I have
the time. Also I just gave a talk on things I learned working on the project
last night at the New York Haskell User Group:

```
http://www.meetup.com/NY-Haskell/events/233372271/
```

It was quite rough, but I was happy to hear from folks that found some interesting things to take away from it.

Thanks to Gershom for inviting me to speak, for my company Signal Vine for sponsoring my trip out, and to Yan Qiao for generously answering my silly questions and helping me understand the paper.

Signal Vine is an awesome group of people, with interesting technology and problems to solve, and we’re looking to grow the small development team. If you have some experience with haskell (you don’t have to be a guru) and are interested, please reach out to Jason or me at:

```
brandon@signalvine.com
jason@signalvine.com
```

]]>```
cabal install hashabler
```

(see my initial announcement post which has some motivation and pretty pictures)

You can see the CHANGELOG but the main change is an implementation of SipHash. It’s about as fast as our implementation of FNV-1a for bytestrings of length fifty and slightly faster when you get to length 1000 or so, so you should use it unless you’re wanting a hash with a simple implementation.

If you’re implementing a new hashing algorithm or hash-based data structure, please consider using hashabler instead of hashable.

]]>`hashabler`

and wanted to share
a nice way I found to translate stateful bit-twiddling code in C (which makes
heavy use of bitwise assignment operators) to haskell.
I was working from the
reference implementation.
As you can see statefulness and mutability are an implicit part of how the
algorithm is defined, as it modifies the states of the `v`

variables.

```
#define SIPROUND \
do { \
v0 += v1; v1=ROTL(v1,13); v1 ^= v0; v0=ROTL(v0,32); \
v2 += v3; v3=ROTL(v3,16); v3 ^= v2; \
v0 += v3; v3=ROTL(v3,21); v3 ^= v0; \
v2 += v1; v1=ROTL(v1,17); v1 ^= v2; v2=ROTL(v2,32); \
} while(0)
int siphash( uint8_t *out, const uint8_t *in, uint64_t inlen, const uint8_t *k )
{
/* ... */
for ( ; in != end; in += 8 )
{
m = U8TO64_LE( in );
v3 ^= m;
TRACE;
for( i=0; i<cROUNDS; ++i ) SIPROUND;
v0 ^= m;
}
```

I wanted to translate this sort of code as directly as possible (I’d already decided if it didn’t work on the first try I would burn my laptop and live in the woods, rather than debug this crap).

First we’ll use name shadowing to “fake” our mutable variables, making it easy to ensure we’re always dealing with the freshest values.

```
{-# OPTIONS_GHC -fno-warn-name-shadowing #-}
```

We’ll also use `RecordWildCards`

to make it easy to capture the “current state”
of these values, through folds and helper functions.

```
{-# LANGUAGE RecordWildCards #-}
```

And finally we use the trivial `Identity`

monad
(this trick I learned from Oleg)
which gets us the proper scoping we want for our `v`

values:

```
import Data.Functor.Identity
```

Here’s a bit of the haskell:

```
siphash :: Hashable a => SipKey -> a -> Word64
siphash (k0,k1) = \a-> runIdentity $ do
let v0 = 0x736f6d6570736575
v1 = 0x646f72616e646f6d
v2 = 0x6c7967656e657261
v3 = 0x7465646279746573
...
v3 <- return $ v3 `xor` k1;
v2 <- return $ v2 `xor` k0;
v1 <- return $ v1 `xor` k1;
v0 <- return $ v0 `xor` k0;
...
-- Initialize rest of SipState:
let mPart = 0
bytesRemaining = 8
inlen = 0
SipState{ .. } <- return $ hash (SipState { .. }) a
let !b = inlen `unsafeShiftL` 56
v3 <- return $ v3 `xor` b
-- for( i=0; i<cROUNDS; ++i ) SIPROUND;
(v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
(v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
v0 <- return $ v0 `xor` b
...
(v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
return $! v0 `xor` v1 `xor` v2 `xor` v3
```

If you were really doing a lot of this sort of thing, you could even make a simple quasiquoter that could translate bitwise assignment into code like the above.

]]>```
cabal install hashabler
```

`hashabler`

is a rewrite of the hashable
library by Milan Straka and Johan Tibell, having the following goals:

Extensibility; it should be easy to implement a new hashing algorithm on any Hashable type, for instance if one needed more hash bits

Honest hashing of values, and principled hashing of algebraic data types (see e.g. #30)

Cross-platform consistent hash values, with a versioning guarantee. Where possible we ensure morally identical data hashes to indentical values regardless of processor word size and endianness.

Make implementing identical hash routines in other languages as painless as possible. We provide an implementation of a simple hashing algorithm (FNV-1a) and make an effort define Hashable instances in a way that is well-documented and sensible, so that e.g. one can (hopefully) easily implement string hashing routine in JavaScript that will match the way we hash strings here.

I started writing a fast concurrent bloom filter variant, but found none of the
existing libraries fit my needs. In particular `hashable`

was deficient in a
number of ways:

The number of hash bits my data structure requires can vary based on user parameters, and possibly be more than the 64-bits supported by hashable

Users might like to serialize their bloomfilter and store it, pass it to other machines, or work with it in a different language, so we need

- hash values that are consistent across platforms
- some guarantee of consistency across library versions

I was also very concerned about the general approach taken for algebraic types,
which results in collision, the use of “hashing” numeric values to themselves,
dubious combining functions, etc. It wasn’t at all clear to me how to ensure my
data structure wouldn’t be broken if I used `hashable`

. See below for a very
brief investigation into hash goodness of the two libraries.

There isn’t interest in supporting my use case or addressing these issues in
`hashable`

(see e.g. #73, #30, and #74)
and apparently hashable is working in practice for people, but maybe this new
package will be useful for some other folks.

Hashing-based data structures assume some “goodness” of the underlying hash function, and may depend on the goodness of the hash function in ways that aren’t always clear or well-understood. “Goodness” also seems to be somewhat subjective, but can be expressed statistically in terms of bit-independence tests, and avalanche properties, etc.; various things that e.g. smhasher looks at.

I thought for fun I’d visualize some distributions, as that’s easier for my puny brain to understand than statistics. We visualize 32-bit hashes by quantizing by 64x64 and mapping that to a pixel following a hilbert curve to maintain locality of hash values. Then when multiple hash values fall within the same 64x64 pixel, we darken the pixel, and finally mark it red if we can’t go any further to indicate clipping.

It’s easy to cherry-pick inputs that will result in some bad behavior by
hashable, but below I’ve tried to show some fairly realistic examples of
strange or less-good distributions in `hashable`

. I haven’t analysed these
at all. Images are cropped ¼ size, but are representative of the whole 32-bit
range.

First, here’s a hash of all `[Ordering]`

of size 10 (~59K distinct values):

Hashabler:

Hashable:

Next here’s the hash of one million `(Word8,Word8,Word8)`

(having a domain ~ 16 mil):

Hashabler:

Hashable:

I saw no difference when hashing english words, which is good news as that’s probably a very common use-case.

If you could test the library on a big endian machine and let me know how it goes, that would be great. See here.

You can also check out the **TODO**s scattered throughout the code and send
pull requests. I mayb not be able to get to them until June, but will be very
grateful!

I’m always open to interesting work or just hearing about how companies are using haskell. Feel free to send me an email at brandon.m.simmons@gmail.com

]]>At a high level `criterion`

makes your benchmark the inner loop of a function,
and runs that loop a bunch of times, measures the result, and then divides by
the number of iterations it performed. The approach is both useful for comparing
alternative implementations, and probably the only meaningful way of answering
“how long does this code take to run”, short of looking at the assembly and
counting the instructions and consulting your processor’s manual.

If you’re skeptical, here’s a benchmark we’d expect to be very fast:

```
import Criterion.Main
main :: IO ()
main = do
defaultMain [
bench "sum2" $ nf sum [1::Int,2]
, bench "sum4" $ nf sum [1::Int,2,3,4]
, bench "sum5" $ nf sum [1::Int,2,3,4,5]
]
```

And indeed it’s on the order of nanoseconds:

```
benchmarking sum2
time 27.20 ns (27.10 ns .. 27.35 ns)
0.994 R² (0.984 R² .. 1.000 R²)
mean 28.72 ns (27.29 ns .. 32.44 ns)
std dev 6.730 ns (853.1 ps .. 11.71 ns)
variance introduced by outliers: 98% (severely inflated)
benchmarking sum4
time 58.45 ns (58.31 ns .. 58.59 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 58.47 ns (58.26 ns .. 58.66 ns)
std dev 654.6 ps (547.1 ps .. 787.8 ps)
variance introduced by outliers: 11% (moderately inflated)
benchmarking sum5
time 67.08 ns (66.84 ns .. 67.33 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 67.04 ns (66.85 ns .. 67.26 ns)
std dev 705.5 ps (596.3 ps .. 903.5 ps)
```

The results are consistent with each other; `sum`

seems to be linear, taking
13-14ns per list element, across our different input sizes.

This is what I was doing today which motivated this post. I was experimenting with measuring the inner loop of a hash function:

```
fnvInnerLoopTest :: Word8 -> Word32
{-# INLINE fnvInnerLoopTest #-}
fnvInnerLoopTest b = (2166136261 `xor` fromIntegral b) * 16777619
```

These were the results criterion gave me:

```
benchmarking test
time 9.791 ns (9.754 ns .. 9.827 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.798 ns (9.759 ns .. 9.862 ns)
std dev 167.3 ps (117.0 ps .. 275.3 ps)
variance introduced by outliers: 24% (moderately inflated)
```

These are the sorts of timescales that get into possibly measuring overhead of function calls, boxing/unboxing, etc. and should make you skeptical of criterion’s result. So I unrolled 4 and 8 iteration versions of these and measured the results:

```
main :: IO ()
main = do
defaultMain [
bench "test" $ nf fnvInnerLoopTest 7
, bench "test4" $ nf fnvInnerLoopTest4 (7,8,9,10)
, bench "test8" $ nf fnvInnerLoopTest8 (7,8,9,10,11,12,13,14)
]
benchmarking test
time 9.380 ns (9.346 ns .. 9.418 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.448 ns (9.399 ns .. 9.567 ns)
std dev 240.4 ps (137.9 ps .. 418.6 ps)
variance introduced by outliers: 42% (moderately inflated)
benchmarking test4
time 12.66 ns (12.62 ns .. 12.72 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 12.68 ns (12.64 ns .. 12.73 ns)
std dev 158.8 ps (126.9 ps .. 215.7 ps)
variance introduced by outliers: 15% (moderately inflated)
benchmarking test8
time 17.88 ns (17.82 ns .. 17.94 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 17.89 ns (17.81 ns .. 17.97 ns)
std dev 262.7 ps (210.3 ps .. 349.7 ps)
variance introduced by outliers: 19% (moderately inflated)
```

So this seems to give a more clear picture of how good our bit twiddling is in that inner loop. I was curious if I could measure the overhead directly in criterion though. Somewhat surprisingly to me, it seems I could!

I added the following benchmark to my list:

```
, bench "baseline32" $ nf (\x-> x) (777::Word32)
```

The idea being to isolate the overhead of applying the most trivial function
and calling `nf`

on an example value of our output type (`Word32`

in this
case).

```
benchmarking baseline32
time 9.485 ns (9.434 ns .. 9.543 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.509 ns (9.469 ns .. 9.559 ns)
std dev 155.8 ps (122.6 ps .. 227.8 ps)
variance introduced by outliers: 23% (moderately inflated)
```

If we consider this value the baseline for the measurements initially reported,
the new results are both linear-ish, as we would expect, and also the resulting
absolute measurements fall about where we’d expect from the assembly we’d hope
for (I still need to verify that this is *actually* the case), e.g. our intial
`test`

is in the ~1ns range, about what we’d expect from an inner loop with a
couple instructions.

I thought this was compelling enough to open an issue to see whether this technique might be incorporated into criterion directly. It’s at least a useful technique that I’ll keep playing with.

Anyway, benchmark your code.

]]>…

]]>`unagi-chan`

, a haskell library implementing
fast and scalable FIFO queues with a nice and familiar API. It is
available on hackage
and you can install it with:
```
$ cabal install unagi-chan
```

This version provides a bounded queue variant (and closes
issue #1!)
that has performance on par with the other variants in the library. This is
something I’m somewhat proud of, considering that the standard
`TBQueue`

is not only significantly slower than e.g. `TQueue`

, but also was seen to
livelock at a fairly low level of concurrency (and so is not included in the
benchmark suite).

Here are some example benchmarks. Please do try the new bounded version and see how it works for you.

What follows are a few random thoughts more or less generally-applicable to the design of bounded FIFO queues, especially in a high-level garbage-collected language. These might be obvious, uninteresting, or unintelligible.

I hadn’t really thought much about this before: a bounded queue limits memory consumption because the queue is restricted from growing beyond some size.

But this isn’t quite right. If for instance we implement a bounded queue by
pre-allocating an array of size `bounds`

then a `write`

operation need not
consume any additional memory; indeed the value to be written has already
been allocated on the heap *before* the write even begins, and will persist
whether the write blocks or returns immediately.

Instead constraining memory usage is a knock-on effect of what we really care
about: **backpressure**; when the ratio of “producers” to their writes is high
(the usual scenario), blocking a write may limit memory usage by delaying heap
allocations associated with elements for *future* writes.

So bounded queues with blocking writes let us:

- when threads are “oversubscribed”, transparently indicate to the runtime which work has priority
- limit
*future*resource usage (CPU time and memory) by producer threads

We might also like our bounded queue to support a non-blocking `write`

which
returns immediately with success or failure. This might be thought of
(depending on the capabilities of your language’s runtime) as more general than
a blocking write, but it also supports a distinctly different notion of
bounding, that is bounding message latency: a producer may choose to drop
messages when a consumer falls behind, in exchange for lower latency for future
writes.

Trying to unpack the ideas above helped in a few ways when designing
`Unagi.Bounded`

. Here are a few observations I made.

When implementing blocking writes, my intuition was to (when the queue is “full”) have writers block before “making the message available” (whatever that means for your implementation). For Unagi that means blocking on an MVar, and then writing a message to an assigned array index.

But this ordering presents a couple of problems: first, we need to be able to handle async exceptions raised during writer blocking; if its message isn’t yet “in place” then we need to somehow coordinate with the reader that would have received this message, telling it to retry.

By unpacking the purpose of bounding it became clear that we’re free to block
at any point during the `write`

(because the `write`

per se does not have the
memory-usage implications we originally naively assumed it had), so in
`Unagi.Bounded`

writes proceed exactly like in our other variants, until the
end of the `writeChan`

, at which point we decide when to block.

This is certainly also better for performance: if a wave of readers comes along, they need not wait (themselves blocking) for previously blocked writers to make their messages available.

One hairy detail from this approach: an async exception raised in a blocked
writer does not cause that write to be aborted; i.e. once entered, `writeChan`

always succeeds. Reasoning in terms of linearizability this only affects
situations in which a writer thread is known-blocked and we would like to abort
that write.

In `Unagi.Bounded`

I relax the bounds constraint to “somewhere between `bounds`

and bounds*2”. This allows me to eliminate a lot of coordination between
readers and writers by using a single reader to unblock up to `bounds`

number
of writers. This constraint (along with the constraint that `bounds`

be a power
of two, for fast modulo) seemed like something everyone could live with.

I also guess that this “cohort unblocking” behavior could result in some nicer stride behavior, with more consecutive non-blocking reads and writes, rather than having a situation where the queue is almost always either completely full or empty.

This has nothing to do with queues, but just a place to put this observation:
garbage-collected languages permit some interesting non-traditional concurrency
patterns. For instance I use `MVar`

s and `IORef`

s that only ever go from empty
to full, or follow a single linear progression of three or four states in their
lifetime. Often it’s easier to design algorithms this way, rather than by using
long-lived mutable variables (for instance I struggled to come up with a
blocking bounded queue design that used a circular buffer which could be made
async-exception-safe).

Similarly the CAS operation (which I get exported from
`atomic-primops`

)
turns out to be surprisingly versatile far beyond the traditional
read/CAS/retry loop, and to have very useful semantics when used on short-lived
variables. For instance throughout `unagi-chan`

I do both of the following:

CAS without inspecting the return value, content that we or any other competing thread succeeded.

CAS using a known initial state, avoiding an initial read

]]>

These are my thoughts so far on FIFO semantics.

In the interesting paper
“How FIFO is your concurrent FIFO queue?”(PDF).
A Haas, et al. propose that an ideal FIFO queue has operations that are
instantaneous (think of each `write`

having an infinitely accurate timestamp,
and each `read`

taking the corresponding element in timestamp order). They then
measure the degree to which *real* queues of various designs deviate from this
platonic FIFO semantics in their message ordering, using a metric they call
“element-fairness”. They experimentally measure element-fairness of both
so-called “strict FIFO” as well as “relaxed FIFO” designs, in which elements
are read in more or less the order they were written (some providing guarantees
of degree of re-ordering, others not).

The first interesting observation they make is that no queue actually exhibits FIFO semantics by their metric; this is because of the realities of the way atomic memory operations like CAS may arbitrarily reorder a set of contentious writes.

The second interesting result is that the efficient-but-relaxed-FIFO queues which avoid contention by making fewer guarantees about message ordering often perform closer to ideal FIFO semantics (by their metric) than the “strict” but slower queues!

As an outsider, reading papers on FIFO queue designs I get the impression that what authors mean by “the usual FIFO semantics” is often ill-defined. Clearly they don’t mean the platonic zero-time semantics of the “How FIFO… ” paper, since they can’t be called FIFO by that measure.

I suspect what makes a queue “strict FIFO” (by the paper’s categorization) might simply be

If

`write x`

returnsat time`T`

, then`x`

will be read before the elements of any`write`

s that havenot yet startedby time`T`

.

The idea is difficult to express, but is essentially that FIFO semantics is
only observable by way of actions taken by a thread *after* returning from a
`write`

(think: thread `A`

writes `x`

, then tells `B`

which writes `y`

, where
our program’s correctness depends on the queue returning `y`

after `x`

). Note
that since a queue starts empty this is also sufficient to ensure writes don’t
“jump ahead” of writes already in the queue.

Imagine an absurd queue whose `write`

never returns; there’s very little one
can say for certain about the “correct” FIFO ordering of writes in that case,
especially when designing a program with a preempting scheduler that’s meant to
be portable. Indeed the correctness criterion above is actually probably a lot
stricter than many programs require; e.g. when there is no coordination between
writers, an observably-FIFO queue need only ensure that no reader thread sees
two messages from the same writer thread out of order (I think).

The platonic zero-time FIFO ordering criterion used in the paper is quite different from this observable, correctness-preserving FIFO criterion; I can imagine it being useful for people designing “realtime” software.

**Update 04/15/2014**:

What I’m trying to describe here is called
*linearizability*,
and is indeed a well-understood and common way of thinking about the semantics
of concurrent data structures; somehow I missed or misunderstood the concept!

At a certain level of abstraction, correct observable FIFO semantics shouldn’t be hard to make efficient; after all, the moments during which we have contention (and horrible performance) are also the moments during which we don’t care about (or have no way of observing) correct ordering. In other words (although we have to be careful of the details) a thread-coordination scheme that breaks down (w/r/t element-fairness) under contention isn’t necessarily a problem. Compare-and-swap does just that, unfortunately it breaks down in a way that is slower rather than faster.

]]>`shapely-data`

, a haskell library
up here on hackage
for working with algebraic datatypes in a simple generic form made up of
haskell’s primitive product, sum and unit types: `(,)`

, `Either`

, and `()`

.
You can install it with

```
cabal install shapely-data
```

In order from most to least important to me, here are the concerns that motivated the library:

Provide a good story for

`(,)`

/`Either`

as a*lingua franca*generic representation that other library writers can use without dependencies, encouraging abstractions in terms of products and sums (motivated specifically by my work on`simple-actors`

.Support algebraic operations on ADTs, making types composable

`-- multiplication: let a = (X,(X,(X,()))) b = Left (Y,(Y,())) :: Either (Y,(Y,())) (Z,()) ab = a >*< b in ab == ( Left (X,(X,(X,(Y,(Y,()))))) :: Either (X,(X,(X,(Y,(Y,()))))) (X,(X,(X,(Z,())))) ) -- exponents, etc: fanout (head,(tail,(Prelude.length,()))) [1..3] == (1,([2,3],(3,()))) (unfanin (_4 `ary` (shiftl . Sh.reverse)) 1 2 3 4) == (3,(2,(1,(4,()))))`

Support powerful, typed conversions between

`Shapely`

types`data F1 = F1 (Maybe F1) (Maybe [Int]) deriving Eq data F2 = F2 (Maybe F2) (Maybe [Int]) deriving Eq f2 :: F2 f2 = coerce (F1 Nothing $ Just [1..3]) data Tsil a = Snoc (Tsil a) a | Lin deriving Eq truth = massage "123" == Snoc (Snoc (Snoc Lin '3') '2') '1'`

Lowest on the list is supporting abstracting over different recursion schemes or supporting generic traversals and folds, though some basic support is planned.

Finally, in at least some cases this can completely replace `GHC.Generics`

and
may be a bit simpler. See `examples/Generics.hs`

for an example of the
`GHC.Generics`

wiki example
ported to `shapely-data`

. And for a nice view on the changes that were
required, do:

```
git show 3a65e95 | perl /usr/share/doc/git/contrib/diff-highlight/diff-highlight
```

The `GHC.Generics`

representation has a lot of metadata and a complex
structure that can be useful in deriving default instances; more important to
us is to have a simple, canonical representation such that two types that
differ only in constructor names can be expected to have identical generic
representations.

This supports APIs that are type-agnostic (e.g. a database library that returns
a generic `Product`

, convertible later with `to`

), and allows us to define
algebraic operations and composition & conversion functions.

My goal was to have a function:

```
waterStream :: [Int] -> [Int]
```

that would take a possibly-infinite list of columns and return a stream of
*known water quantities*, where volumes of water were output as soon as
possible. We can get a solution to the original problem, then, with

```
ourWaterFlow = sum . waterStream
```

Here is the solution I came up with, with inline explanation:

```
{-# LANGUAGE BangPatterns #-}
-- start processing `str` initializing the highest column to the left at 0, and
-- an empty stack.
waterStream :: [Int] -> [Int]
waterStream str = processWithMax 0 str []
processWithMax :: Int -> [Int] -> [(Int,Int)] -> [Int]
processWithMax prevMax = process
where
process [] = const []
-- output the quantity of water we know we can get, given the column at the
-- head of the stream, `y`:
process (y:ys) = eat 1
where
eat !n xxs@((offset,x):xs)
-- done with `y`, push it and its offset onto the stack
| y < x = process ys ((n,y):xxs)
-- at each "rise" we can output some known quantity of water;
-- storing the "offset" as we did above lets us calculate water
-- above a previously filled "valley"
| otherwise = let col = offset*(min y prevMax - x)
cols = eat (n+offset) xs
-- filter out zeros:
in if col == 0 then cols else col : cols
-- if we got to the end of the stack, then `y` is the new highest
-- column we've seen.
eat !n [] = processWithMax y ys [(n,y)]
```

The bit about “offsets” is the tricky part which I don’t know how to explain without a pretty animation.

It took me much longer than I was expecting to code up the solution above that
worked on a few hand-drawn test cases, and at that point I didn’t have high
confidence that the code was correct, so I turned to
quickcheck and
`assert`

.

First I wanted to make sure the invariant that the “column” values in the stack were strictly increasing held:

```
import Control.Exception (assert)
...
--process (y:ys) = eat 1
process (y:ys) stack = assert (stackSane stack) $ eat 1 stack
...
```

Then I used Philip’s solution (which I had confidence in):

```
waterFlow :: [Int] -> Int
waterFlow h = sum $
zipWith (-)
(zipWith min (scanl1 max h) (scanr1 max h))
h
```

to test my implementation:

```
*Waterflow> import Test.QuickCheck
*Waterflow Test.QuickCheck> quickCheck (\l -> waterFlow l == ourWaterFlow l)
*** Failed! Falsifiable (after 21 tests and 28 shrinks):
[1,0,0,0,1]
```

Oops! It turned out I had a bug in this line (fixed above):

```
--old buggy:
--cols = eat (n+1) xs
--new fixed:
cols = eat (n+offset) xs
```

The solution seems to perform pretty well, processing 1,000,000 `Int`

s in 30ms
on my machine:

```
import Criterion.Main
main = do
gen <- create
rs <- replicateM 1000000 $ uniformR (0,100) gen
defaultMain [ bench "ourWaterFlow" $ whnf ourWaterFlow rs
```

I didn’t get a good look at space usage over time, as I was testing with
`mwc-random`

which doesn’t seem to support creating a lazy infinite list of
randoms and didn’t want to hunt down another library. Obviously on a stream
that simply descends forever, our stack of `(Int,Int)`

will grow to infinite
size.

It seems as though there is a decent amount of parallelism that could be exploited in this problem, but I didn’t have any luck on a quick attempt.

Have a parallel solution, or something just faster? Or an implementation that doesn’t need a big stack of previous values?

]]>`TypeFamilies`

is a GHC extension that lets you create formerly-impossible
abstractions in a very straightforward way. It took me several tries before
they clicked for me though, so this is the introduction to `TypeFamilies`

that
I wish I had read first (although I just found Brent Yorgey’s, which would have done the trick).
I’m treating the subject very narrowly for most of this post, and try to round things out a little at the very end.

If this isn’t the first thing you’ve read about `TypeFamilies`

, it might be
helpful to forget a few things. The question “what precisely is a type family?”
isn’t going to be very helpful; in general, the terminology for the
constellation of constructions that `TypeFamilies`

gives you is a huge mess,
with multiple partially-overlapping terms in the wild, none of which are
helpful for developing an intuition about what all of this is about.

I also found various analogies I’d read to be useless, so forget those too.

Consider the familiar type synonym:

```
type PairOf a = (a,a)
```

Normally this is presented as syntactic sugar, with little to do with the type system.

A more interesting way of thinking about `PairOf`

is as a *function* (as
suggested by the `=`

), where evaluation involves substituting occurrences of
the left hand side (LHS) with the right, in the usual way. These functions are
evaluated *in your type signatures* at compile time.

The analogous regular term-level function would of course be:

```
pairOf a = (a,a)
```

Simple enough. Now let’s think about a simple *term-level* function, and see
what an analogous *type-level* type synonym/function might look like:

```
last (a: []) = a
last (a: (b:bs)) = last (b:bs)
```

For our type-level `Last`

we need something like lists at the type-level, so
we’ll use the common nested tuple representation of `(,)`

as cons and `()`

as
the empty list, e.g.:

```
x :: (Int,(Int,(Int,()))) -- like [1,2,3]
```

Hopefully I didn’t just lose you. Remember for now we just care about using
this list-like tuple thing in our *type signatures*.

If you were to charge ahead and try to define `Last`

using type synonyms
treated as full blown functions, you might come up with:

```
-- this isn't okay:
type Last (a, ()) = a
type Last (a, (b,bs)) = Last (b,bs)
```

Unfortunately the compiler will laugh at you. Type synonyms can only have abstract variable arguments on the LHS where above we have tried to deconstruct them using pattern matching, and to define a different RHS for both cases. Further we’ve made the definition recursive. None of that is okay.

In fact the humble type synonym is only a very simple sort of function (a natural transformation or something close) which is very easy to evaluate, but also very limited.

The `TypeFamilies`

extension lets us define `Last`

successfully almost exactly
as we did above.

```
{-# LANGUAGE TypeFamilies #-}
-- we have to "declare" `Last` separately, and the "family"
-- here distinguishes the syntax from a normal type synonym:
type family Last l
-- ...and then can define our "cases":
type instance Last (a,()) = a
type instance Last (a,(b,bs)) = Last (b,bs)
```

At this point when the type-checker sees `Last (a,(b,bs))`

in a type signature
it will replace it with `Last (b,bs)`

, and continue until all of these “type
functions” are evaluated. I may be fudging things a bit but that’s the
general idea.

Since these are a more general sort of type function, they can even be used to replace traditional type synonyms:

```
type family PairOf a
type instance PairOf a = (a,a)
```

It would be neat to be able to work with “lists” that look like e.g.
`(1,('a',("hello",())))`

; they are heterogeneous, operations like `head`

would
be type-safe, etc. So imagine we want to define a `last`

on types of this
list-like tuple sort of data.

What would the type of `last`

look like? We know it has to be polymorphic,
since its arguments might look like `()`

or `(1,(2,()))`

, different types of
course. So we’ll need a type-class (and a couple other standard extensions):

```
{-# LANGUAGE FlexibleInstances, FlexibleContexts #-}
class LastOfTupleList l where
last' :: l -> Last l -- < we use our `Last` type function
```

Our instances are trivial:

```
instance LastOfTupleList (a,()) where
last' (a,()) = a -- < the type-checker can see that, indeed, `a`
-- is equal to `Last (a,())`
instance (LastOfTupleList (b, bs))=> LastOfTupleList (a,(b,bs)) where
last' (a,(b,bs)) = last' (b,bs)
```

Letting us do:

```
>>> last' (1,(2,()))
2
>>> last' (1,('a',("hello",())))
"hello"
```

Notice how our instances of `Last`

and `last`

have almost the same structure;
this is very common.

If you’re a programmer type you were probably irritated by my initial clumsy
definition of `last`

; why not:

```
last (a:[]) = a
last (a:as) = last as -- `as` matches whatever can fall through the pattern
-- above; in this case only non-empty lists
```

Well, I was being sneaky because the type-level analogue isn’t allowed!

```
type instance Last (a,()) = a
type instance Last (a,as) = Last as -- BAD!
```

This is because unlike functions, type families are *open* meaning, like
typeclasses, a new instance can be added at any moment. Therefore there’s
no way to define a “default” instance to use after all other matches fail,
you simply get illegal overlapping synonym instances such as the one above;
the order in which we defined the two doesn’t matter.

For some use cases this is what we need, for others (such as our `last'`

) we’d
really prefer that type synonym families be *closed* so that we can pattern
match in the usual way.
This feature is apparently
coming soon.

That should give you an intuition. At this point you might want to stop and read through the following documentation, or continue reading below before coming back to these links for a more refined understanding and additional details:

Since we so often define define a type class in terms of one or more type families, we’re given a simplified syntax for combining them in one place.

```
class LastOfTupleList l where
type Last l -- an *associated* type family
last' :: l -> Last l
instance LastOfTupleList (a,()) where
type Last (a,()) = a
last' (a,()) = a
```

When people say “associated types” they mean type functions that are associated with a typeclass using the syntax above.

Type synonym family instances are said to be *not injective*, meaning two
different type functions can map to the same type on the RHS, e.g.

```
type instance F Int = Bool
type instance F Char = Bool
```

It’s easy to forget this when building new abstractions, and assume that the
typechecker will infer from the RHS (e.g. `Bool`

above) the argument passed in
to the type function (`Int`

or `Char`

).

I’ve completely focused on the *type synonym* flavor of `TypeFamilies`

above,
but there is also a `data/newtype`

flavor in which, for each instance
definition the RHS is a *brand new* type declaration, rather than mapping to an
existing type

```
-- from http://www.haskell.org/ghc/docs/7.4.1/html/users_guide/type-families.html
data family T a
data instance T Int = T1 Int | T2 Bool -- new constructors T1 and T2 defined here
newtype instance T Char = TC Bool
```

Because each instance maps to a unique type, data families *are* injective
allowing the type checker to infer the LHS of the equation knowing the right.

`TypeFamilies`

provides the syntax`a ~ b`

to indicate type equality constraints; this is especially useful with type synonym functions, but can be useful on its own as well.- kind signatures are required for type functions on types taking arguments, e.g.
`Maybe`

- pour kettle full of boiling water into drain
- stuff overflow drain with plastic wrap (so that pressure is directed toward the clog, not out the top)
- invert empty draino bottle placing mouth around drain, forming a tight seal
- squeeze bottle rapidly and forcefully

Don’t get boiling water in your eyes.

]]>