Building a Friendly and Safe EDSL with IxState and TypeLits

Haskell is a great, if not the best language for embedding DSLs in. Thanks to Haskell’s modern type system and elegant syntax, devising embedded domain-specific languages (EDSL) that are both low friction and type safe is often fairly achievable. Let’s take a simple EDSL as a running example, and evolve it in a series of steps to make it more user-friendly and safer. The main machineries involved are the indexed state monad (IxState) and some moderate type-level programming.

The EDSL we are going to build is for constructing record data types, particularly those for holding configuration settings. As a starter, the following data type called X will be used:

data X = X
  { aField  :: A
  , bField  :: B
  , csField :: NonEmpty C
  }

type A = Int
type B = String
type C = Double

This X data type is just a toy example. You can substitute for it whatever similar data types you work with in practice. For example, X could be a configuration for a service, a script with a non-empty list of commands, a user account where csField is the operations the user is authorized to perform, or the description of a data migration job with a number of steps to be executed. It can literally be anything that is a record containing a number of fields.

Now, suppose we want to equip our users with a DSL to construct terms of type X. What are our options?

Version 1: Plain Record Syntax

(Code: https://github.com/zliu41/dsl/blob/master/src/DSL/Version1.hs)

This is, of course, not really a DSL. It’s just an L – the plain vanilla Haskell record syntax. To construct a term of type X, we can write something like

import Data.List.NonEmpty as NonEmpty

x :: X
x = X
  { aField = 42
  , bField = "Hello"
  , csField = NonEmpty.fromList [1.0, 2.0]
  }

This looks reasonably good. So why bother doing anything else? Becuase there are a number of things not so desirable about this record syntax:

  1. Not everybody loves the record syntax. The more fields X has, and the more nested the fields are, the more aggravating it becomes.
  2. It is often useful to have partially constructed X values which can be reused in constructing other Xs. With this approach it is not clear how you can partially construct an X. None of the three fields has a Maybe type, so you can’t use Nothing to signal that a field is not yet set.
  3. Related to the above point, there is no decent way (that I know of) to provide default values to some fields.
  4. The record syntax becomes difficult to use when the fields are more complex than simple values like numbers, strings and lists. For example, trees. Later in this post, we are going to extend X such that each item in csField can be associated with a label; labels are potentially nested and thus form a tree structure. Using the record syntax directly would become quite cumbersome in that case.

Another common way to construct a record is lens, so let’s see how that works.

Version 2: Lens

(Code: https://github.com/zliu41/dsl/blob/master/src/DSL/Version2.hs)

To use lens, we make the field types of X abstract, and make lenses for each field:

{-# LANGUAGE TemplateHaskell #-}

import qualified Control.Lens as Lens

data X_Abs a b cs = X_Abs
  { _aField  :: a
  , _bField  :: b
  , _csField :: cs
  }
Lens.makeLenses ''X_Abs

Then we can write:

type X_Empty = X_Abs () () ()
type X       = X_Abs A B (NonEmpty C)

x_empty :: X_Empty
x_empty = X_Abs () () ()

x :: X
x = Lens.set aField 42
  . Lens.set bField "hello"
  . Lens.set csField (fromList [1.0, 2.0])
  $ x_empty

By making the field types of X abstract, the second issue mentioned in Version 1 is addressed: we can now have partially constructed X values. But it has a new problem: now it is possible to accidentally set the same field multiple times. The following code typechecks:

x :: X
x = Lens.set aField 42
  . Lens.set aField 43
  . Lens.set bField "hello"
  . Lens.set cField (fromList [1.0, 2.0])
  $ emptyX

This isn’t much of a worry for our toy data type X, but the larger number of fields, the more unsafe it is. The problem is that Lens.set is too polymorphic. We need to restrict Lens.set aField to only be able to alter the type of _aField from () to Int, and not from Int to Int. There are at least two ways to do so. The first is to specialize the type parameters of Lens.set for each field, for example:

set_aField :: a -> X_Abs () b cs -> X_Abs a b cs
set_aField = Lens.set aField

With this specialized type, calling set_aField twice no longer type checks.

The second and fancier way is to specialize the constraints, rather than the type parameters, of Lens.set, by making use of a type family that returns Constraint:

type family ExactlyOnceConstraints field a a' b b' cs cs' :: Constraint where
  ExactlyOnceConstraints A a a' b b' cs cs' = (a ~ (), b ~ b', cs ~ cs')
  ExactlyOnceConstraints B a a' b b' cs cs' = (a ~ a', b ~ (), cs ~ cs')
  ExactlyOnceConstraints (NonEmpty C) a a' b b' cs cs' = (a ~ a', b ~ b', cs ~ ())

set :: ExactlyOnceConstraints y a a' b b' cs cs'
    => Lens.ASetter (X_Abs a b cs) (X_Abs a' b' cs') x y
    -> y -> X_Abs a b cs -> X_Abs a' b' cs'
set = Lens.set

x :: X
x = set aField 42
  . set bField "Hello"
  . set csField (NonEmpty.fromList [1.0, 2.0])
  $ x_empty

With ExactlyOnceConstraints in place, setting any field twice is illegal. This approach requires that each field of X have a distinct type. This can be achieved by using newtypes to distinguish fields with the same type, and it is usually considered good practice to do so.

Version 3: Enabling do-notation via IxState

(Code: https://github.com/zliu41/dsl/blob/master/src/DSL/Version3.hs)

This version of the DSL takes advantage of the indexed state monad (IxState) to convert regular function composition (.) into composition of monadic actions (>>), allowing us to construct X terms in do-notations, for instance

x :: X
x = mkX $ do
  set aField 42
  set bField "Hello"
  set csField (NonEmpty.fromList [1.0, 2.0])

The idea is straightforward: X_Abs is now the state in the (indexed) state monad, and set modifies the state behind the scenes. This makes our DSL look like imperative programming with a bunch of set statements, and in my opinion, nicer.

The reason to use the indexed state monad, rather than the regular state monad, is to ensure each field is set exactly once, same as the guarantee that Version 2 achieves. The initial state would be X_Empty (which is X_Abs () () ()). After setting aField, the state becomes X_Abs A () (), and so on. Since each call to set changes the type of the state, it is necessary to use the indexed state monad.

Basically the only difference in the implementation compared to Version 2 is that set now returns IxState (AbsX a b cs) (AbsX a' b' cs') (), in contrast to X_Abs a b cs -> X_Abs a' b' cs'. The transition from X_Abs a b cs to AbsX a' b' cs' happens in an IxState action, as opposed to a function. To enable do-notation for IxState, we need to turn on RedindableSyntax, and rebind the (>>) operator using (>>>=) from Control.Monad.Indexed. In Version 3 it is sufficient just to rebind (>>). In the next version and later you’ll also need to rebind (>>=).

Like Version 2, the order in which the fields are set doesn’t matter. The code is well-typed as long as each field is set exactly once.

Version 4: Adding Nested Labels

(Code: https://github.com/zliu41/dsl/blob/master/src/DSL/Version4.hs)

Now, let’s make things more interesting by adding labels to X. We want to be able to attach labels to each value in _csField, and a label can be nested under another. Ideally, we want to construct X terms with the following code, which uses nested do-blocks for nested labels:

x :: X
x = mkX $ do
  set aField 42
  set bField "Hello"
  new_c 1.0        -- 1.0 has no label
  label "label-foo" $ do
    new_c 2.0      -- 2.0 is labeled ["label-foo"]
    label "label-bar" $ do
      new_c 3.0    -- 3.0 is labeled ["label-foo", "label-bar"]
      label "label-baz" $ do
        new_c 4.0  -- 4.0 is labeled ["label-foo", "label-bar", "label-baz"]
      new_c 5.0    -- 5.0 is labeled ["label-foo", "label-bar"]
    new_c 6.0      -- 6.0 is labeled ["label-foo"]
    new_c 7.0      -- 7.0 is labeled ["label-foo"]
  new_c 8.0        -- 8.0 has no label

For this to work, every time we create a new C value using new_c, we need to know the label we currently have, so that we can associate it with the new C we create. This can be pulled off using the Reader monad. Whenever we create a new label using label, we can use local to locally add the new label to the environment.

So now our indexed state monad is composed with the reader monad, and we give it an alias X_M:

type X_M i j a = IxStateT (Reader Label) i j a
type Component = String
type Label = [Component]
type C_Labeled = (Label, C)

Corresponding changes are made to the definitions of X and X_Empty to add labels:

type X_Empty = X_Abs () () [C_Labeled]         -- Instead of "AbsX () () ()"
type X       = X_Abs A B (NonEmpty C_Labeled)  -- Instead of "AbsX A B (NonEmpty C)"

Type X_Empty becomes AbsX () () [C_Labeled] as opposed to AbsX () () (), because we are no longer setting the entire _csField in one fell swoop, but via multiple calls to new_c. Each time we call new_c, a new C_Labeled is added, and it needs to be added to somewhere. In other words, there needs to be something that we can add each new C_Labeled to. Since we can’t add a C_Labeled to (), we need to use an empty list initially. The new_c function is defined as:

new_c :: (IsList cs, Item cs ~ C_Labeled)
      => C
      -> X_M (X_Abs a b cs) (X_Abs a b (NonEmpty C_Labeled)) ()
new_c c = do
  labels <- ilift ask
  let labeled = (labels, c)
  imodify $ Lens.over csField ((labeled :|) . toList)

IsList is a typeclass that both regular lists and non-empty lists are instances of. Before any new_c is called, cs ~ [C_Labeled], and after the first new_c, cs ~ NonEmpty C_Labeled.

The label function uses ilocal (which lifts local into IxStateT) to change the environment by prepending the given label component to the label in the environment:

label :: Component -> X_M i j () -> X_M i j ()
label = ilocal . (:)

ilocal :: (Label -> Label) -> X_M i j a -> X_M i j a
ilocal f m = IxStateT $ local f . runIxStateT m

Function set has exactly the same implementation as Version 3, except that its return type becomes X_M (AbsX a b cs) (AbsX a' b' cs') (), as opposed to IxState (AbsX a b cs) (AbsX a' b' cs') (). Similarly, type alias MkX becomes X_M X_Empty X a instead of IxState X_Empty X a.

Version 5: Guarding Against Incorrect Indentations

(Code: https://github.com/zliu41/dsl/blob/master/src/DSL/Version5.hs)

One problem with the Version 4 DSL is that, since it is indentation-sensitive (as Haskell itself is), it is prone to bugs caused by accidentally using the incorrect indentations. If, for instance, label "label-baz" is indented with two fewer spaces, the code would still be well-typed, but “label-baz” would no longer be nested under “label-bar”.

To make this class of errors less likely, we can augment the label function with type level natural numbers from GHC.TypeLits. Whenever a new label component is added by calling label, we annotate it with the current indentation level, and if the annotation is inconsistent with the level, it would be a type error. Version 5 of our DSL looks like the following:

x :: X
x = mkX $ do
  set aField 42
  set bField "Hello"
  new_c 1.0
  label @0 "label-foo" $ do
    new_c 2.0
    label @1 "label-bar" $ do
      new_c 3.0
      label @2 "label-baz" $ do
        new_c 4.0
      new_c 5.0
    new_c 6.0
    new_c 7.0
  new_c 8.0

In the above code, if we replace any label @n with label @m where m ≠ n, it should fail to compile.

To accomplish this, we need to keep track of the current indentation level in the types, and to do so, we add a new type parameter of kind Nat to X_M. Alternatively, we can add the new type parameter to X_Abs, but this is unnecessary since neither set nor new_c changes the indentation level.

To add the new type parameter, we make X_M a newtype in lieu of a type alias:

newtype X_M (n :: Nat) i j a = X_M
  { runX_M :: IxStateT (Reader Label) i j a }
  deriving (IxFunctor, IxPointed, IxApplicative, IxMonad, IxMonadState)
-- Instead of: type X_M i j a = IxStateT (Reader Label) i j a

Other than that, the implementation is almost identical to Version 4, except that a few type signatures are changed:

type MkX a = X_M 0 X_Empty X a

set :: ExactlyOnceConstraints y a a' b b' cs cs'
    => Lens.ASetter (X_Abs a b cs) (X_Abs a' b' cs') x y
    -> y -> X_M n (X_Abs a b cs) (X_Abs a' b' cs') ()
set =  -- omitted, same as Version 4

new_c :: (IsList cs, Item cs ~ C_Labeled)
      => C
      -> X_M n (X_Abs a b cs) (X_Abs a b (NonEmpty C_Labeled)) ()
new_c =  -- omitted, same as Version 4

label :: forall n i j. Component -> X_M (n+1) i j () -> X_M n i j ()
label =  -- omitted, same as Version 4

ilocal :: (Label -> Label) -> X_M n i j a -> X_M n' i j a
ilocal f m =  -- omitted, same as Version 4

The key thing to note here is label’s type, where the nested action has type X_M (n+1) i j (). For example, In label @2, the nested action has type X_M 3 i j (), and so any call to label immediately nested under label @2 must be annotated with @3. The 0 in MkX ensures that the top-level calls to label are annotated with @0.

Version 6: At Most Two Occurrences per Label

(Code: https://github.com/zliu41/dsl/blob/master/src/DSL/Version6.hs)

In our final version of the DSL, to make things even more entertaining, let’s say there’s a requirement, for whatever reason, that the same label cannot be used more than twice, and we want this property statically enforced.

For example, the code above in Version 5 should now become ill-typed, as there are three C values (2.0, 6.0 and 7.0) labeled ["label-foo"].

Since we now want to statically enforce a property about labels, we have to lift labels to the type level via GHC.TypeLits.Symbol. At each step during the construction of X, we need to be aware of all labels that have been used so far, so that we can determine whether the next C value we want to create is legal.

To keep track of all labels that have been used so far, we add a type parameter xss of kind [TLabel] to X_Abs, where TLabel ~ [Symbol] stands for type-level labels. To determine whether each C to be created is legal, we add another type parameter xs of kind TLabel, which is the current label in the environment. The first type parameter xss must be added to X_Abs rather than X_M, since it can be changed by new_c. On the other hand, xs can be added to X_M.

So the definition of X_Abs and X_M becomes

data X_Abs a b cs (xss :: [TLabel]) = X_Abs
  { _aField  :: a
  , _bField  :: b
  , _csField :: cs
  }
Lens.makeLenses ''X_Abs

newtype X_M (n :: Nat) (xs :: TLabel) i j a = X_M
  { runX_M :: IxStateT (Reader Label) i j a }
  deriving (IxFunctor, IxPointed, IxApplicative, IxMonad, IxMonadState)

They each have one additional type parameter compared to Version 5.

We also need to add the type parameter xss to X, but we don’t really want to, as once we are done constructing X, we don’t care about xss any more. The purpose of xss is just to enforce the property that no label is used more than twice. For this reason, we create a type X' with the additional type parameter, and in the definition of X we existentialize it away:

type X_Empty = X_Abs () () [C_Labeled] '[]
type X' xss  = X_Abs A B (NonEmpty C_Labeled) xss
data X       = forall xss. X (X' xss)

Other than that, there are two main differences in the implementation compared to Version 5.

The first difference is the type signature of new_c, which now has an additional constraint, AtMostOnce xs xss, and adds the label associated with the new C to xss, via ConsNE xs xss:

new_c :: (IsList cs, Item cs ~ C_Labeled, AtMostOnce xs xss)
      => C
      -> X_M n xs (X_Abs a b cs xss) (X_Abs a b (NonEmpty C_Labeled) (ConsNE xs xss)) ()
new_c c =  -- omitted, same as Version 5

-- Prepend xs to xss unless x is empty.
type family ConsNE (xs :: TLabel) (xss :: [TLabel]) :: [TLabel] where
  ConsNE '[] xss = xss
  ConsNE xs xss = xs ': xss

type family Count x xs :: Nat where
  Count x '[] = 0
  Count x (x ': xs) = 1 + Count x xs
  Count x (y ': xs) = Count x xs

type family AtMostOnce x xs :: Constraint where
  AtMostOnce '[] _ = ()
  AtMostOnce x xs = Count x xs <= 1

This is what guarantees that no labels is used more than twice. When new_c is called, if the current label is already used more than once, the constraint is violated and it wouldn’t compile. If the constraint is satisfied, xs is added to xss unless xs ~ '[].

The second difference is the type signature of the label function, which now reads

label :: forall n comp a b cs cs' xs xss xss'. (KnownSymbol comp)
      => X_M (n+1)
             (comp ': xs)
             (X_Abs a b cs xss)
             (X_Abs a b cs' xss')
             ()
      -> X_M n
             xs
             (X_Abs a b cs xss)
             (X_Abs a b cs' xss')
             ()
label m =  -- omitted, same as Version 5

(comp ': xs) and xs in the type signature serves a similar purpose to (n+1) and n. It ensures that the child action’s current label carries the new component comp. For example, any call to label immediately nested under label @n @"label-foo" (for some n) has its current label instantiated to "label-foo" ': xs for some xs.

The child action is allowed to change type parameter xss into xss', since it may create an arbitrary number of new C values, whose labels (unless empty) are added to xss. Here xss' must contain xss as a suffix, and it is possible to reflect this fact using another constraint. Doing so would help guard against certain classes of bugs in the DSL implementation. It doesn’t, however, make our DSL any safer to use. So, for simplicity, my implementation just uses two completely independent type parameters, xss and xss'.

Conclusions

This is the end of our series of improvements to this particular EDSL. As a matter of fact, I’m not really advocating for this approach, since the benefit-cost ratio is not necessarily good enough to justify it. The implementation of this DSL is not exactly Simple Haskell, in fact, far from it (just look at the number of GHC extensions required in the code for Version 6).

An argument for it, though, is that most of the complexity lies in the implementation of the DSL, which the users of the DSL do not necessarily need to be concerned about. This is true, but on the other hand, when a user of such a DSL makes a mistake leading to a type error, the error message could be a bit of a mouthful. This can be alleviated to some degree by customizing the ErrorMessages, but could still lead to friction. Therefore, the usual advice for type-level programming in Haskell applies: it is very helpful to understand the power it holds, but use it in moderation, and find a good tradeoff that maximizes your team’s productivity.

We are not finished just yet - here are some fun exercises to work on if you are in the mood.

Exercise 1 (Easy)

(Answer: https://github.com/zliu41/dsl/blob/master/src/DSL/Exercise1.hs)

In Version 5, we added a type parameter n :: Nat to X_M, which can be used to guard against accidentally using the wrong indentations. An alternative approach is to add a type parameter xs :: TLabel instead of n :: Nat, and require that each call to label supply the full label. The DSL would look like the following (apologies for the syntax highlighting failure):

x :: X
x = mkX $ do
  set aField 42
  set bField "Hello"
  new_c 1.0
  label @'["label-foo"] $ do
    new_c 2.0
    label @'["label-foo", "label-bar"] $ do
      new_c 3.0
      label @'["label-foo", "label-bar", "label-baz"] $ do
        new_c 4.0
      new_c 5.0
    new_c 6.0
    -- new_c 7.0
  new_c 8.0

Modify the code for Version 5 to use the above approach. A nested call to label must be annotated with a label that consists of the parent label as a prefix, plus exactly one new component. The code should otherwise fail to compile.

Exercise 2 (Medium)

(Answer: https://github.com/zliu41/dsl/blob/master/src/DSL/Exercise2.hs)

Like Exercise 1, but for Version 6. Enforce the property that there are no more than two occurrences per label. The above code in Exercise 1 should be ill-typed if we un-comment new_c 7.0.

Exercise 3 (Easy)

(Answer: https://github.com/zliu41/dsl/blob/master/src/DSL/Exercise3.hs)

Modify Version 6 such that we can optionally annotate the nested level when calling new_c, which further reduces the chance of incorrect indentations. The DSL would look like this:

x :: X
x = mkX $ do
  set aField 42
  set bField "Hello"
  new_c 1.0
  label @0 @"label-foo" $ do
    new_c 2.0
    label @1 @"label-bar" $ do
      new_c 3.0
      label @2 @"label-baz" $ do
        new_c 4.0
      new_c @2 5.0
    new_c @1 6.0
    -- new_c 7.0
  new_c @0 8.0

Exercise 4 (Medium)

(Answer: https://github.com/zliu41/dsl/blob/master/src/DSL/Exercise4.hs)

Modify Version 6 such that _aField has a default value (say 42). When constructing a term of type X, set aField should now be called at most once.

Exercise 5 (Hard)

(Answer: https://github.com/zliu41/dsl/blob/master/src/DSL/Exercise5.hs)

In this exercise we modularize the DSL so that we can construct an X term in separate pieces. For example, let’s say we want to move a large part of the block nested under “label-foo” out into a separate value cs:

x :: X
x = mkX $ do
  set aField 42
  set bField "Hello"
  new_c 1.0
  label @0 @"label-foo" $ do
    cs
    -- new_c 7.0
  new_c 8.0

cs = do
  new_c 2.0
  label @n @"label-bar" $ do
    new_c 3.0
    label @(n+1) @"label-baz" $ do
      new_c 4.0
    new_c 5.0
  new_c 6.0

Note that the first type parameter passed to label in cs becomes @n and @(n+1). This makes it possible to insert cs at any level when constructing X.

The code above currently doesn’t compile because n is not in scope, and even if it were in scope, GHC would still be unable to infer the type of cs (to see this, change n and n+1 to some concrete numbers like 1 and 2).

Make it compile. Hint: in addition to adding an appropriate type signature for cs, consider making a new version of new_c.

Further Reading