Serokell

Posted on Sep 6, 2021 • Originally published at serokell.io on Aug 31, 2021

Dependency Analysis of Haskell Declarations

#haskell

Dependency Analysis of Haskell Declarations

Most languages require the programmer to declare functions and data types before their use. For example, the following code is not a valid C or C++ program:

int main() {
  f(); // error: use of undeclared identifier 'f'
}

void f(); // ... even though it’s declared down here.

Among the languages that have this restriction are Python, JavaScript, Rust, and even some functional ones, for example OCaml.

Haskell is one of the few languages that let you structure your program how you see fit. Functions and data types can be declared and used in arbitrary order:

main = putStrLn str
str = "Hello, World"

Declarations out of dependency order are a challenge to the compiler, though. How can GHC know if putStrLn str in main is well-typed without knowing the type of str? It needs to check str first, even though it comes second.

That’s where dependency analysis comes into play. Before type checking a module, GHC rearranges its declarations in such a way that functions and data types are declared before their use. Then it can process them top-down, just as compilers for other languages do.

Dependency analysis can handle fairly convoluted programs:

data Nat = Z | S Nat

parity :: Nat -> Parity
parity Z = Even
parity (S n) = flipP (parity n)

data Parity = Even | Odd

flipP :: Parity -> Parity
flipP Even = Odd
flipP Odd = Even

In this example, both Parity and flipP are used before they are declared. But dependency analysis determines that the order of processing should be: Nat, Parity, flipP, parity.

Type-checking environment

Why does the type checker expect declarations to be in any particular order? This is not merely an engineering issue, it has its roots in theory. Specifically, type theorists often define typing contexts as follows:

Γ::=∅∣Γ,x:τ\Gamma ::= \varnothing \mid \Gamma, x : \tau
Γ::=∅∣Γ,x:τ

If you’re not a fan of mathematical notation and greek letters, no worries: this basically means we have an ordered list of pairs [(Name, Type)]. And whenever we say that an expression is well-typed, we imply that it is well-typed in a given context Γ\GammaΓ.

For example, can you say if x + 1 is well-typed? Not without knowing the type of x, so the context Γ\GammaΓ must have an entry such as x : Int or x : String. In other words, the formal definition of a type system takes it for granted that when we type check an expression, we already have all of its dependencies in the typing context.

That is also how GHC operates: there’s the notion of a type-checking environment, quite similar to the notion of a typing context. When we type check a definition, we add it to the environment, and all subsequent definitions can refer to it.

Dependency analysis: a bird’s eye view

The first step of the dependency analysis is to separate types from terms. Currently, types can (almost) never depend on terms, so types are always checked before terms. This, of course, will have to change in the future if we want to implement dependent types, but that is out of the scope of this article.

Type-level declarations: Nat(Z, S), Parity(Even, Odd)
Term-level declarations: parity, flipP

So even before we start looking at the dependencies, we know that Nat and Parity will be checked before flipP and parity.

In this article, We will primarily focus on the type-level declarations, as that is where we have some open issues and ongoing development. But first a small detour into Template Haskell, which, turns out, blurs the line between terms and types in an unexpected way.

Template Haskell splices

Above I’ve mentioned that types can never depend on terms, but there is one exception to this, and that is Template Haskell splices. Let’s take a look at this example:

boodDesc =
  [("id", [t| UUID |])
  ,("title", [t| Text |])
  ,("author", [t| Text |])

$(generateRecord "Book" 'bookDesc)
-- data Book = Book { bookId :: UUID, bookTitle :: Text, bookAuthor :: Text }

type Library = [Book]

Here we have a Template Haskell function generateRecord that takes a term-level list of fields and uses it to generate a record data type. The generated type depends on a term: if bookDesc is changed, Book will also change. So how does GHC do this if terms are always checked after types?

Before GHC does any type-checking or name resolution in a module, it splits that module into groups (called HsGroup). Each group is separated by a Template Haskell splice:

Then those groups are analyzed and type-checked individually in the order they appear in the source. After a group is type-checked, all of its declarations are added to the type-checking environment, then Template Haskell code is executed, and then GHC moves on to the next group.

A side-effect of this splitting behavior is that you can’t refer to declarations that come after a Template Haskell splice:

boodDesc =
  [("id", [t| UUID |])
  ,("title", [t| Text |])
  ,("format", [t| FileFormat |]) -- FileFormat is not in scope


$(generateRecord "Book" 'bookDesc)
-- data Book = Book { bookId :: UUID, bookTitle :: Text, bookAuthor :: Text }

type Library = [Book]
data FileFormat = EPUB | PDF

This is because the group with bookDesc will be fully processed first, and only then GHC will move on to the group with FileFormat.

In the absence of Template Haskell splices, the entire module constitues a single HsGroup.

Dependency analysis at the type level

So within a given HsGroup, after separating types and terms, how does GHC figure out the order in which to put declarations before type checking them?

Let’s look at this example and see step-by-step what happens:

data L =
  Result ListCB | Operation Op

data Op =
  Reverse L | Sort L

data ListCB =
  NilCB | ConsCB Char ListBC

data ListBC =
  NilBC | ConsBC Bool ListCB

The first thing that GHC does is go through each declaration and make a list of names that the declaration mentions:

Declaration	Mentioned names
`L`	`ListCB`, `Op`
`Op`	`L`
`ListCB`	`Char`, `ListBC`
`ListBC`	`Bool`, `ListCB`

Then GHC builds a directed graph using declarations as vertices and mentioned names as edges. For example, we can see that L mentions ListCB and Op, so it has edges to those vertices in the graph:

After building the graph, GHC will find strongly connected components. A strongly connected component (or SCC for short) is a part of a directed graph where each vertex is reachable from every other vertex. Or, in our case, it is a mutually recursive group. You can think of finding SCCs as building a “supergraph” where each vertex now is a mutually recursive group and edges are dependencies between the groups. So in the example, above we have two mutually recursive groups: ListCB with ListBC and L with Op. And the {L, Op} group depends on the {ListCB, ListBC} group.

And finally, after finding SCCs, GHC puts them in topological order:

-- Group 1
data ListCB = NilCB | ConsCB Char ListBC
data ListBC = NilBC | ConsBC Bool ListCB


-- Group 2
data L = Result ListCB | Operation Op
data Op = Reverse L | Sort L

Type family instances

Type family instances, unlike other sorts of declarations, don’t have names, and nothing can directly refer to them. Thus they’re not a part of the graph and are handled differently. Let’s take a look:

type family PropType a

data WTitle
data WResizable

type MainWindow :: forall prop -> PropType prop
type family MainWindow prop where
  MainWindow WTitle = "Text Editor"
  MainWindow WResizable = True

type instance PropType WTitle = Symbol
type instance PropType WResizable = Bool

After all of the graph and SCC business is done, we’ll get the following groups:

-- Group 1
type family PropType a


-- Group 2
data WTitle


-- Group 3
data WResizable


-- Group 4
type MainWindow :: forall prop -> PropType prop
type family MainWindow prop where
  MainWindow WTitle = "Text Editor"
  MainWindow WResizable = True

But where do the type instances go?

For every type instance, GHC also builds the list of names it mentions, just like it does for other sorts of declarations. But the instance itself has no name, so it does not participate in topological sorting. Instead, it is added to the earliest group where all of its dependencies are satisfied.

The target group is determined by going through each group one by one and removing its declarations from the list of remaining dependencies of the instance. Once the list is empty, it means all of the instance dependencies have been defined in the current and prior groups, so the instance can be inserted into the current group.

Instance	Dependencies
`PropType WTitle`	`PropType`, `WTitle`, `Symbol`
`PropType WResizable`	`PropType`, `WResizable`, `Bool`

So in our example, first Symbol and Bool will be removed from the list since they’re imported. This leaves us with the following:

Instance	Dependencies (remaining)
`PropType WTitle`	`PropType`, `WTitle`
`PropType WResizable`	`PropType`, `WResizable`

Then GHC will process the first group, the one that defines PropType, and remove PropType from dependencies of both instances. So then we have:

Instance	Dependencies (remaining)
`PropType WTitle`	`WTitle`
`PropType WResizable`	`WResizable`

Next it will get to the group with WTitle. WTitle will be removed from the list of dependencies, and since it’s the last dependency of PropType WTitle, the instance is inserted into that group:

-- Group 2
data WTitle
type instance PropType WTitle = Symbol -- inserted

And then the same thing happens on the next group with PropType WResizable, making our final groups look like this:

type family PropType a


data WTitle
type instance PropType WTitle = Symbol


data WResizable
type instance PropType WResizable = Bool


type MainWindow :: forall prop -> PropType prop
type family MainWindow prop where
  MainWindow WTitle = "Text Editor"
  MainWindow WResizable = True

And this algorithm works pretty well. However, it is not failproof. Let’s take a look at an example where it falls short!

Current algorithm’s shortcoming

Example I: `Open`

type family Open a
type instance Open Bool = Nat
type instance Open Char = F Float
type instance Open Float = Type

type F :: forall a -> Open a
type family F a
type instance F Bool = 42 -- :: Open Bool (~ Nat)
type instance F Char = '[0, 1] -- :: Open Char (~ F Float ~ [Nat])
type instance F Float = [Nat] -- :: Open Float (~ Type)

There’s a lot to unpack here, and we’ll start with the Open type family. At its core, it’s a very simple type family: it just maps some Type into some other Type, like Open Bool equals to Nat. Open Char is more interesting because it equals to F Float, so let’s look at F.

F takes some type a and should return something of kind Open a. For example, F Bool should return something with kind Open Bool. Since Open Bool is Nat, it should return Nat as it does.

F Char is the trickiest one. It should return something of kind Open Char, but Open Char is F Float. If we look at F Float, we’ll see it reduces to [Nat], so in the end F Char should return something with kind [Nat] as it does.

So this code is correct, everything should type check here. However, if we try to compile that code in current GHC, we’ll get this error:

• Expected kind ‘Open Char’,
  but ‘'[0, 1]’ has kind ‘[Nat]’

So GHC complains that Open Char is not [Nat], but we know that it is. Why doesn’t GHC know that? That’s because of how instances are grouped with the current algorithm:

type family Open a
type instance Open Bool = Nat
type instance Open Float = Type


type F :: forall a -> Open a
type family F a
type instance Open Char = F Float
type instance F Bool = 42
type instance F Char = '[0, 1]
type instance F Float = [Nat]

Notice how all of the F’s instances as well as Open Char end up in the same group. This is a problem because to kind check F Char, we need to know what Open Char and F Float reduce to, or in other words, those instances should already be in the type-checking environment. But since they’re part of the same mutually recursive group, they are checked at the same time, and neither of them is in the type-checking environment yet.

Example II: `IxKind`

To better showcase the problem, let’s take a look at another example that exhibits the same issue:

type family IxKind (m :: Type) :: Type
type family Value (m :: Type) :: IxKind m -> Type
data T (k :: Type) (f :: k -> Type) = MkT

type instance IxKind (T k f) = k
type instance Value (T k f) = f

Here we have two type families:

IxKind is fairly simple, just maps a Type to some other Type
Value takes some type m and returns a function from IxKind m into Type

And then there’s a data type T in which the kind of its second parameter depends on the first parameter.

Now to expose the problem, let’s write out kinds explicitly in the Value (T k f) instance:

type instance Value (T k f) = f -- :: k -> Type, from the signature of T
           -- Value (T k f) :: IxKind (T k f) -> Type, from the signature of Value

So as we can see, we’re expected to return IxKind (T k f) -> Type, but we’re returning k -> Type. So to type check this, we need to know whether k ~ IxKind (T k f). And if we look at the IxKind (T k f) instance, we’ll see that this is indeed the case. But in order to make use of this information, the IxKind instance must be added to the type-checking environment before the Value instance. Alas, they end up in the same group:

type family IxKind (m :: Type) :: Type


type family Value (m :: Type) :: IxKind m -> Type


data T (k :: Type) (f :: k -> Type) = MkT
type instance IxKind (T k f) = k
type instance Value (T k f) = f

Currently, there’s no mechanism by which dependency analysis would detect that the Value instance depends on the IxKind instance.

How do other languages solve this?

At this point, one might wonder: we have these advanced languages like Idris or Agda that are already dependently typed. How do they tackle this issue? Well, the truth is that they don’t have these issue at all because they just check declarations in their written order like every other language.

Could this be done in GHC? Well, yes, but actually no. Checking declarations in the order they’re written in would break a lot of currently existing code and would deviate from the standard. But perhaps we could have an extension that would let us have “ordered” blocks of code (kind of the inverse of Idris’ mutual keyword)? We could, but this approach is considered non-Haskelly.

The TH splices workaround

If you recall from earlier, Template Haskell splices split the code into groups that are checked strictly in order. We can use this to our advantage to kind of implement the non-Haskelly approach today.

type family Open a
type instance Open Bool = Nat
type instance Open Float = Type


$(return [])
type F :: forall a -> Open a
type family F a
type instance Open Char = F Float
type instance F Bool = 42
type instance F Float = [Nat]


$(return [])
type instance F Char = '[0, 1]

Here, by splitting the groups properly, we force Open Char and F Float to be type checked before GHC gets to type check F Char, making this code compile.

But that’s hacky. So how can we improve the current algorithm so that we don’t have to do this?

The `:sig` and `:def` notation

More often than not, depending on an entire declaration is too strong of a dependency. For example, to type check data T = MkT Bool, we need to know Bool’s kind (which is Type) but not much else. We don’t care what constructors Bool has or anything else about it. So really, this declaration should only depend on signature of Bool and shouldn’t depend on its definition.

Introducing :sig and :def notation, short for signature and definition, respectively. You can think of signature as the left-hand of a declaration and of definition as the right-hand side of a declaration.

data Bool = True | False
-- ^^^^ ^^^^^^^^^^^^
-- Bool:sig Bool:def


type family F a; type instance F Int = Bool;
                 type instance F Bool = Int;
-- ^^^^^ ^^^^^^^^^^^^
-- F:sig F:def

For open type families, all instances are considered to be a part of the same :def block and are grouped together. Instead of having declarations as vertices, our directed graph will have two types of vertices: :sigs and :defs. And now that we have different kinds of vertices, we need new rules for inferring edges (dependencies):

Mentioning the name of a type-level thing adds a dependency to that thing’s :sig. For example, F:def above depends on Bool:sig
Each :def depends on its corresponding :sig.
Mentioning the name of a promoted data constructor adds a dependency to the parent’s :def of that data constructor. For example, mentioning 'False adds a dependency to Bool:def.
If a data type has no explicit kind, add a dependency from its :sig to its :def

Why do we need that last rule? Let’s take a look at this simple data type:

data T a = MkT a

Now if we were to split :sig from :def here, it could happen that GHC will type-check :sig before type-checking :def. Let’s see what happens then:

type T a -- GHC will infer the most general type here:
              -- T a :: forall k. k -> Type

And then it will get to type checking the :def:

data T a = MkT a

And this is where the mismatch occurs, the definition expects a to have kind Type, since data constructors could only have things of kind Type in them, but GHC inferred earlier that a is kind-polymorphic.

So unless the user provides an explicit kind signature, we cannot split :sig from :def because that might interfere with kind inference. To solve this, we add an extra edge from :sig to :def in this case, making them a mutually recursive group, which guarantees that they’ll be checked together.

Fixing the `IxKind` example

Now let’s see what the new graph looks like for our example with IxKind from earlier:

type family IxKind (m :: Type) :: Type
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^
-- IxKind:sig

type family Value (m :: Type) :: IxKind m -> Type
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- Value:sig

data T (k :: Type) (f :: k -> Type) = MkT
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^
-- T:sig T:def

type instance IxKind (T k f) = k
-- ^^^^^^^^^^^^^^^^^^
-- IxKind:def

type instance Value (T k f) = f
-- ^^^^^^^^^^^^^^^^^
-- Value:def

Looks cool! But if you pay close attention, this doesn’t solve our original issue. Our original issue was that we really wanted to type-check IxKind (T k f) before we type-check Value (T k f), but there’s no dependency between them on this graph!

But compared to our previous algorithm, we can now at least indicate this dependency on the graph. All we need to do now is to come up with a rule that will infer this dependency for us.

One of the first proposed rules was “for every dependency from one :sig to another :sig, also add a dependency between corresponding :defs”. This would work in this case, but sometimes this rule infers unnecessary dependencies and sometimes it does not infer enough dependencies.

At this point, what exact set of rules to use is still somewhat an open research question. Some other rules were proposed, but we haven’t tried them yet because there’s a more pressing issue at hand.

Oops! All panics!

The nature of this new algorithm is to split signatures from definitions. Let’s look at this code:

type T :: Type
data E = MkE T
data T

After building the graph, finding SCCs, etc., we’ll get the following groups:

type T :: Type


data E = MkE T


data T

Notice how we use T in E after we type checked the T’s signature but before its definition. Theoretically, that should be enough, however, the type checker makes a lot of assumptions about what is available in the environment and what is not and it cannot handle a situation like that resulting in a crash.

The work on dependency analysis is currently blocked by the type checker crash issue, and it turns out fixing that issue requires some major refactoring in the type checker code-base, which is currently ongoing.

Conclusion

This article is made after my talk (with the same name) at ZuriHac 2021, which you can watch on our YouTube channel.

To be informed about new articles on GHC and our current work on it, don’t forget to follow us on Twitter or Dev. You can also read more about our work by exploring the GHC tag.

Top comments (6)

luis medina • Sep 6 '21

Where can I learn to understand that notation, it would help me to know what it is called. It is lambda calculus

Serokell • Sep 7 '21 • Edited

Hi! Which notation do you mean, exactly?

luis medina • Sep 8 '21

Γ::=∅∣Γ,x:τ\Gamma ::= \varnothing \mid \Gamma, x : \tau
Γ::=∅∣Γ,x:τ
👆this notation, thanks for answering

Serokell • Sep 8 '21

This is incorrectly rendered LaTeX from our website. You can see the original equation here: serokell.io/blog/dependency-analys...

I'm not the author, but AFAIK, it is not connected with lambda calculus, it's math notation from type theory. If you really would like to learn more about type theory, probably something like this would be a good start.