DEV Community

Cover image for Tsonnet #19 - Type checking and semantic analysis
Hercules Lemke Merscher
Hercules Lemke Merscher

Posted on • Originally published at bitmaybewise.substack.com

Tsonnet #19 - Type checking and semantic analysis

Welcome to the Tsonnet series!

If you're just joining, you can check out how it all started in the first post of the series.

In the previous post, Tsonnet has been augmented with property-based tests:

This allows us to add new features to the language while being relatively certain that things will continue working in a myriad of scenarios.

Now it's time to take the first step towards type safety in Tsonnet.

Before starting

Before continuing, I want to clean up lib/tsonnet.ml. There's too much code in there and I prefer that it stays as a clean orchestrator, therefore, I'm moving the interpreter related functions to its own module.

I won't bother copying the change here, but you can see the diff here.

The code now lives in a new file: lib/interpreter.ml. This makes lib/tsonnet.ml much more comprehensible:

open Result
open Syntax_sugar

(** [parse s] parses [s] into an AST. *)
let parse (filename: string) =
  let input = open_in filename in
  let lexbuf = Lexing.from_channel input in
  Lexing.set_filename lexbuf filename;
  let result =
    try ok (Parser.prog Lexer.read lexbuf)
    with
    | Lexer.SyntaxError err -> (Error.trace err (Ast.pos_from_lexbuf lexbuf)) >>= error
    | Parser.Error -> (Error.trace "Invalid syntax" (Ast.pos_from_lexbuf lexbuf)) >>= error
  in
  close_in input;
  result

let run (filename: string) : (string, string) result =
  parse filename
    >>= Interpreter.eval
    >>= Json.expr_to_string

Enter fullscreen mode Exit fullscreen mode

Semantic analysis

While Tsonnet's feature set is still modest -- primitive types, locals, and binary/unary operations -- there are semantic issues we need to address that go beyond simple type verification.

The most pressing issue is detecting invalid binding cycles. Currently, Tsonnet can get stuck in infinite loops during evaluation -- something a proper semantic analysis phase should catch. Consider this example:

// samples/semantics/valid_binding_cycle.jsonnet
local a = b;
local b = d;
local c = a;
local d = 1;
a
Enter fullscreen mode Exit fullscreen mode

Jsonnet evaluates top to bottom, and if a variable is not found predefined earlier, it fails immediately:

$ jsonnet samples/semantics/valid_binding_cycle.jsonnet
samples/semantics/valid_binding_cycle.jsonnet:1:11-12 Unknown variable: b

local a = b;

Enter fullscreen mode Exit fullscreen mode

As we explored in Tsonnet 16 - Late binding and Jsonnet inconsistency, Jsonnet's evaluation model prevents it from resolving forward references, despite claiming lazy evaluation.

Tsonnet supports late binding, and this binding cycle, although seemingly confusing, is valid:

dune exec -- tsonnet samples/semantics/valid_binding_cycle.jsonnet
1
Enter fullscreen mode Exit fullscreen mode

Here's an actually invalid case:

// samples/semantics/invalid_binding_cycle.jsonnet
local a = c;
local b = d;
local c = a;
local d = 1;
a
Enter fullscreen mode Exit fullscreen mode

Lazy evaluation cannot help here, because there's an infinite cycle.

Jsonnet does behave correctly here, due to its eagerly evaluated binding nature:

jsonnet samples/semantics/invalid_binding_cycle.jsonnet
samples/semantics/invalid_binding_cycle.jsonnet:1:11-12 Unknown variable: c

local a = c;

Enter fullscreen mode Exit fullscreen mode

In Tsonnet though, we have an infinite loop -- this should be fixed by semantic analysis performed by the type checker.

A minimal type checker

Before evaluating our code, we'll type-check our program:

diff --git a/lib/tsonnet.ml b/lib/tsonnet.ml
index 10a860b..6a55f4b 100644
--- a/lib/tsonnet.ml
+++ b/lib/tsonnet.ml
@@ -17,5 +17,6 @@ let parse (filename: string) =

 let run (filename: string) : (string, string) result =
   parse filename
+    >>= Type.check
     >>= Interpreter.eval
     >>= Json.expr_to_string
Enter fullscreen mode Exit fullscreen mode

See how clean it is to add intermediate steps to the compiler now?!

The refactored architecture makes adding compiler phases remarkably clean -- this is why separating concerns matters.

I'm also adding a new helper function to Error module to reuse across different modules. This is a pretty common pattern we've encountered so far in the interpreter and now will be shared with the type checker:

let error_at pos = fun msg -> trace msg pos >>= error
Enter fullscreen mode Exit fullscreen mode

The Type.check function serves two purposes: it translates expressions to their corresponding types and performs semantic validation. Here's the core type system:

(* lib/type.ml *)
open Ast
open Result
open Syntax_sugar

type tsonnet_type =
  | Tunit
  | Tnull
  | Tbool
  | Tnumber
  | Tstring
  | Tobject of (string * tsonnet_type) list
  | Tarray of tsonnet_type
  | Lazy of expr

let translate_late_binding translate_fun = fun venv expr ->
  match expr with
  | Lazy expr -> translate_fun expr venv
  | _ -> error "Expected a lazy evaluated expression"

let rec translate expr venv =
  match expr with
  | Null _ -> ok (venv, Tnull)
  | Bool _ -> ok (venv, Tbool)
  | Number _ -> ok (venv, Tnumber)
  | String _ -> ok (venv, Tstring)
  | Ident (pos, varname) ->
    Env.find_var varname venv
      ~succ:(translate_late_binding translate)
      ~err:(Error.error_at pos)
  | Local (_, vars) ->
    let venv' =
      (List.fold_left
        (fun venv (varname, var_expr) ->
          (* Adds an expr to the env to be evaluated at a later point in time (when required) *)
          Env.Map.add varname (Lazy var_expr) venv
        )
        venv
        vars
      )
    in ok (venv', Tunit)
  | Seq exprs ->
    List.fold_left
      (fun acc expr -> acc >>= fun (venv, _) -> translate expr venv)
      (ok (venv, Tunit))
      exprs
  | _ -> error "Not yet implemented"

let check expr =
  translate expr Env.empty >>= fun _ -> ok expr

Enter fullscreen mode Exit fullscreen mode

These types mirror Tsonnet's values but exist purely for compile-time analysis. The Lazy constructor is particularly important -- it represents unevaluated expressions that we'll resolve when needed, preserving our lazy evaluation semantics during type checking.

We need a bucket to store our types, our environment. Here we are re-using the Env module, but instead of having Env of string * expr, we'll have instead Env of string * tsonnet_type.

The translate function recursively descents expressions until it reaches the bottom -- the concrete (primitive) types.

This code is straightforward to understand. A crucial detail: we can't immediately type-check local bindings because of lazy evaluation. Instead, we store them as Lazy expressions in the environment. Only when an Ident forces evaluation do we recursively type-check the referenced expression.

This is enough to make the valid binding case to pass:

dune exec -- tsonnet samples/semantics/valid_binding_cycle.jsonnet
1
Enter fullscreen mode Exit fullscreen mode

But this actually breaks existing code, because we aren't type-checking some literals like arrays and operations yet:

dune exec -- tsonnet samples/literals/array.jsonnet
Not yet implemented

dune exec -- tsonnet samples/binary_operations.jsonnet
Not yet implemented
Enter fullscreen mode Exit fullscreen mode

Recursing on the rest of types

We need a Tany type for cases where precise type inference isn't possible or necessary. This commonly occurs with heterogeneous arrays -- while JSON allows mixed-type arrays, we'll track them as Tany rather than creating complex union types for now:

diff --git a/lib/type.ml b/lib/type.ml
index e6f4bf3..ecd152a 100644
--- a/lib/type.ml
+++ b/lib/type.ml
@@ -8,10 +8,25 @@ type tsonnet_type =
   | Tbool
   | Tnumber
   | Tstring
   | Tobject of (string * tsonnet_type) list
+  | Tany
   | Tarray of tsonnet_type
   | Lazy of expr
Enter fullscreen mode Exit fullscreen mode

And a to_string function to make error messages more user-friendly:

let rec to_string = function
  | Tunit -> "()"
  | Tnull -> "Null"
  | Tbool -> "Bool"
  | Tnumber -> "Number"
  | Tstring -> "String"
  | Tany -> "Any"
  | Tarray ty -> "Array of " ^ to_string ty
  | Tobject fields ->
    "{" ^ (
      String.concat ", " (List.map (fun (field, ty) -> field ^ " : " ^ to_string ty) fields)
    ) ^ "}"
  | Lazy ty -> string_of_type ty
Enter fullscreen mode Exit fullscreen mode

Most of the types are direct translations. Product types such as array and object are the ones carrying more complexity:

let rec translate expr venv =
  match expr with
  | Unit -> ok (venv, Tunit)
  | Null _ -> ok (venv, Tnull)
  | Bool _ -> ok (venv, Tbool)
  | Number _ -> ok (venv, Tnumber)
  | String _ -> ok (venv, Tstring)
  | Ident (pos, varname) ->
    Env.find_var varname venv
      ~succ:(translate_late_binding translate)
      ~err:(Error.error_at pos)
  | Array (_pos, elems) ->
    (* As of now, we compare each element and if all have the same type,
      it is an array of this type, otherwise it will be an array of any.
      Since JSON doesn't care about types, this is the simpler way of
      handling such case in compatibility with Jsonnet. However, when we
      actually use the items, given the flexibility of JSON and Jsonnet
      regarding types, we may need to support polymorphic arrays in a more
      flexible way, but I'm leaving this problem unsolved for now until I
      finally need to.
    *)
    (match elems with
    | [] -> ok (venv, Tany)
    | elem :: rest ->
      let ty = Lazy elem in
      let* (venv, ty) =
        List.fold_left
          (fun acc elem -> acc >>= fun (venv, ty) ->
            let elem_ty = Lazy elem in
            if ty = elem_ty
            then ok (venv, elem_ty)
            else ok (venv, Tany)
          )
          (ok (venv, ty))
          rest
      in ok (venv, Tarray ty)
    )
  | Object (_pos, elems) ->
    let* fields =
      List.fold_left
        (fun acc (attr, expr) ->
          let* attrs = acc in
          let* (_, ty) = translate expr venv in
          ok ((attr, ty) :: attrs)
        )
        (ok [])
        elems
    in ok (venv, Tobject fields)
  | Local (_, vars) ->
    let venv' =
      (List.fold_left
        (fun venv (varname, var_expr) ->
          (* Adds an expr to the env to be evaluated at a later point in time (when required) *)
          Env.Map.add varname (Lazy var_expr) venv
        )
        venv
        vars
      )
    in ok (venv', Tunit)
  | Seq exprs ->
    List.fold_left
      (fun acc expr -> acc >>= fun (venv, _) -> translate expr venv)
      (ok (venv, Tunit))
      exprs
  | BinOp (pos, op, e1, e2) ->
    (let* (venv', e1') = translate e1 venv in
    let* (venv'', e2') = translate e2 venv' in
    match op, e1', e2' with
    | _, Tnumber, Tnumber -> ok (venv'', Tnumber)
    | Add, _, Tstring | Add, Tstring, _ -> ok (venv'', Tstring)
    | _ -> Error.trace "Invalid binary operation" pos >>= error
    )
  | UnaryOp (pos, op, expr) ->
    (let* (venv', expr') = translate expr venv in
    match op, expr' with
    | Plus, Tnumber | Minus, Tnumber | BitwiseNot, Tnumber -> ok (venv', Tnumber)
    | Not, Tbool | BitwiseNot, Tbool -> ok (venv', Tbool)
    | _ -> Error.trace "Invalid unary operation" pos >>= error
    )
  | IndexedExpr (pos, varname, index_expr) ->
    (let* (venv', index_expr') = translate index_expr venv in
    match index_expr' with
    | Tnumber ->
      Env.find_var varname venv'
        ~succ:(fun venv' expr' ->
          match expr' with
          | (Tarray _) as ty -> ok (venv', ty)
          | Tstring as ty -> ok (venv', ty)
          | Lazy expr -> translate expr venv
          | ty -> error (to_string ty ^ " is a non indexable value")
        )
        ~err:(Error.error_at pos)
    | ty -> Error.trace ("Expected Integer index, got " ^ to_string ty) pos >>= error
    )
Enter fullscreen mode Exit fullscreen mode

All primitive types are now covered, and invalid binary and unary operations can now be checked by the type checker too.

We take a pragmatic approach to array typing: if all elements have the same type, we infer a homogeneous array type. Otherwise, we fall back to Tany.

This works great for all tests but the invalid cyclic reference.

Checking cyclic references

Now for the main event: detecting those infinite cycles that could crash our interpreter.

After adding all local bindings to the environment, we need to validate that none create circular dependencies. The algorithm walks the dependency graph, maintaining a "seen" list to detect cycles:

| Local (pos, vars) ->
    let venv' = List.fold_left
      (* Adds an expr to the env to be evaluated at a later point in time (when required) *)
      (fun venv (varname, var_expr) -> Env.Map.add varname (Lazy var_expr) venv)
      venv
      vars
    in
    let* () = List.fold_left
      (fun ok' (varname, _) -> ok' >>= fun _ -> check_cyclic_refs venv' varname [] pos)
      (ok ())
      vars
    in ok (venv', Tunit)
Enter fullscreen mode Exit fullscreen mode

The check_cyclic_refs function will retrieve the variables from the environment and check if they've been seen before, if yes it means we have a cyclic reference:

let rec check_cyclic_refs venv varname seen pos =
  if List.mem varname seen
  then
    Error.trace ("Cyclic reference found for " ^ varname) pos >>= error
  else
    match Env.Map.find_opt varname venv with
    | Some (Lazy expr) -> check_expr_for_cycles venv expr (varname :: seen)
    | _ -> ok ()
and check_expr_for_cycles venv expr seen =
  match expr with
  | Unit | Null _ | Number _ | String _ | Bool _ -> ok ()
  | Array (_, exprs) -> iter_for_cycles venv seen exprs
  | Object (_, fields) -> iter_for_cycles venv seen (List.map snd fields)
  | Ident (pos, varname) -> check_cyclic_refs venv varname seen pos
  | BinOp (_, _, e1, e2) -> iter_for_cycles venv seen [e1; e2]
  | UnaryOp (_, _, e) -> check_expr_for_cycles venv e seen
  | Seq exprs -> iter_for_cycles venv seen exprs
  | _ -> ok ()
and iter_for_cycles venv seen exprs =
  List.fold_left
    (fun ok' expr -> ok' >>= fun _ -> (check_expr_for_cycles venv expr seen))
    (ok ())
    exprs
Enter fullscreen mode Exit fullscreen mode

The cycle detection must recursively examine every expression because lazy evaluation means cycles can hide deep within complex expressions. A variable might reference another variable buried inside an array or object literal, creating indirect cycles.

Now the invalid binding is caught by the compiler:

dune exec -- tsonnet samples/semantics/invalid_binding_cycle.jsonnet
samples/semantics/invalid_binding_cycle.jsonnet:1:10 Cyclic reference found for c

1: local a = c;
   ^^^^^^^^^^^^
Enter fullscreen mode Exit fullscreen mode

And many other invalid cases. Such as:

  1. Binding to itself:
// samples/semantics/invalid_binding_itself.jsonnet
local a = a;
a
Enter fullscreen mode Exit fullscreen mode
dune exec -- tsonnet samples/semantics/invalid_binding_itself.jsonnet
samples/semantics/invalid_binding_itself.jsonnet:1:10 Cyclic reference found for a

1: local a = a;
   ^^^^^^^^^^^^
Enter fullscreen mode Exit fullscreen mode
  1. Invalid cyclic binding in array:
// samples/semantics/invalid_binding_cycle_array.jsonnet
local a = b,
    b = c,
    c = (local d = 1, e = 2; [a, b, c, d, e]);
b
Enter fullscreen mode Exit fullscreen mode
dune exec -- tsonnet samples/semantics/invalid_binding_cycle_array.jsonnet
samples/semantics/invalid_binding_cycle_array.jsonnet:3:30 Cyclic reference found for a

3:     c = (local d = 1, e = 2; [a, b, c, d, e]);
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Enter fullscreen mode Exit fullscreen mode
  1. Invalid cyclic binding in object:
// samples/semantics/invalid_binding_cycle_object.jsonnet
local obj = { a: 1, b: 2, c: obj };
obj
Enter fullscreen mode Exit fullscreen mode
dune exec -- tsonnet samples/semantics/invalid_binding_cycle_object.jsonnet
samples/semantics/invalid_binding_cycle_object.jsonnet:1:29 Cyclic reference found for obj

1: local obj = { a: 1, b: 2, c: obj };
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Enter fullscreen mode Exit fullscreen mode
  1. Invalid cyclic binding in binary operation:
// samples/semantics/invalid_binding_cycle_binop.jsonnet
local a = b;
local b = a + 1;
a
Enter fullscreen mode Exit fullscreen mode
dune exec -- tsonnet samples/semantics/invalid_binding_cycle_binop.jsonnet
samples/semantics/invalid_binding_cycle_binop.jsonnet:1:10 Cyclic reference found for b

1: local a = b;
   ^^^^^^^^^^^^
Enter fullscreen mode Exit fullscreen mode
  1. Invalid cyclic binding in unary operation:
// samples/semantics/invalid_binding_cycle_unaryop.jsonnet
local a = b;
local b = ~a;
a
Enter fullscreen mode Exit fullscreen mode
dune exec -- tsonnet samples/semantics/invalid_binding_cycle_unaryop.jsonnet
samples/semantics/invalid_binding_cycle_unaryop.jsonnet:1:10 Cyclic reference found for b

1: local a = b;
   ^^^^^^^^^^^^
Enter fullscreen mode Exit fullscreen mode

Conclusion

As of the time of writing this, Tsonnet remains a relatively small language, featuring primitive types, local bindings, binary and unary operations. The type checker represents a significant improvement that will prove even more valuable as we introduce upcoming features.

So far, when it comes to type-checking, we need only one environment -- the value environment -- to translate inferred types from expressions. At some point, type annotations will be added, thus requiring a second environment: the type environment. I'm deferring this complexity until we actually need it -- it will become relevant for functions, typed objects, and likely any disambiguation scenarios when the compiler can't infer the type.

I've been enjoying OCaml since I started Tsonnet; however, introducing the type checker made me realize the language was truly an excellent choice for writing a compiler. The pattern matching and algebraic data types make expressing type relationships and transformations remarkably clean and intuitive.


Thanks for reading Bit Maybe Wise! Subscribe now -- my newsletter passes all semantic analysis checks!

Top comments (0)