DEV Community: Dirk Mattig

Inside Grotto: Isolation and the API Boundary

Dirk Mattig — Tue, 21 Jul 2026 04:15:13 +0000

Last time I started looking at some of Grotto's implementation details while going through the budget system.

This time I want to take a step back and look at the overall picture. How does Grotto make it safe to embed the runtime in a host application and execute untrusted code inside it? And how does the host get its own APIs into the runtime for the code to call in the first place?

The language is the sandbox

Half the work is already done by the language itself. Neander follows the guiding design principle that the language is the sandbox.

A Neander program has no file I/O, no sockets, no system access. It can only reach the APIs the embedding application registered, and even those only by going through discover and call. On top of that, every valid program is guaranteed to terminate, and the budget system stops it before it can drain the host's resources.

All that is left for Grotto is to make this design a reality.

The grotto is the fortress

The language being safe by omission handles the program's intentions. The runtime architecture handles everything else: runaway resource use, one submission interfering with another, a wedged program that will not stop.

Grotto's architecture is based on the dispatcher-worker pattern:

Only the dispatcher runs on the host's thread and it does not touch the code itself. Every non-empty, in-range program submission runs in its own fresh worker thread, spawned for that one program and terminated once it returns the response envelope. Two programs cannot observe or influence each other, because they never share a thread and never share memory. The only channel that exists at all is a message pipe to the dispatcher, and it carries copies, not references. The isolation sits on an OS thread boundary.

Because the program runs on a thread separate from the dispatcher, the dispatcher can always kill it. As we already saw last time, that is what makes the duration budget enforceable even against a program stuck in a tight synchronous loop with no yield point: the dispatcher keeps the clock on its own thread and, on overrun, terminates the worker outright.

API providers: the one door in

The whole safety story rests on a program being able to reach nothing except the APIs the host registered. Which raises the question the series has quietly deferred until now: how do those APIs get in, and why is that the one door that does not undo everything else?

The host registers its APIs as provider modules. Each module is one namespace, and it exports two things: a manifest that declares what the namespace offers (its functions, their parameter and return types, its named types and documents) and a handlers map that supplies the actual code behind each declared function. The manifest is what an agent sees through discover. The handlers are what a call eventually runs.

The runtime checks these once, at startup, not on every submission. When the embedding application starts the runtime, a provider validator imports every module and verifies the contract: every function the manifest declares has a matching handler and vice versa, no two modules claim the same namespace, nobody grabs the reserved runtime or main names, and every declared type actually parses and resolves. A malformed provider fails loudly, once, at start. By the time the runtime accepts a single submission, the whole API surface is known to be well-formed.

At execution time the handler runs in-thread, inside the worker, as an ordinary async function call. There is no per-call IPC and no serialization across a process boundary. What there is, is a marshalling boundary: values are converted between their Neander representation and plain JavaScript as they cross into and out of the handler, and every conversion is checked against the type the manifest declared.

That boundary is where the runtime hands control to code it did not write, so it is worth being precise about what can happen there. A handler that throws a conforming {code, message} error comes back to the program as an ordinary error, which it can handle like any other call failure: inspect it, substitute a default, or re-throw it. A handler that misbehaves, by throwing something malformed or returning a value that does not match its declared type, is not the program's fault and is not handed to the program. It is a provider contract violation: the execution aborts, the blame lands on the named provider, and the agent's program never sees it. Provider bugs stay provider bugs. They do not leak into the language's failure model and they do not compromise the isolation around them.

So the one capability the language grants is narrow by construction: reachable only through discover and call, backed by a contract verified before anything runs, and marshalled across a typed, checked boundary every time it is used.

Defense in depth, by category

Put it together and the guarantees line up in layers, each catching a different class of problem at a different moment:

Before the program runs: validation. Anything malformed, ill-typed, or referring to APIs that do not exist is a Flaw, and the program never executes. Forging a discovery handle, mixing up types, calling a function that is not there: caught here, statically.
While it runs: budgets. Too much computation, memory, or time is an Abort. The program is stopped, the host is not.
Around the whole thing: isolation. No I/O to misuse, no sibling submission to spy on, a worker the host can terminate at will, and a provider boundary that blames the provider rather than corrupting the run.

None of these is the load-bearing one. That is the point of layering them. A program that passes validation still cannot outspend its budget. One that stays inside its budget still cannot reach the filesystem. One that somehow wedged its own thread still cannot survive the dispatcher's clock. Nothing here requires trusting the agent. It requires trusting the runtime, and the runtime is a small, dependency-free dispatcher plus a throwaway worker, open for anyone to read.

Field Notes from the Grotto

Across the series we have covered a lot of ground: how an agent finds the host's APIs, how it learns the language cold from an empty program, why its programs always stop, what they are allowed to spend, why it is safe to run them at all, and how the host's own APIs get in. The argument stands on its own now. Put the safety in the language and the runtime, by construction, and no sandbox is needed, because the things a sandbox exists to contain were never built.

For the embedder, this was the answer to the question that arrives the moment someone says "let an agent write code and run it in production": why is that not insane? Because the code the agent writes cannot do anything the host did not hand it, cannot outlast or outspend its allowance, cannot touch the request running next to it, and reaches the host's own code only across a boundary the host verified first.

This is the end of the foundational series. Thank you for taking the tour. There will be a couple of encores, standalone entries on parts that reward a closer look on their own terms: the full failure taxonomy, from in-program failable values to the one top-level verdict every submission returns, and the type system, structural and recursive, down to the arbitrary-precision numbers.

There is still time to read the Neander spec, embed Grotto in your own app, and let me know if your APIs got called.

Mind your head on the way out.

The Budget System: Thalers, Bytes, and Milliseconds

Dirk Mattig — Fri, 17 Jul 2026 03:50:20 +0000

A Neander program is guaranteed to stop, as we saw last time. But "stops eventually" is not good enough in a real-world scenario, where the program runs inside the host application and consumes its resources while it does.

So on top of a sub-Turing language, Neander features a budget system which places hard upper bounds on a program in six different dimensions. Three of them are budgets in the literal sense: they bound what a running program is allowed to consume. The other three are static caps on the shape of the program itself, checked before it is allowed to run at all. All six are set and enforced by the runtime, never declared, requested or negotiated by an agent or a program. The agent can obtain the current budget limits from the runtime by submitting the empty program (what it would do anyway during a cold start). The meta block of the Reference Response contains the runtime's budget configuration.

It is worth stressing that the budget system is an integral part of the language definition, not merely a bolted-on feature of one particular runtime implementation.

What is runtime-specific are the details of how and in what amount these upper bounds are assigned to a submitted program. The Grotto reference implementation, for example, currently requires a static budget configuration at startup time and applies it to every program submission.

What is normative, however, is the consequence of exceeding one. Either way the program produces a Failure result, but the two kinds of bound fail differently. Overrunning a static cap is a Flaw: the program is rejected outright, before execution. Overspending a budget is an Abort: execution stops immediately. There is exactly one exception to this rule, and we will get to it in a minute.

Computation

Computation is measured in Thalers, Neander's unit of computational work, named after an old silver coin. Every operation that actually computes something costs one Thaler. Things that do not compute are free, e.g. binding a name, returning a value, a literal. The Neander specification contains a complete Thaler Cost Table:

Operation	Cost
Arithmetic op (`+`, `-`, `*`, `/`, `%`)	1 Thaler
Comparison (`==`, `!=`, `>`, `<`, `>=`, `<=`)	1 Thaler per base-type comparison
Logical op (`and`, `or`, `not`)	1 Thaler
`is` type check	1 Thaler
`??` null coalescing	1 Thaler
Field access (`.field`)	1 Thaler
List index (`[i]`)	1 Thaler
Map access (`["key"]`)	1 Thaler
List projection (`[*].field`)	1 Thaler per element
`contains()` / `indexOf()` on list	1 Thaler per element checked
`replace()` / `split()` on string	1 Thaler per occurrence found
Other built-in function call	1 Thaler
`call` (API function)	1 Thaler
`each` iteration	1 Thaler per iteration (body costs are additional)
`repeat` iteration	1 Thaler per iteration (body costs are additional)
`discover`	1 Thaler
`if` evaluation	0 (the condition's operations already cost Thalers)
`let` binding	0 (the expression's operations already cost Thalers)
`return` / `throw` / `yield` / `skip`	0 (the expression's operations already cost Thalers)
Literal value	0
Spread (`..`) in list or map literal	1 Thaler per element or entry spread

Memory

Memory is measured in whole kilobytes, and the memory budget is a ceiling on peak allocation. API return values, large lists and maps, deep record structures: all of it is charged, and an allocation that would cross the ceiling is refused before it happens.

In contrast to computation costs, which are standardized by the language, memory costs are heavily implementation-dependent. The Neander design gives each runtime implementation the breathing space it needs to account for the memory allocation specifics of its underlying platform. What is standardized is the contract around the number: that the ceiling binds peak allocation, that crossing it stops the program before the allocation happens, and that the figures reported back are whole kilobytes. Thalers are portable across conforming runtimes. Kilobytes are not.

Time

Time is measured in wall-clock milliseconds, and there exist two separate duration budgets. The first limits the overall program execution duration, including in particular all API calls. The second limits the duration of each API call individually, so one slow API cannot quietly eat the whole duration budget on its own.

The per-call timeout is the exception promised above. Exceeding it does not stop the program. Instead, it returns a recoverable runtime error from the API call, and execution continues. The reasoning is that the program's other budgets may well still be within limits. So the runtime hands it back as an ordinary failed call and lets the program decide.

Size

Size is measured in bytes and limits the length of the submitted program as UTF-8-encoded source. The standard mandates that this limit is checked before any source code pre-processing (lexing, parsing, etc.) starts. That pre-processing costs the runtime real work, but it is not covered by the budgets for computation, memory, and time. These budgets only apply to program execution. The size cap is what bounds everything that happens before.

Depth

Depth is a unitless positive integer and limits the longest root-to-leaf path in the program's abstract syntax tree. It protects against stack overflows during parsing, type checking, and execution, which is a separate attack vector independent of program size: a short program can nest very deeply.

Repeat

Repeat is a unitless positive integer and caps the limit literal that every repeat loop must declare. There are two separate limits at play here, and it is important to distinguish between them. The first is part of the repeat syntax itself and acts as a runtime precondition on the actual repeat count:

repeat pageCount limit 20 as i {
  call orders.list(offset: i * 100, limit: 100)
}

If pageCount turned out to be larger than 20 during execution, then a runtime error would prevent the loop from even starting. This construct gives an agent the opportunity to express an expectation and ensure that the loop does not execute if the expectation is not met.

The budget-system repeat limit, on the other hand, is an absolute ceiling on the literal itself, checked at validation time. So limit 1000000000 is valid syntax and per se allowed, but it very likely exceeds the configured cap, and the program would never start running. This prevents an agent from circumventing the bound on loops by declaring an absurdly high number as the limit.

Grotto's take on budgets

Grotto implements a dispatcher-worker pattern where program submission is handled by a dispatcher running in the host thread and the program itself is handled by an isolated worker.

The dispatcher enforces the program size cap upon reception and the overall program execution duration cap by a timeout on the isolated worker followed by worker termination.

The worker does the rest. Parser and validator enforce depth and repeat limits, computation and memory budgets are checked cooperatively at every operation and allocation, and a timeout guards every in-thread API call.

Note the split. Thalers and memory can be counted cooperatively because the interpreter is doing the work and can be trusted to check as it goes. Duration cannot be left to the same mechanism: a program that stops coming back to a checkpoint stops checking its own clock. Grotto does sample the clock inside the worker too, but it does not rely on it. The deadline that actually binds sits on the dispatcher thread, which is a question of isolation rather than budgeting.

Next from the Grotto

That concludes the overview of the Neander budget system, and it ends with a brief visit to Grotto. Since we are here anyway, it makes sense to stay a while and take a closer look under the hood, to better understand how Grotto keeps its embedding host application isolated from the execution of untrusted code.

In the meantime, read the Neander spec, embed Grotto in your own app, and let me know what it cost you.

Sub-Turing: All Good Programs Must Come to an End

Dirk Mattig — Tue, 14 Jul 2026 02:33:13 +0000

The decision to create a new purpose-built language instead of reusing an existing general-purpose language is deeply connected with a concept from theoretical computer science called Turing completeness. Loosely speaking, a programming language is Turing-complete if it can express any computation a computer could ever perform. Practically all general-purpose languages in use today have this capability. A necessary condition for Turing completeness is the ability to express unbounded looping, and this is where things get interesting in the context of agentic API-orchestration:

Do you want to hand an untrusted agent the ability to execute a program in your host application which will run forever?

I deliberately decided against this when designing the language and gave it a theoretical safeguard: Neander is Turing-incomplete. Every valid Neander program is guaranteed to come to an end. And any invalid program is not started in the first place anyway.

What is missing

Neander lacks the ability to express unbounded looping, and this inability comes in two flavors:

No recursion. A program has one entry point, main, and no way to turn around and re-enter the program. Also, it has no means to define functions of its own which it could then call. A call reaches only a registered API and returns. The call stack simply cannot grow without end.

Every loop is bounded. Neander has exactly two iteration constructs, each and repeat (there is no while), and neither can run away:

each walks a list or a map — a finite, immutable value that already exists, so the iteration count is fixed before the loop starts.

let confirmed: [Booking] =? call bookings.list(state: "confirmed")
let ids: [int] = each confirmed as b -> int {
  yield b.guestId
}

repeat runs a block a counted number of times, and its ceiling is a literal you must write into the source:

repeat pageCount limit 20 as i {
  call orders.list(offset: i * 100, limit: 100)
}

What is still missing

Are we safe now? In theory, yes. Unfortunately, systems have to reliably work in practice, not in theory.

Will a Neander program terminate?

Yes, guaranteed.

OK, but when, exactly?

Well, eventually.

And that is where reality bites you. In practice, eventually might be long enough to overload your host application and cause an incident. The same goes for the loops. They are bounded, yes, but where exactly does this bound sit? Lists could be huge and limit 1000000000 is valid syntax in Neander.

Something is still missing.

As we will find out next time, this something is of a very practical nature: Neander programs are living on a budget.

In the meantime, read the Neander spec, embed Grotto in your own app, and let me know how it ended.

The Cold Start: Learning the Language by Submitting a Program

Dirk Mattig — Fri, 10 Jul 2026 03:05:44 +0000

Last time I showed how an agent discovers the available APIs at runtime using the discover verb. But one question remains unanswered: how does a cold-start agent that has never seen the language before discover discovery? You need to know the language in order to write a program, so how is this chicken-and-egg problem solved?

It turns out that there exists, in fact, one particular program an agent can submit without either knowing or guessing the grammar.

The empty program

When a Neander runtime like Grotto receives the empty program, it is not treated as a syntax error but as valid input. The specification requires the runtime to answer with a special response envelope that contains the Neander Reference document explaining the language, plus the runtime's current configuration.

{
  "success": true,
  "result": "# The Neander Reference...",
  "meta": {
    "neanderVersion": 1,
    "thalerBudget": 500,
    "memoryBudgetKb": 1024,
    "maxDurationMs": 30000,
    "perCallTimeoutMs": 5000,
    "maxProgramSizeBytes": 262144,
    "maxNestingDepth": 128,
    "maxRepeatLimit": 10000
  }
}

This is the entire bootstrap mechanism. One round-trip, and the agent now knows about the Neander language and every limit the runtime enforces. The rule that everything in Neander is a program extends even to requesting the manual for how to write a program.

It is worth mentioning that Grotto optimizes the request processing in this special case. The empty program never spawns a worker that lexes or parses it. The dispatcher immediately returns a prepared Reference Response, which makes bootstrapping an agent a very cheap task.

A little nudge to break the ice

For an embedding application, out-of-the-box bootstrapping and discovery are real convenience benefits. All that it still needs to take care of itself is to register its APIs with the runtime and to connect agents to itself.

The latter must in some shape or form, depending on the underlying technology, inform the agent that it now has access to a submitProgram operation. And ideally, this memo also contains a little nudge to break the ice, namely: Send the empty program first to learn about the language.

A little kindness goes a long way. Agents are smart, but not (yet) mind readers.

Two documents, separate audiences

Now that the Neander Reference is introduced, it is worth stressing that there exist, in fact, two documents describing Neander, and this split is deliberate because they address two very different audiences.

The Reference is served by the runtime to agents who want to write programs to call the APIs of the embedding application. It is example-driven and has three parts: a get-started, a cookbook of copy-and-ship recipes, and the actual language reference for lookup. It answers the how question, but does not explain why the language is the way it is. This is the job of the other document.

The specification is the normative, precise, and exhaustive definition of the language. It informs coding agents who plan and execute a runtime implementation as well as humans who evaluate the technology to assess whether embedding it into their own systems is a viable option. It also serves as the single source of truth for generating or updating the Reference. The two documents are supposed to always be in sync, but if push comes to shove, the specification wins.

Next from the Grotto

We have now covered bootstrapping and discovery at the beginning of a dutiful friendship between the agent and the API. And so it is time to shift our focus to what comes next: The End. Or rather, the question: is there an ending?
We all know that programs can run forever. And thanks to Mr. Turing, we know that we can never know for sure.

But Neander lives up to its namesake's reputation and proves to be way too predictable for that.

In the meantime, read the Neander spec, embed Grotto in your own app, and let me know how the cold start goes.

Discovery: How an Agent Finds Your APIs

Dirk Mattig — Tue, 07 Jul 2026 03:32:50 +0000

Last time I made the case for why Neander and Grotto exist at all — a purpose-built, safe-by-construction language instead of a sandboxed general-purpose one. That was the argument for the whole language. From here, Field Notes from the Grotto takes it apart one feature at a time, and the first one is the feature that explains why the rest exists: discovery.

Classical integration had a shape we all recognize: a developer reads one system's API documentation, then writes integration code that calls it in the right order with the right data. Two steps, one human. Make the calling system agentic — let it decide at runtime what it needs — and the human in the middle vanishes. The two steps do not. They have to land somewhere.

The writing step is the half everyone talks about: the agent writes the program instead of the developer. It is the reading step that gets skipped over. Before you can write a line against an API you have to find out what the API even is — what exists, what it takes, what it returns. That was the developer with a browser tab open on the documentation. When the developer leaves, that finding-out does not leave with them; something has to inherit it. That something is discovery.

The eager answer is the tool catalog: hand the agent every function definition up front and let it pick. That is discovery too — just total, and paid in advance. It does not scale. Hundreds of definitions clutter the context before the agent has done anything, every intermediate result piles on top, and latency and cost climb with them.

Where the documentation used to live

Take the pragmatic route from my post Source Code as the Seam Between Systems — a general-purpose language the model already writes fluently, its APIs behind a sandbox — and you inherit a quieter awkwardness. That language and those API packages were built for human developers, and human developers read documentation: a reference site, a PDF, a README, a page of generated typedocs. But a running agent is not a human developer. It may not have unrestricted access to the web to search for and reach that reference page at all — and even where it does, the page is shaped for a person: prose, worked examples, a layout to skim, none of it meant to be consumed by an agent. So every system that takes this route has to bolt something on — a meta-tool that lists the available functions, or a search endpoint the agent queries before it writes anything. It works. But it is an appendage: out of band, and reinvented for every stack.

Neander makes discovery part of the language itself. discover is a verb you write into a program and submit exactly the same way you submit a program that does real work — and that is the unification worth stressing. In Neander, everything is a program. Finding out what exists and calling it are the same kind of act, in the same language, through the same entry point. The runtime exposes one operation to the agent — submitProgram — and every interaction, whether the agent is asking what is available or getting something done, flows through it. With this, the finding-out moved in-band. What the developer used to do with a browser tab, the agent now does with a program.

Two verbs

Almost everything in Neander exists to glue two verbs together. call invokes one of the registered API functions. discover asks the runtime what there is to call.

discover has six forms — three things you can look for (namespaces, functions, documents) crossed with two ways to look (search a list, or get one by exact name):

discover namespaces ["payment"]              // [Namespace]  — search
discover namespace  "shipping"               // Namespace?   — exact lookup of a namespace
discover functions  ns ["estimate", "intl"]  // [Function]   — search the namespace's functions
discover function   ns "estimateBatch"       // Function?    — exact lookup of a function
discover documents  ns ["format"]            // [Document]   — search the namespace's documents
discover document   ns "requestFormat"       // Document?    — exact lookup of a document

Search terms are case-insensitive substrings, AND-combined, matched against each candidate's name and description. The empty list matches everything. That is the entire discovery surface.

The loop

A cold agent does not know your API, so it works inward. The power lies in the sequential writing and execution of code. One program to list namespaces, a second to look for functions inside a namespace, a third to make the function call — reading and learning from the response envelopes the runtime returns as the result of every program submission.

neander 1 {
  types {}
  main -> [Namespace] {
    return discover namespaces []
  }
}

The runtime answers with a standard envelope — a success flag, the result, and a meta block of telemetry. The return value of type [Namespace] is the fully serialized list of available namespaces:

{
  "success": true,
  "result": [
    {
      "name": "bookings",
      "description": "Booking management — retrieve, list, confirm, and escalate reservations for the hospitality backend"
    },
    {
      "name": "guests",
      "description": "Guest profile lookup — retrieve and list guests for the hospitality backend"
    },
    {
      "name": "payments",
      "description": "Payment processing — charge bookings, issue refunds, and look up prior charges"
    }
  ],
  "meta": {
    "thalersConsumed": 1,
    "memoryConsumedKb": 1,
    "durationMs": 2,
    "apiCalls": []
  }
}

With this list in its context the agent can now further inspect the namespace of interest:

neander 1 {
  types {}
  main -> [Function] {
    let ns: Namespace =? discover namespace "bookings"
    return discover functions ns []
  }
}

This time the result contains the list of available functions in this namespace, each function serialized in full (only one function shown here):

{
  "success": true,
  "result": [
    {
      "qualifiedName": "bookings.get",
      "description": "Get a booking by its numeric id",
      "params": {
        "id": "int"
      },
      "returnType": "bookings.Booking",
      "errors": {
        "404": "Booking not found"
      },
      "types": {
        "bookings.Booking": {
          "description": "A reservation record with id, fare amount, state, and the guest who placed it",
          "fields": {
            "id": "int",
            "fare": "decimal(2, half_away)",
            "state": "string",
            "guestId": "int"
          }
        }
      }
    }
  ],
  "meta": {
    "thalersConsumed": 2,
    "memoryConsumedKb": 1,
    "durationMs": 3,
    "apiCalls": []
  }
}

Every function name listed is the exact string to write next — qualifiedName is what you pass to call, and the types side-table is transitively complete, so the shape of bookings.Booking arrives with the function that returns it. The agent reads it straight out of the JSON response envelope, then writes the program that does the work:

neander 1 {
  types {}
  main -> decimal(2, half_away) {
    let booking: bookings.Booking =? call bookings.get(id: 8821)
    return booking.fare
  }
}

Discover, call, return. Everything else in the language is detail.

Opaque on purpose

The design choice worth pointing out is that the values discover returns — handles of type Namespace, Function, Document — are opaque. A program can hold one and return it, but it cannot read its fields. The envelope contains what the runtime serializes when the discovery handle is returned from main. Inside the program the handle is a sealed token. Only on the way out does it become readable.

What might sound like a restriction is, in fact, a guarantee. A discovery handle can only ever come from discover — you cannot forge one out of a record literal or return it from an API function. So a discovery handle the agent sees is always a handle the runtime minted, pointing at something the runtime actually registered. Discovery is not a convention, it is the only way in.

Next from the Grotto

Discovery answers what APIs exist — but one question remains open: how does an agent discover discovery? The disadvantage of a new, purpose-built language like Neander is that a cold-start agent has never heard of it in stark contrast to any well-established general-purpose language. So before the agent can write even its first three-line namespace discovery program, it has to learn the language itself with nothing more at its disposal than the submitProgram operation. If you do not know the language, what is the only program you can write which is not guesswork?

The empty program, exactly!

In the meantime, read the Neander spec, embed Grotto in your own app, and let me know what you discovered.

Neander and Grotto: Beyond Code Mode

Dirk Mattig — Thu, 02 Jul 2026 14:42:42 +0000

Field Notes from the Grotto starts here — a feature-by-feature tour of the Neander language and its runtime, Grotto. And I am opening with the biggest feature of them all: Neander itself. Why create a whole new language when we already have an abundance of well-established programming languages at our disposal?

In a previous post I have made the case that the seam between systems is turning into a language — that instead of calling your tools one at a time, an agent should send you a small program and let it orchestrate the work on your side. That idea has a name — code mode — and it is not mine. By now it is not even contentious: others arrived at it from their own directions, there are real solutions already shipping it, and the underlying claim — that a model does better writing code than emitting tool calls — has been measured, not just asserted. So the what is settled. This post is about a narrower quarrel with the how.

All existing solutions I came across share an answer that is, frankly, the obvious one. Take a well-known language the model already writes fluently, generate an API from your tools, and run the agent's code in a sandbox. Pragmatic. Available today. And there is serious effort behind making it safe — an entire industry of ways to run untrusted code: lightweight virtual machines, isolated containers, syscall firewalls, network proxies whose sole job is to say no. Real engineering, and it delivers.

So why did I not reach for any of that? Why start from an empty grammar instead?

Because all of it shares one shape — safety by subtraction — and I wanted a different one.

Safe by construction

Every one of those approaches begins with a language that can do anything, then spends its effort taking things away: walling off the filesystem, blocking the network, killing the process when the clock runs out. The language is a threat, and safety comes from a prison built around this culprit.

Neander is no such threat. It cannot run forever — it is not Turing-complete, there is no recursion, every loop is statically bounded, and termination is decided before the program runs. It cannot reach out — there is no file, no socket, no system call anywhere in the grammar. It cannot run up a bill — every program runs under hard ceilings on computation, memory, and time. Whole categories of exploit — sandbox escapes, privilege escalation, data exfiltration — simply do not apply, because the capability they would abuse was never there.

There is no prison because there is no prisoner. The sandbox approach asks you to trust the cage. Neander's safety is the absence of anything that would need a cage. The less a language can do, the less can go wrong — and the less you have to take on faith. The entire sandbox industry exists to contain general-purpose code; Neander opts out of needing it.

Uniform by construction

No matter what program the agent submits, the answer comes back in the same form: a single response envelope, defined by the language itself. It carries either the value the program produced or a precise account of why it produced none. That uniformity holds because failure is never allowed to escape into the mess: errors raised mid-run and exhausted budgets are caught and classified rather than left to surface however they please, and even an invalid program that never executes still produces a response. An envelope also carries metadata — among it the resources used by the program or the usage limits the runtime enforces. The agent learns the very ceilings it operates under from the responses it receives.

The uniform response envelope is not simply a convenience. The whole point of sending code instead of a stream of tool calls was to keep the agent's context clean — one compact result in, rather than every intermediate step piling up. A uniform envelope is what makes that payoff real: the agent gets back a single machine-readable verdict it always knows how to read, and only that verdict costs it any context. It never has to parse prose, squint at a stack trace, or reconcile different failure formats. It reads the envelope, and it knows exactly where it stands.

Open by construction

The existing solutions tend to arrive bolted to something — a cloud you deploy on, a framework you adopt, a tool protocol you have to speak. Neander is a specification, with a conformance suite growing up beside it. Anyone can implement a runtime; Grotto is simply the first. Nothing ties you to one vendor, and nothing ties you to one tool protocol — the host embeds a Neander runtime by wiring it into its own application. A standard you can build on, not a product you sign up for.

Next from the Grotto

That is the case in outline. The rest of the series is the case in detail — one feature per entry, each of them a piece of the argument above made concrete. First up, the one that makes the whole inversion possible: how an agent finds out what your APIs even are, at runtime, without ever carrying a catalog of them around.

In the meantime, read the Neander spec, embed Grotto in your own app, and tell me where it falls short.

Grotto: Where Neander Programs Live

Dirk Mattig — Mon, 29 Jun 2026 08:04:38 +0000

Last time, in Neander: An Agent-First Programming Language, I published the language but not the one thing a host actually needs in order to put it to work: a runtime reference implementation. I promised it already existed, and that I would show it next time.

Here it is.

It's called Grotto. The naming keeps the theme going: the first Neanderthal fossil was pulled from a grotto — the place "where the Neanderthal lived." Grotto is where Neander programs live and run.

A specification is a plan. A reference implementation is the proof that the plan can be put into action. Grotto is that proof: an embeddable TypeScript library, running on Node.js, that takes a Neander program as plain text and runs it — the whole language, end to end, every expression, every data type and every built-in function the spec defines.

The trusted component

Recall the setup from the last two posts. The agent is on the outside, untrusted, writing small programs. The host is on the inside, with the APIs worth calling. Between them sits the runtime — and the runtime is the one component in the whole arrangement that everyone has to trust.

That is a heavy crown to wear. Grotto's entire design is an argument that the trust is warranted.

It starts with the design process itself: Grotto is architected, not vibe coded. I am a certified software architect (iSAQB® CPSA-Advanced Level), so on a good day I even know what I am doing. The Grotto implementation started with the creation of an architecture specification (arc42, C4) and a technical design document (co-authored by me and an agent). Only then did a coding agent create the codebase.

It continues with the dependencies — or rather their absence. Grotto has zero runtime dependencies and leans on nothing but the Node built-ins. That is not housekeeping; it is a security boundary. An npm package you never install is a package that can never turn on you: no CVE, no compromised maintainer, no supply-chain attack can reach Grotto through a dependency, because there is not one to reach it through. Which leaves only Grotto's own code, and its design does the rest of the arguing.

The host-facing library is a small dispatcher of under two hundred lines, and it holds no language code at all. Everything that actually touches a stranger's program — the lexer, the parser, the validator, the interpreter — runs somewhere else entirely: in a fresh worker thread, spawned for that one submission and thrown away the moment it's done. Isolation by construction, not by good manners. Add it all up, and the only Grotto code that ever executes on the embedding application's own thread is that small dispatcher — everything else is quarantined in an isolated worker you can kill.

Every program runs under the hard ceilings of a budget system as mandated by the Neander spec — on computation, on memory, and on wall-clock time. Overrun the time limit and the worker is simply killed; the dispatcher keeps the clock, so even a wedged program can't outlast it. The language itself has no recursion and no unbounded loops, so termination was never in doubt to begin with — the budgets are there for everything else.

The last line of defense is quality assurance. The Grotto specifications and codebase so far have been reviewed by several frontier coding models: Opus 4.6–4.8, GPT-5.5, and, by sheer luck, Fable 5.

The codebase quality is verified by over 1,300 unit tests (coverage > 90% but let's not get overexcited about percentages alone) and more than 750 black-box end-to-end tests (program submissions) that are continuously grown toward a Neander conformance test suite.

You can always do more, and I will. But the groundwork is laid — enough, I'd hope, that anyone weighing Grotto for a real host application can take it, and Neander with it, seriously.

It's early days, to be clear. The spec is still a draft, the version number starts with a zero, and interfaces might move. But it's real, and it all runs today — a runtime, not a roadmap.

What you do with it

If Neander reads strangely because only agents write it, Grotto reads normally, because only humans host it. You embed the library, hand it your own APIs as provider modules, and point agents at it:

import { Runtime } from 'grotto1';

const runtime = await Runtime.start({
  config: {
    neanderVersion: 1,

    // The ceilings every program runs under.
    thalerBudget: 5000,          // computation
    memoryBudgetKb: 2048,        // memory
    maxDurationMs: 10_000,       // wall-clock time
    perCallTimeoutMs: 3000,      // per individual API call
    maxProgramSizeBytes: 65_536, // static caps, checked before a program runs
    maxNestingDepth: 64,
    maxRepeatLimit: 1000,

    // Your APIs — each a module the runtime hosts on your behalf.
    apiProviderModules: ['/app/providers/bookings.mjs'],
  },
});

const responseEnvelope = await runtime.submitProgram(programFromAgent);

Every submission comes back as a single structured envelope — the value it returned, or the precise way it failed: a program that didn't type-check, a call that errored, a budget that ran out. One program in, one well-formed answer out, every time. Nothing leaks, nothing hangs.

Field Notes from the Grotto

There is far more to cover. How an agent discovers your APIs at runtime. The budget system that keeps a program from ever running up a bill. The worker isolation that lets you run a stranger's code at all. Each is worth a post of its own — so I'm starting a new series on the Neander language and its runtime, feature by feature.

In the meantime, Grotto is on GitHub, the license is permissive, and the floor is open. Embed it into your app, and let me know what breaks.

Neander: An Agent-First Programming Language

Dirk Mattig — Tue, 23 Jun 2026 12:09:58 +0000

Last time, in Source Code as the Seam Between Systems, I closed by saying I had built a programming language for the seam between systems, and that I would come back to it.

Here we are.

The language is called Neander (named after the Neanderthal), and its specification is available now.

A quick recap of where the last post left us. When one system needs another to do something, the seam between them used to be a wire carrying structured data, with a human in the middle writing the integration code. Take the human out, let the calling system be an agent that decides at runtime what it needs, and the seam stops being a wire. It becomes a language. The called system exposes an execution environment, and the caller drives it by sending small programs.

Neander is that language.

The inversion

The model everyone started with is the tool catalog: load every function the host exposes into the agent's context, then let the agent pick. It does not scale. Hundreds of tool definitions clutter the context, every intermediate result piles up on top, and costs and latency climb with them.

Neander turns that around. The agent's context holds one compact thing — the Neander Reference itself — rather than a catalog of everything the host can do. To get something done, the agent writes a short program. The program asks the runtime what APIs are available, calls what it needs, composes the results, and hands back a single answer. Discovery happens at runtime, inside the program, instead of up front in the context window.

That gives the language two verbs that carry most of the weight. discover asks the runtime what namespaces, functions, and documents exist; call invokes one of those functions. Everything else — branching and bounded loops, the structural type system, explicit error handling — exists to glue those two together.

First the agent looks around:

neander 1 {
  types {}
  main -> [Function] {
    let ns: Namespace =? discover namespace "bookings"
    return discover functions ns []
  }
}

It reads the returned descriptions, then writes a second program that calls what it found:

neander 1 {
  types {}
  main -> decimal(2, half_away) {
    let booking: bookings.Booking =? call bookings.get(id: 8821)
    return booking.fare
  }
}

That is essentially the whole shape of it. One call, maybe a loop over the result, some conditional logic, another call. The agent writes it for a single task, sends it to the runtime, and throws it away.

The interesting bit is where the code runs. The Neander runtime lives inside the host: the embedding application registers its own APIs with it, so execution happens server-side, right next to these APIs. As described above, the agent stays on the outside, untrusted, and uses the language to talk to the embedding application.

This setup, obviously, raises a few eyebrows.

What Neander deliberately cannot do

The reassurance is structural, not a promise to behave. The things that make running a stranger's code dangerous don't exist in Neander — there's nothing to wall off because there's nothing there.

It cannot run forever. Deliberately not Turing-complete: no recursion, every loop statically bounded. Termination is proven before the program runs — no halting question to lose sleep over.
It cannot reach out. No file I/O, no sockets, no system access. The only thing a program can touch is an API the host chose to register. There's no sandbox because there's nothing to put in one — the language is the sandbox.
It cannot run up a bill. Every execution runs under hard ceilings on computation, memory, and time (the budget system). Exceed one and that program is stopped — not the host it runs in.
It cannot misuse the host's APIs. A program that fails validation never runs; one that passes calls only functions that exist, with correctly-typed arguments, and can never treat a value that might be missing as if it were there.

The name fits the philosophy. Like the spare, limited languages of computing's early days, it can do very little — and the less a language can do, the less can go wrong.

The audience

Neander is written for an unusual audience. Only agents author it; humans host it. Hence no getting-started guide, no tutorials — nothing for a manual coder.

Besides the normative specification, the example-driven Neander Reference is aimed squarely at the agents that will write the programs — and the runtime hands it to them in-band.

The website at newadventuresinit.github.io/neander is for the humans evaluating Neander and deciding whether to embed it into their systems. But before they can put it to work, one thing has yet to be published: a runtime reference implementation.

It already exists — more on that next time.

In the meantime, the Neander specification is live, the license is permissive, and the floor is open. Have a look around, and let me know what you think.

Source Code as the Seam Between Systems

Dirk Mattig — Wed, 17 Jun 2026 13:51:37 +0000

In my previous blog post Speccing Is the New Coding, I claimed that source code will not entirely vanish in an agentic world but will change jobs, from the substance of applications to the seam between them. The promise was to come back and dive deeper into the matter.

Let's take a look.

The way it used to be

In classical software development, when two systems need to talk to each other, a developer reads one system's API documentation, then writes integration code for the other system that calls those APIs in the right order with fittingly structured data. It works, and we have decades of practice at it. But the calling system never actually understands the API it is calling. Only the developer does. The system is just faithfully carrying out instructions written by a human during design time.

That arrangement has already started to look somewhat quaint. When at least the calling system is agentic, deciding at runtime what it needs from the other system, the human in the middle must go.

The top-down approach: mathematical musings

We humans have the built-in ability to "prompt" each other. We call it a conversation.

This technique has proven so successful over the centuries that we even modeled our latest human-machine interface after it. So, why stop there? If it's this successful, it makes sense to apply the concept to machine-machine interfaces, right? Let one system tell the other, in plain English, what it wants, and together they can purposefully harness each other's capabilities to achieve a common goal.

It would work. It would go down in IT history as the most generic, most flexible interface technology ever invented.
And as the most memorable security vulnerability ever shipped.

This naive approach is doomed. Plain natural language is unbounded by design. There are no limits to what can be said, and no formal guarantees about what the words mean.

But there need to be boundaries. System boundaries.

So if we want one system to tell another what to do, two things have to be true:

there have to be limits on what can be said, and
what is said has to have unambiguous meaning.

These requirements clearly point to formal languages in the mathematical sense. And programming languages are of course exactly formal languages with finite syntax, precise semantics, and bounded expressiveness. We invented them to talk to machines, and they happen to be the right shape for machines talking to other machines, too.

The bottom-up approach: tool time

Early in the development of agentic systems, it became clear that they needed a way to interact with the outside world. So a string of techniques was invented, which we now refer to as tool calling. In essence, it is a way for an agent to first absorb an API spec and then execute API calls.

Sounds straightforward. It wasn't.

In November 2025, the creators of the MCP standard published an article highlighting two important lessons learned about tool calling.

First, for an agent to call a tool, it needs to know it exists; hence, all tool definitions are loaded into the agent's context up-front. This did not pose a problem for a handful of tools, but, as it turned out, customers actually needed to expose an agent to hundreds, even thousands, of tools. This led to cluttered contexts, increased response times, and also increased costs.

Second, since the agent acts as the orchestrator of tool calls, every intermediate tool call result is added to the context, leading to the same problems as above, plus reliability and even data protection issues.

The proposed remedy, in a nutshell, is to make the agent write code against the API and add only the end result to its context.

What is interesting is that we arrive at the very same conclusion as in our top-down approach, although we started from two very distinct places:

The seam between two systems is no longer a wire that carries structured data. It is a language, an exposed execution environment that the calling system uses by sending programs to it.

It is worth noting that the above-mentioned article served as my initial inspiration for what I am about to suggest, even though my take on this does not follow the original proposal.

An agent-first programming language

Now that we have arrived at this conclusion, the next step seems obvious: select your favorite language, make an API of your choice available to it, and tell the agent to start sending programs.

Entirely possible. Pragmatic. But is it such a good choice after all?

Practically all of our existing mainstream programming languages are general-purpose languages. By design, you can do literally anything with them. Sure, they will meet our requirements, no matter what they are. And we can always use sophisticated technologies, such as sandboxing, to suppress any superfluous or dangerous features we do not need.

But as the famous quote by Antoine de Saint-Exupéry goes: "Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."

So, let's do this the right way round and start from scratch. What do we actually need for this specific use case?

Simplicity: An agent-centric language design aimed at API-orchestration. One call, a loop over the result, some conditional logic, another call. That's it. Nothing fancy, and no syntactic sugar.
Predictability: Every program is guaranteed to terminate by design. No file I/O, no sockets, no system access. The language is the sandbox.
Discoverability: A program can find out about the available APIs and even the language itself at runtime. No upfront documentation, no out-of-band integration step.

The human in the middle has vanished. The seam no longer sits between the two systems where the translator had to place it, but has moved into the called system. It exposes the language, sets the rules, and defines the vocabulary. The calling system arrives ready to speak whatever it finds. And the three design principles above aim to make this integration seamless: simple enough that the calling system doesn't struggle to use it. Predictable enough that the called system doesn't have to fear it. Discoverable enough that no human has to explain it in advance.

With all this in mind, I have created, from the ground up, a new programming language and its runtime reference implementation for precisely this use case.

TaskTrack — A Specify Spec for Agent Task Management

Dirk Mattig — Fri, 05 Jun 2026 06:45:44 +0000

It is time to put my proposition made in my previous blog post to the test. Is it possible to spec an application for execution by an agent without encoding it in source? Let's find out.

One type of application every knowledge worker is familiar with is task management. Every task has a lifecycle status, dependencies on other tasks, and a history of progress.

Let's give agents their own.

TaskTrack is a simple but non-trivial task management system variant implemented as a Specify spec. It goes beyond checkbox-based to-do lists that agents sometimes use internally and mimics the key system features listed above.

TaskTrack defines two procedures: a "Plan Authoring Run" to create an interconnected set of tasks from requirements and a "Plan Execution Run" to advance a previously authored plan toward completion. One execution run might not always be enough to achieve completion, because TaskTrack allows requesting human feedback and incorporating it during the next execution run. Furthermore, every execution run is divided into "Task Processing Run" sub-procedures to allow for advanced agent context management.

TaskTrack implements all of this in less than 300 lines of text. If the implementation used source code, then, depending on the programming language, this would be enough space to implement only the required file I/O operations (TaskTrack uses files for simplicity, not a database). Natural language can easily become quite bloated, but a stringent, scientific writing style and extensive use of what the Specify standard offers can effectively counter that.

The official test is, how could it be any other way, the implementation of yet another uninspired Breakout clone. The requirements, the completed TaskTrack plan, and the deliverable are contained in the repository.

If you want to run the test yourself, the included README file contains the necessary information, including the launch prompts for both the authoring agent and the execution agent. Please note how both launch prompts are structured. They use TaskTrack terminology and point to the relevant files. They do not contain task-related behavioral instructions. The execution agent launch prompt contains agent-specific instructions for mapping agent features to the generic TaskTrack specification. The principles behind the good old manual coding design patterns remain valid even in the agentic era!

And now, finally, for the test result. In a nutshell: It works!

The authoring agent created all TaskTrack files as indicated, which is, maybe, less surprising or impressive. More importantly, the execution agent showed deterministic behavior over all 16 tasks and two execution runs. I often hear that deterministic behavior must remain encoded in source due to the inherently random, and hence non-deterministic, nature of LLMs. I cannot confirm this based on the test result. The execution agent followed the step-by-step procedure definition by the book each and every time. Even the defined textual output was created as reliably and repeatably as if it were produced by a print statement.

It goes without saying that this single test result does not deliver a general proof of the viability of speccing. It shows it can work; it is possible. Maybe non-deterministic agent behavior is more often than not the result of unspecific instructions rather than randomness in the underlying LLM.

Having said all this, the test run was far from being perfect. It produced several so-called valuable learning experiences.

The first and most obvious finding is that all but one of the timestamps are incorrect. The authoring agent wrote and executed a Python script to retrieve the current UTC time. All task processing subagents simply invented timestamps. When I later asked the system about this difference in behavior, it gave an interesting answer: Creating a new timestamp is a "single, salient, one-off step... worth a real python/date call." Updating timestamp fields is "a repeated, mechanical step... every task, every run, in fresh subagent contexts," and that "models systematically deprioritize repeated boilerplate."

This is not a TaskTrack issue but rather the result of an ill-equipped agent. And it is at this point, where, no matter how hard I try, I cannot stop myself from making the tongue-in-cheek remark that the machines that are feared to first fire and then nuke us apparently have no built-in access to the current time... I will keep this in mind, just in case.

The second finding is that, as the agent itself remarked when reviewing the test results, task resolutions are not necessarily as brief as mandated by the TaskTrack specification. But then, what is brief? Precisely. This is the kind of hastily written, hand-wavy instruction that is open to interpretation and leads to varying results. Just because we are using natural language now does not mean we are allowed to let our rigor slip.

Luckily, it is not a major pain point, since it only affects the resolution, not the core processing logic. Still, it is worth fixing in a future publication.

The third finding is that, strictly speaking, the test run was flawed because these wonderful machines now have memory. Both the authoring and execution agents revealed in their thinking output that they were aware that this was a test. I do not think this flaw invalidates the qualitative test result as such. Still, future test setups will require more care and consideration.

In the meantime, the TaskTrack specification is live, the license is permissive, and the floor is open. Have a look around, and let me know what you think.

Speccing Is the New Coding

Dirk Mattig — Mon, 25 May 2026 10:48:20 +0000

What do we still need source code for?

It is an odd question to ask after spending a lifetime writing it, but it is the one that keeps pulling at my sleeve. Let me work backwards to explain why.

The first computers were one-trick ponies. Their behavior was baked into their wiring — change the task, change the machine. Useful, expensive, inflexible. Then the elegant idea emerged that part of the data a machine processed could also control how it processed the rest, and the stored-program computer was born. Hardware became a stage; software became the play.

And with software came developers — a new profession whose first and hardest job was, and still is, to understand a process well enough that they could, in principle, perform it themselves. The encoding into source code was always the second step. We did it because humans, however well they understand a process, cannot match a machine for speed or reliability — and have an inconvenient need for sleep.

For decades that was the deal. A business owner understood a process; a developer understood the business owner; the source code was the byproduct of that understanding, painstakingly translated through several meetings, languages, frameworks, and rather more meetings on the way to silicon.

That deal has changed. The entity we now describe processes to is already the machine. The author and the audience have merged. So the question writes itself: if the agent already understands what we want, why do we still ask it to produce thousands of lines of source code that we, in turn, will mostly never read?

The short-term answers are perfectly good. Executing compiled code is cheaper and faster than burning tokens. The entire existing body of software — every library, every API, every running system — is encoded in source. That body of work is not going anywhere quickly — not in a year, not in a decade, probably not in two.

But mid-term, I think the answer changes. Our industry has a stubborn habit of making things cheaper and faster, fast. The obstacles ahead are real, but they are the kind of constraints we have spent decades learning to engineer around. Once the economics flip, the cleanest representation of an application is no longer a tree of source files written for one runtime — it is a single document, in prose, describing the logic and behavior the application is supposed to exhibit. Read directly. Understood directly. Acted on directly.

That is what I mean by speccing is the new coding. And it is the reason I have just published SpecPack — three small reference standards meant as an experimental foundation for that future. None of them are rocket science. I think of them as a bit of housekeeping for a fresh start.

MiniMark takes Markdown and removes its optionality. Humans thrive on optionality — it is how we express our individuality and our taste. Machines do not need it, and tend to find it actively confusing. MiniMark keeps the syntax humans already know and strips the redundant ways of saying the same thing.

riVer is a versioning scheme for textual content. It assumes a world in which sophisticated version-control systems like git are no longer required, or simply not present in the agentic environment. Once an application is a single document, a long-standing anti-pattern turns out to make sense again: putting the version number inside the document and hence into the agent's context. What that version should indicate, on the other hand, is a question we get to ask from scratch. Agents do not consult a semantic version to decide whether a change is breaking — they read the spec and find out. What they need is an integer to mark the iteration, a status to mark the lifecycle stage, and a timestamp to place the change in time. riVer gives them exactly that.

Specify is the most ambitious of the three: an attempt at American English coding standards — conventions for expressing programmatic logic and behavior in plain English, precisely enough for an agent to act on.

So — does this mean source code goes away?

No. It changes jobs: from the substance of applications to the seam between them. More on that next time.

In the meantime, the standards are live, the license is permissive, and the floor is open. Have a look around, and let me know what you think.

My New Adventures in IT

Dirk Mattig — Mon, 25 May 2026 10:20:19 +0000

When a blinking cursor on a screen awaited my input for the very first time, I could not possibly have anticipated that this technology would soon open up a whole new dimension for the world to live in. Having been born about six months after the Beatles split up, I happened to be in the right place at the right time when a bunch of nerds succeeded in bringing computers to the home.

Now, after more than 40 years of coding as a hobby and over 25 years of software engineering as a profession, once again a blinking cursor (caret, really) awaits my input. Only this time the machine answers in far more elaborate ways than simply stating "ERROR". The more I use AI, the more evident it becomes to me that this is not just another paradigm shift like web, mobile, or cloud were.

This is a new beginning. A new dimension is opening up.
The new rules are that there will be all new rules.
And that we do not yet know any of these new rules.

This is the starting point for my ventures into the future of software, work, and business.