DEV Community

Alexander Ertli
Alexander Ertli

Posted on

I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened

Two weeks ago I read a line about tool use with Claude that stuck in my head. Paraphrased:

Direct tool calls don’t really scale.
Have the model write code that uses tools, and execute that code instead.

At the same time, I was knee-deep in wiring a JavaScript execution environment into Contenox, my self-hosted runtime for deterministic, chat-native AI workflows.

So of course the thought was:

What if I just let the model write the JavaScript and run it inside the runtime? 😅

This post is about what happened when I tried exactly that.


What is Contenox?

Very short version:

Contenox is a self-hostable runtime for sovereign GenAI applications.
It models AI behavior as explicit state machines, not opaque prompt chains.

Some key properties:

Runtime, not a library, Explicit state machines, Chat-native interface, Vendor-agnostic & self-hosted, written in go with lots of passion and zero tolerance for shortcuts


The experiment: ask it to fetch and summarize a TODO

Once the JS execution hook was in place (a Goja VM with some globals exposed), I wired up a new state machine:

mux_input → moderate → generate_js → run_js → eval_js_result → (repair_js?) → answer
Enter fullscreen mode Exit fullscreen mode

Then I opened the Contenox chat UI and typed:

Me: “Hey Contenox, fetch the TODO item from jsonplaceholder and summarize it.”

The first attempt didn’t go as planned.

Instead of a neat summary, I got:

Contenox:
“I'm sorry, but it seems there was a problem fetching the TODO item from the provided URL due to an Internal Server Error (status code: 500). Unfortunately, I couldn't retrieve any data at this time. My apologies for the inconvenience!”

I checked manually – Cloudflare was having a moment, JSONPlaceholder was behind it, and the request really was failing. Okay, fair. Long coffee break.

A bit later, I went back to the same chat and typed:

Me: “okay let's try again now.”

This time the reply was:

Contenox:
Task titled "delectus aut autem" is incomplete. Nothing to summarize here.

Wait. It worked? Just like that?


Under the hood: LLM → JavaScript → tools → LLM

Looking at the execution trace made it clear what happened.

1. Input is normalized and moderated

The workflow starts with:

  1. mux_input
  • Parses the incoming chat history into a normalized messages array.

    1. moderate
  • Uses a small model to classify the input as safe/unsafe.

  • Output: 0 → safe → continue.

So far this is standard workflow stuff.

2. generate_js: the LLM writes JavaScript

Next, the generate_js task runs with a system instruction like:

  • “You are an expert JavaScript agent for the Contenox runtime…”
  • “You must output JSON: { "code": "" }”
  • “You can use httpFetch, executeTask, executeTaskChain, …”
  • “Script must be synchronous, no async/await, and must return a JSON-serializable object.”

Given the chat history, the model produced something like this (simplified):

{
  "code": "var result;
var answer;
var meta = { steps: [] };

var todoUrl = \"https://jsonplaceholder.typicode.com/todos/1\";

// Step 1: Fetch the TODO item
meta.steps.push(\"Fetching TODO item from \" + todoUrl);
var fetchResponse = httpFetch(todoUrl);

if (!fetchResponse.ok) {
  answer = \"I couldn't access that item right now because there was some trouble with getting it. Please check the URL or try again later.\";
  meta.error = {
    message: \"HTTP fetch failed\",
    url: todoUrl,
    status: fetchResponse.status,
    statusText: fetchResponse.statusText,
    body: fetchResponse.body,
    error: fetchResponse.error
  };
  result = { answer: answer, meta: meta };
} else {
  var todoItem;
  try {
    todoItem = JSON.parse(fetchResponse.body);
    meta.raw = todoItem;
  } catch (e) {
    answer = \"I fetched the item, but I couldn't understand its format. It doesn't seem to be valid JSON.\";
    meta.error = {
      message: \"JSON parse error\",
      rawBody: fetchResponse.body,
      parseError: e.toString()
    };
    result = { answer: answer, meta: meta };
  }

  if (todoItem) {
    // Step 2: Summarize the TODO item using an LLM
    meta.steps.push(\"Summarizing TODO item with LLM\");
    var prompt = \"Summarize the following TODO item in a concise sentence: \" + JSON.stringify(todoItem);
    var summarizeTask = executeTask(prompt, \"gpt-4o-mini\", \"openai\");

    if (!summarizeTask.success) {
      answer = \"While I managed to fetch your TODO details, the summary tool had a hiccup and couldn't provide me with a concise sentence right now. We seem to be experiencing some temporary issues.\";
      meta.error = {
        message: \"LLM task execution failed\",
        task_id: summarizeTask.task_id,
        error: summarizeTask.error
      };
      result = { answer: answer, meta: meta };
    } else {
      answer = summarizeTask.response;
      meta.llmResponse = summarizeTask.response;
      result = { answer: answer, meta: meta };
    }
  }
}

result;"
}
Enter fullscreen mode Exit fullscreen mode

A few important details here:

  • It encodes error handling:

    • If HTTP fails → user-friendly answer, raw error in meta.error.
    • If JSON parse fails → similar pattern.
    • If the inner LLM call fails → fallback message.
  • It delegates the actual summarization to another model via executeTask.

  • It returns a structured result with both answer and meta.

This is not the model “calling tools” directly. It’s the model writing a program that calls tools.

3. run_js: execute the code in a sandbox

The next task is run_js, which is just a Contenox hook that calls the JS sandbox:

{
  "name": "js_sandbox",
  "tool_name": "execute_js",
  "args": {
    "code": "{{.generate_js.code}}"
  }
}
Enter fullscreen mode Exit fullscreen mode

Inside the trace you can see:

  • An httpFetch log for the JSONPlaceholder URL.
  • A response with status: 200 OK when things finally worked.
  • An executeTask log with the summarization prompt:

    • Summarize the following TODO item in a concise sentence: {"userId":1,"id":1,"title":"delectus aut autem","completed":false}

The sandbox result looked roughly like:

{
  "ok": true,
  "result": {
    "answer": "Task titled \"delectus aut autem\" is incomplete.",
    "meta": {
      "llmResponse": "Task titled \"delectus aut autem\" is incomplete.",
      "raw": {
        "userId": 1,
        "id": 1,
        "title": "delectus aut autem",
        "completed": false
      },
      "steps": [
        "Fetching TODO item from https://jsonplaceholder.typicode.com/todos/1",
        "Summarizing TODO item with LLM"
      ]
    }
  },
  "logs": [ ... ],
  "code": "var result; ..."
}
Enter fullscreen mode Exit fullscreen mode

4. eval_js_result: success or retry?

Now comes the evaluator:

  • It receives a description of the JS sandbox output.
  • The system prompt is very strict:

    • If ok is true and there is a non-empty result.answer → respond with success.
    • Otherwise → respond with retry.

On the successful run, it answered:

success
Enter fullscreen mode Exit fullscreen mode

So the workflow does not go into repair_js or run_js_retry. Happy path.

5. answer: extract the final user message

The final task, answer, is intentionally boring:

  • System prompt: “You are a purely extractive post-processor. Do NOT invent content. Just surface the best existing answer field.”
  • It gets:

    • First run (run_js result).
    • Second run (run_js_retry), if any.
  • Selection rule:

    • Take the last non-empty answer you see.
    • Output it verbatim.

In our case it found:

Task titled "delectus aut autem" is incomplete.
Enter fullscreen mode Exit fullscreen mode

And that’s exactly what Contenox replied in chat.


Why this is interesting (to me, at least)

What I originally set out to build:

A runtime for deterministic, observable GenAI workflows.
Tasks, transitions, hooks – all explicit and replayable.

What I accidentally stumbled into:

A multi-model, self-orchestrating agent pattern,
where LLMs write code that uses tools, and the runtime executes and evaluates that code.

The pattern looks like this:

  1. Planner LLM (generate_js)
  • Reads user intent + history.
  • Emits JavaScript that calls httpFetch, executeTask, executeTaskChain, hooks, etc.
  1. Execution environment (run_js in Goja)
  • Deterministic execution of that JS.
  • Full logs of every HTTP call, every inner LLM call, every step.
  1. Controller LLM (eval_js_result)
  • Looks at the sandbox result.
  • Decides: is this good enough? Retry? Repair?
  1. Repair LLM (repair_js, if needed)
  • Gets the previous code + error output.
  • Writes a fixed version of the JS.
  1. Answer LLM (answer)
  • Doesn’t “reason” at all.
  • Just extracts the final answer text safely.

All of that is expressed as an explicit state machine in Contenox.

No hidden loops, no undocumented retries, no magic glue code inside some SDK. It’s all visible in the workflow graph and trace.


To me, that’s the exciting part:

You don’t have to choose between “boring deterministic workflows” and “fancy agents”.
You can build the agent on top of deterministic workflows.
And everything stays **self-hosted, inspectable, and auditable if you want.

Top comments (0)