Two weeks ago I read a line about tool use with Claude that stuck in my head. Paraphrased:
Direct tool calls don’t really scale.
Have the model write code that uses tools, and execute that code instead.
At the same time, I was knee-deep in wiring a JavaScript execution environment into Contenox, my self-hosted runtime for deterministic, chat-native AI workflows.
So of course the thought was:
What if I just let the model write the JavaScript and run it inside the runtime? 😅
This post is about what happened when I tried exactly that.
What is Contenox?
Very short version:
Contenox is a self-hostable runtime for sovereign GenAI applications.
It models AI behavior as explicit state machines, not opaque prompt chains.
Some key properties:
Runtime, not a library, Explicit state machines, Chat-native interface, Vendor-agnostic & self-hosted, written in go with lots of passion and zero tolerance for shortcuts
The experiment: ask it to fetch and summarize a TODO
Once the JS execution hook was in place (a Goja VM with some globals exposed), I wired up a new state machine:
mux_input → moderate → generate_js → run_js → eval_js_result → (repair_js?) → answer
Then I opened the Contenox chat UI and typed:
Me: “Hey Contenox, fetch the TODO item from jsonplaceholder and summarize it.”
The first attempt didn’t go as planned.
Instead of a neat summary, I got:
Contenox:
“I'm sorry, but it seems there was a problem fetching the TODO item from the provided URL due to an Internal Server Error (status code: 500). Unfortunately, I couldn't retrieve any data at this time. My apologies for the inconvenience!”
I checked manually – Cloudflare was having a moment, JSONPlaceholder was behind it, and the request really was failing. Okay, fair. Long coffee break.
A bit later, I went back to the same chat and typed:
Me: “okay let's try again now.”
This time the reply was:
Contenox:
Task titled "delectus aut autem" is incomplete. Nothing to summarize here.
Wait. It worked? Just like that?
Under the hood: LLM → JavaScript → tools → LLM
Looking at the execution trace made it clear what happened.
1. Input is normalized and moderated
The workflow starts with:
mux_input
-
Parses the incoming chat history into a normalized
messagesarray.moderate
Uses a small model to classify the input as safe/unsafe.
Output:
0→ safe → continue.
So far this is standard workflow stuff.
2. generate_js: the LLM writes JavaScript
Next, the generate_js task runs with a system instruction like:
- “You are an expert JavaScript agent for the Contenox runtime…”
- “You must output JSON: { "code": "" }”
- “You can use httpFetch, executeTask, executeTaskChain, …”
- “Script must be synchronous, no async/await, and must return a JSON-serializable object.”
Given the chat history, the model produced something like this (simplified):
{
"code": "var result;
var answer;
var meta = { steps: [] };
var todoUrl = \"https://jsonplaceholder.typicode.com/todos/1\";
// Step 1: Fetch the TODO item
meta.steps.push(\"Fetching TODO item from \" + todoUrl);
var fetchResponse = httpFetch(todoUrl);
if (!fetchResponse.ok) {
answer = \"I couldn't access that item right now because there was some trouble with getting it. Please check the URL or try again later.\";
meta.error = {
message: \"HTTP fetch failed\",
url: todoUrl,
status: fetchResponse.status,
statusText: fetchResponse.statusText,
body: fetchResponse.body,
error: fetchResponse.error
};
result = { answer: answer, meta: meta };
} else {
var todoItem;
try {
todoItem = JSON.parse(fetchResponse.body);
meta.raw = todoItem;
} catch (e) {
answer = \"I fetched the item, but I couldn't understand its format. It doesn't seem to be valid JSON.\";
meta.error = {
message: \"JSON parse error\",
rawBody: fetchResponse.body,
parseError: e.toString()
};
result = { answer: answer, meta: meta };
}
if (todoItem) {
// Step 2: Summarize the TODO item using an LLM
meta.steps.push(\"Summarizing TODO item with LLM\");
var prompt = \"Summarize the following TODO item in a concise sentence: \" + JSON.stringify(todoItem);
var summarizeTask = executeTask(prompt, \"gpt-4o-mini\", \"openai\");
if (!summarizeTask.success) {
answer = \"While I managed to fetch your TODO details, the summary tool had a hiccup and couldn't provide me with a concise sentence right now. We seem to be experiencing some temporary issues.\";
meta.error = {
message: \"LLM task execution failed\",
task_id: summarizeTask.task_id,
error: summarizeTask.error
};
result = { answer: answer, meta: meta };
} else {
answer = summarizeTask.response;
meta.llmResponse = summarizeTask.response;
result = { answer: answer, meta: meta };
}
}
}
result;"
}
A few important details here:
-
It encodes error handling:
- If HTTP fails → user-friendly
answer, raw error inmeta.error. - If JSON parse fails → similar pattern.
- If the inner LLM call fails → fallback message.
- If HTTP fails → user-friendly
It delegates the actual summarization to another model via
executeTask.It returns a structured
resultwith bothanswerandmeta.
This is not the model “calling tools” directly. It’s the model writing a program that calls tools.
3. run_js: execute the code in a sandbox
The next task is run_js, which is just a Contenox hook that calls the JS sandbox:
{
"name": "js_sandbox",
"tool_name": "execute_js",
"args": {
"code": "{{.generate_js.code}}"
}
}
Inside the trace you can see:
- An
httpFetchlog for the JSONPlaceholder URL. - A response with
status: 200 OKwhen things finally worked. -
An
executeTasklog with the summarization prompt:Summarize the following TODO item in a concise sentence: {"userId":1,"id":1,"title":"delectus aut autem","completed":false}
The sandbox result looked roughly like:
{
"ok": true,
"result": {
"answer": "Task titled \"delectus aut autem\" is incomplete.",
"meta": {
"llmResponse": "Task titled \"delectus aut autem\" is incomplete.",
"raw": {
"userId": 1,
"id": 1,
"title": "delectus aut autem",
"completed": false
},
"steps": [
"Fetching TODO item from https://jsonplaceholder.typicode.com/todos/1",
"Summarizing TODO item with LLM"
]
}
},
"logs": [ ... ],
"code": "var result; ..."
}
4. eval_js_result: success or retry?
Now comes the evaluator:
- It receives a description of the JS sandbox output.
-
The system prompt is very strict:
- If
okis true and there is a non-emptyresult.answer→ respond withsuccess. - Otherwise → respond with
retry.
- If
On the successful run, it answered:
success
So the workflow does not go into repair_js or run_js_retry. Happy path.
5. answer: extract the final user message
The final task, answer, is intentionally boring:
- System prompt: “You are a purely extractive post-processor. Do NOT invent content. Just surface the best existing
answerfield.” -
It gets:
- First run (
run_jsresult). - Second run (
run_js_retry), if any.
- First run (
-
Selection rule:
- Take the last non-empty
answeryou see. - Output it verbatim.
- Take the last non-empty
In our case it found:
Task titled "delectus aut autem" is incomplete.
And that’s exactly what Contenox replied in chat.
Why this is interesting (to me, at least)
What I originally set out to build:
A runtime for deterministic, observable GenAI workflows.
Tasks, transitions, hooks – all explicit and replayable.
What I accidentally stumbled into:
A multi-model, self-orchestrating agent pattern,
where LLMs write code that uses tools, and the runtime executes and evaluates that code.
The pattern looks like this:
-
Planner LLM (
generate_js)
- Reads user intent + history.
- Emits JavaScript that calls
httpFetch,executeTask,executeTaskChain, hooks, etc.
-
Execution environment (
run_jsin Goja)
- Deterministic execution of that JS.
- Full logs of every HTTP call, every inner LLM call, every step.
-
Controller LLM (
eval_js_result)
- Looks at the sandbox result.
- Decides: is this good enough? Retry? Repair?
-
Repair LLM (
repair_js, if needed)
- Gets the previous code + error output.
- Writes a fixed version of the JS.
-
Answer LLM (
answer)
- Doesn’t “reason” at all.
- Just extracts the final
answertext safely.
All of that is expressed as an explicit state machine in Contenox.
No hidden loops, no undocumented retries, no magic glue code inside some SDK. It’s all visible in the workflow graph and trace.
To me, that’s the exciting part:
You don’t have to choose between “boring deterministic workflows” and “fancy agents”.
You can build the agent on top of deterministic workflows.
And everything stays **self-hosted, inspectable, and auditable if you want.
Top comments (0)