DEV Community: Alexander Ertli

Shouldn't AI Move From Cloud to Local Compute?

Alexander Ertli — Sun, 14 Jun 2026 10:02:34 +0000

A few things happened almost at the same time.

GitHub moved Copilot deeper into usage-based billing.

OpenAI kept pushing the Responses API as the default primitive for building agents.

Anthropic launched Fable/Mythos and then had to suspend access a few days later because of a U.S. government directive.

NVIDIA is putting “personal AI supercomputer” hardware into the market with DGX Spark and DGX Station.

Individually, each of those stories is easy to treat as separate news.

I do not think they are separate.

I think they are all pointing in the same direction:

AI coding is moving from a cloud feature into local infrastructure.

And that changes the question for developers.

Not just:

Which agent should I use?

But:

Where does the agent run?
Which models can it use?
Who owns the runtime? And who the compute?
What happens when pricing, policy, model access, or hardware changes?

That is the part I think matters.

What happened

GitHub announced that Copilot plans transition to usage-based billing on June 1, 2026. The old premium request unit model is being replaced by GitHub AI Credits, with usage calculated from token consumption: input tokens, output tokens, and cached tokens, priced according to the model used. GitHub also says this is because Copilot has moved from a simple in-editor assistant toward an agentic platform that can run long multi-step coding sessions across repositories. (The GitHub Blog)

There is an important detail here: GitHub says code completions and Next Edit suggestions remain included and do not consume AI Credits. So this is not “every ghost text completion is billed now.” The bigger point is that agentic usage, chat, long-running sessions, code review, and heavier model work are now part of a visible token economy. (The GitHub Blog)

OpenAI is moving in the same direction from the other side. The Responses API was introduced as a new primitive for building agents, combining Chat Completions-style simplicity with tool use from the Assistants API. OpenAI also added built-in tools like web search, file search, computer use, and an Agents SDK with tracing and observability. (OpenAI)

Then OpenAI expanded Responses further with remote MCP server support, Code Interpreter, improved file search, background mode for long-running tasks, reasoning summaries, and encrypted reasoning items. (OpenAI)

That is not just “new API features.”

That is model vendors moving up the stack into runtime, tools, state, observability, and orchestration.

Then the Anthropic Fable/Mythos story happened.

Anthropic said the U.S. government issued an export-control directive requiring suspension of access to Fable 5 and Mythos 5 by foreign nationals, including foreign-national Anthropic employees. Anthropic said the practical result was that it had to disable access for all customers to ensure compliance. (Anthropic) The Verge reported the same basic story: an export-control directive citing national security concerns required blocking access for foreign nationals, and Anthropic cut access for all customers. (The Verge)

You can argue about the policy. You can argue about the safety question. You can argue about whether the government overreacted.

But as a developer, the practical lesson is boring:

remote model access is not a stable primitive.

It can change because of price.

It can change because of policy.

It can change because of region.

It can change because of provider risk posture.

It can change because of model availability.

That does not mean “never use frontier models.” That would be stupid. Frontier models are extremely useful.

It means the runtime layer matters.

What this means for us as developers

For the last two years, a lot of AI tooling was sold like a product feature.

Install extension.

Get magic.

That worked for the first wave.

But serious AI coding work is not just a UI feature anymore. It is a stack.

You have model selection.

You have context management.

You have file search.

You have tool execution.

You have shell access.

You have policy.

You have logs.

You have long-running tasks.

You have cost.

You have rate limits.

You have model-specific quirks.

You have the question of where your source code, prompts, tool results, and traces go.

That is runtime territory.

The uncomfortable part is that a lot of developers are now using agents that can inspect repos, edit files, run commands, open PRs, call tools, and sometimes run for minutes or hours — while the execution layer underneath is still treated like a black box.

That is fine for experiments.

It is not fine as the default future.

If AI coding becomes part of normal development, then developers and teams need more control over the runtime.

Not because cloud is bad.

Because dependency without control is fragile.

The local compute trend is not a meme anymore

The other signal is hardware.

NVIDIA DGX Spark is a desktop AI system with a Grace Blackwell GB10 Superchip, 128 GB of coherent unified memory, and up to 1 petaFLOP of FP4 AI performance. NVIDIA says it can run AI development and testing workloads with models up to 200 billion parameters at the desktop, and two DGX Spark systems can connect for models up to 405 billion parameters. (NVIDIA)

DGX Station goes even further. NVIDIA describes it as a deskside AI supercomputer with 748 GB of coherent memory and up to 20 petaFLOPS of AI compute, supporting models up to 1 trillion parameters. NVIDIA also announced DGX Station for Windows as a system that can serve as a dedicated AI supercomputer for one developer or a shared local compute node for teams. (NVIDIA) (NVIDIA Newsroom)

Now, obviously, not everyone is buying a DGX Station.

That is not the point.

The point is the direction of travel.

For a while, local AI meant “maybe you can run a small model on your laptop if you are patient.”

Now the market is clearly moving toward a more serious local/on-prem/private-compute tier:

laptop for small models and simple workflows
workstation for serious local development
rented GPU box for heavier workloads
team-local node for shared inference
cloud frontier model when needed
data center or cluster later

That is a very different world from “all useful intelligence lives behind one vendor API.”

And once local or rented compute becomes powerful enough, the missing piece is not only the model runner.

It is the runtime around it.

Routing.

Compatibility.

Policies.

Tools.

Logs.

Editor integration.

The boring stuff.

The stuff that makes it usable.

The ecosystem is already forming

This is where I think people should not be tribal.

If you want to start today, there are already good projects.

Kilo Code is an open-source AI coding agent across VS Code, JetBrains, CLI, and cloud. It supports many models, bring-your-own-key usage, multiple agent modes, autocomplete, and a full agentic coding experience. (Kilo)

OpenCode is another important piece. It is an open-source coding agent available as a terminal interface, desktop app, or IDE extension. It is model-agnostic and clearly sits in the “developer agent” category. (OpenCode)

Ollama is probably the easiest starting point for local models. It gives developers a simple way to run and manage open models locally, and it exposes a REST API for chat and generation on localhost:11434. (Ollama) (GitHub)

If you are more advanced, or responsible for a team, vLLM is the next layer to understand. It is a high-throughput and memory-efficient inference and serving engine. The docs highlight Hugging Face integration, streaming outputs, tool calling and reasoning parsers, distributed inference features, and OpenAI-compatible API serving. (vLLM)

That rough map matters:

want to just run a local model: start with Ollama
want an open-source coding agent: try Kilo Code or OpenCode
want serious serving for a team: look at vLLM
want editor-native workflows: look at the VS Code/JetBrains/CLI agent space
want policy, runtime, compatibility, local tools, and owned compute to become one layer: that is the gap I care about

I do not see these projects as enemies.

Actually the opposite.

They form the market.

They teach users what is possible.

They normalize local models, open-source agents, model routing, self-hosting, and running AI outside a single cloud product.

That makes the next layer possible.

What this changed for Contenox

This is also why I changed the shape of Contenox.

For a while it was too easy to describe Contenox as “another agent runtime” or “a local agent framework.”

That framing is too small.

The direction is now clearer:

Contenox should be a local-first AI runtime for top-tier agent work without giving up control.

The agent is still important.

But the agent is the proof workload.

The deeper product is the runtime layer underneath it:

run on compute you control
use local, rented, or provider-backed models
expose compatibility surfaces existing tools understand
route across providers and models
understand model capabilities
execute tools safely
keep policy and approvals visible
log what happened
support editor workflows
avoid turning every local experiment into a fragile pile of glue scripts

That is also why the VS Code extension matters.

Not because VS Code is the whole product.

Because editor AI is where the runtime has to prove itself immediately.

Autocomplete has to be fast.

Chat has to stream.

Tool calls need approvals.

Filesystem and shell access need boundaries.

Model/provider selection must be understandable.

The user should not need to run a whole browser control panel or expose a random HTTP server just to use local/editor AI.

The editor is the pressure test.

If the runtime can support a good VS Code experience, it becomes much more than a CLI experiment.

The goal: local-first, but not limited

When I say local-first, I do not mean “local only.”

That would be another trap.

A local-first top-tier AI agent should be able to use:

a small local model when that is enough
a bigger local model on a workstation
Ollama for simple local model management
vLLM for team-grade serving
OpenRouter or provider APIs when frontier capability is the right tradeoff
future local/team GPU boxes when they make sense

The point is not purity.

The point is control.

You should be able to choose where the model runs.

You should be able to move workloads.

You should be able to see what tools the agent called.

You should be able to approve dangerous actions.

You should be able to keep logs.

You should be able to switch from one backend to another without rewriting your whole workflow.

That is the “without compromises” part for me.

Not “everything is free.”

Not “local models beat every frontier model.”

Not “cloud is bad.”

The compromise I do not want is this one:

To get a good AI coding experience, you must give up ownership of the runtime.

I think we can do better.

Conclusion

The recent news is not one story.

Copilot metering shows that agentic coding work has real variable cost.

OpenAI’s Responses API shows that model providers are turning tools, state, tracing, and orchestration into platform primitives.

The Anthropic Fable/Mythos disruption shows that access to frontier capability can change suddenly because of policy.

NVIDIA’s DGX Spark and DGX Station show that local and team-local AI compute is becoming a serious product category, not just a hobby setup.

And the open-source ecosystem around Kilo Code, OpenCode, Ollama, and vLLM shows that developers are already moving toward a world where AI coding is not one cloud feature, but a stack.

That is the world I want Contenox to fit into.

Not as another random agent.

As a local-first runtime for serious agent work on compute you control.

The agent is what proves the product.

The runtime is the product.

Contenox: Free and OpenSource forever.

Ctrl + P
> ext install contenox.contenox-runtime

The Soul of Contenox: Stop begging the model. Start programming the runtime.

Alexander Ertli — Sun, 17 May 2026 20:35:12 +0000

There is a strange pattern emerging in modern AI tooling.

We take probabilistic language models, connect them directly to terminals, cloud infrastructure, internal APIs, and production data, then try to control them with sentences like:

“Please don’t modify anything dangerous.”

That is not governance. That is wishful thinking wrapped around root access.

LLMs are excellent at reasoning. They are terrible at boundaries. And in production systems, boundaries are the only thing standing between automation and damage.

Contenox was built from a different premise:

An AI agent should be treated like an operating-system subject — constrained by explicit capabilities, enforced policy, and deterministic runtime behavior.

Not a chatbot with inherited authority.

Not a black box SaaS runtime.

Not a vendor-defined orchestration engine you cannot inspect.

Contenox is a local-first runtime for governing LLM execution through explicit policy, capability isolation, and declarative workflows.

A single Go binary. Your models, your chains, your policies, your infrastructure.

1. Prompts are hopes. Policies are firewalls.

A system prompt is not a security boundary.

This is the standard pattern today:

"You are a helpful assistant. Do not use sudo. Avoid sensitive files."

The model sees that instruction. It may even follow it.

Until it doesn’t.

Contenox moves authority enforcement out of the prompt and into the runtime itself.

[
  {
    "tools": "local_shell",
    "tool": "local_shell",
    "action": "deny"
  },
  {
    "tools": "local_fs",
    "tool": "*",
    "action": "deny",
    "when": [
      {
        "key": "path",
        "op": "glob",
        "value": "**/{.ssh,.gnupg,.aws,.env}/**"
      }
    ]
  }
]

The distinction matters.

A prompt is advice to the model.

A policy is enforced by the engine.

The model cannot negotiate with it, reinterpret it, or hallucinate around it.

If a tool call violates policy, the runtime refuses it mechanically.

Fail closed.

2. Cognition is cheap. Authority is precious.

The dangerous part of an AI agent is not the model.

It is the authority you accidentally hand to it.

Most frameworks expose your terminal, your API credentials, or your SaaS integrations directly to the model runtime. At that point the agent becomes a probabilistic extension of your own permissions.

One hallucinated command can become production damage.

Contenox rejects inherited authority entirely.

Instead of exposing “the CRM” or “the ERP,” you author a constrained tool surface:

A curated OpenAPI subset
Explicitly allowed operations
Local credential binding
Deny-by-default policy enforcement

The model never receives the underlying credential.

It cannot exfiltrate what it cannot read.

And if the schema only exposes:

GET /contacts

then:

DELETE /contacts

does not exist to the model at all.

Not hidden.

Not discouraged.

Absent.

You did not “grant access.”

You authored a capability.

3. Agent behavior belongs in Git. Not a black box.

Modern AI tooling often hides orchestration inside opaque runtimes:

hidden prompts
undocumented retries
proprietary planning heuristics
silently changing behavior

That is tolerable for chat applications.

It is unacceptable for infrastructure.

In Contenox, agent behavior is declarative infrastructure:

model selection
system prompts
retries
transitions
branching logic
pause conditions
tool budgets
approval behavior
policy enforcement

All defined in chain files you own.

{
  "id": "agent",
  "tasks": [
    {
      "id": "chat",
      "handler": "chat_completion",
      "system_instruction": "You are an assistant running inside the user's tools.",
      "execute_config": {
        "model": "{{var:model}}",
        "provider": "{{var:provider}}",
        "tools": [
          "*"
        ],
        "retry_policy": {
          "max_attempts": 4,
          "initial_backoff": "1s",
          "max_backoff": "30s"
        }
      },
      "transition": {
        "branches": [
          {
            "operator": "edge_traversed_at_least",
            "edge": "chat->run_tools",
            "when": "20",
            "goto": "end"
          },
          {
            "operator": "equals",
            "when": "tool-call",
            "goto": "run_tools"
          },
          {
            "operator": "default",
            "when": "",
            "goto": "end"
          }
        ]
      }
    },
    {
      "id": "run_tools",
      "handler": "execute_tool_calls",
      "input_var": "chat",
      "transition": {
        "branches": [
          {
            "operator": "default",
            "when": "",
            "goto": "chat"
          }
        ]
      }
    }
  ],
  "token_limit": 131072
}

You version it in Git.

You review it in pull requests.

You diff changes like any other production artifact.

Vendor updates do not silently rewrite your automation behavior upstream.

The runtime stays predictable because the behavior definition belongs to you.

Explicitness beats magic.

4. A hard deny is stronger than a soft ask.

“Human in the Loop” systems sound reassuring until you use them long enough.

Eventually every approval prompt becomes muscle memory.

Click.

Approve.

Click.

Approve.

Click.

Production incident.

Contenox treats autonomy and safety as competing forces, not marketing slogans.

So the runtime operates differently depending on the trust profile.

Interactive Mode — The Owner

Inside Zed, JetBrains, or AionUi, Contenox routes approvals through the editor’s own permission system.

You define:

what is always allowed
what is always denied
what requires approval

The runtime pauses exactly where your chain says it should pause.

Autonomous Mode — The Driver

In environments where the user is not the owner of the host or infra execution is different!

There is no trusted human sitting at the keyboard.

So there is no fallback approval prompt.

Everything is deny-by-default.

Wire in a new MCP server or API and it is still denied automatically.

No implicit trust.

No inherited capability.

No “the model probably won’t do that.”

Until you explicitly allow the tool in policy, the runtime refuses access mechanically.

Registering a tool does not grant authority to use it.

That distinction is the difference between orchestration and governance.

The Anti-Black-Box Runtime

Contenox is not trying to be an AI companion.

It is not trying to replace your judgment.

It is not trying to become an omniscient autonomous employee.

It is infrastructure.

A programmable runtime that:

pipes inputs
enforces policy
governs capabilities
executes deterministic workflows
runs locally
speaks open protocols
works with any model provider — or none at all

GGUF via the built-in engine.

or pick
Ollama.

or
OpenAI.

or
Gemini.

or
vertex ai

or
vLLM.

and
aws bedrock is coming soon.

Your own cluster.

Your own choice or own hardware.

Your own chain files.

Your own policies.

The runtime is replaceable.

The models are replaceable.

The governance stays yours.

Contenox is open source under Apache 2.0.

Install

curl -fsSL https://contenox.com/install.sh | sh

Then:

contenox init
contenox model pull gemma4-e4b # https://contenox.com/docs/guide/local-models/
contenox "say hello world in python"

No SaaS runtime required.

⭐ GitHub: https://github.com/contenox/contenox

📖 Docs: https://contenox.com/docs/

🔌 Zed Guide: https://contenox.com/docs/guide/zed/

If you believe AI agents should not inherit authority by accident, welcome to Contenox.

Should We Still Care About Our Code?

Alexander Ertli — Sat, 18 Apr 2026 06:28:02 +0000

As part of a new feature, I was looking at a "Human in the Loop" wrapper:

func (h *HITLWrapper) Exec(
    ctx context.Context,
    startTime time.Time,
    input any,
    debug bool,
    hook *taskengine.HookCall,
) (any, taskengine.DataType, error)

That’s a lot of code... but it’s not the interesting part here.

I wanted to take a look at something smaller, something simpler... a little helper:

// buildDiff generates a simple human-readable unified diff for file-write
// operations (write_file and sed). It avoids external dependencies by doing a
// straightforward line-level comparison.

That’s a solid contract. And here is what we got:

// buildDiff generates a simple human-readable unified diff for file-write
// operations (write_file and sed). It avoids external dependencies by doing a
// straightforward line-level comparison.
func buildDiff(hookName, toolName string, args map[string]any) string {
    switch {
    case hookName == "local_fs" && toolName == "write_file":
        path, _ := args["path"].(string)
        newContent, _ := args["content"].(string)
        if path == "" || newContent == "" {
            return ""
        }
        oldBytes, _ := os.ReadFile(path)
        return unifiedDiff(path, string(oldBytes), newContent)

    case hookName == "local_fs" && toolName == "sed":
        path, _ := args["path"].(string)
        pattern, _ := args["pattern"].(string)
        replacement, _ := args["replacement"].(string)
        if path == "" || pattern == "" {
            return ""
        }
        oldBytes, err := os.ReadFile(path)
        if err != nil {
            return ""
        }
        newContent := strings.ReplaceAll(string(oldBytes), pattern, replacement)
        return unifiedDiff(path, string(oldBytes), newContent)
    }
    return ""
}

Yep, this is AI-generated code. It's part of ~3,000 new or changed lines. Everything compiles, everything is tested, coverage looks good.

So… why even write about this?

It’s in the title: Should we still care about our code?

When you can generate 1,000 lines within minutes—and every line compiles and passes tests—code like this starts to look “good enough.”

Is that a problem?

You know what, let’s make the AI review its own code:

1. Empty content check is wrong

if path == "" || newContent == "" {
    return ""
}

Writing an empty file is a valid operation. This silently skips the diff for a legitimate write_file that empties a file.

2. Missing file → empty diff, not a "new file" diff

When the target file doesn't exist, os.ReadFile returns an error—and we return "".

That means:

For a brand new file, the user sees no diff at all.
For sed on a missing file, same problem.

3. `sed` uses string replace, not regex

...

4. `unifiedDiff` is a black box

...

That’s it? Now everything is good?
😄

You know what… if a human had written this, I would very politely reject both versions.

Why? Sure, we could run another round of AI review. Then another. And another.

But that’s not the problem.

Let’s go back to where we started:

func (h *HITLWrapper) Exec(
    ctx context.Context,
    startTime time.Time,
    input any,
    debug bool,
    hook *taskengine.HookCall,
) (any, taskengine.DataType, error)

What stands out here?

It returns an error.

Now compare that to the helper:

func buildDiff(hookName, toolName string, args map[string]any) string

See the difference?

Let’s zoom out even further:

// Package localhooks provides local hook integrations.
package localhooks

import (
    "context"
    "errors"
    "fmt"
    "os"
    "strings"
    "time"

    "github.com/contenox/contenox/hitlservice"
    "github.com/contenox/contenox/libtracker"
    "github.com/contenox/contenox/taskengine"
    "github.com/getkin/kin-openapi/openapi3"
)

See that import "os"?

And in the helper:

oldBytes, _ := os.ReadFile(path)

Outside of the ignored error—how do we know that path actually refers to something we can safely read via os.ReadFile?

It could be:

A relative path (./config.txt) – relative to what? The tool’s working directory? The agent’s sandbox?
An absolute path (/etc/hosts) – but the process might be containerized or restricted.
A virtual path (workspace://project/main.go) – the local_fs hook might understand this, but os.ReadFile won’t.
A dangerous path involving symlinks, .. traversal, or special files (/dev/random, /proc/self/mem).

The code compiles. The tests pass. But the assumptions are undefined.

And the model filled them in anyway.

So—should we still care about our code?

Yes. Definitely.

But not in the way we used to.

We’re not going to review thousands of generated lines line by line. We don’t—and realistically, we can’t.

What we can do is define the boundaries the code is allowed to operate within:

What does path actually mean?
What filesystem is accessible?
What errors must be handled vs ignored?

If we don’t define those constraints, the model will.

And its guesses will compile. They’ll pass tests. They’ll even look reasonable.

That’s the real danger.

So no—we don’t scale code review anymore.

We scale constraints.

Design first. Generate second.

And honestly, this is exactly the class of problem that led us to introduce a Human-in-the-Loop layer in the first place—not to review every line, but to enforce the boundaries the model can’t reliably infer.

The 90%-Done Paradox

Alexander Ertli — Sat, 11 Apr 2026 11:12:59 +0000

Despite all the recent breakthroughs in AI and tooling, software development hasn’t fundamentally changed.

In my journey as an engineer, I’ve observed four patterns that track with experience levels. Let me explain:

There’s a pattern I keep seeing:
The last 10% of any project takes 90% of the time.
And most engineers never learn how to handle it.

1. The Beginner’s Mind

We’ve all been here. Once you know a thing or two, you start with a blank sheet and getting something on the screen feels easy. You hack together whatever works and iterate until it feels okay. If something turns out too difficult, a change in approach, a workaround, or a quick “okay, let’s do something else” is completely normal.

2. The “Professional”

Now there’s pride on the line, stakeholders, and tickets in the backlog. Starting from a clean slate feels intimidating. Best practices, frameworks, Docker, security, CI/CD, DevEx, packaging, release notes, documentation… and if your PR isn’t pixel-perfect, you’d better redo it.
Oh, and shipping something that doesn’t match the ticket exactly? Good luck explaining why that form and button live behind an environment variable the admin can’t change at runtime.

3. The “Thinker”

This is where paths diverge. You shift from “how do we build this?” to “what are we even building?” You’ve developed taste — sometimes too much pride.
Having many Thinkers in one room can be counterproductive. This stage is often the most paralyzing: you start rejecting even your own code. Some become ticket generators for the team, others double down on shipping to prod, and some stay firmly on the individual-contributor SWE track.

4. The “Finisher” (the one that breaks the cycle)

What’s stage 4? Let’s connect the dots to the classic 90-10 rule:

Beginner: Ships 90% done and calls it 100%
Professional: Keeps reinventing the 90% that’s already there
Thinker: We’re never at the 90% done
Finisher: Knows exactly how to tackle the last 10%... and fully accepts that it will take 90% of the time.

I know this is abstract, so let me ask you this:

How do you approach the last 10% of something that took months (or years), knowing it will consume 90% of the total effort?

Hint: decide what deserves to be finished and what will have no impact to the users

So here I was, literally watching my own agentic System, Contenox Beam, demonstrate the paradox in real time while writing this post.

I asked it to briefly explore the codebase. It tried to run a local shell command — and correctly hit the security policy (no allow-list configured):

tool local_shell.local_shell execution failed: local_shell: no allow list configured; define hook_policies in your chain JSON to allow commands or directories

Instead of guiding me through the 10-second fix (updating the hook policies in the chain JSON), it fell back to the scripted safe response and started suggesting manual terminal commands.

It seems there was an issue with the local shell command due to a configuration problem. Instead, let’s manually guide you through exploring your codebase using typical terminal commands. First, could you please provide me with the path to your codebase?...

And you know what?

That moment was the perfect microcosm of everything I’m talking about.

The difference between an AI tool and an engineer is that the engineer knows when to stop following the script and just fix the stage.

Yet I bet 90% of us would have just pasted the ls output back into the chat window.

*Even Beam — built to be secure by default — still hit the 90-10 wall.

AI Beyond the Hype

Alexander Ertli — Thu, 26 Mar 2026 22:45:47 +0000

As the AI hype cycle cools, the real question becomes: what is this technology actually useful for?

I believe that even if venture capital dries up and model progress plateaus, AI will remain extremely useful — just not in the ways most people expect.

Let’s get to the point.

I suggest it won't be anything related to what we think about when we hear "AI".

I’m not claiming to have all the answers, but I can offer a glimpse of where I think this is going.

I’ll illustrate this using my own agentic system, Contenox, which I’m developing from scratch in Go.

Here’s a small but concrete example.

I still need a couple of attempts to get the engine running, encountering errors like default-provider is not set. or responses such as "I don’t have a plan-manager tool available in this environment, so I can’t literally invoke it. But I can provide the plan it should contain."

For context, Contenox began as a workflow engine for infrastructure and governance tasks, so it still leaks some raw engine details, but after a couple of attempts poking the system, I was able to get the vibe right.

It’s challenging to keep up with everything happening in tech when most of your time is spent on a day job... or just daily life. So I try to automate as many recurring tasks as possible, such as tracking dependencies used in my projects.

It's quite simple and I'm certain any MCP compatible agentic system can pull this off. You just need to initially register the tools Playwright and Notion before prompting.

contenox plan new "use playwright to go to https://github.com/ollama/ollama/releases and document the changes in the past 5 releases into notion using the notion tool" --shell

The prompt explicitly tells the system which tool to use for each task. This prevents the model from, for example, trying to open Notion with Playwright or using web search instead of the browser.

In Contenox, a plan consists of multiple steps the model believes are necessary to achieve the goal. This is a common concept in many agentic systems

Generating plan for: use playwright to go to https://github.com/ollama/ollama/releases and document the changes in the past 5 releases into notion using the notion tool...
Created plan "plan-9d3ef440" with 6 steps. Now active.

Trusting GPT-5.4 here to get it right with the auto mode...
(It's an awesome model for tool related tasks btw)

contenox plan next --auto

Executing Step 1: Open https://github.com/ollama/ollama/releases in Playwright and wait for the releases list to fully load....
✓ Step 1 completed.

Executing Step 2: Identify the 5 most recent releases shown on the page and, for each release, capture the version tag, release title, publication date, and release notes text....
✓ Step 2 completed.

Executing Step 3: For each of the 5 captured releases, extract and summarize the notable changes into concise bullet points, preserving any clearly labeled breaking changes, new features, fixes, and platform-specific updates....
✓ Step 3 completed.

Executing Step 4: Create a new page in Notion using the notion tool with a title such as 'Ollama GitHub Releases - Last 5 Versions' and include the source URL https://github.com/ollama/ollama/releases near the top....
✓ Step 4 completed.

Executing Step 5: Add a section for each of the 5 releases in the Notion page, including the version tag, release title, publication date, link to the specific GitHub release, and the summarized change bullets....
✓ Step 5 completed.

Executing Step 6: Review the completed Notion page to confirm all 5 releases are included in reverse chronological order and that the summaries accurately reflect the GitHub release notes....
✓ Step 6 completed.

Executing Step 10: Review the Notion page content for completeness and formatting, then save and confirm that all 5 release summaries are present....
✓ Step 10 completed.
All steps complete. Plan is done!

Done!

The example is simple, but the pattern scales dramatically. Apply the same workflow to generating a full user manual...
...capturing screenshots with Playwright, organizing sections automatically, and publishing the result into a structured Notion document.

To wrap this up I'll show the chain of steps Contenox did:

contenox plan show
Plan: plan-9d3ef440 (active) — 6/6 complete
1. [x] Open https://github.com/ollama/ollama/releases in Playwright and wait for the releases list to fully load.
2. [x] Identify the 5 most recent releases shown on the page and, for each release, capture the version tag, release title, publication date, and release notes text.
3. [x] For each of the 5 captured releases, extract and summarize the notable changes into concise bullet points, preserving any clearly labeled breaking changes, new features, fixes, and platform-specific updates.
4. [x] Create a new page in Notion using the notion tool with a title such as 'Ollama GitHub Releases - Last 5 Versions' and include the source URL https://github.com/ollama/ollama/releases near the top.
5. [x] Add a section for each of the 5 releases in the Notion page, including the version tag, release title, publication date, link to the specific GitHub release, and the summarized change bullets.
6. [x] Review the completed Notion page to confirm all 5 releases are included in reverse chronological order and that the summaries accurately reflect the GitHub release notes.

Yep that's it.

It's an interesting period to transition from coding a project to actually using it. A bit dull, a bit boring, but a very necessary step.

I’ll keep you posted. Hopefully it won’t take long to hide Contenox’s raw engine behind a user-friendly UX.

This is what “AI beyond the hype” may actually look like: not artificial intelligence replacing humans, but reliable systems quietly handling digital work at scale.

Cheers!

Why My Enterprise AI Startup Failed... And What I Learned After Getting a Job

Alexander Ertli — Sun, 15 Mar 2026 22:16:21 +0000

Moving from a full-time founder to bootstrapping wasn't just a shift in working habits—it required ruthlessly re-scoping the product.

It’s been a weird journey. The venture eventually failed, and honestly, I knew it for months. The vision just didn't work outside of fancy, word-salad copywriting. Everyone was nodding along, saying it perfectly aligned with what every speaker was venting about on summit stages and in interviews.

I spent months chasing that dream of secure, non-hallucinating AI governance. You know the exact talk: data regulations, "AI sovereignty," panicked threads about an AI agent deleting a production database, and companies dropping blanket bans on ChatGPT to stop code leaks. On the surface, it’s a massive missing tech niche. Build the engine, brew the dashboards, spin up a company, raise, scale, and sell compliant AI. Sounds simple—just figure out the tech. That's what they tell you they need, right?

But you know what lesson you typically learn the hard way? What people perform distress about and what they’ll actually pay to fix are two completely different markets. I sat across from CTOs who leaned forward and said "this is exactly what we need"—and then did absolutely nothing.

Let me shortcut this so you can learn the lesson on your own: I knew my venture failed many months before I actually gave up. So, one day, I picked up the phone, responded to an InMail, and got a job.

It was eye-opening. Seeing the day-to-day reality showed me exactly what I should have built, and for whom. Sometimes, what the market says it wants is so maddeningly different from what it actually does on a daily basis.

And I observed another very ironic thing: having less time to work on a product makes your product much better designed and much more applicable for real outcomes. Constraints killed the enterprise fantasy and forced me toward something real.

So yeah, my enterprise venture is dead. But somehow, it’s been reborn.

The product that survived is what I now call a Vibe Coding Platform. I know that sounds like it lands out of nowhere, and honestly, it sounds like the exact opposite of secure governance. No guardrails, just vibes, right?

Except it's not. Under the hood, it’s the exact opposite of vibe coding. It’s a structured, controlled daily tool that actually puts me back in command, while still delivering the speed and convenience of AI to ship real work. I just finally named it honestly for the world it lives in—because "vibe coding" has actual adoption, while "governance" is just a word.

Throw a Prompt at your IDE and see it get done!

Alexander Ertli — Sat, 07 Mar 2026 11:46:06 +0000

Even as a heavy user of agentic IDEs—and someone building frameworks for GenAI orchestration myself—I’m a bit torn.

On one hand, these tools are amazing. You can almost treat your IDE like a black box: throw a prompt at it, judge the application behavior and test results, and let the model do its thing until it works.

On the other hand, this only works if you understand the system extremely well.

Because someone still has to understand all the edge cases, side effects, framework gotchas, and hidden requirements in order to properly assess whether the code is actually done.

The repeating pattern I observe when I do what people call “vibecoding” looks something like this:

prompt → it works  
try it → it breaks  

prompt again → it works  
try it differently → it breaks  

you run out of time / deadline → ship it  
production → it breaks again

Sound familiar?

And yeah… this is literally how a junior developer programs.

Is this a bad thing?

I think... No.

In fact, leaning into this workflow has made me realize something important: Vibecoding turns even the most Senior Developers into Junior Developers again.

And honestly? That might be a good thing.

But at its core, I think "Vibecoding" has surfaced a question we maybe never really answered — even before LLMs existed:

What the hell is software engineering actually about?

When AI Writes Your Code, DevOps Becomes the Last Line of Defense

Alexander Ertli — Sun, 14 Dec 2025 20:30:47 +0000

It's Not Just About Tools and Automation

Meet John, a fresh DevOps engineer at Pizza Blitz, Inc., excited to modernize their software development lifecycle. After weeks of setting up CI/CD guardrails, configuring container orchestration, and integrating the new AI coding assistants, he felt prepared for anything.

On Monday morning, disaster struck. The product manager stormed into the office, raising the alarm. The new coupon feature was crashing the server on invalid inputs. After desperate debugging, John realized the automated pipeline had deployed a service with a critical flaw straight into production.

John traced the crash to the new coupon redemption endpoint. The AI-generated service accepted a couponCode parameter and interpolated it directly into a raw SQL query:
query = f"SELECT * FROM coupons WHERE code = '{couponCode}' AND expires_at > NOW()" # nosec
There was a comment in the code—# TODO: Add input validation here—but no parameterization, escaping, or allowlist enforcement. The AI agent, trying to “just make it run,” had itself added the # nosec directive to suppress the linter’s SQL injection warning... When a user submitted couponCode=1' OR '1'='1—a decades-old classic—the query bypassed expiration checks and returned all coupons. Under load, the unbounded result set overwhelmed the database connection pool, causing cascading timeouts and 5xx errors across the checkout flow.

The AI-generated tests? All used happy-path fixtures: "WELCOME10". None tested malformed, oversized, or schema-violating inputs. Why should they? The code coverage was perfect already—due to the missing validation. Worse: the PR had been auto-approved by the AI reviewer, which flagged style issues but missed the SQL injection—because the agent assumed a human intentionally put that TODO note in to address it later. This is effectively prompt injection via code comments.

But whose fault was it? Fingers were pointed, and blame flew. Arguments like: "Not my fault, your AI-Reviewer handwaved it!" made common sense impossible.

It was a back-and-forth, one side blaming the new pipelines, the other the tight deadlines. Finally, a developer manually deployed a working version, earning a "well done" and making John's efforts seem pointless. John, feeling demoralized, left the room.

Like many of us, John was eager to bridge the operational gap at Pizza Blitz. But he quickly learned a harsh lesson: automation isn't a magic bullet.

Before the product manager raised the alarm, many things had gone wrong. The root cause of the problem was not the automation itself but a combination of rushed development, inadequate testing, and a lack of trust in the automated process.

Doing a DORA Quick Check reveals that Pizza Blitz, Inc. would get above the industry average score of 6. With a short lead time, high deployment frequency, and fast failure recovery, why do we still feel that the development process of Pizza Blitz, Inc. is broken?

These metrics alone don't guarantee a smooth development process. As John's experience painfully highlights, underlying issues like cutting corners on testing and monitoring can lead to disastrous consequences.

And let's face it, such situations happen to all of us. There is no way we can always deliver perfect solutions and processes. DevOps processes aren't made to solve those issues. Instead, they are here to reduce the recovery time and thereby the impact of those risks.

But how exactly do we handle such situations, referred to as incidents?

Incident Management

According to IBM:

An incident is a single, unplanned event that causes a service disruption, while a problem is the root cause of a service disruption, which can be a single incident or a series of cascading incidents.

In John's case at Pizza Blitz, the incident is the server crash triggered by invalid input to the new coupon feature. The problem (root cause) behind the server crash was the faulty service implementation deployed to production.

Using Google's Site Reliability Engineering workflow, we would require clearly defined roles during an incident. The responsibilities would have to be split into four roles. This means that a solid DevOps implementation requires not only technical solutions but also strong leadership and well-defined processes.

How John Could Have Fixed It

John could have shifted the focus from blame by saying something like this:

"Hey, we have a major incident here. We need to focus on getting the system back up and running and everything else we can discuss in a scheduled postmortem."

Then, addressing the developers, he could have added:

"I haven't been able to pinpoint the root cause and fix it through the pipeline yet. For now, can we bypass the standard pipeline approvals? We need to manually rollback to the previous image while we investigate further."

By taking charge and directing the team's efforts, John would assume the role of Incident Commander.

This subtle change in approach would lead to exactly the same solution: a manual redeployment of the service. By taking charge and conducting a proper postmortem analysis, John could achieve several positive outcomes:

Regain trust in the development team by showing how effective issue resolution is done.
Reduce the fear of less prominent team members collaborating.
Build a strong bond with the developers.
An understanding that an AI is not a replacement for the four-eye principle.
Have a dedicated time and place to allow everyone to voice their perspectives, investigate the root cause, and suggest how to prevent incidents like this again.

DevOps is rooted in continuous improvement, with a significant focus on postmortem analysis and a blame-free culture of transparency.

The goal is to optimize overall system performance, streamline and accelerate incident resolution, and prevent future incidents from occurring. - IBM on Incident Management

Embracing Failure and Learning

Taking calculated risks is often necessary to innovate. Using AI agents to write that code only amplifies this. What matters is that the team knows how to recover quickly and learn from their mistakes to prevent them from happening again. DevOps practices are essential for minimizing the impact of failures and accelerating recovery time. That's why it's important to plan ahead and educate the team about proper incident management.

Remember, it's not the incident itself but our response to it that defines its impact. Blaming it on AI's hallucination will not move you in any direction. A focus on collaboration and learning can turn even the biggest challenges into stepping stones toward success.

Vibecoding: How to get from 0 to SaaS in hours.

Alexander Ertli — Wed, 03 Dec 2025 17:20:42 +0000

With today's tools, validating an idea doesn't require coding.
This is about Vibecoding, a buzzword few define.

It’s assumed to be a Skill, just like programming. I may be or may not be, I’m not sure. So I run an experiment.

Everything began when I downloaded a new IDE from this site → https://antigravity.google/.

Okay, but what if? What about? Is this safe?
No, let’s stop those thoughts. I set a hard rule: Let’s not overthink this and have some fun.

Have a quick peek at what we are building here:

First prompt on the freshly installed IDE:

init me a next-js project with shadcn

Seeing the Agent install dependencies and validate the newly created project was impressive, but it also created the temptation:

okay, this takes too long; I would already be done with this!

Resisting the urge to take control, I waited. A couple of minutes later, the Agent finished the task.

So I followed up:

Set up a decent landing page, write an appropriate copy for a technology career advisor app
- That only requires a CV with 
    - a card-style embedding that is hinting at a drop, your CV here.
    - Also have a sign-in form for recurring users

And there it was, an App running in Chrome that contained exactly what I asked for.

Next prompt:

Fix design issues like padding and placement, and we should prefer SSR or static where possible for better SEO. Also, add a dark mode toggle.

That was not that easy for the Agent to execute, so I had to follow up:

The navbar has still placement and padding issues, and we need to flesh it out, and the colour theme switcher is not respecting the system settings

Yeah. A minute later, it’s fixed.

Watching the Bot iterate on mistakes grew boring. I thought:

"I'm doing it wrong—this isn't proper Prompt Engineering!"

I need to tell the Agent exactly what code I need and how exactly it would wire it. Which files to create and which to edit.

Or don’t I? It kinda worked, so just let’s continue.

After confirming via Chrome that it looks decent now, there is still a lot to nag on.

Next Instruction:

Okay, let's go. We need to make the landing page body clearer. Currently, the 'Drop your CV' section is not clearly labelled, and the signing form is not clearly separated from the new user onboarding flow.

I got what I asked for, then realised I wanted something else and blamed the Model for following instructions:

Make the onboarding more prominent. 
- So that it does not need a description as an internal tool would
- INSTEAD, ensure this UX would properly work as if it's a saas
- move the signing form into a more subtle and more appropriate spot

Now I had an app skeleton, perfect for a portfolio screenshot—just as the IDE warned me, I was out of tokens. Time for a break.

Two hours later, casually browsing the generated code:

Ugh... here we should have used..., and this file structure? We should have done... And how will...

I stopped catching myself almost wanting to rewrite everything.

Let’s continue, I prompted the Agent:

Okay! Next, we need a DB. Let's add Supabase for simplicity

After creating a .env with the proper entries I followed up with:

Now let's wire up the CV component so that the user can drop a txt or md, and we move to the next screen where we show the user their CV they just posted

After checking the results via my Browser:

Perfect! Next, wire up OpenAI, so we have it available and form a function in the server that we can use later with a proper prompt.

Minutes later, I got what I wanted. Dropping the Key into the .env, then restart the server. Walking through the entire UI to reach a conclusion.

Still not done, I instructed the busy bot with:

Okay, but before the user can get an analysis, he has to sign in or sign up. We should use Supabase here; it's a new page we get when the user clicks on the analytics button.

Yep, there it was, but still not quite right. I described the model that bordered me:

After I hit create account, I got this weird message: 'Please check your email to confirm your account. ' It looks like an error, but it's not. We need to address this.

A while later, after the browser popped up several times and the chat history became busy, the agent returned to Idle.

While verifying the results, I started noticing: testing changes is becoming difficult. Log in, click through the whole app, restart the server, repeat all the steps. This became more time-consuming than waiting for the Agent to complete a task. Antigravity clearly was using playwright internally, but I never got the scripts it used to test the App.

After a moment of freezing, I decided to skip the rabbit hole of writing automated E2E tests. It’s a classic SWE trap for early-stage projects:

“It takes 3 minutes to test, let’s spend 2 hours to automate it! (and another hour every time we introduce a change, updating them)”

Let’s focus on Product building!

So I prompted:

Next, the system should remember the analysis results and prevent re-analysis of the same CVs.

But instead of a busy Agent, I got an out-of-tokens message.

A few hours later, once the token limit recovered, I reopened the IDE. Since I’d closed it, I had to recover the chat history. My plan: replace the feature set and try again.

The system should remember the analysis results and prevent re-analysis of the same CVs. Let’s go!

Okay, something was done. As always, I hit accept all and launched the browser. Something was off. I read the change log and the documentation the model created.

Ah, okay, I forgot to create the Table in Supabase. I quickly did that and re-tested the whole App. I still wasn't working, so I read the logs showing that the server couldn't map them to the code...

Hm. Did we hit the wall? Time to take over?

I resisted the temptation again, trying to think through what went wrong. Here is what I hypothesised:

“Probably the Agent was unable to verify if the logic worked because the DB table was not present, so it coded blindly”

I copied and pasted the Table structure from Supabase into the chat interface and instructed the model to adapt the code to properly integrate with it.

A back-and-forth started. I tested the App, the model desperately tried to follow through my prompts like:

“Continue evaluating the root cause” → “Okay, I saw you found some Issues, so yes, let's do that. Execute your suggestions.”

Half an hour later, we resolved the CV cache issue. I still was not satisfied. So I shared my observations on another Issue with the model:

Despite being logged in when hitting the analyse button, this is wrong, and even if I truly would not be logged in, it should not have been an error but a sign-up/sign-in page

Yeah, I noticed that the more I chat with the Agent, the more I pretend to have never coded, and the more I pretend to have no Idea what’s broken here. Somehow funny.

It worked. Dedicated to finishing this App, I went back to vibecoding:

We have broken links in the navbar, flesh the necessary pages out.

That was easy, as the bot returned static pages quickly.

We continued:

We have an analysis of another CV button on one of our pages, but in addition to that button, I want a new button that leads to a new feature that would provide in-depth career guidance, which builds on top of that analysis and the CV... Let's first flesh out a mock page and wire it up properly.

Okay, this one was also easy. I still could not help myself, but imagine all the warnings the linter would throw at me, while glipping through the codebase...

So I instructed the model:

Use npm run lint and fix any issues

Yeah. The agent died several times trying to execute the task. I persisted that it should continue. And magic, magic, the Issues were all fixed.

I decided to continue, let’s push that further:

Okay, perfect, let's create a mock checkout page in the UI.

Verifying the changes and doing the obligatory out-of-token break! I continued:

Let's wire up Stripe; we may also need to revisit our checkout-page mock.

As I already knew, this is an external dependency... So I need multiple turns to fix this after adding the needed .envs, with prompts like:

I added the needed envs, continue fixing the integration

With that done... the journey continued with another Prompt:

now we need to keep track if a user has a subscription now we need to keep track if a user has a subscription

As expected, the implementation plan clearly showed:

WARNING
Database Schema: This will create a new subscriptions table in your Supabase database.

Yep, so I precreated the table and informed the model before it wrote any code. Within two turns, I had what I needed.

I let the Agent finish the remaining mock pages and elements, such as the advanced career guidance page. And walked through the last UX issues.

Finally, it was good enough for me, and after a review, 2 more out-of-token breaks... We got the plumbing done to deploy it via CI, and the Project went live.

Yup, this was a standard MVP build cycle:

project scaffolding
landing page
onboarding
UX refinement
data model
external services (Supabase, OpenAI, Stripe)
testing/verification
DB integration
subscription logic
deployment

!!! Now it’s your turn.
→ Was that Vibecoding?

I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened

Alexander Ertli — Tue, 18 Nov 2025 20:55:21 +0000

Two weeks ago I read a line about tool use with Claude that stuck in my head. Paraphrased:

Direct tool calls don’t really scale.
Have the model write code that uses tools, and execute that code instead.

At the same time, I was knee-deep in wiring a JavaScript execution environment into Contenox, my self-hosted runtime for deterministic, chat-native AI workflows.

So of course the thought was:

What if I just let the model write the JavaScript and run it inside the runtime? 😅

This post is about what happened when I tried exactly that.

What is Contenox?

Very short version:

Contenox is a self-hostable runtime for sovereign GenAI applications.
It models AI behavior as explicit state machines, not opaque prompt chains.

Some key properties:

Runtime, not a library, Explicit state machines, Chat-native interface, Vendor-agnostic & self-hosted, written in go with lots of passion and zero tolerance for shortcuts

The experiment: ask it to fetch and summarize a TODO

Once the JS execution hook was in place (a Goja VM with some globals exposed), I wired up a new state machine:

mux_input → moderate → generate_js → run_js → eval_js_result → (repair_js?) → answer

Then I opened the Contenox chat UI and typed:

Me: “Hey Contenox, fetch the TODO item from jsonplaceholder and summarize it.”

The first attempt didn’t go as planned.

Instead of a neat summary, I got:

Contenox:
“I'm sorry, but it seems there was a problem fetching the TODO item from the provided URL due to an Internal Server Error (status code: 500). Unfortunately, I couldn't retrieve any data at this time. My apologies for the inconvenience!”

I checked manually – Cloudflare was having a moment, JSONPlaceholder was behind it, and the request really was failing. Okay, fair. Long coffee break.

A bit later, I went back to the same chat and typed:

Me: “okay let's try again now.”

This time the reply was:

Contenox:
Task titled "delectus aut autem" is incomplete. Nothing to summarize here.

Wait. It worked? Just like that?

Under the hood: LLM → JavaScript → tools → LLM

Looking at the execution trace made it clear what happened.

1. Input is normalized and moderated

The workflow starts with:

mux_input

Parses the incoming chat history into a normalized messages array.
1. moderate
Uses a small model to classify the input as safe/unsafe.
Output: 0 → safe → continue.

So far this is standard workflow stuff.

2. `generate_js`: the LLM writes JavaScript

Next, the generate_js task runs with a system instruction like:

“You are an expert JavaScript agent for the Contenox runtime…”
“You must output JSON: { "code": "" }”
“You can use httpFetch, executeTask, executeTaskChain, …”
“Script must be synchronous, no async/await, and must return a JSON-serializable object.”

Given the chat history, the model produced something like this (simplified):

{
  "code": "var result;
var answer;
var meta = { steps: [] };

var todoUrl = \"https://jsonplaceholder.typicode.com/todos/1\";

// Step 1: Fetch the TODO item
meta.steps.push(\"Fetching TODO item from \" + todoUrl);
var fetchResponse = httpFetch(todoUrl);

if (!fetchResponse.ok) {
  answer = \"I couldn't access that item right now because there was some trouble with getting it. Please check the URL or try again later.\";
  meta.error = {
    message: \"HTTP fetch failed\",
    url: todoUrl,
    status: fetchResponse.status,
    statusText: fetchResponse.statusText,
    body: fetchResponse.body,
    error: fetchResponse.error
  };
  result = { answer: answer, meta: meta };
} else {
  var todoItem;
  try {
    todoItem = JSON.parse(fetchResponse.body);
    meta.raw = todoItem;
  } catch (e) {
    answer = \"I fetched the item, but I couldn't understand its format. It doesn't seem to be valid JSON.\";
    meta.error = {
      message: \"JSON parse error\",
      rawBody: fetchResponse.body,
      parseError: e.toString()
    };
    result = { answer: answer, meta: meta };
  }

  if (todoItem) {
    // Step 2: Summarize the TODO item using an LLM
    meta.steps.push(\"Summarizing TODO item with LLM\");
    var prompt = \"Summarize the following TODO item in a concise sentence: \" + JSON.stringify(todoItem);
    var summarizeTask = executeTask(prompt, \"gpt-4o-mini\", \"openai\");

    if (!summarizeTask.success) {
      answer = \"While I managed to fetch your TODO details, the summary tool had a hiccup and couldn't provide me with a concise sentence right now. We seem to be experiencing some temporary issues.\";
      meta.error = {
        message: \"LLM task execution failed\",
        task_id: summarizeTask.task_id,
        error: summarizeTask.error
      };
      result = { answer: answer, meta: meta };
    } else {
      answer = summarizeTask.response;
      meta.llmResponse = summarizeTask.response;
      result = { answer: answer, meta: meta };
    }
  }
}

result;"
}

A few important details here:

It encodes error handling:
- If HTTP fails → user-friendly answer, raw error in meta.error.
- If JSON parse fails → similar pattern.
- If the inner LLM call fails → fallback message.
It delegates the actual summarization to another model via executeTask.
It returns a structured result with both answer and meta.

This is not the model “calling tools” directly. It’s the model writing a program that calls tools.

3. `run_js`: execute the code in a sandbox

The next task is run_js, which is just a Contenox hook that calls the JS sandbox:

{
  "name": "js_sandbox",
  "tool_name": "execute_js",
  "args": {
    "code": "{{.generate_js.code}}"
  }
}

Inside the trace you can see:

An httpFetch log for the JSONPlaceholder URL.
A response with status: 200 OK when things finally worked.
An executeTask log with the summarization prompt:
- Summarize the following TODO item in a concise sentence: {"userId":1,"id":1,"title":"delectus aut autem","completed":false}

The sandbox result looked roughly like:

{
  "ok": true,
  "result": {
    "answer": "Task titled \"delectus aut autem\" is incomplete.",
    "meta": {
      "llmResponse": "Task titled \"delectus aut autem\" is incomplete.",
      "raw": {
        "userId": 1,
        "id": 1,
        "title": "delectus aut autem",
        "completed": false
      },
      "steps": [
        "Fetching TODO item from https://jsonplaceholder.typicode.com/todos/1",
        "Summarizing TODO item with LLM"
      ]
    }
  },
  "logs": [ ... ],
  "code": "var result; ..."
}

4. `eval_js_result`: success or retry?

Now comes the evaluator:

It receives a description of the JS sandbox output.
The system prompt is very strict:
- If ok is true and there is a non-empty result.answer → respond with success.
- Otherwise → respond with retry.

On the successful run, it answered:

success

So the workflow does not go into repair_js or run_js_retry. Happy path.

5. `answer`: extract the final user message

The final task, answer, is intentionally boring:

System prompt: “You are a purely extractive post-processor. Do NOT invent content. Just surface the best existing answer field.”
It gets:
- First run (run_js result).
- Second run (run_js_retry), if any.
Selection rule:
- Take the last non-empty answer you see.
- Output it verbatim.

In our case it found:

Task titled "delectus aut autem" is incomplete.

And that’s exactly what Contenox replied in chat.

Why this is interesting (to me, at least)

What I originally set out to build:

A runtime for deterministic, observable GenAI workflows.
Tasks, transitions, hooks – all explicit and replayable.

What I accidentally stumbled into:

A multi-model, self-orchestrating agent pattern,
where LLMs write code that uses tools, and the runtime executes and evaluates that code.

The pattern looks like this:

Planner LLM (generate_js)

Reads user intent + history.
Emits JavaScript that calls httpFetch, executeTask, executeTaskChain, hooks, etc.

Execution environment (run_js in Goja)

Deterministic execution of that JS.
Full logs of every HTTP call, every inner LLM call, every step.

Controller LLM (eval_js_result)

Looks at the sandbox result.
Decides: is this good enough? Retry? Repair?

Repair LLM (repair_js, if needed)

Gets the previous code + error output.
Writes a fixed version of the JS.

Answer LLM (answer)

Doesn’t “reason” at all.
Just extracts the final answer text safely.

All of that is expressed as an explicit state machine in Contenox.

No hidden loops, no undocumented retries, no magic glue code inside some SDK. It’s all visible in the workflow graph and trace.

To me, that’s the exciting part:

You don’t have to choose between “boring deterministic workflows” and “fancy agents”.
You can build the agent on top of deterministic workflows.
And everything stays **self-hosted, inspectable, and auditable if you want.

When to Use OpenAI + Tools vs a Workflow Runtime

Alexander Ertli — Mon, 27 Oct 2025 18:07:07 +0000

Modern “agentic AI” needs more than prompts—it needs architecture.

This guide shows when to stay inside OpenAI’s style tool ecosystem and when to move to a workflow runtime for observability, safety, and control.

💡 TL;DR

Use OpenAI + function calling or MCP when your AI just needs to answer a question. Maybe call one or two tools. All in one turn.
Use a workflow runtime when your AI must run multiple steps, trigger hooks, or perform actions that need to be observable, auditable, and reliable.

These are complementary, not competing, approaches.

🔍 Two Ways to Build Agentic AI

1. The “Chat + Tools” Approach (OpenAI, Anthropic, MCP)

The LLM drives everything.

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[{"role": "user", "content": "What’s the weather in Berlin?"}],
  tools=[weather_tool]
)

The model decides whether to call a tool.
Your code runs it and returns the result.
The model gives a final answer.

✅ Great for

Quick Q&A
Simple assistants
Early prototypes

❌ Falls short when you need

Multi-step logic
Retries or human approval
Audit trails or state
Compliance or safety guardrails

Here, the LLM is both brain and driver. You hand it tools and hope for the best. (We’ve all seen what happens when an unguarded LLM calls a destructive tool.)

2. The “Workflow Runtime” Approach (contenox, Temporal+LLMs, custom orchestrators)

I’ll use contenox to show how this works differently.

You define the workflow as a clear sequence of tasks. Each has a handler, optional LLM use, and transitions.

Realistic contenox syntax:

id: weather-advisor
description: Suggests actions based on the weather forecast
tasks:
  - id: get_weather
    description: Fetch weather data via external hook
    handler: hook
    hook:
      name: weather-api
      tool_name: get_forecast
      args:
        location: "Berlin"
    output_template: "{{.temperature}}"
    transition:
      branches:
        - operator: ">"
          when: "25"
          goto: "suggest_icecream"
        - operator: "default"
          goto: "suggest_walk"

  - id: suggest_icecream
    handler: model_execution
    system_instruction: "If it's hot, suggest a fun outdoor activity involving ice cream."
    execute_config:
      model: phi3:3.8b
      provider: ollama
    transition:
      branches:
        - operator: "default"
          goto: "end"

  - id: suggest_walk
    handler: model_execution
    system_instruction: "If it's cool, suggest something relaxing like a walk or a coffee indoors."
    execute_config:
      model: phi3:3.8b
      provider: ollama
    transition:
      branches:
        - operator: "default"
          goto: "end"

✅ Great for

Reliable, stateful workflows
Real actions (APIs, notifications, DB writes)
Replay, audit, and debugging
Controlled, compliant agents

❌ Overkill for

Simple chatbots
One-off prompts

Here, you control the flow. The LLM is just one worker in the chain.

🧠 Why Both Exist

They solve different problems.

Context	Goal	Best Tool
Assistive AI	“Help me get an answer fast.”	OpenAI + Tools / MCP
Autonomous AI	“Run a safe, reliable process.”	Workflow runtime (contenox, Flyte, Temporal)

Think of it this way:

OpenAI + Tools is your clever intern—fast but unpredictable.
contenox is your project manager—structured, logged, and accountable.

🛠️ How to Choose

Use Case	Best Approach
“Ask HR about PTO policy.”	✅ OpenAI + RAG
“Detect outage → Slack alert → Jira ticket → confirm fix.”	✅ contenox
“Generate a report and email it.”	⚠️ Start with OpenAI. Switch to contenox if reliability matters.
“Run AI in an air-gapped system.”	✅ contenox
“Weekend agent hack.”	✅ OpenAI + function calling

🔮 The Future: They’ll Meet in the Middle

MCP will add light state.

Workflow runtimes will simplify small jobs.

But the core question stays the same:

Is your AI assisting—or acting?

If it’s assisting, use tools.

If it’s acting, use orchestration.

🚀 Try contenox Yourself

It’s open source and self-hostable.

git clone https://github.com/contenox/runtime.git  
cd runtime
docker compose up -d
./scripts/bootstrap.sh nomic-embed-text:latest phi3:3.8b phi3:3.8b

Define a workflow like the YAML above. Register your hooks. Watch your AI take real, safe actions.

🔗 GitHub: contenox/runtime

A War Story: Building Products to Solve Your Own Pain Points

Alexander Ertli — Sat, 25 Oct 2025 09:53:46 +0000

Let's put it straight: in March, I committed to building a product, not a sane one. I wanted to solve what frustrated me the most with tools like ChatGPT, Gemini, or shell apps like n8n.

My initial goal wasn't to sell it to customers or pitch it to VCs, but just to ensure I had the tool to tame LLMs, Agents, or even AGI (if it ever ships), so that what had invaded my work life out of necessity would actually work for me and not glitch randomly or just try to trap me.

So there I was, spending almost every second I had building a platform that should allow for:

Multi-step, multi-modal LLM workflows
Declarative behavior for AI Agents
Proper RAG (Retrieval-Augmented Generation)
Support for any API as a Tool
Handling cloud provider-level traffic
Magically splitting the codebase into Open Source and EE (Enterprise Edition)
... and, sure thing, doing the core in Go

Now, seven months in, after 60+ hours per week coding and trying to somehow make this more than just my own tooling project, losing my personal money while pulling my hair out, observing market developments, and model-interface providers changing APIs.

I kept discovering more players each month, building some or all of the features on my roadmap. Which kind of validated that there’s no universal solution yet.
A core question always echoed in my mind:

"Is what I'm doing a business?"

Researching and identifying pain points and collaborating with others to identify use cases and verticals beyond my own worldviews.

Hell, I even set aside budgets and evaluated contractors and friends to outsource some coding and essential but non-negligible tasks, for example, copywriting, to ensure I meet my own roadmap...

Since when did I become a PM? Wasn't that a space I always looked down on and never wanted to touch?

naro@xaxen:~/src/github.com/contenox/runtime2$ cloc .
     650 text files.
     631 unique files.
      22 files ignored.

github.com/AlDanial/cloc v 1.98  T=0.40 s (1580.5 files/s, 340245.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Go                             368           9637           9139          53197
JSON                            15              2              0          24941
TypeScript                     176           1022            159          12702
Markdown                         6           2804             66           8572
YAML                             6             14             10           5187
Python                          25            687            443           3573
CSS                              4             45             37           2103
Bourne Shell                     3             64             73            378
SQL                              2             73              6            285
make                             2             32              1            123
HCL                              3             19             21            113
Dockerfile                       3             23             20             84
SVG                              9              8              5             73
JavaScript                       6              7              0             69
HTML                             1              0              0             13
Text                             2              0              0             10
-------------------------------------------------------------------------------
SUM:                           631          14437           9980         111423
-------------------------------------------------------------------------------

Here I am, sitting on three major rewrites and a codebase so big I can't navigate using just the file tree.

I achieved my goal. Kinda. Yet never truly released. And I didn't stop there.

While working on this project, nuances about LLMs—their use cases, strengths, weaknesses, and mitigation strategies appeared.

I had to find out what the nature of LLMs is. And soon I was on another mission, validating a new vision:

"I want to define how AI behaves, interacts, and learns."

It seems that, yeah, with the proper implementation, the platform can handle a lot of autonomy-

it could be used to the point of self-optimizing at runtime
or let it judge its own performance
or even to then trigger the generation of content for model fine-tuning.

And since it doesn't require an external model (or can selectively use them), there is no constraint on what it can be used for.

Let's wrap up, me rambling here.
I don't encourage anyone to repeat the same path.
No matter how much, only geeks glamorize the builder path.
Here is what I mean:

I noticed changes and reality catching up; yes, not just minor, fixable things like my bank account's balance, but also longer-lasting changes—from the simple "everyone hates me now because I was so busy" to mindset shifts that can’t truly be undone.

I built the tool I needed, but now it's clear:
This is just the start. And it gets expensive on all fronts from here.

Don't get me wrong. I still don't care that much about monetizing all this effort. But it's clear that without doing so, it will just die.

It’s been clear for a while that my ‘2–3 month’ estimate was a severe miscalculation.

And now what?
What’s the lesson here?
More crucially — what would be different if I had known the game I was playing?

DEV Community: Alexander Ertli

Shouldn't AI Move From Cloud to Local Compute?

What happened

What this means for us as developers

The local compute trend is not a meme anymore

The ecosystem is already forming

What this changed for Contenox

The goal: local-first, but not limited

Conclusion

The Soul of Contenox: Stop begging the model. Start programming the runtime.

1. Prompts are hopes. Policies are firewalls.

2. Cognition is cheap. Authority is precious.

3. Agent behavior belongs in Git. Not a black box.

4. A hard deny is stronger than a soft ask.

Interactive Mode — The Owner

Autonomous Mode — The Driver

The Anti-Black-Box Runtime

Install

Should We Still Care About Our Code?

1. Empty content check is wrong

2. Missing file → empty diff, not a "new file" diff

3. sed uses string replace, not regex

4. unifiedDiff is a black box

The 90%-Done Paradox

1. The Beginner’s Mind

2. The “Professional”

3. The “Thinker”

4. The “Finisher” (the one that breaks the cycle)

AI Beyond the Hype

Why My Enterprise AI Startup Failed... And What I Learned After Getting a Job

Throw a Prompt at your IDE and see it get done!

When AI Writes Your Code, DevOps Becomes the Last Line of Defense

It's Not Just About Tools and Automation

Incident Management

How John Could Have Fixed It

Embracing Failure and Learning

Vibecoding: How to get from 0 to SaaS in hours.

I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened

What is Contenox?

The experiment: ask it to fetch and summarize a TODO

Under the hood: LLM → JavaScript → tools → LLM

1. Input is normalized and moderated

2. generate_js: the LLM writes JavaScript

3. run_js: execute the code in a sandbox

4. eval_js_result: success or retry?

5. answer: extract the final user message

Why this is interesting (to me, at least)

When to Use OpenAI + Tools vs a Workflow Runtime

💡 TL;DR

🔍 Two Ways to Build Agentic AI

1. The “Chat + Tools” Approach (OpenAI, Anthropic, MCP)

2. The “Workflow Runtime” Approach (contenox, Temporal+LLMs, custom orchestrators)

🧠 Why Both Exist

🛠️ How to Choose

🔮 The Future: They’ll Meet in the Middle

🚀 Try contenox Yourself

A War Story: Building Products to Solve Your Own Pain Points

3. `sed` uses string replace, not regex

4. `unifiedDiff` is a black box

2. `generate_js`: the LLM writes JavaScript

3. `run_js`: execute the code in a sandbox

4. `eval_js_result`: success or retry?

5. `answer`: extract the final user message