Kelos: Orchestrating Autonomous AI Coding Agents on Kubernetes

#ai #agents #kubernetes #opensource

From one-off prompts to continuous, infrastructure-native software agents

Kelos started from a simple idea: AI coding agents should not live only inside interactive terminals. They should be able to run continuously, react to real events, operate inside isolated environments, and participate in software delivery workflows the same way other production systems do.

That is what Kelos is for: a Kubernetes-native framework for orchestrating autonomous AI coding agents.

Today, most agent workflows still feel manual. You open a CLI, paste a prompt, wait for output, maybe review a branch, and then start over. That model is useful, but it starts to break down when you want agents to work in the background, respond automatically to GitHub issues, continue across multiple stages, or run safely at scale.

Kelos treats agents less like chat sessions and more like infrastructure. Workflows are declared as Kubernetes resources, versioned as YAML, and run continuously by the cluster.

At the core of Kelos are four primitives: Tasks, Workspaces, AgentConfigs, and TaskSpawners.

A Task is a unit of agent work. A Workspace gives that work a repository context. AgentConfigs package instructions, skills, and MCP server integrations. TaskSpawners watch for external triggers such as GitHub issues, pull requests, or cron schedules, then create Tasks automatically.

In other words, you define what should happen, and Kelos handles how it runs.

That distinction matters. Kelos is not just about running an agent once. It is about managing the full lifecycle of autonomous work: cloning the right repository, injecting credentials, running in an isolated pod, capturing outputs such as branch names and PR URLs, and passing results to downstream stages. It also supports chaining with dependsOn, so you can build pipelines instead of isolated prompts.

One of the most interesting things about Kelos is that it can help develop itself. The project runs a set of always-on TaskSpawners that assist with ongoing work such as triaging issues, generating implementation plans, fixing bugs, responding to PR feedback, and testing the developer experience as a new user.

The agents are not perfect, and they still need feedback, correction, and oversight. But that is part of what makes the system useful. We can guide them through code review, issue comments, and workflow changes, then improve how they operate over time.

Just as importantly, those workflows are managed through specs. That makes the development process itself something we can inspect, refine, and evolve, instead of something hidden inside one-off prompts.

Kelos is also designed to be agent-agnostic. It supports Claude Code, OpenAI Codex, Google Gemini, OpenCode, Cursor, and custom agent images through a standardized container interface. That means the orchestration layer is separate from the model vendor. You can swap agent backends without redesigning the surrounding workflow system.

Another thing that matters is that Kelos is built for parallelism and scale from the start. Because tasks are just Kubernetes workloads, the system can fan out work across repositories and teams while relying on the cluster for scheduling and resource management. In practice, the limiting factors become your cluster capacity and provider quotas, not the structure of the orchestration model itself.

Kelos is not the easiest possible way to run an AI coding agent, and it is not trying to be. It is a Kubernetes-native system, which means you need a cluster, some familiarity with Kubernetes, and enough context to define the workflows you want to automate. There is real setup involved.

But that complexity buys you something important: isolation, scheduling, observability, and a foundation for running autonomous workflows as part of real engineering systems. Once that foundation is in place, Kelos gives you a structured way to move from one-off agent runs to declarative workflows, automated issue handling, scheduled workers, and multi-stage pipelines.

The bigger shift here is conceptual. We are moving from “AI helps me code” to “AI workers participate in the software system.” Once you accept that shift, you need better primitives than chat history and shell scripts. You need triggering, isolation, state, orchestration, observability, handoffs, and policy boundaries.

Kelos is an attempt to provide those primitives in an environment many teams already trust for automation: Kubernetes.

If that sounds interesting, the best way to understand Kelos is to look through the resources, examples, or its self-development workflows, then imagine your own background workflows: issue triage, bug fixing, PR follow-up, DX testing, release chores, or internal maintenance loops.

The point is not just to run an agent. The point is to make autonomous coding work operationally real.

Top comments (2)

klement Gunndu • Mar 15

The TaskSpawner primitive is clever — we built something similar with file-based triggers instead of K8s CRDs, and the hardest part was exactly what you described: managing credential injection across isolated agent pods. How are you handling secret rotation when agents run continuously?

Gunju Kim • Mar 15

Currently, kelos gets a static token (which has an expiration date) or an API key as a secret, and doesn’t handle the refresh automatically. (I guess there’s no official way to refresh the coding agent token, you may be banned by providers if you try it..?)

The secret should be updated by a user or an external actor if necessary.
The task is supposed to handle a run of a coding agent and is recreated by taskspawner so it should work.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.