DEV Community: Lokesh Mure

One Click to Build, Verify, Ship..!

Lokesh Mure — Fri, 12 Jun 2026 21:07:37 +0000

Lokesh Mure

Jun 12

Loki Mode at 20K developers: 15 releases in 4 days, and what we learned about verified vs live autonomous coding

#ai #opensource #devtools #programming

19 min read

Loki Mode at 20K developers: 15 releases in 4 days, and what we learned about verified vs live autonomous coding

Lokesh Mure — Fri, 12 Jun 2026 19:02:19 +0000

I was halfway through a coffee when our self-update telemetry ticked over 20,000 unique developers. Six months ago, Loki Mode was a side project to scratch a personal itch. I wanted an autonomous coding agent I would actually trust to ship a diff into my own repo. Not a Replit-style cloud sandbox. Not a Lovable-style preview. Not a Cursor-style editor pane. A loop I could leave running overnight, walk back to in the morning, and trust the result on the git diff.

This week we shipped 15 minor releases in 4 days, and I think we finally landed the thing.

This post is the honest engineering writeup. Architecture, a real comparison table with check marks, a hands-on walkthrough using a real spec (not the training-wheels quickstart), the issue-to-merged-PR workflow with screenshots, and the parts we got wrong on the way. If you skim, the comparison table in section 4 is where the punchline lives.

1. The problem we set out to solve

When Replit Agent and Lovable started landing late last year, every founder I know lit up. Spec to deployed app in 90 seconds. Public preview URL. Done.

But a quiet thing kept happening on my own builds. The preview would load. The Cmd+R refresh would show "Welcome to your todo app." I would click "add todo," type something, hit submit, and watch the request post to a function that swallowed errors silently. The agent had marked the run "complete." The preview showed something running. The preview was lying.

The thing I wanted was simple: an autonomous loop that refuses to call work done on an empty diff or a failing test. A real gate. The same gate I would want around a junior engineer's first solo PR. Not "this looks running" but "this would pass code review."

That is what Loki Mode is. Built it locally first, open-sourced it, and the user base found it.

2. The numbers, honest version

From PostHog (anonymous, opt-out via LOKI_TELEMETRY_DISABLED=true or DO_NOT_TRACK=1, never captures prompts or PRD content or source code):

Metric	Value
Cumulative developers installed	20,000+
Weekly active developers	~6,000
Cumulative CLI sessions	500,000+
Top country (absolute count)	United States
Top country (per-capita)	Norway
Trending up fast (last 30 days)	Hong Kong, United Kingdom, India
Long-tail markets	Germany, Brazil, Singapore, Australia
Top install channel	Bun (51%), npm (38%), Homebrew (7%), Docker (4%)
Median time-to-first-verified-build	~47 minutes from `loki start ./prd.md`

What is honest: this is one person (me) maintaining the project. Two open GitHub issues right now. Bus factor of one. The trade is fast release cadence, fast PR review, and a maintainer who replies on Discord within hours.

What the data tells me: the developers pulled hardest to Loki are the ones who would not let a hosted spec-to-app tool touch their repo at all. Self-hosted, your-keys, no proxy. That market is bigger than the hosted-tool TAM and BSL-1.1 source-available was built to serve exactly it.

3. How the architecture actually works

Skip this if you only care about the product comparison. Read it if you want to know what is under the hood before you trust the thing with your repo.

Inputs (the "spec" surface)

A spec is any of:

A PRD markdown file: ./prd.md
A GitHub, GitLab, Jira, or Azure DevOps issue (URL or shorthand). GitHub accepts 123, #123, owner/repo#123, and full issue URLs. GitLab takes gitlab.com/owner/repo/-/issues/N. Jira takes PROJ-123 or the Atlassian URL. Azure DevOps takes the work-item URL.
An OpenAPI document (YAML or JSON)
An OpenSpec change directory: ./openspec/changes/feature-x/
A plain text or YAML one-liner: loki start "build a markdown editor with file sync"

The CLI auto-detects which kind of spec you handed it based on extension, URL pattern, or none of the above. The unified entry is loki start; loki run <issue-ref> is a kept-working deprecated alias.

The execution loop: RARV-C

Every run iterates through a five-phase closure loop, with the model tier rotating per phase:

Phase	Job	Default model tier
Reason	Architectural decisions, task decomposition, planning	Planning (Claude Opus by default)
Act	Code generation, file writes, tool use	Development (Opus or Sonnet)
Reflect	Self-critique, 3-reviewer blind council vote on the current diff	Development
Verify	Automated quality gates (tests, lint, security, coverage, held-out evals)	Fast (Sonnet or Haiku)
Compound	Episodic memory write, learning extraction for future runs	Fast

The C is what makes the loop compound across sessions. After every iteration, the agent records what it tried, what worked, what failed, and which files were touched into .loki/memory/episodic/. Future runs in the same project (or sibling projects, when cross-project memory is on) get the agent's accumulated context as a "PAST FAILURES TO AVOID" block in the next prompt.

The trust layer

Loki refuses to call a run done on:

An empty diff against the run-start commit. Always blocks.
A red test run when a test runner was detected and ran. Always blocks.
A failing held-out spec eval (section 6 walks this through). Always blocks.
A council REJECT verdict from the 3-reviewer blind review. Always blocks.

Every gate writes machine-readable evidence to .loki/verify/evidence.json and a human-readable report to .loki/verify/report.md. Every completed run also emits a portable .loki/proofs/<run_id>/proof.json + index.html you can hand to a teammate, an auditor, or a PR reviewer.

Provider routing

Loki dispatches to one of these underlying agent CLIs:

Provider	Tier	Notes
Claude Code	Tier 1, E2E-verified primary	Default. Deepest SDK integration.
OpenAI Codex	Tier 2, Experimental	Works end-to-end.
Cline	Tier 2, Experimental	Via the `-y` autonomous flag.
Aider	Tier 3, Experimental	Best for narrower file-edit tasks.

Plus, via ANTHROPIC_BASE_URL, any LLM that speaks the Anthropic API:

# Route the Claude provider through local Ollama (qwen2.5-coder:32b)
export ANTHROPIC_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_API_KEY=ollama
export LOKI_MODEL_OVERRIDE=qwen2.5-coder:32b
loki start ./prd.md

LOKI_MODEL_OVERRIDE only takes effect when ANTHROPIC_BASE_URL is also set, so you can never accidentally reroute an Anthropic-native run.

Full multi-provider setup: autonomi.dev/docs/multi-provider-setup.

4. The comparison table, with check marks

This is the part the README will not tell you because it is hard to write honestly. Here is what each tool is good at, and where each tool is the right pick.

Capability	Replit Agent	Lovable	Cursor	Loki Mode
Instant cloud sandbox + URL	✅	✅	❌	❌
Multiplayer collaboration	✅	⚠️	❌	❌
Marketing-page / landing-page taste	⚠️	✅	❌	⚠️
Editor integration (inline)	❌	❌	✅	⚠️
Runs locally (no cloud upload)	❌	❌	✅	✅
Your own provider keys (no proxy)	❌	❌	⚠️	✅
Fullstack with backend + database	⚠️	⚠️	⚠️	✅
Compose-first multi-service (healthchecks)	❌	❌	❌	✅
Background runs (delegate + notify)	❌	❌	❌	✅
CI-gateable verification (exit 0/1/2)	❌	❌	❌	✅
Held-out spec evals (anti-reward-hacking)	❌	❌	❌	✅
Reviewer subcalls cannot mutate code	❌	❌	❌	✅
Machine-readable evidence per run	⚠️	❌	❌	✅
Shareable proof-of-run artifact	❌	❌	❌	✅
Issue-to-PR autonomous workflow	❌	❌	❌	✅
Provider-agnostic (4 + any LLM)	❌	❌	⚠️	✅
Source-available license	❌	❌	❌	✅ (BSL 1.1, → Apache 2.0 in 2030)
Air-gappable	❌	❌	❌	✅
Mobile browser editor	✅	✅	❌	❌

Pick Replit if the goal is to learn, demo, teach, or prototype-with-a-URL fast. They earned that lane honestly.

Pick Lovable if the spec is mostly visual and what you ship is a landing page or design-heavy frontend. Their taste of output is genuinely ahead for that work.

Pick Cursor if you want AI assistance inside the editor where you already work. The Composer + Tab autocomplete are well-tuned to existing muscle memory.

Pick Loki Mode if you are shipping into an existing private codebase, you need a deterministic CI gate on the diff before merge, your provider keys cannot leave your machine, or you want a council that physically cannot edit the code it is reviewing.

These tools can coexist. Use Cursor while you write the spec. Use Replit when teaching the team. Use Lovable for the marketing site for whatever Loki is building. The case for picking Loki Mode is specifically "verified diff before merge into my repo."

5. Workflow 1: a real PRD to a running fullstack app

The shortest path that matters. Drop a markdown spec, get a verified Git repo with a running app.

# Install (Bun recommended; v8 will be Bun-only)
bun install -g loki-mode

# Verify the install
loki version
loki doctor

Write a real spec. The agent reads it as markdown, so be explicit about acceptance criteria. Here is the one I used to test the v7.26.0 compose-first support:

# TaskFlow

A task tracker with user auth, full-text search, and tags.

## Stack
- Backend: Node + Express + Postgres
- Cache + sessions: Redis
- Frontend: React + Vite, served by the same backend

## Acceptance criteria
- POST /api/auth/register creates a user (bcrypt hashed password)
- POST /api/auth/login returns a session cookie
- GET /api/tasks returns the logged-in user's tasks
- POST /api/tasks creates a task with title, body, tags[], due_date
- PATCH /api/tasks/:id updates a task
- DELETE /api/tasks/:id soft-deletes a task
- GET /api/search?q=... returns tasks matching the query (Postgres FTS)
- 401 when the session is missing or expired
- All endpoints return JSON; 422 on invalid input

## Run
- One command: docker compose up
- Healthcheck on the web service must reflect actual readiness

Save that as ./prd.md and run:

loki start ./prd.md

What you will see:

Plan auto-shown. Before the agent does anything, Loki prints a complexity tier, cost estimate, iteration cap, and time estimate. The estimate is real -- it uses the actual model pricing table. Declining costs nothing.
Dashboard auto-opens. A new browser tab opens at http://localhost:57374. (Skipped on CI, with --detach/--background, over SSH without a TTY, with piped stdin, or with LOKI_NO_AUTO_OPEN=1.)
The agent starts iterating. Reason, Act, Reflect, Verify, Compound.

In the dashboard, the left sidebar shows your project. The main area has tabs for Overview, Tasks, RARV Timeline, Quality, Cost, and Live App.

The Live App tab is the workflow change that pulled us past 20K. Before v7.24, you had to cd to the project, run the dev server in another terminal, and pray it talked to the right port. Now the agent is still writing files and the app is already running in the iframe. You click "Add Task," type something, hit submit. You watch the bug get fixed in real time over the next iteration.

For multi-service specs (like the one above), v7.26.0 ships compose-first detection. The agent gets a RUN_CONTRACT instruction telling it to generate a 12-factor docker-compose.yml with a clearly-named primary web service (either web/app by name, or labeled loki.primary=true), healthchecks on every service, depends_on wiring, env-var config, and a committed .env.example. The runner identifies the primary web service by that label and surfaces THAT in the iframe rather than accidentally surfacing a Postgres port.

Behind the scenes the council reviews each iteration. The Council tab shows the 3-reviewer blind verdicts with the evidence each reviewer raised (not just APPROVED/REJECTED badges):

=12, Reviewer 3 the devils-advocate APPROVES after running 5 adversarial attacks" width="800" height="418">

The Cost tab tracks per-iteration spend. v7.11.0 added a pre-cap warning at 80% (not just the existing hard stop at 100%); the warning broadcasts over WebSocket so a persistent amber banner appears on every dashboard page if you walked away from the terminal:

When the run completes, your project directory contains a working app:

docker compose up           # the stack
curl http://localhost:3000/health   # healthy
npm test                    # 47/47 passing

And a portable proof:

loki proof list             # all proofs for this project
loki proof show <run-id>    # render the HTML in the terminal
loki proof open <run-id>    # open in your browser
loki proof share <run-id>   # publish as a GitHub gist (after redaction preview + confirm)

The proof leads with the itemized bill (cost USD, tokens, per-model breakdown), then files-changed with the diffstat, then per-reviewer council verdicts with evidence, then quality gates, then wall-clock, provider/model, plus an integrity hash. A single chokepoint at autonomy/lib/proof_redact.py runs once before serialization and refuses to emit if it did not run. It scrubs Anthropic/OpenAI/Google/GitHub/AWS/Slack keys, Bearer tokens, JWTs, PEM private-key blocks, named secret assignments, DB URI credentials, and absolute user paths from both the JSON and the rendered HTML.

6. Workflow 2: GitHub issue to merged PR, hands-free

The thing that gets us past "demo tool" and into "production engineering tool." Hand Loki a real issue from your tracker and walk away.

isolate -> RARV-C build -> verify -> ship" width="800" height="418">

# Issue-driven, foreground
loki start owner/repo#123

# Issue-driven, background, auto-PR + auto-merge when verified
loki start 123 --ship --bg

What each flag does (the cascade is documented in loki start --help):

Flag	Behavior
`--worktree`, `-w`	Git worktree isolation. Branch: `loki/issue-<n>`. Working tree never touched.
`--pr`	Implies `--worktree`. Auto-creates a PR via `gh pr create` when the run verifies.
`--ship`	Implies `--pr`. Auto-merges via `gh pr merge` once the PR's CI passes.
`--bg`, `--detach`, `-d`	Background mode. Implies `--worktree`. Local desktop notification on completion (v7.22.0).

Supported issue refs (auto-detected):

GitHub: 123, #123, owner/repo#123, full issue URL
GitLab: https://gitlab.com/owner/repo/-/issues/42
Jira: PROJ-123, https://org.atlassian.net/browse/PROJ-123
Azure DevOps: https://dev.azure.com/org/project/_workitems/edit/456

When you delegate with --bg, v7.22.0's "delegate then notify" writes a durable completion summary to .loki/COMPLETION.txt and .loki/state/completion.json and fires a local OS notification (macOS osascript, Linux notify-send). Every terminal state notifies and records a summary -- success, max-iterations, stopped, failed, genuinely-blocking pauses. The perpetual-mode auto-clear pause is correctly NOT treated as terminal, so a mid-run pause never produces a false "done" record. Zero network egress.

Opt-in LOKI_DELEGATE_BRANCH=1 isolates a run on a dedicated loki/delegate-<timestamp> branch. Opt-in LOKI_DELEGATE_PR=1 opens a local pull request on completion (a gh call from your own machine, never CI).

7. Workflow 3: gate the diff in CI before it merges

This is the third workflow, and the one I think actually moves the needle on enterprise adoption.

loki verify is a standalone verification module that does NOT enter the autonomous loop. It is the deterministic gate. Five checks scoped to the diff:

Run it locally:

# Verify against the default base ref
loki verify

# Or against a specific ref
loki verify origin/main

# Or for CI as machine-readable JSON
loki verify origin/main --output-json > verify-result.json

Real output from a run on a 14-file diff:

loki verify (run id: a7c2-...)
=============================

Diff base:            merge-base(origin/main, HEAD)..HEAD
Files changed:        14
Lines added:          892
Lines removed:        47

Build         pass    (12.4s)
Tests         pass    47/47 passing  (1.8s)
Static        pass    eslint clean, tsc strict ok  (3.1s)
Secrets       pass    no secrets in diff
Dependencies  pass    no critical CVE in changed packages
Held-out      pass    5 of 5 reserved spec items satisfied

Verdict:      VERIFIED  (exit 0)
Evidence:     .loki/verify/evidence.json
Report:       .loki/verify/report.md

Exit codes:

Code	Meaning
`0`	VERIFIED -- all checks pass with conclusive evidence
`1`	CONCERNS -- inconclusive evidence, empty diff, or non-blocking warnings
`2`	BLOCKED -- red test, secret leak, critical CVE, failing held-out item
`3`	Verifier error (could not complete; never silently passes)

The diff base resolution is merge-base(base, HEAD)..HEAD -- proper PR semantics, not HEAD~1. Inconclusive evidence is never reported VERIFIED. Empty diffs yield CONCERNS, not green. Bare root-level test files are detected so discoverable tests are never silently skipped.

Important scope note: the v7.27.0 MVP is deterministic-only. No LLM in the gate path. The LLM did its work upstream in the RARV-C build loop. A single-reviewer LLM stage and the blind council are sequenced for future releases per the verification spec. This is stated honestly in loki verify --help and in the evidence document (llm_review.status = "skipped").

Drop the same command into GitHub Actions:

# .github/workflows/loki-verify.yml
name: Loki Verify
on:
  pull_request:
    branches: [main]

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # full history for merge-base

      - uses: oven-sh/setup-bun@v1

      - name: Install Loki Mode
        run: bun install -g loki-mode

      - name: Run verification
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: loki verify origin/${{ github.base_ref }}

      - name: Upload evidence
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: loki-verify-evidence
          path: .loki/verify/

The job exits 0/1/2/3. The evidence is a structured artifact you can inspect from the PR view. The deterministic checks make it a real gate, not a vibe check.

To see how trust evolves on YOUR repo over time:

loki trust              # one-line verdict + per-axis direction
loki trust --json       # machine-readable trajectory
loki trust-metrics      # block rate, p90 failure, council rejection, cost-per-verified-task

loki trust-metrics aggregates from a durable append-only log at .loki/metrics/trust-events.jsonl. Un-instrumented projects report available: false, never fabricated zeros.

8. Workflow 4 (optional): `loki quickstart` -- the training-wheels mode

If you have never used the tool before and want a guaranteed-working first run with zero PRD-writing, loki quickstart is a guided 4-step interview that lands the bundled Todo app on four Enter presses. Setup check, idea (default: Todo app), template pick (deterministic offline scorer over the bundled templates, no LLM at this step), plan review.

loki quickstart

It is genuinely just for the first 10 minutes. The real workflows are the three above.

9. Internals: how the held-out gate stops reward-hacking

The technical bit I get the most questions about, and the one I am most proud of from this release wave.

The failure mode

Once an autonomous build loop has access to the spec's acceptance checklist, an aggressive optimizer can tune to that exact checklist. The visible items pass. The spec does not. You ship something that satisfies the letter of the checklist and not the intent.

This is the same failure mode that has plagued ML benchmarks for years (BLEU, ROUGE, leaderboards). For autonomous coding it is worse because the optimizer has access to the test runner and can iterate against it.

The fix

Before the first verification, a deterministic selector reserves a slice (real impl in autonomy/prd-checklist.sh):

# Simplified; see the real bash impl linked above
def select_heldout(checklist_items):
    N = len(checklist_items)
    if N < 4:
        return []  # too small to reserve from
    count = clamp(round(0.25 * N), 1, 5)
    ranked = sorted(checklist_items, key=lambda c: sha256(c.id))
    return ranked[:count]

Selection is reproducible (sha256(id)-ranked, not random), idempotent (only written once to .loki/checklist/held-out.json), and bounded (clamped to 1-5 items).

What the build agent sees

Everything the build loop reads is filtered to exclude the held-out IDs. The build agent literally cannot see them in its context window. It can pass every item it can see and still get blocked at the ship gate if the held-out items fail.

Concretely, the filter removes the IDs from:

The visible checklist summary in the run prompt
The per-iteration checklist progress gate
The completion-prompt count of "N/M items satisfied"
The dashboard's task panel

The completion council reads them at the ship gate

At the ship gate (called from both the standard completion route and the force-review route), council_heldout_gate (in autonomy/completion-council.sh) reads .loki/checklist/held-out.json, runs each item against the current diff with a dedicated evaluation prompt, and writes a heldout_eval trust event to .loki/metrics/trust-events.jsonl. A held-out item whose status comes back failing and is not explicitly waived blocks completion like any other critical failure.

Honest limits

This guards the prompt feed, not the filesystem. The reservation lives on disk at .loki/checklist/held-out.json. An agent with read access to the working tree could open that file and learn which items were held out. The guarantee is that no prompt or summary the build agent reads ever names a held-out item. For the most realistic attack-shape (an LLM tuning to its visible context window), that is the right defense.

For a stronger guarantee, an opt-in mode could ship that places the held-out file outside the working tree, with the tradeoff that you cannot rerun verification offline without the file path. We chose the on-disk default and named the limit explicitly.

Opt out entirely with LOKI_HELDOUT_GATE=0.

10. What we learned shipping 15 minor releases in 4 days

The release cadence is not a marketing stunt; it came out of internal practice. A few things that turned out to matter.

Coordinated arcs beat feature dumps

The previous wave (v7.9 through v7.17) was the same shape: 9 minor releases over 2 days, sequenced R1 through R10. The v7.20 through v7.35 wave was 15 minors over 4 days. Each release closes one specific user-facing problem, and the arc has a narrative the README can sustain.

If we had shipped the same functionality as a single v8.0.0 release, we would have spent weeks on integration testing and the user-facing communication would have been a mess.

A council that cannot edit code reviews more honestly

The single biggest quality improvement of the week was v7.33.0's --disallowedTools. Reviewer subcalls now physically cannot use Edit, Write, NotebookEdit, or destructive git (the list includes the git -C / --git-dir / -c flag-prefixed forms too).

Before this, we observed the council occasionally "improving" the diff under review, which technically satisfied the review goal (the new diff was now passable) but defeated the gate's purpose. The fix was small, the impact was large. This is the kind of thing you find by reading your own internal traces, not by running a benchmark.

Opt out with LOKI_REVIEW_TOOL_GUARD=0.

Honest provider labels build trust faster than full-stack promises

When v7.27.0 dropped, the README labels for Codex, Cline, and Aider went from "Supported" to "Experimental." Loki Mode now claims "Tier 1 E2E-verified primary" only for Claude Code.

This was uncomfortable to ship. The README looks less impressive. The marketing surface got smaller. But Discord activity went up, not down, the week after. The audience pulled to a tool like this is allergic to "supports five providers" marketing copy. Saying the smaller true thing builds more trust than saying the larger fuzzy thing.

Cost honesty enforced in code beats cost honesty as a marketing claim

v7.31.0 and v7.32.0 shipped the cost-honesty contract: the loki plan quote, the dashboard's reported model, and the actually-dispatched model agree across every model lever. A sonnet session pin that routes through the development tier to Opus now quotes Opus, not Sonnet. The old behavior underquoted by about 1.7x.

The work was three days of internal plumbing. It does not show up on the README feature list. Users noticed within hours because their cost dashboards stopped lying.

Auto-open the dashboard

The single highest-leverage UX change of the week was the one-line "loki start auto-opens the dashboard." We resisted it for months on the grounds of "developers don't want surprise browser windows." The data was unambiguous: with auto-open, the hit rate of users finishing their first build went up a lot.

Lesson: respect for the user's environment matters less than removing one barrier between them and a successful first run. The opt-out (LOKI_NO_AUTO_OPEN=1, plus auto-skip on CI=true / SSH-no-TTY / piped stdin) is enough.

11. The roadmap and where contributions move the needle fastest

Near-term (next 4 weeks):

LLM single-reviewer stage in loki verify. v7.27.0 MVP is deterministic-only. The single-reviewer stage is sequenced next per the verification spec, with the blind council after that.
Public hosted backend for loki proof share --hosted. Today the --hosted flag publishes to a user-supplied LOKI_HOSTED_ENDPOINT and prints an honest "no official hosted backend yet" message when unset. We are building the hosted endpoint. Opt-in. The free-forever CLI commitment in docs/OPEN-CORE-BOUNDARY.md stays.
Mobile dashboard polish. The dashboard is web-based but assumes a desktop browser. Mobile responsiveness needs work.
More benchmark task adapters. We ship loki bench with real adapters for Aider and Claude Code. We need adapters for more competitors. Cleanest contribution surface for an external PR.

Medium-term (next quarter):

Replay re-execution mode (loki memory replay --apply). Today loki memory replay is read-only. Re-execution needs proper sandboxing and confirmation; not shipping until that is right.
Embedding layer for cross-project memory. Today's retrieval uses token overlap. An embedding layer would catch synonym mismatches the keyword scorer misses.
10k-episode memory index at p95 < 500ms.

Where contributions land fastest right now:

Benchmark task adapters for any AI coding tool that has a CLI. The contract is clean, the integration is small, and we will land any well-formed PR within 48 hours.
Agent and template marketplace packs for loki agent install. Install is data-only by construction (manifests are never eval'd, exec'd, or imported), so contributions are safe to land without security review for each one.
Language server coverage. We auto-spawn an lsp-proxy MCP for TypeScript, Python, Go, Rust. Adding Ruby, PHP, Kotlin, Swift, Elixir is small and well-scoped.
Dashboard panels and i18n. The dashboard is Web Components + Tailwind. Adding panels is straightforward.

If any of these interest you, drop into Discord and say hello. I respond within hours.

12. Try it in two minutes

# Install (Bun recommended; v8 will be Bun-only)
bun install -g loki-mode

# Verify the install
loki version
loki doctor

# Workflow 1: from a PRD
loki start ./prd.md

# Workflow 2: from any GitHub/GitLab/Jira issue
loki start owner/repo#123 --ship --bg

# Workflow 3: CI gate on any branch or PR diff
loki verify origin/main

# Inspect trust trajectory across all your runs
loki trust

# Per-iteration cost visibility
loki cost --last 10

# First-time exploration
loki quickstart

If you build something with it, drop a screenshot in Discord. I will boost it.

Links

Repo: github.com/asklokesh/loki-mode
Site: autonomi.dev
Docs: autonomi.dev/docs
Discord: discord.gg/k8NpBhc5KA
LinkedIn: linkedin.com/company/autonomi-dev-agents
Full v7.20-v7.35 writeup on the blog: autonomi.dev/blog/spec-to-live-app-week-v7-20-v7-35
Verification deep dive: autonomi.dev/blog/loki-vs-replit-lovable-verified-shipment
Open-core boundary commitment: github.com/asklokesh/loki-mode/blob/main/docs/OPEN-CORE-BOUNDARY.md
License: BSL 1.1, converts to Apache 2.0 on March 19, 2030

If you read this far, thank you. Tell me in the comments which part of the verification surface you would push back on, or what would make you trust an autonomous agent to ship a diff into your own codebase. That feedback is the input I most need.

Loki Mode (also called Autonomi) is built and maintained by @asklokesh. Source-available under BSL 1.1; converts to Apache 2.0 on March 19, 2030. We never proxy your provider keys, never collect prompts or code, and the telemetry that produced the numbers above is opt-out via LOKI_TELEMETRY_DISABLED=true or DO_NOT_TRACK=1.

One Click to Build, Verify, SHIP..!

Lokesh Mure — Fri, 26 Dec 2025 17:16:26 +0000

Lokesh Mure

Dec 26 '25

How I Built an Autonomous AI Startup System with 37 Agents Using Claude Code

#ai #opensource #productivity #programming

4 min read

How I Built an Autonomous AI Startup System with 37 Agents Using Claude Code

Lokesh Mure — Fri, 26 Dec 2025 17:12:26 +0000

Last month I asked myself a question that wouldn't leave me alone: what if I could mass hire 37 specialists for my side projects without spending anything?

I work full-time as a technology lead. Like many of you, I have a graveyard of side projects that died somewhere between "great idea" and "I'll finish it this weekend." The problem was never the idea. It was bandwidth. Solo founders are expected to be developer, marketer, ops, legal, finance, and customer support all at once.

So I built Loki Mode - an open source Claude Code skill that orchestrates 37 specialized AI agents to take a product requirements document and autonomously build, deploy, and operate a complete product.

This is the story of how I built it and what I learned.

The Problem I Wanted to Solve

Most AI coding tools still require you to babysit every step. You prompt, wait, review, prompt again, fix the hallucination, prompt again. It's faster than coding from scratch, but you're still the bottleneck.

I wanted something different:

Give it a PRD
Walk away
Come back to a deployed product

No hand-holding. No human in the loop for routine decisions.

Architecture: Why 37 Agents?

I started with a single autonomous agent. It worked for simple tasks but fell apart on anything complex. The context window would fill up, the agent would lose track of what it was doing, and quality degraded.

The solution was specialization. Instead of one agent trying to be everything, I created focused agents that only do one thing well:

Engineering Swarm (8 agents): frontend, backend, database, mobile, API, QA, performance, infrastructure

Operations Swarm (8 agents): devops, SRE, security, monitoring, incident response, release management, cost optimization, compliance

Business Swarm (8 agents): marketing, sales, finance, legal, support, HR, investor relations, partnerships

Data Swarm (3 agents): ML engineer, data engineer, analytics

Product Swarm (3 agents): product manager, designer, technical writer

Growth Swarm (4 agents): growth hacker, community, customer success, lifecycle marketing

Review Swarm (3 agents): code reviewer, business logic reviewer, security reviewer

Each agent has a focused context, specific capabilities, and clear boundaries. The orchestrator coordinates them through a distributed task queue.

The Parallel Code Review Pattern

This was the single biggest improvement to code quality. Instead of one reviewer, every piece of code goes through three specialized reviewers simultaneously:

IMPLEMENT → REVIEW (3 parallel) → AGGREGATE → FIX → RE-REVIEW → COMPLETE
                │
                ├─ code-reviewer (quality, patterns, maintainability)
                ├─ business-logic-reviewer (requirements, edge cases)
                └─ security-reviewer (vulnerabilities, auth issues)

Each reviewer returns a structured response:

{
  "strengths": ["Well-structured modules", "Good test coverage"],
  "issues": [
    {
      "severity": "High",
      "description": "Missing input validation on user endpoint",
      "location": "src/api/users.js:45",
      "suggestion": "Add schema validation before processing"
    }
  ],
  "assessment": "FAIL"
}

The severity determines what happens next:

Severity	Action
Critical/High/Medium	Block. Dispatch fix agent. Re-run ALL 3 reviewers.
Low	Add `// TODO(review): ...` comment, continue
Cosmetic	Add `// FIXME(nitpick): ...` comment, continue

This catches issues that a single reviewer would miss. The business logic reviewer catches requirements gaps. The security reviewer catches vulnerabilities. The code reviewer catches maintainability issues.

Handling Failures: Circuit Breakers and Dead Letter Queues

Autonomous systems fail. The question is how they fail.

I implemented circuit breakers borrowed from distributed systems design:

CLOSED (normal) → failures++ → threshold reached → OPEN (blocking)
                                                        │
                                                   cooldown expires
                                                        │
                                                        ▼
                                                  HALF-OPEN (testing)
                                                        │
                                    success ◄───────────┴───────────► failure
                                       │                                  │
                                       ▼                                  ▼
                                    CLOSED                              OPEN

When an agent type fails repeatedly, the circuit breaker opens and stops sending work to that agent type. After a cooldown period, it enters half-open state and allows one test request. If that succeeds, normal operation resumes. If it fails, back to open.

For tasks that fail even after retries, they go to a dead letter queue for manual review rather than blocking the entire system.

State Persistence: Surviving Rate Limits

Claude Code has rate limits. In the middle of building your startup, you might hit them. The system needed to survive this gracefully.

Every agent maintains its own state file:

{
  "id": "eng-backend-01",
  "role": "eng-backend",
  "status": "active",
  "currentTask": "task-uuid",
  "tasksCompleted": 12,
  "lastCheckpoint": "2025-01-15T10:30:00Z"
}

Before every major operation, agents checkpoint their state. When the system resumes after a rate limit:

Orchestrator reads its state file
Scans all agent states for incomplete tasks
Re-queues orphaned tasks
Spawns replacement agents for failed ones
Continues from where it left off

No lost work. No starting over.

The Anti-Hallucination Protocol

AI agents hallucinate. They claim packages exist that don't. They invent API endpoints. They assume syntax that doesn't compile.

Every agent follows a strict protocol:

Category	Verification Method
Technical capabilities	Web search official docs
API usage	Read docs + test with real call
Package/dependency	Verify exists on registry
Syntax correctness	Execute code, don't assume
Performance claims	Benchmark with real data
Competitor features	Verify on their actual site

The rule is simple: never assume, always verify. When uncertain, research first. If still uncertain, choose the conservative option and document the uncertainty.

What I Would Do Differently

Start with fewer agents. 37 agents is a lot to coordinate. I would start with the core engineering swarm and add others incrementally.

Better observability. Debugging a multi-agent system is hard. I added logging everywhere but still sometimes struggle to understand why an agent made a particular decision.

More integration tests. Unit testing individual agents is straightforward. Testing the interactions between 37 agents is not.

Try It Yourself

The entire system is open source under MIT license:

GitHub: https://github.com/asklokesh/claudeskill-loki-mode

To use it:

# Clone to your Claude Code skills directory
git clone https://github.com/asklokesh/claudeskill-loki-mode.git ~/.claude/skills/loki-mode

# Launch Claude Code with autonomous permissions
claude --dangerously-skip-permissions

# Say the magic words
> Loki Mode with PRD at ./docs/requirements.md

Fair warning: this requires --dangerously-skip-permissions because the agents need to execute code, create files, and make network requests autonomously. Understand what that means before you run it.

What's Next

I'm still iterating on this. Current areas of focus:

Better agent coordination patterns
Reducing token usage through smarter context management
More deployment targets
Improved monitoring dashboard

If you try it, let me know what breaks. Open an issue or find me on LinkedIn.

Building in public. One autonomous agent at a time.

DEV Community: Lokesh Mure

One Click to Build, Verify, Ship..!

Loki Mode at 20K developers: 15 releases in 4 days, and what we learned about verified vs live autonomous coding

Loki Mode at 20K developers: 15 releases in 4 days, and what we learned about verified vs live autonomous coding

1. The problem we set out to solve

2. The numbers, honest version

3. How the architecture actually works

Inputs (the "spec" surface)

The execution loop: RARV-C

The trust layer

Provider routing

4. The comparison table, with check marks

5. Workflow 1: a real PRD to a running fullstack app

6. Workflow 2: GitHub issue to merged PR, hands-free

7. Workflow 3: gate the diff in CI before it merges

8. Workflow 4 (optional): loki quickstart -- the training-wheels mode

9. Internals: how the held-out gate stops reward-hacking

The failure mode

The fix

What the build agent sees

The completion council reads them at the ship gate

Honest limits

10. What we learned shipping 15 minor releases in 4 days

Coordinated arcs beat feature dumps

A council that cannot edit code reviews more honestly

Honest provider labels build trust faster than full-stack promises

Cost honesty enforced in code beats cost honesty as a marketing claim

Auto-open the dashboard

11. The roadmap and where contributions move the needle fastest

12. Try it in two minutes

Links

One Click to Build, Verify, SHIP..!

How I Built an Autonomous AI Startup System with 37 Agents Using Claude Code

How I Built an Autonomous AI Startup System with 37 Agents Using Claude Code

The Problem I Wanted to Solve

Architecture: Why 37 Agents?

The Parallel Code Review Pattern

Handling Failures: Circuit Breakers and Dead Letter Queues

State Persistence: Surviving Rate Limits

The Anti-Hallucination Protocol

What I Would Do Differently

Try It Yourself

What's Next

8. Workflow 4 (optional): `loki quickstart` -- the training-wheels mode