DEV Community

Mamoor Ahmad
Mamoor Ahmad

Posted on

I Sent One Message and 5 AI Agents Built, Audited, Tested & Deployed a Full App

OpenClaw Challenge Submission ๐Ÿฆž

This is a submission for the OpenClaw Challenge.

What I Built

I built ClawForge โ€” a multi-agent orchestration system that turns a single natural language prompt into a fully deployed production application. No human intervention after the initial message. No manual code review. No hand-written tests. Just one sentence and five AI agents handle the rest.

The pipeline looks like this:

You: "Build me a URL shortener with click analytics"
 โ”‚
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ ๐Ÿง  Architect Agent    โ”‚ Plans stack, schema, endpoints
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ ๐Ÿ’ป Coder Agent        โ”‚ Implements full project (11 files)
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ ๐Ÿ” Reviewer Agent     โ”‚ Security audit, code quality gate
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ ๐Ÿงช Tester Agent       โ”‚ Writes & runs test suite
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ ๐Ÿš€ Deployer Agent     โ”‚ Git init, GitHub push, deploy
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
 โœ… Live URL + Repo
Enter fullscreen mode Exit fullscreen mode

The first project it built โ€” a URL shortener with click analytics โ€” went from zero to a live deployed app in about 20 minutes. One Telegram message. That's the entire human input.

Live demo: URL Shortener

Source code: GitHub โ€” mamoor123/clawforge

How I Used OpenClaw

ClawForge is built as an OpenClaw skill โ€” a drop-in module that extends what your AI agent can do. Here's how the pieces fit together:

The Skill Layer

The orchestrator lives in skills/clawforge/SKILL.md. When OpenClaw sees a trigger phrase like "Build me..." or "ClawForge: [description]", it activates the pipeline. The skill file tells the agent how to coordinate โ€” it doesn't write code itself. It's a conductor, not a musician.

Five Specialized Agents

Each agent has its own IDENTITY.md file โ€” a focused persona with a checklist:

Agent What It Does Output
๐Ÿง  Architect Stack selection, schema design, API planning Full plan in clawforge-state.json
๐Ÿ’ป Coder Implements everything in TypeScript 11 production-ready source files
๐Ÿ” Reviewer Security audit, code quality, best practices Pass/fail with specific issues
๐Ÿงช Tester Writes and runs a test suite 10 tests, results + coverage
๐Ÿš€ Deployer Git setup, GitHub push, live deployment Repo URL + live URL

Shared State Communication

All agents communicate through a single JSON file (clawforge-state.json). Each stage reads the previous agent's output and writes its own results. The orchestrator reads the state to decide what happens next โ€” including whether to loop back and retry on failure.

# The state manager that ties everything together
bash clawforge.sh init "Build a URL shortener"
bash clawforge.sh stage architect plan_complete
bash clawforge.sh review-result true ""
bash clawforge.sh test-result 10 10 0
bash clawforge.sh deploy-result https://live-url.com https://github.com/...
Enter fullscreen mode Exit fullscreen mode

The Retry Loop

Here's where it gets interesting. If the Reviewer finds issues, the pipeline loops back to the Coder with specific fix instructions. Same with tests โ€” if tests fail, the Coder gets the failure output and fixes the code. Up to 2 retries per stage. This isn't just "generate and pray" โ€” there's a real feedback loop.

initialized โ†’ architect โ†’ coder โ†’ reviewer โ†’ tester โ†’ deployer โ†’ complete
                             โ†‘         โ†‘
                             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ (retry on failure, max 2x)
Enter fullscreen mode Exit fullscreen mode

Demo

Here's what happened when I sent a single Telegram message:

"Build me a URL shortener with click analytics"

Architect output: Planned a TypeScript + Express + SQLite stack with nanoid for short codes. Defined 5 API endpoints, database schema, and the full file structure.

Coder output: 11 files โ€” server, database setup, route handlers (shorten, redirect, analytics), frontend with dark UI, and tests. All in TypeScript with strict mode.

Reviewer output: Found 21 issues across security, correctness, and architecture. 6 high-severity bugs including a heredoc command injection vulnerability, JSON injection via newlines, and a bug where an empty field path would destroy the entire state file. The pipeline looped back to fix them.

Tester output: 10 tests, all passing. Covers URL creation, validation, redirects, click tracking, and error cases.

Deployer output: Git initialized, pushed to GitHub, deployed to a live URL.

Final result:

The URL shortener supports:

  • Shorten any URL with 7-character codes
  • Custom short codes for branded links
  • Click tracking (referrer, user agent, IP, timestamp)
  • Analytics dashboard with daily clicks and top referrers
  • Full REST API with pagination

What I Learned

The Reviewer Agent Was the MVP

I expected the Coder to be the star. It wasn't. The Reviewer agent found real bugs that would have been production nightmares:

  • Command injection via heredoc โ€” the init function used an unquoted heredoc, meaning $(rm -rf /) in user input would execute. The Reviewer caught this.
  • State destruction bug โ€” passing an empty field path to the update command would turn the entire state file into a single string. Confirmed it โ€” the state file literally becomes "boom". Gone.
  • Schema mismatch โ€” the init command and plan.sh script initialized the same fields with different types (empty objects vs null). This would crash downstream jq queries.

These aren't hypothetical. They're bugs that a human code reviewer could easily miss.

AI Agents Need Guardrails, Not Just Prompts

The initial pipeline had no retry mechanism. The first run produced code that the Reviewer flagged, and without a loop, it would have shipped broken software. Adding the retry logic (max 2 attempts per stage) transformed the system from "generate once" to "iterate until quality gate passes."

Lesson: don't just prompt an AI to build something. Give it a feedback loop. The Reviewer-Tester-Coder cycle is what makes ClawForge actually produce reliable output.

The Architecture Was the Hardest Part

Getting five agents to coordinate through a shared state file sounds simple. It wasn't. The state schema needed to be consistent across all agents, error handling needed to work across stage boundaries, and the orchestrator needed to handle edge cases like "what if the Reviewer and Tester both fail on the first try?"

The audit report revealed issues I didn't anticipate โ€” like the lack of file locking when multiple agents could theoretically write to the state file simultaneously, or the fact that runtime state files were being committed to git.

What Surprised Me Most

The Coder agent generated a dark-themed responsive frontend without being explicitly told to. The prompt was "URL shortener with click analytics" โ€” nothing about UI design. But the Architect planned a frontend, and the Coder delivered a clean dark UI with copy-to-clipboard functionality. The agents made design decisions I wouldn't have thought to specify.

Also surprising: the entire pipeline ran in about 20 minutes. From a cold start to a live deployed app. That's faster than most human developers can scaffold a project.


ClawForge is open source under MIT. Built on OpenClaw. Try it yourself โ€” just say "Build me..." and see what happens.

Top comments (0)