Maurizio

Posted on May 28

I'm a photographer. I built a DSL for multi-agent workflows.

#ai #programming #learning #agents

I'm a wedding photographer. Over the past couple of months I fell down a rabbit hole and built something I honestly can't evaluate, because I don't fully understand what I built. I need your take.

What it does

AgentFlow DSL — you write a .aflow file describing agents, phases, and loops. It auto-exposes as an MCP tool in Claude Code. Zero Python glue.

Example: a 3-agent workflow with a quality loop:

workflow code_quality
  agents:
    agent writer     → model: "local-fast"
    agent tester     → model: "openrouter-smart"
    agent critic     → model: "claude-sonnet"

  loop quality_gate
    phases: [write, test, review]
    repeat_while: review.verdict == "needs_work"
    max_iterations: 10

Writer generates code → Tester tries to break it → Critic reviews. If the critic says "needs_work", it loops back with feedback. 68 lines total.

Does it actually work?

Yes, somehow. Tested with real OpenRouter calls (Gemini Flash). 92 tests pass. Three executors: Claude SDK, OpenRouter API, Ollama. MIT license.

But here's the thing: I coded most of this without fully understanding what I was doing. The tokenizer, the parser, the compiler. I wrote them, but if you asked me to explain the theory, I'd struggle. Claude Code (the irony) helped me write a lot of it. I'm genuinely not sure if I built a real compiler or just a fancy string processor with a loop.

My honest questions

Does this solve a real problem or did I reinvent Python with extra characters?
LangGraph, CrewAI, AutoGen already exist, is a declarative DSL actually better, or do developers prefer writing code?
Is "docker-compose for AI agents" a useful framing or marketing fluff?
If you look at the code, is it architecturally solid or fundamentally wrong?

Top comments (4)

Harjot Singh • May 31

Not roasting this, the instinct is more right than a lot of programmer-built agent systems. Most people wire agents together with vibes and a prompt; you reached for deterministic, reproducible pipelines, which is exactly the property that separates a demo from something you can actually rely on. A YAML DSL is a smart move too: it makes the workflow inspectable and version-controllable, so you can see and diff what the system will do instead of it living in someone's head. The honest hard part (the fair roast) is that determinism and LLMs are in tension, the orchestration can be perfectly reproducible while the model output underneath still varies run to run. So the real test of BookWright is what happens at the seams: when an agent returns something wrong or off-format, does the pipeline catch it and retry or fail loudly, or does the bad output flow downstream into the next step? Deterministic control flow around stochastic models is the right architecture; the validation between steps is what makes it trustworthy. A reproducible pipeline that doesn't verify each step just reproduces the same mistakes reliably. That deterministic-harness-plus-checks instinct is the core of how I think about Moonshift. How are you handling a step whose model output doesn't match what the next agent expects, retry, schema-validate, or human check?

Maurizio • May 31

This is exactly the right question, and you've nailed the tension.

Right now, AgentFlow has three defenses:

must_produce = if a required field is missing from the JSON output, the runtime fills it with a default ("" for strings, 0 for numbers) and logs a warning. Crude, but it prevents downstream crashes.
done when conditions = boolean checks on output values (e.g., review.confidence >= 0.85 AND verdict == "approved"). Combined with loops, this gives you a quality gate that retries until the condition is met.
on_max_exceeded / escalate_to = when a loop exhausts its iterations, the workflow escalates to a human reviewer or falls back gracefully instead of silently producing garbage.

What's missing (and you're right to call out): schema validation between phases. Right now, must_produce checks presence, not validity. A word_count: "banana" would pass because the field exists. Type coercion helps ("42" → 42) but it's not real validation.

The next iteration I'm thinking about is structured output schemas per phase. Each agent declares a JSON Schema or TypeScript type, and the runtime validates before passing to the next agent. Combined with automatic retry on validation failure, this would make the harness actually trustworthy.

Your "deterministic harness + stochastic models" framing is spot on. That's the architecture. The validation between steps is what turns it from a demo into infrastructure. Still working on that part.

(Also, tell me more about Moonshift. Same architecture instinct?)

Harjot Singh • May 31

Yeah, exact same instinct, you basically described the spine of it. Moonshift is a system where AI agents take a product idea and actually build, deploy, and market a SaaS end to end, the framing being they work the night shift while you sleep. The only reason that's not a horror story is the harness: a deterministic pipeline of phases (plan, scaffold, build, test, deploy, market) with stochastic agents inside each phase, and the reliability comes almost entirely from what sits between the phases, not from the model. Your AgentFlow defenses map nearly one-to-one onto what I run: bounded loops with max-iteration caps and escalate-on-exhaust, done-when conditions as quality gates that retry, and exactly the schema-validation-between-phases you're building toward, each phase has a typed contract its output must satisfy before the next phase consumes it, with automatic retry on validation failure. That last piece is the one that turned it from a demo into something that can run unattended, sounds like the same conclusion you reached. The other big one is gating the irreversible: anything that spends money or ships to prod passes a hard check, because an unattended agent can't be allowed an oops there. Happy to compare notes on the per-phase schema validation, you're clearly attacking the same problem from the same direction. And if you ever want to point it at an idea, the first run is completely free, no cards, no strings attached.

Maurizio • Jun 1

This is great, "deterministic pipeline with stochastic agents inside" is exactly the framing I was fumbling toward but couldn't articulate that cleanly. Stealing it.

The validation layer you're describing is almost identical to what I landed on. In AgentFlow it's output_schema (JSON Schema inline) + validation.retry (auto-retry with the validation error fed back to the agent) + validation.on_fail (abort vs. default fill). What surprised me: even with a cheap model like Gemini Flash, the retry-with-feedback works most of the time. The agent gets the schema violation message and fixes it on the next pass. It's not magic, it just gives the model a second chance with a specific error instead of "try again pls."

The irreversibility gate is the thing I haven't built yet and you're right that it's the other half of the puzzle. Validation prevents silent corruption; gating prevents silent disaster. Right now AgentFlow trusts the workflow author to set on_fail: abort on critical phases, but that's convention, not enforcement. A declarative irreversible: true on phases that touch money/deploy would be the right move.

What format are you using for the typed contracts between phases? I went with JSON Schema because it was the obvious choice, but I wonder if there's a better fit for agent-to-agent contract validation that I'm not seeing.

DEV Community

I'm a photographer. I built a DSL for multi-agent workflows.

What it does

Does it actually work?

My honest questions

Links

Top comments (4)