Sebastian Chedal

Posted on Apr 8 • Originally published at fountaincity.tech

How We Built Hydraulic 3D Simulation Software With Zero Human Code (And What We Learned Through the Pain)

#ai #agents #gamedev #productivity

Fountain City built a hydraulic 3D simulation system with zero human-written code. Here’s what actually happened.

Earlier this year we built a hydraulic simulation system for a gaming client. The software generates physically realistic terrain with lakes, rivers, erosion channels, watershed detection, seasonal water cycles, and topographic mapping. It runs inside Unity 6.2 and produces landscapes that behave the way water actually behaves in the real world.

The entire system, 18,000 lines of C# across 58 files, was written by AI agents. No human typed a single line of production code. One person directed the entire operation.

This is a detailed account of what worked, what broke, and what we’d do differently so you can decide whether the approach makes sense for your projects.

What We Were Building

The client needed a hydraulic cascade system integrated into a 3D mesh-topology environment. The end result: terrain where you can watch rain fall on a mountain, flow downhill through erosion channels, pool into lakes, overflow through pour points, and form river networks that change with the seasons. The core application is game maps, but the data framework we built can be reapplied to scientific simulation fields. Projects like this are part of our AI workflows practice, where we build custom automation systems for domain-specific problems. The requirements included accurate hydraulic cascading, dynamic river and lake generation, weather and seasonal water transforms, watershed detection, and topological erosion mapping.

This is not a web app. Hydraulic simulation requires mathematical precision: the system implements the D8 flow direction algorithm (O’Callaghan and Mark, 1984) for routing water to the steepest downhill neighbor, priority-flood depression detection based on Wang and Liu (2006), Strahler stream ordering for river hierarchy, and the weir equation for calculating discharge at lake outlets. Getting any of these wrong means water flows uphill, lakes form in impossible locations, or rivers appear from nowhere.

The system processes terrain through a 10-phase pipeline. Each phase builds on the previous: core infrastructure, depression detection, pour point analysis, water distribution, lake formation, outlet rivers, lake cascade processing, dynamic water levels, channel incision, and water mesh generation. The central algorithm processes depressions in strict topological order from highest elevation to lowest in a single pass.

The Setup: What “Zero Code” Actually Means

We need to define this precisely. “Zero human code” means no human-written source code. The AI agents generated all C# code. A human (Sebastian Chedal, our CEO) acted as director: defining requirements, reviewing output, making architectural decisions, evaluating test results, and managing the agent team. The human role was orchestration and judgment, not implementation.

The tooling stack paired two models. Anthropic Opus handled coding, architecture, testing, task management, and batch executions through Claude Code and Cursor. Gemini 3.1 Pro handled scientific model evaluation and cross-checking the accuracy and fidelity of the built system. Cursor ran with Unity 3D in batch mode, which let Claude Code execute test runs autonomously, driving Unity and recording results without a human touching the editor.

Pairing the two models produced better results than either alone. A significant amount of planning was done between them. All test cases were run through both models to validate completeness and correctness. The adversarial dynamic, one model building and another checking against scientific literature, caught problems that a single-model approach would have missed entirely.

If you’re evaluating where agentic coding sits in the broader spectrum, we’ve written about the progression from AI-assisted coding through vibe coding to fully agentic coding. This project sits at the far end of that spectrum.

The Agent Architecture

The project used 13 custom agents and 8 domain skills, organized around a principle we learned the hard way: the agent that writes code should never validate its own output.

One agent (the hydraulic-simulation-developer, running on Sonnet) was the only agent allowed to write code. Before implementing anything, it read validation reports from a scientific auditor and code reviews from a software architect, both running on Opus. The scientific auditor checked physics: mass balance conservation, uphill flow prevention, depression hierarchy correctness, topological cycle detection. The software architect checked code quality: class responsibilities, coupling, anti-patterns.

A separate test runner executed the simulation in Unity batch mode and captured log output. A different agent, the log validator (running on Haiku, the cheapest model), parsed those logs against acceptance criteria: mass balance under 1% error, no NaN or Infinity values, performance targets met. The test runner and validator were deliberately kept apart. Neither could influence the other.

An orchestrator dispatched work and tracked progress without making technical decisions. A debug strategist acted as an air traffic controller during investigations, detecting when the team kept testing the same hypothesis without progress and forcing pivots. Supporting agents handled performance profiling, refactoring plans, documentation, and web research.

This separation isn’t organizational theater. It solves a real problem with AI agents: a single model can rationalize its own mistakes if given the opportunity. By making the builder, tester, and validator different agents with different context windows and different models, no single agent can both create and approve its own work.

The Numbers

The client’s estimate for traditional development was 300 hours of coding time to refactor the basic water system into a full hydraulic flow with correct cascading, weather and seasonal transforms, and dynamic river and lake generation.

Actual coding time: 60 hours. Start of coding to end of testing. That’s a 5x improvement.

The full picture is more nuanced. Planning time doubled: 16 hours traditionally, 32 hours for this project. The extra planning wasn’t waste. It was the investment that made the 5x coding speedup possible, because the AI agents needed comprehensive documentation to work effectively.

There was roughly 40 hours of learning overhead. We figured out what not to do, discovered how critical upfront specifications were, backed out of dead ends, and rewrote test-driven specifications in greater depth. This was first-project cost. We don’t need to spend it again on future projects because those patterns are now established.

Metric	Traditional Estimate	Agentic Actual
Coding time	300 hours	60 hours
Planning time	16 hours	32 hours
Learning overhead	0 (established practices)	~40 hours (first project)
People involved	2–3 developers (estimated)	1 human + 13 AI agents
Total API cost	n/a	$360.50
Code output	—	18,000 lines C# (58 files)

The API cost breakdown is worth noting. Over 11 sessions and 9,314 messages, the system processed roughly 334.6 million tokens. Sonnet handled implementation at $198; Opus handled judgment calls at $162. Cache operations (reading and re-reading large context windows across turns) accounted for 98% of the spend. The actual input and output tokens were negligible by comparison.

That’s $360.50 for 18,000 lines of production C# — roughly $0.02 per line. The client’s traditional estimate was 300 hours of development. At even a modest $100/hour fully loaded rate, that’s $30,000. The API cost is less than 1.2% of the traditional equivalent.

What Went Wrong

The gotcha table below captures the patterns. This section covers what it felt like to hit them.

The documentation problem caught us early. We started coding before the specs were complete, backed into three corners in the first week, and spent more time undoing bad work than we would have spent writing the specs in the first place. The fix was obvious but painful: stop, write everything out, run the documentation through both models and a human reviewer, and only then let the agents touch code. After that, phases that followed the documentation-first pattern went smoothly. Phases that didn’t turned into debugging marathons.

The hardcoded values discovery was more subtle. Tests were passing, mass balance was under 1%, everything looked correct. But when we reviewed the actual code against the peer-reviewed equations, the agents had inserted constants that produced the right test outputs without implementing the underlying physics. The numbers were close enough to pass validation but the implementation was fake. The dual-model architecture caught this — Gemini flagged the discrepancy between the code and the Wang and Liu algorithm — but it was a clear signal that passing tests doesn’t mean correct implementation.

The fix-break loops were the most frustrating. A change to river generation would break lake cascade processing. Fixing the cascade would reintroduce the river bug. Three cycles in, we realized the root cause wasn’t a code problem — it was an architecture problem. The subsystems shared assumptions that weren’t documented anywhere, so fixing one silently invalidated the other. The solution wasn’t better debugging. It was mapping every interaction between subsystems before writing code, which is what the gotcha table calls “architecture-first approach.”

For context, Daniel Bentes documented a similar experience building a project management tool over 27 days with 99.9% AI-generated code. His key pain points, architectural lock-in and context management challenges, overlap with ours. Bentes encountered the same patterns with a simpler domain, which suggests these are structural challenges of agentic coding rather than domain-specific issues. The problems we describe are not unique to simulation software — they scale with complexity, and any sufficiently complex agentic project will hit them.

What Exceeded Expectations

Once the system was well-documented and the workflows were dialed in, the speed and quality of output became genuinely impressive.

The self-checking architecture had a practical consequence: by forcing the system to validate its own work and pitting agents against each other for quality, we could focus on system architecture and design rather than line-by-line implementation. The mental work shifted to a higher level: thinking about how subsystems interact, what edge cases exist, how seasonal transitions affect the topology. The code itself took care of itself.

The three-pronged validation approach (structured log markers parsed mechanically, automated code audits via grep patterns, and manual visual testing in the Unity editor) caught issues before they compounded. The log validator, running on Haiku at minimal cost, parsed markers like [MASS-BALANCE] and [MESH-VALIDATE] to verify that every change maintained physical correctness. This mechanical checking was faster and more thorough than human code review for quantitative criteria.

Specific technical outcomes that would have been difficult with traditional development on the same timeline:

The hypsometric curve implementation provides O(1) elevation-to-volume lookups using binary search and linear interpolation, replacing expensive O(N) cell iteration. This optimization emerged from the scientific auditor’s review, not from a human developer’s intuition.
The depression hierarchy system handles arbitrarily nested topographic basins (a bowl inside a bowl inside a valley) with correct merge behavior when water levels rise above connecting pour points.
Seasonal transitions preserve all data using a truncation index pattern rather than destroying and rebuilding, enabling recovery when conditions reverse. The “golden rule” (never destroy data during transitions) was encoded as a domain skill that activated automatically whenever agents touched seasonal code. The mechanism works through stateful tracking: the task manager agent writes a plan for each task block, and if the task touches seasonal code, additional seasonal-review agents are engaged. Their approval is required before any changes proceed. The task manager must provide written proof in the context document showing whether the change affects seasonal data and why. The seasonal agents then review the code, provide written analysis, and check their completion markers. Only then does the task manager finalize. Stateful tracking plus written validation of that statefulness means agents can’t hand-wave their way through critical transitions — every step is checked and documented.

The Gotcha Table

Specific pitfalls we encountered, mapped to how we resolved them and what we learned. These patterns apply to any complex agentic coding project, not just simulation software.

Pitfall	Resolution	Lesson
Insufficient documentation	Three-tier doc structure: system → phase → task	Front-load documentation investment. It’s the single highest-ROI activity in agentic coding.
Hardcoded values passing tests	Scientific auditor reviews code against peer-reviewed equations	Separate the model that writes code from the model that validates the science.
Fix-break loops across subsystems	Documented all cross-system edge cases upfront; architecture-first approach	Map every interaction between subsystems before writing code. The AI can’t infer system-level consequences.
Agent confidence / hand-waving	Required proof of work; antagonistic model pairing	Never accept “it’s fixed” without evidence. Adversarial validation surfaces real problems.
Recursive agent delegation (infinite loop)	Explicit tool constraints; forbidden agent-spawning rules	Define exactly which agents can spawn which other agents. Ambiguous delegation causes recursion.
Documentation bloat from agents	Shell hook blocking forbidden file patterns; whitelist enforcement	AI agents aggressively create summary files after every task. Automate the constraint.
Cell deduplication errors (100%+ volume errors)	Switched from List to HashSet for cell tracking	When merging data structures, deduplication bugs compound silently. Mass balance checks catch them.

What We’d Do Differently

Start with comprehensive documentation from day one. Break everything down immediately: system-level specs, then phase-level docs, then task-level docs. Write tests for each phase before implementing. Run the entire documentation package through a second high-end model and through a human reviewer before any code is written.

Model every scenario you want the system to handle. Have the agents think through resulting edge cases, problem states, and situations that need resolution. This upfront investment was the single largest factor in whether a given phase went smoothly or turned into a debugging marathon.

We’d also pair models from the start. The antagonistic dynamic between Opus (building) and Gemini (validating the science) caught problems that neither model would have found alone. For any domain-specific project, plan to use at least two models with complementary strengths.

One question that comes up frequently in evaluations: what happens after delivery? The code is standard C# following conventional Unity patterns. Any competent C# developer can read it, modify it, and add features without the agentic architecture. The three-tier documentation system means the next developer has a complete specification to work from. The client can maintain this system with their existing team. That’s worth stating explicitly, because one of the concerns people have about agentic builds is that the output will be unmaintainable without AI agents. In our experience, the opposite is true: the enforced documentation discipline produces cleaner, better-documented code than most human-developed codebases.

When This Approach Makes Sense

This project proves that fully agentic, zero-code development can handle complex domain-specific software. But it doesn’t make sense for everything.

Agentic coding works well when the domain has clear rules (physics, mathematics, established algorithms), when quality can be validated programmatically (mass balance checks, performance benchmarks, log-based assertions), and when the directing human understands the full pipeline from specification to delivery and knows what quality looks like at each step.

It works less well when the domain is ambiguous, when success depends on subjective visual quality that requires human judgment at every step, or when the codebase is so tightly coupled that every change requires understanding the entire system simultaneously. Context window limitations are real. If a change in file A has implications for file Z that isn’t in the current context, the agent won’t catch it unless you’ve documented the relationship.

This isn’t a “push a button and get software” situation. It’s closer to being a technical director running an AI development team — one person who understands every discipline in the chain.

As Anthropic’s 2026 Agentic Coding Trends Report notes, developers can “fully delegate” only 0-20% of tasks currently. This project pushed past that range by investing heavily in documentation, validation infrastructure, and multi-model verification. The 5x coding speedup is real, but it comes with doubled planning time and a first-project learning curve. The net economics improve significantly on subsequent projects once the methodology is established.

For organizations that want this capability without building the methodology from scratch, we offer managed autonomous AI agents as a service. We bring the agent architecture, documentation frameworks, and multi-model validation patterns. The client brings the domain expertise.

We’re also watching model capabilities closely. The areas that were still heavily human-driven, visualization checks and spatial recognition, are exactly where models are improving fastest. We’d like to revisit this space as those capabilities mature.

FAQ

Can AI agents really build complex engineering software with no human code?

Yes. We built an 18,000-line hydraulic simulation system with topological mapping, and the AI agents generated every line. The caveat: it required a skilled human director, comprehensive documentation, multi-model validation, and roughly 32 hours of planning time. The AI handles the implementation. The human provides the architectural thinking and quality judgment.

What types of software are best suited for fully agentic development?

Software with clear, rule-based domains (physics, mathematics, established algorithms) where quality can be validated programmatically. Scientific simulation, data processing pipelines, and systems with well-defined acceptance criteria work well. Software that depends heavily on subjective visual design or requires constant cross-system awareness in tightly coupled architectures is harder.

How long does a zero-code agentic project take compared to traditional development?

For this project, coding time dropped from an estimated 300 hours to 60 hours (5x improvement). Planning time doubled from 16 to 32 hours. There was also a one-time learning overhead of about 40 hours for establishing the methodology. Net: the first project was faster overall, and the next one will be significantly faster because the learning investment carries forward.

What are the biggest risks of fully agentic software development?

Fix-break loops (the agent fixes one subsystem and breaks another), hardcoded values passing tests instead of real calculations, AI confidence leading to unverified claims of completion, and documentation debt compounding as the project grows. All of these are manageable with the right architecture, but none of them are trivial.

What tools did you use for this project?

Claude Code and Cursor for coding, with Anthropic Opus as the primary coding model. Gemini 3.1 Pro for scientific validation and cross-checking. Unity 6.2 running in batch mode for automated test execution. 13 custom agents with 8 domain skills and 7 automation hooks (shell scripts enforcing project conventions at lifecycle points).

Is zero-code agentic coding ready for production software?

For the right projects with the right human direction, yes. The system we built runs in production. But “right human direction” is the key qualifier. This requires someone who understands every discipline in the software development pipeline, not just someone who can write prompts. The technology works. The bottleneck is the quality of the specifications and the judgment of the person directing the agents.

DEV Community