✓ Human-authored analysis; AI used for formatting and proofreading.
There's a wave of tooling that makes AI coding agents more reliable: Agent Skills, Superpowers, Get Shit Done, spec-driven development kits. They work. The agent stops flailing and starts following a process. But there's a confusion in how people talk about them. It costs money: the belief that a better per-task setup adds up to a coherent system. It doesn't. Dave Snowden's Cynefin framework provides the framework to analyze.
We will distinguish between visibility (what the AI can see) and emergence (what the system does at runtime). Even an infinite context window cannot deduce a race condition that only triggers under a specific production load. That remains a property of the Complex domain.
The distinction between building a known category (CMS) vs a new category explains how teams find AI tools miraculous while other find them frustrating to use. As systems grow, the number of interactions grow exponentially, eventually exceeding any context window or human's ability to analyze rather than probe.
Two domains, one codebase
Cynefin sorts problems by the nature of cause and effect.
Complicated problems have a knowable right answer. Finding it takes skill or analysis. It's deducible from what's in front of you. Once found, it's repeatable. Write a function that validates this schema. Refactor this module to use the repository pattern. These have correct answers an expert (or a well-driven agent) can produce. The method is analyze, then execute.
Complex problems don't have a deducible answer in advance. The system is a web of interacting parts, and outcomes are only coherent in retrospect. Will these forty merged changes stay consistent over six months? Does the architecture still hang together after this quarter's features? Did agent A's reasonable local choice silently contradict agent B's? You can't deduce these from any single artifact. They emerge from interactions. The method is probe, sense, respond. Try things, observe, adjust.
Writing the code is mostly Complicated. Building and evolving the system is Complex. Almost all the new AI-dev tooling lives, correctly in the Complicated domain. This is why it's a mistake to expect it to solve the Complex one.
Complicated domain work vs making a system coherent
Agent Skills (the open SKILL.md format) package procedural knowledge and context into portable, version-controlled folders that an agent loads on demand, through progressive disclosure. Discover, activate, execute. A skill is captured expertise for a bounded task: how to format a presentation, how to run a specific analysis pipeline. That is a Complicated-domain instrument in its purest form. It encodes the knowable right way to do a defined thing, and makes it repeatable.
Superpowers (Jesse Vincent's framework) is a development methodology for coding agents, built on composable skills. It forces a disciplined sequence. Brainstorm the design, tease out a spec, write a detailed plan, then Red-Green-Refactor test-driven development with code-review gates between tasks. It even deletes code written before its test. This is process applied to the unit of work: it makes each task the agent performs more rigorous and less ad-hoc. Again, Complicated — it's about doing the knowable thing correctly and verifiably.
Get Shit Done is a meta-prompting, context-engineering, and spec-driven system in the same family: pin down the spec, engineer the context, execute against it. Spec-driven development generally is the clearest case. A specification is the Complicated-domain tool. It nails down the deducible answer so the agent can hit it.
Skills capture the right way to do a task. Superpowers enforces a rigorous process for a task. Spec kits pin down what a task must produce. All three make the unit — the task, the function, the bounded change — reliable, repeatable, and closer to deducible. That is valuable, and it's Complicated-domain work. It is not the same thing as making the system coherent.
Why "more reliable units" doesn't equal "coherent system"
Complexity is a property of interactions. You can make every individual task the agent performs flawless. Perfect spec adherence, perfect TDD, perfect skill execution and still end up with an incoherent system. Because the incoherence lives in how the pieces combine.
A perfectly-executed task that names a field userId and another perfectly-executed task that expects user_id are each correct in isolation and broken together. A retry policy and a timeout policy can each pass every test and still combine to cause a system overload. Two refactors of a shared utility, each rigorous, each spec-compliant, can overwrite each other. Per-task discipline does not catch these, because each task was done right. The failure is emergent — Complex, not Complicated.
This is why a better SKILL.md or a stricter TDD gate, however much it improves the unit, cannot by itself produce system coherence. It's solving a Complicated problem well and leaving the Complex one untouched. The two require different methods: deduction and process for the unit, probe-sense-respond for the system.
It's easy to say "just write integration tests." In the Complex domain you cannot know in advance what will break. If you could, you'd write a targeted test and it would be Complicated. So the real probe isn't a test you wrote ahead of time. It's the act of integration itself — merging the changes, deploying the assembled system, putting it under real load. The sense comes back: the CI signal, the failure nobody predicted, the production telemetry. The response is adjusting to what you observed. Integration and property-based tests still earn their place. But as instrumentation that makes the sensing sharper. They widen what you'll notice when you probe — not as a substitute for combining the pieces and watching what the combination does. A test you wrote presumes you already knew which interaction to check. The contradictions that hurt are the ones you didn't know to look for.
The seductive overclaim
The trap is "adopt this spec/skill/methodology and your AI development is handled." Cynefin says no.
A spec does not move a Complex problem into the Complicated domain. It makes the unit Complicated — bounded, deducible, checkable. But the system's emergent behavior stays Complex no matter how good your specs are, because you cannot enumerate in advance all the ways independently-correct pieces will interact. Specs and skills reduce the surface of complexity — fewer unconstrained units, fewer degrees of freedom for things to go wrong. The same way good fences reduce the ways a crowd can go wrong. It is not the same as abolishing complexity, and a team that believes "we have specs, therefore the system is under control" has made a domain error that will surface as an integration failure nobody's task was responsible for.
These tools can claim: we make each unit of AI-generated work more reliable, repeatable, and verifiable. That's true, it's valuable, and the Complicated domain rewards it. The inflated claim — we make your system correct — reaches across a domain boundary the tooling doesn't cross.
A related hope deserves the same scrutiny: that ever-larger context windows will dissolve the problem by letting an agent see the whole system at once. They erode one slice of it. Some incoherence exists only because an agent working in a narrow window couldn't see a related part of the codebase — the userId versus user_id clash is partly that. Widen the window enough to hold both and that visibility-limited class of contradiction becomes catchable. It moves from "emergent because nobody could see it" toward "deducible because now someone can." But this conflates two different things if you push it further: seeing more of the system is not the same as being able to deduce how it behaves. You can fit an entire distributed system's code in context and still not deduce, by reading it, whether it deadlocks under production load. That's an emergent runtime property, not a static one analysis reveals. Bigger context shrinks the see-it-coming kind of complexity. It leaves the emergent kind untouched. The moment the system outgrows the active context of the task at hand, even the visibility slice returns.
Where a product enters the map — and why experience can mislead
The tooling conversation usually skips a prior question: not every product begins in the same domain. Where you enter the Cynefin map is set by how known your category already is in the world, not just to you. A CMS or an e-commerce platform starts mostly Complicated, sometimes Clear. The problem is solved, with decades of reference architectures, competitors to study, and settled user expectations. Your unknowns are local (your stack, your niche), not categorical. There, spec-driven methodology and skills frameworks fit from day one. Because the right answer is deducible from prior art and pinning it down early is appropriate. A new category starts Complex, sometimes bordering Chaotic. Nobody knows the product or the solution's shape, because the category doesn't yet exist to copy. Writing a confident spec on day one is then the same domain error in reverse: specifying something you haven't learned enough to specify, hardening assumptions that haven't survived contact with reality.
In that Complex start, the method is probe-sense-respond expressed as prototypes and experiments. A specific danger lives here for technical founders. The feeling that you're only making progress when you're writing code is a Complicated-domain instinct: it assumes there's a known thing to build and the job is building it. In a new category the job is learning, and the artifact of learning is often a prototype you throw away once it has told you what you needed to know. The uncertainty was the point. A founder measuring progress by lines committed, in a domain whose real risk is "does this category exist," will feel productive while that risk goes untouched. Productive in the wrong domain is the most expensive kind of busy. This is the code-is-a-liability point: in the Complex domain, kept code can be negative progress, because it commits you to assumptions you should still be free to discard.
There's a trap underneath: experience can make this worse. Deep expertise is mastery of best and good practices — fast pattern-recognition of known solutions. The more seasoned the engineer, the more reflexively they reach for those patterns. In the Complicated domain that reflex is a superpower; it's most of what "senior" means. In the Complex domain it's a liability, because it answers "what's the known right way here?" when the honest answer is "there isn't one yet — we have to probe." The very fluency that makes an expert excellent at building a known category can blind them in a new one: they build the wrong thing, while the situation called for stopping to ask whether the thing should exist at all. The discipline a new-category build demands is almost the inverse of the one that makes you good at a known one.
Throwing away code is hard. Because skill and effort make the artifact feel valuable independent of what it taught you. That sunk-cost pull converts a probe into premature product. The discipline is to treat early code as a probe, keep the learning, and discard the artifact. The migration from Complex toward Complicated is partial and never total. Prototyping retires the uncertainty you can retire. It doesn't touch the irreducibly emergent kind — runtime behavior that only shows up in the assembled, running system and every new feature re-injects Complexity into a codebase that had settled. You don't prototype your way to a fully deducible system. You earn the right to use the Complicated-domain tooling for the parts that have migrated, and you keep probe-sense-respond alive for the parts that never will.
The harder deletion: code that works but dilutes
Discarding a probe is the easier of two deletion disciplines, because the code was disposable from the start. Its job was to teach you something. There's a second discipline that is rarer and much harder: deleting working, committed, good code because it dilutes the essence of the product.
This is a different act. The code isn't wrong. It passes its tests, it's well-structured, someone built it with care. The question is whether it sharpens or blurs the essence of the product. It works is not the same as it belongs. Once you can name your product's essence, everything that isn't that essence becomes dilution, no matter how good it is in isolation.
The code-as-asset instinct fights hardest here, because here it looks most correct. A throwaway prototype is visibly disposable. Quality working code reads as an asset by every signal a developer is trained on — tests green, structure clean, function delivered. So deleting it feels like destroying value. Every feature looks individually defensible and the product accretes into blur, each addition justified, the whole slowly losing its center. The naming is the prerequisite discipline. Dilution is invisible without it.
In TRIZ: trimming — remove a component and require the system to deliver its function with fewer parts, and ideality rises. Deleting dilution is trimming applied to product essence: the product delivers its essential function with less, and becomes stronger by getting smaller. That matters because it lets the deletion register as progress rather than loss. The earlier point was that added lines aren't progress; the sharper corollary is that removed lines can be, because the product is more itself without them. That is almost impossible to feel as progress in the moment, since the value created is clarity of essence, which does not show up on any metric.
This discipline is only sound after the essence is named and validated. Before that, "it dilutes the essence" becomes a license to rip out anything you've cooled on, or — worse — to mistake a still-Complex, not-yet-understood part for dilution and delete something you needed to keep probing. Name the essence, then subtract against it. Deleting before you've earned the naming isn't discipline.
Match the Method to the Domain
- Complicated layer (the unit): Skills, Superpowers, GSD, spec kits. Use them. They make the agent's individual output rigorous and repeatable. AI-assisted development shines here.
- Complex layer (the system): the act of integrating and running the assembled system, with enough instrumentation — integration and property-based tests at the seams, observability, architectural review, a human watching for emergent contradiction — that you notice what the combination does. This is probe-sense-respond: you learn what breaks by combining and exercising the pieces, not by analyzing any one of them harder. No per-task tool replaces it.
The mistake is expecting them to cover the Complex domain — buying a better way to write each function and assuming the system will therefore cohere. It won't, because system coherence was never a Complicated-domain problem to begin with.
Match the method to the domain. Use the spec kits and skills frameworks for what they're superb at — making the unit reliable — and put separate, different effort into the Complex layer where coherence lives. The teams that conflate the two will keep being surprised by integration failures that every individual task did everything right to cause.
Cynefin is Dave Snowden's framework; I'm applying it, and any rough edges in the mapping are mine. Tools described from their own sources (mid-2026): Agent Skills, Superpowers, Get Shit Done. If you think one of these does cross into the Complex domain — addresses emergent system coherence, not just per-unit rigor — that's the disagreement worth having.
Top comments (0)