DEV Community: CPDForge

Don't Ask AI to Stop Guessing. Design a System Where It Doesn't Need To.

CPDForge — Tue, 30 Jun 2026 17:18:02 +0000

Designing out the guesswork.

Part of a short series (1 of 3) on engineering lessons from building governed, AI-assisted production systems. Each piece takes one real failure and the architectural idea it forced. The examples are ours; the principle is meant to be transferable.

We removed a capability that genuinely existed — from a tool whose entire job was to represent capabilities fairly. Not because the model hallucinated. Because our architecture made guessing the rational thing to do.

It reasoned correctly from the inputs it was given. The inputs were the problem — and so was our first instinct about how to fix them. This is a write-up of what failed, why the obvious fix didn't work, and the pattern we ended up with. None of it is about prompting.

The failure

We run a tool that compares options for a buyer and recommends the best fit. It's supposed to be neutral. One of the inputs to that tool is a set of capabilities — what each option can actually do.

An automated agent, doing maintenance work, updated that capability set. It removed one of our own capabilities, on the grounds that it had been discontinued. It cited its source: a handover note from an earlier work session that said, in passing, that the capability had been "parked."

The capability had not been parked. It was live, published, and in active use. But for the duration of that change, a tool that was supposed to be neutral became quietly unfair — it stopped representing something that genuinely existed.

The agent was not careless. If you read the note it was working from, you would have drawn the same conclusion. The note was wrong, and nothing in the workflow forced a check against the thing that was actually true.

Why the obvious fix doesn't work

The obvious fix is the one everyone reaches for: tell the model to be more careful. Add instructions. "Verify before asserting." "Do not rely on summaries." "Check the source of truth." Strengthen the prompt.

We tried versions of this. It moves the failure rate; it does not remove the failure. And once you sit with why, that becomes obvious too.

An LLM reasons over the context it's handed. If the nearest, most fluent description of reality is a summary, the model will use the summary — not because it's lazy, but because the summary is right there and reads authoritatively. "Be careful" is an instruction to expend extra effort against an unspecified target. It competes with every other instruction in the context, and it degrades exactly when you most need it: under long context, time pressure, or a confidently-worded but stale narrative.

The deeper issue is that we were treating a systems problem as a behaviour problem. We had two descriptions of reality — an authoritative one (the live system of record) and a convenient one (prose) — and we left it to the model's judgement to pick the right one every single time. That's not a judgement we should have delegated. The model didn't have a guessing problem. We had given it a reason to guess.

What we changed

We stopped trying to make the model choose correctly between sources, and instead made the authoritative source the only path to a fact.

Two ideas did the work.

First: rank the sources explicitly, and let the ranking — not the model — resolve conflicts.

authoritative = resolve(
    production,   # system of record — authoritative
    derived,      # computed from production
    override,     # a cited fact, only where production is silent
    narrative,    # summaries, handovers — NEVER authoritative
)

# A lower tier is consulted only when every higher tier is silent.
# Production truth can never silently fall through to a lower one.
assert not (production.has_answer and authoritative.source != PRODUCTION)

The point of the assert is not defensive coding. It's a statement of intent: when production has an answer, nothing below it gets a vote. Prose can inform where production is silent, but it can never override — or quietly stand in for — a fact the system of record already holds. And — this is the part that bit us — absence has to be proven from the authoritative source, not inferred from a summary that failed to mention it.

Second: derive facts, don't assert them.

The capability set is no longer something a human or an agent edits by hand. It is computed from the live system of record at build time.

capabilities = derive_from(system_of_record.published_items())
# absence of a capability is established by its absence in (system_of_record),
# never by its absence in a document.

Once capabilities are derived, the class of bug we hit becomes structurally impossible. You cannot remove a live capability by editing a note, because notes are no longer in the path. The system self-corrects whenever the system of record changes. Nobody has to remember to keep the description in sync, because there is no separate description to keep in sync.

What we deliberately did NOT automate

This is the part I'd most want a sceptical reader to notice, because it's where restraint mattered more than cleverness.

We automated the source of truth. We did not automate the decision.

When the derived capability set and someone's expectation disagree, the system does not silently "fix" anything. It surfaces the discrepancy and stops. A human decides whether the difference is a genuine change, a mistake, or an intended exception. We never gave the pipeline the authority to assert a new fact about the world — only to derive facts from a system that already holds them, and to flag when something looks off.

The temptation, once you've built a resolver, is to let it auto-resolve everything. We didn't, because "what exists" is a fact (derivable) but "what should exist" is a decision (not). Collapsing those two is how you build a system that is confidently, automatically wrong.

There's a clean line underneath all of this:

Facts come from systems. Decisions come from people. A pipeline may derive facts and flag conflicts; it may never decide what is true.

The behaviour / witness boundary

We still rely on behaviour — the agent is expected to reason from the authoritative source. But we no longer trust behaviour to be the only line of defence. There is one small machine check that runs in the pipeline: it recomputes the capability set from the system of record and fails the build if what we're about to ship has drifted from what the source actually exposes.

That check doesn't make the model behave. It makes the invariant observable. If behaviour silently regresses, the witness fails loudly before anything reaches a user. Behaviour governs; the check is evidence that the invariant still holds. We were careful not to confuse the two — a passing check is not proof of good judgement, only proof that one specific, decidable property is intact.

Lessons for other engineering teams

Most "AI reliability" problems are ambiguity-surface problems. Before you reach for a better prompt, ask whether the model even had an unambiguous source to reason from. If two descriptions of reality are in scope, you've already lost — you're just waiting to find out when.
Make the correct path the only easy path. "Be careful" is a tax on every future inference. Removing the alternative source is a one-time structural change. Prefer structure over diligence; diligence doesn't scale and doesn't survive context pressure.
Derive, don't describe. Any fact that exists authoritatively somewhere should be computed from that place, not transcribed. Every transcription is a copy that will eventually disagree with the original.
Rank your sources before you need to. The conflict between a system of record and a convenient summary is not exotic — it's the default condition of any system with documentation. Decide the precedence in advance, in code, so no human or model has to adjudicate it in the moment.
This generalises well beyond LLMs. Replace "the agent" with "a new engineer" or "a cron job." The same fix — single authoritative source, derived not described, conflicts surfaced not resolved — removes the same class of error. The LLM just made the latent design flaw fail faster and more visibly.

The reframe that mattered for us was small but total. We had been asking, "how do we get the model to stop guessing?" The better question was, "why is the model in a position where guessing is reasonable?" Once we removed the reason, the behaviour took care of itself.

The Engineering Principle

An agent guesses when its inputs leave room for guessing. Don't instruct it to stop — remove the ambiguity. Derive facts from the system of record, rank every source so prose can never outrank truth, and surface conflicts instead of resolving them automatically. Don't ask AI to stop guessing. Design a system where it doesn't need to.

The ISO 42001 Course That Refused To Pass

CPDForge — Wed, 24 Jun 2026 01:18:41 +0000

The ISO 42001 Course That Refused To Pass

A technical post-mortem about false confidence, audit blind spots, and the difference between detection and reality. There is some AI in here, but this is not an AI story. It's an observability story wearing a compliance hat.

The bug report that wasn't a bug

It started the way these things always start: with a result that didn't fit the story we were telling ourselves.

We run a platform that generates structured compliance-training programmes — courses, assessments, the whole apparatus an awarding body needs to stand behind a credential. One of those programmes was an ISO 42001 Lead Auditor course (ISO 42001 is the management-system standard for AI). It kept failing our internal quality checks on a criterion we call standard-family consistency: roughly, "does this material talk about the standard it claims to teach, and not quietly drift into a different one?"

Fine. A failing QA check is a Tuesday. We did the obvious things:

Re-ran the generator.
Tightened the prompt.
Added explicit "do not reference other ISO standards as governing" directives.
Re-ran the audit.
Watched it fail again.
Questioned our life choices.
Repeated.

The programme kept failing. Not dramatically — it wasn't producing nonsense. It would subtly frame an assignment around environmental management controls (ISO 14001) inside a programme that was supposed to be about AI management (ISO 42001). A reviewer described one draft as "GreenTech Energy doing AI governance," which is funny until you realise a learner was about to be assessed against the wrong standard.

We spent an embarrassing amount of time convinced the programme was uniquely broken. The breakthrough was a question we should have asked on day one:

We stopped asking "Why is this programme failing?" and started asking "Why is **everything else* passing?"*

That reframing is the whole article. Everything below is what happened when we took it seriously.

The Architecture Map

When you suspect one thing is broken, you debug that thing. When you suspect your definition of broken is wrong, you have to map the system. So we drew the actual generation topology instead of the one in our heads.

We expected maybe six or seven "generators." We found 22 distinct generation pathways, each capable of producing learner-facing content, each with its own prompt assembly, its own context, and — critically — its own relationship (or non-relationship) with our governance rules:

 1  Blueprint              8  Applied exercises     15  Rubric generation
 2  Outline / refine       9  Capstone              16  Smart hotspots
 3  Course generation     10  Module exams          17  Cover image
 4  Lesson generation     11  Final exam            18  Scenario generation
 5  Lesson regeneration   12  Course final assess.  19  Competency mapping
 6  Lesson quizzes        13  Assignment brief      20  Family FAQ
 7  Diagnostics           14  Assignment regen      21  Assessment explainer
                                                    22  Exam-bank regen

Nobody designed 22 pathways. That's the point. They accreted. Version 1 had lessons and quizzes. Then someone added diagnostics. Then capstones, because a credential needs a summative assessment. Then assignment briefs and rubrics, because awarding bodies wanted graded work. Then regeneration endpoints, because owners wanted a "redo this bit" button. Then a second regeneration path, because the first one didn't fit a new workflow.

Each addition was reasonable. Each was shipped with tests. And each was wired to the governance rules that existed on the day it was written — which is to say, drift didn't arrive as a catastrophe. It arrived as eighteen reasonable Tuesdays.

We colour-coded the map by governance coverage:

🟢 Green (8): generation is governed, output is audited, defects are remediable.
🟠 Amber (9): generation is roughly governed, but nothing audits the result.
🔴 Red (5): completely orphaned — applied exercises, assignment brief, assignment regeneration, rubric generation. No shared governance context in, no audit looking at the output.

Here's the part that made the ISO 42001 mystery dissolve. The standard-family auditor only inspected the green pathways: lessons, quizzes, exams, diagnostics, capstone. The defect lived in the assignment brief — a red pathway. The auditor was strict and blind at the same time: strict enough to flag the leak where it could see, blind to the larger surface where the leak actually originated.

The ISO 42001 programme wasn't uniquely broken. It was uniquely measured — it happened to be grounded strictly enough that the contamination surfaced somewhere the auditor could see, exposing a structural weakness every other programme also had but nobody was looking for.

The Confidence Problem: PASS ≠ Clean

Let me state the thing that should be tattooed on every dashboard:

A system can only detect what it measures. A PASS only means "no defect found in the surfaces we inspected." It says nothing about the surfaces we don't.

Our portfolio view proudly showed green. Most programmes "passed." But "pass" was computed from a QA function (qa.passed) that only consulted the audited pathways — roughly 9 of 22. The other 13, including every red one, contributed exactly nothing to the verdict. So our PASS was, with full technical honesty:

PASS  ≡  "no blocker found in ~40% of the generation surface"

That's not a quality signal. That's a sampling artefact wearing a green checkmark. The most dangerous kind of green: the kind that's technically true and practically meaningless.

If you've done observability work, this is achingly familiar. It's the 200 OK that returns an error page in the body. It's 100% test coverage where the assertions are assert True. It's a health check that pings the load balancer instead of the database. The metric was real. The thing it implied was fiction. We didn't have a content problem. We had a measurement problem, and measurement problems are worse, because they corrupt your ability to know you have any other problems.

The Audit Confidence Report

The most useful artefact of the whole exercise wasn't code. It was a document that changed the question.

We had been asking: "Are there defects?" A binary, defect-hunting question, which presumes your detector is trustworthy.

We switched to: "How much of the system are we actually measuring, and how much should we trust a PASS?"

That sounds like a semantic dodge. It isn't — it's the difference between a smoke detector and a smoke detector with a battery you've verified is in it. Concretely, we needed every audit verdict to ship with two things it never had:

Coverage — which pathways did this audit actually inspect?
Confidence — given that coverage, how much weight should a human put on the verdict?

A clean verdict over 40% coverage and a clean verdict over 100% coverage should not render as the same green pixel. Until they stopped looking identical, every other improvement was theatre.

Building a Governance Programme

We refused to do the tempting thing, which was to "just patch the orphaned generators." Patching is how you get 22 pathways in the first place. Instead we did it in phases, each shippable, each with a hard scope boundary, and — importantly — each forbidden from doing the next phase's job.

Phase 0 — The governance spine

Before fixing anything, we built the thing every pathway should have shared from the start: a single immutable context object, ProgrammeGenerationContext, carrying scenario, domain, approved companion standards, locale, and grounding. The rule: every generator receives it; no generator reinvents it.

Phase 0 changed zero output. We proved that with byte-parity tests — the generated artefacts were identical before and after, because Phase 0 only re-routed how context arrived, not what it contained. Boring on purpose. A spine you can't trust to be inert is a spine that introduces new bugs while you're fixing old ones.

We also added provenance. Every generated artefact now carries:

"generation_meta": { "route": "assignment_brief", "governance_version": "gv1", "at": "..." }

This one little stamp does an unreasonable amount of work later. It's the difference between "this artefact is clean" and "this artefact was born under governance version gv1, via the assignment-brief route, at this time." Provenance turns an opinion into a fact.

Phase 1 — Generation alignment

Now we pulled the 5 red pathways onto the spine. The orphaned generators — applied exercises, assignment brief, both regeneration paths, rubric — started consuming the same shared governance directive as the green ones.

The nastiest find here was a feedback loop: the assignment-regeneration path re-seeded itself from the existing (contaminated) brief, so "regenerate to fix it" faithfully reproduced the defect with the confidence of a freshly committed mistake. We cut that loop. Red pathways became amber: governed at generation time, even though nothing audited them yet. That "yet" is Phase 2.

Phase 2 — Detection expansion

Generation coverage was now ahead of audit coverage, which is its own kind of lie (your content is governed but you can't prove it). So we extended the detector to the previously-invisible surfaces — applied, brief, rubric, hotspots — and added a non-negotiable: a pathway registry with a self-test.

def test_all_22_pathways_registered():
    assert set(PATHWAY_REGISTRY) == set(range(1, 23))
# Add a 23rd pathway without a detection decision → the suite goes red.

That test is the cheapest insurance we bought all year. It makes "we forgot to audit the new generator" a compile-time-ish failure instead of a discovered-in-production one.

Two things mattered more than the new detectors:

Attribution. Every finding now answers what failed / which artefact / which pathway / which route / which governance version / is it legacy. A finding you can't locate is a complaint, not a bug report.

The confidence split. We learned the hard way (see below) not to collapse trust into one number, so we report two axes:

coverage_confidence — how much did we measure?
content_confidence — how clean is what we measured?

A programme can be coverage: High, content: Low — "we looked everywhere and found a real problem" — which is completely different from coverage: Low, content: High — "everything we managed to look at was fine, but we didn't look at much." One number could never say both.

We were also disciplined about false positives, because the original failure had a sharp edge: legitimate cross-standard pedagogy ("compared to ISO 9001, Annex SL gives a shared structure…") is not a defect. We reused a classifier with three buckets — A (another standard presented as governing: a real defect), B (title/structure encoded), C (comparative/contextual). Only A is allowed to be loud. Get this wrong and you train your engineers to ignore the detector, which is worse than not having one.

Phase 3 — Remediation expansion

Detection without remediation is just a more articulate way to feel bad. We wired each finding to a one-click, governance-aware fix that regenerates through the Phase 1 spine — so the fix is clean by construction and re-stamped gv1, which flips its is_legacy flag to false. Detect → fix → re-detect-clean, closed loop.

This phase also forced a subtle distinction. Applied exercises are practice-only; assignment briefs and rubrics are credential-bearing. So applied remediation regenerates per-exercise, in place, with no version bump and no SME sign-off — a deliberately lighter policy than the graded artefacts. Same engine, different blast radius. Resisting "make it uniform" was the right call; uniformity that ignores blast radius is just centralised risk.

And we paid down a Phase 2 debt: remediation now writes through to the cached audit report, because we'd discovered the portfolio was reading a stale cache (a programme showed 3 findings on the cached row and 9 on a fresh audit — the exact divergence that makes dashboards lie).

Phase 4 — Enforcement

This is the only phase permitted to change publish behaviour, and it's deliberately narrow. A learner-facing artefact that presents an off-domain standard as governing (a Category-A finding) became a hard publish blocker:

_BLOCKER_KEYS = {"courses", "qa", "signoff", "assignment",
                 "module_exams", "final_exam", "governance"}  # ← new

The boundary conditions are where the engineering judgement lives:

Only Category-A blocks. Medium/advisory findings warn; they never gate. We are not going to block a publish over a legitimate "compared to ISO 9001."
Confidence never blocks. Not coverage, not content, not finding-density, not legacy status. The gate fires on specific, named, fixable defects, never on an aggregate score. Coupling a fuzzy dashboard number to a binary gate is how you get teams disabling the gate.
Every blocker has a one-click fix in its payload. Phase 4 only enforces what Phase 3 can remediate. No dead-end gates.
Fresh detection at publish time, never the cache. And the whole thing sits behind a flag (GOVERNANCE_PUBLISH_ENFORCEMENT) so a bad day in production is a config change, not a redeploy-revert.
No retroactive unpublishing. Live programmes stay live; the gate applies to the next publish/republish.

Phase 5 — Measurement and evidence

Here's where we resisted the strongest temptation in the entire project: to celebrate. We had a governance spine, aligned generation, full detection, one-click remediation, and enforcement. Surely the portfolio was now clean?

We didn't know. We had built a beautiful machine and assumed its output. So Phase 5 did the unglamorous thing: it measured, before concluding anything.

The Final Sweep

Phase 5 was a portfolio-wide re-audit sweep. Read-only with respect to content — it never regenerates an artefact; it only recomputes detection under the completed model and refreshes each cached report. Then it rolls everything into a Portfolio Governance Report with a deliberately blunt question attached: was this widespread or isolated?

The preview-portfolio sweep returned this:

programme_count:           20
affected_programme_ratio:  0.0
category_a_by_pathway:      {}            # no off-domain-as-governing defects
verdict:                   CLEAN / ISOLATED
total_remediations applied: 36

Zero Category-A learner-facing defects across the portfolio. The affected ratio was 0. The thing we built an entire five-phase programme to catch was, under honest measurement, not actually present at scale.

Read that result the wrong way and it's deflating: "all that work to find nothing?" Read it the right way and it's the most valuable outcome available:

The programme didn't succeed because it uncovered a widespread mess. It succeeded because it produced evidence — quantified, comparable, reproducible — that there wasn't one. The ISO 42001 failure was an early-warning sentinel, not the tip of an iceberg.

This is the prevention vs cleanup distinction, and engineers systematically undervalue prevention because it has no body count. Cleanup is legible — you fixed 400 things, here's the burndown chart, promotion please. Prevention is a non-event: the incident that didn't happen, the GreenTech-flavoured ISO 42001 assignment that never reached a learner because the gate stopped it at publish. Nobody throws a party for the fire that didn't start. But the whole point of governance is to convert "we're probably fine" into "here is the dated, versioned evidence that we are fine, and the enforcement that keeps us there."

Two honest caveats, because a post-mortem that flatters itself isn't one:

The clean result is from our preview portfolio, which is fixture-heavy. The authoritative production catalogue is access-gated; the sweep has to be run there to make the verdict definitive. We're claiming a proven mechanism and a strong preview signal, not a victory lap.
"0 defects" is only as trustworthy as the detector's coverage — which is exactly why Phase 2's coverage model and Phase 5's provenance reporting exist. We are measuring our measurement. It's turtles, but at least we now know how many turtles and which version each turtle is.

The Irony

We were building an AI Governance Lead Auditor programme — a course that teaches people how to audit an organisation's AI management system for exactly this class of problem: drift, weak controls, things that look compliant but aren't measured.

That course, by refusing to pass, forced us to run an AI-governance audit on our own platform. The curriculum became the test case. The thing we were trying to teach, we had to do — to ourselves, under our own strict reading of the standard.

The course we were ready to write off as broken turned out to be the most effective auditor we had. It didn't pass because it was the only thing in the building strict enough to notice that passing didn't mean what we thought it meant.

Final lessons

For the engineers, SREs, and GRC folks who got this far — the transferable parts, none of which are about AI:

1. PASS ≠ Clean. A green result is a statement about your detector's coverage, not your system's health. Always ask what a PASS is silent about.

2. Detection coverage beats dashboards. A confident dashboard over partial coverage is more dangerous than no dashboard, because it actively suppresses the instinct to look closer. Ship coverage and confidence next to every verdict.

3. Governance gaps masquerade as content defects. We chased a "bad course" for weeks. It was an architecture gap wearing a content costume. When one instance fails repeatedly and resists every local fix, suspect the system that judges it, not just the thing being judged.

4. The hardest bugs are bugs in the system that tells you whether bugs exist. A wrong answer is easy. A measurement system that confidently reports the wrong amount of certainty will burn weeks, because it corrupts the feedback loop you use to debug everything else.

5. Measure before you migrate. The instinct after Phase 4 was to bulk-regenerate "all the legacy stuff." We didn't. We measured first — and found there was almost nothing to migrate. The migration we were about to schedule was, in evidence, unnecessary.

6. Evidence before assumptions. "It's probably fine" and "here is the dated, versioned, reproducible report showing it is fine" are different deliverables. Only one of them survives an actual audit, an actual incident, or an actual awarding body asking you to prove it.

The takeaway

The ISO 42001 programme was not valuable because it uncovered a broken platform. By the evidence, the platform mostly wasn't broken.

It was valuable because it forced the platform to prove whether it was broken — and in building the machinery to produce that proof, we replaced a green checkmark that meant "we didn't look" with one that means "we looked everywhere, here's the coverage, here's the confidence, and here's the enforcement that keeps it true."

The proof became the real deliverable. The course was just the thing stubborn enough to demand it.

If your dashboards are all green right now, here's the uncomfortable parting question: green because the system is clean, or green because nothing is looking? The honest answer is usually "I'm not sure" — and being able to replace that with a number is the entire job.

Why Building Software Is Like Leading an MMO Raid

CPDForge — Thu, 11 Jun 2026 12:38:07 +0000

Why Building Software Is Like Leading an MMO Raid

A few years ago, if you'd told me that thousands of hours leading MMO raids would end up helping me build software products, I'd have laughed.

Today I'm not so sure.

I've spent years leading raids in games like Star Wars Galaxies and Star Wars: The Old Republic. What surprised me is how often the same lessons show up when building software.

The technology changes.

The tools change.

The jargon changes.

But the principles?

Not so much.

After building products, leading teams, surviving deployments, and spending more hours than I'd care to admit staring at architecture diagrams, I've realised that software development and MMO raid leadership have far more in common than most people think.

Most Wipes Are Self-Inflicted

One of the first lessons every raid leader learns is this:

Most wipes are self-inflicted.

Not every wipe.

Not all wipes.

But most of them.

The boss usually isn't the problem.

The raid wipes because:

Somebody got greedy
Somebody ignored the mechanic
Somebody panicked
Somebody thought they could improvise

Software projects are exactly the same.

Most projects don't fail because the technology was impossible.

They fail because:

Requirements weren't clear
Priorities kept changing
Validation was skipped
Scope grew uncontrollably
Assumptions went unchallenged

The technology rarely kills the project.

The team usually does.

Don't Pull Extra Mobs

This one has become a genuine software development philosophy for me.

Every MMO player knows this moment.

The group is making steady progress.

Everything is under control.

Then somebody says:

"While we're here..."

Five minutes later you're fighting three extra packs, the healer is out of resources, and half the raid is dead.

Software development has exactly the same trap.

You're implementing a reporting feature.

Someone says:

"While we're here, we could also..."

Then suddenly you're discussing:

New dashboards
New permissions
Notifications
Analytics
Exports
AI summaries

Congratulations.

You just pulled extra mobs.

One of the most useful questions I've learned to ask is:

Does this solve the problem we're trying to solve right now?

If the answer is no, it probably belongs in the backlog.

Finish the current fight first.

Do The Mechanic

Every raid eventually has that player.

The one topping the damage charts.

The one doing incredible numbers.

The one who dies first because they ignored the mechanic.

The mechanic doesn't care how talented you are.

It doesn't care how much damage you're doing.

If you don't do the mechanic, you're dead.

Software projects have mechanics too.

Things like:

Architecture reviews
Testing
QA
Security checks
Deployment validation

Nobody gets excited about them.

Everybody wants to write code.

But the mechanic still needs to be done.

The teams that survive long term are usually the ones that respect the boring parts.

Don't Stand In Fire

Another timeless lesson.

There is always fire.

Sometimes it's literal fire.

Sometimes it's:

Hardcoded secrets
Direct production database edits
Skipping tests
Ignoring warnings
Undocumented architecture

Everybody knows it's dangerous.

People still stand in it.

Repeatedly.

The lesson is simple:

If something has already burned you three times, stop standing there.

The DPS Meter Lies

One of the most underrated lessons from raid leadership.

The highest DPS player is not always the most valuable player.

Sometimes the most valuable person is:

The healer who prevented a wipe
The player handling mechanics
The person explaining strategy
The one spotting problems before they happen

Software teams work the same way.

The most commits don't necessarily create the most value.

The engineer who prevents a production outage may contribute more than the person who writes thousands of lines of code.

The person who simplifies a system may create more value than the person who adds five new features.

Not all contributions show up on the meter.

Wait For Loot

This might be my favourite lesson.

The boss dies.

Everyone gets excited.

Then the raid leader says:

"Wait. Don't loot yet."

Usually because:

Loot rules aren't set
Someone is still running back
A screenshot is needed
Something still needs checking

Good raid leaders don't sprint to the next boss the moment the current one dies.

They stop.

Review.

Recover.

Then move on.

Software projects should do exactly the same.

A feature ships.

Great.

Now:

Test it
Monitor it
Validate it
Learn from it

Only then start the next thing.

Too many teams treat deployment as the finish line.

It's usually the start of the learning phase.

The DPS Hero Is Usually The First To Die

Every raid has one.

The player convinced they're the main character.

The one who thinks mechanics are for everyone else.

The one who says:

"I've got this."

Thirty seconds later they're face down on the floor.

Software projects have these moments too.

The engineer who insists:

Documentation is unnecessary
Testing is optional
Architecture reviews are bureaucracy
Deployment checklists are for other people

Often ends up discovering why those things existed in the first place.

Usually in production.

Usually on a Friday.

The Boss Usually Isn't The Problem

This is probably the biggest lesson of all.

Most of the time, the boss isn't what kills you.

It's everything around the boss.

The lack of planning.

The lack of coordination.

The avoidable mistakes.

The unnecessary complexity.

Software projects are no different.

The technology challenge is often only a small part of the problem.

The bigger challenge is usually:

Focus
Discipline
Prioritisation
Communication

The boring stuff.

The raid-leader stuff.

Don't Touch Anything, I Forgot To Set Random

If you've ever led a raid, you've probably said something like:

"Don't loot yet. I forgot to set random."

Everyone freezes.

Nobody touches anything.

Because everyone understands that a tiny process mistake can create a much larger problem.

Software has these moments too.

They're usually disguised as:

"Don't deploy yet."

"Don't restart that."

"Don't touch production."

"Wait, I forgot to update the environment variables."

The principle is the same.

Slow down.

Verify.

Then proceed.

The Best Raid Leaders Aren't The Best Players

This took me years to understand.

The best raid leaders aren't necessarily:

The most skilled
The best geared
The highest DPS

They're usually the people who:

Stay calm
Keep everyone focused
Prioritise correctly
Reduce unnecessary chaos

The same is true in software.

The best technical leaders aren't always the smartest people in the room.

They're often the people who stop the team from creating problems for themselves.

Final Thoughts

After years of leading raids and building software, I've ended up with a surprisingly simple development framework:

Do the mechanic.

Don't stand in fire.

Wait for loot.

Don't pull extra mobs.

Most wipes are self-inflicted.

It's not a bad framework for software development.

And honestly, it's probably saved me more projects than some of the formal methodologies I've used over the years.

Although it's admittedly harder to explain during architecture reviews.

Then again...

The longer I build software, the more I think good architecture and good raid leadership are really the same thing.

Reduce unnecessary chaos.

Focus on the current objective.

And for the love of all that is holy...

Don't pull extra mobs.

Beyond the Prompt: Building a Proposal-Based Workflow for AI Content Updates

CPDForge — Tue, 09 Jun 2026 15:30:55 +0000

AI should propose changes. Systems should decide whether those changes become reality.

Most AI systems are designed to generate content.

The problem is that production systems rarely need content generation.

They need content maintenance.

And maintenance is where AI becomes dangerous.

When you're building software for regulated environments, the challenge isn't creating version one.

It's safely updating version one after it's already been approved, deployed, audited, referenced, and relied upon.

That's where we started rethinking how AI should interact with content.

The Problem With Regenerate

Imagine you have a compliance training course containing:

Regulatory references
Knowledge checks
Real-world scenarios
Internal procedures
Approval history

A regulation changes.

You need to update a single section.

Most AI systems approach this by sending the lesson to an LLM and asking it to regenerate the content.

Technically, it works.

Operationally, it creates a new problem.

The AI might:

Rewrite surrounding content unnecessarily
Change instructional tone
Remove important references
Alter lesson structure
Introduce subtle inaccuracies

The update itself might be correct.

The rest of the lesson might no longer be.

In high-stakes environments, that's not acceptable.

Content Is Not A String

One of the design decisions we made while building CPDForge AI was to stop thinking about content as large blocks of text.

Instead, we treat content as structured documents.

A simplified model looks like this:

Course
 └── Module
      └── Lesson
           └── Section
                └── Content Block

That structure allows updates to be targeted precisely.

Instead of telling AI:

Rewrite this lesson

We can tell it:

Update this section

That distinction turns out to be surprisingly important.

Because once content becomes structured, AI no longer needs to touch everything.

It only needs to touch the thing that changed.

From Generation To Proposals

The next decision was even more important.

The AI never directly edits production content.

Instead, it generates a proposal.

Production Content
        ↓
AI Analysis
        ↓
Proposed Change
        ↓
Validation
        ↓
Human Review
        ↓
Approved Update

The model suggests.

Humans decide.

That simple shift dramatically improves trust.

Instead of treating AI as an author, we treat it as a contributor.

Granular Path Targeting

Because content is structured, updates can be scoped to specific locations rather than entire lessons.

Conceptually:

{
  "path": "modules[2].lessons[0].sections[4]",
  "action": "update",
  "reason": "regulatory_change"
}

The AI receives the affected section rather than the entire document.

This reduces content drift, limits unintended changes, and makes updates easier to review.

More importantly, it keeps the blast radius small.

The Preservation Problem

Even targeted updates create risk.

An AI can still:

Remove required knowledge checks
Drop regulatory references
Break expected structure
Change instructional intent

So every proposal must be validated before it reaches a reviewer.

Deterministic Validation

LLMs are non-deterministic.

Compliance systems shouldn't be.

Before a proposal can be approved, it must pass validation checks such as:

Schema Validation

Does the proposal still conform to the expected structure?

Required Component Validation

Are mandatory elements still present?

Citation Validation

Have required references been preserved?

Structural Integrity Validation

Does the update still fit within the expected hierarchy?

The objective is simple:

Prevent the AI from damaging content while attempting to improve it.

Human Review Still Matters

There is a temptation to automate everything.

We've found the opposite works better.

The AI identifies potential improvements.

The platform validates the proposal.

Humans make the final decision.

For regulated content, that distinction matters.

Not because humans are perfect.

But because accountability still matters.

AI As A Pull Request

The more we developed this workflow, the more it started to resemble modern software development.

Developers don't usually push unreviewed code directly into production.

They create pull requests.

Those pull requests are reviewed, validated, tested, and approved before being merged.

We're increasingly treating AI-generated content updates the same way.

The AI creates the equivalent of a pull request.

The platform validates it.

Humans decide whether it should be merged.

That model feels significantly safer than allowing direct mutation of production content.

The Bigger Lesson

This pattern extends far beyond compliance training.

The same principle applies to:

Documentation systems
Knowledge bases
Legal content
Internal policies
CMS platforms
Enterprise workflows
Any system where correctness matters

The most valuable AI systems won't necessarily be the ones that generate the most content.

They'll be the ones that help organisations maintain complex information safely.

Content generation is rapidly becoming commoditised.

Content governance is not.

As builders, we spend a lot of time thinking about generation.

Increasingly, I think we should be thinking about maintenance instead.

Because once version one exists, the real challenge begins.

Questions For Other Builders

If you're building AI into a production system:

Are you allowing direct mutation of live content?
How are you validating AI-generated changes?
What safeguards exist when the model gets it wrong?
Are you treating AI as an author or as a reviewer?
Have you adopted a proposal-based workflow similar to pull requests?

I'd be interested to hear how others are approaching this problem.

The Most Dangerous AI Output Isn’t Wrong — It’s “Almost Right”

CPDForge — Sat, 04 Apr 2026 09:19:47 +0000

Most people think the biggest risk with AI is hallucination.

Completely wrong answers.

Obvious mistakes.

Stuff you can spot instantly.

That’s not what caused problems for us.

The real issue showed up later — once things looked like they were working.

The outputs weren’t wrong.

They were almost right.

And that’s a much harder problem to deal with.

Why “Almost Right” Is Worse Than Wrong

If something is clearly wrong, you catch it.

You fix it.

You move on.

But when something is:

90% correct
Well structured
Confidently written

…it passes through unnoticed.

And that’s where systems start to break.

What This Looks Like in Practice

These weren’t big failures.

They were small, subtle ones:

A field slightly misclassified
A rule applied in the wrong context
A structure that looks valid but doesn’t align with the system

Individually, they don’t matter.

At scale, they compound.

The Real Problem: AI Stabilises Its Own Mistakes

Here’s what we realised:

AI doesn’t just generate errors — it reinforces them.

Once a slightly incorrect pattern appears, the model tends to:

Repeat it
Expand on it
Make it look more consistent over time

So instead of random errors, you get:

Clean, consistent, wrong outputs.

Which are much harder to detect.

Why This Happens

AI isn’t reasoning in the way we expect.

It’s optimising for:

Coherence
Pattern completion
Internal consistency

Not correctness.

So if an early assumption is slightly off, the model will build a very convincing version of reality around it.

Where This Breaks Real Systems

This becomes critical when AI is used for:

Structured content generation
Compliance or policy outputs
Anything reused or scaled

Because now you don’t just have an error.

You have:

A repeatable error
A scalable error
A system-level error

What We Changed

We stopped trusting “good-looking outputs.”

Instead, we built around one principle:

Every output is suspect until proven stable.

1. Pattern Detection Over Single Output Review

Instead of asking:
“Is this output correct?”

We ask:
“Is this pattern consistently correct across outputs?”

This exposes hidden drift fast.

2. Intent vs Output Validation

We separate:

What the system is supposed to do
What the AI actually produced

Then compare them explicitly.

If they don’t align, it fails — even if it looks right.

3. Breaking the Feedback Loop

We avoid feeding AI its own outputs without checks.

Because that’s how:

Small errors become reinforced patterns become system behaviour

The Counterintuitive Bit

Making outputs more polished made the problem worse.

Cleaner language increases trust.

More trust reduces scrutiny.

Which allows bad patterns to survive longer.

Why This Matters Right Now

A lot of AI tooling is focused on:

Making outputs better
Making them more human
Making them more polished

But that increases risk if you’re not validating underneath.

The Takeaway

If your AI outputs look great but your system still feels unreliable:

You’re probably dealing with “almost right” errors.

And those are much harder to catch than obvious failures.

Question for Anyone Building with AI

If you’re using AI in production workflows:

What breaks first when you scale?
Do you validate outputs, or just trust them if they look good?
Have you run into “clean but wrong” behaviour?

Genuinely curious how others are handling this.

From Prompts to Systems: Fixing AI Agent Drift in Production

CPDForge — Mon, 30 Mar 2026 13:50:55 +0000

Why My AI Agent Kept Getting Things Wrong (And What Actually Fixed It)

At first, it worked.

I gave the AI a clear prompt. It responded well. Structured, relevant, even a bit impressive.

Then I tried again.

Same prompt. Slightly different output.

Then again — and something felt off.

Not completely wrong… just inconsistent.

That’s when it became a problem.

Because I wasn’t building a demo. I was building a product.

The Problem: “Almost Right” Is Not Good Enough

When you’re working with LLMs in isolation, variability is fine. Even interesting.

When you’re building something people rely on — it isn’t.

I started seeing patterns:

Outputs drifting in structure
Key instructions being ignored
Tone and formatting changing between runs
Occasionally… things just made up

Nothing catastrophic. Just unreliable.

And that’s worse.

Because you can’t trust it.

The Context: This Wasn’t Just a Chatbot

One important detail — this wasn’t an internal tool or a sandbox experiment.

This was a user-facing AI agent, interacting with both:

logged-in users (with context, data, and history)
prospective users (with no context at all)

Which meant I effectively needed two behaviours:

one that could operate with structured internal data and constraints
one that could explain, guide, and respond more openly without access to that context

Trying to handle both with the same prompt quickly broke down.

The agent would:

assume context that didn’t exist
overreach when it should stay generic
or lose structure when switching between modes

That’s when it became clear the issue wasn’t just prompting — it was context control and behavioural separation.

Why This Happens (and Why It’s Not a Bug)

It took a bit of stepping back to realise:

The model wasn’t failing — I was asking it to behave like something it isn’t.

LLMs are:

Stateless (unless you force context)
Probabilistic (not deterministic)
Context-sensitive (and context degrades fast)

What I was treating as “rules” were really just:

Suggestions with good intentions

Even system prompts didn’t fully solve it.

They help — but they don’t enforce behaviour.

What I Tried First (and Why It Didn’t Work)

Like most people, I went through the usual iterations:

Making prompts longer
Repeating instructions
Adding “IMPORTANT:” everywhere
Trying to be hyper-specific

It improved things slightly… but not enough.

The problem wasn’t clarity.

The problem was control.

The Shift: From Prompts to Systems

The breakthrough came when I stopped thinking in terms of prompts and started thinking in terms of structure.

Instead of:

“Tell the model what to do”

I moved to:

“Define how the model is allowed to behave”

That’s a completely different mindset.

What I Built: A Structured Instruction Layer

I ended up creating what I originally called an “instruction bible”.

In reality, it’s closer to a structured instruction system layered on top of the model.

1. Persistent rules (not buried in prompts)

Instead of mixing everything into one prompt, I separated:

Role definition
Behaviour rules
Output constraints

Example:

{
  "role": "compliance_ai",
  "rules": [
    "Do not invent regulations",
    "Flag uncertainty explicitly",
    "Prioritise clarity over completeness"
  ],
  "output_format": "structured_sections"
}

This becomes the source of truth, not just part of the conversation.

2. Modular instructions

Different tasks = different instruction sets.

Instead of one giant prompt, I used:

Generation mode
Review mode
Analysis mode

Each with its own constraints.

This reduced cross-contamination between behaviours.

3. Controlled outputs

I stopped accepting “natural” responses.

Everything had to follow a structure.

For example:

Sections must exist
Headings must match
Lists must be formatted consistently

If the output didn’t comply, it was rejected or reprocessed.

4. Reduced ambiguity

I removed anything vague.

No:

“be helpful”
“be clear”
“be concise”

Instead:

Define structure
Define constraints
Define boundaries

The model performs much better when it has less room to interpret.

What Changed

Once this layer was in place, the difference was immediate.

Outputs became consistent
Structure stabilised
Hallucination dropped significantly
Reuse became possible

Most importantly:

I could actually trust the output in a product setting

Not perfect — but predictable.

The Bigger Realisation

The real lesson wasn’t about prompts.

It was this:

Prompt engineering doesn’t scale. Systems do.

You can get good results with clever prompts.

But if you want:

reliability
repeatability
product-grade output

You need structure.

Where This Fits in the Bigger Picture

This lines up with a broader shift happening right now:

From chatbots → agents
From prompts → orchestration
From “AI responses” → controlled systems

We’re moving away from:

“Ask the model something”

Toward:

“Design how the model operates”

Final Thought

LLMs are powerful — but they’re not plug-and-play components.

If you want to build something real with them, you have to accept:

You’re not just writing prompts
You’re designing behaviour

And once you start treating it that way, everything changes.

If you’re building with AI and hitting similar issues, I’d be interested to hear how you’re handling it — especially where things break.

We tried to generate a compliance course with AI. It didn’t go well.

CPDForge — Wed, 25 Mar 2026 08:07:10 +0000

We started off trying to build a compliance course.

We ended up building the system required to trust one.

Turns out they’re not the same thing.

That’s when everything changed.

🧪 The First Version (Looked Fine… Until It Didn’t)

The initial idea was simple:

Use AI to generate a compliance training course.

Pick a topic like:

risk assessment
workplace safety
ESG fundamentals

Feed it into a model, get a structured course out.

And technically — that worked.

We got:

modules
lessons
headings
even quizzes

On the surface, it looked decent.

But once you actually read it properly…

❌ What Was Broken

Shallow Content

It explained things, but didn’t really teach anything.

No depth. No real-world context. No edge cases.

Inconsistent Structure

Some lessons were detailed. Others felt like placeholders.

No consistency across the course.

No Instructional Flow

It wasn’t designed — it was assembled.

Content chunks, not a learning journey.

And the Big One: Reliability

In compliance training, “almost correct” isn’t acceptable.

It’s a risk.

⚠️ The Realisation

We assumed the problem was:

“How do we generate better content?”

It wasn’t.

The real problem was:

How do we make that content consistent, reliable, and safe to use?

AI was doing exactly what it’s good at:

producing plausible output
filling gaps convincingly
sounding right

But that’s not the same as being trustworthy.

🔧 What Broke First

Our original pipeline looked something like:

Prompt → LLM → Output course

And for a moment, that felt like enough.

Until we started testing it properly.

Sections contradicted each other
Concepts repeated in different ways
Terminology drifted across lessons
Some parts were strong, others clearly weak

You could generate a course.

You just couldn’t rely on it.

🧱 What We Had to Build Instead

The moment things changed was when we stopped treating this as a generation problem.

We started treating it as a system problem.

The pipeline evolved into something more like:

Input
→ Structured Generation
→ Validation Layer
→ Targeted Rewriting
→ Enrichment (quizzes, scenarios, examples)
→ Compliance Checks
→ Output

Each layer existed for a reason.

Because every time we skipped one — something failed.

🧩 The Hard Parts (That Don’t Show Up in Demos)

Structure Enforcement

We had to stop the model from improvising.

That meant:

fixed lesson frameworks
defined section types
controlled outputs

Targeted Improvement (Not Regeneration)

Regenerating everything just moved the problem around.

Instead:

identify weak sections
rewrite only those
preserve what already works

Cross-Course Consistency

This was harder than expected.

We needed to deal with:

duplicated concepts
mismatched terminology
uneven difficulty

Which meant introducing:

internal rules
pattern checks
consistency constraints

Compliance Awareness

This is where most tools fall down.

We needed:

alignment with recognised frameworks
the ability to adapt as guidance evolves
detection of weak or risky content

🧠 The Shift

At some point, we stopped thinking in prompts.

We started thinking in systems.

AI became one part of the process — not the solution.

🛠️ If You’re Building with AI

It’s very easy to focus on:

better prompts
better outputs

But the real leverage is in:

constraints
validation
iteration
control

Because generation is easy.

Making it usable is not.

🚀 Where This Landed

What started as “generate a course” became:

structure
validation
rewriting
enrichment
compliance
delivery

Not because we wanted more features —

but because without them, none of it worked.

That was the real lesson.

AI doesn’t remove complexity.

It just hides it — until it matters.