DEV Community: Simon Griffiths

The Disconnected Vector Store — Why AI Storage Needs to Move Inside the Boundary

Simon Griffiths — Thu, 16 Jul 2026 15:37:53 +0000

My previous article challenged the assumption that binary content belongs in object storage, and argued that the governance gap it creates becomes a security vulnerability when agents are the callers. This article examines the same problem in the infrastructure built specifically to serve those agents: the vector store and the agent memory store.

The pattern has become so common it barely registers as a decision. An organisation builds a RAG pipeline: chunk the documents, generate embeddings, store them in a vector database, retrieve by similarity at query time. An agent framework is added: conversation history goes to Redis, long-term memories to Pinecone or Chroma, retrieved context assembled at prompt time from multiple sources. The operational database — where the organisation’s actual data lives, with its access controls, its audit trail, its transactional integrity — sits to one side, connected to this machinery by application code that nobody has formally designated as an integration layer.

The vector store and the agent memory store are disconnected. They were deployed by default, not by design. And the consequences of that disconnection — from the transactional boundary, from the governance model, from the access policy — are what this article examines.

A Pattern That Normalised Before It Was Examined

There is a precise parallel with the saga pattern discussed in an earlier article. Sagas were a rational engineering response to a real problem — enforcing consistency across distributed microservices when a single transaction boundary was no longer available. They were sound in context, adopted widely, and became normalised to the point where the conditions that justified them stopped being examined in each new case.

The disconnected vector store followed the same trajectory, but faster. When organisations began building RAG pipelines in earnest, native vector support was either absent or immature in most general-purpose databases. The specialist vector databases — Pinecone, Weaviate, Chroma, Qdrant — filled a genuine gap. The workaround made sense. What did not follow was the re-examination of that choice as the gap closed. Native vector support has since arrived in PostgreSQL via pgvector, in Oracle via AI Vector Search, and in varying forms across most major database platforms. The external vector store remained anyway, because it was already there, the pipelines were built around it, and inertia is a more powerful architectural force than most engineering teams acknowledge.

The result is an infrastructure pattern adopted for reasons that no longer fully apply, carrying costs that were not part of the original calculus.

The Security Problem

The previous article established that object storage lacks fine-grained access control, and that an agent with access to an object store bucket can read everything in it — constrained only by coarse bucket-level policy. The disconnected vector store has exactly the same problem, applied to the data that agents use most directly: the context they reason on.

A vector store is a copy of data in a different form — embeddings derived from documents, records, and content that lives in the operational database. The access controls defined in that database — which identities can see which records, what is masked, what is subject to retention policy — do not automatically extend to the vector store. The embedding of a document that a given agent identity should not be able to retrieve may nonetheless be retrievable through semantic similarity search, because the vector store’s access model is less granular, or enforced against a different permissions model entirely.

This is not a theoretical exposure. A RAG pipeline that retrieves context from a disconnected vector store will return results based on semantic similarity alone. It has no awareness of the row-level security policies, data classifications, or customer-level access restrictions that the operational database enforces. An agent reasoning on that context is reasoning on data it may not have been entitled to see — and acting on it, writing results derived from it, propagating that entitlement violation downstream without any audit trail that would make the violation visible.

The governance gap that the previous article identified in object storage is, if anything, more consequential here. Object storage holds the assets. The vector store holds the meaning extracted from those assets — the semantic representation that agents use to reason. Uncontrolled access to meaning is at least as dangerous as uncontrolled access to the raw content it was derived from.

The Cost of Disconnection

Beyond security, the costs of disconnection fall into three further categories: consistency, operational overhead, and agent memory.

There is a pre-AI version of this problem that I have seen in CRM systems. A telecoms call-centre application had a recommendation engine designed to prompt agents with upsell offers. The recommendations were driven from a cache that refreshed daily.

That sounded reasonable until a customer placed an order and then called back the same day, perhaps to change something or ask a follow-up question. The operational CRM knew the order had been placed. The recommendation cache did not. So the call-centre agent was prompted to offer the same product again to someone who had just bought it.

This was not catastrophic. No database was corrupted, and no system went down. But it was a very poor customer experience, because the company appeared not to know what it had just sold. The problem was not the recommendation engine in isolation. It was the gap between operational truth and derived context.

Consistency. An embedding is derived from data. When the data changes, the embedding is stale until the pipeline that regenerates it runs again. In a synchronous application workflow this lag is manageable — the window is short, the consequences are bounded, and the retrieval results are understood to reflect a near-current state. In an agent workflow the consequences are neither bounded nor understood. An agent querying a disconnected vector store for context will retrieve whatever the index contains, without any indication of whether that content reflects the current state of the source data or a state from days ago before a significant update. The embedding may surface a document that has since been revised, superseded, or deleted from the authoritative store. The agent reasons on that context and acts. The action is based on information the database no longer considers true.

This is not a theoretical edge case. It is the default behaviour of any architecture where the vector store is not updated in lockstep with the database — which is to say, almost every architecture where they are separate systems.

Operational overhead. A disconnected vector store is a separate system to operate: separate index refresh schedules, separate monitoring, separate backup, separate restore, separate access policy management. Each of these is a place where things can fail, diverge, or be misconfigured independently of the operational database. The backup consistency problem identified in the previous article applies here too — a database restore and a vector index restore are separate operations with no native coordination, leaving the possibility of agents retrieving context from an index that no longer corresponds to the state of the data it was built from.

Agent memory. The problem compounds when agent memory is also externalised. An agent that stores its beliefs, its working context, and its understanding of prior interactions in a separate memory store — Redis, a document database, a specialist agent memory platform — is building a model of the world that can diverge from the transactional reality of the database it acts against. It may remember that an order was placed that was subsequently cancelled. It may carry forward a belief about a customer’s status that the database has since updated. It will act on those beliefs with confidence, because nothing in its architecture tells it that its memory and the database are out of sync.

Where the Source Data Lives

The previous article challenged the assumption that binary content belongs in object storage, and established that much of the content organisations externalise — text fields, JSON, incidental images, attachments, documents — belongs in the database. This has a direct implication for the vector store argument.

Where the source data lives in the database, the question of where to store the embedding is straightforward: the database. The embedding is derived from data that is already subject to the database’s access controls, transactional guarantees, and retention policies. Keeping the embedding alongside the source data is not an architectural preference — it is the natural consequence of keeping data together. The consistency problem largely disappears: when the source record changes, the embedding can be regenerated within the same transaction. The security problem disappears: the same access policy that governs the source data governs the embedding.

Where the source data is genuinely external — binary content that belongs in object storage for legitimate scale or access pattern reasons — the embedding still belongs in the database. The embedding is not the asset. It is a fixed-size numerical representation that participates in retrieval operations and carries metadata about its provenance. It belongs inside the governance boundary regardless of where the asset it was derived from lives.

The architectural principle is therefore consistent: embeddings belong in the database. The question of where the source content lives is a separate question, addressed in the previous article. The answer to that question does not change the answer to this one.

The Scale Objection, Honestly Assessed

The standard argument for the external vector store is scale. Specialist vector databases are optimised for approximate nearest-neighbour search at high dimensionality and high volume. General-purpose databases were not designed for this workload, and at sufficient scale the performance characteristics diverge.

This is true. It is also, in most enterprise deployments, less relevant than it is presented to be. The scale at which the performance gap between an in-database vector index and a specialist vector database becomes operationally significant is higher than most organisations are operating at. The benchmarks cited in favour of specialist vector databases tend to reflect workloads at the upper end of what enterprises encounter — and even there, the gap is closing as in-database implementations mature.

The more important point is about architectural default. The question is not whether there exist workloads at sufficient scale to justify a specialist vector database. There clearly are. The question is whether that justification should be the starting assumption or the conclusion of an analysis. Currently it tends to be the starting assumption, applied to workloads that have not been evaluated against it. The external store is deployed first; the scale case is made afterwards, if it is made at all.

Native Vector Support: Where Things Stand

pgvector brought vector search to PostgreSQL and has seen rapid adoption. It is a genuine capability, though with limitations at scale and in query sophistication that more recent implementations have addressed. Oracle AI Vector Search, now generally available, integrates vector storage and similarity search directly into the Oracle database engine — which means vector queries can be combined with relational filters, subject to Oracle’s access control model, and executed within the same transaction boundary as the rest of the database workload. Oracle AI Agent Memory takes the same approach to agent memory specifically: storing agent beliefs and working context inside the database rather than in an external store, with the consistency and governance properties that entails.

These are not arguments for Oracle specifically. They are illustrations of a direction the market is moving in — one that closes the gap that disconnected external stores were filling. The relevant question for any architecture is whether the specialist external store is still justified given what in-database alternatives now offer, or whether it is a decision made when those alternatives did not exist and not revisited since.

The Default Trap

The deepest problem with the disconnected vector store is not technical. It is that it becomes permanent not because it is right but because it is there.

Infrastructure decisions that ship as defaults acquire institutional inertia at speed. The pipelines are built around them. The operational processes — index refresh schedules, monitoring, backup — are built around them. The teams are organised around maintaining them. Questioning the decision requires unpicking all of that, and the cost of unpicking is always visible in a way that the cost of staying put is not — until the security incident, the consistency failure, or the audit finding makes the invisible cost suddenly very legible.

This is precisely what happened with sagas and eventual consistency, and with the object storage default examined in the previous article. The pattern made sense when it was introduced. It became the default. The conditions under which it made sense stopped being examined.

Disconnected vector stores are earlier in the same trajectory. The moment to examine the decision is now, before the infrastructure calcifies further and before the agent workloads that depend on it scale to the point where the governance and consistency gaps become production problems rather than architectural concerns.

What This Means for Architecture

The disconnected vector store is not an inevitable feature of AI architecture. It is a default that emerged when the alternatives did not exist, and it has persisted through inertia rather than considered design.

The series has now traced four versions of the same underlying failure: transactional integrity pushed out of the database into saga-managed application code; binary content pushed out of the database into object storage without adequate governance; business rules pushed out of the database into application-layer validation; and AI storage deployed outside the database entirely. In each case the drift was rational at the moment it happened. In each case the agent era makes the cost of that drift considerably more visible.

The architectural principle is consistent across all four: integrity, governance, and state need to live where the caller cannot bypass them and where the transactional boundary can hold. For AI storage — embeddings, agent memory, retrieved context — that means inside the database, unless there is a specific, examined, and scale-justified reason why it cannot be.

That completes the diagnosis. Across this series one failure has recurred in every form: integrity, governance and state pushed out of the database into layers an agent can bypass. What follows is the answer. The next article turns from what agents break to what the architecture must become — and it begins where the exposure is sharpest, the write path. It is the first article of Designing the Stack, the point where this series' diagnosis becomes a design: the database stops trusting the caller and starts owning its own contract.

Don't let ChatGPT write your white paper — and don't let Claude, either.

Simon Griffiths — Thu, 16 Jul 2026 15:34:30 +0000

I gave the same document to two of the best language models available, with exactly the same prompt, and asked each to score it against ten editorial criteria on a scale of one to five.

Hardly any of the scores agreed.

That alone I could have shrugged off. What I couldn't shrug off came next. When I acted on one model's advice and then handed the improved draft to the other, the second model quietly undid the first one's work — and marked the result down against the version it had just been given. Swap back, and the same thing happened in reverse. Each model was, in effect, editing the other one out of the document. I spent a while trying to get each to act on its own recommendations automatically, chaining them together, before I admitted the obvious: this wasn't noise in the scoring. It was two different temperaments, pulling in two different directions.

It isn't a bug — it's a personality

The models I used were Claude Opus 4.8 and ChatGPT running GPT-5.6 Sol. Same document, same prompt, same ten categories. And once I stopped treating the disagreement as an error and started reading it as character, the whole thing made sense.

This turns out to be one of the most widely reported observations among people who write with these tools for a living, so I'll not pretend I discovered it. But it's one thing to read it in a comparison post and another to watch it happen to your own paper, paragraph by paragraph.

ChatGPT was the lawyer. Every statement had to be qualified. Every claim wanted hedging. Every generalisation had to be pinned down to something specific and defensible. The result was undeniably more accurate — and substantially more boring. Strong adjectives were sanded flat. Sentences accreted so many "in many cases" and "it may be that" clauses that the argument disappeared inside them. By the end it read like the small print at the end of a radio advert, the bit where a voice reads out the side effects at four times normal speed. (Yes, Americans, you have these too — the pharmaceutical ones are legendary.)

Claude was the advocate. It let the strong statements stand. It didn't reach for a qualifier every time I made a claim, and the narrative kept its shape and its momentum. But it also left gaps — openings where a sentence could be read as more than the evidence supported — and it was surprisingly reluctant to make big structural cuts. ChatGPT was far better at spotting duplication, collapsing three woolly sentences into one clear one, and sharpening executive summaries and conclusions where clarity matters most.

Why this matters more for a white paper than almost anything else

A white paper is a strange object. It presents itself as impartial — measured, evidenced, above the fray — and it is, in practice, a sales tool. The whole persuasive force of the format comes from the reader believing it isn't trying to persuade them. Which means it lives or dies on a narrow middle ground between credibility and advocacy.

Lean too far toward credibility and you get the ChatGPT failure mode: a document so hedged and so careful that it convinces no one of anything. It is accurate and it is inert. Lean too far toward advocacy and you get the Claude failure mode: a compelling read that overclaims just enough to get shredded by the one sceptical reader whose opinion actually matters.

Neither model, left to run on its own, will find that middle ground for you — because neither model has a middle ground. Each has a native lean. That's precisely why writing a persuasion document entirely inside the cautious model is a bad idea, and writing it entirely inside the confident one is a different bad idea. The credible-but-inert version doesn't lose the argument; it loses the reader before the argument starts.

The process that actually worked

Once I understood the two temperaments, the workflow more or less designed itself. Play them to their strengths and use each to cover the other's weakness.

Shape and strength first, in Claude Opus. A full pass to get the narrative arc right and to introduce the strong claims — the assertions the paper actually wants to make.
Substantiation next, in ChatGPT (GPT-5.6 Sol). A full pass to make every one of those claims defensible: specifics nailed down, generalisations qualified where they had to be, duplication removed.

Both of those were automatic, whole-document edits. Then I switched off the autopilot.

Manual, paragraph by paragraph. I read the whole thing slowly. Whenever I wasn't happy, I lifted the paragraph out, wrote my own revision, and pasted both into ChatGPT asking only for comment — not a rewrite. Then I made the edit by hand. (I write in Markdown, in Zed, so this cut-and-paste loop is frictionless.)

When I'd finished, I handed the whole document back to Claude Opus. The scores jumped. More tellingly, when I asked it to compare the before-and-after versions of the major edit, it agreed the paper was much improved — something the pure automatic passes had never once conceded.

Where the models will let you down

Three things are worth saying plainly, because they're the difference between using these tools and trusting them.

Don't take a model's agreement as validation. When ChatGPT told me my hand-edits were improvements, it was probably right — but agreeableness is a known feature of these systems, a tendency that keeps you nodding along and typing. Agreement from the more accommodating model is weak evidence at best. Treat it as a prompt to check, not a verdict.

The scores are directional, not absolute. After everything above, I'd earned a clean sweep of fives from Opus — and I'm not going to pretend that number means the paper is objectively excellent. It means the paper improved within one model's frame of reference. The value of the scoring was never the number; it was watching the number move. Read it that way and it's genuinely useful. Read it as a grade and you're fooling yourself with a metric this same experiment just proved to be personality-driven.

You have to check the facts yourself. Claude's extra round of recommendations was full of confident errors — most often product names. It particularly struggled where a product had been renamed, insisting the old and new names were two different things, and it made wrong assumptions about anything that had changed after its training cut-off. I checked every one by hand with a plain web search, corrected the model, and moved on. (I'll note, with some humour, that my own first draft of this piece got a model name wrong — so the failure isn't the machine's alone.)

Did it work? A note on what "validation" is worth

That Opus liked the finished paper proves little on its own — it shaped the narrative, so it was grading its own frame of reference. The more interesting test was to open a clean ChatGPT account: no memory of the edits, the opposite temperament, the cautious lawyer rather than the advocate, scoring the document cold. It came back mostly fives with a few fours — by a wide margin the highest it had rated any version of the paper.

That convergence is the strongest signal I have, and it's worth having. When the sceptic and the advocate both rate a document highly, you have probably found the middle ground. But I want to be exact about what it is and isn't. It is not an independent test. The paper had already been through ChatGPT's substantiation pass, so a clean ChatGPT is still, in part, grading work tuned to its own standards. Opening a fresh account removes the model's memory of the conversation; it does not remove the fact that the document was shaped to satisfy that model's editorial instincts in the first place. A genuinely independent check would come from a model that touched none of the editing — or, better, from a reader.

Which is the same place the whole exercise keeps landing: the only score that finally matters is whether the person you wrote it for acts on it.

The conclusion

If you want a legally watertight, exhaustively accurate document, ChatGPT is the better editor. If you want a strong, persuasive narrative, Claude Opus is. If you want both — which, for a white paper, you have no choice but to want — you need both, run in the right order, with a human holding the seam between them.

And that's the real finding. Two of the best models available, working in concert, still could not produce an impactful document on their own. They gave me enormous leverage: three hours of work on a twenty-page paper that would otherwise have taken far longer. But every paragraph still needed a human to stop and ask the only questions that matter — what is this actually trying to say? Is it as tight as it could be? Is it on message? And even now it isn't finished: reading this back, both models still have suggestions, and I'm still the one deciding which are worth taking and which are just the machine tidying away the very edges that give the thing its voice. A paper like this is never done — it's abandoned at a point you're willing to defend. The models can carry a claim a long way. They cannot decide whether it's worth making, or when to stop. That's still the job.

Appendix: technique and token economy

A few practical notes for anyone wanting to try this:

New thread per assessment. Each scoring run got its own conversation, so no prior edits contaminated the judgement.
Paste the full document into the chat — do not attach it as a source document. Attaching a file invokes a retrieval mechanism that pulls fragments on demand; it does not hold the whole document in context, which is exactly what you need for a coherent editorial pass. (This is my practical experience; the exact behaviour varies by product and changes over time — verify for your own tool.)
For manual edits, paste the paragraph plus your own revision, and ask for comment. Asking for opinion on a specific change gets fast, focused responses and keeps you in the author's chair. Asking for a rewrite hands the chair back.
Write in Markdown. The whole cut-and-paste editing loop depends on plain text you can move around without formatting getting in the way.

The Bridge Looked Fine Too

Simon Griffiths — Mon, 29 Jun 2026 15:45:16 +0000

This is the fourth post in Craft & Code, a short Friday series about what carpentry can teach us about AI, skill and the future of software. Last week I worried about where the next generation's judgement will come from. This week, why we may not notice it is missing until it is too late.

My father built me shelves in an alcove when I was small, and I mentioned in the first post that they may still be there for eternity. The other side of that story is the one every household knows: the shelf that is not quite right. The one that sags under a row of books, or sits a degree off true so that anything round rolls gently to one end. You do not need to be a carpenter to see it. A bad joint, a door that will not close, a shelf that dips — the material tells on the maker, immediately and to everyone.

That is the comforting version of the analogy, and the one I expected to write: carpentry is honest about its failures because they are visible, while software can look polished and be rotten underneath. A wonky shelf looks wonky; bad software looks finished. It is a tidy line, and there is real truth in it.

But it is only half the truth, and the more interesting half should worry us — because the moment you go up from a shelf to a serious piece of engineering, the comfort falls away completely.

Consider two of the most admired structures of the last century.

The Tacoma Narrows Bridge was designed by one of the leading suspension-bridge engineers of his day: elegant, slender, celebrated. It opened in the summer of 1940 and tore itself apart in the wind that November, twisting like a ribbon because the design had not reckoned with how the deck would behave aerodynamically. Nobody had seen a wonky bridge; it looked magnificent. The flaw was real, fundamental, and invisible until the wind found it.

The Citicorp Center in New York, finished in 1977, was a triumph of structural engineering, raised dramatically on great columns at the midpoints of its sides. Only after it was complete and occupied did its own engineer rework the numbers and realise the building was vulnerable to diagonal winds the original calculations had missed — made worse because some welded joints had been changed to bolted ones during construction. On paper, a storm of the kind that arrives every couple of decades could have toppled it. The fix was carried out in secret, at night, while the building was full of people who had no idea anything was wrong.

These were not amateurs cutting corners. They were experts at the very top of their craft, with every incentive and resource to get it right. And they still shipped beautiful things with fatal flaws nobody could see by looking. That is the part the wonky-shelf story leaves out. Visible failure is the privilege of simple work. The more complex and ambitious the thing, the more its flaws can hide beneath a surface that looks not just acceptable but superb.

Software is the most complex, most ambitious, least physically constrained engineering most of us will ever touch. It belongs firmly in the company of the bridge and the tower, not the shelf.

This is where the analogy bites, and where I want to be careful, because the democratisation I keep returning to in this series is genuine and worth defending. More people can build useful things now than at any point in my career — internal tools, automations, dashboards, prototypes that solve a real problem for a real team. Much of it is good, and much of it would never have existed otherwise. What follows is not an argument that amateurs should not build software. The experts got caught too; that is rather the point.

Software hides its failures even better than a building does, and the stakes are not always visible at the moment of creation. The screen renders. The login works. The form submits and shows a friendly tick. The demo lands and the room nods. None of that tells you whether the permissions model is sound, whether the data model survives a year of growth, whether the security boundaries hold, whether anyone could recover the system at three in the morning when it falls over. A bridge that is unsound will, at least, eventually announce itself by falling down in public. Software can carry a latent, fatal flaw indefinitely, silently, and then express it everywhere at once — because unlike a bridge it is copied, deployed, and handed to thousands of people at no cost. The same properties that make software miraculous apply in full to its mistakes.

So the ceiling is not defined by who is building. It is defined by what the software touches: customer data, money, legal obligations, regulated workflows, security boundaries, large numbers of strangers depending on it, and the long grey grind of having to keep it alive for years. Below that ceiling, build away; if it breaks, the blast radius is small. Above it, a thing that merely looks finished is not an asset but a liability in the costume of one.

Here is the part I find genuinely unsettling, not merely cautionary.

Everything I have described so far happened with expert humans firmly in charge. The bridge, the tower, the skyscraper — and, more recently, the cloud. In October 2025 a single fault in Amazon's US-EAST-1 region took a large part of the internet down for most of a day. The root cause was not a hardware failure or an attack; it was a latent flaw in an automated system that managed DNS records, which quietly produced an empty record and cascaded across dozens of dependent services. Nine days later Microsoft's Azure suffered its own global outage, in which a bad configuration change sailed straight through the very safeguards built to catch it, because those safeguards themselves had a defect. These are among the most sophisticated engineering organisations that have ever existed — and in both cases the automation built to prevent failure was the thing that failed, in a way nobody saw coming until it had already happened at global scale.

Now ask what happens as we hand more of the building, and eventually more of the operating, to AI systems whose own failure modes we do not yet understand — with fewer and fewer experienced humans in the loop, for all the reasons I worried about last week. I am not claiming to know the answer; honestly, it is an unknown. But the two outages are a foretaste of the shape of it: systems too complex for any one person to hold in their head, defended by automation that can itself harbour the fatal flaw, producing a surface that looks entirely healthy right up until the moment it does not. AI does not invent this problem. It widens it, accelerates it, and removes some of the people who used to be standing there to catch it.

And there is one more thing software has that no bridge or tower ever did: an enemy.

Every failure I have described so far came by accident — a bad assumption, an unlucky configuration, a latent flaw nobody saw. A suspension bridge is only ever attacked by the wind, and the wind does not learn, bear a grudge, or go looking for weaknesses on purpose. Software is attacked by people: intelligent, motivated, well-resourced people whose entire occupation is to hunt for exactly the hidden, plausible-looking flaws this whole post has been about. That turns the question from "might this break?" into "might this be broken, deliberately, at the worst possible moment?"

And the blast radius is now national. Ransomware has shut down hospitals, fuel pipelines, and the back offices of whole companies and public authorities — rarely through exotic genius, usually by finding the one unsound assumption nobody went back to check. We have already seen how fragile the concentration is even without an attacker: when Spain's card payments collapsed in 2025 — once in a nationwide power blackout, and again in the very cloud fault I described above — an entire country was thrown back to cash within minutes. Both were accidents. That is the unsettling part. If an accident can flatten a nation's payments for a day, the version where someone is actively trying does not take much imagination.

AI sits on both sides of this. It will help the defenders, certainly. But it also arms the attackers, and it pours out code whose hidden flaws no experienced human may ever have reviewed — more surface that looks finished, produced faster than anyone can probe it, in a world where someone is always probing.

The judgement that matters here was never the magic of typing the right syntax. Anyone, and now any tool, can produce syntax that runs and a screen that looks finished. The judgement is knowing where things break: which assumptions are load-bearing, which corners cannot be cut, what a system does on its worst day rather than in its demo.

And even that is not a guarantee, which is the most sobering lesson of all. The engineer of the Citicorp Center was brilliant, and he still very nearly presided over a catastrophe. What saved the building was not that he got it right first time. It was that he kept looking after everyone else had stopped — that someone with the experience to know what to check went back and checked, long after the ribbon had been cut and the building declared a success.

That habit, of continuing to scrutinise the thing that already looks finished, is precisely what a fast, confident, plausible AI surface discourages. The work looks done. Why would you go back? A wonky shelf at least nags at you from across the room. The bridge looked fine too — right up until it didn't.

Next week, the last in the series: the lecturer. What my father's final career taught me about where craft goes when the tools take over the doing.

Who Asked That? Identity, Accountability and the Agentic Query

Simon Griffiths — Tue, 23 Jun 2026 13:48:32 +0000

The previous article in this series argued that the application layer has always been the real control layer for enterprise data — and that agents bypass it entirely. This article examines one of the most significant consequences of that bypass: the loss of authorisation control over what users can do and, perhaps more critically, what they can see.

This is not primarily a security argument, though security is implicated. It is an argument about the fundamental mechanics of how data access has always been governed — and why those mechanics fail silently when agents enter the picture.

The Application Knows Who You Are

Every enterprise application implements authorisation. Not just authentication — not just "are you allowed in" — but a continuous, contextual set of rules governing what each user can do and what data they can see.

There is an older version of this problem that I remember from the early client-server period. In the green-screen world, user identity was usually built tightly into the terminal application. It was not perfect, but the system had a clear idea of which operator was using which function.

The move to Visual Basic front ends changed that. These were Windows 3.1 desktops, often before network logins were a reliable part of the environment. If someone could sit at a PC and start the application, tying the action back to a real individual was much harder than it sounds now. Physical access to the machine did far too much of the security work.

Modern authentication has improved that enormously. But the underlying lesson still matters: if identity is only strong at the application layer, then anything operating behind that layer can collapse back into a shared technical account. The database may record that a service user queried or changed data, but that is not the same as knowing which person caused it, whether they were entitled to do so, and why the action happened.

A sales representative sees their own accounts and opportunities, not the whole customer base. A clinician sees their own patients' records. A regional manager sees their region's performance, not the company's. A payroll administrator can see salary data; almost nobody else can. These restrictions are not incidental features — they are fundamental to how the application functions correctly and safely.

In almost every case, these restrictions are implemented in the application layer. The application constructs queries that reflect the user's scope. It filters results before they are displayed. It withholds fields that the user has no business seeing. The database, for its part, returns whatever it is asked for. It has no independent knowledge of what any given user should or should not see. It trusts the application to ask the right questions.

Analytic and business intelligence tools follow the same pattern. Most modern BI platforms implement their own row-level security and data access models — but these are embedded in the tool, governed by the tool, and enforced by the tool. They work reliably within that environment. Step outside it, and they offer no protection at all.

The pattern is consistent across applications, across tooling, across the entire data access landscape: authorisation controls live in the access layer. The data layer enforces almost nothing.

The Agent Does Not Know the Rules

An agent operating autonomously has no knowledge of these rules. It was not built with your application's authorisation model in mind. It does not know that this user should only see their region, or that that field should never be exposed to anyone below director level, or that these records are restricted pending a legal hold.

It will ask the database for what it needs to complete its task. The database will answer. The filters that would have been applied by the application will not be applied, because the application is not in the loop.

The result is not a dramatic breach in the conventional sense. There may be no attacker, no exploit, no malicious intent. The agent may be doing exactly what it was asked to do. But the data it accesses — and potentially acts upon, summarises, or passes to another system — may be far beyond what the originating user was ever entitled to see. The absence of restriction is the vulnerability, and it was never visible because the restriction was never in the database.

The Stakes Are Not Uniform

Not all data carries the same consequences when access controls fail. It is worth being direct about the categories where the stakes are highest.

Commercially sensitive data presents a particular exposure because it is typically governed entirely by internal policy, not external regulation. Pricing models, margin structures, contract terms, strategic plans, acquisition targets — access to this data is controlled because the organisation chose to control it, enforced through application logic that reflects those choices. There is no external framework compelling the database to protect it. An agent that can query freely across the data estate may assemble a picture that no individual user was ever intended to have — not by accessing any single restricted record, but by combining data that was never meant to be seen together.

Regulated data raises the stakes from commercial damage to legal liability. Financial data governed by SOX, health information under HIPAA, personal data under GDPR — in each case, the regulatory framework imposes obligations not just to protect the data but to demonstrate that it was protected. The application layer was the mechanism through which compliance was implemented and evidenced. An agent that bypasses it does not just create an access risk; it may place the organisation in breach of its compliance obligations, with no audit trail capable of demonstrating otherwise.

Highly sensitive personal data warrants separate mention because the regulatory and ethical consequences of inappropriate access are severe and asymmetric. The special categories recognised under GDPR — health data, biometric data, racial or ethnic origin, political opinions, sexual orientation, and others — attract the highest levels of scrutiny and the most significant penalties. The burden of demonstrating lawful access is high. An agent that touches this data without a clear, auditable, user-level justification is not just a compliance risk; it is a harm risk, with consequences that extend well beyond the organisation.

The Audit Trail Proves Nothing

There is a further problem that compounds all of the above. When an agent accesses data through a service account or API credential — as is typical — the database audit log records that access against that credential. It does not record the end user on whose behalf the agent was acting. It does not record what the agent did with the data. It does not record whether the access was within the scope of what the user was entitled to see.

An audit trail that cannot answer these questions is not an audit trail in any meaningful sense. It is a log of connections and queries, which is useful for diagnosing technical problems but largely useless for demonstrating compliance, investigating incidents, or answering the question that regulators and data subjects will ask: who accessed this data, why, and were they entitled to?

The application always knew the answers to those questions. It encoded them in the queries it constructed and the filters it applied. That knowledge does not transfer automatically to an agent operating in its place.

The Database Was Never the Last Line of Defence — But Now It Needs to Be

The controls described in this article were not placed in the application layer through negligence or oversight. They were placed there deliberately, because the application was the right place for logic that was contextual, user-specific, and closely tied to the application's own data model. The database was never designed to be the last line of defence, because it was never expected to need to be.

That expectation is no longer safe. When agents can access data directly, bypassing the access layer entirely, the question of where authorisation is enforced becomes urgent. The database cannot continue to assume that someone upstream has already applied the appropriate filters. In an agentic world, there may be nobody upstream.

What that means for database architecture — for identity models, for access control, for audit infrastructure — is the subject of subsequent articles in this series. The starting point is recognising that the rules which governed data access for decades were written in application code, and that code is no longer guaranteed to be in the room.

The Agent at the Gate — Why Agentic Access Breaks the Unwritten Rules of Database Security

Simon Griffiths — Thu, 18 Jun 2026 19:32:19 +0000

The API series looked at the contract problem from the outside: what happens when agents call interfaces that were designed around known consumers, shared context, and human judgement. This series looks from the other direction. What happens when the data boundary itself was relying on the application layer to enforce rules the database never knew?

There is an assumption buried deep in almost every enterprise data architecture. It is so fundamental that most organisations have never written it down, let alone questioned it. The assumption is this: the application is the gatekeeper.

Everything that stands between an end user and raw data — the business rules, the validation logic, the integrity checks, the transactional boundaries, the data quality controls — lives in the application layer. The database stores the data. The application controls what happens to it.

That assumption is about to be tested as it has never been tested before.

This Is Not a New Idea — We Abandoned It

It is worth remembering that this was not always the accepted model. Thirty or forty years ago, in the early era of enterprise relational databases, the database was the control layer. Business rules lived in stored procedures. Constraints enforced data quality. Triggers maintained integrity. The logic was close to the data, and the database was the authority.

I remember dealing with a version of this in the old Visual Basic client-server days, before REST APIs and middle tiers became the normal shape of enterprise systems. A VB front end would connect to the database over ODBC, often using a schema username and password embedded somewhere in the application configuration or code. Because the application needed to update data, the obvious answer was to give that schema write access to the tables.

It worked, but it was a security nightmare. If those credentials leaked, the database was effectively open. The mitigation was to remove direct table updates and force all changes through PL/SQL procedures. The VB application could call governed operations, but it could not simply write whatever it liked to the underlying tables. We also used a token check to make sure the call was coming through an expected path.

That was not elegant by modern standards, but the principle was sound: the database did not trust the caller with raw write access. It exposed controlled operations instead.

That model was deliberately dismantled, and not without reason. Stored procedures were difficult to test, painful to version control, and resistant to the kind of agile development that application frameworks enabled. As distributed architectures emerged, database-centric logic became a bottleneck — hard to scale, hard to change, hard to own across teams. The application layer offered speed, flexibility, and separation of concerns. Moving logic out of the database was a conscious, reasoned decision made by architects who understood the trade-offs.

The consequence, largely invisible at the time, was that the database became thinner. Schemas relaxed. Constraints were dropped in favour of application-layer validation. Business rules migrated into service layers and APIs. The database became very good at storing and retrieving data, and progressively less involved in deciding what that data should look like or how it should behave.

For decades, this worked. The application, or later the middle tier, was always there. It was the only door into the data, and it was a door with rules. But the old lesson did not disappear. If the middle tier is the thing enforcing the rules, then anything that bypasses it reopens the same class of risk.

The Application Layer Is No Longer Guaranteed

AI agents change this fundamentally — not as a theoretical risk, but as a practical architectural reality that is already unfolding.

An agent operating autonomously may not follow the workflow the application was designed to enforce. It may not submit a form, trigger a validation routine, or invoke the service layer in the expected sequence. Depending on how it has been integrated, it may query and write to data through tools, service credentials, generated SQL, or APIs that expose lower-level operations than the original application ever intended. The orchestration layer that manages the agent may impose some constraints, but it is not the application that encoded your business rules. It knows little about your data model's history, your organisation's integrity requirements, or the assumptions baked into your schema over twenty years of development.

This is not a security problem alone, though security is part of it. It is a much broader loss of control. Consider what actually lives in the application layer of a typical enterprise system:

Business rules — eligibility checks, approval workflows, state transition logic, pricing rules. None of this is in the database. The database will accept a record that violates every one of these rules if the application does not intercept it first.
Data quality controls — format validation, referential checks beyond foreign keys, cross-field consistency rules. An agent writing directly to the database bypasses all of them.
Transactional boundaries — the application defines what constitutes a complete operation. The database enforces atomicity within a transaction, but it is the application that decides what belongs in that transaction. An agent may write partial state without ever completing the logical operation.
Audit and compliance hooks — in many systems, compliance logging is implemented in application code. Direct database access is invisible to it.
End user identity — perhaps most critically, the database typically sees a single application credential. It has no knowledge of which end user is behind a given operation. The application knew. The agent, acting autonomously through a service account or API credential, strips that context away entirely.

Each of these, individually, represents a gap. Together, they represent a control framework that simply ceases to function when the application layer is bypassed.

The Schema Problem

The risk is compounded by a decade of architectural choices made in the name of flexibility. The shift toward flexible schemas — JSON columns, document stores, schema-on-read, EAV patterns — was a legitimate response to the pace of change that modern development demands. If the application is controlling what goes in, some schema flexibility is manageable. The application provides the structure that the schema does not.

Remove the application, and flexible schema becomes a liability. An agent writing to a document store or a loosely-typed column meets almost no structural resistance. Nothing breaks at write time. The data will be accepted. The damage will be silent, cumulative, and potentially very difficult to detect or reverse.

There is a contrarian case worth making here: the organisations that resisted schema flexibility — that maintained strong, constrained, relational data models even as the industry moved away from them — may now find themselves better protected. Rigid schemas, often criticised as obstacles to agility, turn out to be a form of structural defence. They cannot enforce business rules, but they can at least reject structurally invalid data.

The Wheel Turns

The irony is sharp. We are rediscovering the value of database-enforced logic at the precise moment the application layer is being dismantled. The thick database model — constraints, rules, integrity enforcement close to the data — is looking considerably more prescient than it did in 2010.

The historical objections to that model were real, but they are weakening. The two most powerful arguments against embedding logic in the database were the development cost and the rigidity of strong schemas. Both depended on the assumption that humans, working at human pace, would bear the cost of writing and maintaining that logic.

AI-assisted development changes that calculus. Complex constraints, validation rules, and schema definitions that would once have taken significant time to write and maintain can now be generated, tested, and evolved far more rapidly. The friction that made database-enforced logic impractical is reducing precisely when the need for it is increasing. Even strong, well-defined schemas — historically the enemy of development speed — become less of an obstacle when the tooling can absorb much of the cost of evolving them.

This does not mean returning to monolithic stored-procedure architectures. It means taking seriously the question of where control should live when the application can no longer be assumed to be in the loop.

The Question Organisations Need to Ask Now

How much of what you believe to be data integrity actually lives in your database — and how much lives in the application you are about to bypass?

For most organisations, the honest answer requires some clarity about intent. Application architects knew exactly how thin the database was — thinning it was the goal. Moving logic into the application layer was deliberate, considered, and successful on its own terms. This was not an oversight; it was a design philosophy prosecuted with discipline over many years.

The problem is not that those decisions were wrong. The problem is that they were right for a world in which the application could be assumed to remain in the loop. That assumption is weaker now. The context that made a thin database the correct answer has shifted, and decisions that were rational then need to be revisited now.

The subsequent articles in this series examine the specific dimensions of this problem — identity and accountability, transactional integrity, schema and storage models, governance and compliance — and what the database management system needs to become in order to meet it. But the starting point is recognising that the problem exists, that it is structural, and that it has been building for thirty years.

The agent is at the gate. The gate was not designed for this.

Your Email Account Is the Master Key

Simon Griffiths — Wed, 17 Jun 2026 06:58:50 +0000

Most people think of email as somewhere messages arrive.

That is no longer the important part.

Your main email account is the place other systems use to decide whether you are still you. Banks, shopping sites, cloud services, insurance companies, phone providers, password managers, social networks, travel companies and government services all use email as part of the recovery chain. If you forget a password, reset an account, confirm a device, receive an alert or prove ownership, the route often comes back through email.

That makes your email account much more important than an inbox. It has become recovery infrastructure for the rest of your digital life.

I Learnt This The Awkward Way

My original primary email address was s.griffiths@virgin.net, from the dial-up days.

At the time, that was perfectly normal. Your internet provider gave you an email address, you used it, and over time it became part of your life. It went on forms. It became the address friends knew. It became the place services used to contact you.

Then I moved away from the dial-up service and discovered the catch: my primary email address depended on that service continuing.

Fortunately, my parents still used Virgin.net for dial-up, so I managed to keep the address alive for a few more years through their account. That bought me time to move. Unfortunately, I moved to Yahoo next.

That solved the immediate problem, but it was not a permanent answer. As Yahoo declined, I realised I needed to move again, this time to Gmail, which felt like a more durable long-term home for my digital identity.

Years later, I hit the ghost of that first decision. I tried to sign up to a service using my phone number and was rejected because the number was still linked to my old virgin.net email address. Virgin.net was long gone. I could not access the mailbox. The vendor insisted on using the non-existent email account as the proof point, so I had no easy way to disconnect the phone number.

That is the problem. An email address can disappear from your life but remain embedded in someone else's recovery process.

Email Became The Spare Key Everyone Trusts

The shift happened gradually enough that most of us did not notice.

Email started as correspondence. Losing access was annoying, but not always catastrophic. You might miss messages, lose old conversations, or have to tell people your new address.

Now email is part of the machinery of identity.

If someone controls your primary email account, they may not need to know where you bank. They can search your inbox and find out. They may not need to guess which cloud service you use. The alerts, receipts and old sign-up messages will tell them. They may not need to break into every account at once. They can reset one, wait, read the notifications, and move carefully.

That is why this matters. A compromised email account is not only a privacy problem. It can become a map of your digital life and, in many cases, a route into it.

The risk is not always dramatic. An attacker may not lock you out immediately. They may add a forwarding rule, keep reading quietly, and wait for useful messages to arrive. They may delete warnings before you see them. They may use old receipts and account emails to build a picture of where to try next.

This is why I would treat your main email account as one of the foundations of your digital life, not just somewhere messages arrive.

Some Email Accounts Are Too Fragile For This Job

This is where I am going to be opinionated.

For most people, the main recovery email for banking, cloud storage, phone accounts, government services and password managers should sit with a provider that has strong modern account protection and is likely to be around for the long term.

In practice, that often means Google, Microsoft or Apple. Not because they are perfect, and not because you should trust any large technology company blindly. This is not a brand recommendation. It is a judgement about security maturity, recovery processes, passkeys or strong two-factor options, device alerts, and the likelihood that the account will still exist in the form you need a decade from now.

I would be much more cautious about using an ISP mailbox, an old work address, an old Yahoo-style account, a small provider you barely think about, or a custom domain you set up years ago and now only half remember.

Some of those services can be secured. That is not the point. The question is whether they are strong enough, durable enough and recoverable enough to act as the foundation for everything else.

An email account that still works is not necessarily an email account you should trust as your digital master key.

The Forgotten Domain Problem

There is another version of this problem that catches technically aware people more often than non-technical ones: the old personal domain.

If your main email address is something like you@yourdomain.com, then your security does not only depend on the mailbox. It also depends on the domain name, the renewal payment, the account where the domain is managed, and whoever still knows how it is configured.

That sounds technical, but the risk is simple.

If the domain expires, is misconfigured, or falls under someone else's control, email for that domain may stop reaching you. Worse, it may eventually start reaching someone else. If that address is used for password resets, bank alerts, cloud accounts or your password manager, the domain has become part of your identity system.

For a newsletter address, that may be fine. For the account that protects your bank, cloud storage, phone provider and family photos, it is only safe if you actively manage the domain, protect the registrar account, and know exactly how recovery works.

The test is simple: if losing that domain would make it hard to prove who you are to important services, either manage it as critical infrastructure or do not use it as your keystone email address.

Five Things To Do Today

This does not need to become a weekend project. Start with the account that matters most.

1. Choose Your Keystone Email Account

Decide which email account should be the recovery address for your important services.

That means your bank, credit card, phone provider, cloud storage, password manager, government services, insurance, main shopping accounts and anything you would panic about losing.

For most people, this should be one strong mainstream account, not a collection of old addresses accumulated over twenty years.

There is a reasonable argument for having a separate account used only for these critical services, rather than using the same address for everyday mail, newsletters and shopping receipts. I have considered doing this myself. It reduces noise and makes the account's purpose very clear.

But it only works if you actually look after it. A separate account that you rarely check, forget how to recover, or leave tied to an old phone number is not safer. For most households, the practical answer is one properly secured main account, or one dedicated keystone account that is protected strongly and checked regularly. What I would avoid is a half-forgotten spare mailbox that quietly becomes the recovery route for everything important.

2. Move Critical Accounts Away From Fragile Addresses

If important accounts still point to an ISP email address, an old work address, a forgotten domain, or a mailbox you would struggle to recover, move them.

Do not try to fix everything at once. Start with the accounts that matter most: banking, phone provider, password manager, cloud storage and government services.

Keep the old address for newsletters, receipts and low-risk services if you want. Just stop using it as proof of identity.

3. Turn On Strong Sign-In Protection

Your keystone email account should have more than a password protecting it.

Use a passkey if the provider supports it. If not, use an authenticator app or a physical security key. SMS codes are better than nothing, but I would not choose them as the main protection for the account that protects everything else.

Also save recovery codes somewhere offline. Printed and stored with important documents is good enough for many households. The point is that you should not need access to the same email account to recover the email account.

4. Check For Quiet Ways Someone Could Still Be Reading

Changing the password is not always enough.

Check whether your email account has forwarding rules you do not recognise. Look at the devices currently signed in. Review old connected apps and mail clients. Remove old app passwords if your provider shows them.

You do not need to understand every technical detail. The question is simple: is there any route by which someone, or some old app, could still be reading this mailbox without you noticing?

If you do not recognise it, remove it.

5. Fix Recovery Before You Need It

Check the recovery phone number and recovery email address on your keystone account.

Make sure they are current and under your control. If the recovery address points to an old mailbox you barely use, you have just moved the weakness one step sideways.

Also think about your password manager. If you need your email to recover your password manager, and your password manager to recover your email, you have built a loop. Break that loop with offline recovery codes, an emergency kit, or a written record stored safely at home.

Good Enough For A Household

Good enough does not mean perfect.

For most households, good enough looks like this: one strong main email account, protected with a passkey or proper two-factor authentication, with clean recovery details, no unknown forwarding rules, no mystery devices, and recovery codes stored somewhere offline.

That is not glamorous. It is not advanced cyber security. It is basic household resilience.

If you secure only one account properly this week, make it your main email account.

Because your bank may hold your money, your cloud account may hold your photos, and your phone may hold your messages. But your email account is often the place they all turn to when they need to decide whether you are still you.

APIs Expose Data, Not Meaning

Simon Griffiths — Thu, 11 Jun 2026 17:43:34 +0000

APIs Expose Data, Not Meaning

The previous article drew a line from SOA to the present. Structural contracts were never enough, because they defined how services communicated, not what they meant.

Agents push that gap into the foreground.

This article is about what the gap looks like in practice, and why it is harder to close than it appears.

The Schema Isn't The Contract

When we talk about API quality, we tend to talk about schemas.

Is the response well structured? Are the types correct? Is the contract versioned? Is there an OpenAPI specification? These things matter. They are part of the discipline. But they are not the whole contract.

A schema defines shape. It does not define meaning.

That difference is easy to miss when the consumer is a human developer. Developers read documentation, ask questions, infer context, and notice when something does not behave as expected. They build a mental model of the system over time. The API may be incomplete, but the human fills in some of the missing meaning.

Agents do not have that same feedback loop. They operate on what is exposed. If what is exposed is structurally valid but semantically ambiguous, the result can be plausible and wrong at the same time. Worse, it can be wrong consistently, at speed, and without anyone noticing until a decision has already been made.

A Field Called Gender

Take a field that appears in many enterprise systems: gender.

It looks straightforward. The schema validates. Values come back as M, F, or perhaps Other. There are no errors, no failed calls, and nothing obviously broken in the payload.

But which gender is it?

In an HR system, it may be the employee's self-reported gender identity, used for inclusion reporting and pronoun preferences in internal tooling. In a payroll system, it may be the legal sex recorded on government tax documents, constrained by statutory reporting requirements. In a healthcare benefits platform, it may represent biological sex for insurance categories, clinical screening, or medication rules. In a CRM, it may not be identity at all, but an inferred marketing segment derived from behavioural signals.

The field name is the same. The values may look similar. The meaning is not the same.

That difference is not pedantry. It changes governance, privacy, acceptable use, and the decisions that can safely be made from the value. An agent composing across these systems does not just risk getting the value wrong. It risks using the right value in the wrong context, which is often more dangerous.

A value that is accurate in one domain can produce a decision that is structurally sound and substantively incorrect in another.

The API exposed data. It did not expose meaning.

The Same Problem Appears Everywhere

The gender example is vivid, but it is not unusual. Most large organisations have the same problem in less sensitive and less obvious forms.

A field called status might describe whether a customer is active, whether an account is billable, whether a workflow has completed, or whether access should be allowed. A field called type might represent a product category in one system, a regulatory classification in another, and an internal routing hint somewhere else. The schema may be valid in each case, but the semantics do not travel cleanly between systems.

Dates create a similar problem in a less sensitive but very familiar form. I have seen APIs where the date format was perfectly defined, but a business convention sat outside the schema: a timestamp with the time set to midnight meant "this is a whole-day value", not "this happened at exactly midnight". Structurally, the value was valid. Semantically, it carried an assumption the caller had to know.

That kind of convention causes real problems downstream, especially in analytics. If one system uses midnight to mean "no time component", another uses it as an actual event time, and a third applies local business-day rules, then a simple question like "show me everything for Tuesday" stops being simple. The data shape is consistent. The meaning is not.

The same issue applies to operations. An endpoint called updateCustomer may sound simple, but what does it actually do? Does it only change a record? Does it trigger a notification? Does it alter audit state? Does it start a downstream workflow? Does it have different behaviour depending on who called it, which channel the request came from, or which state the customer is already in?

These are not edge details. They are part of the real contract.

The missing meaning sits in several places at once: field semantics, operation intent, domain context, provenance, side effects, idempotency, consistency, ordering, and authority. Some of that can be documented. Some of it needs to be designed into the API surface. Some of it belongs in governance and access control. But it cannot be left as tribal knowledge if the caller is an agent.

Why We Got Away With It

Human developers are good at closing semantic gaps.

They read between the lines. They ask for clarification. They test unexpected cases. They learn which systems are reliable, which fields are overloaded, which APIs are safe to compose, and which ones require care. Over time, that knowledge becomes distributed across teams, conventions, runbooks, and the memory of people who have integrated with the system before.

It is slow and fragile, but it works up to a point.

What it really means is that the full contract was never in the API. It was partly in the schema, partly in the documentation, partly in the code that called it, and partly in the institutional knowledge around the system.

Agents inherit very little of that.

They do not know which field is politically sensitive, which status value is overloaded, or which endpoint has a side effect that everyone on the team remembers but nobody put in the schema. When the formal contract is incomplete, they fill the gap with inference.

Inference from incomplete contracts is where the risk concentrates.

Better Semantics Is Not Just More Documentation

The obvious response is to document more. That helps, but it is not enough.

If meaning sits only in prose beside the API, it is still separate from the thing being called. A human may read it. A generated client may ignore it. An agent may summarise it incorrectly, overweight the wrong part, or fail to connect it to the specific operation being performed.

Richer semantics has to affect the design of the API itself.

That means names should reflect business intent rather than implementation convenience. approveClaim, suspendAccount, or issueRefund carry more meaning than a generic update operation with a status field buried in the payload. Enumerated values should describe what they mean, not just which strings are valid. Side effects should be explicit at the point of call. Provenance and governance context should be part of what is exposed when the meaning of a value depends on where it came from.

It also means separating concepts that only look similar. If legal sex, gender identity, biological sex, and marketing inference mean different things, they should not collapse into one generic field just because the values happen to overlap. If customer status means different things in billing, fulfilment, support, and access control, the API should not pretend there is one universal truth called status.

That is not just better schema design. It is domain modelling.

The Real Question

If SOA taught us that structural contracts are not enough, and the semantic gap shows us that data without meaning is not a complete contract, then the next question is straightforward.

What would an API look like if it were designed from the start to carry meaning, not just data?

Not a larger schema. Not a thicker PDF beside the endpoint. A different model of the API itself: one where business capability, domain intent, authority, provenance, and behavioural guarantees are first-class concerns rather than things the consumer has to reconstruct from structure alone.

That is what an agent-ready API requires.

It is also a much richer definition of an API than the technical contract we have been working with.

We've Seen This Before: What SOA Teaches Us About APIs in the Age of Agents

Simon Griffiths — Wed, 10 Jun 2026 11:37:31 +0000

In the first article in this series, I argued that agents do not replace APIs. They expose the quality of the APIs underneath them.

That should feel familiar, because we have been here before.

The current wave of AI agents, MCP, and tool-driven architectures follows a recognisable pattern. We are being told that systems will become more composable, more interoperable, and more reusable. Software will call other software dynamically. Capabilities will be discovered and invoked at runtime. Integration will become less rigid because the caller can decide what to use as it goes.

That sounds new because the tooling is new. The architecture story is not.

We said much the same thing about Service-Oriented Architecture.

SOA, at least in its enterprise form, did not deliver on that promise. The easy explanation is to blame the technology: SOAP, XML, WS-* and all the ceremony around them. That is convenient, but it misses the more useful lesson. The problem was not simply how services communicated. The problem was how they were designed, and what we thought a contract actually meant.

That matters now, because agents are pushing us back into the same territory, only faster.

SOA Didn't Fail At Interfaces

SOA gave us strong interface definitions. WSDL could describe operations, inputs, outputs, schemas, and types. For the time, that was a serious attempt to make enterprise systems talk to each other in a more formal way.

On paper, it looked like interoperability.

In practice, it often wasn't.

The contract could tell you the shape of the message, but not enough about the meaning of the message. It could tell you that an operation existed, but not always what business intention sat behind it. It could describe fields and types, but not the assumptions the service was making, the states it expected, or how it behaved once you stepped outside the happy path.

Two systems could integrate perfectly at a structural level and still misunderstand each other completely.

That was the deeper failure. SOA did not fail because interfaces were useless. It failed because the interfaces were not rich enough to survive outside the context in which they had been created.

The Known Consumer Problem

Many SOA services were not really general-purpose capabilities. They were services extracted from, or designed for, a particular application, process, or integration.

That meant the formal interface was only part of the contract. The rest lived in shared context. The consuming system knew the expected sequence. The teams knew which fields were safe to use. The people involved understood which status values were meaningful, which edge cases mattered, and which calls were technically possible but operationally foolish.

The service looked reusable because the interface was published. But reuse outside the original context still required human negotiation.

This is the same problem we see with many APIs today. The interface is technically available, but its correct use depends on assumptions that were never made explicit. Once the consumer is no longer the one you designed for, the weakness becomes visible.

A specific example from the SOA period has stayed with me. Architects designed what should have been a public CRM capability: a customer API that could have served as a stable enterprise surface for more than one consumer. In practice, it was shaped around the first integration. The request model, sequencing, optional fields, and error handling reflected the needs of the two systems being connected rather than the wider business capability.

It had the language of a reusable service, but the design of a point-to-point connector.

Agents make that weakness more serious because they are, by definition, not the original known consumer.

One concrete version of this is the end-to-end customer journey. I have written before about lead-to-cash as a process that looks tidy on a diagram but crosses marketing, fulfilment, billing, payments, operations, and multiple system owners. In the SOA and ESB world, the process model could give the impression that the journey had a single contract. In practice, the contract often lived in the routing rules, service assumptions, ownership boundaries, and exception handling around the services.

The interface did not carry all of that. The process did.

Structure Is Not Meaning

The most important distinction is between structural precision and semantic clarity.

SOA was often structurally precise. A schema could be strict. A service could reject invalid messages. Types could be checked. The system could be correct in a narrow technical sense.

But a field called status could still be ambiguous. Does it describe the current business state, the last completed process step, the result of a validation check, or the state visible to a particular user role? What transitions are valid? What side effects happen when it changes? Which system is the source of truth?

Those answers were rarely in the contract. They were in documentation, project memory, support teams, or the heads of people who had been around long enough to know.

That is not a small gap. It is the difference between being able to call a service and being able to use it correctly.

Governance And Operational Reality Parted Ways

SOA also teaches a second lesson, which is that governance cannot just be process layered on top of weak contracts.

Organisations recognised that service reuse and change management were difficult, so they introduced versioning rules, approval boards, central registries, and service governance processes. The intention was sensible: bring order to a growing service estate.

The deeper problem was that governance and operational reality parted ways.

The SOA registry was supposed to be the source of truth, but it could not express enough of the contract. It could describe the service, the endpoint, the schema, and perhaps some ownership metadata. It could not reliably express the business meaning, assumptions, side effects, operational constraints, or safe usage patterns.

So architects and teams did what they had to do: they wrote documentation around the registry.

That helped for a while, but it created a new problem. The API, the registry, and the documentation were now three separate representations of the same contract. They changed at different speeds, were owned by different people, and predictably began to drift.

Once that happened, governance stopped being a reliable description of reality. It became another artefact to reconcile.

I saw the operational version of this in audit and compliance work. The specialists I worked with were not interested in the service diagram as the final truth. They wanted the raw data, because committed data in the database was the only durable evidence of what had happened. Queue state, process state, and integration logs mattered, but they were not a substitute for a governed source of record.

The lesson is not that governance is bad. The lesson is that governance has to live in the design of the contract itself. If meaning, authority, behaviour, and change expectations are not part of the interface, a process document will not rescue it.

The Illusion Of Reuse

SOA promised reusable services. What many organisations got instead were services that were technically reusable but practically tied to their first use case.

That distinction matters.

Technical reusability means another system can call the service. Practical reusability means another system can call it safely, understand what it means, depend on its behaviour, and evolve with it over time. SOA often delivered the first and assumed the second would follow.

It did not follow.

Reuse still depended on humans understanding the semantics, uncovering hidden assumptions, and negotiating change. The contract had not carried enough of the work.

Why Agents Make The Old Problem New Again

Agents change the consumption model.

Instead of a known front end or a controlled integration, we now have consumers that may discover tools dynamically, compose operations at runtime, and act without the shared background knowledge of the teams that built the API. An agent does not know what you meant. It knows what you exposed.

That turns old weaknesses into more immediate risks. Ambiguous fields become incorrect decisions. Hidden assumptions become failed workflows. Unclear behaviour becomes unpredictable action. A contract that was merely inconvenient for human developers can become dangerous when an autonomous caller starts composing it with other capabilities.

This is why the move from SOAP to REST, from XML to JSON, or from WSDL to OpenAPI does not by itself solve the problem. The modern stack is cleaner and more pleasant to work with, but most API contracts still describe structure better than meaning. They tell you about endpoints, payloads, response codes, and schemas. They say much less about intent, domain context, behavioural guarantees, authority boundaries, and side effects.

That is the pattern we risk repeating.

The Real Lesson

SOA standardised how services talk. It did not standardise what they mean.

That is the lesson worth carrying forward.

Agents are not sending us into an entirely new architectural problem. They are making an old one harder to ignore. If the contract does not contain enough meaning to survive outside its original context, then dynamic composition will not make the system more reliable. It will make the misunderstanding faster.

The next question is therefore not whether we need APIs. We do.

The question is what an API contract needs to contain when the caller is no longer a known application, a known team, or a known workflow.

And beneath that, there is a deeper issue: our APIs expose data structures, but not meaning.

What AI Changes About Building Blogs - Migrating from Wordpress to GitHub Pages

Simon Griffiths — Wed, 03 Jun 2026 16:37:45 +0000

WordPress powers around 43% of the web. For years it was the default answer to "how do I start a blog?" It was mine too.

This is not a post about why WordPress is bad. It is about something its pricing quietly depends on: the belief that leaving is expensive.

For most of the web's history that belief was correct. Migration meant days or weeks of work: exporting content, rebuilding themes, fixing images, redesigning layouts, and repairing formatting. So people stayed.

AI has changed that calculation.

The Final Straw

I had been on WordPress.com for years. Mixed content, inconsistent posting: a burst of twenty articles back in 2019, then a long gap before I started writing seriously again this year.

The friction accumulated slowly. Themes locked behind higher-tier plans. Limited customisation unless you paid more. None of that was unreasonable on its own. Platforms cost money.

What finally pushed me over the edge was much smaller.

I wanted a staging environment to test changes before pushing them live. WordPress.com charges for a second site, then charges again for the tooling to sync changes between them. Staging does carry real cost, but the pricing made a routine workflow feel disproportionately expensive for an individual blog. It stopped feeling like infrastructure and started feeling like a toll booth.

The question stopped being "should I stay?" and became "how long would it take to leave?"

The answer turned out to be: a few hours of AI-assisted work.

The Migration

To WordPress.com's credit, the export is excellent: one XML file containing posts, pages, metadata, and media references.

From there, I handed the problem to AI.

I used the OpenAI Codex app, a macOS coding agent that works across a local development environment. I gave it the export, pointed it at my existing site, and asked it to recreate the blog in Hugo with roughly the same visual feel.

What came back was not a scaffold or a starter template. It was a functioning site: migrated content, working structure, and a recognisable design language. Separately, I used Claude to build a colour palette and refine the visual direction.

Then I iterated. Better typography. Cleaner layouts. Simpler navigation. Things I had wanted to fix for years but had never got around to inside the WordPress editor.

Export to deployment took a couple of hours. Another few refined the design.

The Image Problem

The export solved content migration, but not presentation. Some posts had inline images. Some had none. The older 2019 articles were visually weak: written quickly, published with little thought for consistency.

I wanted every article to have a proper hero image.

So I described the problem to Codex: identify posts missing hero images, promote suitable inline images where possible, and flag the gaps. Then I used AI image generation to create artwork for the rest.

Something useful happened in the process. Without planning it, I started developing a visual identity. The generated images began sharing a palette, a mood, a texture. That consistency now carries forward into new writing.

The result is the thing WordPress never made worth my time: a consistent visual identity across the whole archive, old and new, without editing it post by post. I will revisit the weakest of the early images eventually, but the archive already hangs together.

Why Hugo and GitHub Pages

Hugo is fast and simple. There is no database. Content lives as Markdown files in Git. Deployment is a push. There are far fewer moving parts than a dynamic CMS.

GitHub Pages hosts it for free, with HTTPS and custom domains.

The trade-off is real: you lose WordPress's browser-based editor and plugin ecosystem. For non-technical creators who value integrated editing, plugins, memberships, forms, and low-friction publishing, WordPress still makes a great deal of sense.

But for technical writers, and increasingly for anyone comfortable with AI-assisted tooling, static publishing is far more accessible than it used to be.

What This Actually Means

The shift here is not Hugo specifically. It is that AI has collapsed migration friction.

Work that used to require significant developer time, such as rebuilding layouts, transforming content, repairing formatting, auditing media, and iterating on design, now compresses into a few hours of directed work with coding agents.

That changes the balance of power between platforms and users. Platforms have always benefited from inertia: even mildly dissatisfied users stayed because leaving felt expensive and risky. Much of that friction has now gone. For technically comfortable users, the barrier is less about implementation and more about clarity of intent.

I am not suggesting everyone should move to Hugo and GitHub Pages. The right answer depends on your audience, your workflow, and your tolerance for technical tooling.

AI does not remove the need for judgement, taste, or technical understanding. What it removes is the implementation cost of acting on them. And once that cost falls far enough, "it is not worth the effort to move" stops being a reason and starts being an excuse.

What I Haven't Solved Yet

The migration is done. The design is where I want it, and the site now sits on the kind of infrastructure I prefer: static files, GitHub Pages, a custom domain, and Cloudflare in front.

But there are still things Jetpack handled on WordPress that need replacing properly: analytics, subscriber capture, automatic emails, and newsletters. When I started looking at this, it became clear that there are limited low-cost integrated toolsets for the GitHub Pages blogger. There is no direct equivalent of what Jetpack offered WordPress users.

The components all exist. Nobody has bundled them in quite the same way.

That is the next problem to solve.

Agents Don't Replace APIs. They Expose How Weak Most APIs Already Are

Simon Griffiths — Tue, 02 Jun 2026 07:06:36 +0000

Agents Don't Replace APIs. They Expose How Weak Most APIs Already Are

There is a growing narrative that AI agents, often coupled with things like Model Context Protocol, will replace APIs.

It is easy to see why that idea has taken hold. Agents can discover tools, reason about which one to call, and assemble workflows at runtime. That feels very different from the static integrations we have lived with for years, where one system calls another through a fixed endpoint, with a fixed payload, in a fixed sequence.

But the conclusion does not hold.

Agents do not replace APIs. They depend on them. What they really do is expose how fragile many APIs already are.

The Illusion of Replacement

Model Context Protocol and similar approaches change the control plane. They give agents a way to discover available tools, inspect descriptions, decide what looks relevant, and make calls based on inferred intent.

That is useful. It is also easy to mistake for something bigger than it is.

Underneath the agent-facing description, the work is still being done by the same kinds of things we already understand: HTTP endpoints, structured payloads, authenticated operations, services, data stores, and event streams. The agent may decide what to call dynamically, but the thing being called still has to behave deterministically.

This is the first important distinction. MCP standardises how agents find and call tools. It does not remove the execution layer underneath those tools. If anything, it makes that layer more important, because the caller is now less predictable.

Most APIs Were Never Designed To Be Open

The uncomfortable truth is that many APIs are not really general-purpose interfaces. They are backends for a known application.

That is not an insult. It is how most systems have been built for perfectly understandable reasons. A React front end needs data in a certain shape. A mobile app has a known journey. A partner integration follows a negotiated flow. The API evolves around those consumers, and over time the contract absorbs assumptions that everyone involved understands without needing to write them down.

The caller is known. The sequence is known. The context is shared.

That works well enough when the API is really part of a larger application boundary. It becomes much more brittle when the consumer is an agent that has not inherited the surrounding context.

An agent does not know which endpoint was built only for a particular screen. It does not know that a field is only valid after a previous call has initialised some state. It does not know that an operation is safe only because the front end normally prevents users from reaching it in the wrong sequence. It sees a tool, a name, a description, and a schema. Then it reasons from what has been exposed.

I saw this pattern directly on a modifyCustomer API I once worked with. It had originally been designed as a universal back-end operation: one customer modification surface that could be used by more than one channel. In practice, the contract was gradually shaped around the assumptions of a particular front end. The input model included fields only that front end had access to, and it relied on context that existed in the screen flow rather than in the API itself.

The result was an API that still looked universal from the outside, but could not be safely called from anywhere else. The contract had absorbed the assumptions of its first consumer.

That is where APIs that looked clean suddenly become fragile. Not because they were badly engineered in the narrow sense, but because they were designed inside a relationship of shared assumptions. Agents weaken that relationship.

Open APIs Are A Different Discipline

Designing an API that can be safely used by any caller is different from building an application backend.

The difference is not just documentation. It is the level of explicitness in the contract. A genuinely open API has to carry more of its own meaning. It needs operation names that reflect business intent, not just implementation convenience. It needs payloads that are stable and understandable outside the original client. It needs predictable behaviour, clear failure modes, narrow authority, and minimal reliance on hidden sequence or surrounding UI logic.

In other words, the API has to stand on its own.

That is much closer to building a platform surface than exposing the internal workings of an application. It is harder work, and it is one reason genuinely reusable APIs are rarer than most organisations like to admit.

We have often treated "API" as if it were a transport choice. Put JSON over HTTP, publish a schema, and the system has an API. But the transport was never the hard part. The hard part is whether the interface expresses enough meaning for an unknown consumer to use it correctly.

Agents Move The Burden Onto The Contract

Agents introduce probabilistic orchestration into systems that were mostly designed around deterministic flows.

That matters because the API can no longer rely on the caller behaving in the expected way. A human-written integration usually follows a known path. It was tested against the intended use case. It encodes a set of assumptions made by developers on both sides.

An agent composes at runtime. It may call tools in combinations the API designer did not anticipate. It may infer intent from an incomplete prompt. It may treat two operations as equivalent because their descriptions sound similar. It may be perfectly well intentioned and still be wrong.

When that happens, the burden shifts to the API contract. The contract has to be predictable enough, explicit enough, and narrow enough that misuse is either prevented or fails safely.

This is where common weaknesses become much more visible. Ambiguous names matter more. Overloaded endpoints matter more. Loose schemas matter more. Broad security models matter more. They were always architectural problems, but human developers often compensated for them with judgement, local knowledge, and testing. Agents remove much of that safety net.

What Actually Changes

The shift is not "APIs versus agents". That is the wrong framing.

The real shift is from static integration to dynamic composition. Instead of a predefined workflow calling a known set of services in a known order, we now have systems where the choice of tool may be made at runtime. The orchestration layer becomes more flexible, but the execution layer still has to do the real work.

That execution layer remains APIs, services, data stores, queues, and event streams. Nothing about agents makes those responsibilities disappear. If anything, agents make the quality of those interfaces more important, because more of the system's behaviour now depends on whether they can be safely understood and composed.

This is why the replacement narrative is misleading. It focuses attention on the agent framework, when the more important question is whether the systems underneath are fit to be called by something operating without the assumptions of the original application.

The Uncomfortable Conclusion

Agents are not a shortcut around architecture.

They are a forcing function.

They force us to confront weak API design, poorly defined data boundaries, inconsistent contracts, and security assumptions that depended too heavily on a well-behaved caller. They expose the difference between an API that works for a known client and an API that can be safely used as a general capability.

That distinction is going to matter more as agentic systems move from demos into real enterprise environments.

If you are building for an agent-driven world, the priority is not the agent framework. It is the execution layer behind it.

Because agents do not replace APIs. They make it impossible to ignore how good, or bad, those APIs really are.

But even if APIs remain central, the way we design them is starting to break. We have seen that pattern before, and the lesson is older than the current agent wave. It starts with SOA.

Why You Should Change Your Debit Card Now

Simon Griffiths — Mon, 01 Jun 2026 16:24:45 +0000

There’s a piece of advice you won’t hear from your bank. Not because they don’t know — their security teams absolutely do — but because saying it out loud is uncomfortable when you’re also in the business of reassuring people that everything is fine.

So I’ll say it instead: if your debit card is more than a couple of years old and you’ve used it on smaller websites, you should replace it. Now, not eventually.

Here’s why.

Something just changed

In April 2026, Anthropic — the AI company behind the Claude family of models — quietly announced something significant. Their latest model, Claude Mythos Preview, had been used internally to find thousands of previously unknown security vulnerabilities across every major operating system and every major web browser. Many of these bugs had survived for decades undetected. One was 27 years old.

That’s notable, but not the part that matters most for you.

What matters is this: the model didn’t just find the vulnerabilities. It wrote working exploit code. Autonomously. On the first attempt, more than 83% of the time. And it can do the same thing with closed-source software — taking a compiled binary with no source code available, reverse-engineering it, and finding the holes.

This isn’t a research curiosity. It represents a fundamental shift in who can attack what, and at what cost.

The assumption that kept you relatively safe

For most of the internet’s commercial history, there was an implicit triage happening in the world of cybercrime. Sophisticated attacks — the kind that could exploit subtle vulnerabilities in obscure software — required serious skill and time. That meant attackers focused on high-value targets: large retailers, payment processors, banks.

The small online shop running an older version of WooCommerce, or the regional business with a custom checkout built in Node.js five years ago and touched infrequently since — these weren’t worth the effort. Not because they were secure. Because exploiting them required more skilled attacker time than the return justified.

That constraint has now effectively gone.

The economics of targeting have collapsed. What previously required a skilled security researcher working for days now takes an AI model a few minutes and costs a fraction of a penny per attempt. Mass automated scanning and exploitation of long-tail targets — the thousands of small sites that collectively hold a lot of card data — is now viable at industrial scale.

What this means for your debit card specifically

Every time you’ve used your debit card on a smaller website, you’ve trusted that site’s security. Some of those sites were fine. Some were running software with known vulnerabilities they hadn’t patched. Some may have been quietly compromised without ever knowing it — or without it ever making the news.

Your card details from those transactions may already exist somewhere they shouldn’t. You have no way of knowing.

The four-year-old debit card in your wallet carries four years of that history.

And debit cards carry a specific risk that credit cards don’t: fraud hits your actual bank balance. You’re not disputing a charge on borrowed money while your life continues normally. You’re potentially short on real funds while a dispute resolves — and disputes take time.

The practical response

None of this requires paranoia. It requires a modest update to your habits.

Replace the card. Call your bank and request a replacement. It’s free, it takes a few days, and it closes the window on any previously harvested card number. Whatever exists in the wild from your old transactions becomes useless.

Switch to Apple Pay or Google Pay for online purchases where you can. When you pay via these services, the merchant receives a one-time token — not your actual card number. There is nothing reusable to steal. This is the single most effective change most people can make.

For the occasional site that doesn’t accept digital wallets, consider a prepaid or virtual card with a low limit — something like a Revolut disposable card — as a sacrificial layer. Your real card stays out of it.

Move online spend to a credit card rather than debit where possible. The consumer protections are equivalent, but fraud on a credit card doesn’t drain your bank account while the dispute works through the system.

Check haveibeenpwned.com — enter your email address and it will tell you which known data breaches your details have appeared in. It won’t tell you everything, but it gives you a factual baseline rather than uncertainty.

Why your bank isn’t telling you this

Partly institutional lag — this shift is recent and bank customer communications move slowly. Partly because the advice implies their merchant ecosystem has a problem, which is an uncomfortable thing to say when those merchants are also customers. And partly because telling you to use Apple Pay is, from their perspective, endorsing a competitor’s product.

Their fraud teams know the landscape has changed. That awareness has not yet reached the leaflet in your online banking app.

The advice here is about twelve to eighteen months ahead of what mainstream consumer guidance will eventually catch up to. Acting on it now costs you nothing except a few minutes on the phone to your bank.

That seems like a reasonable trade.

Building Your First Real GPT Is Not a Prompting Exercise

Simon Griffiths — Fri, 29 May 2026 10:58:31 +0000

I recently built my first non-trivial GPT.

The interesting lesson was not about clever prompting. It was almost the opposite.

The GPT only started to become useful when I stopped treating it as something I could configure with a good instruction and a pile of documents, and started treating it as a small software project: source material, structure, behaviour, tests, iteration, version control.

That sounds obvious after the event. It was not obvious at the start.

The First Attempt Failed in a Familiar Way

My first attempt was probably the default path most people take.

I uploaded a set of existing documents, wrote a reasonably sensible instruction, and expected the GPT to work out the rest.

It did not.

The answers were not terrible. That was part of the problem. They were plausible, broadly relevant, and occasionally useful. But they were also vague, inconsistent, and too willing to drift away from the standards I was trying to enforce.

The real warning sign was repeatability. Ask the same question more than once and the answer would shift in ways that mattered. Nothing was obviously broken, but it was not dependable.

Looking back, the reason is clear. The documents were written for humans, not retrieval. Important guidance was buried inside larger documents. Some assumptions were implicit. The prompt was trying to compensate for weak source material. Behaviour and knowledge were mixed together.

At that point I stopped tuning and started again.

That was the right decision.

The Knowledge Base Matters More Than the Prompt

The second attempt started with the documents, not the GPT instructions.

I took the source material and converted it into Markdown. Then I stripped out noise, removed irrelevant sections, and reorganised the content around topics rather than original documents.

That distinction matters.

A document written for a human reader usually has a narrative structure. It explains, introduces, repeats, and provides context. That is useful when someone is reading from start to finish. It is much less useful when a GPT is trying to retrieve the right fragment of knowledge in response to a specific question.

For retrieval, the unit of structure needs to be smaller and sharper.

Instead of preserving large documents, I split the material into focused modules: one for each area of knowledge that needed to be used accessed and used together - essentially split into topics. Each file had a job. Each one answered a class of question. This was especially important if similar informaiton was split over many source files.

This was the first real improvement. Not a better prompt. Better source structure.

Restructuring Exposes the Gaps

Once the material was split into modules, the gaps became much easier to see.

Some processes were missing steps. Some terms were used inconsistently. Some guidance depended on knowledge that had never been written down because, in the original context, everyone involved already knew it.

That is the awkward thing about turning human knowledge into machine-usable knowledge. You find out how much of the contract was never in the text.

So I added missing sections, standardised terminology, and made assumptions explicit. I did not try to make the documents longer. I tried to make them less ambiguous.

This made a bigger difference than I expected.

The GPT became more stable not because it had been told to be stable, but because the material it retrieved was clearer.

Codex Became the Build Tool

This was not a manual editing exercise.

I used Codex to do the heavy lifting: converting source material into Markdown, splitting files, reorganising sections, identifying missing topics, and checking consistency across the knowledge base.

That changed the economics of the work.

Manually doing this across many files would have been slow and error-prone. With Codex, the work became iterative. I could ask it to restructure a set of files, inspect the result, correct the direction, and run another pass.

This is where the process started to feel less like writing a prompt and more like shaping a codebase.

The files mattered. The structure mattered. The naming mattered. Repetition and ambiguity mattered. The same instincts you use when cleaning up software applied here too.

The Prompt Is the Control Layer

Only after the knowledge base was in reasonable shape did I write the main instruction.

This is the opposite of how I started.

The main prompt should not duplicate the knowledge base. If it tries to do that, it becomes bloated, fragile, and hard to reason about. Its job is to define behaviour:

what the GPT is for
how it should answer
when it should use the knowledge files
what it should do when the answer is uncertain
what standards it should not compromise

I kept the instruction deliberately constrained. The more knowledge I pushed into the prompt, the worse the design became. The prompt is not the application. It is the control layer.

That separation between behaviour and knowledge is probably the most important design principle I took from the exercise.

Testing Has to Start Earlier Than Feels Natural

The other mistake I made was leaving test questions too late.

I eventually built a small test set covering normal questions, edge cases, ambiguous requests, and questions where the GPT should refuse to invent an answer.

That should have existed earlier.

Without a test set, you are just having a conversation with the GPT and deciding whether it feels better. That is not enough. The model can improve in one area while regressing in another. It can produce one excellent answer and still fail the same class of question five minutes later.

Testing gave me a way to see whether the system was becoming more reliable, not just more impressive.

I also found it useful to test in Codex before moving into the GPT builder. Codex was faster for file-level iteration, easier for inspecting the knowledge base, and better suited to making structural changes. But that did not remove the need to test again in the actual GPT runtime.

The two environments behaved differently enough to matter.

Codex and GPT Runtime Are Not the Same Thing

This was one of the more practical lessons.

Something that worked well in Codex did not always behave identically once deployed as a GPT. Retrieval could differ. Tone could shift. A file that seemed obvious in the build environment might not be used in the way I expected at runtime.

So the workflow became a loop:

Build and restructure in Codex
Test in Codex
Deploy to the GPT
Test again
Fix the source material, not just the prompt

That last point matters.

When a GPT gives a weak answer, the temptation is to add another instruction. Sometimes that is right. More often, the problem is lower down: the relevant knowledge file is too vague, too large, badly named, or missing the thing the model needs.

Prompt patches feel quick, but they accumulate into a mess.

Fixing the source is slower in the moment and better over time.

Git Is Not Optional Once This Becomes Serious

As soon as the GPT became a multi-file system, Git became necessary.

I wanted version history for the knowledge files, the main instruction, and the test questions. I wanted to be able to experiment, roll back, compare approaches, and understand why the GPT had changed.

Without Git, the process would have become guesswork very quickly.

This is another reason the exercise felt more like software than prompting. Once you have source files, build steps, tests, and runtime behaviour, you are no longer just configuring a chatbot. You are maintaining a system.

What I Would Do Differently

Next time, I would start with a module map.

In this build, the modules emerged from the documents I already had. That worked, but it was backward. A better approach is to begin with the questions the GPT needs to answer, the domains it needs to understand, and the boundaries between them.

Then the documents can be shaped to fit the model, rather than the model inheriting the accidental structure of the documents.

I would also split files more aggressively. Some of my early modules were still too broad. Smaller files retrieved more cleanly and produced less noise in the answers.

I would separate reference material from process material more deliberately. "How this works" and "how to do this" are different kinds of knowledge. Mixing them makes retrieval less predictable.

And I would define the test cases at the same time as the module map, not after the first working version. Tests are not a final validation step. They are part of the design.

The Real Lesson

The tooling makes GPT creation look simple.

In a trivial case, it probably is.

But if you want a GPT that behaves consistently, reflects a real body of knowledge, and gives answers you are prepared to rely on, the work is more structured than the interface suggests.

You are not just writing a prompt.

You are designing a knowledge base, a behaviour layer, and a feedback loop.

That is much closer to software engineering than most of the current language around GPTs implies.

DEV Community: Simon Griffiths

The Disconnected Vector Store — Why AI Storage Needs to Move Inside the Boundary

A Pattern That Normalised Before It Was Examined

The Security Problem

The Cost of Disconnection

Where the Source Data Lives

The Scale Objection, Honestly Assessed

Native Vector Support: Where Things Stand

The Default Trap

What This Means for Architecture

Don't let ChatGPT write your white paper — and don't let Claude, either.

It isn't a bug — it's a personality

Why this matters more for a white paper than almost anything else

The process that actually worked

Where the models will let you down

Did it work? A note on what "validation" is worth

The conclusion

Appendix: technique and token economy

The Bridge Looked Fine Too

Who Asked That? Identity, Accountability and the Agentic Query

The Application Knows Who You Are

The Agent Does Not Know the Rules

The Stakes Are Not Uniform

The Audit Trail Proves Nothing

The Database Was Never the Last Line of Defence — But Now It Needs to Be

The Agent at the Gate — Why Agentic Access Breaks the Unwritten Rules of Database Security

This Is Not a New Idea — We Abandoned It

The Application Layer Is No Longer Guaranteed

The Schema Problem

The Wheel Turns

The Question Organisations Need to Ask Now

Your Email Account Is the Master Key

I Learnt This The Awkward Way

Email Became The Spare Key Everyone Trusts

Some Email Accounts Are Too Fragile For This Job

The Forgotten Domain Problem

Five Things To Do Today

1. Choose Your Keystone Email Account

2. Move Critical Accounts Away From Fragile Addresses

3. Turn On Strong Sign-In Protection

4. Check For Quiet Ways Someone Could Still Be Reading

5. Fix Recovery Before You Need It

Good Enough For A Household

APIs Expose Data, Not Meaning

APIs Expose Data, Not Meaning

The Schema Isn't The Contract

A Field Called Gender

The Same Problem Appears Everywhere

Why We Got Away With It

Better Semantics Is Not Just More Documentation

The Real Question

Further Reading For The Series

We've Seen This Before: What SOA Teaches Us About APIs in the Age of Agents

SOA Didn't Fail At Interfaces

The Known Consumer Problem

Structure Is Not Meaning

Governance And Operational Reality Parted Ways

The Illusion Of Reuse

Why Agents Make The Old Problem New Again

The Real Lesson

What AI Changes About Building Blogs - Migrating from Wordpress to GitHub Pages

The Final Straw

The Migration

The Image Problem

Why Hugo and GitHub Pages

What This Actually Means

What I Haven't Solved Yet

Agents Don't Replace APIs. They Expose How Weak Most APIs Already Are

Agents Don't Replace APIs. They Expose How Weak Most APIs Already Are

The Illusion of Replacement

Most APIs Were Never Designed To Be Open

Open APIs Are A Different Discipline

Agents Move The Burden Onto The Contract

What Actually Changes

The Uncomfortable Conclusion

Why You Should Change Your Debit Card Now

Something just changed

The assumption that kept you relatively safe

What this means for your debit card specifically

The practical response