Lê Tú Hào

Posted on Jun 10

AI-Driven Data Architecture, Part 1: Why Prompts Aren't Enough

#ai #datastructures #architecture #rag

AI-Driven Data Architecture, Part 1: Why Prompts Are Not Enough

What AI-driven data architecture means to me, and how I learned it the hard way

Next: Part 2 — The Blueprint

What you'll take away

If you've moved past the chat-demo stage, you may have hit the same wall I did: the model forgets what it said three sessions ago, retrieved context feels random, translated terms drift, and nobody can answer "where did this fact come from?" without reading git history and hoping.

This two-part series is for builders wrestling with that same wall. It isn't a standard or a prompt cookbook — it's the model I arrived at from one build, written down so you can borrow it, adapt it, or tell me where it breaks.

By the end of Part 1 you will have:

A working definition of AI-driven data architecture as I use the term (and how it differs from "LLM + database")
An eight-layer lens you can try mapping onto your own product domain
An honest account of why my "two weeks to ship" estimate was a trap — from a real project, not theory

Part 2 turns the lens into patterns: layered SSOT, the generate→extract→retrieve flywheel, retrieval as engineering, and a maturity rubric for locating yourself when you're "half done" (spoiler: that's normal).

I've only validated these patterns in one domain (fiction). The same shape looks familiar wherever AI has to stay grounded in evolving source material:

Support tickets — raw threads → extracted intents → approved macros → agent replies
Legal review — contracts → extracted obligations → human-approved clause library → drafting assist
Internal wikis — docs → extracted entities → curated glossary → search-backed chat

But outside fiction those remain hypotheses, not shipped results. Creative writing is just where the continuity problems hurt most visibly.

I'll occasionally reference a multilingual novel-workflow platform I've been building (LoreWeave) where a pattern showed up in production. The blog stands alone without it.

The illusion: prompt + context = product?

The most seductive plan in AI product development — the one I believed — looks like this:

Collect user content (documents, tickets, chapters, contracts).
Stuff the relevant slice into a prompt.
Call the model.
Ship.

I wrote that plan on a napkin. Estimated timeline: two weeks. The product would help authors write and translate fiction with LLM assistance — chat, maybe batch translation, done.

Demos reinforced the fantasy. A single book, lore pasted into the system prompt, a friendly UI — it worked. Stakeholders clapped. I clapped. Then I tried to live in the system.

Continuity broke first. A character's honorific changed in chapter twelve because the model had no durable memory of chapter three. Translation wasn't string replacement: the same proper noun had three acceptable renderings across languages, and the model picked whichever sounded fluent that hour. When I asked "did the author write this, or did extraction infer it?" my own codebase shrugged. Context windows didn't save me — replaying fifty messages every turn doesn't scale in cost, latency, or coherence.

None of these failures were prompt-engineering problems in the narrow sense. They were data architecture problems wearing prompt-engineering costumes — at least, that's the framing that finally unblocked me.

That distinction is the subject of this series.

What I mean by "AI-driven data architecture"

I use AI-driven data architecture to mean the set of structures and pipelines that turn raw inputs into grounded, traceable, reusable knowledge that AI features consume — with explicit ownership, measurement, and improvement loops.

In my usage it is not:

A vector database relabeled "RAG"
A single Postgres schema with an embeddings column
A folder of JSON files the prompt loader reads

It is a commitment that the system's job is to prepare, own, and serve context — and that the LLM is one consumer among many (chat, batch jobs, agents, translation pipelines), not the center of gravity. That commitment is the one I kept failing to make early on.

Two mindsets — mine, before and after

This is my own before/after, not a scorecard for anyone else's work:

Where I started	Where the hard parts pushed me
Prompt engineering is the core skill	Data contracts and SSOT boundaries are the core skill
One database	Layered stores: raw, authored, extracted, derived
RAG = embed + search	Retrieval is engineered, benchmarked, degrades gracefully
Ship features	Ship vertical slices through the full stack
Model upgrade fixes quality	Flywheel: generate → measure → correct → re-ingest

The shift was subtle and, for me, slow: I stopped asking "what should the prompt say?" and started asking "who owns this fact, how did it get here, and how do we know retrieval worked?"

An eight-layer lens

$Diagram 2: Pattern 2 - The Flywheel / Pipeline of Extracted Knowledge (Knowledge Data Lifecycle) | Objective: Visualize the asynchronous and self-improving nature of this architecture. | Visual Content: A lifecycle diagram instead of a comparison; isolates a Vertical Slice starting from a BOOK_CHAPTER.SAVED event; Flow: Book Service $\rightarrow$ (Message Bus) $\rightarrow$ Knowledge Extraction Worker $\rightarrow$ (LLM Call: Extract + Provenance) $\rightarrow$ (Neo4j Update: Structure & Vector) $\rightarrow$ (Postgres Update: Extracted State); shows asynchronous operations using clock or queue icons. | Impact: Clearly demonstrates how your system addresses the cost/latency pain points by removing the AI entity extraction from the main processing flow.$

Think of these as the questions an AI-native architecture has to answer sooner or later — not org-chart boxes. They're the ones I wish I'd asked on day one.

Layer	Question it answers	If you skip it…
Ingest	Where does raw truth live?	No ground truth; everything is prompt fiction
Extract	What structured facts exist in the source?	Lore lives only in prompts; re-extraction is manual
Store (SSOT)	Who owns each class of fact?	Silent corruption; merges delete the wrong rows
Index / retrieve	How do you find the right passage?	"We have RAG" but answers feel unrelated
Synthesize	Translation, summaries, co-writing, reports	One-off generations that never feed back
Evaluate	How do you know retrieval and generation work?	"Live smoke passed" becomes your only metric
Consume	Chat, agents, pipelines calling the model	Token-wasteful mega-prompts
Improve	Feedback → better configs, data, models	Static slop forever

The insight that cost me the most: this is not one database. It behaves more like a pipeline culture. Layers can share physical stores, but logical ownership has to stay explicit. Collapsing "author wrote it" and "model inferred it" into one table without a promote/quarantine story is how I started losing trust in my own data.

You don't need eight microservices on day one — I don't have eight. You need eight answered questions. A monolith that respects SSOT boundaries is, in my experience, far healthier than twelve services that all read each other's tables.

SSOT in one sentence

SSOT (single source of truth) means: for every fact type, exactly one layer owns writes; everyone else reads via contract (API, event, projection) — never by reaching into another service's tables.

A stopping point I recognize (because I stopped there too)

During this build, I read many open-source AI projects and observed a number of creative AI tools from the outside. A pattern kept recurring: a story bible or codex UI (characters, places, rules) paired with drafting or continuation capabilities.

It reminded me strongly of where my own system once was — rich consumption experiences built on top of a relatively thin knowledge foundation. In hindsight, that stage corresponds roughly to layers 1 and 7 in the model above, with much of the middle still handled manually.

I'm not presenting this as a critique of those systems. Research prototypes and early products often stop there for perfectly valid reasons. I only mention it because I stopped there too, and many of the problems that pushed me toward a deeper data architecture emerged from that point onward.

Here's what I had to add once continuity, provenance, and multilingual consistency stopped being nice-to-haves:

Automatic extraction from real manuscripts or corpora at scale
Split ownership between human-authored canon and machine-extracted candidates
Retrieval I could measure (not "we embedded chunks")
A closed loop where new writing updates structured knowledge without me copy-pasting summaries

Research prototypes often show a different archetype — impressive multi-agent orchestration over a thin data foundation. That's usually the right trade-off for research: a paper isolates and proves one new capability; it isn't trying to own a knowledge graph in production a year later. In fact the academic work on retrieval and graph-grounded generation is where I borrowed most of these ideas — patterns in Part 2 echo published systems like GraphRAG and HippoRAG. I'm field-testing a field's work, not inventing in a vacuum.

So none of this is a failing on anyone's part. It's an architecture stopping point that feels shippable — it felt shippable to me — right up until those requirements arrive. The honest version of the lesson, in my own case: at first AI was the UI, not the system. Turning it into infrastructure was the part I underestimated.

Seven lessons from one build

Field notes, not laws — but the ones that cost me the most to learn.

1. Prompting is consumption, not foundation.

Prompts assemble context at call time. They don't replace ingest, SSOT, or extraction. Treat prompt templates as views over owned data.

2. SSOT boundaries beat model choice.

When human-curated glossary terms and machine-extracted entities lived in the same mental bucket, we got subtle corruption — merges that looked fine in UI tests but violated "no silent data loss" in production. Split authored vs extracted knowledge early; define a promote path.

3. Derived stores must be rebuildable.

Graph and vector indexes are projections. If you can't re-derive them from extraction state + raw content, you've created a second source of truth by accident.

4. Measurement is a layer, not a phase.

We shipped hybrid search that "worked" in manual testing. A retrieval eval harness (golden queries, recall, NDCG) found a recall bug integration tests missed — wide terms clustered into few chapters because SQL returned a flat row limit. Numbers hurt; they also saved weeks of guessing.

5. Events before intelligence.

Reliable change notification (outbox, streams, queues) precedes "smart" features. Extraction triggered by saves beats nightly cron once users expect freshness.

6. Agents come after data contracts.

Tool-calling agents need owned, scoped data exposed as tools — not 40k tokens of JSON in the system prompt. Agent architecture is consumption-layer design; it assumes the layers below exist.

7. Fifty to seventy percent foundation is normal.

As a system grows past the demo, you'll ship vertical slices (search works end-to-end! translation works!) while horizontal layers (eval flywheel, agent tooling, full synthesis loop) mature in parallel. Half-built foundation isn't failure — undisciplined half-building is. The rubric in Part 2 helps distinguish the two.

A note on RAG

Retrieval-augmented generation is a consumption technique (layer 7 calling layer 4), not a foundation. If your "RAG architecture" is embed-chunk-search with no SSOT story, no eval, and no path from new content back into indexes, you have a feature — not the architecture I'm describing. That was fine while I was prototyping; it got fragile for me exactly when continuity and provenance became requirements.

What's next

Part 2 — The Blueprint walks through four patterns:

Layered SSOT — content, authored, extracted, derived
The generate → extract → retrieve flywheel
Retrieval as engineering — hybrid search, eval gates, graceful degradation
Consumption layers — chat, pipelines, agents

It closes with a maturity rubric so you can locate where your foundation actually is — and a short case study of LoreWeave at roughly fifty-five to sixty-five percent on that rubric, offered as one worked example, not proof the model is universal.

The monster I underestimated wasn't the LLM. It was the data system the LLM assumes already exists. Part 2 is the map I drew for myself.

DEV Community