Every AI agent demo looks impressive until it hits the real world.
The reasoning works. The tool calls chain correctly. Then the agent tries to look up a UK postcode, validate a VAT number, or verify a company registration — and it either hallucinates the answer or stops dead.
The missing piece is almost never the model. It's the data layer.
What Is the Data Layer in an Agentic System?
In traditional software, the data layer is your database, ORM, and query logic. In an agentic system, it's broader: it's everything that lets an agent retrieve ground truth about the external world at runtime.
A well-designed agentic data layer has three tiers:
┌─────────────────────────────────────┐
│ Agent / LLM Core │ ← reasoning, planning, tool selection
├─────────────────────────────────────┤
│ Tool / Function Layer │ ← structured API calls, schema validation
├─────────────────────────────────────┤
│ Data Provider Layer │ ← real-time data: APIs, DBs, indexes
└─────────────────────────────────────┘
Most tutorials focus on the top two layers. The bottom layer — the actual data providers — is where production agents break.
Why Agents Hallucinate "Factual" Answers
LLMs are trained on static snapshots of the world. For anything time-sensitive or domain-specific, they guess. And they guess confidently.
Ask gpt-4o for the current registered address of a UK company. It will give you one. It may be three years out of date, or entirely fabricated.
This isn't a failure of reasoning. It's a failure of data architecture. The agent has no reliable path to ground-truth data, so it fills the gap with pattern-matched text.
The fix isn't a better prompt. It's a real data connection.
What a Reliable Agentic Data Layer Looks Like
1. Structured, schema-validated API calls
Agents work best when data sources return typed, predictable JSON — not free-text scraped HTML or inconsistent responses. Every data provider your agent calls should have:
- A clear request schema (the agent knows what to send)
- A stable response schema (the agent knows what it gets back)
- Explicit error states (the agent knows when to stop or retry)
# Bad: agent parses unstructured text
result = scrape_companies_house("Acme Ltd") # returns HTML or markdown blob
# Good: agent calls a structured API
result = company_api.lookup(name="Acme Ltd")
# returns: { "company_number": "12345678", "status": "active", "registered_address": {...} }
2. Real-time data, not cached training knowledge
For anything that changes — addresses, VAT status, company registrations, exchange rates — the agent must call out at runtime. Cached or embedded knowledge is a liability for factual queries.
3. Separation of data concerns
Don't build a single monolithic "search everything" tool. Give your agent narrow, composable tools:
tools = [
lookup_uk_postcode, # → addresses for a given postcode
validate_vat_number, # → VAT registration status
verify_company, # → Companies House data
check_bank_sortcode, # → sort code / bank branch validity
]
Narrow tools are easier for the LLM to select correctly, easier to test, and easier to replace.
MCP: A Standard Data Interface for Agents
The Model Context Protocol (MCP) is an open standard (introduced by Anthropic) that defines how AI agents connect to external data and tools. Think of it as a USB-C port for agent data sources — one standard interface, many compatible providers.
An MCP server exposes tools that any MCP-compatible agent can call — Claude Desktop, Cursor, Windsurf, or a custom agent built with the Anthropic SDK.
A minimal MCP tool for address lookup looks like this:
server.tool(
"lookup_postcode",
"Look up UK addresses for a given postcode",
{ postcode: z.string().describe("UK postcode, e.g. SW1A 1AA") },
async ({ postcode }) => {
const data = await addressApi.lookup(postcode);
return {
content: [{ type: "text", text: JSON.stringify(data) }],
};
}
);
The agent decides when to call this tool. The MCP server handles the actual data retrieval. The data provider (an API) returns ground truth.
This three-way separation is clean and testable at every layer.
A Practical Example: KYC Agent Data Layer
Consider a KYC (Know Your Customer) agent for a UK fintech. It needs to:
- Verify a company is registered and active
- Look up the registered address and cross-check it
- Validate the VAT number if provided
- Flag inconsistencies for human review
Without a proper data layer, the agent reasons from training data — which is wrong for dissolved companies, relocated addresses, and newly registered entities.
With a proper data layer:
User: "Verify Acme Trading Ltd, VAT GB123456789"
Agent calls:
→ verify_company("Acme Trading Ltd") # Companies House: active ✓
→ lookup_address("EC2A 4NE") # Royal Mail PAF: registered address ✓
→ validate_vat("GB123456789") # HMRC: VAT registered, matches name ✓
Agent returns:
"Company verified. Registered and active. Address confirmed. VAT number valid and matches registered entity."
Every answer is sourced. None of it is hallucinated.
Common Mistakes When Building the Data Layer
Giving the agent a web search tool and calling it done
Web search is useful for open-ended research. It's unreliable for structured factual queries. A search result saying a company is active is not the same as a live Companies House API response.
Returning too much data per tool call
If your address lookup returns a 4KB JSON blob with 40 fields, the agent will include irrelevant fields in its reasoning. Return the minimum needed. Let the agent ask follow-up questions.
No error handling contract
If your API returns a 404 with an HTML error page, the agent will try to parse it as data. Define what "not found" looks like in your schema and return it consistently.
Mixing real-time and cached data without labelling it
If some tools return live data and others return cached snapshots, label the difference. An agent treating a 6-month-old cache as current fact is worse than no data at all.
Where to Start
If you're building a production agentic system that needs to interact with real-world structured data:
- List every factual query your agent needs to answer — addresses, company data, financial identifiers, identity checks
- Map each to a reliable API source — not a search engine, not training knowledge
- Wrap each in a narrow, typed tool — one concern per tool
- Expose via MCP if you want agent portability — Claude, Cursor, and others will be able to use it without additional integration work
- Test the failure cases — what happens when the API returns a 429, a 404, or an empty result set?
The reasoning layer of your agent will improve automatically as the models improve. The data layer only improves if you build it deliberately.
APITier provides a set of UK-focused data APIs (address lookup, VAT validation, company verification, KYC checks) designed to work as agent tools via an MCP server. If you're building agents that work with UK data, you can find the MCP server setup in the APITier developer docs.
What data layer challenges have you hit building production agents? Drop a comment below.
Top comments (0)