Sam Kennard

Posted on Mar 18

Context Engineering Has a Blind Spot

#ai #llm #rag

The biggest shift in agent design over the past year has been context engineering rather than improved models. Most of the published guidance focuses on codebases, documentation, and structured knowledge bases, and it's good guidance.

But there's a category of enterprise data that breaks every standard context engineering pattern, and almost nobody is writing about it: email.

Why email is different from everything else

When Google's ADK team writes about context engineering, they describe a pipeline: ingest data, compile a view, serve it to the model. When Anthropic describes it, they talk about curating tokens for maximum utility.

Both assume the source data has some structural integrity to work with, because a codebase has files, functions, and imports, a knowledge base has documents with authors and dates, and even Slack has channels and timestamps.

Email has none of that. A 20-reply business thread contains the same quoted text duplicated up to 20 times, every email client quoting differently (Gmail uses > prefixes, Outlook uses indentation, Apple Mail wraps in blockquote HTML).

Forwarded chains collapse three separate conversations into a single message body with no structural separator. Inline replies break every deduplication pattern because someone typed new content between quoted blocks.

And the most critical information, the PDF with the actual contract terms or the invoice that needs reconciling, is sitting in an attachment that most context pipelines never touch.

This is where a huge amount of enterprise context actually lives, not in the CRM fields or the wiki, but in the messy, unstructured communication data where business actually happens.

What breaks at enterprise scale

The reason this matters isn't that one agent can't parse one email thread. It's what happens when you try to run context engineering across an organization's entire communication history.

A finance team closing the books at month-end needs to reconcile invoices against purchase order approvals across hundreds of vendors. The invoices arrive as PDF attachments and the approvals live in email threads scattered across 15 people's inboxes, often buried in a reply that says "approved, go ahead" with no formal record in any system.

An agent running multi-hop search over this data makes one retrieval call, gets a fragment, reformulates, searches again, and by hop 5 it's burning 40,000 tokens on a single vendor reconciliation.

Multiply that by 300 vendors and you've spent more on token costs than the finance team's monthly payroll, with accuracy degrading on every query because each hop compounds the noise from the previous one.

A compliance team monitoring regulatory commitments has to scan 50,000 threads per month for obligations that were agreed to in email and never entered into a tracking system. The commitments aren't labeled, they're buried in sentences like "we can do that by Q3" from someone in a 30-reply thread where the first 20 messages were about something else entirely.

A multi-hop agent searching for "regulatory commitments" returns threads that mention regulations, not threads that contain actual commitments. The semantic gap between what the agent searches for and what the data looks like structurally is exactly where context engineering is supposed to help, and where standard approaches fail on email.

A sales organization running deal risk scoring across 200 active opportunities needs to detect signals that only exist in email patterns: the champion going quiet over two weeks, procurement entering a thread where they weren't before, reply latency increasing, tone shifting from collaborative to transactional.

None of this shows up in the CRM, which says the deal is "Stage 3, on track" while the email thread says the deal is dying. An agent that can't reason over the full communication history with participant attribution, temporal ordering, and cross-thread awareness will miss every one of these signals, and miss them confidently.

The architectural gap

Standard context engineering assumes you can compile a useful view of your data at query time. For email at enterprise scale, this doesn't hold because the preprocessing required to make email useful is too expensive and too complex to do per-query.

Thread reconstruction, quoted text deduplication, participant attribution, attachment extraction, temporal ordering across threads that reference each other: this work needs to happen once at index time, not repeatedly inside an agent loop.

When you do it at index time, the agent gets pre-assembled context in a single retrieval call where latency is predictable, cost is fixed, and the same query returns the same result every time, which is the only way downstream automation actually works.

When you try to do it at query time through multi-hop search, you get variable latency (10-60 seconds depending on thread complexity), variable cost (scales with how messy the data is, which means your hardest queries are your most expensive), and variable accuracy (each hop builds on the previous hop's interpretation, and the error compounds).

The agent is simultaneously trying to reconstruct the conversation, figure out who said what, determine what's current versus what's quoted history, and answer the actual question. That's four jobs where each one is hard enough on its own.

What index-time context engineering looks like

The work that makes email usable for agents comes down to a few things that need to happen once, not per-query: reconstruct threads, strip quoted text, attribute who said what, and actually read attachments.

Then index all of it with semantic and structural metadata, scoped per-user so one person's agent can't surface another person's data.

Most teams skip this and go straight to multi-hop search, which works in demos and breaks in production at exactly the scale where the business case justifies the investment.

We build this infrastructure at iGPT, where a developer sends one API call and gets back structured, reasoning-ready context with source citations, with no loops or retries or per-query preprocessing.

from igptai import IGPT

client = IGPT(api_key="...", user="user_123")
result = client.recall.ask(
    input="Reconcile Q1 invoices from Apex Logistics, flag PO mismatches",
    quality="cef-1-normal",
    output_format="json"
)
# Structured JSON: vendor, invoice amounts, PO deltas, source email citations

The industry is right to focus on context, but most implementations assume the data is already usable, and email isn't.

If your agent is reasoning over email without fixing that first, it's not failing because the model is weak, it's failing because the context never made sense in the first place.

Docs: docs.igpt.ai
SDK: pip install igptai

Top comments (1)

Daniel Yarmoluk • Apr 5

The index-time vs query-time distinction is the core insight here and it applies well beyond email.

I run into the same architectural problem with public data — SEC filings, patent records, federal procurement data, regulatory submissions. The raw source is structurally as bad as email: inconsistent formatting, cross-document references with no explicit links, attachments that contain the actual substance while the parent document is boilerplate.

Multi-hop search over that fails the same way you describe. The agent is doing thread reconstruction and entity attribution at query time, which means every hard query is your most expensive query, and the queries you most need to get right are the ones that cost the most to get wrong.

The fix I landed on: build the structure into the file at index time as a compressed knowledge graph in .md. Entities resolved, relationships explicit, temporal ordering encoded. One retrieval call, predictable cost, same result every time.

The interesting implication of your post: the "blind spot" isn't really email specifically. It's any domain where the raw data structure doesn't match how agents need to reason over it. Email is a high-visibility case because every enterprise runs on it, but the same gap exists anywhere the business-critical context lives in unstructured communications — regulatory submissions, patent prosecution histories, contract redlines.

Index-time preprocessing is the answer in all of them. The question is just whether you build it yourself or use infrastructure like what you're describing.