The biggest shift in agent design over the past year has been context engineering rather than improved models. Most of the published guidance focuses on codebases, documentation, and structured knowledge bases, and it's good guidance.
But there's a category of enterprise data that breaks every standard context engineering pattern, and almost nobody is writing about it: email.
Why email is different from everything else
When Google's ADK team writes about context engineering, they describe a pipeline: ingest data, compile a view, serve it to the model. When Anthropic describes it, they talk about curating tokens for maximum utility.
Both assume the source data has some structural integrity to work with, because a codebase has files, functions, and imports, a knowledge base has documents with authors and dates, and even Slack has channels and timestamps.
Email has none of that. A 20-reply business thread contains the same quoted text duplicated up to 20 times, every email client quoting differently (Gmail uses > prefixes, Outlook uses indentation, Apple Mail wraps in blockquote HTML).
Forwarded chains collapse three separate conversations into a single message body with no structural separator. Inline replies break every deduplication pattern because someone typed new content between quoted blocks.
And the most critical information, the PDF with the actual contract terms or the invoice that needs reconciling, is sitting in an attachment that most context pipelines never touch.
This is where a huge amount of enterprise context actually lives, not in the CRM fields or the wiki, but in the messy, unstructured communication data where business actually happens.
What breaks at enterprise scale
The reason this matters isn't that one agent can't parse one email thread. It's what happens when you try to run context engineering across an organization's entire communication history.
A finance team closing the books at month-end needs to reconcile invoices against purchase order approvals across hundreds of vendors. The invoices arrive as PDF attachments and the approvals live in email threads scattered across 15 people's inboxes, often buried in a reply that says "approved, go ahead" with no formal record in any system.
An agent running multi-hop search over this data makes one retrieval call, gets a fragment, reformulates, searches again, and by hop 5 it's burning 40,000 tokens on a single vendor reconciliation.
Multiply that by 300 vendors and you've spent more on token costs than the finance team's monthly payroll, with accuracy degrading on every query because each hop compounds the noise from the previous one.
A compliance team monitoring regulatory commitments has to scan 50,000 threads per month for obligations that were agreed to in email and never entered into a tracking system. The commitments aren't labeled, they're buried in sentences like "we can do that by Q3" from someone in a 30-reply thread where the first 20 messages were about something else entirely.
A multi-hop agent searching for "regulatory commitments" returns threads that mention regulations, not threads that contain actual commitments. The semantic gap between what the agent searches for and what the data looks like structurally is exactly where context engineering is supposed to help, and where standard approaches fail on email.
A sales organization running deal risk scoring across 200 active opportunities needs to detect signals that only exist in email patterns: the champion going quiet over two weeks, procurement entering a thread where they weren't before, reply latency increasing, tone shifting from collaborative to transactional.
None of this shows up in the CRM, which says the deal is "Stage 3, on track" while the email thread says the deal is dying. An agent that can't reason over the full communication history with participant attribution, temporal ordering, and cross-thread awareness will miss every one of these signals, and miss them confidently.
The architectural gap
Standard context engineering assumes you can compile a useful view of your data at query time. For email at enterprise scale, this doesn't hold because the preprocessing required to make email useful is too expensive and too complex to do per-query.
Thread reconstruction, quoted text deduplication, participant attribution, attachment extraction, temporal ordering across threads that reference each other: this work needs to happen once at index time, not repeatedly inside an agent loop.
When you do it at index time, the agent gets pre-assembled context in a single retrieval call where latency is predictable, cost is fixed, and the same query returns the same result every time, which is the only way downstream automation actually works.
When you try to do it at query time through multi-hop search, you get variable latency (10-60 seconds depending on thread complexity), variable cost (scales with how messy the data is, which means your hardest queries are your most expensive), and variable accuracy (each hop builds on the previous hop's interpretation, and the error compounds).
The agent is simultaneously trying to reconstruct the conversation, figure out who said what, determine what's current versus what's quoted history, and answer the actual question. That's four jobs where each one is hard enough on its own.
What index-time context engineering looks like
The work that makes email usable for agents comes down to a few things that need to happen once, not per-query: reconstruct threads, strip quoted text, attribute who said what, and actually read attachments.
Then index all of it with semantic and structural metadata, scoped per-user so one person's agent can't surface another person's data.
Most teams skip this and go straight to multi-hop search, which works in demos and breaks in production at exactly the scale where the business case justifies the investment.
We build this infrastructure at iGPT, where a developer sends one API call and gets back structured, reasoning-ready context with source citations, with no loops or retries or per-query preprocessing.
from igptai import IGPT
client = IGPT(api_key="...", user="user_123")
result = client.recall.ask(
input="Reconcile Q1 invoices from Apex Logistics, flag PO mismatches",
quality="cef-1-normal",
output_format="json"
)
# Structured JSON: vendor, invoice amounts, PO deltas, source email citations
The industry is right to focus on context, but most implementations assume the data is already usable, and email isn't.
If your agent is reasoning over email without fixing that first, it's not failing because the model is weak, it's failing because the context never made sense in the first place.
Docs: docs.igpt.ai
SDK: pip install igptai
Top comments (0)