Sam Kennard

Posted on Feb 12

Why email breaks every RAG pipeline

#ai #mcp #api #email

If you've built RAG over email, you know the feeling: everything works on PDFs and wiki pages, and then you point the same pipeline at someone's inbox and the whole thing quietly falls apart. Not with errors, but with bad retrieval you keep trying to fix with better chunking and bigger context windows until you realize the problem was never the retrieval.

Email threads aren't documents. Every standard RAG approach treats them like they are.

The standard approach

Connect to Gmail API, pull messages, chunk, embed, retrieve top-k.

service = build("gmail", "v1", credentials=creds)
results = service.users().messages().list(userId="me", maxResults=50).execute()

raw_emails = []
for msg in results.get("messages", []):
    full = service.users().messages().get(
        userId="me", id=msg["id"], format="full"
    ).execute()
    raw_emails.append({
        "id": msg["id"],
        "threadId": full.get("threadId"),
        "body": get_body_text(full.get("payload", {})),
        "headers": {
            h["name"]: h["value"]
            for h in full.get("payload", {}).get("headers", [])
        }
    })

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = []
for email in raw_emails:
    for split in splitter.split_text(email["body"]):
        chunks.append({"text": split, "metadata": {"thread_id": email["threadId"]}})

vectorstore = Chroma.from_texts(
    [c["text"] for c in chunks],
    OpenAIEmbeddings(),
    metadatas=[c["metadata"] for c in chunks]
)

This works on static documents because each chunk is self-contained and relationships between chunks are semantic. Email has neither property.

6 ways email breaks this

1. Quoted text duplication

In a 12-message thread, the Gmail API returns every reply with the full quoted chain below it. The original message appears 12 times. When you embed this, the oldest messages and signature blocks dominate the embedding space because they're repeated in every chunk, and the model reads repetition as reinforcement. Your most recent, most relevant messages get buried.

The fix isn't regex because people reply inline, edit quotes, and forward with additions mid-quote.

2. Thread structure vanishes

Email threads are conversation trees, not linear sequences. Message 7 might reply to message 3, not message 6. When you embed, that structure disappears. Ask "who approved this" and retrieval surfaces someone saying "looks good" when they were actually being quoted by someone disagreeing with them.

3. CC vs. authorship confusion

Your model sees "David" in the CC line and "David's proposal" in the body and has no structural way to distinguish "David was informed" from "David authored this." Extraction pipelines end up confidently attributing work to people who never wrote a single reply because their names appeared in CC fields.

4. Forwarded thread forks

Someone forwards a thread to a new group. Now you have two conversations that share history but diverged, and Gmail treats them as separate threads with no link between them. Ask "what did the team decide" and retrieval pulls from either branch without knowing they're contradictory.

5. Signatures and boilerplate at scale

Across a real organization: 30+ signature formats, compliance disclaimers in multiple languages, confidentiality notices longer than the actual messages. A meaningful portion of your token budget goes to this noise while the model treats it as content worth reasoning over.

6. Cross-thread temporal reasoning

"Let's revisit this next quarter" in January. "The timeline we discussed" in March. Completely different words for the same thing. The connection is temporal, not semantic, so vector similarity can't find it.

Why the usual fixes don't work

All six failures happen upstream of the model. Better models reason more confidently over the same broken input. Bigger context windows stuff in more duplicated text you're paying for.

Better prompts ask the model to reconstruct thread structure, deduplicate quotes, resolve attribution, and track temporal references on every single query. You're pushing infrastructure problems into the prompt.

The fix: treat email as a graph, not a document

Email threads are conversational graphs. Each message is a node, replies create edges, participants have roles that change over time, and decisions create cross-thread edges. The pipeline needs six layers between raw email and your model:

┌─────────────────────────────────────────────────────┐
│                   YOUR APPLICATION                   │
├─────────────────────────────────────────────────────┤
│  Layer 6: Hybrid Retrieval                           │
│  semantic search + metadata filters + graph traversal│
├─────────────────────────────────────────────────────┤
│  Layer 5: Cross-Thread Linking                       │
│  participant overlap, topic refs, temporal proximity │
├─────────────────────────────────────────────────────┤
│  Layer 4: Structured Metadata Extraction             │
│  decisions, tasks, owners, deadlines, sentiment      │
├─────────────────────────────────────────────────────┤
│  Layer 3: Participant & Role Tracking                │
│  From vs To vs CC, role changes across thread        │
├─────────────────────────────────────────────────────┤
│  Layer 2: Content Deduplication                      │
│  quoted text removal, inline edit preservation       │
├─────────────────────────────────────────────────────┤
│  Layer 1: Thread Reconstruction                      │
│  In-Reply-To / References headers → conversation tree│
├─────────────────────────────────────────────────────┤
│                  RAW EMAIL (Gmail API / IMAP)         │
└─────────────────────────────────────────────────────┘

Layer 1 is where most people start and stop. Map In-Reply-To headers to build the conversation tree:

from collections import defaultdict

def build_thread_tree(messages):
    by_message_id = {}
    children = defaultdict(list)
    roots = []

    for msg in messages:
        msg_id = msg["headers"].get("Message-ID", "")
        reply_to = msg["headers"].get("In-Reply-To", "")
        by_message_id[msg_id] = msg

        if reply_to and reply_to in by_message_id:
            children[reply_to].append(msg_id)
        else:
            roots.append(msg_id)

    return roots, children, by_message_id

Layers 2-3 handle deduplication and participant roles. Both are straightforward in concept but brutal in practice because email clients format quotes differently, people edit them without marking changes, and the distinction between "David authored this" and "David was CC'd" needs to be structured data, not something the model infers from flattened text.

Layers 4-6 extract structured metadata (decisions, tasks, owners, deadlines), build cross-thread connections, and combine semantic search with metadata filtering and graph traversal so you can say "find messages from Sarah about Q2 budget where a decision was made" and have the retrieval handle filtering before semantic matching.

This is what we built iGPT to handle. All six layers, one API call. Docs here.

What the difference looks like

Standard RAG:

results = vectorstore.similarity_search("What are the open action items?", k=5)
# - 2 chunks dominated by signature blocks
# - 1 chunk from a quoted reply (wrong attribution)
# - 1 relevant chunk buried in noise
# - 1 chunk from an unrelated thread (similar keywords)

Through iGPT:

from igptai import IGPT

client = IGPT(api_key="your-api-key", user="user-123")
response = client.recall.ask(
    input="What are the open action items from this week?",
    quality="cef-1-normal"
)

Seven source documents referenced, structured data with owners, dates, and attribution. No signatures, no duplicated quotes, no misattributed CC recipients. The infrastructure handled it before the model saw anything.

Streaming shows the pipeline stages in real time:

for event in client.recall.ask(
    input="Who committed to what in the last 7 days?",
    stream=True,
    quality="cef-1-normal"
):
    if "delta" in event:
        print(event["delta"]["output"], end="", flush=True)

Sources referenced: 22
Here is a summary of commitments made in the last 7 days...

| Date       | Person       | Commitment                                     |
|------------|-------------|------------------------------------------------|
| 2026-02-09 | Jane Doe | Proposed new campaign, requested alignment sync |
| 2026-02-10 | John Doe  | Reviewing blog and one-pager, final versions    |

Works the same in Node.js:

import IGPT from "igptai" and the API is identical.

Try it

pip install igptai

from igptai import IGPT

client = IGPT(api_key="your-key", user="your-user-id")

auth = client.connectors.authorize(
    service="google",
    scope="email",
    redirect_uri="https://your-app.com/callback"
)

datasources = client.datasources.list()

response = client.recall.ask(
    input="What decisions were made this week and who owns next steps?"
)

Don't want to set up OAuth just to see it work? The playground lets you connect your inbox and run queries in about five minutes, no code required.
Links:

🔗 iGPT Website
📖 API Documentation
🐍 Python SDK (PyPI)
📦 Node.js SDK (npm)
🛝 Playground

Top comments (1)

Daniel Yarmoluk • Apr 5

The quoted text duplication problem is the one that kills accuracy the fastest. By the time you're 10 replies deep, 80% of the tokens in that thread are older content being repeated — and the embedding treats repetition as signal. You're not surfacing the most relevant message, you're surfacing the most quoted one, which is usually the oldest.

I hit the same structural problem from a different angle — public records data. SEC filings reference prior filings by citation, patent applications quote prior art, regulatory submissions embed earlier correspondence. Standard RAG treats every chunk as independent and misses every cross-document dependency.

The fix that worked for me: resolve all of that structure at index time into a compressed graph file. Entities explicit, relationships typed, temporal ordering encoded, duplicates stripped. What the agent sees is already reconstructed — no pipeline complexity at query time, no compounding errors across hops.

Your six-layer pipeline is doing the same job, just on a harder input format. Email's structure is implicit and buried in headers; public records structure is at least nominally explicit but scattered across thousands of documents. Different surface, same underlying problem: vector similarity alone can't reason over relational data.

Layer 6 (hybrid retrieval with graph traversal) is where this actually gets solved. Pure semantic search on email is never going to catch "the approval that's referenced three replies up in a forwarded thread" — that requires graph traversal, not cosine similarity.