If you've built RAG over email, you know the feeling: everything works on PDFs and wiki pages, and then you point the same pipeline at someone's inbox and the whole thing quietly falls apart. Not with errors, but with bad retrieval you keep trying to fix with better chunking and bigger context windows until you realize the problem was never the retrieval.
Email threads aren't documents. Every standard RAG approach treats them like they are.
The standard approach
Connect to Gmail API, pull messages, chunk, embed, retrieve top-k.
service = build("gmail", "v1", credentials=creds)
results = service.users().messages().list(userId="me", maxResults=50).execute()
raw_emails = []
for msg in results.get("messages", []):
full = service.users().messages().get(
userId="me", id=msg["id"], format="full"
).execute()
raw_emails.append({
"id": msg["id"],
"threadId": full.get("threadId"),
"body": get_body_text(full.get("payload", {})),
"headers": {
h["name"]: h["value"]
for h in full.get("payload", {}).get("headers", [])
}
})
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = []
for email in raw_emails:
for split in splitter.split_text(email["body"]):
chunks.append({"text": split, "metadata": {"thread_id": email["threadId"]}})
vectorstore = Chroma.from_texts(
[c["text"] for c in chunks],
OpenAIEmbeddings(),
metadatas=[c["metadata"] for c in chunks]
)
This works on static documents because each chunk is self-contained and relationships between chunks are semantic. Email has neither property.
6 ways email breaks this
1. Quoted text duplication
In a 12-message thread, the Gmail API returns every reply with the full quoted chain below it. The original message appears 12 times. When you embed this, the oldest messages and signature blocks dominate the embedding space because they're repeated in every chunk, and the model reads repetition as reinforcement. Your most recent, most relevant messages get buried.
The fix isn't regex because people reply inline, edit quotes, and forward with additions mid-quote.
2. Thread structure vanishes
Email threads are conversation trees, not linear sequences. Message 7 might reply to message 3, not message 6. When you embed, that structure disappears. Ask "who approved this" and retrieval surfaces someone saying "looks good" when they were actually being quoted by someone disagreeing with them.
3. CC vs. authorship confusion
Your model sees "David" in the CC line and "David's proposal" in the body and has no structural way to distinguish "David was informed" from "David authored this." Extraction pipelines end up confidently attributing work to people who never wrote a single reply because their names appeared in CC fields.
4. Forwarded thread forks
Someone forwards a thread to a new group. Now you have two conversations that share history but diverged, and Gmail treats them as separate threads with no link between them. Ask "what did the team decide" and retrieval pulls from either branch without knowing they're contradictory.
5. Signatures and boilerplate at scale
Across a real organization: 30+ signature formats, compliance disclaimers in multiple languages, confidentiality notices longer than the actual messages. A meaningful portion of your token budget goes to this noise while the model treats it as content worth reasoning over.
6. Cross-thread temporal reasoning
"Let's revisit this next quarter" in January. "The timeline we discussed" in March. Completely different words for the same thing. The connection is temporal, not semantic, so vector similarity can't find it.
Why the usual fixes don't work
All six failures happen upstream of the model. Better models reason more confidently over the same broken input. Bigger context windows stuff in more duplicated text you're paying for.
Better prompts ask the model to reconstruct thread structure, deduplicate quotes, resolve attribution, and track temporal references on every single query. You're pushing infrastructure problems into the prompt.
The fix: treat email as a graph, not a document
Email threads are conversational graphs. Each message is a node, replies create edges, participants have roles that change over time, and decisions create cross-thread edges. The pipeline needs six layers between raw email and your model:
┌─────────────────────────────────────────────────────┐
│ YOUR APPLICATION │
├─────────────────────────────────────────────────────┤
│ Layer 6: Hybrid Retrieval │
│ semantic search + metadata filters + graph traversal│
├─────────────────────────────────────────────────────┤
│ Layer 5: Cross-Thread Linking │
│ participant overlap, topic refs, temporal proximity │
├─────────────────────────────────────────────────────┤
│ Layer 4: Structured Metadata Extraction │
│ decisions, tasks, owners, deadlines, sentiment │
├─────────────────────────────────────────────────────┤
│ Layer 3: Participant & Role Tracking │
│ From vs To vs CC, role changes across thread │
├─────────────────────────────────────────────────────┤
│ Layer 2: Content Deduplication │
│ quoted text removal, inline edit preservation │
├─────────────────────────────────────────────────────┤
│ Layer 1: Thread Reconstruction │
│ In-Reply-To / References headers → conversation tree│
├─────────────────────────────────────────────────────┤
│ RAW EMAIL (Gmail API / IMAP) │
└─────────────────────────────────────────────────────┘
Layer 1 is where most people start and stop. Map In-Reply-To headers to build the conversation tree:
from collections import defaultdict
def build_thread_tree(messages):
by_message_id = {}
children = defaultdict(list)
roots = []
for msg in messages:
msg_id = msg["headers"].get("Message-ID", "")
reply_to = msg["headers"].get("In-Reply-To", "")
by_message_id[msg_id] = msg
if reply_to and reply_to in by_message_id:
children[reply_to].append(msg_id)
else:
roots.append(msg_id)
return roots, children, by_message_id
Layers 2-3 handle deduplication and participant roles. Both are straightforward in concept but brutal in practice because email clients format quotes differently, people edit them without marking changes, and the distinction between "David authored this" and "David was CC'd" needs to be structured data, not something the model infers from flattened text.
Layers 4-6 extract structured metadata (decisions, tasks, owners, deadlines), build cross-thread connections, and combine semantic search with metadata filtering and graph traversal so you can say "find messages from Sarah about Q2 budget where a decision was made" and have the retrieval handle filtering before semantic matching.
This is what we built iGPT to handle. All six layers, one API call. Docs here.
What the difference looks like
Standard RAG:
results = vectorstore.similarity_search("What are the open action items?", k=5)
# - 2 chunks dominated by signature blocks
# - 1 chunk from a quoted reply (wrong attribution)
# - 1 relevant chunk buried in noise
# - 1 chunk from an unrelated thread (similar keywords)
Through iGPT:
from igptai import IGPT
client = IGPT(api_key="your-api-key", user="user-123")
response = client.recall.ask(
input="What are the open action items from this week?",
quality="cef-1-normal"
)
Seven source documents referenced, structured data with owners, dates, and attribution. No signatures, no duplicated quotes, no misattributed CC recipients. The infrastructure handled it before the model saw anything.
Streaming shows the pipeline stages in real time:
for event in client.recall.ask(
input="Who committed to what in the last 7 days?",
stream=True,
quality="cef-1-normal"
):
if "delta" in event:
print(event["delta"]["output"], end="", flush=True)
Sources referenced: 22
Here is a summary of commitments made in the last 7 days...
| Date | Person | Commitment |
|------------|-------------|------------------------------------------------|
| 2026-02-09 | Jane Doe | Proposed new campaign, requested alignment sync |
| 2026-02-10 | John Doe | Reviewing blog and one-pager, final versions |
Works the same in Node.js:
import IGPT from "igptai" and the API is identical.
Try it
pip install igptai
from igptai import IGPT
client = IGPT(api_key="your-key", user="your-user-id")
auth = client.connectors.authorize(
service="google",
scope="email",
redirect_uri="https://your-app.com/callback"
)
datasources = client.datasources.list()
response = client.recall.ask(
input="What decisions were made this week and who owns next steps?"
)
Don't want to set up OAuth just to see it work? The playground lets you connect your inbox and run queries in about five minutes, no code required.
Links:
🔗 iGPT Website
📖 API Documentation
🐍 Python SDK (PyPI)
📦 Node.js SDK (npm)
🛝 Playground
Top comments (0)