Most conversational state management assumes the conversation is happening — a chat session, a websocket, a context window. Email breaks that assumption rudely: a customer replies five days after your agent's last message, and your code is expected to pick up exactly where things left off, with no session, no socket, and a process that has restarted twelve times since.
The good news: email already solved durable conversation tracking, decades ago, in three headers. Build on them properly and the thread itself becomes the agent's memory. This is the pattern behind the multi-day support agent recipe, which runs an LLM support agent on its own mailbox via Agent Accounts — currently in beta.
The three headers doing the work
Every email carries a globally unique Message-ID. A reply adds In-Reply-To (the Message-ID being answered) and References (the full chain of Message-IDs, oldest to newest). That's how Gmail, Outlook, and Apple Mail all decide what belongs to one thread — subject-line matching is only a fallback, and the email threading docs explain why relying on it breaks: recipients edit subjects, two prospects can receive identical subjects, and forwards keep the subject while changing the conversation entirely.
You don't manage these headers yourself. Pass reply_to_message_id on a send and the platform populates In-Reply-To and References automatically:
curl --request POST \
--url "https://api.us.nylas.com/v3/grants/<GRANT_ID>/messages/send" \
--header "Authorization: Bearer <NYLAS_API_KEY>" \
--header "Content-Type: application/json" \
--data '{
"reply_to_message_id": "<MESSAGE_ID>",
"to": [{ "email": "alice@example.com" }],
"subject": "Re: Trouble accessing my account",
"body": "Thanks for the extra detail, Alice — here is what I found..."
}'
The reply threads correctly in the recipient's client and lands in the same thread in the agent's own mailbox.
Thread ID as the session key
When a reply arrives, the message.created webhook payload includes thread_id. That's the durable session identifier. The pattern from the docs:
-
On outbound, store the
thread_idmapped to your internal state — ticket record, workflow step, whatever the agent was doing. -
On inbound, look up the
thread_id. Found it? Restore context and continue. Didn't? It's a brand-new conversation — classify and route it.
The dispatch logic at the top of the webhook handler is short, but every line is a guard that earns its place:
app.post("/webhooks/support", async (req, res) => {
res.status(200).end();
const event = req.body;
if (event.type !== "message.created") return;
const msg = event.data.object;
if (msg.grant_id !== SUPPORT_GRANT_ID) return;
// Skip messages the agent itself sent.
if (msg.from?.[0]?.email === SUPPORT_EMAIL) return;
// Deduplicate — webhook delivery is at-least-once.
if (await db.alreadyProcessed(msg.id)) return;
await db.markProcessed(msg.id);
// Reply to an existing ticket, or a brand-new conversation?
const ticket = await db.tickets.findByThreadId(msg.thread_id);
ticket ? await handleFollowUp(msg, ticket) : await handleNewTicket(msg);
});
The self-sent check matters more than it looks: the agent's own replies land in the same mailbox and fire the same trigger. Without that guard, the agent treats its own answer as a customer follow-up and responds to it — an email-based feedback loop you'll discover via a very confused customer.
The lookup table must live in a database, not memory. Support threads span days; in-memory maps don't survive deploys.
Rehydrating context after silence
When a dormant thread revives, the agent re-reads the whole conversation through the Threads API: each thread object carries an ordered message_ids list, the participants, and last-activity timestamps. Fetch the messages, sort by date, label each as agent or customer, and feed the transcript to the LLM for reclassification — not just reply generation. The recipe is insistent on this: a conversation that opened as a general question often turns into a billing dispute by message two, and routing should adapt.
The recipe also hard-codes lifecycle guards around the LLM:
- Auto-reply only above a 0.85 confidence threshold, and only in safe categories.
- Escalate at a 6-turn limit — if six exchanges haven't resolved it, a human should.
- Escalate when a thread reopens after 168 hours of dormancy. A week-old context shift deserves human judgment.
A support agent that confidently sends a wrong billing answer is worse than one that says "let me get a human."
Escalation is a feature, not a failure
When the agent does hand off, it should pass the human everything it knows — the ticket category, the turn count, the escalation reason, and a pointer to the thread — so the human doesn't re-read the conversation from scratch. The recipe marks the ticket escalated in the store, and the follow-up handler checks that status first: once a human owns a thread, the agent stays out of it.
The handoff mechanism itself is pleasingly low-tech. Because an Agent Account is a real mailbox, the human team can connect to it over IMAP from Outlook or Apple Mail and read or answer the escalated thread directly. The API and IMAP share the same mailbox, so if the ticket is later de-escalated, the human's replies are right there in the thread history the agent rehydrates.
Two operational numbers from the recipe worth tracking from day one:
- Escalation rate. If more than 40-50% of tickets escalate, the agent isn't pulling its weight — tune the knowledge base, adjust the prompt, or narrow the auto-reply categories.
- Send volume. A busy support inbox can exhaust the send cap — 200 messages per account per day on the free plan. Provision a policy with a higher limit, or split load across multiple Agent Accounts by category.
And log everything: the classification result, the confidence score, and the generated reply for every interaction. Support emails are auditable communications; don't ship an agent that talks to customers without an audit trail.
When you do need raw headers
Mostly you won't — thread_id is more stable for application logic than Message-IDs, since the platform assigns it and it spans the whole conversation. But when you need the chain itself, pass fields=include_basic_headers on a message GET to receive just Message-ID, In-Reply-To, and References without the full header payload, which is often larger than the message body.
One more design note: don't assume one reply per outbound. Two people on a thread can both respond, and your agent shouldn't double-reply to the same thread because two webhooks fired.
A test worth running
Build the minimal loop — send, store thread_id, reply to yourself from a personal account, watch the webhook restore context — then kill your process between the send and the reply. If the agent still picks up the conversation after a restart, your memory model is real. If it doesn't, you've found the bug before a customer did. How long do conversations in your domain go quiet before they come back?
Top comments (2)
Solid pattern for durable agent state via email threading — the thread_id as session key approach is clean. The lifecycle guards around the LLM are especially smart. I worked on a similar boundary enforcement problem with Brainstorm-Mode (mehmetcanfarsak on GitHub) — preventing agents from leaving their designated operational mode (ideation vs execution). Uses hooks to enforce behavioral boundaries at the infrastructure level, keeping the agent in its intended headspace rather than jumping to tool use prematurely.
Solid breakdown of durable state management for agents. The thread_id-as-session pattern is exactly right — context management is one of the hardest agent problems. I ran into a related issue with context switching: agents lose their 'mode' mid-task, jumping from brainstorming to coding without transition. Built Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) to solve that — it uses hooks to maintain the agent's headspace across interactions. Three modes (divergent, actionable, academic) act like session states that persist through the conversation.