Multi-Turn Email Conversations for LLM Agents

#ai #email #llm #agents

Day 0, 10:00 — your agent sends a demo follow-up. Day 2, 14:37 — the prospect replies with a question. Day 2, 14:39 — they send a second thought. Day 5 — silence, then a reply to something the agent said a week ago. Somewhere between day 0 and day 5, your process restarted twice and deployed once.

A single send-and-forget email is easy. The timeline above is the actual job: a conversation spanning five exchanges over days, where the agent has to remember what it said, what it's waiting for, and where in the workflow it stands — across restarts, deploys, and hours of dead air. The multi-turn conversation recipe builds this loop on a Nylas Agent Account (the feature's in beta), running entirely on webhooks and the Threads API — no polling, no missed messages.

State lives outside the model

The core design decision: every active conversation gets a durable record keyed by the thread ID.

const conversationRecord = {
  threadId: "nylas-thread-id",
  grantId: AGENT_GRANT_ID,
  contactEmail: "prospect@example.com",
  purpose: "demo_followup",   // What started this conversation
  step: "awaiting_reply",     // Where in the workflow we are
  turnCount: 1,
  maxTurns: 10,               // Safety cap before escalation
  lastActivityAt: "2026-04-14T10:00:00Z",
  metadata: {},
};

The step field is the heart of it — a tiny state machine tracking what the agent is waiting for, which determines how the next inbound message gets handled. The store has to be durable (Postgres, Redis with AOF, DynamoDB); the gap between messages can be days, so in-memory state is a non-starter.

Starting a conversation means sending the first message and persisting the record under the threadId the send returns:

async function startConversation({ to, subject, body, purpose, metadata }) {
  const sent = await nylas.messages.send({
    identifier: AGENT_GRANT_ID,
    requestBody: {
      to: [{ email: to.email, name: to.name }],
      subject,
      body,
    },
  });

  await db.conversations.create({
    threadId: sent.data.threadId,
    contactEmail: to.email,
    purpose,
    step: "awaiting_reply",
    turnCount: 1,
    maxTurns: 10,
    lastActivityAt: new Date().toISOString(),
    metadata: metadata ?? {},
  });

  return sent.data;
}

Email threading does the heavy lifting from there: every future reply arrives carrying the same thread_id, which is your lookup key back into the agent's memory.

The webhook handler is mostly filters

When message.created fires, the handler runs three checks before any LLM gets involved:

const msg = event.data.object;
if (msg.grant_id !== AGENT_GRANT_ID) return;

// Outbound fires message.created too — don't reply to yourself.
if (msg.from?.[0]?.email === agentEmail) return;

const conversation = await db.conversations.findByThreadId(msg.thread_id);
if (!conversation) {
  await triageNewInbound(msg);  // Not a reply to anything we sent.
  return;
}

That middle check is the classic footgun: message.created fires for the agent's own sends. Skip the sender check and the agent enters a polite infinite loop with itself.

Restoring context: the thread is the memory

The webhook payload only carries summary fields, so the handler fetches the full message, then pulls the entire thread and every message in it, sorts by date, and formats a transcript with agent / contact roles. The LLM gets the transcript plus the current step and purpose, generates the reply, and returns a nextStep that advances the state machine. The reply goes out with replyToMessageId set so it threads correctly on the recipient's side, and the record updates: increment turnCount, bump lastActivityAt, merge any new metadata.

One efficiency note from the recipe that pays for itself fast: the model doesn't need every message. For long threads, summarize the early turns and pass only the last 3–4 messages in full. Token usage stays sane without losing the context that matters.

Conversations end badly unless you decide how they end

The recipe treats lifecycle edges as first-class features, not error handling:

Turn caps. Before generating any reply, check turnCount against maxTurns. An unbounded loop is a token sink and a risk — 10 turns is the recipe's default, tuned to whatever's realistic for the workflow.
Escalation. Cap reached, topic out of scope, frustration detected: set step to escalated, record the reason, and notify a human through whatever you use — Slack, PagerDuty, an internal API.
Completion. Purpose fulfilled (meeting booked, question answered): mark the record completed so a later reply on the same thread doesn't reanimate the workflow.
Dormancy. Someone replies after weeks of silence. The recipe's threshold: over 168 hours of inactivity, escalate instead of auto-replying. A confident LLM response to a context the agent half-remembers is worse than a human taking over.

The dormancy check is four lines in the webhook handler, before the conversation continues:

const hoursSinceLastActivity =
  (Date.now() - new Date(conversation.lastActivityAt).getTime()) / 3600000;

if (hoursSinceLastActivity > 168) {
  await escalate(conversation, "dormant thread reopened after 7+ days");
  return;
}

Escalation itself is just a state transition plus a notification — set step: "escalated", store the reason in metadata, ping the human channel. The thread stays intact, so whoever picks it up reads the same transcript the agent had.

Two more behaviors that separate demos from production: batch rapid-fire replies (a 30–60 second delay turns two quick messages into one turn instead of two separate generated replies), and treat webhook redelivery plus concurrent workers as a day-one concern — dedup and locking, not an edge case for later.

Why this beats a chat session

Chat sessions evaporate when the tab closes. An email thread is durable, human-readable, and auditable — the conversation state machine on top of it can crash, redeploy, and resume, because the source of truth (the thread) and the workflow position (the record) both survive. That's a genuinely good persistence model for any agent whose counterpart is a human on their own schedule.

A focused way to start: implement just startConversation and the webhook handler with the three filters, hard-code one purpose, and run a single conversation with yourself across two days — including one process restart in the middle. If the agent picks the thread back up correctly, the rest is iteration. What's the longest-running conversation you'd trust an agent to hold?