Qasim Muhammad

Posted on Jun 12

Voice Agents That Follow Up by Email

#ai #voice #email #agents

Last sprint, a team I talked to demoed a voice agent that handled support calls impressively — right up until a caller asked "can you email me those instructions?" and the room went quiet. The agent could talk about the docs. It had no address to send them from. The workaround on the whiteboard afterwards was grim: relay through a shared noreply@, lose the replies, reconcile threads manually in the ticketing system.

Voice agents hit this wall constantly, because phone calls generate follow-up artifacts — reset instructions, documents, meeting recaps — and email is how callers expect to receive them. The clean fix is the same one that works for text agents: the voice agent gets its own mailbox.

The identity half

A Nylas Agent Account is a hosted mailbox you create through the API — Agent Accounts are in beta — and the voice use case from the product docs is exactly the scenario above: a voice agent taking support calls sends documents, reset instructions, or meeting recaps from its own voice-agent@yourcompany.com address the moment the caller asks. The part that makes it more than a send pipe: when the caller replies, the reply returns through the same account, so the full conversation is one thread in one mailbox. The phone call and its written follow-ups stop living in separate systems.

Each account is a real grant with a grant_id that works against the existing Messages, Threads, and Webhooks endpoints, ships with six system folders, and sends up to 200 messages per account per day on the free plan.

The plumbing half

The voice agents recipe covers how the runtime actually calls email tools. The flow is the same regardless of vendor:

speech → STT → LLM (function-calling) → subprocess(nylas …) → JSON → LLM → TTS → speech

The LLM decides on a tool, the runtime spawns a Nylas CLI subprocess with --json, the result comes back, and the model composes a spoken response. On LiveKit, a tool is just a decorated function:

from livekit.agents import function_tool
import subprocess

@function_tool()
async def list_recent_emails(limit: int = 5) -> str:
    """List the last few emails. Keep limit small for voice."""
    out = subprocess.run(
        ["nylas", "email", "list", "--limit", str(limit), "--json"],
        capture_output=True, text=True, timeout=30,
    )
    return out.stdout if out.returncode == 0 else "Could not fetch emails."

Vapi is the same idea over webhooks — Vapi posts JSON to your backend when the LLM calls a tool, your handler executes the CLI, and you return stdout in Vapi's envelope:

app.post("/vapi/tools", async (req, res) => {
  const { name, parameters } = req.body.message.toolCall;
  const args = ["nylas", "email", "list",
                "--limit", String(parameters.limit ?? 5),
                "--json"];
  const result = await execAsync(args, { timeout: 30000 });
  res.json({
    results: [{
      toolCallId: req.body.message.toolCall.id,
      result: result.stdout,
    }],
  });
});

Retell, Bland.ai, and OpenAI Realtime all follow the generic define-schema, dispatch-to-subprocess, return-JSON pattern. The recipe is explicit about why this beats running an MCP server next to the voice runtime: voice frameworks expect function-call-style tools that hand back a JSON blob, not a JSON-RPC peer. A side benefit of routing through the CLI: it absorbs every provider difference, so the same tools work whether the grants behind them are Gmail, Microsoft 365, Exchange, Yahoo, iCloud, IMAP — or an Agent Account.

Voice surfaces every UX mistake immediately

Four rules from the recipe, none optional:

Cap lists at 5. Reading a 50-message inbox aloud takes minutes. Default --limit 5 and let the caller say "more."
Summarize, don't read. Have the LLM produce "You've got three emails from Ada about the contract and a calendar invite from Rin" rather than narrating subject lines.
Confirm before send. Always. Speech-to-text mishears recipients and subjects in ways that send the wrong mail to the wrong person. The agent speaks the recipient, subject, and gist; only an explicit "yes" triggers the send tool:

   AGENT:  "Send to Ada at acme.test, subject 'pricing', body 'I'm in'?"
   USER:   "Yes."

Translate errors. "Error 401: invalid grant" is not a voice response. Map failures to "I couldn't reach email right now — you may need to re-authenticate."

And one rule that's really an SLA: every subprocess call needs a timeout, and 30 seconds is the right number. Voice users won't wait a minute; the framework's silence detection kicks in and the conversation falls apart. Aim for a round-trip under 2 seconds on the common tools — nylas email list --limit 5 --json clears that comfortably — and return a graceful spoken fallback when the timeout fires instead of bubbling the exception.

Why the dedicated address changes the product

Run the follow-up sends through the agent's own account rather than a borrowed human grant and three things improve at once:

Continuity. The caller replies to the recap, the reply lands in the agent's inbox, and the next interaction — voice or email — has the whole history in one thread.
Auditability. Every message the agent ever sent is sitting in its sent folder. The recipe separately recommends logging every send (recipient, subject, run ID, approval source) to your own store; the mailbox gives you the ground truth to reconcile against.
Multi-user routing stays sane. Voice platforms serving many users need per-user grant routing anyway — pass --api-key and --grant-id per command. The agent's outbound identity stays constant while the caller-side grants vary.

Quick answers

Can I use MCP instead of subprocess tools? If your runtime genuinely speaks MCP — Claude Code does, for example — yes, and the docs cover that path separately. Voice runtimes mostly don't, which is why the recipe defaults to subprocess + --json.

Where does the calendar fit? The same subprocess pattern covers nylas calendar events list, so "do I have anything tomorrow?" is one more decorated function, not a new integration.

A reasonable first milestone: wire one tool — send_recap — into your existing voice stack, pointed at an agent address on a trial domain, with the confirm-before-send exchange in the conversation script. Call it yourself, ask for the recap, and reply to the email it sends you. If the reply shows up threaded in the agent's inbox, you've got the loop. What would your voice agent send first — recaps, docs, or reset links?

Top comments (2)

Luis Cruz • Jun 12

Great writeup — the dedicated agent mailbox is the part that clicks for me. Turning it from a send pipe into one threaded conversation, with the sent folder doubling as an audit trail, solves continuity and reconciliation in a single move. The confirm-before-send gate is smart too; STT mishearing a recipient is exactly how trust erodes fast.
I build voice and multi-agent systems — Python/FastAPI, LLM function-calling, RAG — and have been working through this same follow-up-artifact problem on a few projects. Would love to connect and compare notes, and happy to collaborate if you're building in this space. Nice work.

Marcus Chen • Jun 24

The follow-up-by-email step is exactly where I have seen voice agents quietly double-act: the call ends, the agent fires the email, the request times out, it retries, and now the caller gets two follow-ups. If the send is not keyed to the call intent rather than the HTTP attempt, a slow mailer turns one promise into two emails. I would put an idempotency key on the follow-up keyed to the call id plus the action, and confirm the email provider honors it. How are you deciding when the agent has actually committed to sending, since a barge-in or a mid-sentence hangup can leave that intent half-formed?