Ephemeral Inboxes: Spin Up a Mailbox Per Test Run

#testing #email #api #devops

Two CI workers kick off at the same moment. Both sign up a test user, both poll the shared QA Gmail account for "the" verification email, and worker #7 grabs the message that belonged to worker #12. The test passes. The wrong test. You spend an afternoon staring at a green build that should've been red.

Shared inboxes are the single biggest source of flakiness in email-dependent E2E tests, and every workaround — catch-all forwarding rules, label rules scoped per PR, OAuth tokens living on the runner — adds another moving part that breaks on its own schedule. The fix is structural: every test gets its own address, on infrastructure your suite provisions and destroys.

One wildcard, infinite addresses

The E2E email testing recipe sets this up with one CLI command:

nylas inbound create e2e

You get back an inbox ID and a wildcard pattern shaped like e2e-*@yourapp.nylas.email. From there, each test mints a unique address under the wildcard — e2e-<uuid>@yourapp.nylas.email — and there's nothing to provision per address. You don't pay or configure per address either; the wildcard is just a convention, so burn UUIDs freely. Mail flows through MX records hosted on the Nylas side, which means zero DNS work in your own zone (the tradeoff: addresses live under *.nylas.email).

The Playwright fixture is two pieces — an address minter and a poller:

export const test = base.extend<Fixtures>({
  testEmail: async ({}, use) => {
    await use(`e2e-${randomUUID()}@yourapp.nylas.email`);
  },

  pollInbox: async ({ testEmail }, use) => {
    const poll = async (timeoutMs = 30_000) => {
      const deadline = Date.now() + timeoutMs;
      while (Date.now() < deadline) {
        const out = execSync(
          `nylas inbound messages ${process.env.INBOX_ID} --json --limit 50`,
        ).toString();
        const match = JSON.parse(out).find((m) =>
          m.to.some((t) => t.email === testEmail),
        );
        if (match) return match;
        await new Promise((r) => setTimeout(r, 1500));
      }
      throw new Error(`Email never arrived for ${testEmail}`);
    };
    await use(poll);
  },
});

A signup test then reads the way you wish it always had:

test("signup completes after email verification", async ({ page, testEmail, pollInbox }) => {
  await page.goto("/signup");
  await page.getByLabel("Email").fill(testEmail);
  await page.getByLabel("Password").fill("hunter2-correct");
  await page.getByRole("button", { name: "Create account" }).click();

  await expect(page.getByText("Check your inbox")).toBeVisible();

  const msg = await pollInbox();
  const linkMatch = msg.body.match(/https:\/\/[^\s"<]+\/verify\?[^\s"<]+/);
  expect(linkMatch).not.toBeNull();

  await page.goto(linkMatch![0]);
  await expect(page.getByText("Email verified")).toBeVisible();
});

Password reset is the same shape. OTP flows swap the link regex for /\b\d{6}\b/ — though watch out for bodies with multiple 6-digit numbers (phone numbers, transaction IDs); match near the label text or use a stricter extraction helper. And when the verification URL lives in an <a href> instead of plain text, parse the HTML rather than regexing the whole body:

import * as cheerio from "cheerio";

const $ = cheerio.load(msg.html);
const link = $('a:contains("Verify your email")').attr("href");

Why this is parallel-safe by construction

Playwright runs tests across workers, and with fullyParallel: true the inbox ID is shared — but the addresses aren't. Each test polls for messages addressed to its UUID, so the matching logic never sees another worker's mail. No filtering by subject, no "wait until the right message bubbles to the top." Delivery latency is typically under 5 seconds, so the 1.5-second poll interval catches most messages within two iterations; the 30-second default timeout is generous for almost any flow.

One performance note from the recipe: execSync blocks the test until the CLI returns. That's fine for most suites, but chatty ones should swap in execAsync and await in parallel. And if you want a clean inbox between debugging sessions, mark messages read in an afterEach — otherwise mail just ages out with the standard retention window.

When the test needs a full mailbox, not just an address

A wildcard inbox covers assertion-style tests: did the email arrive, does it contain the right link. Some suites need more — an identity that can send, sign up for a third-party service, and complete onboarding autonomously. That's the Agent Account flow: a fully functional, API-controlled mailbox (Agent Accounts are in beta) that your pipeline provisions per run.

nylas agent account create signup-agent@agents.yourdomain.com

The recipe pairs this with a message.created webhook, which fires within a second or two of mail arriving — your handler matches the expected sender, fetches the full body, extracts the confirmation link, and follows it. Two of its warnings are worth tattooing onto any test-infra design doc:

Don't trust the first message that arrives. Plenty of services send a "Welcome" email before the verification email. Match the sender and the expected URL pattern before acting on anything.
Don't ship per-run agents without teardown. Inactive grants accumulate. Delete on completion or failure:

nylas agent account delete signup-agent@agents.yourdomain.com --yes

Also practical: a free-plan Agent Account sends up to 200 messages per account per day, so a large test matrix should provision multiple grants rather than hammering one. If your test address ever leaks, an allow-list policy — a list of allowed from.domain values paired with a block rule for everything else — keeps the inbox deterministic. And one non-technical warning the recipe makes explicitly: programmatic signup is fine for your own testing and first-party integrations, but check the target service's terms before automating against third parties.

Which one do you need?

Rough decision rule: if the test only ever receives (verification links, OTPs, notification assertions), the wildcard inbound inbox is lighter and faster to adopt. If the test has to act — send replies, complete a signup conversation, exercise your product's email round-trip — provision an Agent Account per run and tear it down in afterAll.

A reasonable middle ground is reusing one long-lived Agent Account across signup runs instead of provisioning per run — the signup recipe explicitly supports both. Per-run accounts give you perfect isolation; a reused account gives you faster setup and one less teardown path to get wrong. Pick per-run for parallel CI, reused for local development.

The proof-of-concept costs about ten minutes: run nylas inbound create e2e, drop the fixture above into your Playwright project, and convert exactly one flaky signup test. Run it with --repeat-each=10 next to the old shared-inbox version and compare failure counts. That diff is the whole argument.

Top comments (1)

Luis Cruz • Jun 12

The opening scenario — worker #7 grabbing worker #12's verification email and passing the wrong test green — is the cleanest description of E2E email flakiness I've read. "Parallel-safe by construction" is the right framing too; making isolation a property of the address scheme rather than something you filter for after the fact removes a whole class of race conditions instead of patching them.
I build automation and backend systems — Python/FastAPI, Playwright, CI pipelines — and have lost real afternoons to exactly this shared-inbox problem. Would love to connect and trade notes, and happy to collaborate if you're building test infra in this space.