Francisco Perez

Posted on Mar 13 • Originally published at uncorreotemporal.com

Building Developer Tools That Scale: Lessons from Email Infrastructure

#infrastructure #python #backend #devops

There is a certain category of developer tool that looks, from the outside, like it should take a weekend to build. A temporary email service is one of them. Create an address, receive mail, expire it after N minutes. The interface is trivial. The implementation is not.

Building uncorreotemporal.com — a programmable temporary email infrastructure for automation, CI pipelines, and AI agents — required solving a set of engineering problems that most "simple" developer tools eventually surface: concurrent ingestion, time-bounded data, real-time event propagation, and API contracts that hold up under machine load. This article is about those problems and how we solved them.

The Illusion of Simple Developer Tools

The most dangerous developer tools to build are the ones where the interface is clean. A queue with three methods. An email API with two endpoints. An auth service with one decision. The surface area looks small, which makes it easy to underestimate the engine behind it.

Consider what email APIs, queues, authentication services, and testing tools have in common: they are all infrastructure primitives. They do not do business logic — they are the layer below business logic. And infrastructure primitives fail in ways that are invisible until they do not: message loss, split-brain states, TTL races, partial writes, replay ambiguity.

When you build a temporary email service for developers, you are not building a toy. You are building infrastructure that test suites depend on to create isolated inboxes, receive confirmation emails, and clean up reliably after every run. If it drops a message or delivers to an expired inbox, a CI job fails with a confusing, non-deterministic error. If it has an unpredictable API, automation scripts break when the contract shifts. The simplicity of the interface is a contract with your users — maintaining it requires solving complexity underneath.

Real Email Infrastructure Is Hard

SMTP is a 40-year-old protocol that was designed for delivery, not for programmable use. Building on top of it means accepting its characteristics: messages are raw bytes, headers are often inconsistent, multipart structure is implicit, and the protocol itself has no concept of a "user" or an "inbox" — only a recipient address and a message.

The ingestion layer has two distinct modes. In development and local testing, aiosmtpd runs as a proper async SMTP server:

class MailHandler:
    async def handle_DATA(self, server, session, envelope) -> str:
        raw: bytes = envelope.content

        for rcpt in envelope.rcpt_tos:
            address = rcpt.lower().strip()
            if not address.endswith(f"@{settings.domain}"):
                logger.debug("Dominio no reconocido: %s — ignorado", address)
                continue
            await deliver_raw_email(raw, address)

        return "250 Message accepted"

The handler always returns 250 Message accepted. Rejection is silent and internal — a deliberate design decision. SMTP's response codes communicate to the sending server, not to the end user. Returning a 5xx for an unknown address would tell spammers which addresses exist.

In production, the ingestion path is entirely different: AWS SES receives the mail, publishes an SNS notification, and calls a webhook at POST /api/v1/ses/inbound. Both paths converge at core/delivery.py.

Parsing raw RFC 2822 bytes into structured data is where the real complexity lives. The implementation uses Python's stdlib email module with policy.default — the modern, standards-compliant policy introduced in Python 3.6:

def parse_email(raw: bytes) -> ParsedEmail:
    msg = email.message_from_bytes(raw, policy=policy.default)
    if msg.is_multipart():
        for part in msg.walk():
            content_type = part.get_content_type()
            disposition = str(part.get("content-disposition", ""))
            if "attachment" in disposition:
                attachments.append(ParsedAttachment(
                    filename=part.get_filename() or "attachment",
                    content_type=content_type,
                    size=len(part.get_payload(decode=True) or b""),
                ))
            elif content_type == "text/plain" and body_text is None:
                body_text = part.get_content()
            elif content_type == "text/html" and body_html is None:
                body_html = part.get_content()

A critical architectural choice: attachment binary payloads are not stored separately. Only metadata (filename, content type, size, content-id) is persisted as JSONB. The complete raw_email bytes are stored alongside. This means the original message is always re-parseable without duplication — and it simplifies the schema at the cost of making attachment retrieval slightly more expensive.

Designing for Concurrency from Day One

The entire stack is async: FastAPI with SQLAlchemy async sessions, aiosmtpd for SMTP, aioredis for pub/sub, and asyncio tasks for background work. This was not an afterthought — it was a prerequisite for any form of meaningful throughput.

The delivery pipeline (core/delivery.py) illustrates the async design. It is a six-step sequence where each step is an awaited call: look up the mailbox, load the plan, check quota, parse the email, insert the message, publish to Redis. No blocking I/O, no thread-per-connection overhead:

async def deliver_raw_email(raw: bytes, address: str) -> bool:
    async with AsyncSessionLocal() as db:
        return await _save_to_mailbox(raw, address, db)

The function returns bool — True if the message was saved, False if it was silently rejected. This return value is consumed by the SMTP handler and the SES webhook alike, keeping both callers simple.

The fan-out from message delivery to real-time notification follows a pub/sub pattern over Redis:

# After DB commit:
payload = json.dumps({"event": "new_message", "message_id": str(message.id)})
await redis.publish(f"mailbox:{address}", payload)

The publish happens after the database commit. If Redis is unreachable, the exception is caught and logged — the message is already durable in PostgreSQL. Redis is the notification layer, not the source of truth. This ordering matters: message loss in Redis is recoverable (the client can poll); message loss in the DB is not.

Time-Based Data Is a Hidden Challenge

Expiring data is deceptively hard. The naive implementation — DELETE FROM mailboxes WHERE expires_at < now() — conflates two separate concerns: enforcement and storage reclamation. The production implementation separates them.

Expiration is a state transition, not a deletion. The background worker runs as an asyncio task co-hosted with the API process and fires every 60 seconds:

async def _expire_mailboxes() -> int:
    now = datetime.now(timezone.utc)
    async with AsyncSessionLocal() as db:
        result = await db.execute(
            update(Mailbox)
            .where(
                Mailbox.expires_at <= now,
                Mailbox.is_active == True,
            )
            .values(is_active=False)
        )
        await db.commit()
        return result.rowcount

One SQL statement. No Python-level iteration. No loading rows into memory. The entire sweep is a single round-trip to PostgreSQL, bounded in cost by the number of newly expired inboxes since the last cycle.

The sleep mechanism is worth examining. Rather than await asyncio.sleep(60), the worker uses:

await asyncio.wait_for(stop_event.wait(), timeout=interval_seconds)

A sleeping coroutine cannot be interrupted without task cancellation. An event wait can be signaled immediately. When the FastAPI lifespan shuts down, stop_expiry_task() sets the event and the worker exits within milliseconds rather than waiting up to 60 seconds for a sleep to expire.

The subtler challenge is the race between the 60-second polling window and message delivery. Between an inbox's expires_at timestamp and when the background worker next runs, is_active may still be True even though the inbox has expired. The delivery layer closes this window with a double-check:

result = await db.execute(
    select(Mailbox).where(
        Mailbox.address == address,
        Mailbox.is_active == True,
        Mailbox.expires_at > now,   # critical: not just is_active
    )
)

The background worker's state transition is an optimization that keeps list queries and quota checks efficient. But it is not the primary enforcement mechanism. Delivery correctness does not depend on the worker having run.

API Design for Automation

Developer tools built for automation require a different API design philosophy than consumer products. The contract must be machine-readable, predictable, and composable with CI pipelines and agent loops.

The mailbox creation endpoint is designed for automation-first use:

POST /api/v1/mailboxes?ttl_minutes=30

No request body — TTL is a query parameter because it is always optional and has a plan-dependent default. The response is flat and explicit:

{
  "address": "swift-river-42@uncorreotemporal.com",
  "expires_at": "2026-03-13T15:30:00+00:00",
  "session_token": "uct_xxxxx"
}

expires_at is always an ISO 8601 timestamp with UTC offset — not a duration, not a relative time. A test harness receives this and can schedule its own assertion: if datetime.now(tz=utc) > expires_at, the inbox is gone. No ambiguity about timezone, no state to track.

Plan limits are enforced at the API layer before any resource is created:

effective_ttl = min(ttl_minutes or plan.default_ttl_minutes, plan.max_ttl_minutes)

A free plan user requesting a 48-hour inbox gets a 60-minute inbox, silently capped. The expires_at in the response reflects the actual cap. Free caps: 60 minutes, 1 mailbox, 10 messages. Pro: 24 hours, 20 mailboxes, 500 messages.

The messages API separates listing from retrieval:

GET    /api/v1/mailboxes/{address}/messages        → list (metadata only, no body)
GET    /api/v1/mailboxes/{address}/messages/{id}   → full message (marks read)
DELETE /api/v1/mailboxes/{address}/messages/{id}   → delete

The list endpoint never returns body content — only id, from_address, subject, received_at, is_read, has_attachments. A CI test polling for an OTP email reads the list, finds the matching subject, fetches only that message. The list stays lightweight regardless of how many large emails have been received.

Real-Time Systems Change the Game

Polling is the wrong abstraction for event-driven workflows. A test suite that polls GET /messages every second is burning API quota and introducing latency that scales with polling interval. WebSockets change the model entirely.

The WebSocket endpoint at WS /ws/inbox/{address} accepts connections authenticated via either a session token or an API key. Authentication happens before the connection is accepted — an unauthorized client receives a 1008 Policy Violation close frame, not a connected stream followed by a permission error:

authorized = await _authenticate(address, token, api_key)
if not authorized:
    await websocket.close(code=status.WS_1008_POLICY_VIOLATION)
    return
await websocket.accept()

Once connected, the endpoint subscribes to the Redis pub/sub channel for that inbox and runs two concurrent asyncio tasks:

send_task = asyncio.create_task(_send_loop())   # forwards Redis messages to WS
ping_task = asyncio.create_task(_ping_loop())   # 30s keepalive ping

done, pending = await asyncio.wait(
    [send_task, ping_task],
    return_when=asyncio.FIRST_COMPLETED,
)
for task in pending:
    task.cancel()

The client receives two event types:

{"event": "new_message", "message_id": "550e8400-e29b-41d4-a716-446655440000"}
{"event": "ping"}

On new_message, the client calls GET /api/v1/mailboxes/{address}/messages/{id} to retrieve the full body. The WebSocket is a notification channel only — it carries the signal, not the payload. For AI agents and CI pipelines: open a WebSocket before triggering the flow that sends the email, wait for the new_message event, fetch the body, extract the OTP. Zero polling. Sub-second latency from email arrival to agent response.

Infrastructure Lessons Learned

Async is the default, not an optimization. Every I/O operation — database queries, Redis operations, SMTP handling, WebSocket messaging — is non-blocking. This is the design that makes a single-process deployment serve concurrent WebSocket connections, handle incoming SMTP sessions, and run background expiration sweeps without threading complexity.

Design for failure at every layer. The Redis publish in the delivery pipeline is wrapped in a try/except that logs the error and returns True. The message is already in PostgreSQL. Redis failure does not become delivery failure. The expiry loop wraps its core operation in a try/except so that a transient database error does not kill the loop. Infrastructure components fail. The question is whether the failure is isolated or cascading.

Separate enforcement from cleanup. The expiry model distinguishes between is_active = False (enforcement, near-real-time) and physical deletion (cleanup, deferred). This separation means the delivery layer can enforce TTL without depending on a garbage collector having run.

Single source of truth for state. Plan limits and TTL caps live in the database, not in application configuration. A free plan with max_ttl_minutes = 60 is a row in the plans table. Changing it requires no deployment.

Dual ingestion path, shared logic. The aiosmtpd handler and the SES webhook share core/delivery.py entirely. The SMTP handler's handle_DATA method is three lines of routing logic; everything else is shared. Local development tests exactly the same delivery pipeline as production SES ingestion.

The Hidden Complexity Behind "Simple" Products

The products that developers call "simple" are the ones where the interface succeeded. Stripe's charge API is simple. S3's put/get is simple. The simplicity is the product of enormous engineering effort spent hiding complexity from the caller.

A temporary email service that works looks like this from the outside: create inbox, receive mail, read message, inbox expires. Four operations. What it requires underneath: a domain registered with MX records pointing at your infrastructure, an async SMTP ingestion layer with domain validation and silent rejection, an RFC 2822 parser that handles malformed multipart messages without crashing, a quota enforcement system tied to a plan hierarchy, a background expiration worker with correct race condition handling, a real-time notification system over Redis pub/sub with WebSocket fan-out, and an API designed for machine consumption at every endpoint.

None of these are visible in the API contract. That invisibility is the job.

The engineering trap is believing that because the interface is simple, the implementation can be simple too. It cannot. The interface's simplicity is the result of pushing complexity inward — into the infrastructure, into the error handling, into the data model. When you skip that work, the complexity leaks out into the caller's code.

Conclusion

Building developer tools that scale comes down to a few durable engineering principles.

Make async the baseline, not the upgrade path. If your infrastructure needs to handle concurrent connections, background workers, and real-time events — and any non-trivial tool will — an async framework is the correct starting point, not a future refactor.

Design data lifecycle explicitly. Time-bounded data requires a model that separates enforcement from cleanup, and an enforcement mechanism that does not depend on cleanup having run. Race conditions in time-based systems are not edge cases — they are the default condition.

Build APIs for machines first. Predictable response shapes, explicit timestamps, flat structures, and quiet enforcement of limits are properties that make automation reliable. An API designed for a human UI can often be driven by a machine; an API designed for a machine can always be used by a human.

Isolate failure. Every I/O operation has failure modes. The question is whether those failures propagate or are contained. Redis unavailability should not cause message loss. A transient database error should not kill a background worker. Design the blast radius before the failure happens.

The measure of infrastructure engineering is not what happens under ideal conditions — it is what happens when things go wrong, and whether those failures are invisible to the people building on top of you. That is what "simple" actually means.

The full system is running at https://uncorreotemporal.com. Anonymous inboxes need no signup: POST /api/v1/mailboxes?ttl_minutes=5 creates a live inbox with a five-minute TTL and returns a WebSocket-ready address. It is as simple as it looks.

DEV Community