Edward Sam

Posted on Mar 24

Building an Outbound Email Engine That Doesn't Burn Your Domains (Architecture Deep-Dive)

#tutorial #productivity #news #startup

I'm a developer who got dragged into fixing our sales team's email infrastructure. What I found was horrifying.

No monitoring. No validation. No health checks. They were sending thousands of cold emails through mailboxes with 8% bounce rates, wondering why half their emails landed in spam. When a domain got blacklisted, their solution was to buy a new one and start over.

So I built a system to fix it. This post is the technical architecture behind it — how data flows from lead enrichment to email send, what breaks at each stage, and how we automated the recovery.

System Overview

Clay (enrichment)
  → Superkabe (validation + routing + monitoring)
    → Smartlead / Instantly (sending)
      → Bounce/reply webhooks back to Superkabe
        → Slack (alerts)

Simple on paper. The complexity is in the failure modes.

Stage 1: Lead Ingestion

Leads arrive via two paths:

Clay webhook — automated pipeline, fires when enrichment completes
CSV upload — manual bulk import

Both hit the same processLead() function. This is important — one ingestion path, regardless of source. No special cases.

// Simplified ingestion flow
async function processLead(lead: IncomingLead) {
  // Step 1: Validate the email BEFORE anything else
  const validation = await emailValidationService.validate(lead.email);

  if (validation.status === 'invalid') {
    await saveLead(lead, { status: 'BLOCKED', validation });
    return; // Never reaches the sending platform
  }

  // Step 2: Health gate classification
  const health = classifyLead(lead, validation);
  // GREEN = safe, YELLOW = careful routing, RED = block

  // Step 3: Route to campaign based on persona + health
  const campaign = await routingService.route(lead, health);

  // Step 4: Push to sending platform
  await sendingPlatform.pushLead(campaign, lead);
}

The email validation in Step 1 is the piece most outbound setups skip entirely. They trust the data source to provide valid emails. Bad idea.

Stage 2: Email Validation (the part everyone skips)

Our validation runs in two layers:

Internal checks (free, fast, runs on every lead):

Syntax validation (regex, but the proper RFC 5322 kind, not the naive kind)
MX record lookup via DNS — does this domain even accept email?
Disposable domain check against a maintained list (~30k domains)
Catch-all detection — does this domain accept literally any address?

API verification (paid, conditional):

Only triggers when internal checks return low confidence
Hits MillionVerifier's API to do an actual SMTP-level probe
They maintain rotating IPs so the probes don't get blocked

async function validate(email: string): Promise<ValidationResult> {
  // Always run internal checks
  const internal = await runInternalChecks(email);

  // Check domain cache first (huge cost saver)
  const domainInsight = await getDomainInsight(extractDomain(email));
  if (domainInsight?.isCatchAll) {
    return { status: 'risky', score: 45, source: 'internal', isCatchAll: true };
  }

  // API fallback based on tier
  if (shouldCallApi(internal.score, tier)) {
    const apiResult = await millionVerifier.verify(email);
    return mergeResults(internal, apiResult);
  }

  return internal;
}

The DomainInsight cache is the money saver. If you have 5,000 leads at @bigcorp.com, you check ONE email against the API, learn it's a catch-all domain, and cache that. 4,999 API calls saved.

Stage 3: Health Gate

After validation, each lead gets a health classification that determines routing:

Score	Classification	What happens
80-100	GREEN	Routes to high-volume campaigns normally
50-79	YELLOW	Routes to campaigns but with per-mailbox risk caps
0-49	RED	Blocked. Never reaches a sending platform.

YELLOW leads get distributed carefully. We cap risky leads per mailbox — a mailbox sending 60 emails won't get more than 2 high-risk leads in that batch. This prevents any single mailbox from eating a disproportionate number of bounces.

Stage 4: Sending Platform Integration

We push leads to Smartlead and Instantly via their APIs. Both have quirks.

Smartlead:

Rate limited to 10 requests per 2 seconds
Campaign pause is POST /campaigns/{id}/status with { status: 'PAUSED' } — NOT a PATCH on the campaign object (we learned this the hard way)
Resume uses { status: 'START' }, not ACTIVE (again, learned the hard way)
Removing a mailbox from a campaign is DELETE /campaigns/{id}/email-accounts with the account ID in the request body

Instantly:

Pause: POST /campaigns/{id}/pause
Resume: POST /campaigns/{id}/activate
Cleaner API overall but less granular mailbox control

We built a platform adapter pattern so the rest of our system doesn't care which sender is being used:

interface PlatformAdapter {
  pauseCampaign(orgId: string, campaignId: string): Promise<void>;
  resumeCampaign(orgId: string, campaignId: string): Promise<void>;
  addMailboxToCampaign(orgId: string, campaignId: string, mailboxId: string): Promise<void>;
  removeMailboxFromCampaign(orgId: string, campaignId: string, mailboxId: string): Promise<void>;
}

Three implementations: SmartleadAdapter, InstantlyAdapter, EmailBisonAdapter. Same interface. The monitoring system calls adapter.removeMailboxFromCampaign() and doesn't need to know if it's hitting Smartlead's weird DELETE-with-body endpoint or Instantly's cleaner API.

Stage 5: Monitoring (where it gets interesting)

Every 60 seconds, a worker runs across all active mailboxes:

// metricsWorker runs every 60s
for (const mailbox of activeMailboxes) {
  const bounceRate = calculateBounceRate(mailbox);
  const sendVolume = getSendVolume(mailbox, '24h');

  if (bounceRate > PAUSE_THRESHOLD) {
    await monitoringService.pauseMailbox(mailbox.id, reason);
  }
}

When a mailbox gets paused, here's what happens automatically:

Resilience score drops by 15 points
Mailbox removed from ALL campaigns on the sending platform (API calls to Smartlead/Instantly)
Correlation check — is this a mailbox problem or a domain problem?
- If 3+ mailboxes on the same domain are failing → escalate to domain pause
- If failures concentrate on one campaign → pause that campaign instead
Slack alert fires
Cooldown timer starts (24h first offense, 72h second, 7 days for third+)

The correlation check prevents whack-a-mole. If the root cause is a blacklisted domain, pausing individual mailboxes won't help. You need to pause the domain.

Stage 6: The Healing Pipeline

This is the part I'm most proud of. Paused mailboxes don't just sit in limbo. They go through a 5-phase recovery:

PAUSED → QUARANTINE → RESTRICTED_SEND → WARM_RECOVERY → HEALTHY

PAUSED: Cooldown timer running. No sends. Waiting for cooldown to expire.

QUARANTINE: Cooldown expired. System checks if the domain's DNS is healthy (SPF valid, DKIM valid, not blacklisted). If DNS is broken, mailbox stays here until it's fixed. No point warming up a mailbox on a blacklisted domain.

RESTRICTED_SEND: DNS passed. Warmup re-enabled at 10 emails/day, flat — no ramp-up yet. Must complete 15 clean sends with zero bounces (25 for repeat offenders) before graduating.

WARM_RECOVERY: Warmup increases to 50 emails/day with a +5/day ramp. Must sustain for 3+ days with bounce rate under 2%.

HEALTHY: Back in production. Re-added to all campaigns it was removed from. Maintenance warmup keeps running at 10/day in the background.

Each transition is managed by two workers:

metricsWorker (every 60s): handles PAUSED → QUARANTINE (cooldown expiry)
warmupTrackingWorker (every 4h): handles QUARANTINE → RESTRICTED → WARM → HEALTHY (graduation criteria)

Phase transitions use optimistic locking to prevent race conditions:

const result = await prisma.mailbox.updateMany({
  where: {
    id: mailboxId,
    recovery_phase: fromPhase  // Only update if still in expected phase
  },
  data: {
    recovery_phase: toPhase,
    phase_entered_at: new Date()
  }
});

if (result.count === 0) {
  // Another process already changed the phase — abort
  return;
}

If a mailbox bounces DURING recovery (relapse), penalties escalate:

1st relapse: back to QUARANTINE, doubled cooldown
2nd relapse: full PAUSED, 72h cooldown
3rd+: full PAUSED, 7 day cooldown, flagged for manual review

Stage 7: Alerting

Every significant event pushes to Slack. Not email. Obvious reasons.

We categorize alerts by severity:

Critical: Domain blacklisted, mailbox paused, campaign auto-stopped
Warning: Bounce rate climbing, mailbox entering recovery phase
Info: Mailbox graduated from recovery, lead validation batch complete

The alert includes actionable context: which mailbox, what bounce rate, which campaigns are affected, and what the system already did about it.

What We Learned Building This

DNS checks matter more than you think. SPF, DKIM, DMARC misconfigurations cause deliverability problems that look like content issues. We've seen mailboxes with perfect copy land in spam because someone forgot to add an SPF record for their sending service.

Don't trust any single data source for email verification. Clay says the email is verified. MillionVerifier says it's valid. The email still bounces. This happens. Build for it. Your infrastructure needs to handle bounces gracefully, not assume they won't happen.

Batch operations need rate limiting, not just individual calls. We hit Smartlead's rate limit every time we tried to pause a domain (which removes 15+ mailboxes from multiple campaigns). Added a rate limiter that queues requests at 10 per 2 seconds and the errors disappeared.

The adapter pattern was the best architectural decision we made. When we added EmailBison as a third sending platform, it took a day instead of a week. Implement the interface, register the adapter, done.

Observe mode saved us from ourselves. We built three system modes: OBSERVE (log what would happen), SUGGEST (create notifications), ENFORCE (actually do it). We ran in observe mode for 2 weeks watching the logs before flipping to enforce. Found 3 bugs that would have auto-paused healthy mailboxes.

The whole system is TypeScript, Prisma + PostgreSQL, with BullMQ for job processing. Runs on Railway. The monitoring, healing, and validation layers are what we productized as Superkabe - but the architecture patterns apply regardless of whether you build or buy.

If you're running cold outbound at any real volume and your infrastructure management is still manual, you're leaving money on the table. Or more accurately, you're burning domains that cost money to replace.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.