I'm a developer who got dragged into fixing our sales team's email infrastructure. What I found was horrifying.
No monitoring. No validation. No health checks. They were sending thousands of cold emails through mailboxes with 8% bounce rates, wondering why half their emails landed in spam. When a domain got blacklisted, their solution was to buy a new one and start over.
So I built a system to fix it. This post is the technical architecture behind it — how data flows from lead enrichment to email send, what breaks at each stage, and how we automated the recovery.
System Overview
Clay (enrichment)
→ Superkabe (validation + routing + monitoring)
→ Smartlead / Instantly (sending)
→ Bounce/reply webhooks back to Superkabe
→ Slack (alerts)
Simple on paper. The complexity is in the failure modes.
Stage 1: Lead Ingestion
Leads arrive via two paths:
- Clay webhook — automated pipeline, fires when enrichment completes
- CSV upload — manual bulk import
Both hit the same processLead() function. This is important — one ingestion path, regardless of source. No special cases.
// Simplified ingestion flow
async function processLead(lead: IncomingLead) {
// Step 1: Validate the email BEFORE anything else
const validation = await emailValidationService.validate(lead.email);
if (validation.status === 'invalid') {
await saveLead(lead, { status: 'BLOCKED', validation });
return; // Never reaches the sending platform
}
// Step 2: Health gate classification
const health = classifyLead(lead, validation);
// GREEN = safe, YELLOW = careful routing, RED = block
// Step 3: Route to campaign based on persona + health
const campaign = await routingService.route(lead, health);
// Step 4: Push to sending platform
await sendingPlatform.pushLead(campaign, lead);
}
The email validation in Step 1 is the piece most outbound setups skip entirely. They trust the data source to provide valid emails. Bad idea.
Stage 2: Email Validation (the part everyone skips)
Our validation runs in two layers:
Internal checks (free, fast, runs on every lead):
- Syntax validation (regex, but the proper RFC 5322 kind, not the naive kind)
- MX record lookup via DNS — does this domain even accept email?
- Disposable domain check against a maintained list (~30k domains)
- Catch-all detection — does this domain accept literally any address?
API verification (paid, conditional):
- Only triggers when internal checks return low confidence
- Hits MillionVerifier's API to do an actual SMTP-level probe
- They maintain rotating IPs so the probes don't get blocked
async function validate(email: string): Promise<ValidationResult> {
// Always run internal checks
const internal = await runInternalChecks(email);
// Check domain cache first (huge cost saver)
const domainInsight = await getDomainInsight(extractDomain(email));
if (domainInsight?.isCatchAll) {
return { status: 'risky', score: 45, source: 'internal', isCatchAll: true };
}
// API fallback based on tier
if (shouldCallApi(internal.score, tier)) {
const apiResult = await millionVerifier.verify(email);
return mergeResults(internal, apiResult);
}
return internal;
}
The DomainInsight cache is the money saver. If you have 5,000 leads at @bigcorp.com, you check ONE email against the API, learn it's a catch-all domain, and cache that. 4,999 API calls saved.
Stage 3: Health Gate
After validation, each lead gets a health classification that determines routing:
| Score | Classification | What happens |
|---|---|---|
| 80-100 | GREEN | Routes to high-volume campaigns normally |
| 50-79 | YELLOW | Routes to campaigns but with per-mailbox risk caps |
| 0-49 | RED | Blocked. Never reaches a sending platform. |
YELLOW leads get distributed carefully. We cap risky leads per mailbox — a mailbox sending 60 emails won't get more than 2 high-risk leads in that batch. This prevents any single mailbox from eating a disproportionate number of bounces.
Stage 4: Sending Platform Integration
We push leads to Smartlead and Instantly via their APIs. Both have quirks.
Smartlead:
- Rate limited to 10 requests per 2 seconds
- Campaign pause is
POST /campaigns/{id}/statuswith{ status: 'PAUSED' }— NOT a PATCH on the campaign object (we learned this the hard way) - Resume uses
{ status: 'START' }, notACTIVE(again, learned the hard way) - Removing a mailbox from a campaign is
DELETE /campaigns/{id}/email-accountswith the account ID in the request body
Instantly:
- Pause:
POST /campaigns/{id}/pause - Resume:
POST /campaigns/{id}/activate - Cleaner API overall but less granular mailbox control
We built a platform adapter pattern so the rest of our system doesn't care which sender is being used:
interface PlatformAdapter {
pauseCampaign(orgId: string, campaignId: string): Promise<void>;
resumeCampaign(orgId: string, campaignId: string): Promise<void>;
addMailboxToCampaign(orgId: string, campaignId: string, mailboxId: string): Promise<void>;
removeMailboxFromCampaign(orgId: string, campaignId: string, mailboxId: string): Promise<void>;
}
Three implementations: SmartleadAdapter, InstantlyAdapter, EmailBisonAdapter. Same interface. The monitoring system calls adapter.removeMailboxFromCampaign() and doesn't need to know if it's hitting Smartlead's weird DELETE-with-body endpoint or Instantly's cleaner API.
Stage 5: Monitoring (where it gets interesting)
Every 60 seconds, a worker runs across all active mailboxes:
// metricsWorker runs every 60s
for (const mailbox of activeMailboxes) {
const bounceRate = calculateBounceRate(mailbox);
const sendVolume = getSendVolume(mailbox, '24h');
if (bounceRate > PAUSE_THRESHOLD) {
await monitoringService.pauseMailbox(mailbox.id, reason);
}
}
When a mailbox gets paused, here's what happens automatically:
- Resilience score drops by 15 points
- Mailbox removed from ALL campaigns on the sending platform (API calls to Smartlead/Instantly)
- Correlation check — is this a mailbox problem or a domain problem?
- If 3+ mailboxes on the same domain are failing → escalate to domain pause
- If failures concentrate on one campaign → pause that campaign instead
- Slack alert fires
- Cooldown timer starts (24h first offense, 72h second, 7 days for third+)
The correlation check prevents whack-a-mole. If the root cause is a blacklisted domain, pausing individual mailboxes won't help. You need to pause the domain.
Stage 6: The Healing Pipeline
This is the part I'm most proud of. Paused mailboxes don't just sit in limbo. They go through a 5-phase recovery:
PAUSED → QUARANTINE → RESTRICTED_SEND → WARM_RECOVERY → HEALTHY
PAUSED: Cooldown timer running. No sends. Waiting for cooldown to expire.
QUARANTINE: Cooldown expired. System checks if the domain's DNS is healthy (SPF valid, DKIM valid, not blacklisted). If DNS is broken, mailbox stays here until it's fixed. No point warming up a mailbox on a blacklisted domain.
RESTRICTED_SEND: DNS passed. Warmup re-enabled at 10 emails/day, flat — no ramp-up yet. Must complete 15 clean sends with zero bounces (25 for repeat offenders) before graduating.
WARM_RECOVERY: Warmup increases to 50 emails/day with a +5/day ramp. Must sustain for 3+ days with bounce rate under 2%.
HEALTHY: Back in production. Re-added to all campaigns it was removed from. Maintenance warmup keeps running at 10/day in the background.
Each transition is managed by two workers:
-
metricsWorker(every 60s): handles PAUSED → QUARANTINE (cooldown expiry) -
warmupTrackingWorker(every 4h): handles QUARANTINE → RESTRICTED → WARM → HEALTHY (graduation criteria)
Phase transitions use optimistic locking to prevent race conditions:
const result = await prisma.mailbox.updateMany({
where: {
id: mailboxId,
recovery_phase: fromPhase // Only update if still in expected phase
},
data: {
recovery_phase: toPhase,
phase_entered_at: new Date()
}
});
if (result.count === 0) {
// Another process already changed the phase — abort
return;
}
If a mailbox bounces DURING recovery (relapse), penalties escalate:
- 1st relapse: back to QUARANTINE, doubled cooldown
- 2nd relapse: full PAUSED, 72h cooldown
- 3rd+: full PAUSED, 7 day cooldown, flagged for manual review
Stage 7: Alerting
Every significant event pushes to Slack. Not email. Obvious reasons.
We categorize alerts by severity:
- Critical: Domain blacklisted, mailbox paused, campaign auto-stopped
- Warning: Bounce rate climbing, mailbox entering recovery phase
- Info: Mailbox graduated from recovery, lead validation batch complete
The alert includes actionable context: which mailbox, what bounce rate, which campaigns are affected, and what the system already did about it.
What We Learned Building This
DNS checks matter more than you think. SPF, DKIM, DMARC misconfigurations cause deliverability problems that look like content issues. We've seen mailboxes with perfect copy land in spam because someone forgot to add an SPF record for their sending service.
Don't trust any single data source for email verification. Clay says the email is verified. MillionVerifier says it's valid. The email still bounces. This happens. Build for it. Your infrastructure needs to handle bounces gracefully, not assume they won't happen.
Batch operations need rate limiting, not just individual calls. We hit Smartlead's rate limit every time we tried to pause a domain (which removes 15+ mailboxes from multiple campaigns). Added a rate limiter that queues requests at 10 per 2 seconds and the errors disappeared.
The adapter pattern was the best architectural decision we made. When we added EmailBison as a third sending platform, it took a day instead of a week. Implement the interface, register the adapter, done.
Observe mode saved us from ourselves. We built three system modes: OBSERVE (log what would happen), SUGGEST (create notifications), ENFORCE (actually do it). We ran in observe mode for 2 weeks watching the logs before flipping to enforce. Found 3 bugs that would have auto-paused healthy mailboxes.
The whole system is TypeScript, Prisma + PostgreSQL, with BullMQ for job processing. Runs on Railway. The monitoring, healing, and validation layers are what we productized as Superkabe - but the architecture patterns apply regardless of whether you build or buy.
If you're running cold outbound at any real volume and your infrastructure management is still manual, you're leaving money on the table. Or more accurately, you're burning domains that cost money to replace.
Top comments (0)