The Workflow That Worked Too Well (And Crashed Everything)

#operations #systemdesign #ai #discuss

I built the perfect system. Then it tried to send forty-seven thousand emails in six minutes.
A SaaS company came to us with a great problem. Too many leads, not enough salespeople. Their process was a mess. Leads went into a spreadsheet. Someone checked it when they remembered. Someone assigned leads based on vibes. Sales reps emailed days later. By then, leads were cold.
They wanted full automation. Lead comes in, gets qualified, assigned, emailed, and followed up automatically. I thought this was straightforward. I decided to build the mother of all workflows.
The Dream Workflow
The system looked beautiful on paper.
A form submission triggered a webhook. Data flowed into the CRM. Enrichment pulled company details and LinkedIn profiles. Leads were scored from one to one hundred. High scores went to senior reps, mid scores to mid reps, low scores to nurture.
Then personalization kicked in. An LLM generated tailored emails referencing company, role, and pain points. Outreach fired. Follow-ups were scheduled. Every stage triggered the next. Clean. Logical. Fully automated.
I deployed it on a Friday afternoon.
That was my first mistake.
Saturday, 3 AM
My phone exploded. Missed calls. Texts. Hundreds of emails.
The client messaged in all caps telling me to turn everything off immediately.
I logged in.
And then I understood.
What Actually Happened
The workflow worked exactly as designed. Just not how I expected.
The company had twelve thousand leads sitting in an old spreadsheet. Six months of uncontacted leads. When I connected the CRM, the automation saw twelve thousand “new” leads.
Every one of them entered the workflow at the same time.
All seven stages. In parallel.
The Chain Reaction
In six minutes, twelve thousand leads hit the enrichment stage. The system tried to call the enrichment API thousands of times. The API rate limit was a hundred requests per minute. We sent two thousand.
The API key was banned.
With enrichment failing, scoring defaulted everyone to medium priority. Twelve thousand leads were assigned to eight sales reps. Each rep received around fifteen hundred notifications.
Next, the LLM tried to generate twelve thousand personalized emails in minutes. The API throttled us.
Then the email service tried to send thousands of emails at once. It queued them, sent a hundred, then froze the account for suspicious activity.
Sales reps woke up to notification storms that crashed their phones. The CRM slowed to a crawl. Race conditions caused duplicate emails. Data mixed. Some leads received emails meant for other people.
One lead got six emails in two hours.
They replied asking if we were okay.
The Carnage
By Saturday morning, every external service was rate-limited, banned, or frozen. The CRM crashed twice. Sales reps were furious. Leads were confused or angry. The client was ready to fire us.
The system didn’t fail slowly. It exploded.
What I Did Wrong
I never added rate limits. I assumed leads would trickle in, not arrive by the thousands.
I didn’t batch anything. Every trigger fired instantly.
I didn’t consider first-run safety. Historical data should never be treated like fresh leads.
I had no circuit breakers. When enrichment failed, the workflow just kept going with bad data.
I tested with five leads. Not five thousand.
And I deployed on a Friday afternoon, then went offline.
Every classic mistake, stacked together.
The Rebuild
I rebuilt the entire workflow with safety rails.
Processing was capped. Only a limited number of leads could be handled per hour. Anything beyond that went into a queue.
Leads were processed in small batches with cooldowns in between.
Circuit breakers were added. If an API failed repeatedly, the workflow paused instead of pushing forward blindly.
A first-run mode was introduced. On initial deployment, only very recent leads were processed. Historical data required explicit approval.
Duplicate checks ran before every email. No lead could be emailed twice within twenty-four hours.
And most importantly, I added a kill switch. One button to pause everything instantly.
The New Logic
Before any lead was processed, the system checked queue size, first-run status, duplication, API health, and recent outreach history.
If enrichment failed, the system skipped it. If personalization failed, it used a safe template. If email delivery failed, it stopped completely.
The core rule became simple.
Slow and steady beats fast and broken.
The Re-Launch
Monday morning, we turned it back on.
Only twenty-three new leads were processed. It took fifteen minutes. Every API call succeeded. Emails were paced. Sales reps got a manageable number of leads.
The client asked why it felt slow.
I told them it wasn’t slow. It was safe.
By the end of the day, everything worked perfectly.
Trust was restored.
Handling the Old Leads
The client still had twelve thousand historical leads.
We processed them carefully. Two hundred per day. During business hours. With manual reviews after each batch. If anything looked wrong, we stopped.
It took two months.
Over two thousand of those old leads converted.
Slow, boring, safe success.
The Outcome
Three months later, the system had processed thousands of new and historical leads with almost zero errors. No API bans. No sales rep complaints. Response rates tripled. Conversion rates increased significantly.
The biggest change wasn’t speed.
It was reliability.
What I Learned About Automation
Scale is a feature, not a freebie. Unlimited throughput is dangerous.
Testing small is meaningless. You must test at scale.
External services will fail. Assume they will.
Circuit breakers matter more than clever logic.
Automation must always have a human override.
And never deploy something complex on a Friday afternoon.
The Principle That Saved Me
Build workflows that fail gracefully, not catastrophically.
When my system failed the first time, it exploded. The rebuilt version failed by slowing down, pausing, and skipping steps.
Slow failure beats fast disaster every time.
Your Turn
Have you ever built automation that worked too well? How do you handle rate limits and safety in your workflows? What’s your worst Friday deployment story?
Written by FARHAN HABIB FARAZ
Senior Prompt Engineer and Team Lead at PowerInAI
Building workflows that know when to slow down

Tags: workflowautomation, ratelimiting, scalability, systemdesign, automation, disasterrecovery

DEV Community

The Workflow That Worked Too Well (And Crashed Everything)

Top comments (0)