Stephen Souza

Posted on May 5

Our SaaS stopped getting signups at 2am. No alerts fired. Here's why.

#devops #saas #b2b #startup

It was a Tuesday morning. I opened my laptop, checked the dashboard, and noticed something off.

No signups since 2:17am.

Not one. For almost six hours.

My first thought: slow day. Maybe people just weren't signing up. It happens.

My second thought, ten minutes later after staring at the chart: this has never happened before.

Everything was "up"

Here's the part that still bothers me.

My uptime monitor? Green.
Server health? Normal.
Error logs? Clean.
SSL cert? Valid.
API response time? Fine.

By every traditional measure, my product was working perfectly. Which meant none of my alerts fired. Not a single one.

But my signup flow was completely broken.

What actually happened

Somewhere around 2am, something in our signup flow started returning responses that looked like success — but weren't. Not a crash. Not an error. Just silent wrong behaviour. The kind that look like success to a health check but actually fail when a real user tries to sign up.

The signup form submitted. The spinner spun. Then nothing. No account created. No error shown to the user. Just a quiet dead end.

Users didn't get an error. They got silence. And they left.

For six hours.

I found out when a user emailed me saying they'd tried to sign up three times the night before and gave up.

That email hurt more than the outage itself.

The monitoring gap nobody talks about

Most monitoring tools are built around one question: "Is it up?"

Is the server responding? Is the endpoint returning 200? Is the cert valid?

These are good questions. But they're the wrong question for this failure.

The right question was: "Is anything actually happening?"

Specifically — are signups happening at the rate they normally do? Because on a normal Tuesday at 2am, even with low traffic, I get some signups. When that number drops to zero for six hours, something is wrong. Always.

But no tool I was using asked that question. They were all watching the pipes. Nobody was watching the water.

The difference between "up" and "working"

Your server can be up. Your API can respond. Your database can be connected.

And your business can still be silently bleeding.

This happens more than people admit:

Payment webhooks stop arriving — Stripe has a hiccup, webhooks queue and don't retry correctly. Your server never goes down. Your revenue processing stops.
Emails stop sending — your SMTP provider throttles you, but returns 200s. Onboarding emails never arrive. Users churn thinking you don't care.
Cron jobs silently skip — the job "runs" but processes zero records due to a config drift. No error. No alert. Your data pipeline is stale for days.
Signup flow breaks — exactly what happened to us. The form works. The backend "responds." Zero accounts created.
AI agents go quiet — agent is "running" but stops producing outputs. No exception, no crash. Task queue drains, nothing gets done.
AI agents loop infinitely — agent keeps retrying the same step, burning tokens and API credits silently. No alert. You see the bill at end of month.
AI agents get stuck — tool call hangs waiting for a response that never comes. The agent neither fails nor succeeds. Just... waits. Forever.

In every one of these cases, traditional monitoring sees nothing wrong. Because technically, nothing is wrong with the infrastructure. The failure is at the business event level — the things that actually matter.

What I wished I had

I didn't want another dashboard to stare at.

I wanted something to notice that signups had gone quiet — and tell me.

Not because I configured a threshold. Not because I manually set up an alert for "signups < 1 per hour." But because the system knew what normal looked like for my app at 2am on a Tuesday, and knew that zero signups for six hours was abnormal.

Basically: I wanted the monitoring equivalent of someone who's worked at my company long enough to say "hey, something feels off today."

What We built

After this incident — and two others like it in the same week (a CPU spike that correlated with a memory leak, and an anomalous jump in signups that turned out to be a bot wave) — We built NotiLens.

The core idea is what We call Smart Silence Detection.

Instead of asking "is the server up?", it asks "is your business behaving normally?"

It learns your baseline — what normal event volume looks like at each hour of the day, each day of the week. Then it alerts you when things go abnormally quiet. No manual threshold configuration needed. No spreadsheet of "expected events per hour." It just learns, and it watches.

When signups stop. When webhooks dry up. When your cron job runs but processes nothing. When orders drop to zero on a Saturday afternoon. When your AI agent stops producing outputs. When it starts looping through the same step burning your API credits. When it hangs waiting for a tool response that never comes.

That's the alert you need. Not "server down." But "your business just went silent."

There's a second pattern it catches that We didn't even plan for initially: broken flows.

Silence detection watches for events that stop happening altogether. Broken flow detection watches for events that start but never finish.

payment.initiated fired. But payment.completed never followed.

user.registered fired. But user.activated never followed.

order.placed fired. But order.fulfilled never followed.

Each individual event looks fine. No errors. No timeouts. The payment was "initiated" — technically true. But the money never moved.

This is where most revenue leaks actually happen. Not in crashes. In the gap between two events that should always travel together — but sometimes don't.

The practical setup (for the technically curious)

The way it works under the hood:

You send business events to NotiLens via a simple SDK call or webhook — signup.completed, payment.received, order.placed, whatever matters to your app.
NotiLens builds a rolling baseline of expected event frequency using ML — per event type, per hour of day, per day of week.
When observed frequency drops significantly below baseline for a sustained period, it fires an alert — push notification to your phone, Slack, email, whatever you have set up.
You also get anomaly detection in the other direction: sudden spikes (bot attacks, viral traffic, billing loops) are caught too.
For broken flows, you define the relationship between two events — payment.initiated should always be followed by payment.completed within X minutes. If it isn't, you get alerted immediately. No polling. No manual checks.

What I learned

A few things I'd tell myself before that Tuesday morning:

1. "No alerts" is not the same as "no problems."
Silence from your monitoring tools means your infrastructure is up. It says nothing about whether your business is working.

2. The failures that hurt most are the ones users experience silently.
A full server outage is obvious. Users tweet about it, you get flooded with emails, you know within minutes. A broken signup flow at 2am? You find out when a user emails you three days later — if they email at all. Most just leave.

3. Business events are first-class monitoring targets.
signup.completed, payment.received, user.activated — these deserve the same monitoring attention as CPU and memory. Maybe more.

4. Absence of data is data.
Zero signups for six hours is a signal. Treat it like one.

If any of this sounds familiar — if you've had that moment of "wait, when did this stop working?" — I'd love to hear your story in the comments.

And if you want to try NotiLens — if you're a solo founder, running a small team, building with AI agents, or just tired of juggling multiple systems with no single place watching whether your business is actually working — we're giving early users 3 months free in exchange for honest feedback. Just reach out directly or drop a comment below.

Top comments (2)

arun rajkumar • May 7

The "uptime green, business-metric flat" gap is the failure mode every founder discovers around the same revenue inflection — somewhere between traffic-low-enough-that-zero-is-noise and traffic-high-enough-that-zero-is-an-incident. The instrumentation that fixed it for us isn't another threshold alert (those break the same way) — it's a tiny rule that compares "signups in the last 1h" against "signups in the equivalent 1h window from the last 7 days, same day-of-week" and pages on > 3σ deviation. Generic uptime monitors can't model your business; only your business can. The hard part isn't writing the rule — it's deciding which 4–5 metrics deserve this treatment without drowning oncall in seasonality alerts.

Stephen Souza • May 7

That day of week + time of day windowing is exactly how our baseline works - Tuesday 2am is compared against other Tuesday 2ams, not the global average. A Sunday night drop that looks alarming against a Monday morning baseline is completely normal in context. Each event learns its own pattern independently. The "which metrics deserve this" decision becomes "which business events actually matter to you" - much easier question to answer.