Your Startup Doesn't Need Better Monitoring. It Needs Less of It.

#devops #startup #ai #monitoring

I'm going to say something that will annoy every SRE who's ever given a conference talk: most of what they tell you about observability is wrong for your stage.

Not wrong in general. Wrong for you. A founding team of six people shipping a B2B SaaS product does not have the same operational needs as Google. I know this sounds obvious written down. But I watch founders set up Datadog with 47 custom dashboards before they have 47 customers, and nobody's telling them to stop.

I did exactly this. About two years into my first startup, I spent an entire weekend building what I genuinely believed was a world-class monitoring stack. Prometheus, Grafana, custom exporters, alert rules for CPU, memory, disk, network throughput, request latency at p50, p95, p99, p99.9 — the works. I felt like a real engineer. Professional. Prepared.

Then I got paged at 3am on a Tuesday because CPU hit 80% on a box that was completely fine. The alert was technically correct. The threshold was just wrong. I silenced it, went back to sleep, got paged again at 4am for a memory warning that also didn't matter. By morning I'd silenced four alerts and missed the one email from a customer saying they couldn't log in.

The login bug had nothing to do with CPU or memory. A config file got borked during a deploy. None of my beautiful dashboards caught it because I was monitoring infrastructure when I should have been monitoring the product.

What Actually Matters

Here's what I think you need at the early stage. Not what the monitoring vendor's blog post says. Not what the "complete observability guide" on Medium recommends. What actually keeps your customers happy and lets you sleep.

One: know if your thing is up. That's it. A simple HTTP check against your most important endpoint, every 30 seconds. If it fails three times in a row, text yourself. I don't care if you use UptimeRobot, Pingdom, or a cron job that curls your health check — it doesn't matter. The fancy tool doesn't help if you're checking the wrong thing. Hit the endpoint your customers actually use. Not /health. Not /ping. The actual login page, or the API call that matters most. If that works, you're probably fine. If it doesn't, you need to know immediately.

Two: know if your customers are getting errors. This means tracking your HTTP 5xx rate. You can do this in CloudWatch, in your application logs, in whatever. The point is: if more than, say, 1% of requests are returning server errors, something is wrong and you should look at it. During business hours. Not at 3am. Unless it's way above 1%, in which case yes, wake up.

Three: know if things are slow. Response time matters, but you don't need seven percentile buckets. Track p95 latency. If your p95 is under 500ms for a typical API call, you're fine. If it's climbing, investigate when you're awake. If it suddenly spikes to 5 seconds, that's worth waking up for.

That's the list. Three things. Everything else is noise at your stage.

The Alert Hygiene Nobody Talks About

Here's a rule I wish someone had tattooed on my forearm before I started: every alert that wakes you up must require you to do something right now. Not "hmm, interesting." Not "I should look at this tomorrow." Right now, tonight, in your underwear, something needs to be done or it wasn't worth waking you up.

If you get paged and the correct response is "I'll check this in the morning," that alert is broken. Downgrade it. Make it a Slack notification. Make it an email. Make it a dashboard you glance at with your coffee. But do not let it wake you up.

I know this sounds aggressive. You're thinking "but what if I miss something?" You might. And that's okay. Because the alternative is alert fatigue, which is when you've been woken up by false alarms so many times that you start sleeping through the real ones. Alert fatigue has caused more outages than missing alerts ever has. I'd bet money on it.

At one point I had 30+ alert rules configured. I was getting maybe 4-5 notifications a day. I started ignoring all of them. It took a customer emailing our support address (which was my personal Gmail) to tell me the payment flow had been broken for six hours. Six hours. While my monitoring stack was happily telling me that CPU utilization was nominal.

I deleted 25 of those alerts in one commit. Kept five. Slept better. Caught more real problems. Go figure.

The Tools Question

People ask me what monitoring tools to use and I think the honest answer is: it barely matters, and spending a week evaluating tools is a week you didn't spend building your product.

If you're on AWS, CloudWatch is already there and it's fine. The UI is ugly and the query language is annoying but it works. If you want something nicer, Grafana Cloud has a free tier that's generous enough for a small startup. If you have money to spend and want things to just work out of the box, Datadog is great — but you will be shocked by the bill once you grow. Their pricing model is designed to be cheap when you're small and extremely expensive when you're not. Just know what you're signing up for.

The one tool I'd say is genuinely worth paying for early: an error tracking service. Sentry, Bugsnag, something like that. It catches unhandled exceptions in your application code, groups them, shows you the stack trace, tells you which deploy introduced it. This is the stuff that actually breaks your product for users, and application-level error tracking catches it way faster than infrastructure monitoring ever will.

What To Add Later (Not Now)

When you have paying customers with SLAs, or when you've got 10+ services talking to each other, or when you're waking up more than twice a month for real incidents — that's when you start thinking about distributed tracing, log aggregation, SLOs, error budgets, and all the other stuff that makes the SRE Twitter crowd excited.

Not before. I promise you, nobody churned because you didn't have distributed tracing. They churned because your app was down and you didn't notice for two hours because you were drowning in alerts about disk utilization.

The bigger unlock at the early stage is getting all your operational context — deploys, errors, customer signals, team activity — in one place so you're not switching between eight tools to figure out what's happening. That's a different problem than monitoring, and it's the one that actually slows founders down.

The Actual Hard Part

The real operational skill at the early stage isn't monitoring. It's deploy discipline. Can you ship a change and roll it back in under five minutes if something goes wrong? Do you know what changed between "it was working" and "it's not working"? Can you look at your deploy history and your error rate on the same timeline?

If you can do that, you can fix almost anything fast enough that your customers won't care. And at the early stage, fast recovery beats prevention every single time. You don't have the team or the time to prevent every problem. But you can damn sure get good at fixing them quickly.

Build the smallest monitoring setup that tells you when customers are hurting. Delete everything else. Ship your product.

The operational layer of a startup — knowing what's happening, what needs attention, what can wait — should take minutes a day, not hours. That's the problem worth solving.

Rob is building Strake — an operational platform for startup founders that connects your tools, surfaces what needs your attention, and cuts the overhead of running a company before it buries you. Less time managing operations. More time building the thing.

If that's the problem you're living with, follow along or reach out on X.