DEV Community: Sirisha Katta

Stop Writing Alert Rules By Hand

Sirisha Katta — Tue, 24 Mar 2026 09:04:49 +0000

At some point every engineering team has the same meeting. "We need better alerting." Someone opens a spreadsheet. You list every service. You decide on thresholds. Error rate above X. Latency above Y. CPU above Z. You spend a day writing rules in Prometheus, CloudWatch, or Datadog.

Two weeks later, three of the rules are too noisy and get silenced. Five more never fire because the thresholds are too conservative. And the next production incident is something nobody predicted, so none of the rules cover it.

This cycle repeats every six months. Sometimes every quarter. The alert rules pile up but coverage never feels complete.

The fundamental problem with static thresholds

A static threshold assumes the system behaves the same way all the time. "Alert when error rate exceeds 5%" treats Monday at 9am the same as Sunday at 3am. But your traffic patterns are different. Your error baseline is different. The same error rate might be normal during a traffic spike and catastrophic during off-hours.

Some teams respond by creating time-based rules. "Alert when error rate exceeds 5% during business hours and 2% outside business hours." Now you have two rules per metric, and you still haven't accounted for holidays, deploy windows, or gradual traffic changes as your product grows.

The deeper problem is that static rules require you to predict failure modes in advance. You write a rule after an incident and hope it catches the same kind of failure next time. But production systems find new ways to break. The failures that hurt the most are the ones that don't match any existing rule.

What anomaly detection does differently

Instead of asking "is this above a threshold," anomaly detection asks "is this different from what's normal for right now."

It builds a baseline from your actual data. Monday 9am has its own expected range. Sunday 3am has its own. When the observed value deviates significantly from the expected range for that specific time window, that's an anomaly.

This catches two things that static thresholds miss:

Novel failure modes. You don't need to predict them. Anything that deviates significantly from normal gets flagged.
Context-dependent anomalies. 50 errors per minute at 3am is a disaster. 50 errors per minute at peak traffic is normal background noise. Anomaly detection knows the difference because it learned what normal looks like for each time window.

Why most teams haven't adopted it

Anomaly detection isn't new as a concept. The reason most teams still write static alert rules is that building good anomaly detection is hard. You need baseline computation, seasonal pattern recognition, a reasonable statistical model, and enough operational experience to set the sensitivity right.

Datadog and Grafana both offer anomaly detection features. But they're add-on features that you have to configure per metric. You're still deciding which metrics to monitor and what sensitivity to use. It's better than raw thresholds, but it's still manual work per signal.

The approach we took with Epok is different. Anomaly detection runs automatically on every log stream. You don't configure it. You don't select metrics. You don't choose sensitivity levels. Epok watches every service's log volume, error patterns, and log cadence. When something deviates from the learned baseline, it alerts.

Where static rules still make sense

There are cases where a hard threshold is the right tool. Disk space above 90% should always alert, regardless of what's "normal." Payment processing success rate below 99.9% should always alert. These are business SLOs, not anomaly detection problems.

Epok supports threshold rules too for exactly these cases. But they should be the exception, not the primary detection mechanism. Use thresholds for hard business constraints. Use anomaly detection for everything else.

Try it on your logs

Epok's free tier includes volume anomaly detection, new error detection, and silence detection. Point your log shipper at Epok and see what it catches in the first week. Most teams find something within the first 24 hours that their existing monitoring missed.

The best alerting system is the one that catches things you didn't think to look for.

Epok is a log intelligence engine with automatic anomaly detection. Free tier: 150 GB/month, no credit card.

Silent Failures: The Bug That Won't Page You

Sirisha Katta — Tue, 24 Mar 2026 09:02:43 +0000

Your worker process crashes at 2am. No error log. No exception. The process just dies. Maybe it was an OOM kill. Maybe a segfault in a native library. Maybe the container runtime pulled the rug out.

Whatever the cause, the result is the same: the logs stop. And because there's no error to trigger an alert, nobody gets paged. The job queue backs up. Emails stop sending. Payments stop processing. Six hours later, someone notices.

This is the most dangerous class of production failure, and almost nobody monitors for it.

Why error-based alerting misses this

Every alerting system you've used probably works the same way: watch for a condition, fire when the condition is true. CPU above 90%. Error rate above 5%. Latency above 500ms. Response code is 500.

All of these require something to happen. They need data to evaluate against. When a service dies silently, there is no data. There's nothing to evaluate. The alert rule sits there, perfectly happy, because zero errors is technically below the threshold.

Some teams work around this with heartbeat checks or synthetic monitors. Ping the service every 30 seconds, alert if it doesn't respond. This catches some cases, but only for services that expose a health endpoint. Background workers, cron jobs, queue consumers, and batch processors often don't have an HTTP endpoint to ping.

Watching for absence

The fix is simple in concept: if a service that normally logs every few seconds stops logging for several minutes, something is wrong.

Your API server processes 200 requests per minute and logs each one. If that drops to zero for 3 minutes straight, either the service is down or something fundamental has changed. Either way, you want to know about it.

The implementation is harder than it sounds. You need to know what "normal" looks like for each service. A batch job that runs once an hour will naturally have 59 minutes of silence between runs. Your API server at 3am on a Sunday will log much less than Monday at noon. You can't just set a static threshold for "too quiet."

How silence detection should work

Good silence detection learns each service's log cadence over time. It builds a baseline per service, per hour of day, per day of week. Monday 9am for your API server has a different expected volume than Sunday 3am.

Then it watches the live log stream. If a service's log volume drops to zero and the baseline says it should be producing logs, that's a silence alert. If the baseline for this time slot is already near zero (like a batch job between runs), no alert.

This is what Epok's silence detector does. It activates within about an hour of seeing a new log stream, using short-term cadence analysis. Full weekly baselines build over 7 days for hourly and daily patterns.

Real examples

A background worker that processes webhook events from Stripe crashes after an OOM kill. No error log because the kernel killed it. Epok notices the log stream went silent and alerts within 5 minutes.

A cron job that runs every 15 minutes stops running because someone accidentally deleted the cron entry during a deploy. No errors anywhere. Epok flags it when the expected log output doesn't appear at the next scheduled time.

A database replica falls behind and stops accepting queries. The app fails over to the primary, which works fine, so there are no application errors. But the replica's log stream goes quiet. Silence detection catches it before the primary gets overloaded.

Start monitoring for silence

Silence detection is included in Epok's free tier. Point your log shipper at Epok, and within an hour it starts learning your service cadences. When something goes quiet that shouldn't be quiet, you'll get a Slack message or a PagerDuty page.

Because the scariest production bug isn't the one that fills your logs with errors. It's the one that leaves them empty.

Epok is a log intelligence engine with automatic anomaly detection. Free tier: 150 GB/month, no credit card.

Datadog Alternatives for Small Teams

Sirisha Katta — Sun, 22 Mar 2026 17:38:36 +0000

Datadog is a great product. It's also priced for companies with dedicated platform teams and six-figure observability budgets. If you're a team of five
shipping features every day, Datadog will eat your entire infrastructure budget before you've finished the onboarding wizard.

The pricing page looks simple. $0.10/GB for log ingestion. But then there's indexing at $2.55 per million events, retention costs, per-host APM fees, custom
metrics charges, and the silent killer: cardinality. Log a high-cardinality field like user_id or trace_id and your bill explodes because Datadog indexes
every unique value.

Most small teams discover this after their first real month of usage. The POC was cheap because traffic was low. Production traffic hits and suddenly the bill
is $500/month for a team that's spending $200/month on actual compute.

What small teams actually need

Let's be honest about what a team of 5-20 engineers needs from log monitoring:

Know when something breaks. Automatically. Without writing alert rules.
Search logs when debugging. Fast, full-text search across all services.
See new errors as they appear. Not buried in a query result, but flagged and grouped.
Get notified in Slack or PagerDuty. Not in a dashboard nobody's watching.
Pay a predictable amount every month. No surprises, no cardinality taxes.

You don't need 400 integrations. You don't need a marketplace of dashboards. You don't need custom metrics with 15 tag dimensions. You need to know when your
stuff breaks.

The alternatives, honestly

Grafana Cloud + Loki

Grafana Cloud is the most popular Datadog alternative. Loki handles log storage, Grafana handles visualization. The free tier is generous (50 GB/month) and
the paid tier is $0.50/GB.

The catch: you still have to build everything. Dashboards, alert rules, recording rules. Loki's query language (LogQL) has a learning curve. And if you want
anomaly detection, you're writing PromQL expressions and hoping they catch the right things. For teams with a dedicated SRE, this works. For teams where
everyone wears five hats, it's another project that never gets finished.

Axiom

Axiom is a solid log management tool with generous free tier (500 GB/month ingest, 30-day retention). The interface is clean and queries are fast. If you want
a cheaper Datadog-like experience, Axiom is a good choice.

But like Grafana, you're building the intelligence layer yourself. Axiom stores and searches your logs. It doesn't watch them for you. You still need to
create monitors, write queries for each failure mode, and tune thresholds. The anomaly detection is basic and requires manual configuration.

Better Stack (formerly Logtail)

Better Stack combines uptime monitoring with log management. Pricing starts at $0.25/GB. The UI is polished and they have decent alerting. Good option if you
want uptime monitoring bundled with logs.

The log analysis features are limited compared to dedicated tools. No automatic anomaly detection, no pattern clustering, no root cause analysis. It's a log
database with a nice UI and some alerting on top.

Epok

Epok is a log intelligence engine. It's different from the tools above because it watches your logs for you. Send your logs to Epok and it automatically
detects new errors, volume anomalies, silent services, and pattern changes. No dashboards to build, no alert rules to write.

Pricing is flat: free up to 150 GB/month (forever, no credit card), $49/month for Pro (600 GB, all 16 detectors, AI analysis), $149/month for Business (1.5
TB, full AI suite, 30-day retention). No per-query fees. No cardinality charges. Log any field you want.

Price comparison at 600 GB/month

Datadog: ~$3,100/month (ingestion + indexing + retention)
Grafana Cloud: ~$300/month (but you build everything yourself)
Better Stack: ~$150/month (limited intelligence features)
Axiom: ~$25/month (generous pricing, but no automatic detection)
Epok: $49/month (all detection features included)

The actual decision

If you want a full APM platform and have the budget, Datadog is genuinely good at what it does. No shame in paying for it if you use it.

If you want cheap log storage with a fast query engine and you have time to build dashboards and alerts, Axiom or Grafana Cloud are solid picks.

If you want something that watches your logs and tells you when things break, without you having to configure anything, that's what we built Epok for. Send
logs, get intelligence. The free tier has no expiration and includes all core detection features.

Epok is a log intelligence engine with automatic anomaly detection. Free tier: 150 GB/month, no credit card.

Why Your AWS Logging Bill Is Out of Control

Sirisha Katta — Sun, 22 Mar 2026 17:32:03 +0000

Every few months, someone on your team opens the AWS bill, scrolls to CloudWatch, and says something unprintable. The number is always higher than last month. Nobody can explain why.

This keeps happening because CloudWatch doesn't have a price. It has a pricing spreadsheet. Ingestion is one rate. Storage is another. Every query costs money. Every dashboard widget costs money. Alarms cost money. Cross-region anything costs money. And the numbers change depending on your region.

Where the money actually goes

Let's say you're running a typical setup. Five services, each producing about 4 GB of logs per day. That's 20 GB/day, or about 600 GB/month. Sounds manageable. Here's what CloudWatch charges you:

Ingestion: $0.50/GB = $300/month
Storage (after 5GB free): $0.03/GB/month, but data grows over time. With 30-day retention that's about $9/month. Not bad on its own.
Queries via Logs Insights: $0.005 per GB scanned. Run 20 queries a day across 600 GB and you're at $60/month.
Dashboards: $3/month per dashboard after the first three. Most teams have 5-10.
Alarms: $0.10 each for standard, $0.30 for high-resolution. 50 alarms = $5-15/month.

Add it up and you're somewhere between $280 and $400 per month. For 600 GB of logs. And you still had to build every dashboard and write every alarm rule yourself.

The real problem isn't the price per GB

The billing model punishes you for actually using your logs. Every time a developer runs a query to debug a production issue, that's a billable event. Every time someone opens a dashboard, the widgets are scanning data. The more you use CloudWatch, the more it costs.

This creates a weird incentive where teams avoid querying their logs because it's expensive. Which defeats the entire purpose of having logs in the first place.

Some teams try to control costs by reducing log verbosity. They drop debug logs, then info logs, then start filtering out anything that isn't an error. By the time they're done, the logs are useless for debugging because all the context is gone.

What 600 GB/month looks like elsewhere

Datadog charges $0.10/GB for ingestion, which sounds cheap until you realize they also charge $2.55 per million events for indexing. If your average log line is 500 bytes, 600 GB is about 1.2 billion events. That's $3,060/month just for indexing. Plus the $0.10/GB ingestion. So you're looking at $3,120/month. For the same 600 GB.

Grafana Cloud charges $0.50/GB for logs, which puts 600 GB at $300/month. Better than Datadog, similar to CloudWatch, but you still have to build dashboards and write alert rules.

With Epok, 600 GB/month is $49. Flat. That includes anomaly detection, new error fingerprinting, silence alerts, pattern clustering, AI root cause analysis, Slack and PagerDuty integration. No per-query fees. No per-dashboard fees. No surprises on the bill.

The fix isn't cheaper storage

Cheaper storage helps, but the bigger issue is what you're paying for. With CloudWatch, you're paying for infrastructure. Disk space, compute cycles, API calls. You're renting a database and building the intelligence layer yourself.

What you actually want is to know when something breaks. You want to know about new errors the moment they appear. You want to know when a service goes silent. You want to know when error rates spike after a deploy.

That's what you should be paying for. Not the storage underneath it.

What switching looks like

If you're already running FluentBit, Vector, Promtail, or the OpenTelemetry Collector, switching is a config change. Point your log shipper at Epok instead of CloudWatch. Your existing log format works as-is.

If you're using the CloudWatch agent directly, you can forward CloudWatch logs to Epok via a Lambda subscription filter. It takes about 15 minutes to set up and you can run both in parallel while you evaluate.

The free tier gives you 150 GB/month with all core detection features. No credit card. No trial period. It's permanently free.

Epok is a log intelligence engine with automatic anomaly detection. Free tier: 150 GB/month, no credit card.