DEV Community

Nebula
Nebula

Posted on • Originally published at furqan-8f96-42004.nebula.me

How I Stopped Chasing Production Errors Before My Coffee Got Cold

Sentry throws 200 errors at my team every day. Most of them are noise. The ones that matter arrive at 3am and nobody sees them until a customer complains.

I built an agent that triages them instead. It reads new Sentry issues, groups them by fingerprint, applies filtering rules, and only sends me the ones that actually need attention.

Here is how it works, the day it failed, and what I learned about building alerting systems that agents actually maintain.

The Problem

Our Sentry project tracks errors across the Nebula backend, web frontend, and CLI. In any given week we get 500+ issues. Maybe 20 of them are new bugs that need fixing. The rest are transient failures, known patterns, and false alarms.

The old workflow: check Sentry when someone pings Slack about an error. Spend 15 minutes figuring out if it is new or already tracked. If new, create a GitHub issue. If old, close it as duplicate. Repeat.

This is exactly the kind of mechanical work an agent should do.

The Agent

sentry-error-monitor connects to Sentry via REST API. It runs every 15 minutes, fetches unresolved issues, and runs them through a three-level filter:

Level 1: Known issue matching. Every issue has a fingerprint (Sentry generates this from the stack trace). The agent maintains a list of known fingerprints. If an issue matches a known one, it checks the severity. If severity has not changed, skip.

Level 2: Pattern filtering. The agent looks for noise patterns: timeout errors during known maintenance windows, rate limit spikes from API abuse, connection pool exhaustion that resolves within 5 minutes.

Level 3: Escalation rules. If a new issue is RED (high volume, user-facing), post to Slack immediately. If YELLOW (new, low volume), add to the daily digest. If GREEN (noise), skip and log.

# Simplified triage logic
async def triage_issue(issue_id: str) -> TriageResult:
    issue = await sentry.get_issue(issue_id)
    fingerprint = issue.fingerprint

    if fingerprint in known_issues:
        if not severity_changed(issue):
            return TriageResult.SKIP

    if is_noise_pattern(issue):
        return TriageResult.SUPPRESS

    if is_user_facing(issue) and issue.count > threshold:
        return TriageResult.ESCALATE

    return TriageResult.DIGEST
Enter fullscreen mode Exit fullscreen mode

The agent also checks the Nebula platform codebase (nebula, nebula-web, nebula-cli) to add context when reporting. If an error references thirdweb-dev/nebula commit history, it pulls the relevant PRs and links them.

The Day It Failed

Day 6. A real production failure happened. Database connection pool exhaustion across all three repos. Hundreds of "Connection refused" errors started appearing in Sentry at 9:14 AM PT.

The agent did nothing.

Its fingerprint matching was too aggressive. The Connection refused errors had been seen before (sporadic, one-off failures during deploys). The agent matched the fingerprint, found it in the known issues list, and classified it as "already tracked." It skipped every single one.

We found out 45 minutes later when a customer reported failed deployments.

The Fix

The fix was not "use AI better." The fix was to add a volume check to the fingerprint matching:

if fingerprint in known_issues:
    # Only skip if volume is within normal range for this pattern
    if issue.count < known_issues[fingerprint].max_normal_count:
        return TriageResult.SKIP
    # Volume spike means something changed — escalate
    return TriageResult.ESCALATE
Enter fullscreen mode Exit fullscreen mode

A known issue that has 5 errors per day is not the same issue that has 500 errors in 10 minutes. The fingerprint does not change. The significance does.

This seems obvious in retrospect. But it took a real outage to expose it. The agent had not seen enough volume data to know what "normal" looks like.

How Fast Does It Run

After the fix, the agent processes new issues in under 90 seconds from the time they appear in Sentry. That means:

  • New issue appears in Sentry at 2:14pm
  • Agent fetches it at 2:15pm (15-minute cycle)
  • Agent triages and posts to Slack by 2:15:30
  • Total latency: about 90 seconds

This is not instant. But it is fast enough that we see errors before the team does.

What It Catches Now

In a typical week, the agent processes 500+ Sentry issues and surfaces about 15:

  • 3-5 genuine new bugs
  • 2-4 regression patterns
  • 5-8 known issues with increased severity
  • 480+ suppressed as noise or resolved

Each triaged issue includes the file it originated from, the recent commits that touched it, and a severity assessment. The Slack message looks like this:

[RED] 24 new Connection refused errors in worker.py
- Sentry issue: SENTRY-1234
- Last touched in: PR #3401 (merged 2 hours ago)
- Commit: 7f52b88 - ref("sentry"): add error grouping rules
- Severity: HIGH - errors are user-facing
- Link: https://sentry.io/.../issues/1234/
Enter fullscreen mode Exit fullscreen mode

What Still Goes Wrong

The agent still has two blind spots:

1. Cross-issue correlation. If three different issues are actually the same root cause (a bad database migration), the agent sees them as three separate problems. It does not yet group by root cause.

2. Slow-growing issues. An issue that starts at 1 error per day and climbs to 100 over two weeks gets classified as "low priority" the whole time until it suddenly breaches the threshold. There is no trend analysis.

Both are on the backlog. The agent is good enough for now but not complete.

The Real Metric

Before the agent: engineers spent about 2 hours per week triaging Sentry issues. That is 96 hours per year.

After the agent: engineers spend about 15 minutes per week reviewing the digest. That is 13 hours per year.

The agent itself costs about $0.15 per run, or roughly $130 per month running every 15 minutes.

The math is straightforward. Eighty-three hours of engineer time saved per year. $1,560 in agent costs. The value depends on your engineer rate. For us it pays for itself in a single day.

Takeaways

  1. Fingerprint matching is necessary but insufficient. Always add volume thresholds. A known pattern at 100x volume is a new problem.
  2. The agent needs to be wrong before it gets right. The day-6 failure taught me more than two weeks of testing. Production data is the only real eval.
  3. 90-second latency is acceptable for error triage. Faster means more API calls and more cost. The 15-minute cycle is a good tradeoff.
  4. Cross-issue correlation is the next frontier. Sentry groups by stack trace, not by root cause. An agent that can read between issues would be significantly more useful.
  5. The agent is not a replacement for monitoring. It is a filter. The monitoring still needs to exist, and humans still need to review the output.

The agent is still running on the trigger sentry-error-monitor. Every 15 minutes, reading, filtering, reporting. My coffee stays hot now.

Top comments (0)