Niketa Sharma

Posted on Mar 11

That Weekend Incident Bot? It Costs $233K

#devops #sre #incidentmanagement #ai

Your senior engineer says "give me a weekend with Claude and Cursor, I'll have a working incident bot by Monday."

Copilot writes the paging logic. The escalation state machine practically builds itself. Status page? Templated in an afternoon.

They're right about v1. They're wrong about the next three years.

I build incident management software, so I sat down and did the math on three-year total cost for a 20-person engineering team. The numbers were worse than I expected.

Path	3-Year TCO
Build from scratch	$233K-$395K
Open source (self-host)	$99K-$360K
Buy commercial	$11K-$83K

Building costs 3-8x more than buying. Most of that gap is engineer time, not infrastructure bills.

What AI actually changed

Two years ago, building a basic incident system took 2-4 weeks. With AI coding tools, that's down to days. A weekend if you push it.

The scaffolding part is real. Slack bot setup, status page templates, database schemas, escalation logic. AI chews through boilerplate fast.

But Slack retires APIs whether you used AI to write the integration or not. The legacy file upload method was sunset in Nov 2025. Legacy custom bots were discontinued in Mar 2025. AI helps you migrate faster. It doesn't stop the deprecations.

Phone and SMS paging is a carrier deliverability problem. International routing is its own discipline. No prompt fixes that.

The engineer who leaves is still the biggest risk. Nobody else knows why the Slack workaround exists or what half the edge case handling is even doing. AI wrote the code, sure. But the context walked out the door with the person who prompted it.

And SOC2 auditors don't care that Claude wrote your audit log. They care that it's complete, immutable, and retained for the right duration.

So AI saved maybe $10K-$15K on the initial build. On a $233K+ three-year total, that barely moves the needle. The build was always the cheap part.

How the forever-project happens

SREs call these forever-projects. I've watched the pattern play out multiple times now.

First few months are fine. Thing works, people use it, the engineer who built it feels good about the decision. Then around month four, Slack changes a permission model, or rate limits bite during a real incident, or someone new joins and asks "why does it work this way?" and nobody can explain it without reading the source.

Somewhere between month seven and month twelve, the builder changes roles or leaves. Now you've got a codebase nobody else has touched, doing something everyone depends on, and nobody wants to be the one to modify it.

By year two it's got real debt. The Slack integration is two API versions behind. There's a workaround for a bug that nobody documented. Someone suggests rewriting it, someone else says "let's just buy something," and both options feel expensive because they are.

The actual numbers

A senior engineer fully loaded in the US runs $250K-$400K/year. At 25% time on your incident tool, that's $62K-$100K a year. For one internal tool that pages people and makes Slack channels.

Building it yourself

Cost	Year 1	Year 2	Year 3	Total
Initial build (AI-assisted, 1-2 weeks)	$8K-$15K	-	-	$8K-$15K
Maintainer (25% time)	$62K-$100K	$62K-$100K	$62K-$100K	$186K-$300K
Infrastructure and hosting	$3K-$10K	$3K-$10K	$3K-$10K	$9K-$30K
Rebuilds and migrations	-	$30K-$50K	-	$30K-$50K
Total	$73K-$125K	$95K-$160K	$65K-$110K	$233K-$395K

Buying it

Cost	Year 1	Year 2	Year 3	Total
Responder-based pricing (10-15 users x $15-30/mo)	$2K-$5K	$2K-$5K	$2K-$5K	$5K-$16K
Onboarding and setup	$3K-$8K	-	-	$3K-$8K
Total	$5K-$13K	$2K-$5K	$2K-$5K	$11K-$27K

Enterprise per-seat pricing (PagerDuty-style) runs higher, $35K-$83K over three years. Still way less than building.

That weekend hack? 3-6% of the three-year number.

Run the math yourself

Inputs:
  EngCost     = Fully-loaded eng cost/year (default: $300K)
  BuildWeeks  = Initial build time in weeks (default: 1-2)
  FTE         = Maintainer allocation (default: 0.25)
  Vendor      = Vendor $/user/month (default: $15-100)
  Users       = On-call responders (default: 10-15)
  Infra       = Hosting/monitoring per year (default: $5K)
  Rebuild     = Migration allowance over 3 years (default: $30K)

Formulas:
  Build 3-yr TCO = (EngCost/52 x BuildWeeks) + (EngCost x FTE x 3) + (Infra x 3) + Rebuild
  Buy 3-yr TCO   = (Vendor x Users x 12 x 3) + Onboarding

If your maintainer is 0.1 FTE instead of 0.25 FTE, subtract $25K-$40K per year. If you dodge all rebuilds, subtract another $30K-$50K. The gap narrows. It doesn't close.

Open source options got thin

Netflix archived Dispatch in September 2025. It was the best self-hosted option for years. Read-only forever now. Netflix had hundreds of engineers behind it and still walked away.

Grafana closed-sourced OnCall. The open version entered maintenance mode in March 2025 and gets fully archived March 2026. SMS, phone, and push stop working after that.

Two of the biggest open source incident tools either archived or went closed-source in the same year.

What's left:

Incidental: Slack integration, status pages. Most capable open option remaining, but still early (v0.1.0).
incident-bot: Slack-based, Python/PostgreSQL. Integrates with PagerDuty, Jira, Confluence.
IncidentFox: AI-powered SRE platform. Core is Apache 2.0, but the production security layer is BSL 1.1, so you need a commercial license for production.

Both Incidental and incident-bot are MIT. Both are much smaller projects than what Dispatch and Grafana OnCall were.

Over three years, self-hosting runs $99K-$360K depending on how much maintainer time you actually spend. The full breakdown is in the longer version of this article.

Same infra, same failure

During a P0, when the database is on fire and your CEO is watching the Slack channel, your incident tool has to work.

But most teams host their custom incident tooling alongside their product. Same infrastructure, same deploy pipeline, sometimes the same database. Product goes down, incident tool goes down with it. If it uses company SSO, you're locked out of your response system when the identity provider is part of the outage.

Your incident tool needs to be on separate infrastructure from your app, with its own monitoring. Most vendors already do this because their business depends on being up when your stuff isn't.

The policy questions nobody plans for

Once you have an incident system, people start asking things you didn't think about. Who can declare incidents? Who can close them? How long do you keep records? Can you export them for an audit? What's the access control model?

It's annoying how fast an internal tool turns into a policy surface. V1 is cheap. Then RBAC, retention policies, and compliance asks start piling up, and that's where the hours go.

I talked to a 60-person fintech that spent about $80K building an incident system. It worked for 18 months. Then Slack platform changes and internal security policy changes hit at the same time. The engineer who built it had left. They spent $40K rewriting it. Six months later, compliance asked for audit trails. Another $30K.

When building is the right call

There are teams that should build. I don't think that's most teams, but they exist.

If you have regulatory constraints no vendor can meet, specific data residency, mandated audit log formats, custom approval workflows tied to proprietary systems, then building makes sense. If you're 500+ engineers with complex multi-team processes, off-the-shelf might not fit, though at that scale you've got a dedicated internal tooling team anyway.

One case that worked: an 80-person fintech needed EU data residency for specific customers, custom approval workflows tied to their fraud detection system, regulator-mandated audit log formats, and integrations with internal systems no vendor supported. Three years later, still maintained by 0.3 FTE. Total cost was about $250K-$300K versus maybe $200K-$270K if they'd bought and built the custom bits on top. They'd build again.

What made it work is that their requirements were actual regulatory constraints, not just preferences. I hear "our situation is unique" a lot. Usually it isn't.

When to buy

If three or more of these apply to you, you're probably past the point where a custom build makes sense:

On-call rotation has 8+ people
You're handling 4+ incidents per month
3+ teams regularly involved in response
Customer-facing SLAs or enterprise customers asking about your incident process
Compliance requirements (audit logs, retention, RBAC)
You need stakeholder updates within 10-15 minutes, reliably
Your current ad-hoc system already failed during a real incident

What I'd actually recommend

For teams between 20 and 200 people: buy or self-host the core workflow. Use AI to write the glue code between your incident tool and whatever internal systems you need it talking to. That's where the coding tools save you time.

Start with paging and escalation that works on phone and SMS, and timeline capture so you have a record of what happened. Status page, analytics, post-incident review, add those within six months once you know what you care about.

You can build it. Obviously. I just wouldn't want to be the one maintaining it three years from now.

I wrote a longer version of this with more TCO scenarios and migration notes: Build, Open Source, or Buy Incident Management in 2026

Disclosure: We're building Runframe, incident management for engineering teams. I included open source options and noted when building is the right call. If I got the math wrong, tell me in the comments.

DEV Community