The same root cause keeps coming back because nobody tracks it. I built a zero-dep CLI that does.

#showdev #opensource #cli #devops

You write the postmortem. You file the action items. Everyone nods, the doc gets archived, and life moves on.

Six months later, the exact same root cause takes down the exact same service — and nobody in the room remembers the first incident, let alone that its fix never actually shipped.

"We use rootly to track this automatically. It flags when incidents have the same root cause as previous ones."

That's a real answer from an SRE thread about this exact problem — and it's a paid, hosted feature of a full incident-management platform. Most teams don't have rootly or incident.io. What they have is a folder of markdown postmortems that nobody diffs against each other.

So I built rootecho: a zero-dependency CLI that does the one useful thing those platforms do for this — flag when a new incident's root cause echoes a past one, and show you whether that past incident's action items ever actually got finished.

How it works

Each postmortem is one JSON record — free-text root_cause and/or curated root_cause_tags, plus action_items with a status:

{
  "id": "INC-2026-014",
  "title": "Payment webhook retries exhausted",
  "root_cause": "webhook retry queue misconfigured to drop after 3 attempts, no dead-letter fallback",
  "root_cause_tags": ["webhook", "retry-queue", "dead-letter", "config"],
  "action_items": [
    { "id": "AI-1", "description": "Add dead-letter queue for webhook retries", "owner": "alice", "status": "open" }
  ]
}

rootecho add records it and compares against your history:

$ rootecho add inc-2026-014.json
⚠ root cause echo detected for "INC-2026-014":

  INC-2026-003 (2026-03-15) — 100% similar root cause
  Payment webhook retries exhausted
    ✓ Add retry backoff [done]
    ✗ Add monitoring alert for queue depth [open] — 93d overdue
  → 1 action item(s) from this past incident were never finished.

recorded to .rootecho/history.jsonl

That's the whole point of the tool in one output: not just "you've seen this before," but "and here's the fix that never happened."

rootecho check does the same comparison without recording — exit code 1 on an echo, so you can wire it into CI as a gate on the PR that closes out a postmortem.

Why not just grep the old postmortems?

Because "webhook retry queue misconfigured" and "retry queue drops webhooks after repeated failures" are the same root cause in different words, and grep doesn't know that. rootecho scores similarity with Jaccard overlap — tags first (curated, low-noise, weighted 70%), free text as a fallback/secondary signal (30%) — no ML dependency, no network call, runs in milliseconds on a folder of JSON files.

Design notes for the technical reader

Storage is project-local, not ~/.rootecho. History lives in .rootecho/history.jsonl in your repo, one JSON object per line, meant to be committed — so git blame/git log on that file doubles as an incident timeline the whole team shares, and git diff on it during a PR review shows exactly what changed.
Zero dependencies, dual-language. A Node build (npx rootecho) and a Python build (pip install rootecho) exist because teams aren't single-language, and they read/write the exact same history file — down to byte-identical --json output, which took more care than I expected (Python's json.dumps escapes non-ASCII by default and prints whole floats as 1.0; JS's Date.parse and Python's datetime.fromisoformat accept different date grammars depending on Python version. All ironed out — timestamps are epoch milliseconds on both sides, dates are a hand-rolled strict ISO 8601 parser shared in spirit between both implementations instead of trusting either language's built-in leniency).
Corrupt data degrades, it doesn't crash. A JSONL file meant to be hand-edited and merged by a team will occasionally get a stray null line from a bad merge. Both CLIs skip malformed entries instead of taking down add/check/list for the whole team.

Install

npx rootecho init incident.json      # scaffold a postmortem
npx rootecho add incident.json       # record it, flag any echo

pip install rootecho                 # or the Python build

MIT licensed, ~500 lines total, no dependencies in either language.

npm: https://www.npmjs.com/package/rootecho
PyPI: https://pypi.org/project/rootecho/
GitHub (Node): https://github.com/jjdoor/rootecho
GitHub (Python): https://github.com/jjdoor/rootecho-py

Does your team already have a home-grown way of catching repeat root causes — a spreadsheet, a Slack bot, a wiki convention? I'd like to hear what's actually working (or not) before I add anything past this MVP.

Top comments (1)

Aldo • Jul 14

The recurring 'fix' that isn't really a fix is such a familiar scenario in production SaaS environments. I recall one particularly stubborn issue where a specific database deadlocking pattern would periodically manifest, causing brief but impactful outages. We'd patch the immediate symptoms, write up the incident, and then six weeks later, like clockwork, it'd be back. The post-mortem documents clearly outlined the structural changes needed, but the sheer inertia of our sprint cycles often meant those larger, preventative fixes got pushed down the backlog.

Part of the problem, I think, is that addressing true root causes often requires cross-team effort or significant architectural changes that don't fit neatly into a single team's sprint. It's easy for these 'meta-issues' to become everyone's problem and thus nobody's priority. The pressure to ship new features often overshadows the less visible but equally critical work of preventing future incidents. We've tried various systems, from dedicated Jira epics to shared Notion boards, but the overhead of maintaining them sometimes outweighs the perceived short-term benefit, especially when things are 'quiet'.

That's where a lightweight, zero-dependency CLI like yours could really shine. The friction of adopting a new enterprise-grade incident management system can be a huge blocker, but a simple tool that integrates into a developer's existing workflow – maybe even tied into source control hooks or a daily standup ritual – has a much higher chance of actually being used consistently. The key, in my experience, is making the tracking process itself as low-effort as possible, so the mental load is on solving the issue, not reporting it.