DEV Community: Dave Rochwerger

How You Can Build Incident Management Inside Jira — for Free

Dave Rochwerger — Tue, 02 Dec 2025 00:38:10 +0000

Originally published on the Phoenix Incidents Blog.

🧐 This post is a high-level walk-through.
The full step-by-step guide is available from the original post here:
How You Can Build Incident Management in Jira for free

What you’ll get

Complete details of Jira setup
Detailed, deploy-ready Jira automations
Clear Incident lifecycle management and status tracking
RCA Templates automatically created in Confluence, linked to RCAs
Slack notifications and workflow reminders

Years before we built Phoenix Incidents, I built it by hand inside Jira. Twice.

At two different companies, years apart, I went looking for an incident management app in the Atlassian Marketplace. There wasn’t one. There were hundreds of ITSM add-ons and connectors to external incident tools, but nothing that actually ran incident response end-to-end inside Jira — without adding another system to learn.

So I built it.

The idea felt obvious: incidents are simply the most urgent kind of work. They sit alongside bugs, stories, and tasks, so pulling them into a separate tool never made sense. You lose the context — linked issues, ownership, and the visibility & traceability. Alerts can live elsewhere; they’re the signal. But once an alert becomes an incident, it deserves to live where work happens, which for many teams, that’s Jira.

What followed was a surprisingly complete system — one you can still recreate today — and a lesson in where Jira shines, and where it needs help.

Why I Did It

We already had all the tools: Jira for tracking, Slack for comms, Splunk On-Call or PagerDuty for paging (depending on the company). It just felt absurd that these couldn’t work together out of the box to build a solid, modern Incident Management
system for software engineering teams.

So I wired them up:

Jira workflows to model the full incident lifecycle.
Jira Automation rules that sent Slack alerts when incidents were created, or transitioned and reminders to keep the team on track to our SLAs.
Outgoing webhooks that posted to our paging system to page the right team.
Confluence templates for RCAs that auto-linked back to the incident.
Linked issues for action items that, when all resolved, automatically closed the parent incident.

It worked. It was scrappy, but the system handled hundreds and hundreds of real incidents over years. We shipped faster, closed loops more consistently, and the automation kept things moving without constant babysitting. One of the very early wins was the transparency of the system kept our customer facing teams in the loop, and we saw increased accountability from the engineering teams.

2. What I Built — and Why It Worked

The setup wasn’t fancy, but it got the job done:

Slack automation. Every incident created or transitioned sent messages to two channels — a general incident feed and a team-specific one.
Auto-closure logic. When all linked “action item” issues were marked done, the parent incident automatically resolved.
RCA enforcement. If someone tried to close an incident without critical data or tasks completed, we leveraged workflow conditions when possible, or Jira automation to reopen it and comment (for the cases where native conditions were limited).
Confluence integration. A Confluence template pre-populated fields from the Jira issue so postmortems were automatically linked.
Daily RCA reminders. Jira automation pinged assignees when RCAs weren’t completed by their target date.
Escalation reminders. If an engineer hadn’t acknowledged, verified, or sent an update, automation rules fired timed Slack nudges until they did.
Reporting & Charts. We used an off the shelf charting app in the Atlassian marketplace. With significant setup we had all the requisite reporting needed for quarterly business reviews

For a single company, it was surprisingly effective — a zero-budget incident platform that made Jira feel purpose-built for ops.

📘 Get the full implementation guide, templates & automations
→ How You Can Build Incident Management in Jira for free

3. The Edges: Where It Fell Apart

It wasn’t the logic that broke down. It was maintainability and process drift.

Manual maintenance. Some configuration was brittle — for example, Slack User Id mappings or confluence API keys.
Clunky reminders. The update reminders couldn’t detect if an update was actually sent — they were just timers. Add different SLAs for different severities, and it got unmanageable fast.
Templates ≠ process. Confluence templates worked fine, but they don’t enforce good RCA practices. Engineers skipped required fields, the “five whys” were inconsistently filled, and root causes varied from insightful to ¯\_(ツ)_/¯.
Timeline chaos. Slack conversations, Zoom calls, and Jira comments all lived in different places. The “timeline” was whatever someone remembered to write down later.
One-way Slack. Notifications weren’t interactive — you couldn’t acknowledge or update from Slack without jumping back into Jira.
Reporting and dashboards. The reporting technically worked — we used a 3rd-party charting jira app and built dozens of custom views. It got the job done, but the setup was heavy, and every quarter someone had to rebuild or recalibrate charts just to keep the data meaningful.

It all functioned, but it required constant tending. Someone had to maintain the automation rules, chase missing data, and update Confluence templates. It was a clever system, not a sustainable one.

4. What It Taught Me

You can stretch Jira incredibly far with Automation and Confluence. But eventually, the work shifts from automating incidents to maintaining the automation.

The challenge wasn’t technical — it was human:

Enforcing consistency in how engineers documented root causes.
Making sure the lessons were actually reusable six months later.
Keeping communication loops tight without overwhelming people with pings.

Those gaps weren’t bugs to fix — they were signals. They showed where process, structure, and guided tooling matter more than clever rules.

And that learning became the foundation for Phoenix Incidents:
a system built to handle the human side of incident management reliably, at scale.

5. The Role of AI — and What to Be Deliberate About

Today AI can handle parts of this beautifully:

Reconstructing timelines automatically from Slack, Zoom, and Jira.
Suggesting status updates or summaries.
Recommending similar past incidents or runbooks.

That’s all powerful — and it should be automated.

But there’s a line. In some places, you want a human in the loop, in others, the process is the value.

Take Post-mortems (RCAs), for example, the “Five Whys” exercise is an intentional process, not a rote task. The learning comes from the conversation — engineers debating causes, challenging assumptions, uncovering the real systemic issues. Some tools try to skip that by letting an LLM auto-generate root causes and a polished write-up. It looks slick, but it misses the point entirely: nobody learns, and the document just becomes noise — even if it’s useful later as training data.

AI should absolutely support the process — prompting better questions, surfacing blind spots, suggesting related incidents — but it shouldn’t replace the reasoning itself.

The future of incident management is AI-driven, but deliberately so. Automate everything that doesn’t teach you something. Keep humans in the loop where reflection, context, and improvement matter.

6. Takeaway: Build First, Then Evolve

You can build this inside Jira. And if you do, you’ll learn a lot — what your process actually needs, which automations help, and where the friction lives.

But you’ll also hit the ceiling fast. Maintenance, reporting, data quality — all the same problems I ran into — will start to eat your weekends. That’s the signal it’s time for something purpose-built.

That’s why we built Phoenix Incidents: not as a collection of scripts or automations, but as a reliable, scalable Forge app shaped by those early lessons. It takes everything that worked — the workflows, the Slack links, the RCA discipline — and rebuilds them as a mature, reliable platform. Slack is fully interactive with buttons, slash commands, and real-time updates.
Reminders are precise, not timer hacks. Reporting is built-in, not bolted on. Everything runs automatically and at scale — the way it should have from the start.

So yes, you can recreate what I did.

Or you can let Phoenix Incidents handle all the heavy lifting — built on years of experience running incidents the hard way.

✅ Download the full incident-management implementation guide + templates
→ How You Can Build Incident Management in Jira for free

Phoenix Incidents Now GA

Dave Rochwerger — Wed, 22 Oct 2025 04:20:24 +0000

🚀 We’re live.
Phoenix Incidents is now available on the Atlassian Marketplace.

We built it for teams that fix things fast but rarely follow through when it’s over—where accountability slips, and lessons get lost.

We built it because even during an incident, communication breaks down and customer-facing teams are left guessing.

Phoenix Incidents manages the entire incident lifecycle—start to finish—directly inside Jira and Slack, with built-in guidance to keep everyone aligned:

💬 Full incident management in Slack or Jira: declare, update, escalate, and resolve from either place. Smart buttons and reminders guide responders through each stage so nothing stalls.
⚙️ Best-practice workflows & SLAs by severity: pre-defined templates keep teams consistent and on-track, with time-based expectations baked in.
🔗 Automatic Jira issue creation and linkage: incidents, actions, and follow-ups stay next to your team’s regular work—no context switching.
🧠 AI-guided postmortems & RCAs: generate structured reports that capture what happened, why, and what to fix—without a blank-page start.
📊 Beautiful reporting dashboards in Jira: real-time metrics for executives, engineering, and SRE—showing response times, SLA performance, and follow-up completion.
✅ Accountability that lasts: reminders and dashboards make sure post-incident actions actually get done—within defined SLAs.

It’s the tool we always wanted during those 2 a.m. pages—lightweight, opinionated best practices, no new tools to deal with--just simple.

👉 Install in Jira on Atlassian Marketplace

📖 Or read more at on our website: phoenixincidents.com.

We’re excited to finally make this public and can’t wait to see how you run your next incident.

Why Postmortems Fail and How to Make Them Drive Real Change

Dave Rochwerger — Tue, 21 Oct 2025 19:04:59 +0000

Introduction: The Hidden Cost of Poor Incident Follow-Up

How did this happen again? Didn't we prepare for this?

No engineering leader wants this message—especially after months of careful planning. Yet there we were: during our peak traffic spike, a critical customer-facing service slowed to a crawl, badly impacting customers exactly as we had seen before. Senior executives had to spend days personally reassuring frustrated customers, promising once again to finally address the underlying issues.

The painful truth was that our infrastructure was outdated and architecture desperately needed refactoring. Instead, we’d spent months scaling hardware, applying patches, and tackling easy fixes—everything except solving the core problem. Our team knew what was needed, but the organization never allocated the necessary resources.

This story isn't unique. Most engineering teams genuinely want to prevent incidents—but their organizations struggle to prioritize deep, thematic fixes over quick patches.

Repeated incidents aren't just operational headaches—they’re symptoms of a deeper problem: poorly executed post-incident processes.

Section 1: The Board Issue—Why Senior Leaders Should Care

When there are many incidents, senior engineering leaders often find themselves playing defense: explaining why the product isn't reliable, why customer satisfaction scores are dropping, and why the same issues seem to surface again and again.

At the executive level, the concern isn't just about the total number of incidents; it’s about the consequences of those incidents. Customers and internal stakeholders may not dive deep into root-cause analysis at first—but they absolutely notice when the product is down repeatedly, trust begins to slip, and internal teams start questioning Engineering’s ability to deliver.

Good executive teams track uptime and incident counts, yes, but the real signal they respond to comes from customer satisfaction metrics, renewal rates, and feedback from internal teams like Customer Success and Sales. When incidents—especially repeated ones—pile up, these signals inevitably deteriorate. Leaders then find themselves having uncomfortable conversations with the board, forced to justify performance instead of focusing on growth.

Reducing the number of repeat incidents is one of the most straightforward ways senior engineering leaders can proactively protect customer satisfaction, internal trust, and ultimately, their own credibility.

Here are three reasons why effective post-incident processes matter at the board level:

1. Reputational Risk

Reputation takes years to build but only moments to damage. Repeated incidents send a clear public signal: your team struggles to learn from its mistakes. Customers quickly notice instability, which undermines your brand's perceived reliability. Competitors capitalize, positioning themselves as stable, trustworthy alternatives.

For senior leaders, reputation isn't just a marketing metric—it directly impacts valuation, investor confidence, and long-term growth.

2. Customer Trust

Customers rarely leave after a single incident, but repeated issues erode trust over time. When clients continuously experience similar disruptions, their patience wears thin. Eventually, they ask, Is this company competent enough to reliably deliver its service?

This loss of customer trust isn’t hypothetical. According to PagerDuty’s 2024 Incident Report, 90% of IT leaders agree that outages significantly harm customer trust, and that year-over-year customer-impacting incidents have increased by 43%. And downtime has real cost consequences too: according to a 2016 Ponemon Study, on average the cost of an unplanned outage is nearly $9,000 per minute.

These aren't just numbers; they're warning signals to senior leaders: repeated incidents drive customers away, hurting revenue and growth.

3. Internal Trust & Employee Morale

Internally, repeated incidents quickly sap morale. Teams across your company—especially Customer Success, Sales, and Product—depend on Engineering to deliver a stable product. When they continually encounter the same problems, frustration builds. Internal dialogue shifts from problem-solving to blaming:

“Why doesn’t Engineering fix this for real?”
“We can’t promise customers improvements if Engineering won’t follow through.”

This internal friction erodes team collaboration and overall efficiency, turning what should be organizational allies into skeptics. At its worst, it creates a culture of learned helplessness—"why bother?" becomes the pervasive attitude.

Section 2: Why Most Postmortems Fail

Every engineering team has good intentions after a major incident. You gather the right people, document what happened, and create a list of improvements. But then something breaks down. The urgency fades, action items don't make it into sprints, and weeks later you're asking, "Why didn’t we fix this last time?"

Through experience—ours and others—we've seen consistent patterns emerge. Here are the most common reasons postmortems fail to deliver meaningful change:

1. No Accountability or Clear Ownership

This is one of the most frequent pitfalls. Postmortems often generate a lot of ideas but few explicit owners. Without accountability, tasks drift. Weeks later, it’s unclear who was supposed to deliver what, and critical action items remain undone.

2. Delays in Scheduling and Execution

The best time to perform a root-cause analysis (RCA) is as close to the incident as possible. Memory is fresh, urgency is high, and you’re still in the mindset to solve problems. Wait even a week, and context fades, key details are lost, and urgency drops significantly. Postmortems become box-checking exercises rather than meaningful improvements.

3. Weak or Unstructured Root Cause Analysis

Without structured guidance, RCAs often become superficial or overly narrow. Teams might chase immediate triggers instead of thematic, systemic causes. You fix the symptom—an overloaded server—but miss the underlying cause, such as poor alerting or weak service dependencies.

4. Failure to Follow Through and Learn

It’s easy to capture action items after an incident—harder to follow through. Teams often list every possible improvement in the heat of the moment. But when everything is important, nothing gets done.

We often see teams fall into the same traps:

Writing down too many action items, with no clear prioritization
Declaring incidents "closed" before improvements are complete
Failing to check in on progress or remind owners
Fixing each incident in isolation, ignoring repeating patterns across teams or services

Each of these mistakes chips away at your ability to improve. Action items stay incomplete. The same failures repeat. And leadership is left with a false sense of progress—until the next outage proves otherwise. Instead, build follow-through into your reliability strategy. That means:

Only taking on the most impactful action items
Assigning owners and due dates
Tracking completion publicly
Categorizing root causes so you can spot recurring themes

Result: Frustration and Repeated Failures

Together, these common pitfalls lead to the same frustrating cycle: repeated failures, eroded trust, and exhausted engineering teams. Your postmortems become performative instead of transformative. Eventually, your stakeholders— customers, internal teams, and senior leaders—start doubting the team's ability to deliver reliable systems.

Fortunately, each of these pitfalls is addressable. In the next sections, I'll show how simple process improvements and structured tooling—like what Phoenix Incidents provides—can shift your team's incident response from reactive to proactive, permanently reducing incident volume and restoring trust.

Section 3: Building a Reliable Postmortem Process

Knowing why postmortems fail isn’t enough. You need clear, repeatable steps to build an effective post-incident practice. From our experience, here’s how you get there:

Step 1: Schedule Postmortems Quickly

Run RCAs within 72 hours of the incident. Memories fade fast, and key context disappears after a few days. Quick scheduling means deeper insights and more accurate findings.

Step 2: Establish Clear Accountability

Assign explicit owners for every action item. This isn’t about assigning blame; it’s about making sure improvements actually get done. Make sure every action item has a single accountable person and a realistic due date. Enforce those dates with SLAs and track progress.

Step 3: Structured Root Cause Analysis

Avoid unstructured discussions. Use standardized methods that guide teams towards identifying deeper, underlying causes--we've had amazing experience with the "Five Whys" method.

Critically, ensure root causes use consistent naming conventions or categories. This makes it easier to detect patterns or systemic issues over time.

Step 4: Prioritize Action Items Carefully

Not every idea from a postmortem is worth immediate action. Too many action items overwhelm teams, reducing the likelihood of completion. Prioritize actions by the potential to prevent future incidents. Quality over quantity wins every time.

Step 5: Regular Follow-ups and Transparent Visibility

Create routine checkpoints to track and review open action items—ideally weekly, but not more infrequently than monthly. This provides clear visibility to stakeholders and ensures no improvement gets lost in the backlog. Regularly report progress to senior leaders to maintain momentum and accountability.

Step 6: Identify and Address Thematic Issues

Track root causes across incidents. If the same issues keep showing up—poor monitoring, unclear ownership, fragile dependencies—address these at the organizational level, not just the team or incident level.

This might mean dedicating sprint time specifically to improve monitoring, tooling, or onboarding. These systemic investments yield major incident reductions down the road.

Section 4: How Phoenix Incidents Helps You Get There

The true test of incident management isn't how quickly you put out fires—it's ensuring those fires never start again.

Phoenix Incidents is designed to make that philosophy real—without adding process for the sake of process. We don’t give you empty templates, nor provide pages of customization; we hardwire best practices directly into the workflow your team already uses.

Here’s how we help teams actually fix what caused the fire:

Enforce follow-through: Incidents are not closed until linked action items are complete. That’s not a guideline—it's built in.
Guide real RCAs: Five Whys, consistent root cause tagging, and AI-assisted timelines help teams focus on analysis, not formatting.
Keep it visible: Weekly Slack reminders, public report cards, and dashboards keep ownership clear and progress visible.

Most importantly, Phoenix turns thematic issues into visible, solvable patterns—so leadership can invest in fixing what really matters, not just what broke last week.

Section 5: The Leadership Call to Action

Incident management isn’t about paperwork or meetings—it’s about trust, credibility, and growth. You’ve seen why postmortems matter, where most teams fail, and how to get it right.

Now it’s time to act.

Evaluate Your Current Postmortem Process

Are your teams performing RCAs promptly—ideally within 72 hours?
Does every action item have clear ownership and due dates?
Do you know how many postmortem tasks are incomplete today?
Are you consistently tracking and addressing thematic root causes?

If the answer to any of these questions isn’t a confident “yes,” your post-incident process needs attention. Addressing these gaps is critical, not just for operational reliability, but for customer trust, internal morale, and leadership credibility.

Prioritize What Matters Most

You don’t need dozens of new processes—just a few reliable, high-impact practices that prevent incidents from recurring. Start with prompt scheduling, structured root causes, explicit accountability, and regular check-ins. These basics yield immediate results.

How Phoenix Incidents Helps You Get Started

Phoenix Incidents isn’t another tool you need to babysit; it actively drives your process. It enforces your SLAs, guides your RCAs, ensures accountability, and provides transparency. Incident follow-through isn’t optional—it’s automatic. In fact, it's not even another tool at all, we leverage the existing tools your team already uses.

Phoenix Incidents is now available for Jira Cloud on the Atlassian Marketplace.

If your team is serious about fixing recurring incidents for good, install from the Atlassian Marketplace or book a short intro call — we’d love to show you what Phoenix can do.

Your next incident doesn’t need to be déjà vu. When you’re supported by Phoenix Incidents, you can turn incidents into permanent improvements—every single time.