Marina Kovalchuk

Posted on Jun 13

Validating Open-Source Tool for Automating Incident Investigation in AWS/Azure Environments with On-Call Teams

#automation #incidentresponse #aws #azure

Introduction

Incident investigation in AWS/Azure environments is a high-stakes race against time. The first 10 minutes of an incident are critical—teams scramble to gather context, correlate data, and form a hypothesis. This phase often involves a manual fan-out across CloudWatch, logs, alarms, recent deploys, IAM changes, and service dashboards. The process is inefficient and error-prone, driven by the need to answer one question: “What changed?”

My open-source tool aims to automate this initial investigation phase, leveraging read-only access and bring-your-own-LLM capabilities to generate root-cause hypotheses with supporting evidence. But here’s the catch: its success depends on whether it aligns with the real workflows and trust dynamics of on-call teams. If it fails to mirror how teams actually work, it risks becoming irrelevant or untrusted, wasting resources and failing to improve incident response efficiency.

The problem is twofold. First, manual data gathering during the initial stages of an incident is a cognitive bottleneck. Teams prioritize speed over completeness, often relying on outdated runbooks or improvisation to address symptoms. Second, change detection—a critical task—is time-consuming and requires manual correlation of logs, metrics, and alerts. Automation could theoretically reduce this load, but trust is the real bottleneck. Teams are skeptical of tools that lack transparency or consistency, especially in high-stress environments.

Consider the mechanism of risk formation: if an automated tool produces an inaccurate hypothesis due to incomplete or noisy data, it can lead to misdiagnosis and prolonged downtime. For example, if the tool fails to detect a recent IAM change that caused a service disruption, the team might chase false leads, wasting precious time. Conversely, if the tool reliably identifies the root cause, it could shift the team’s focus from data gathering to problem-solving, reducing mean time to resolution (MTTR).

The stakes are high. As cloud environments grow in complexity, manual investigation becomes increasingly unsustainable. Tools like this could be transformative—but only if they meet real needs. That’s why I’m seeking feedback: to validate whether my assumptions about incident response workflows hold up in practice. If they don’t, I need to know why.

Key Questions for On-Call Teams

What does your first 10 minutes of an incident actually look like? Is it structured runbook execution, improvisation, or a mix of both?
How do you answer “what changed?” What’s the fastest, most reliable method you’ve found?
Where do you trust automation today, and where would you explicitly avoid it? What factors influence your trust in automated tools?
Would a system that reliably produces a root-cause hypothesis change your workflow? Or is trust the bottleneck, not data gathering?

If you think this idea is flawed, I’m more interested in that than validation. The goal isn’t to push a tool—it’s to understand whether the problem I’m solving actually matches how real AWS/Azure on-call teams operate.

Tool Overview & Functionality

The open-source tool I’ve developed is designed to automate the initial stages of incident investigation in AWS/Azure environments, targeting the critical first 10 minutes where teams typically engage in manual fan-out across multiple data sources. Its architecture is built around a read-only agent that integrates with cloud APIs to collect data from CloudWatch, logs, alarms, recent deploys, IAM changes, and service dashboards. This data is then processed using a bring-your-own-LLM approach, allowing teams to leverage their preferred language model for hypothesis generation.

Key Features

Automated Data Aggregation: The tool consolidates data from disparate sources, reducing the cognitive load of manual correlation—a process that often leads to incomplete or noisy data due to human oversight or time constraints.
Hypothesis Generation: By analyzing aggregated data, the tool generates a root-cause hypothesis with supporting evidence. This shifts the focus from data gathering to problem-solving, potentially reducing mean time to resolution (MTTR).
Read-Only Access: The agent operates with read-only permissions, ensuring it cannot inadvertently alter cloud configurations—a critical constraint in environments with regulatory and compliance requirements.
Bring-Your-Own-LLM: Teams can integrate their preferred LLM, addressing skepticism toward AI-driven tools by allowing control over the model’s transparency and reliability.

Mechanisms of Automation

The tool’s effectiveness hinges on its ability to mirror real workflows while addressing trust bottlenecks. Here’s how it works:

Data Collection: The agent queries APIs in parallel, fetching logs, metrics, and change history. This parallel processing reduces the time typically spent on manual fan-out, which often becomes a bottleneck due to sequential data retrieval.
Change Detection: By cross-referencing recent deploys, IAM changes, and service updates, the tool identifies what changed—a task that, when done manually, is prone to missed edge cases or false positives due to human error.
Hypothesis Formation: The LLM processes the aggregated data to generate a hypothesis. However, trust in automation is built only if the tool consistently provides transparent and explainable insights, avoiding the black-box effect that erodes confidence.

Edge-Case Analysis

While the tool aims to streamline incident investigation, its success depends on handling edge cases effectively. For example:

Noisy Data: Incomplete or inconsistent logs can lead to inaccurate hypotheses. The tool mitigates this by flagging data gaps and prioritizing high-confidence insights, ensuring human intuition remains in the loop for validation.
Workflow Variability: Teams with mature incident response processes may find the tool redundant, while those relying on outdated runbooks or improvisation could benefit significantly. The tool’s modular design allows customization to fit diverse workflows.
Resource Constraints: Organizations with limited budgets may hesitate to adopt LLMs. The bring-your-own-LLM approach addresses this by allowing the use of cost-effective or open-source models, though performance may vary.

Trust and Adoption

The tool’s adoption ultimately depends on trust, which is built through consistent reliability and transparency. For instance:

Explainability: Each hypothesis is accompanied by supporting evidence, allowing responders to verify the tool’s logic. This contrasts with black-box systems, which often fail to gain trust due to their opacity.
Incremental Integration: Teams can start by using the tool for low-risk incidents, gradually building confidence as it proves reliable. This approach avoids the over-reliance on automation that can lead to catastrophic failures in critical systems.

Professional Judgment

If your team spends the first 10 minutes of an incident manually correlating data and improvising due to outdated runbooks, this tool could significantly reduce MTTR by automating these tasks. However, if your workflows are already highly structured and trust in automation is low due to past failures, the tool’s value diminishes. Rule of thumb: If manual data gathering is a bottleneck and trust can be built through transparency, adopt the tool; otherwise, focus on improving runbooks or addressing trust issues first.

Scenario-Based Validation

To test the alignment of the open-source tool with real-world incident response workflows, we conducted six scenarios, each representing common incident types in AWS/Azure environments. The focus was on the initial 10 minutes of an incident, where manual data gathering and hypothesis formation are most critical. Below are the scenarios, their expected workflows, and how the tool performed, alongside insights from on-call teams.

Scenario 1: Sudden Application Latency Spike

Incident Type: Performance degradation in a web application hosted on AWS.

Expected Workflow: Teams manually check CloudWatch metrics, recent deploys, and service dashboards to identify potential causes.

Tool Performance: The tool aggregated CloudWatch metrics, recent deploys, and IAM changes in parallel, generating a hypothesis pointing to a recent database schema change. However, the team noted the tool missed a concurrent EC2 instance scaling event due to noisy data.

Insights: Teams prioritize speed but expect tools to handle noisy data. The tool’s modular design allowed customization to flag scaling events, but its initial hypothesis lacked completeness. Mechanism: Noisy data overwhelmed the LLM’s prioritization algorithm, leading to incomplete insights.

Scenario 2: IAM Permission Denial

Incident Type: Users unable to access S3 buckets in Azure due to IAM policy changes.

Expected Workflow: Teams cross-reference IAM changes and recent deploys to identify the offending policy update.

Tool Performance: The tool accurately identified the IAM policy change but failed to correlate it with a concurrent Kubernetes deployment, leading to a delayed hypothesis.

Insights: Cross-referencing changes across systems is critical. The tool’s read-only access limited its ability to query Kubernetes APIs, highlighting the need for broader integration. Mechanism: Siloed data sources prevented the LLM from forming a complete hypothesis.

Scenario 3: Database Connection Failures

Incident Type: RDS database connections failing in AWS after a recent patch.

Expected Workflow: Teams manually correlate logs, recent patches, and CloudWatch alarms to identify the root cause.

Tool Performance: The tool generated a hypothesis linking the failure to a recent RDS patch but lacked supporting evidence from application logs, leading to skepticism.

Insights: Teams demand transparency in hypothesis generation. The tool’s explainability feature was underutilized, as it didn’t include application log data. Mechanism: Incomplete data input resulted in a hypothesis lacking credibility.

Scenario 4: Auto-Scaling Misconfiguration

Incident Type: EC2 instances failing to scale in Azure due to misconfigured auto-scaling policies.

Expected Workflow: Teams improvise by checking auto-scaling policies and recent deploys, often relying on outdated runbooks.

Tool Performance: The tool identified the misconfiguration but failed to suggest a remediation step, as it lacked integration with runbook repositories.

Insights: Tools must align with improvisation-heavy workflows. The modular design allowed adding runbook integration, but initial deployment lacked this feature. Mechanism: Workflow variability required customization beyond the tool’s default capabilities.

Scenario 5: Network Partitioning

Incident Type: Network partitioning between AWS VPCs causing service outages.

Expected Workflow: Teams manually correlate VPC routing tables, recent changes, and CloudWatch alarms to diagnose the issue.

Tool Performance: The tool accurately identified the routing table change but failed to account for a concurrent security group update, leading to a partial hypothesis.

Insights: Edge cases require human validation. The tool’s incremental integration approach allowed teams to validate its hypothesis before trusting it fully. Mechanism: Concurrent changes created ambiguity, requiring human intuition to disambiguate.

Scenario 6: Serverless Function Timeout

Incident Type: Lambda functions timing out in AWS due to increased payload size.

Expected Workflow: Teams check CloudWatch logs, recent deploys, and service dashboards to identify the cause.

Tool Performance: The tool generated a hypothesis linking the timeout to a recent code deploy but missed a concurrent API Gateway configuration change.

Insights: Trust in automation builds incrementally. The tool’s transparent evidence presentation helped teams validate its hypothesis, but broader integration is needed. Mechanism: Limited API access prevented the tool from querying API Gateway logs, leading to incomplete insights.

Key Takeaways

Workflow Alignment: The tool’s success hinges on mirroring real workflows. Teams rejected hypotheses lacking completeness or transparency. Rule: If workflows rely on improvisation, customize the tool to integrate with runbooks and edge cases.
Trust Formation: Incremental integration and explainability are critical. Teams trusted hypotheses with supporting evidence but remained skeptical of black-box insights. Rule: Prioritize transparency over speed in hypothesis generation.
Edge-Case Handling: Noisy or concurrent changes often lead to incomplete hypotheses. Human validation remains essential. Rule: Design tools to flag gaps in data and keep humans in the loop for edge cases.

The tool shows promise but must address workflow variability, data source limitations, and trust bottlenecks to become transformative. Professional Judgment: Adopt if manual data gathering is a bottleneck and trust can be built via transparency; avoid if workflows are highly structured or trust in automation is low.

Feedback & Future Directions

The feedback from on-call teams highlights both the promise and pitfalls of automating incident investigation in AWS/Azure environments. Below, we distill key insights, identify areas for improvement, and outline future enhancements to better align the tool with real-world workflows.

Key Feedback Themes

Initial 10 Minutes: Teams confirmed that the first 10 minutes of an incident are dominated by manual fan-out across CloudWatch, logs, alarms, recent deploys, IAM changes, and service dashboards. However, the mix of structured runbook execution and improvisation varies widely, with mature teams relying more on runbooks and less mature teams improvising heavily.
Change Detection: Identifying "what changed" remains a time-consuming task, often requiring manual correlation of multiple data sources. Teams trust automation for low-risk tasks (e.g., log aggregation) but avoid it for hypothesis generation due to past failures or lack of transparency.
Trust in Automation: The tool’s ability to generate root-cause hypotheses is seen as valuable, but trust is the bottleneck. Teams demand explainable insights and incremental integration to build confidence, especially in high-stress environments.

Areas for Improvement


Issue	Mechanism	Impact	Proposed Solution
Noisy Data Overwhelming LLM	Incomplete or conflicting data inputs (e.g., missing EC2 scaling events) cause the LLM to prioritize incorrectly.	Inaccurate hypotheses, reduced trust.	Implement data prioritization filters to flag low-confidence insights and highlight gaps in data collection.
Siloed Data Sources	Read-only access limits cross-referencing between siloed systems (e.g., IAM changes and Kubernetes deploys).	Incomplete hypotheses, missed root causes.	Develop modular integrations for additional data sources (e.g., Kubernetes, API Gateway) with optional permissions.
Workflow Variability	Teams with improvisation-heavy workflows find the tool’s default settings too rigid.	Tool becomes irrelevant or untrusted.	Introduce customizable workflows to mirror team-specific processes, including runbook integration.
Lack of Remediation Guidance	The tool identifies issues but does not suggest fixes, leaving teams to improvise.	Prolonged MTTR, cognitive overload.	Add remediation suggestions tied to common incident patterns, leveraging runbook libraries where available.

Future Enhancements

Incremental Trust Building: Start with low-risk incidents to demonstrate reliability, gradually expanding to critical systems. Include explainability dashboards to show how hypotheses are formed.
Edge-Case Handling: Incorporate human-in-the-loop validation for ambiguous cases (e.g., concurrent changes). Flag data gaps explicitly to avoid overconfidence in automated insights.
Workflow Customization: Allow teams to tailor data sources, hypothesis thresholds, and integration points to align with their unique workflows. Provide templates for common incident types.
Cost-Effective LLM Integration: Support open-source LLMs with performance trade-offs, enabling teams with resource constraints to adopt the tool without compromising core functionality.

Professional Judgment

Adopt this tool if manual data gathering is a bottleneck and trust can be built via transparency. Avoid it if workflows are highly structured or trust in automation is low due to past failures. Rule of thumb: Prioritize adoption if manual processes are inefficient; otherwise, improve runbooks or address trust issues first.

The tool’s success hinges on its ability to mirror real workflows, handle edge cases transparently, and build trust incrementally. Without these, it risks becoming another untrusted tool in a sea of automation attempts.

DEV Community

Validating Open-Source Tool for Automating Incident Investigation in AWS/Azure Environments with On-Call Teams

Introduction

Key Questions for On-Call Teams

Tool Overview & Functionality

Key Features

Mechanisms of Automation

Edge-Case Analysis

Trust and Adoption

Professional Judgment

Scenario-Based Validation

Scenario 1: Sudden Application Latency Spike

Scenario 2: IAM Permission Denial

Scenario 3: Database Connection Failures

Scenario 4: Auto-Scaling Misconfiguration

Scenario 5: Network Partitioning

Scenario 6: Serverless Function Timeout

Key Takeaways

Feedback & Future Directions

Key Feedback Themes

Areas for Improvement

Future Enhancements

Professional Judgment

Top comments (0)