<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vinothsingh Elumalai</title>
    <description>The latest articles on DEV Community by Vinothsingh Elumalai (@velumal09).</description>
    <link>https://dev.to/velumal09</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3990085%2Fd45a344f-6547-41e7-a287-d0a81ae20d42.jpg</url>
      <title>DEV Community: Vinothsingh Elumalai</title>
      <link>https://dev.to/velumal09</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/velumal09"/>
    <language>en</language>
    <item>
      <title>How I Built FRIDAY - An Autonomous Incident Investigation Agent That Reduced MTTR by 65%</title>
      <dc:creator>Vinothsingh Elumalai</dc:creator>
      <pubDate>Thu, 18 Jun 2026 04:22:12 +0000</pubDate>
      <link>https://dev.to/velumal09/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65-42ae</link>
      <guid>https://dev.to/velumal09/how-i-built-an-autonomous-incident-investigation-agent-that-reduced-mttr-by-65-42ae</guid>
      <description>&lt;h2&gt;
  
  
  Series: AI-Native SRE
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The Problem Every On-Call Engineer Knows&lt;/li&gt;
&lt;li&gt;What FRIDAY Does&lt;/li&gt;
&lt;li&gt;Architecture Overview&lt;/li&gt;
&lt;li&gt;Key Design Decisions&lt;/li&gt;
&lt;li&gt;The Tool-Use Loop: How FRIDAY Reasons&lt;/li&gt;
&lt;li&gt;The Training System: Pre-Built Knowledge&lt;/li&gt;
&lt;li&gt;Handling Edge Cases&lt;/li&gt;
&lt;li&gt;Results&lt;/li&gt;
&lt;li&gt;Lessons Learned&lt;/li&gt;
&lt;li&gt;Try It Yourself&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem Every On-Call Engineer Knows &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;It's 2:47 AM. Your phone buzzes and it's a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open PagerDuty → read the alert title&lt;/li&gt;
&lt;li&gt;Open Datadog → search for the service, find the error spike&lt;/li&gt;
&lt;li&gt;Open GitHub → check if someone deployed something&lt;/li&gt;
&lt;li&gt;Cross-reference timestamps between all three tools&lt;/li&gt;
&lt;li&gt;Form a hypothesis&lt;/li&gt;
&lt;li&gt;Drill deeper — check affected tenants, error paths, queue depths&lt;/li&gt;
&lt;li&gt;Write up findings for the team&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This process takes &lt;strong&gt;15–45 minutes&lt;/strong&gt; for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages.&lt;/p&gt;

&lt;p&gt;I asked myself:&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
What if an AI agent could do Steps 1–7 autonomously in under 3 minutes — and deliver structured findings to your team before the on-call engineer even opens their laptop?&lt;br&gt;

&lt;/div&gt;


&lt;p&gt;So I built one. It's been running in production for months, investigating real incidents on a platform serving &lt;strong&gt;30+ million end users&lt;/strong&gt; across multiple AWS regions. We call it &lt;strong&gt;FRIDAY&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What FRIDAY Does &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;When a PagerDuty alert fires, FRIDAY:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Receives the webhook&lt;/strong&gt; in real-time via API Gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locks the target region&lt;/strong&gt; from the alert metadata (never investigates the wrong region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checks GitHub first&lt;/strong&gt; — finds what changed before the alert (deployments, config changes, PRs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queries Datadog&lt;/strong&gt; — error rates, affected tenants, application exceptions, queue depths&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesizes findings&lt;/strong&gt; — correlates code changes with observability signals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delivers a structured report&lt;/strong&gt; to Microsoft Teams as an Adaptive Card&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The entire investigation takes &lt;strong&gt;under 2 minutes&lt;/strong&gt;. The on-call engineer wakes up to a complete analysis instead of a raw alert.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐     ┌────────────────┐     ┌─────────────────────┐
│  PagerDuty   │────▶│  API Gateway   │────▶│  Lambda (Sync)      │
│  Webhook     │     │  (Validate)    │     │  Parse + Self-Invoke│
└──────────────┘     └────────────────┘     └─────────┬───────────┘
                                                       │ Async
                                                       ▼
                                            ┌─────────────────────┐
                                            │  Lambda (Async)      │
                                            │  Investigation Agent │
                                            │                      │
                                            │  ┌────────────────┐ │
                                            │  │ Amazon Bedrock  │ │
                                            │  │ Claude Opus     │ │
                                            │  │ (Tool-Use Loop) │ │
                                            │  └───────┬────────┘ │
                                            │          │          │
                                            │    ┌─────┼─────┐   │
                                            │    ▼     ▼     ▼   │
                                            │ GitHub Datadog  S3  │
                                            └─────────┬───────────┘
                                                      │
                                                      ▼
                                            ┌─────────────────────┐
                                            │  Microsoft Teams    │
                                            │  (Adaptive Card)    │
                                            └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Design Decisions &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Two-Lambda Architecture (Sync + Async)
&lt;/h3&gt;

&lt;p&gt;API Gateway has a &lt;strong&gt;30-second hard timeout&lt;/strong&gt;. A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning &lt;code&gt;200 OK&lt;/code&gt; to PagerDuty within 2 seconds.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sync handler: validate, parse, self-invoke, return immediately
&lt;/span&gt;&lt;span class="n"&gt;lambda_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;FunctionName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;InvocationType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Fire and forget
&lt;/span&gt;    &lt;span class="n"&gt;Payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_async_investigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alert_payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;alert_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Investigation started&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The async Lambda runs the full investigation without timeout pressure.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. GitHub First, Datadog Second
&lt;/h3&gt;

&lt;p&gt;This is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, &lt;strong&gt;80%+ of acute incidents are caused by a preceding change&lt;/strong&gt;: a deployment, a config update, a replica count change, a memory limit modification.&lt;/p&gt;

&lt;p&gt;FRIDAY is instructed to check GitHub &lt;em&gt;before&lt;/em&gt; touching Datadog:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MANDATORY FIRST STEP — GitHub (Step 0):
Before touching Datadog, you MUST run these calls in parallel:
1. github_search_repos — find the repo for the alerted service
2. github_list_commits — find commits in the 2 hours before 
   the alert fired

A deployment or config change is the most likely root cause.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Why this matters:&lt;/strong&gt; When the AI correlates "this PR merged 12 minutes before the error spike" with "5xx errors started at exactly the merge timestamp" — it produces findings that are immediately actionable. This single design decision dramatically improved root cause accuracy.&lt;br&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Region Lock — Preventing Wrong-Region Investigation
&lt;/h3&gt;

&lt;p&gt;Our platform spans multiple AWS regions. A naive agent querying "all 5xx errors" would mix signals from healthy and unhealthy regions, producing confused analysis.&lt;/p&gt;

&lt;p&gt;FRIDAY's first action is always to &lt;strong&gt;lock a target region&lt;/strong&gt; from the alert metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🌍 Region: Description — resolved from alert hostname

Every subsequent Datadog query includes:
kube_cluster_name:region-az-* (scoped to affected region only)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Structured Output Contract
&lt;/h3&gt;

&lt;p&gt;FRIDAY's output isn't freeform text. It follows a strict section contract that the Teams integration parses into visual containers:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## EXECUTIVE SUMMARY&lt;/span&gt;
[2-3 sentences — what happened, who's affected, what changed]

&lt;span class="gu"&gt;## KEY FINDINGS&lt;/span&gt;
[Bulleted evidence from GitHub + Datadog]

&lt;span class="gu"&gt;## WHAT CHANGED&lt;/span&gt;
[Specific commit/PR with timestamp and author]

&lt;span class="gu"&gt;## ERROR BREAKDOWN&lt;/span&gt;
[Service-by-service error counts with affected tenants]

&lt;span class="gu"&gt;## ROOT CAUSE&lt;/span&gt;
[Confirmed / Suspected / Unknown — with evidence chain]

&lt;span class="gu"&gt;## CUSTOMER IMPACT&lt;/span&gt;
[Affected tenants, operations, scope]

&lt;span class="gu"&gt;## RECOMMENDED ACTIONS&lt;/span&gt;
[Specific next steps for the on-call engineer]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The on-call engineer can glance at the Teams card and immediately know: &lt;em&gt;what happened, who's affected, what likely caused it, and what to do next&lt;/em&gt;  without reading a wall of text.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Tool-Use Loop: How FRIDAY Reasons &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;FRIDAY uses Claude's tool-use capability in a multi-round loop. The AI doesn't execute a fixed script — it &lt;strong&gt;reasons&lt;/strong&gt; about each alert independently, deciding which tools to call based on what it's learned so far.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;round_num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_TOOL_ROUNDS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# Max 25 rounds
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;converse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-opus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;toolConfig&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TOOL_DEFINITIONS&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Execute tools, append results, continue reasoning
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content_blocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
                &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# AI has concluded — extract findings
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;extract_final_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content_blocks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Available Tools
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;github_search_repos&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find which repo owns a service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;github_list_commits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What changed before the alert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;github_get_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read actual deployment configs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;github_search_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find all producers/consumers of a queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;datadog_log_search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find specific error messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;datadog_log_aggregate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Count errors by backend/tenant/path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;datadog_query_metrics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Queue depth, CPU, memory, latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;datadog_get_monitor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Understand what threshold triggered&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The AI typically uses &lt;strong&gt;8–15 tool calls per investigation&lt;/strong&gt;, batching parallel calls when possible to minimize round-trip time.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Training System: Pre-Built Knowledge &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;A cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a &lt;strong&gt;deterministic training mode&lt;/strong&gt; that pre-builds architectural knowledge:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Deterministic training:
    ~13 targeted API calls, then one Bedrock synthesis call.

    Collects: cluster-service maps, HAProxy backends, 
    chronic error baselines, recent planned work.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Phase 1: Targeted data collection (no AI — pure API calls)
&lt;/span&gt;    &lt;span class="n"&gt;collected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TRAINING_CALLS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 2: Single AI synthesis call
&lt;/span&gt;    &lt;span class="n"&gt;knowledge_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;synthesize_knowledge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Phase 3: Save to S3 — injected into system prompt
&lt;/span&gt;    &lt;span class="nf"&gt;save_to_s3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;knowledge_doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The knowledge document contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster → Service map&lt;/strong&gt; — What runs where&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chronic error baselines&lt;/strong&gt; — Background noise to ignore (not incidents)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent planned work&lt;/strong&gt; — Deployments and migrations that explain expected errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend inventory&lt;/strong&gt; — Every backend serving traffic&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Key insight:&lt;/strong&gt; Knowledge injection &amp;gt; Larger context windows. A &lt;em&gt;synthesized&lt;/em&gt; knowledge document — curated, current, and actionable — is more effective than dumping raw infrastructure documentation into the prompt. It captures &lt;em&gt;real state&lt;/em&gt;, not &lt;em&gt;aspirational state&lt;/em&gt;.&lt;br&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Handling Edge Cases &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Planned Work vs. Real Incidents
&lt;/h3&gt;

&lt;p&gt;One of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you &lt;em&gt;expect&lt;/em&gt; 5xx errors as traffic drains. FRIDAY handles this through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge injection&lt;/strong&gt; — Training mode captures recent PRs tagged as planned work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time PR correlation&lt;/strong&gt; — During investigation, it reads PR bodies for keywords like "decommission", "drain", "planned"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit classification&lt;/strong&gt; — If a 5xx spike coincides with a merged "failover" PR, FRIDAY reports:&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Force-Completion Under Round Limits
&lt;/h3&gt;

&lt;p&gt;What happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a &lt;strong&gt;graceful degradation&lt;/strong&gt; mechanism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rounds_remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;user_content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STOP CALLING TOOLS. Write your FINAL report &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NOW using all data collected so far. Mark &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uncertain findings as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Suspected&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; rather &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;than skipping them.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This ensures every investigation produces a report — even if incomplete — rather than timing out silently.&lt;/p&gt;
&lt;h3&gt;
  
  
  Deduplication
&lt;/h3&gt;

&lt;p&gt;PagerDuty retries webhooks. FRIDAY handles this at two levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Webhook-level&lt;/strong&gt; — In-memory cache of webhook IDs (survives Lambda warm starts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident-level&lt;/strong&gt; — S3 marker files prevent re-investigating the same incident&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Results &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;After running in production for several months:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before FRIDAY&lt;/th&gt;
&lt;th&gt;After FRIDAY&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mean Time to First Analysis&lt;/td&gt;
&lt;td&gt;15–45 min&lt;/td&gt;
&lt;td&gt;90 sec–3 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~90% faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR (overall)&lt;/td&gt;
&lt;td&gt;~60 min&lt;/td&gt;
&lt;td&gt;~15 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65% reduction&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI tool adoption (team)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4x increase&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alert noise (false escalations)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~80% reduction&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-generated postmortems&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100% of P1/P2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Eliminated manual RCA drafts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
The most impactful change isn't the speed — it's the &lt;strong&gt;consistency&lt;/strong&gt;. A human engineer at 3 AM makes mistakes: investigates the wrong region, misses a recent deployment, forgets to check queue depths. FRIDAY follows the same rigorous methodology every time.&lt;br&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Lessons Learned &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Prompt Engineering IS Architecture
&lt;/h3&gt;

&lt;p&gt;The system prompt is the most important file in the codebase. It's not instructions — it's the agent's &lt;strong&gt;operating manual&lt;/strong&gt;. Ours is ~5,000 words covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment topology (region mappings, cluster roles, service dependencies)&lt;/li&gt;
&lt;li&gt;Investigation methodology (step-by-step procedures)&lt;/li&gt;
&lt;li&gt;Critical rules (what NOT to do — as important as what to do)&lt;/li&gt;
&lt;li&gt;Output format contract&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Invest in your prompt like you invest in your architecture docs.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. "GitHub First" Was the Single Biggest Win
&lt;/h3&gt;

&lt;p&gt;Before this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. You Need Guardrails, Not Just Capabilities
&lt;/h3&gt;

&lt;p&gt;FRIDAY is explicitly told it &lt;strong&gt;does NOT take remediation actions&lt;/strong&gt;. It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it's a &lt;strong&gt;design choice that builds trust&lt;/strong&gt;. When on-call engineers trust the AI's analysis, they act on it faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Separate Investigation from Notification
&lt;/h3&gt;

&lt;p&gt;The two-Lambda pattern (sync for webhook receipt, async for investigation) is essential. Don't let API Gateway timeouts dictate your AI agent's investigation depth.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're extending this pattern to &lt;strong&gt;autonomous security remediation&lt;/strong&gt; — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
The future of SRE isn't "AI-assisted." It's &lt;strong&gt;AI-native&lt;/strong&gt;: systems designed from the ground up with autonomous agents as first-class participants in the operational loop.&lt;br&gt;

&lt;/div&gt;





&lt;h2&gt;
  
  
  Try It Yourself &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;The pattern is reproducible with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Bedrock&lt;/strong&gt; (Claude Opus or Sonnet for cost-sensitive use)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any webhook source&lt;/strong&gt; (PagerDuty, Opsgenie, Datadog)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any observability platform&lt;/strong&gt; with an API (Datadog, Grafana, New Relic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any source control&lt;/strong&gt; (GitHub, GitLab)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any chat platform&lt;/strong&gt; (Teams, Slack)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard part isn't the code — it's the &lt;strong&gt;system prompt&lt;/strong&gt;. That's where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful.&lt;/p&gt;




&lt;p&gt;&lt;/p&gt;
  What does FRIDAY stand for?
  &lt;br&gt;
FRIDAY is named after Tony Stark's AI assistant in the Marvel universe. Because if I'm going to be on-call at 2 AM, I at least deserve a butler. ☕

&lt;p&gt;The name also works as a backronym: &lt;strong&gt;F&lt;/strong&gt;irst &lt;strong&gt;R&lt;/strong&gt;esponder for &lt;strong&gt;I&lt;/strong&gt;ncident &lt;strong&gt;D&lt;/strong&gt;iagnostics and &lt;strong&gt;A&lt;/strong&gt;nal*&lt;em&gt;Y&lt;/em&gt;*sis — but honestly, we just thought the Marvel reference was cooler.&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;
&lt;p&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is Part 1 of my &lt;strong&gt;AI-Native SRE&lt;/strong&gt; series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://linkedin.com/in/vinothsingh-elumalai-88967251" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Connect on LinkedIn&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>aiops</category>
      <category>sre</category>
      <category>cloudnative</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
