DEV Community: Vinothsingh Elumalai

How an AI Terminal Assistant Became My Team's Most Productive Engineer - Opencode + Claude + MCP

Vinothsingh Elumalai — Wed, 24 Jun 2026 18:31:54 +0000

The Moment That Changed Everything
What It Actually Is
The Setup Nobody Believes Is This Simple
Focused Sessions — One Agent, One Mission
Sub-Agents With Specific Skill Sets
Building MCPs for Your Own Stack
What We've Actually Achieved
How It Became a Force Multiplier for Incident Response
From OpenCode to FRIDAY — The Agent That Investigates Incidents Autonomously
From FRIDAY to JARVIS — Thinking About Write-Path Autonomy
Deleting 130,000 Accounts Without Writing Code
Learning the Entire Product in Conversations
Why This Is Different From ChatGPT
The Uncomfortable Truth
What's Next

The Moment That Changed Everything

It was 11pm on a Tuesday. A cache migration in our production environment had just caused thousands of authentication failures for two of our largest enterprise customers. Our VP of Product wanted answers. Our support team was fielding escalations. And our engineers were alt-tabbing between AWS console, Datadog, GitHub, Azure DevOps, and PagerDuty trying to piece together what happened.

Three weeks later, when we needed to attempt the same change again, an engineer typed this into a terminal:

"Review the ADO change ticket, compare the MOP against the actual ElastiCache configuration in prod region, check the K8s config repo for how Redis env vars are wired on the Green cluster, and tell me if this approach avoids the token validation failure that caused the previous customer impact."

Fourteen seconds later, the system had pulled the work item, queried AWS ElastiCache across four regions, read the Kubernetes configuration from GitHub, cross-referenced the deployment patches, and delivered a precise technical assessment including a risk it identified that the team hadn't documented: in-flight tokens during the 30–60 second Global Accelerator propagation window.

That system is OpenCode — an AI-powered CLI assistant connected to our entire operational stack through the MC(Model Context Protocol). And it has fundamentally changed how a 20-person platform engineering team manages infrastructure serving thousands of enterprise tenants and processing millions of authentication requests daily.

What It Actually Is

OpenCode is deceptively simple in concept. A terminal application on an engineer's laptop. You type questions or tasks in plain English. It responds with answers pulled from live production systems.

  Engineer (terminal)
        │
        ▼
    OpenCode (Claude AI)
        │
        ▼
    MCP Servers
   ╱  │  │  │  ╲
  ▼   ▼  ▼  ▼   ▼
 AWS  DD  GH ADO PD  RD

 AWS = Amazon Web Services (prod + non-prod)
 DD  = Datadog (logs, metrics, monitors)
 GH  = GitHub (repos, PRs, code)
 ADO = Azure DevOps (tickets, sprints, wikis)
 PD  = PagerDuty (incidents, schedules)
 RD  = Rundeck (jobs, executions)

The magic is in those MCP servers. Each one is a lightweight connector to a backend platform. When you ask a question, the AI doesn't guess — it makes real API calls against real systems and works with real data.

Ask "what's our AWS spend this month?" — it queries Cost Explorer. Ask "which tenant generates the most provisioning traffic?" — it aggregates Datadog logs. Ask "what did that PR change in the K8s config repo?" — it reads the actual file diff from GitHub. Ask all three in the same sentence and it does them in parallel.

No pre-built dashboards. No saved queries. No runbooks to follow. You just ask.

The Setup Nobody Believes Is This Simple

The entire configuration is a single JSON file. Each MCP server gets a block: here's the server binary, here's the credentials, connect.

{
  "mcpServers": {
    "aws-prod": {
      "command": "aws-mcp-server",
      "env": { "AWS_PROFILE": "prod" }
    },
    "datadog": {
      "command": "datadog-mcp-server",
      "env": {
        "DD_API_KEY": "...",
        "DD_APP_KEY": "..."
      }
    },
    "github": {
      "command": "github-mcp-server",
      "env": { "GITHUB_TOKEN": "..." }
    }
  }
}

The AI model never sees the credentials. It calls tools by name — "search logs in Datadog" or "describe EKS clusters" and the MCP server handles authentication, pagination, error handling, and response formatting.

Adding a new system takes about ten minutes. Write a config block, provide credentials, restart.

Focused Sessions — One Agent, One Mission

Here's something that changes how you think about AI assistants: you can create focused sessions with a single purpose.

Right now, as I write this article, I have an OpenCode session that's been running for days as a documentation advisor. It's reviewed my architecture docs, drafted technical articles, generated formal roadmap documents, and is tracking project milestones. When I start a new conversation about something unrelated, I can tell the session: "This session is reserved for documentation work only" — and it keeps me focused.

This pattern works for any focused workstream:

Session	Purpose	Tools Used
Documentation Advisor	Article drafting, roadmap generation, technical writing	Doc Agent, GitHub, web search
Incident Responder	Active incident investigation and RCA	Datadog, GitHub, PagerDuty, AWS
Cost Analyst	Monthly spend review, waste identification	AWS (Cost Explorer, EC2, RDS, S3)
Sprint Planner	Ticket creation, backlog grooming, capacity planning	Azure DevOps, GitHub
Security Reviewer	Code review, vulnerability assessment	GitHub, AWS (IAM, SecurityHub)

Each session maintains context across the entire conversation. The AI remembers what you discussed 3 hours ago. It builds on previous findings. It doesn't start from zero every time.

This is the real power: Not one assistant that does everything poorly. Multiple focused sessions, each purpose-built for a specific mission, with the right tools connected and the right context loaded.

Sub-Agents With Specific Skill Sets

Beyond focused sessions, you can create sub-agents — specialized configurations trained for specific domains:

The Doc Agent

Generates formal documents — postmortems, RCA reports, roadmaps, technical specs. It knows document templates, formatting standards, and outputs polished Word/PDF files.

I used this to generate formal migration roadmaps, architecture documents, and execution playbooks — all properly formatted, ready to share with leadership.

The ADO Agent

Creates, updates, and queries Azure DevOps work items. It understands your project structure, sprint cadence, and ticket hierarchy (Epic → Feature → Task).

One prompt: "Create a Feature under the cleanup Epic with 8 tasks — one per batch" — and 9 tickets exist with proper hierarchy, descriptions, and assignments.

The AWS Agent

Queries across all regions, all services. Cost analysis, resource inventory, security posture review. Runs in read-only mode with separate IAM profiles for prod vs. non-prod.

The Incident Agent

Connected to PagerDuty + Datadog + GitHub. When an alert fires, it pulls the monitor definition, searches logs, checks recent deployments, and synthesizes findings. This is the agent that eventually became FRIDAY — but more on that later.

The GitHub Agent

Code review, PR analysis, repository search. It reads actual code and configs, not summaries. When someone asks "what changed in the proxy config last week?" — it reads every commit.

Build Your Own

Any system with an API can become an MCP server. The pattern is:

Find or build an MCP server for the platform (many exist: AWS, Datadog, GitHub, PagerDuty, Slack, Jira, Confluence...)
If one doesn't exist — build a lightweight one. An MCP server is just a program that exposes tools via the MCP protocol. A basic one is ~100 lines of Python.
Add it to your config — one JSON block with the command and credentials
Restart — the AI can now query that system

# A minimal custom MCP server (simplified):
@mcp.tool()
def query_my_system(query: str) -> str:
    """Query our internal API"""
    response = requests.get(
        f"https://internal-api.company.com/search",
        params={"q": query},
        headers={"Authorization": f"Bearer {API_KEY}"}
    )
    return response.json()

The principle: If your tech stack has an API, you can make it conversational. DNS provider? MCP server. Internal CMDB? MCP server. Terraform state? MCP server. The AI becomes as capable as the tools you connect to it.

What We've Actually Achieved

This isn't a proof of concept. Here's what production operational work looks like with OpenCode:

The >$100K/Month Cost Discovery

Finance asked: "What does each customer cost us?" In shared infrastructure where a single proxy pod serves all tenants — the conventional answer is "we can't really tell you."

We asked OpenCode. One session:

Pulled monthly billing from AWS Cost Explorer: $308,763 across four regions
Discovered database Storage IO alone was $40,985/month
Switched to Datadog, aggregated 149 million proxy log entries from 7 days
Broke down by tenant: top customers = 43% of all platform traffic
Identified two accounts consuming 60% of all activity
Found $>100,000/month in addressable waste

Output: a 361-line Word document with every number traced to an API response. Not estimates. Not SWAGs. Production telemetry.

The Unused Resources Audit

Across two AWS accounts and four regions:

unassociated Elastic IPs (including legacy BYOIP blocks)
load balancers attached to decommissioned clusters
duplicate NAT gateways in the same subnets
A temporary RDS instance someone forgot about
Lambda functions on end-of-life Node.js runtimes

Output: an Excel workbook, color-coded, with subtotals. Combined waste: ~$3,200/month plus the $97K overlap.

The Cache Migration Pre-Mortem

When the retry was planned, one prompt produced a full technical assessment:

Read the ADO ticket for the method of procedure
Queried ElastiCache for current topology
Read 20KB of Kubernetes YAML from GitHub
Identified a risk the team hadn't documented: in-flight tokens during the 30-60 second traffic propagation window

Total time: 14 seconds.

How It Became a Force Multiplier for Incident Response

The first time I used OpenCode during a live incident, I realized something: the AI was doing incident investigation faster and more consistently than our engineers.

Not because it's smarter — because it doesn't context-switch.

A human investigating an incident opens:

PagerDuty → read the alert
Datadog → search for the service, find the error spike
GitHub → check if someone deployed something
Cross-reference timestamps between all three tools
Form a hypothesis
Drill deeper — check affected tenants, error paths, queue depths
Write up findings

That's 15-45 minutes for an experienced engineer. More for a junior. And the cognitive overhead of switching between tools while sleep-deprived leads to missed signals and wrong conclusions.

With OpenCode, the same investigation is one conversation:

"PagerDuty alert fired on proxy 5xx errors in EU. Check Datadog for error rates by backend and affected tenants. Check GitHub for any recent deployments to the primary EU cluster. What changed?"

90 seconds to 3 minutes. Every time. No context switching. No missed signals. No investigating the wrong region.

The realization: If the AI can do this interactively in my terminal session, it can do it autonomously when triggered by a PagerDuty webhook. It doesn't need me to type the question — it can formulate the question itself from the alert payload.

That realization created FRIDAY.

From OpenCode to FRIDAY — The Agent That Investigates Incidents Autonomously

FRIDAY is essentially OpenCode's incident investigation pattern, extracted into a Lambda that runs without a human typing the questions.

The evolution:

Stage	System	Human Involvement	Response Time
Before	5 dashboards + manual investigation	100% human	15-45 minutes
OpenCode	AI-assisted investigation (human asks)	Human types the prompt	90 seconds
FRIDAY	Autonomous investigation (webhook triggers)	Human reads the findings	90 seconds (automated)

Same tools. Same reasoning pattern. Same output format. But no human in the loop for the investigation phase — the on-call engineer wakes up to finished analysis instead of a raw alert.

Results after months in production:

65% MTTR reduction
85% AI tool adoption across the engineering team (up from 20%)
~80% reduction in false escalations

From FRIDAY to JARVIS — Thinking About Write-Path Autonomy

Once FRIDAY proved that an AI agent could reliably investigate production incidents (read-only), the natural question was: can it also fix things?

Not incidents — those require human judgment in the moment. But vulnerability remediation — the routine security fixes that follow a predictable pattern

JARVIS is designed to handle that 80% — the routine fixes where the remediation is well-understood and the verification is automatable. Human approval gates at every stage. Automatic rollback if anything breaks.

The progression: OpenCode showed that AI + MCP tools can reason across multiple systems. FRIDAY proved it can do this autonomously for investigation. JARVIS extends it to autonomous remediation — with guardrails. Each step builds trust for the next.

Deleting 130k Accounts Without Writing Code

Here's something that surprised even me: a complex tenant cleanup operation was largely driven through OpenCode sessions — by someone who didn't write the rake task.

The rake task existed. But executing it required understanding:

Which accounts to target (CSV generation from production queries)
How to configure the runner (batch sizes, offsets, phase tracking)
How to monitor progress (Datadog dashboard interpretation)
How to troubleshoot when things broke (a 48K-user account hung the process, a zombie process ran for 35 hours, a metrics API bug caused silent data loss)
How to communicate status (ADO tickets, Teams updates, DBA coordination)

OpenCode handled all of this conversationally:

"What's the current status of Phase 2? Check the Rundeck execution and the Datadog dashboard."

"Create a task under the cleanup Feature for Phase 4 execution. Include the batch count, estimated timeline, and dependencies."

"The runner seems stuck. Check processes on the worker for any rake tasks. What's happening?"

"The DBA says the events database CPU spiked. Pull the top queries from the RDS monitoring dashboard. Cross-reference the account IDs with our cleanup CSV."

Each of these would normally require logging into 2-3 systems, running manual queries, and synthesizing results. With OpenCode, it's a conversation.

The insight: OpenCode doesn't just help experts go faster. It enables non-experts to execute complex operations by providing the context and tool access they'd otherwise lack. You don't need to know how to read a Datadog dashboard if you can ask "are there any errors related to our cleanup?"

Learning the Entire Product in Conversations

When I joined the team, understanding the full platform took months. Multiple microservices. AWS regions. EKS clusters. Proxy backends. RabbitMQ queues. Aurora databases. Redis caches.

No single engineer understands all of it. The knowledge is distributed across dozens of people, hundreds of documents, and thousands of configuration files.

OpenCode changed how new team members (and existing ones exploring unfamiliar areas) learn the platform:

"How does the push notification service work? What's its architecture? Where does it run, what does it depend on?"

The AI reads the K8s config repo, checks which clusters the service is deployed to, reads the deployment YAML for dependencies (RabbitMQ queues, SNS topics, Redis), and synthesizes a technical overview — from live configuration, not stale documentation.

"What happens when a user logs in via SAML? Trace the request path from the browser through the proxy to the backend services."

It reads the proxy backend configuration from GitHub, identifies the routing rules, checks which services handle SAML assertions, and traces the dependency chain — all from actual config files and Datadog service maps.

This isn't replacing documentation. It's making the infrastructure self-documenting. The source of truth isn't a wiki page someone wrote 18 months ago — it's the live configuration that the AI reads in real-time.

Why This Is Different From ChatGPT

Every engineer has pasted error messages into ChatGPT. That's not what this is.

	ChatGPT	OpenCode + MCP
Data source	General training data	Live production systems via API
Specificity	"A 401 error usually means..."	"Your API gateway generated 1.3 million of them yesterday"
Infrastructure	Doesn't know your systems	Queries your actual AWS, Datadog, GitHub
Freshness	Training cutoff	Real-time data
Hallucination	Common for specifics	Can't hallucinate API responses
Action	Suggests what to do	Does it (queries, aggregates, cross-references)

The model doesn't need to be told which tools to use. Ask "is the cache migration approach safe?" and it independently decides to: read the ADO ticket, query ElastiCache, read the K8s config, compare env var wiring, and synthesize. The engineer didn't specify any of those steps.

The Uncomfortable Truth

The uncomfortable truth is that most of what this system does isn't hard. Any senior engineer can query AWS Cost Explorer, aggregate Datadog logs, read a GitHub PR, and review an ADO ticket.

The hard part is doing all of them in the same mental context, in the same hour, without losing the thread.

An engineer investigating the cache migration opens AWS in one tab, Datadog in another, GitHub in a third, ADO in a fourth, terminal in a fifth. Copy cache endpoint addresses, paste into GitHub search, cross-reference with K8s config, check ADO for the deployment timeline, look at Datadog for the error spike. Context switches. Tab switches. Copy-paste. Scroll. Search. Repeat.

The system doesn't eliminate the need for engineering judgment. The engineer still decides whether the approach is safe, whether the risk is acceptable, whether the cost model makes sense. What the system eliminates is the mechanical overhead of gathering the information needed to make those decisions.

That overhead, across a 20-person team managing multi-region production infrastructure for a global identity platform, adds up to something significant.

What's Next

We're six MCP servers in. The gaps are obvious: DNS management, direct Kubernetes cluster access for kubectl operations, Confluence for documentation. Each one is a JSON config block and a credential away from being connected.

But the more interesting trajectory isn't more connectors — it's more autonomy:

┌────────────────────────────────────────────────────────┐
│  THE PROGRESSION                                        │
│                                                         │
│  Stage 1: Answer questions (OpenCode — today)           │
│     "What caused the 401 errors?"                       │
│                                                         │
│  Stage 2: Investigate autonomously (FRIDAY — live)      │
│     PagerDuty webhook → full analysis in 90 seconds     │
│                                                         │
│  Stage 3: Remediate autonomously (JARVIS — designing)   │
│     Vulnerability finding → PR → deploy → verify        │
│                                                         │
│  Stage 4: Predict and prevent (future)                  │
│     Detect anomaly → correlate → alert before impact    │
└────────────────────────────────────────────────────────┘

Each stage builds trust for the next. Read-only first. Then write-path with approval gates. Then proactive monitoring. Then autonomous prevention.

The technology is ready for all of it. The trust model is what needs to catch up. We run AWS in read-only mode for a reason. But the trajectory is clear.

For now, we'll settle for the fact that when Finance asks "what does our biggest customer cost us?" — we answer with a number that came from production telemetry, not a spreadsheet someone made up.

The complete OpenCode + MCP ecosystem
Current MCP Servers (6):

AWS Production (IAM profile: prod, read-only)
AWS Non-Production (IAM profile: non-prod, read-only)
Datadog (logs, metrics, monitors, dashboards)
GitHub (org repos, PRs, commits, code search)
Azure DevOps (work items, sprints, wikis, pipelines)
PagerDuty (incidents, schedules, escalation policies)

Sub-Agents Built:

Doc Agent (postmortems, RCAs, roadmaps → Word/PDF)
Incident Agent (FRIDAY — autonomous, Lambda-based)
ADO Agent
AWS Agent
PD Agent
Github Agent

The beauty of this setup is that you can hook up as many tools as you want and create sub-agents for each of them — or just have one agent connected to everything. There's no right answer. Some engineers on my team prefer a single session with all 6 MCP servers connected — they ask about AWS costs, then pivot to a GitHub PR review, then check a PagerDuty schedule, all in one conversation. Others prefer focused agents: an AWS-only session for cost analysis, a Datadog-only session for incident investigation, a GitHub-only session for code review. The system doesn't impose a pattern and it adapts to how you think. Start with one MCP server. Connect your observability platform, or your ticketing system, or your cloud provider — whichever one you spend the most time context-switching into. Once you see the AI pull live data from it in a conversation, you'll immediately know which system to connect next. Within a week, you'll wonder how you ever operated without it

Articles in the AI-Native SRE Series:

FRIDAY — Autonomous Incident Investigation
JARVIS — Autonomous Vulnerability Remediation (upcoming)
Tenant Cleanup — Live Debugging at Scale (upcoming)
Platform Command Center (upcoming)
Rundeck Migration — Legacy Jobs to Cloud-Native (upcoming)
This article — the origin story

I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead infrastructure for a global IAM/SSO platform serving millions of users across multiple AWS regions. This article is the origin story of everything in my AI-Native SRE series.

Connect with me on LinkedIn — I write about the intersection of AI, DevOps, and the future of platform engineering.

Follow for the Full Series

How I Built FRIDAY ? An Autonomous Incident Investigation Agent That Reduced MTTR by 65%

Vinothsingh Elumalai — Thu, 18 Jun 2026 04:22:12 +0000

Series: AI-Native SRE

The Problem Every On-Call Engineer Knows
What FRIDAY Does
Architecture Overview
Key Design Decisions
The Tool-Use Loop: How FRIDAY Reasons
The Training System: Pre-Built Knowledge
Handling Edge Cases
Results
Lessons Learned
Try It Yourself

The Problem Every On-Call Engineer Knows

It's 2:47 AM. Your phone buzzes and it's a P1 alert. You open your laptop, bleary-eyed, and begin the familiar ritual:

Open PagerDuty → read the alert title
Open Datadog → search for the service, find the error spike
Open GitHub → check if someone deployed something
Cross-reference timestamps between all three tools
Form a hypothesis
Drill deeper — check affected tenants, error paths, queue depths
Write up findings for the team

This process takes 15–45 minutes for an experienced engineer. For a junior on-call? Sometimes hours. And the cognitive overhead of context-switching between 3-4 tools while sleep-deprived leads to missed signals, false conclusions, and longer outages.

I asked myself:

What if an AI agent could do Steps 1–7 autonomously in under 3 minutes — and deliver structured findings to your team before the on-call engineer even opens their laptop?

So I built one. It's been running in production for months, investigating real incidents on a platform serving 30+ million end users across multiple AWS regions. We call it FRIDAY.

What FRIDAY Does

When a PagerDuty alert fires, FRIDAY:

Receives the webhook in real-time via API Gateway
Locks the target region from the alert metadata (never investigates the wrong region)
Checks GitHub first — finds what changed before the alert (deployments, config changes, PRs)
Queries Datadog — error rates, affected tenants, application exceptions, queue depths
Synthesizes findings — correlates code changes with observability signals
Delivers a structured report to Microsoft Teams as an Adaptive Card

The entire investigation takes under 2 minutes. The on-call engineer wakes up to a complete analysis instead of a raw alert.

Architecture Overview

┌──────────────┐     ┌────────────────┐     ┌─────────────────────┐
│  PagerDuty   │────▶│  API Gateway   │────▶│  Lambda (Sync)      │
│  Webhook     │     │  (Validate)    │     │  Parse + Self-Invoke│
└──────────────┘     └────────────────┘     └─────────┬───────────┘
                                                       │ Async
                                                       ▼
                                            ┌─────────────────────┐
                                            │  Lambda (Async)      │
                                            │  Investigation Agent │
                                            │                      │
                                            │  ┌────────────────┐ │
                                            │  │ Amazon Bedrock  │ │
                                            │  │ Claude Opus     │ │
                                            │  │ (Tool-Use Loop) │ │
                                            │  └───────┬────────┘ │
                                            │          │          │
                                            │    ┌─────┼─────┐   │
                                            │    ▼     ▼     ▼   │
                                            │ GitHub Datadog  S3  │
                                            └─────────┬───────────┘
                                                      │
                                                      ▼
                                            ┌─────────────────────┐
                                            │  Microsoft Teams    │
                                            │  (Adaptive Card)    │
                                            └─────────────────────┘

Key Design Decisions

1. Two-Lambda Architecture (Sync + Async)

API Gateway has a 30-second hard timeout. A thorough AI investigation takes 60–180 seconds. The solution: the sync Lambda validates the webhook, parses the alert, and immediately self-invokes asynchronously returning 200 OK to PagerDuty within 2 seconds.

# Sync handler: validate, parse, self-invoke, return immediately
lambda_client.invoke(
    FunctionName=context.function_name,
    InvocationType="Event",  # Fire and forget
    Payload=json.dumps({
        "_async_investigate": True,
        "alert_payload": alert_payload,
    }),
)
return {"statusCode": 200, "body": "Investigation started"}

The async Lambda runs the full investigation without timeout pressure.

2. GitHub First, Datadog Second

This is counterintuitive. Most engineers and most AI systems jump straight to observability data when an alert fires. But in my experience, 80%+ of acute incidents are caused by a preceding change: a deployment, a config update, a replica count change, a memory limit modification.

FRIDAY is instructed to check GitHub before touching Datadog:

MANDATORY FIRST STEP — GitHub (Step 0):
Before touching Datadog, you MUST run these calls in parallel:
1. github_search_repos — find the repo for the alerted service
2. github_list_commits — find commits in the 2 hours before 
   the alert fired

A deployment or config change is the most likely root cause.

Why this matters: When the AI correlates "this PR merged 12 minutes before the error spike" with "5xx errors started at exactly the merge timestamp" — it produces findings that are immediately actionable. This single design decision dramatically improved root cause accuracy.

3. Region Lock — Preventing Wrong-Region Investigation

Our platform spans multiple AWS regions. A naive agent querying "all 5xx errors" would mix signals from healthy and unhealthy regions, producing confused analysis.

FRIDAY's first action is always to lock a target region from the alert metadata:

🌍 Region: Description — resolved from alert hostname

Every subsequent Datadog query includes:
kube_cluster_name:region-az-* (scoped to affected region only)

This eliminated an entire class of false-positive findings where the AI would cite errors from an unrelated region.

4. Structured Output Contract

FRIDAY's output isn't freeform text. It follows a strict section contract that the Teams integration parses into visual containers:

## EXECUTIVE SUMMARY
[2-3 sentences — what happened, who's affected, what changed]

## KEY FINDINGS
[Bulleted evidence from GitHub + Datadog]

## WHAT CHANGED
[Specific commit/PR with timestamp and author]

## ERROR BREAKDOWN
[Service-by-service error counts with affected tenants]

## ROOT CAUSE
[Confirmed / Suspected / Unknown — with evidence chain]

## CUSTOMER IMPACT
[Affected tenants, operations, scope]

## RECOMMENDED ACTIONS
[Specific next steps for the on-call engineer]

The on-call engineer can glance at the Teams card and immediately know: what happened, who's affected, what likely caused it, and what to do next without reading a wall of text.

The Tool-Use Loop: How FRIDAY Reasons

FRIDAY uses Claude's tool-use capability in a multi-round loop. The AI doesn't execute a fixed script — it reasons about each alert independently, deciding which tools to call based on what it's learned so far.

for round_num in range(MAX_TOOL_ROUNDS):  # Max 25 rounds
    response = bedrock_client.converse(
        modelId="anthropic.claude-opus",
        messages=messages,
        toolConfig={"tools": TOOL_DEFINITIONS},
    )

    if stop_reason == "tool_use":
        # Execute tools, append results, continue reasoning
        for tool_call in content_blocks:
            result = execute_tool(
                tool_call["name"], 
                tool_call["input"]
            )
            tool_results.append(result)
        messages.append(tool_results)

    elif stop_reason == "end_turn":
        # AI has concluded — extract findings
        return extract_final_report(content_blocks)

Available Tools

Tool	Purpose
`github_search_repos`	Find which repo owns a service
`github_list_commits`	What changed before the alert
`github_get_file`	Read actual deployment configs
`github_search_code`	Find all producers/consumers of a queue
`datadog_log_search`	Find specific error messages
`datadog_log_aggregate`	Count errors by backend/tenant/path
`datadog_query_metrics`	Queue depth, CPU, memory, latency
`datadog_get_monitor`	Understand what threshold triggered

The AI typically uses 8–15 tool calls per investigation, batching parallel calls when possible to minimize round-trip time.

The Training System: Pre-Built Knowledge

A cold investigation — where the AI knows nothing about your infrastructure — is slow and imprecise. FRIDAY includes a deterministic training mode that pre-builds architectural knowledge:

def train():
    """
    Deterministic training:
    ~13 targeted API calls, then one Bedrock synthesis call.

    Collects: cluster-service maps, HAProxy backends, 
    chronic error baselines, recent planned work.
    """
    # Phase 1: Targeted data collection (no AI — pure API calls)
    collected = {}
    for key, tool_name, tool_input in TRAINING_CALLS:
        collected[key] = execute_tool(tool_name, tool_input)

    # Phase 2: Single AI synthesis call
    knowledge_doc = synthesize_knowledge(collected)

    # Phase 3: Save to S3 — injected into system prompt
    save_to_s3(knowledge_doc)

The knowledge document contains:

Cluster → Service map — What runs where
Chronic error baselines — Background noise to ignore (not incidents)
Recent planned work — Deployments and migrations that explain expected errors
Backend inventory — Every backend serving traffic

Key insight: Knowledge injection > Larger context windows. A synthesized knowledge document — curated, current, and actionable — is more effective than dumping raw infrastructure documentation into the prompt. It captures real state, not aspirational state.

Handling Edge Cases

Planned Work vs. Real Incidents

One of the hardest problems: distinguishing planned maintenance from real outages. During a Kubernetes cluster migration, you expect 5xx errors as traffic drains. FRIDAY handles this through:

Knowledge injection — Training mode captures recent PRs tagged as planned work
Real-time PR correlation — During investigation, it reads PR bodies for keywords like "decommission", "drain", "planned"
Explicit classification — If a 5xx spike coincides with a merged "failover" PR, FRIDAY reports:

"This alert coincides with planned cluster decommission. Errors are expected during traffic drain. No incident action required."

Force-Completion Under Round Limits

What happens when an investigation is complex and approaching the 25-round tool limit? FRIDAY has a graceful degradation mechanism:

if rounds_remaining <= 3:
    user_content.append({
        "text": (
            "STOP CALLING TOOLS. Write your FINAL report "
            "NOW using all data collected so far. Mark "
            "uncertain findings as 'Suspected' rather "
            "than skipping them."
        )
    })

This ensures every investigation produces a report — even if incomplete — rather than timing out silently.

Deduplication

PagerDuty retries webhooks. FRIDAY handles this at two levels:

Webhook-level — In-memory cache of webhook IDs (survives Lambda warm starts)
Incident-level — S3 marker files prevent re-investigating the same incident

Results

After running in production for several months:

Metric	Before FRIDAY	After FRIDAY	Improvement
Mean Time to First Analysis	15–45 min	90 sec–3 min	~90% faster
MTTR (overall)	~60 min	~15 min	65% reduction
AI tool adoption (team)	20%	85%	4x increase
Alert noise (false escalations)	High	Minimal	~80% reduction
Auto-generated postmortems	0%	100% of P1/P2	Eliminated manual RCA drafts

The most impactful change isn't the speed — it's the consistency. A human engineer at 3 AM makes mistakes: investigates the wrong region, misses a recent deployment, forgets to check queue depths. FRIDAY follows the same rigorous methodology every time.

Lessons Learned

1. Prompt Engineering IS Architecture

The system prompt is the most important file in the codebase. It's not instructions — it's the agent's operating manual. Ours is ~5,000 words covering:

Environment topology (region mappings, cluster roles, service dependencies)
Investigation methodology (step-by-step procedures)
Critical rules (what NOT to do — as important as what to do)
Output format contract

Invest in your prompt like you invest in your architecture docs.

2. "GitHub First" Was the Single Biggest Win

Before this rule, the AI would spend 10+ rounds querying Datadog, building elaborate theories about traffic patterns — then discover a config change was merged 5 minutes before the alert. Now it finds the root cause in rounds 1-2 for ~80% of change-induced incidents.

3. You Need Guardrails, Not Just Capabilities

FRIDAY is explicitly told it does NOT take remediation actions. It investigates, analyzes, and reports. A human validates and acts. This is not a limitation — it's a design choice that builds trust. When on-call engineers trust the AI's analysis, they act on it faster.

4. Separate Investigation from Notification

The two-Lambda pattern (sync for webhook receipt, async for investigation) is essential. Don't let API Gateway timeouts dictate your AI agent's investigation depth.

What's Next

We're extending this pattern to autonomous security remediation — an agent that ingests vulnerability findings, generates IaC fixes, deploys through GitOps, verifies no impact, and requests human approval before proceeding. Same tool-use architecture, different domain.

The future of SRE isn't "AI-assisted." It's AI-native: systems designed from the ground up with autonomous agents as first-class participants in the operational loop.

Try It Yourself

The pattern is reproducible with:

Amazon Bedrock (Claude Opus or Sonnet for cost-sensitive use)
Any webhook source (PagerDuty, Opsgenie, Datadog)
Any observability platform with an API (Datadog, Grafana, New Relic)
Any source control (GitHub, GitLab)
Any chat platform (Teams, Slack)

The hard part isn't the code — it's the system prompt. That's where your SRE expertise lives. The AI is the execution engine; your knowledge of your infrastructure is what makes it useful.

What does FRIDAY stand for?
FRIDAY is named after Tony Stark's AI assistant in the Marvel universe. Because if I'm going to be on-call at 2 AM, I at least deserve a butler. ☕

The name also works as a backronym: First Responder for Incident Diagnostics and Anal*Y*sis — but honestly, we just thought the Marvel reference was cooler.

I'm Vinothsingh Elumalai, a Platform Engineering leader building AI-native operations at enterprise scale. I lead the Platform team for a global IAM/SSO platform serving 30M+ users. Currently exploring how agentic AI transforms SRE from reactive firefighting to autonomous, closed-loop operations.

This is Part 1 of my AI-Native SRE series. Part 2 will cover JARVIS — an autonomous vulnerability remediation agent that fixes security findings through GitOps with human approval gates.

Connect on LinkedIn

DEV Community: Vinothsingh Elumalai

How an AI Terminal Assistant Became My Team's Most Productive Engineer - Opencode + Claude + MCP

Table of Contents

The Moment That Changed Everything

What It Actually Is

The Setup Nobody Believes Is This Simple

Focused Sessions — One Agent, One Mission

Sub-Agents With Specific Skill Sets

The Doc Agent

The ADO Agent

The AWS Agent

The Incident Agent

The GitHub Agent

Build Your Own

What We've Actually Achieved

The >$100K/Month Cost Discovery

The Unused Resources Audit

The Cache Migration Pre-Mortem

How It Became a Force Multiplier for Incident Response

From OpenCode to FRIDAY — The Agent That Investigates Incidents Autonomously

From FRIDAY to JARVIS — Thinking About Write-Path Autonomy

Deleting 130k Accounts Without Writing Code

Learning the Entire Product in Conversations

Why This Is Different From ChatGPT

The Uncomfortable Truth

What's Next

How I Built FRIDAY ? An Autonomous Incident Investigation Agent That Reduced MTTR by 65%

Series: AI-Native SRE

Table of Contents

The Problem Every On-Call Engineer Knows

What FRIDAY Does

Architecture Overview

Key Design Decisions

1. Two-Lambda Architecture (Sync + Async)

2. GitHub First, Datadog Second

3. Region Lock — Preventing Wrong-Region Investigation

4. Structured Output Contract

The Tool-Use Loop: How FRIDAY Reasons

Available Tools

The Training System: Pre-Built Knowledge

Handling Edge Cases

Planned Work vs. Real Incidents

Force-Completion Under Round Limits

Deduplication

Results

Lessons Learned

1. Prompt Engineering IS Architecture

2. "GitHub First" Was the Single Biggest Win

3. You Need Guardrails, Not Just Capabilities

4. Separate Investigation from Notification

What's Next

Try It Yourself