<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kirill Polishchuk</title>
    <description>The latest articles on DEV Community by Kirill Polishchuk (@kirponik).</description>
    <link>https://dev.to/kirponik</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1489786%2Fbafd0c09-23c4-4142-8327-105b86dc70af.jpeg</url>
      <title>DEV Community: Kirill Polishchuk</title>
      <link>https://dev.to/kirponik</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kirponik"/>
    <language>en</language>
    <item>
      <title>Why Your SRE Agents Need a Graph</title>
      <dc:creator>Kirill Polishchuk</dc:creator>
      <pubDate>Sat, 21 Mar 2026 23:46:23 +0000</pubDate>
      <link>https://dev.to/kirponik/why-your-sre-agents-need-a-graph-4pij</link>
      <guid>https://dev.to/kirponik/why-your-sre-agents-need-a-graph-4pij</guid>
      <description>&lt;p&gt;Traditional automation relies on &lt;strong&gt;Directed Acyclic Graphs (DAGs)&lt;/strong&gt;—linear pipelines that execute steps A, then B, then C. Tools like GitHub Actions and Jenkins excel at this. They're perfect for deterministic workflows like building Docker images or running test suites.&lt;/p&gt;

&lt;p&gt;But infrastructure failures aren't linear. When your database chokes at 3 AM, the recovery process is iterative: you observe metrics, form a hypothesis, test it, and when it fails—you &lt;em&gt;backtrack&lt;/em&gt; and try another angle. Even after finding the root cause, you often need to &lt;strong&gt;pause and ask for human approval&lt;/strong&gt; before executing a potentially destructive remediation.&lt;/p&gt;

&lt;p&gt;A linear pipeline can't do any of this. If step 2 fails, the pipeline dies. It can't loop back to gather more data. It can't pause mid-execution to wait for human input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is why AI agents need graph-based orchestration.&lt;/strong&gt; Not the rigid DAGs of CI/CD pipelines, but &lt;strong&gt;cyclic, stateful graphs&lt;/strong&gt; that support iteration, maintain context across cycles, and can pause for human approval at critical moments.&lt;/p&gt;

&lt;p&gt;Here's how I built a production-ready autonomous SRE system using &lt;strong&gt;LangGraph&lt;/strong&gt; for the orchestration and &lt;strong&gt;PydanticAI&lt;/strong&gt; for the agent intelligence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Agents Need a Graph?
&lt;/h2&gt;

&lt;p&gt;Traditional automation uses linear pipelines. But AI agents are different—they think, they iterate, they sometimes need to ask for help. Without a graph-based orchestrator, you're left with brittle scripts that can't adapt.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Superpowers of Graph-Based Agent Orchestration:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Cyclic Routing: Think → Act → Reflect → Repeat&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Real debugging is iterative. An agent makes a hypothesis, tests it, and either succeeds or loops back with new information. A graph with cyclic routing lets your agents iterate naturally:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyhx2g3ofzgxpixl6e32.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyhx2g3ofzgxpixl6e32.jpg" alt=" " width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Stateful Memory: Building Context Across Iterations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each agent cycle builds on the last. The graph maintains state—observations, metrics, hypotheses—so agents don't start from scratch every time. This persistent context is crucial for complex debugging scenarios where the root cause only becomes apparent after several failed attempts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Human-in-the-Loop with Interrupts: Safety at Critical Moments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most powerful feature: graphs can &lt;strong&gt;pause execution&lt;/strong&gt; and wait for human input. When your agent wants to kill a query on the production database, it should ask first. With the right orchestrator, this is elegant—the graph simply pauses, sends a Slack message with Approve/Reject buttons, and waits. When the human clicks, the graph resumes exactly where it left off.&lt;/p&gt;

&lt;p&gt;Without a graph orchestrator, implementing human approval would require complex state polling or external workflow engines. With the right tool, it's a natural part of the workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with DAGs in Infrastructure
&lt;/h2&gt;

&lt;p&gt;Imagine a traditional automation script trying to debug a database:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trigger:&lt;/strong&gt; High CPU alert.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Fetch slow query log.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; If slow query found, kill it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What happens if the slow query log is empty because the issue is actually an InnoDB lock wait? The script fails, throws an exception, and wakes you up anyway.&lt;/p&gt;

&lt;p&gt;What if the script identifies a fix but needs human approval before executing it? A DAG can't pause mid-execution—it just runs to completion or fails.&lt;/p&gt;

&lt;p&gt;We need a system that can say: &lt;em&gt;"Hmm, the slow log didn't give me the answer. Let me loop back, look at the disk I/O metrics, and form a new hypothesis. And before I execute any remediation, let me ask a human to confirm."&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stateful, Multi-Agent Orchestration with Human Approval
&lt;/h2&gt;

&lt;p&gt;To solve this, I built a system with three core components:&lt;/p&gt;

&lt;h3&gt;
  
  
  The State Machine (LangGraph)
&lt;/h3&gt;

&lt;p&gt;The recovery process is treated as a state machine rather than a pipeline. The graph maintains the incident's global memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IncidentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Global memory that persists across agent cycles.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="c1"&gt;# Observations accumulate with each cycle (using operator.add)
&lt;/span&gt;    &lt;span class="n"&gt;observations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
    &lt;span class="n"&gt;hypothesis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;remediation_sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;cycle_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Approval workflow state
&lt;/span&gt;    &lt;span class="n"&gt;approval_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# pending, approved, rejected
&lt;/span&gt;    &lt;span class="n"&gt;approver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As agents loop through diagnostic cycles, they append their findings here. This persistent context is crucial for complex debugging where the root cause only becomes apparent after several failed attempts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Brain (PydanticAI Agents)
&lt;/h3&gt;

&lt;p&gt;Specialized AI agents handle different aspects of the investigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metrics Agent:&lt;/strong&gt; Fetches and interprets Prometheus data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyzer:&lt;/strong&gt; Forms hypotheses from the metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Researcher:&lt;/strong&gt; Validates hypotheses by querying the database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These agents return strictly typed JSON outputs, so the graph router knows exactly what to do with their results—loop back for more diagnosis, proceed to remediation, or escalate to a human.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Safety Layer (MCP + Human Approval)
&lt;/h3&gt;

&lt;p&gt;Two critical safety mechanisms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP (Model Context Protocol):&lt;/strong&gt; Acts as a secure gateway between the AI and your database. The AI never sees credentials—MCP holds them locally and only exposes read-only diagnostic tools. It's like a USB-C port for AI data access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-Loop:&lt;/strong&gt; Before any destructive action, the graph pauses and sends an interactive Slack message. The workflow literally cannot proceed until a human clicks "Approve" or "Reject."&lt;/p&gt;

&lt;h3&gt;
  
  
  Dynamic Configuration
&lt;/h3&gt;

&lt;p&gt;A key insight: Grafana alerts already contain everything we need. The MySQL instance IP, whether it's a replica, the cluster name—all of it is in the alert labels. We extract this dynamically, eliminating the need for static configuration files or AWS credential management. Each alert creates its own isolated connection context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Orchestration Flow
&lt;/h2&gt;

&lt;p&gt;Here's how the pieces fit together:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcad6ile9g0s2yxa5uc69.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcad6ile9g0s2yxa5uc69.jpg" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cyclic Routing in Practice
&lt;/h3&gt;

&lt;p&gt;The router is simple but powerful. Here's how it works in code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IncidentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Route the workflow based on current state.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Guardrail: Prevent infinite loops and runaway costs
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cycle_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Success: Found the root cause
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready_for_remediation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_approval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Iteration needed: Loop back for more diagnosis
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diagnose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If the researcher finds the root cause:&lt;/strong&gt; Proceed to remediation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the hypothesis is wrong:&lt;/strong&gt; Loop back to the diagnose node with updated context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If we've tried 5 times without success:&lt;/strong&gt; Escalate to a human&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If we're ready to remediate:&lt;/strong&gt; Pause and wait for human approval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This cycle continues until success, human intervention, or the safety limit is reached.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Interrupt Pattern
&lt;/h3&gt;

&lt;p&gt;When the graph reaches the approval point, it doesn't poll or busy-wait. It &lt;strong&gt;interrupts&lt;/strong&gt;—pausing execution entirely and saving its state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;approval_wait_logic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IncidentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pause graph and wait for human approval via interrupt.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Send Slack message with Approve/Reject buttons
&lt;/span&gt;    &lt;span class="nf"&gt;send_slack_approval_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hypothesis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Interrupt: Pause execution entirely, wait for external input
&lt;/span&gt;    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;awaiting_approval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;

    &lt;span class="c1"&gt;# Graph resumes here when human clicks button
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execute_remediation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a human clicks "Approve" in Slack, the webhook resumes the graph with the decision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In Slack webhook handler
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_approval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;approved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;approved&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Resume the graph with the human's decision
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# Graph continues from interrupt point
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Execution continues exactly where it left off. This is far more elegant than polling loops or external state machines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Safety
&lt;/h2&gt;

&lt;p&gt;Unlike standard scripts, agentic systems are billed per "thinking cycle." An LLM stuck in a loop trying to debug a phantom network issue will happily burn through tokens until your OpenAI bill looks like a phone number.&lt;/p&gt;

&lt;p&gt;The cycle limit (capped at 5) ensures that if the AI is truly stumped, it gracefully escalates to a human SRE rather than looping infinitely.&lt;/p&gt;

&lt;p&gt;Other critical guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read-only database access:&lt;/strong&gt; MCP only exposes diagnostic queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mandatory human approval:&lt;/strong&gt; No destructive actions without explicit sign-off&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval timeouts:&lt;/strong&gt; Auto-escalate if humans don't respond in time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured logging:&lt;/strong&gt; Full observability with correlation IDs for every incident&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Architecture Wins?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without a graph orchestrator:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scripts fail on first error&lt;/li&gt;
&lt;li&gt;No way to iterate or backtrack&lt;/li&gt;
&lt;li&gt;No built-in human approval mechanism&lt;/li&gt;
&lt;li&gt;State is lost between steps&lt;/li&gt;
&lt;li&gt;Can't implement "ask a human" mid-workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With LangGraph:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agents iterate naturally: hypothesis → test → refine&lt;/li&gt;
&lt;li&gt;State persists across cycles&lt;/li&gt;
&lt;li&gt;Interrupts enable human-in-the-loop safety&lt;/li&gt;
&lt;li&gt;Clear routing logic based on typed agent outputs&lt;/li&gt;
&lt;li&gt;Built-in cycle limits prevent runaway costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By moving away from linear DAGs and utilizing cyclic graphs with governed tool access (MCP) and human-in-the-loop interrupts, we finally have an infrastructure recovery system that behaves like a real engineer: it investigates, it fails, it adapts, it asks for help when needed, and it tries again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Linear pipelines work for deterministic processes. But infrastructure failures are messy, non-linear, and often require human judgment. AI agents need an orchestrator that matches this reality—one that supports iteration, maintains context, and can pause for human input.&lt;/p&gt;

&lt;p&gt;Graph-based orchestration isn't just a nice-to-have for AI agents. It's the difference between a brittle script that wakes you up at 3 AM and an autonomous system that either fixes the issue or escalates with full context.&lt;/p&gt;




&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;p&gt;Want to build this yourself? Here is the reading list I used to put this architecture together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔗 &lt;strong&gt;LangGraph Documentation:&lt;/strong&gt; The framework for building stateful, multi-actor applications with interrupts. &lt;a href="https://www.langchain.com/langgraph" rel="noopener noreferrer"&gt;Read the docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔗 &lt;strong&gt;LangGraph Interrupt Pattern:&lt;/strong&gt; How to pause graphs for human input. &lt;a href="https://langchain-ai.github.io/langgraph/concepts/persistence/" rel="noopener noreferrer"&gt;Read the docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔗 &lt;strong&gt;PydanticAI:&lt;/strong&gt; The typed, robust agent framework by the creators of Pydantic. &lt;a href="https://ai.pydantic.dev/" rel="noopener noreferrer"&gt;Read the docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔗 &lt;strong&gt;Model Context Protocol (MCP):&lt;/strong&gt; The open standard for securely connecting AI to data sources. &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Official Specification&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>langgraph</category>
      <category>mcp</category>
      <category>aiops</category>
    </item>
    <item>
      <title>Mastering Multi-Provider Routing with OpenRouter</title>
      <dc:creator>Kirill Polishchuk</dc:creator>
      <pubDate>Sat, 14 Mar 2026 22:17:03 +0000</pubDate>
      <link>https://dev.to/kirponik/mastering-multi-provider-routing-with-openrouter-1ce3</link>
      <guid>https://dev.to/kirponik/mastering-multi-provider-routing-with-openrouter-1ce3</guid>
      <description>&lt;h2&gt;
  
  
  🧠 The Single-Provider Trap
&lt;/h2&gt;

&lt;p&gt;Let's be real: treating a Large Language Model (LLM) provider like a highly available, always-on utility is a massive architectural risk. We've all experienced it. You deploy a sophisticated agentic workflow, and suddenly the primary API goes down, gets aggressively rate-limited, or starts throwing 5xx errors.&lt;/p&gt;

&lt;p&gt;Relying on a single provider—even an industry giant—creates a systemic vulnerability. To build true enterprise-grade AI applications, we have to decouple the application layer from specific vendors. The goal is to engineer a resilient "intelligence backbone" that autonomously shifts traffic based on availability, latency, and unit economics.&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Enter the Unified Routing Plane
&lt;/h2&gt;

&lt;p&gt;Instead of wrestling with half a dozen different SDKs and writing custom retry loops for OpenAI, Anthropic, Meta, and DeepSeek, modern architectures are shifting toward unified routing planes.&lt;/p&gt;

&lt;p&gt;By using an API gateway like OpenRouter, your application interfaces with just one endpoint. The complexity is handled entirely behind the scenes: the gateway uses built-in fallback logic to automatically reroute failed requests to secondary models, or to alternative infrastructure providers hosting the exact same open-weight model.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚙️ Declarative JSON Routing: Infrastructure as Data
&lt;/h2&gt;

&lt;p&gt;The cleanest way to manage routing at scale is by externalizing your logic into a declarative JSON configuration. This keeps your application code lean and allows Platform or FinOps teams to adjust routing priorities dynamically without triggering a full code deployment.&lt;/p&gt;

&lt;p&gt;Here is what a production-ready routing payload looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"meta-llama/llama-3.3-70b-instruct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze this dataset..."&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"deepinfra/turbo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fireworks"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allow_fallbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"latency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"zdr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"completion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Model-Level Fallbacks for Maximum Resilience
&lt;/h3&gt;

&lt;p&gt;Beyond provider fallbacks, OpenRouter supports &lt;strong&gt;model-level fallbacks&lt;/strong&gt; using the &lt;code&gt;models&lt;/code&gt; array. This is a game-changer for resilience—if your primary model is completely unavailable across all providers, the gateway can automatically fall back to semantically similar models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-sonnet-4.5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai/gpt-5-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"google/gemini-3-flash-preview"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Analyze this dataset..."&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"throughput"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"partition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"zdr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;partition: "none"&lt;/code&gt; removes model grouping, allowing the router to sort endpoints globally across all models. This means if Claude is slow or down, your request automatically routes to the fastest available alternative—whether that's GPT-5-mini or Gemini—without any code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Thresholds for Predictable SLAs
&lt;/h3&gt;

&lt;p&gt;For enterprise applications with strict latency requirements, you can set explicit performance thresholds using &lt;code&gt;preferred_max_latency&lt;/code&gt; and &lt;code&gt;preferred_min_throughput&lt;/code&gt;. These work with percentile statistics (p50, p75, p90, p99) calculated over a rolling 5-minute window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deepseek/deepseek-v3.2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Generate report..."&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"preferred_max_latency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"p90"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"p99"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"preferred_min_throughput"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"p90"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Providers not meeting these thresholds are deprioritized (moved to fallback positions) rather than excluded entirely. This ensures your requests always execute while preferring endpoints that meet your SLA requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this configuration is powerful:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Surgical Provider Targeting (&lt;code&gt;order&lt;/code&gt;)&lt;/strong&gt;: We explicitly target optimized endpoints first, like DeepInfra's high-speed turbo instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Sorting (&lt;code&gt;sort&lt;/code&gt;)&lt;/strong&gt;: Setting this to &lt;code&gt;"latency"&lt;/code&gt; instructs the gateway to actively seek out the fastest responding provider for your chosen model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Data Retention (&lt;code&gt;zdr&lt;/code&gt;)&lt;/strong&gt;: A non-negotiable flag for enterprise compliance, ensuring your chosen providers do not log your sensitive prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Ceilings (&lt;code&gt;max_price&lt;/code&gt;)&lt;/strong&gt;: Prevents automated fallovers from accidentally defaulting to a premium, budget-draining endpoint during a weekend outage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your application code remains blissfully simple. You just inject this JSON into a standard REST call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Load declarative routing policy
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;routing_config.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# A single API call handles all fallbacks and routing internally
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://openrouter.ai/api/v1/chat/completions&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  💸 FinOps &amp;amp; Unit Economics
&lt;/h2&gt;

&lt;p&gt;Running complex Retrieval-Augmented Generation (RAG) pipelines or large-context reasoning models gets expensive fast. A mature FinOps strategy requires strict controls, and centralizing your routing makes this vastly easier to manage.&lt;/p&gt;

&lt;p&gt;You can establish cost-aware routing dynamically. By setting the &lt;code&gt;provider.sort&lt;/code&gt; key to &lt;code&gt;"price"&lt;/code&gt;, the gateway automatically hunts down the cheapest inference provider currently hosting your requested open-source model. The &lt;code&gt;max_price&lt;/code&gt; parameter ensures your AI spend remains entirely predictable, even when fallback chains are triggered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Cost Impact
&lt;/h3&gt;

&lt;p&gt;To understand the savings potential, consider the price variance across providers for the same model. For example, &lt;strong&gt;Llama 3.3 70B&lt;/strong&gt; pricing varies significantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepInfra&lt;/strong&gt;: ~$0.15/million input tokens, $0.20/million output tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fireworks AI&lt;/strong&gt;: ~$0.20/million input tokens, $0.20/million output tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Together AI&lt;/strong&gt;: ~$0.20/million input tokens, $0.20/million output tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Bedrock&lt;/strong&gt;: ~$0.72/million input tokens, $0.72/million output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a workload processing 100 million tokens monthly, switching from the most expensive to the most affordable provider saves &lt;strong&gt;~$57,000 per month&lt;/strong&gt;. The &lt;code&gt;max_price&lt;/code&gt; parameter acts as a circuit breaker—if no compliant provider is available under your ceiling, the request fails gracefully rather than silently draining your budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚖️ The Centralization Trade-off
&lt;/h2&gt;

&lt;p&gt;This architecture is incredibly powerful, but it's not a silver bullet. The biggest trade-off is centralization. By moving away from individual provider SDKs, you are trading multiple potential points of failure for a single, massive one: the routing gateway itself.&lt;/p&gt;

&lt;p&gt;If the unified API's load balancers fail, your entire stack loses access to external AI simultaneously. It's a calculated risk—you're betting that a dedicated routing platform will maintain better aggregate uptime than any individual LLM provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Relying on a solitary API endpoint is no longer acceptable for modern, mission-critical systems. It exposes your business to unpredictable vendor rate limits, unannounced deprecations, and frustrating outages.&lt;/p&gt;

&lt;p&gt;By adopting a centralized routing plane with declarative JSON configurations, engineering teams can cleanly abstract away the chaos of the AI provider ecosystem. You gain the ability to orchestrate dynamic fallback arrays and latency-based routing without constantly rewriting application logic. This pattern definitively hardens your application, creating a robust foundation for the next generation of autonomous agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/docs/guides/routing/provider-selection" rel="noopener noreferrer"&gt;Official documentation&lt;/a&gt; - Official documentation on structuring JSON payloads for latency sorting, fallback arrays, and ZDR enforcement.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.finops.org/wg/finops-for-ai-overview/" rel="noopener noreferrer"&gt;FinOps for AI Frameworks&lt;/a&gt; - Strategic frameworks for measuring AI unit economics and mitigating cloud waste.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/docs/guides/routing/model-fallbacks" rel="noopener noreferrer"&gt;Model Fallbacks&lt;/a&gt;  - Deep dive into model-level routing strategies&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>openrouter</category>
      <category>llm</category>
      <category>highavailability</category>
    </item>
    <item>
      <title>"Just Enough" Platform Engineering: Replacing Terraform with Kubernetes APIs</title>
      <dc:creator>Kirill Polishchuk</dc:creator>
      <pubDate>Sat, 28 Feb 2026 22:07:12 +0000</pubDate>
      <link>https://dev.to/kirponik/just-enough-platform-engineering-replacing-terraform-with-kubernetes-apis-53d0</link>
      <guid>https://dev.to/kirponik/just-enough-platform-engineering-replacing-terraform-with-kubernetes-apis-53d0</guid>
      <description>&lt;p&gt;We’ve all been there. You want to build an Internal Developer Platform (IDP). You start with good intentions: &lt;em&gt;"Let's simplify infrastructure for our developers."&lt;/em&gt; Six months later, you have a sprawling Backstage instance that nobody likes, a fragile mountain of Terraform modules that take 40 minutes to apply, and developers who still just DM you to "fix the S3 bucket permissions."&lt;/p&gt;

&lt;p&gt;We fell into this trap. We tried to abstract everything away until we realized we were just hiding complexity, not managing it.&lt;/p&gt;

&lt;p&gt;This article details a different approach. We call it &lt;strong&gt;"Just Enough" Platform Engineering&lt;/strong&gt;. Instead of building a portal that triggers a CI pipeline to run Terraform (the "ClickOps" anti-pattern), we moved the abstraction layer into the Kubernetes cluster itself.&lt;/p&gt;

&lt;p&gt;Using &lt;strong&gt;AWS Kro (Kubernetes Resource Orchestrator)&lt;/strong&gt; and &lt;strong&gt;ACK (AWS Controllers for Kubernetes)&lt;/strong&gt;, we built a self-service API that allows developers to spin up production-ready, compliant microservices in minutes. No Jenkins pipelines. No Terraform state locks. Just &lt;code&gt;kubectl apply&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here is how we solved the "Day 2" operations gap and cut provisioning time from days to minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 The Real Problem: The "Day 2" Gap
&lt;/h2&gt;

&lt;p&gt;Most platforms nail "Day 0" (creating the hello-world app). They fail at "Day 2" (maintenance).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Scenario:&lt;/strong&gt; You use a Terraform module to provision an S3 bucket for a team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Drift:&lt;/strong&gt; A developer manually changes the bucket policy in the AWS Console to debug something. Your Terraform state is now wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioning:&lt;/strong&gt; You update the Terraform module to enforce encryption. You now have to run &lt;code&gt;terraform apply&lt;/code&gt; across 50 different repositories to propagate the fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cognitive Load:&lt;/strong&gt; Developers have to learn HCL  just to add a queue.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We needed a solution that was &lt;strong&gt;actively reconciling&lt;/strong&gt; (fixing drift automatically) and &lt;strong&gt;API-centric&lt;/strong&gt; (versioned and manageable).&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ The Architecture: Kubernetes as the Control Plane
&lt;/h2&gt;

&lt;p&gt;We stopped treating Kubernetes as just a container scheduler and started treating it as a universal control plane.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS Kro:&lt;/strong&gt; Allows us to define custom APIs (CRDs) without writing Go code. It acts as the "glue" or the orchestrator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ACK (AWS Controllers for Kubernetes):&lt;/strong&gt; Native Kubernetes controllers that talk to AWS APIs. They turn an S3 bucket into a Kubernetes object.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Platform Team&lt;/strong&gt; defines a &lt;code&gt;ResourceGraphDefinition&lt;/code&gt; (RGD). This is the blueprint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kro&lt;/strong&gt; converts that RGD into a custom Kubernetes API (e.g., &lt;code&gt;kind: SecureMLWorkspace&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer&lt;/strong&gt; applies a simple 5-line YAML file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kro + ACK&lt;/strong&gt; automatically provision the Deployment, Service, IAM Role, and S3 Bucket, wiring them all together securely.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  💻 Implementation: The "Secure ML Workspace" API
&lt;/h2&gt;

&lt;p&gt;Let's build a real artifact. We want a Custom Resource called &lt;code&gt;MLWorkspace&lt;/code&gt; that gives a data scientist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A Jupyter Notebook (Deployment + Service).&lt;/li&gt;
&lt;li&gt;A private S3 Bucket for datasets.&lt;/li&gt;
&lt;li&gt;An IAM Role that allows &lt;em&gt;only&lt;/em&gt; that notebook to access &lt;em&gt;only&lt;/em&gt; that bucket.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  1. The Foundation (Terraform)
&lt;/h3&gt;

&lt;p&gt;We use Terraform &lt;em&gt;only&lt;/em&gt; for the static base (EKS cluster, OIDC provider, and installing the controllers). We don't use it for the dynamic app resources.&lt;/p&gt;

&lt;p&gt;Terraform&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Install the ACK S3 Controller via Helm&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"helm_release"&lt;/span&gt; &lt;span class="s2"&gt;"ack_s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ack-s3-controller"&lt;/span&gt;
  &lt;span class="nx"&gt;chart&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3-chart"&lt;/span&gt;
  &lt;span class="nx"&gt;repository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"oci://public.ecr.aws/aws-controllers-k8s"&lt;/span&gt;

  &lt;span class="c1"&gt;# Crucial: Map the K8s ServiceAccount to an AWS IAM Role (IRSA)&lt;/span&gt;
  &lt;span class="nx"&gt;set&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"serviceAccount.annotations.eks&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;.amazonaws&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;.com/role-arn"&lt;/span&gt;
    &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ack_s3_controller&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. The Abstraction (Kro ResourceGraphDefinition)
&lt;/h3&gt;

&lt;p&gt;This is the "secret sauce." Instead of writing a complex Go Operator, we define the relationship graph in YAML.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: We use the &lt;code&gt;ResourceGraphDefinition&lt;/code&gt; kind (the current standard for Kro).&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kro.run/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResourceGraphDefinition&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-workspace-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# The Interface: What the developer sees&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1alpha1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MLWorkspace&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;boolean | default=false&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;notebookUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://${notebookservice.metadata.name}.${schema.metadata.namespace}.svc.cluster.local:8888"&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${s3bucket.status.ackResourceMetadata.arn}&lt;/span&gt;

  &lt;span class="c1"&gt;# The Implementation: What gets created&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. The Private S3 Bucket (Managed by ACK)&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3bucket&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3.services.k8s.aws/v1alpha1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bucket&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-data&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-data-${schema.metadata.uid}&lt;/span&gt; &lt;span class="c1"&gt;# Unique Name&lt;/span&gt;
          &lt;span class="na"&gt;encryption&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;applyServerSideEncryptionByDefault&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;sseAlgorithm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AES256&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. The IAM Policy for Bucket Access (Managed by ACK)&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iampolicy&lt;/span&gt;
      &lt;span class="na"&gt;readyWhen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;${iampolicy.status.ackResourceMetadata.arn != &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iam.services.k8s.aws/v1alpha1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Policy&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-s3-policy&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-s3-policy-${schema.metadata.uid}&lt;/span&gt;
          &lt;span class="na"&gt;policyDocument&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;{&lt;/span&gt;
              &lt;span class="s"&gt;"Version": "2012-10-17",&lt;/span&gt;
              &lt;span class="s"&gt;"Statement": [{&lt;/span&gt;
                &lt;span class="s"&gt;"Effect": "Allow",&lt;/span&gt;
                &lt;span class="s"&gt;"Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],&lt;/span&gt;
                &lt;span class="s"&gt;"Resource": [&lt;/span&gt;
                  &lt;span class="s"&gt;"arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}",&lt;/span&gt;
                  &lt;span class="s"&gt;"arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}/*"&lt;/span&gt;
                &lt;span class="s"&gt;]&lt;/span&gt;
              &lt;span class="s"&gt;}]&lt;/span&gt;
            &lt;span class="s"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. The IAM Role for K8s Service Account/IRSA (Managed by ACK)&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iamrole&lt;/span&gt;
      &lt;span class="na"&gt;readyWhen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;${iamrole.status.ackResourceMetadata.arn != &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iam.services.k8s.aws/v1alpha1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-role&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-role-${schema.metadata.uid}&lt;/span&gt;
          &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;${iampolicy.status.ackResourceMetadata.arn}&lt;/span&gt;
          &lt;span class="na"&gt;assumeRolePolicyDocument&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;{&lt;/span&gt;
              &lt;span class="s"&gt;"Version": "2012-10-17",&lt;/span&gt;
              &lt;span class="s"&gt;"Statement": [{&lt;/span&gt;
                &lt;span class="s"&gt;"Effect": "Allow",&lt;/span&gt;
                &lt;span class="s"&gt;"Principal": { "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/OIDC_URL" },&lt;/span&gt;
                &lt;span class="s"&gt;"Action": "sts:AssumeRoleWithWebIdentity",&lt;/span&gt;
                &lt;span class="s"&gt;"Condition": {&lt;/span&gt;
                  &lt;span class="s"&gt;"StringEquals": {&lt;/span&gt;
                    &lt;span class="s"&gt;"OIDC_URL:sub": "system:serviceaccount:${schema.metadata.namespace}:${schema.spec.project}-sa"&lt;/span&gt;
                  &lt;span class="s"&gt;}&lt;/span&gt;
                &lt;span class="s"&gt;}&lt;/span&gt;
              &lt;span class="s"&gt;}]&lt;/span&gt;
            &lt;span class="s"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. The Kubernetes Service Account&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serviceaccount&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-sa&lt;/span&gt;
          &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.metadata.namespace}&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${iamrole.status.ackResourceMetadata.arn}&lt;/span&gt;

    &lt;span class="c1"&gt;# 5. The Notebook Service&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notebookservice&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-notebook&lt;/span&gt;
          &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.metadata.namespace}&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8888&lt;/span&gt;
              &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8888&lt;/span&gt;

    &lt;span class="c1"&gt;# 6. The Notebook Deployment (Wait for bucket to be ready)&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notebook&lt;/span&gt;
      &lt;span class="na"&gt;readyWhen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;${notebook.status.availableReplicas == 1}&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
        &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-notebook&lt;/span&gt;
          &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.metadata.namespace}&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}&lt;/span&gt;
          &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${schema.spec.project}-sa&lt;/span&gt;
              &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jupyter&lt;/span&gt;
                  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jupyter/scipy-notebook:latest"&lt;/span&gt;
                  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="c1"&gt;# AUTOMATIC WIRING: Inject the Bucket ARN directly&lt;/span&gt;
                    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATA_BUCKET&lt;/span&gt;
                      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${s3bucket.status.ackResourceMetadata.arn}&lt;/span&gt;
                  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="c1"&gt;# Conditional Logic in CEL&lt;/span&gt;
                      &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${schema.spec.gpu?&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'1'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'0'}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. The Developer Experience
&lt;/h3&gt;

&lt;p&gt;The developer doesn't care about IAM policies, encryption rules, or Pod selectors. They just want a workspace&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MLWorkspace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detection-dev&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detection&lt;/span&gt;
  &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. When they apply this, Kro creates the bucket, waits for the ARN to be generated by AWS, injects that ARN into the Pod's environment variables, and spins up the compute.&lt;/p&gt;




&lt;h2&gt;
  
  
  💰 The Cost Reality
&lt;/h2&gt;

&lt;p&gt;Is running this expensive? We analyzed the costs of running the control plane (ACK + Kro) versus the operational savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Tax" (Infrastructure Cost):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kro Controller:&lt;/strong&gt; Runs as a standard Pod on your existing EKS nodes. Costs nothing beyond the base EC2/Fargate compute required (which is negligible).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ACK Controllers:&lt;/strong&gt; Also run as Pods on your existing nodes. Minimal resource usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total "Platform Tax":&lt;/strong&gt; Essentially &lt;strong&gt;$0&lt;/strong&gt; in additional licensing or managed service fees. You only pay for the standard EKS cluster and the underlying compute nodes you are already running.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Savings (Operational):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drift Remediation:&lt;/strong&gt; $0 (Automatic).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wait Time:&lt;/strong&gt; Reduced from days (ticketing queue) to seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Audits:&lt;/strong&gt; The RGD acts as a policy. You can verify that &lt;em&gt;every&lt;/em&gt; &lt;code&gt;MLWorkspace&lt;/code&gt; uses AES256 encryption just by checking the single RGD file.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 My Individual Conclusion
&lt;/h2&gt;

&lt;p&gt;After migrating our core data services to this model, here is my honest take:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The "Leaky Abstraction" Risk is Real&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When an ACK resource fails (e.g., AWS rejects the bucket name because it's taken), the error bubbles up to the Kubernetes status. Your developers &lt;em&gt;will&lt;/em&gt; need to know how to read &lt;code&gt;kubectl describe&lt;/code&gt; output. You cannot hide the cloud entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Portability vs. Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kro creates a tight coupling with the underlying CRDs (ACK). If you move to Google Cloud, you have to rewrite your RGDs to use Config Connector (Google's equivalent). This is &lt;strong&gt;not&lt;/strong&gt; a "write once, run anywhere" solution like pure Helm charts might claim to be, but the operational stability you gain on your primary cloud is worth the lock-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Verdict&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use Kro if you are a platform team that wants to provide &lt;strong&gt;golden paths&lt;/strong&gt; without building a massive software project. It sits perfectly in the sweet spot between "raw YAML" and "heavy enterprise portal."&lt;/p&gt;




&lt;h2&gt;
  
  
  📚 Resources &amp;amp; References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Official Project:&lt;/strong&gt;(&lt;a href="https://github.com/kubernetes-sigs/kro" rel="noopener noreferrer"&gt;https://github.com/kubernetes-sigs/kro&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Controllers:&lt;/strong&gt;(&lt;a href="https://aws-controllers-k8s.github.io/community/docs/community/overview/" rel="noopener noreferrer"&gt;https://aws-controllers-k8s.github.io/community/docs/community/overview/&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep Dive:&lt;/strong&gt;(&lt;a href="https://www.cncf.io/blog/2025/12/15/building-platforms-using-kro-for-composition/" rel="noopener noreferrer"&gt;https://www.cncf.io/blog/2025/12/15/building-platforms-using-kro-for-composition/&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syntax Guide:&lt;/strong&gt; &lt;a href="https://github.com/google/cel-spec" rel="noopener noreferrer"&gt;CEL (Common Expression Language) Introduction&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>platformengineering</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
    </item>
    <item>
      <title>An AI Crew for Automated Diagramming and Documentation</title>
      <dc:creator>Kirill Polishchuk</dc:creator>
      <pubDate>Sun, 16 Nov 2025 22:44:43 +0000</pubDate>
      <link>https://dev.to/kirponik/an-ai-crew-for-automated-diagramming-and-documentation-og2</link>
      <guid>https://dev.to/kirponik/an-ai-crew-for-automated-diagramming-and-documentation-og2</guid>
      <description>&lt;h2&gt;
  
  
  The Introduction
&lt;/h2&gt;

&lt;p&gt;Our cloud documentation is almost always out of date. It's not because we're lazy; it's because the cloud moves too fast. A diagram drawn in a sprint planning meeting is obsolete by the time the code hits production. This documentation crisis, that  every engineering team faces, is a massive and invisible tax. Nobody talks about it, but we know that manual updates are expensive, error-prone, and always outdated when you need them most. The "cost" isn't just the 2-3 days of senior engineer time every quarter—it's the production incidents that could have been prevented, the security vulnerabilities you didn't know existed, and the new hires who take weeks to understand the system.&lt;/p&gt;

&lt;p&gt;I was tired of this cycle. So I built a solution that uses AI agents to automatically scan live AWS environments and generate accurate, multi-audience documentation in minutes—not days. Here's how it works, what I learned, and why this approach unlocks something bigger than just better diagrams.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;💡 Why Everything We've Tried Has Failed&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;❌ Manual Documentation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The promise:&lt;/strong&gt; "We'll keep the wiki updated"&lt;br&gt;
&lt;strong&gt;The reality:&lt;/strong&gt; Updated once during setup, referenced never, trusted by no one&lt;br&gt;
&lt;strong&gt;The cost:&lt;/strong&gt; 2-3 days of senior engineer time per environment, outdated within weeks&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;❌ Diagrams-as-Code (Terraform/CloudFormation diagrams)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The promise:&lt;/strong&gt; "Our IaC is our documentation"&lt;br&gt;
&lt;strong&gt;The reality:&lt;/strong&gt; Shows the &lt;em&gt;intended&lt;/em&gt; state, not the actual state after three hotfixes and that manual console change on Friday night&lt;br&gt;
&lt;strong&gt;The gap:&lt;/strong&gt; What you &lt;em&gt;planned&lt;/em&gt; vs. what actually &lt;em&gt;exists&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;❌ Static Scanning Tools&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The promise:&lt;/strong&gt; "We'll scan your infrastructure"&lt;br&gt;
&lt;strong&gt;The reality:&lt;/strong&gt; Dumps 10,000 lines of JSON that tell you &lt;em&gt;what&lt;/em&gt; exists but not &lt;em&gt;why&lt;/em&gt; or &lt;em&gt;how it's connected.&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;💡 AI Agents That Understand Infrastructure&lt;/p&gt;

&lt;p&gt;What we actually needed is a system that can perceive infrastructure like a scanner, understand it like a senior architect, and explain it like a technical writer—automatically. To achieve this, I created a "crew" of specialized AI agents—each with a specific job, just like a real engineering team.&lt;/p&gt;

&lt;p&gt;Think of it like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Inspector&lt;/strong&gt; scans AWS (like a junior engineer running AWS CLI commands)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Analyst&lt;/strong&gt; understands relationships (like a senior architect reviewing configs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Draftsman&lt;/strong&gt; creates diagrams (like a technical illustrator)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Writers&lt;/strong&gt; create documentation for different audiences:

&lt;ul&gt;
&lt;li&gt;Technical Writer → detailed runbook for ops teams&lt;/li&gt;
&lt;li&gt;Executive Analyst → high-level summary for leadership&lt;/li&gt;
&lt;li&gt;Developer Advocate → practical guide for developers&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;All working in parallel, all generating outputs from the same live data, all in minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Transformation
&lt;/h2&gt;

&lt;p&gt;💡 Before vs. After&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Aspect&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Before ( Manual Process )&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;After ( Automated with AI Agents )&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;⏱️ &lt;strong&gt;Time&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;2-3 days per environment&lt;/td&gt;
&lt;td&gt;5-10 minutes per environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;👤 &lt;strong&gt;Who&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Senior engineer (expensive)&lt;/td&gt;
&lt;td&gt;Anyone with AWS access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;📄 &lt;strong&gt;Output&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;One diagram, maybe a doc&lt;/td&gt;
&lt;td&gt;Diagram + 4 tailored documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔄 &lt;strong&gt;Update Frequency&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Quarterly if you're lucky&lt;/td&gt;
&lt;td&gt;On-demand or automated (CI/CD)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🎯 &lt;strong&gt;Accuracy&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Outdated within weeks&lt;/td&gt;
&lt;td&gt;Always reflects current state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;😰 &lt;strong&gt;Stress Level&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;High (always out of date)&lt;/td&gt;
&lt;td&gt;Low (always accurate)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;The entire system is open source. You can have it running in 5 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install the package&lt;/span&gt;
git clone https://github.com/kirPoNik/aws-architecture-diagrams-with-crewai.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aws-architecture-diagrams-with-crewai
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# 2. Run it (that's it!)&lt;/span&gt;
aws-diagram-generator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"Production"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tags&lt;/span&gt; &lt;span class="s2"&gt;"Environment=prod"&lt;/span&gt; &lt;span class="s2"&gt;"App=myapp"&lt;/span&gt;

&lt;span class="c"&gt;# 3. Check your output/ directory for complete documentation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;AWS credentials&lt;/li&gt;
&lt;li&gt;AWS Config enabled&lt;/li&gt;
&lt;li&gt;AWS Bedrock access (Claude 3.5 Sonnet &lt;strong&gt;preferred&lt;/strong&gt; )&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In under 10 minutes, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ PlantUML architecture diagram with AWS icons&lt;/li&gt;
&lt;li&gt;✅ Technical Runbook with every resource detail&lt;/li&gt;
&lt;li&gt;✅ Executive Summary in plain English&lt;/li&gt;
&lt;li&gt;✅ Developer Onboarding Guide with endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Actually Works
&lt;/h2&gt;

&lt;p&gt;Three Key Innovations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Universal Discovery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This works with ANY AWS Service. The first breakthrough was realizing we don't need to hard-code &lt;code&gt;describe_instances()&lt;/code&gt;, &lt;code&gt;describe_db_instances()&lt;/code&gt;, etc. for every service. Instead, use AWS's universal APIs:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This one API call finds ANY tagged resource across ALL services
&lt;/span&gt;&lt;span class="n"&gt;paginator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tagging_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_paginator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;get_resources&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paginator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;paginate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TagFilters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;boto3_tag_filters&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;resources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ResourceTagMappingList&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;all_resource_mappings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works with services that didn't exist when you wrote the code. No maintenance as AWS adds new services.

&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Batch Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The second breakthrough was batching AWS Config calls instead of fetching resources one-by-one:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Group by type
&lt;/span&gt;&lt;span class="n"&gt;resources_by_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_resource_type_from_arn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;resources_by_type&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fetch up to 20 at once
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_get_resource_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;resourceKeys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resource_keys&lt;/span&gt;  &lt;span class="c1"&gt;# Batch of 20
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Automatic fallback for edge cases
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ValidationException&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;config_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_resource_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Expression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * WHERE configuration.arn = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;safe_arn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processes 100s of resources in seconds&lt;/li&gt;
&lt;li&gt;Built-in retry logic for throttling&lt;/li&gt;
&lt;li&gt;Automatic fallback when batch isn't supported

&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;AI Understanding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The third breakthrough was using specialized AI agents with personas:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;inspector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AWS Infrastructure Inspector&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Scan AWS and provide detailed JSON of resources&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You use AWS APIs to discover cloud resources based on tags.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;aws_scanner_tool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;analyst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Cloud Architecture Analyst&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Understand architecture, components, and relationships&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You interpret raw infrastructure data and structure it into a logical model.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;draftsman&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PlantUML Diagram Draftsman&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Generate PlantUML diagram scripts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You convert architectural information into PlantUML using AWS icons.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Chain them together: Inspector → Analyst → Draftsman
&lt;/span&gt;&lt;span class="n"&gt;task_inspect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Scan AWS...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inspector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;task_analyze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Analyze...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;analyst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_inspect&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;task_draw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Create diagram...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;draftsman&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_analyze&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...])&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each agent is an expert in its domain&lt;/li&gt;
&lt;li&gt;Outputs are human-readable, not raw JSON&lt;/li&gt;
&lt;li&gt;Same data → 4 different perspectives (technical, executive, developer, visual)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;💡 How It All Fits Together&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffp5kd1ktyusv38bj9jde.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffp5kd1ktyusv38bj9jde.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Actually Get
&lt;/h2&gt;

&lt;p&gt;💡 Here's what the final markdown file can look like&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# AWS Architecture Documentation: Production Environment

## Table of Contents
1. Architecture Diagram
2. Technical Infrastructure Runbook
3. Executive Summary for Leadership
4. Developer Onboarding Guide

## Architecture Diagram
@startuml
!include &amp;lt;awslib/AWSCommon&amp;gt;
!include &amp;lt;awslib/Compute/EC2&amp;gt;
!include &amp;lt;awslib/Database/RDS&amp;gt;

rectangle "VPC: vpc-12345 (10.0.0.0/16)" {
  rectangle "Public Subnet: subnet-abc" {
    ElasticLoadBalancing(alb, "Application LB", "")
  }
  rectangle "Private Subnet: subnet-def" {
    EC2(web1, "Web Server 1", "t3.medium")
    EC2(web2, "Web Server 2", "t3.medium")
  }
  rectangle "DB Subnet: subnet-ghi" {
    RDS(db, "PostgreSQL", "db.t3.large")
  }
}

alb --&amp;gt; web1
alb --&amp;gt; web2
web1 --&amp;gt; db
web2 --&amp;gt; db
@enduml

## Technical Infrastructure Runbook

### Compute Resources
**EC2 Instance: i-0abc123** (Web Server 1)
- Instance Type: t3.medium
- Private IP: 10.0.1.10
- Security Groups: sg-web123 (allows 80/443 from ALB)
- IAM Role: web-server-role
- Tags: Environment=production, Tier=web

[... detailed configs for every resource ...]

## Executive Summary
This production environment hosts our customer-facing web application using a
highly available, three-tier architecture. The system consists of:

- **Web Tier:** Redundant web servers behind a load balancer for high availability
- **Database Tier:** Managed PostgreSQL database with automated backups
- **Security:** Private subnets, restricted security groups, encrypted data

The architecture supports approximately 10,000 daily users with 99.9% uptime...

## Developer Onboarding Guide
### Quick Start
**Application URL:** &amp;lt;https://my-app-prod-123.us-east-1.elb.amazonaws.com&amp;gt;

**Database Connection:**
Host: mydb.cluster-abc.us-east-1.rds.amazonaws.com
Port: 5432
Database: production_db
User: app_user

## **Environment Variables:**
[... practical connection details ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  💭 Final Thoughts and Next Steps
&lt;/h2&gt;

&lt;p&gt;This approach is powerful, but it's not magic. Here are the real-world considerations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dependency:&lt;/strong&gt; The &lt;code&gt;AWS Config&lt;/code&gt; discovery method is robust, but it relies on AWS Config being enabled and correctly configured to record all the resource types you care about.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; This makes heavy use of a powerful LLM (like Claude 3.5 Sonnet or GPT-4). Running it on-demand is fine, but running it every 10 minutes on a massive environment could get expensive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Rate Limits&lt;/strong&gt;:  AWS Bedrock has very strong limits, especially on Anthropic Models ( 1-2 requests per minute). To work around we use models via inference profile. Also the Use-Case submission is required. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-Determinism:&lt;/strong&gt; LLMs are non-deterministic. The &lt;code&gt;Analyst&lt;/code&gt; might occasionally misinterpret a relationship or the &lt;code&gt;Draftsman&lt;/code&gt; might make a syntax error. This requires prompt refinement and testing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you have AI agents that can perceive and understand your infrastructure, you unlock an entire category of use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost Optimization&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;finops_analyst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FinOps Analyst&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Identify cost optimization opportunities&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You find abandoned or over-provisioned resources.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: "Found 5 unattached EBS volumes costing $150/month"
#         "RDS instance at 12% CPU could be downsized, save $200/month"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Security Auditing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;security_auditor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Security Auditor&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Identify security vulnerabilities&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You audit cloud configurations for compliance.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: "Security group sg-123 allows 0.0.0.0/0 on port 22"
#         "S3 bucket 'backups' is not encrypted"
#         "RDS instance publicly accessible"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compliance Verification&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;compliance_checker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Compliance Checker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Verify HIPAA/PCI-DSS/SOC2 compliance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: "HIPAA Violation: Database not in private subnet"
#         "PCI-DSS: Encryption at rest not enabled"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📦 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kirPoNik/aws-architecture-diagrams-with-crewai" rel="noopener noreferrer"&gt;aws-architecture-diagrams-with-crewai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🛠️ &lt;strong&gt;Tools Used:&lt;/strong&gt; &lt;a href="https://docs.crewai.com/" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt; | &lt;a href="https://docs.aws.amazon.com/config/" rel="noopener noreferrer"&gt;AWS Config&lt;/a&gt; | &lt;a href="https://plantuml.com/" rel="noopener noreferrer"&gt;PlantUML&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🎨 &lt;strong&gt;AWS Icons:&lt;/strong&gt; &lt;a href="https://github.com/awslabs/aws-icons-for-plantuml" rel="noopener noreferrer"&gt;aws-icons-for-plantuml&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CrewAI GitHub Examples: &lt;a href="https://github.com/crewAIInc/crewAI-examples" rel="noopener noreferrer"&gt;https://github.com/crewAIInc/crewAI-examples&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>ai</category>
      <category>devops</category>
      <category>documentation</category>
    </item>
    <item>
      <title>Predicting Failures in a Serverless App with AWS DevOps Guru and OpenTelemetry</title>
      <dc:creator>Kirill Polishchuk</dc:creator>
      <pubDate>Sat, 25 Oct 2025 21:44:26 +0000</pubDate>
      <link>https://dev.to/kirponik/predicting-failures-in-a-serverless-app-with-aws-devops-guru-and-opentelemetry-2hfe</link>
      <guid>https://dev.to/kirponik/predicting-failures-in-a-serverless-app-with-aws-devops-guru-and-opentelemetry-2hfe</guid>
      <description>&lt;h3&gt;
  
  
  Limitation of the Traditional Monitoring
&lt;/h3&gt;

&lt;p&gt;The management of modern distributed applications has become increasingly complex. Using traditional monitoring tools, which rely mainly on manual analysis, is insufficient for ensuring the availability and performance demanded by microservices or serverless topologies.&lt;/p&gt;

&lt;p&gt;One of the main problems with traditional monitoring is the high volume and variety of telemetry data generated by IT environments. This includes metrics, logs, and traces, which in an ideal world should be consolidated on a single monitoring dashboard to allow observation of the entire system. Another problem is static thresholds for alarms. Setting them too low will generate a high volume of false positives, while setting them too high will fail to detect significant performance degradation.&lt;/p&gt;

&lt;p&gt;To solve these problems, organizations are shifting to an intelligent, automated, and predictive solution known as AIOps. Instead of relying on human operators to manually connect the dots, AIOps platforms are designed to ingest and analyze these vast datasets in real time.&lt;/p&gt;

&lt;p&gt;In this article, we will learn how AIOps platforms are capable of proactive anomaly detection—its most fundamental capability - as well as root cause analysis, prediction, and alert generation.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Technology Stack
&lt;/h3&gt;

&lt;p&gt;The solution detailed in this article is a combination of three synergistic pillars:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A managed AIOps platform&lt;/strong&gt; that provides analytical intelligence. We will use AWS Guru, which is the core of our solution and acts as its "AIOps brain." AWS Guru is a managed service that leverages machine learning models built and trained by AWS experts. A key design principle is to make AIOps accessible to specialists without special machine learning expertise. Its primary function is to detect operational issues or anomalies and produce high-level insights instead of a stream of raw, uncorrelated alerts. These insights include related log snippets, a detailed analysis with a possible root cause, and actionable steps to diagnose and remediate the issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An Open-Standard observability framework&lt;/strong&gt; that supplies high-quality telemetry data and provides a unified set of APIs, SDKs, and tools to generate, collect, and export it. The importance of OpenTelemetry lies in two principles: standardization and vendor neutrality. The benefit of using OpenTelemetry is that if we want to switch to a different AIOps tool, we can just redirect the telemetry stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A Serverless Application&lt;/strong&gt; that is an example of a modern and dynamic microservice topology.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The complete architectural solution for a proposed telemetry pipeline can be observed on the below diagram.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhn3eko1oeva50gnwbvs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhn3eko1oeva50gnwbvs.png" alt=" " width="800" height="1189"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💡 &lt;em&gt;Figure1. The solution architecture illustrates how a user request flows through the serverless application, then ADOP lambda layer collects the telemetry data and sends it to X-Ray and CloudWatch and then AWS Guru ingests this data and generates an insights&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Practical Implementation
&lt;/h3&gt;

&lt;p&gt;It’s important to understand that AWS Guru does not collect any telemetry data itself but is configured to monitor and continuously analyze resources produced by the Application and identified by specific tags.&lt;/p&gt;

&lt;p&gt;To give a reader a better understanding in this section we provide a comprehensive guide on how to implement the proposed solution and further in the Experiment section we will see on how to instrument it. The following structure of a git repository aligns with IAC best practices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
├── demo
│   ├── envs
│   │   └── dev
│   │       ├── env.hcl    &lt;span class="c"&gt;# Environment-specific configuration that sets the environment name&lt;/span&gt;
│   │       ├── api_gateway
│   │       │   └── terragrunt.hcl
│   │       ├── devopsguru
│   │       │   └── terragrunt.hcl
│   │       ├── dynamodb
│   │       │   └── terragrunt.hcl
│   │       ├── iam
│   │       │   └── terragrunt.hcl
│   │       └── serverless_app
│   │           └── terragrunt.hcl
│   └── project.hcl    &lt;span class="c"&gt;# Project-level configuration defining `app_name_prefix` and `project_name` used across all environments&lt;/span&gt;
├── root.hcl    &lt;span class="c"&gt;# Root Terragrunt configuration that generates AWS provider blocks and configures S3 backend&lt;/span&gt;
├── src
│   ├── app.py    &lt;span class="c"&gt;# Lambda handler function with OpenTelemetry instrumentation&lt;/span&gt;
│   ├── requirements.txt
│   └── collector.yaml
└── terraform
    └── modules    &lt;span class="c"&gt;# Infrastructure Modules&lt;/span&gt;
        ├── api_gateway
        ├── devopsguru
        ├── dynamodb
        └── iam
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;💡 This Modular (&lt;code&gt;Terragrunt&lt;/code&gt;) Approach has the following &lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True environment isolation: each environment (&lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;prod&lt;/code&gt;, etc.) has its own state, config, and outputs.&lt;/li&gt;
&lt;li&gt;All major AWS resources (Lambda, API Gateway, DynamoDB, IAM, DevOps Guru) are reusable Terraform modules in &lt;code&gt;terraform/modules/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Easy to extend for new AWS services or environments with minimal duplication.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The full repository can be found &lt;a href="https://github.com/kirPoNik/aws-aiops-detection-with-guru" rel="noopener noreferrer"&gt;https://github.com/kirPoNik/aws-aiops-detection-with-guru&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Lambda function (code in &lt;strong&gt;&lt;code&gt;app.py&lt;/code&gt;&lt;/strong&gt;) receives requests from API Gateway, generates an unique ID and put an item to the Dynamo DB Table.  It also contains the logic to inject a "gray failure", which will be required for our experiment, see the code snipped with the Key Logic below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="c1"&gt;# --- CONFIGURATION FOR GRAY FAILURE SIMULATION ---
# This environment variable acts as our feature flag for the experiment
&lt;/span&gt;&lt;span class="n"&gt;INJECT_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INJECT_LATENCY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;MIN_LATENCY_MS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;  &lt;span class="c1"&gt;# Minimum artificial latency in milliseconds
&lt;/span&gt;&lt;span class="n"&gt;MAX_LATENCY_MS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;  &lt;span class="c1"&gt;# Maximum artificial latency in milliseconds
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Handles requests and optionally injects a variable sleep
    to simulate performance degradation.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# This is the core logic for our "gray failure" simulation
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;INJECT_LATENCY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;latency_seconds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MIN_LATENCY_MS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MAX_LATENCY_MS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The function's primary business logic is to write an item to DynamoDB
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# ... returns a successful response ...
&lt;/span&gt;    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# ... returns an error response ...
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the collector configuration ( in &lt;strong&gt;&lt;code&gt;collector.yaml&lt;/code&gt;&lt;/strong&gt;), that defines pipelines to send traces to AWS X-Ray and metrics to Amazon CloudWatch, see the Key Logic below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This file configures the OTel Collector in the ADOT layer&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Send trace data to AWS X-Ray&lt;/span&gt;
  &lt;span class="na"&gt;awsxray&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Send metrics to CloudWatch using the Embedded Metric Format (EMF)&lt;/span&gt;
  &lt;span class="na"&gt;awsemf&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# The pipeline for traces: receive data -&amp;gt; export to X-Ray&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;awsxray&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# The pipeline for metrics: receive data -&amp;gt; export to CloudWatch&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;awsemf&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Simulating Failure and Generating Insights
&lt;/h3&gt;

&lt;p&gt;💡 The Experiment section&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Deploy the Stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;demo/envs/dev&lt;/code&gt; directory, run the usual commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terragrunt init &lt;span class="nt"&gt;--all&lt;/span&gt;
terragrunt plan &lt;span class="nt"&gt;--all&lt;/span&gt;
terragrunt apply &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grab the API endpoint from the output and save it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;API_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;terragrunt output &lt;span class="nt"&gt;-json&lt;/span&gt; &lt;span class="nt"&gt;--all&lt;/span&gt;  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'to_entries[] | select(.key | test("api_endpoint")) | .value.value'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;💡 You need to enable AWS DevOps Guru and wait 15-90 minutes for &lt;strong&gt;Discovering applications and resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Establish a Baseline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DevOps Guru needs to learn what "normal" looks like. Let's give it some healthy traffic. We'll use &lt;strong&gt;&lt;code&gt;hey&lt;/code&gt;&lt;/strong&gt;, a simple load testing tool perfect for this job.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Why &lt;code&gt;hey&lt;/code&gt;? We could use a more complex tool like k6, which is great for scripting detailed user journeys. But for this test, we just need to hit an endpoint with a steady stream of requests. hey does that with a single command, keeping things simple.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run a light load for a few hours. This gives the ML models plenty of data to build a solid baseline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run for 4 hours at 5 requests per second&lt;/span&gt;
hey &lt;span class="nt"&gt;-z&lt;/span&gt; 4h &lt;span class="nt"&gt;-q&lt;/span&gt; 5 &lt;span class="nt"&gt;-m&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$API_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;💡 Use GNU Screen to run this in background&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Inject the Failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now for the fun part. We'll introduce our "gray failure" - a subtle slowdown that a simple threshold alarm would likely miss.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;demo/envs/dev/serverless_app/terragrunt.hcl&lt;/code&gt;, add a new &lt;code&gt;INJECT_LATENCY&lt;/code&gt;  to our Lambda function's environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;environment_variables&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;TABLE_NAME&lt;/span&gt;                         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dependency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;table_name&lt;/span&gt;
    &lt;span class="nx"&gt;AWS_LAMBDA_EXEC_WRAPPER&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/opt/otel-instrument"&lt;/span&gt;
    &lt;span class="nx"&gt;OPENTELEMETRY_COLLECTOR_CONFIG_URI&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/var/task/collector.yaml"&lt;/span&gt;
    &lt;span class="nx"&gt;INJECT_LATENCY&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;-- Change this to true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the change. This quick deployment is an important event that DevOps Guru will notice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terragrunt apply &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Generate Bad Traffic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Run the same load test again. This time, every request will have that extra, variable delay.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run for at least an hour to generate enough bad data&lt;/span&gt;
hey &lt;span class="nt"&gt;-z&lt;/span&gt; 1h &lt;span class="nt"&gt;-q&lt;/span&gt; 5 &lt;span class="nt"&gt;-m&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$API_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our app is now performing worse than its baseline. Let's see if DevOps Guru noticed.&lt;/p&gt;

&lt;p&gt;After 30-60 minutes of bad traffic, an "insight" popped up in the DevOps Guru console. &lt;/p&gt;

&lt;p&gt;This is the real value of AIOps. A standard CloudWatch alarm would have just said, "Latency is high." DevOps Guru said, "Latency is high, and it started right after you deployed this change."&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;This experiment shows a clear path away from reactive firefighting. By pairing a standard observability framework like &lt;strong&gt;OpenTelemetry&lt;/strong&gt; with an AIOps engine like &lt;strong&gt;AWS DevOps Guru&lt;/strong&gt;, we can build systems that help us find and fix problems before they become disasters.&lt;/p&gt;

&lt;p&gt;The big takeaway is &lt;strong&gt;correlation&lt;/strong&gt;. The magic wasn't just spotting the latency spike; it was automatically linking it to the deployment. That's the jump from raw data to real insight.&lt;/p&gt;

&lt;p&gt;The future of ops isn't about more dashboards. It's about fewer, smarter alerts that tell you what's wrong, why it's wrong, and how to fix it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Github Repository: &lt;a href="https://github.com/kirPoNik/aws-aiops-detection-with-guru" rel="noopener noreferrer"&gt;https://github.com/kirPoNik/aws-aiops-detection-with-guru&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/devops-guru/" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS DevOps Guru Official Page&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://opentelemetry.io/docs/" rel="noopener noreferrer"&gt;&lt;strong&gt;OpenTelemetry Official Documentation&lt;/strong&gt;&lt;/a&gt;:&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws-otel.github.io/docs/getting-started/lambda" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS Distro for OpenTelemetry (ADOT) for Lambda&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/rakyll/hey" rel="noopener noreferrer"&gt;&lt;strong&gt;hey - HTTP Load Generator&lt;/strong&gt;&lt;/a&gt;:&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>observability</category>
      <category>aiops</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>Building a 'Chat with Your Logs' System on AWS Using OpenSearch Serverless and Bedrock</title>
      <dc:creator>Kirill Polishchuk</dc:creator>
      <pubDate>Thu, 04 Sep 2025 23:36:45 +0000</pubDate>
      <link>https://dev.to/kirponik/building-a-chat-with-your-logs-system-on-aws-using-opensearch-serverless-and-bedrock-57g2</link>
      <guid>https://dev.to/kirponik/building-a-chat-with-your-logs-system-on-aws-using-opensearch-serverless-and-bedrock-57g2</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;🧩&lt;/strong&gt; The Challenge: Drowning in Data During Incidents
&lt;/h2&gt;

&lt;p&gt;In the critical moments of a production incident, engineering teams face a formidable challenge: navigating a deluge of log data to find the needle in the haystack. Traditional log analysis demands that engineers formulate precise, often complex, queries using specialized languages. This is effective when you know what to look for, but the real difficulty often lies in diagnosing the "unknown unknowns" - unexpected failures not captured by simple keyword searches.&lt;/p&gt;

&lt;p&gt;What if you could ask questions in plain English, like, &lt;strong&gt;"What were the most common errors for the checkout service in the last 15 minutes?"&lt;/strong&gt; This article demonstrates how to build a powerful, serverless AIOps pipeline on AWS to create a natural language interface for your application logs, transforming log analysis from a rigid, query-based task into an intuitive, conversational experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💬&lt;/strong&gt; The Solution: Conversational AIOps with RAG
&lt;/h2&gt;

&lt;p&gt;This solution leverages a powerful pattern in generative AI known as &lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt;. RAG enhances the capabilities of Large Language Models (LLMs) by connecting them to external knowledge sources - in this case, your real-time application logs. This approach is highly cost-effective as it avoids expensive model retraining, instead providing the LLM with relevant, live context to answer questions accurately.&lt;/p&gt;

&lt;h3&gt;
  
  
  High-Level Architecture
&lt;/h3&gt;

&lt;p&gt;The system is composed of a series of integrated, serverless AWS services that form a complete AIOps pipeline, from ingestion to a conversational response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdrz47ph7qjpnz8osaav2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdrz47ph7qjpnz8osaav2.png" alt="High-Level Architecture Diagram" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data flows as follows:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ingestion &amp;amp; Embedding:&lt;/strong&gt; Logs are streamed to an Amazon OpenSearch Ingestion pipeline. The pipeline uses an AWS Lambda function to call Amazon Bedrock's Titan Text Embeddings model, converting the semantic content of each log into a numerical vector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Indexing:&lt;/strong&gt; The original log, now enriched with its vector embedding, is stored in an Amazon OpenSearch Serverless collection configured for high-performance vector search.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query &amp;amp; Retrieval:&lt;/strong&gt; A user asks a question through a simple web app. The app converts the question into a vector using the same Titan model and performs a k-Nearest Neighbors (k-NN) similarity search against the OpenSearch collection to find the most semantically relevant logs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Synthesis &amp;amp; Response:&lt;/strong&gt; The retrieved logs are passed as context, along with the original question, to a powerful generative LLM like Anthropic's Claude on Amazon Bedrock. Claude analyzes the logs, synthesizes the information, and generates a coherent, human-readable answer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🧠&lt;/strong&gt; The AIOps Pipeline: Key Components
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How Ingestion and Embedding Work Together
&lt;/h3&gt;

&lt;p&gt;The core of the data processing is a seamless, serverless flow between the Amazon OpenSearch Ingestion pipeline and the &lt;code&gt;embedding_lambda&lt;/code&gt; function. This is how raw logs are enriched with semantic meaning before they are ever stored.&lt;/p&gt;

&lt;p&gt;Here’s a step-by-step breakdown of their interaction:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Arrives at the Pipeline:&lt;/strong&gt; An application sends a log entry to the OpenSearch Ingestion pipeline's HTTP endpoint.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pipeline Invokes the Lambda Processor:&lt;/strong&gt; The pipeline's configuration includes a &lt;code&gt;processor&lt;/code&gt; stage that points to our &lt;code&gt;embedding_lambda&lt;/code&gt; function. When the pipeline receives log data, it automatically invokes this Lambda, passing the batch of log records to it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lambda Generates Embeddings:&lt;/strong&gt; The &lt;code&gt;embedding_lambda&lt;/code&gt; function executes its logic: it iterates through each log, extracts the text, and makes an API call to Amazon Bedrock's Titan Text Embeddings model. Bedrock returns a numerical vector (the embedding) that captures the log's meaning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lambda Enriches the Data:&lt;/strong&gt; The Lambda function adds this new vector as a field (e.g., &lt;code&gt;log_embedding&lt;/code&gt;) to the original log record.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pipeline Sends Data to the Sink:&lt;/strong&gt; The Lambda returns the modified, enriched log records back to the pipeline. The pipeline then sends this complete document to its configured &lt;code&gt;sink&lt;/code&gt; - the OpenSearch Serverless vector collection - where it is indexed and becomes available for &lt;strong&gt;semantic search&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Embedding Lambda: Adding Semantic Context&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;embedding_lambda&lt;/code&gt; is a small but critical piece of the pipeline. Its sole job is to &lt;strong&gt;enrich the log data&lt;/strong&gt; with semantic meaning. Triggered by the OpenSearch Ingestion pipeline for every new batch of logs, it performs three key steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Receives Logs:&lt;/strong&gt; It accepts a batch of raw log entries from the ingestion pipeline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Generates Vectors:&lt;/strong&gt; It extracts the text from each log and calls the &lt;strong&gt;Amazon Bedrock API&lt;/strong&gt;, specifically requesting an embedding from the &lt;strong&gt;Titan Text Embeddings&lt;/strong&gt; model. Bedrock returns a numerical vector (e.g., a list of 1,536 numbers) that represents the log's meaning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Returns Enriched Logs:&lt;/strong&gt; The function adds this vector to the original log data under a new field, like &lt;code&gt;log_embedding&lt;/code&gt;, and returns the modified batch to the ingestion pipeline, which then stores it in OpenSearch.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This function acts as a serverless, on-demand transformation engine, making our logs "smart" before they are even indexed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputText&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amazon.titan-embed-text-v2:0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock_runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;accept&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;contentType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response_body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error generating embedding: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;log_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="n"&gt;log_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;log_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Add the new embedding vector to the log data
&lt;/span&gt;                &lt;span class="n"&gt;log_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;log_embedding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  OpenSearch Serverless: The Vector Store
&lt;/h3&gt;

&lt;p&gt;We use an Amazon OpenSearch Serverless collection as our &lt;strong&gt;vector database&lt;/strong&gt;. Its &lt;code&gt;Vector search&lt;/code&gt; collection type is optimized for the high-performance similarity searches (k-NN) we need.&lt;/p&gt;

&lt;p&gt;For this to work, we must configure the index mapping to treat our &lt;code&gt;log_embedding&lt;/code&gt; field as a vector. This tells OpenSearch how to index the vector for efficient searching.&lt;/p&gt;

&lt;p&gt;Here is a sample index mapping, which you would typically define in your Terraform configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"log_embedding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"knn_vector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dimension"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hnsw"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"engine"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"faiss"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"space_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"l2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"ef_construction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 Key Configuration Details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;"type": "knn_vector"&lt;/code&gt;: This explicitly defines the &lt;code&gt;log_embedding&lt;/code&gt; field for k-NN search.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;"dimension": 1024&lt;/code&gt;: This &lt;strong&gt;must match&lt;/strong&gt; the output dimension of your embedding model. Amazon Titan Text Embeddings generates vectors of this size.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;"method"&lt;/code&gt;: We specify the &lt;code&gt;hnsw&lt;/code&gt; (Hierarchical Navigable Small World) algorithm, which provides an excellent balance of search speed and accuracy for large datasets.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🛠️&lt;/strong&gt; Practical Implementation Guide
&lt;/h2&gt;

&lt;p&gt;The Git repository is structured using a modular approach, which is a best practice that promotes reusability and maintainability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;├── README.md
├── envs/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
├── modules/
│   ├── iam/
│   ├── ingestion_pipeline/
│   ├── embedding_lambda/
│   └── opensearch/
└── src/
    ├── embedding_lambda/
    └── streamlit_app/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡&lt;/strong&gt; The repository separates the definition of the infrastructure (in a &lt;code&gt;modules/&lt;/code&gt; directory) from the configuration for specific deployments (in an &lt;code&gt;envs/&lt;/code&gt; directory). An engineer can deploy a complete development environment by simply running &lt;code&gt;terraform apply&lt;/code&gt; within the &lt;code&gt;envs/dev/&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;See &lt;a href="https://github.com/kirPoNik/aws-bedrock-log-analytics-rag" rel="noopener noreferrer"&gt;Complete Code Repository&lt;/a&gt; for your reference.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The User Interface and Prompt Engineering
&lt;/h3&gt;

&lt;p&gt;A simple web application built with Streamlit serves as the user-facing component. The quality of the final answer is heavily dependent on the quality of the prompt sent to the Claude model. A simple "Answer the question" prompt is insufficient. Instead, a robust prompt template is used to guide the model's behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_llm_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;log_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    You are an expert AIOps assistant. Your task is to answer questions about application behavior based *only* on the provided log entries. Do not use any prior knowledge. If the answer cannot be found in the logs, you must state &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;I cannot answer the question based on the provided logs.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;

    Here are the relevant log entries retrieved:
    &amp;lt;logs&amp;gt;
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;log_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    &amp;lt;/logs&amp;gt;

    Based on the logs above, please answer the following question:
    &amp;lt;question&amp;gt;
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    &amp;lt;/question&amp;gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-2023-05-31&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock_runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BEDROCK_MODEL_ID_CLAUDE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response_body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Prompting Tip:&lt;/strong&gt; This prompt uses several best practices for Claude: it assigns a persona ("expert AIOps assistant"), provides clear constraints to prevent hallucination, and uses XML tags (&lt;code&gt;&amp;lt;logs&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;question&amp;gt;&lt;/code&gt;) to structure the context, which significantly improves the model's ability to follow instructions.&lt;/p&gt;

&lt;p&gt;Here are the latest model IDs we would use in 2025 :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;For the highest capability (Opus):&lt;/strong&gt; &lt;code&gt;anthropic.claude-opus-4-1-20250805-v1:0&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;For a balance of performance and cost (Sonnet):&lt;/strong&gt; &lt;code&gt;anthropic.claude-sonnet-4-20250514-v1:0&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💫&lt;/strong&gt; A New Paradigm for Observability
&lt;/h2&gt;

&lt;p&gt;This serverless RAG solution represents a new approach to log analysis, with different strategic considerations compared to traditional tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Model: Query vs. Ingestion
&lt;/h3&gt;

&lt;p&gt;The AIOps RAG architecture shifts the cost model. The cost of ingesting and creating embeddings for logs is relatively low. The primary cost driver is the LLM inference at query time. Each user question triggers an API call to the Claude model with a context of retrieved logs. This means the system's operational cost is driven not by log volume, but by &lt;em&gt;query volume and complexity&lt;/em&gt;. This makes the system ideal for high-value, deep-investigation queries during incidents, rather than high-frequency, dashboard-style monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  🪄 The Future of Observability: Beyond Q&amp;amp;A
&lt;/h2&gt;

&lt;p&gt;The vector embeddings generated during ingestion are a valuable data asset that can be leveraged for capabilities far beyond simple question-answering.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated Semantic Anomaly Detection:&lt;/strong&gt; By applying clustering algorithms to the stream of log embeddings, the system can identify the emergence of new clusters of logs that are semantically distinct from the normal baseline. This can detect novel error types or subtle shifts in application behavior that keyword-based alerting would miss.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated Incident Summary Generation:&lt;/strong&gt; The summarization capabilities of LLMs can be used to automatically generate a first draft of an incident summary. By retrieving logs from an incident's timeframe, the system can provide a timeline of key events, a likely root cause, and customer impact, drastically reducing the manual effort required for post-mortem analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;✅&lt;/strong&gt; Conclusion
&lt;/h2&gt;

&lt;p&gt;The serverless RAG architecture presented here offers a transformative approach to log analysis on AWS. By combining the scalable vector search of Amazon OpenSearch Serverless with the advanced reasoning of foundation models on Amazon Bedrock, organizations can build powerful, conversational interfaces for their observability data. This approach lowers the barrier to deep log analysis, empowers a wider range of team members to participate in incident investigation, and opens the door to a new class of intelligent AIOps tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;📚&lt;/strong&gt; &lt;a href="https://github.com/kirPoNik/aws-bedrock-log-analytics-rag" rel="noopener noreferrer"&gt;&lt;strong&gt;Complete Code Repository&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/" rel="noopener noreferrer"&gt;&lt;strong&gt;What is Retrieval-Augmented Generation (RAG)&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/bedrock/anthropic/" rel="noopener noreferrer"&gt;&lt;strong&gt;Anthropic's Claude on Amazon Bedrock&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/blogs/aws/vector-engine-for-amazon-opensearch-serverless-is-now-generally-available/" rel="noopener noreferrer"&gt;&lt;strong&gt;Vector Engine for Amazon OpenSearch Serverless&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/opensearch-service/features/ingestion/" rel="noopener noreferrer"&gt;&lt;strong&gt;Amazon OpenSearch Ingestion&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/prompt-engineering-techniques-and-best-practices-learn-by-doing-with-anthropics-claude-3-on-amazon-bedrock/" rel="noopener noreferrer"&gt;&lt;strong&gt;Prompt Engineering for Anthropic's Claude&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>serverless</category>
      <category>vectordatabase</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
