DEV Community: Kirill Polishchuk

Why Your SRE Agents Need a Graph

Kirill Polishchuk — Sat, 21 Mar 2026 23:46:23 +0000

Traditional automation relies on Directed Acyclic Graphs (DAGs)—linear pipelines that execute steps A, then B, then C. Tools like GitHub Actions and Jenkins excel at this. They're perfect for deterministic workflows like building Docker images or running test suites.

But infrastructure failures aren't linear. When your database chokes at 3 AM, the recovery process is iterative: you observe metrics, form a hypothesis, test it, and when it fails—you backtrack and try another angle. Even after finding the root cause, you often need to pause and ask for human approval before executing a potentially destructive remediation.

A linear pipeline can't do any of this. If step 2 fails, the pipeline dies. It can't loop back to gather more data. It can't pause mid-execution to wait for human input.

This is why AI agents need graph-based orchestration. Not the rigid DAGs of CI/CD pipelines, but cyclic, stateful graphs that support iteration, maintain context across cycles, and can pause for human approval at critical moments.

Here's how I built a production-ready autonomous SRE system using LangGraph for the orchestration and PydanticAI for the agent intelligence.

Why AI Agents Need a Graph?

Traditional automation uses linear pipelines. But AI agents are different—they think, they iterate, they sometimes need to ask for help. Without a graph-based orchestrator, you're left with brittle scripts that can't adapt.

The Three Superpowers of Graph-Based Agent Orchestration:

1. Cyclic Routing: Think → Act → Reflect → Repeat

Real debugging is iterative. An agent makes a hypothesis, tests it, and either succeeds or loops back with new information. A graph with cyclic routing lets your agents iterate naturally:

2. Stateful Memory: Building Context Across Iterations

Each agent cycle builds on the last. The graph maintains state—observations, metrics, hypotheses—so agents don't start from scratch every time. This persistent context is crucial for complex debugging scenarios where the root cause only becomes apparent after several failed attempts.

3. Human-in-the-Loop with Interrupts: Safety at Critical Moments

The most powerful feature: graphs can pause execution and wait for human input. When your agent wants to kill a query on the production database, it should ask first. With the right orchestrator, this is elegant—the graph simply pauses, sends a Slack message with Approve/Reject buttons, and waits. When the human clicks, the graph resumes exactly where it left off.

Without a graph orchestrator, implementing human approval would require complex state polling or external workflow engines. With the right tool, it's a natural part of the workflow.

The Problem with DAGs in Infrastructure

Imagine a traditional automation script trying to debug a database:

Trigger: High CPU alert.
Action: Fetch slow query log.
Action: If slow query found, kill it.

What happens if the slow query log is empty because the issue is actually an InnoDB lock wait? The script fails, throws an exception, and wakes you up anyway.

What if the script identifies a fix but needs human approval before executing it? A DAG can't pause mid-execution—it just runs to completion or fails.

We need a system that can say: "Hmm, the slow log didn't give me the answer. Let me loop back, look at the disk I/O metrics, and form a new hypothesis. And before I execute any remediation, let me ask a human to confirm."

Stateful, Multi-Agent Orchestration with Human Approval

To solve this, I built a system with three core components:

The State Machine (LangGraph)

The recovery process is treated as a state machine rather than a pipeline. The graph maintains the incident's global memory:

class IncidentState(TypedDict):
    """Global memory that persists across agent cycles."""
    incident_id: str
    # Observations accumulate with each cycle (using operator.add)
    observations: Annotated[List[str], operator.add]
    metrics: dict
    hypothesis: Optional[str]
    remediation_sql: Optional[str]
    cycle_count: Annotated[int, operator.add]
    # Approval workflow state
    approval_status: Optional[str]  # pending, approved, rejected
    approver: Optional[str]

As agents loop through diagnostic cycles, they append their findings here. This persistent context is crucial for complex debugging where the root cause only becomes apparent after several failed attempts.

The Brain (PydanticAI Agents)

Specialized AI agents handle different aspects of the investigation:

Metrics Agent: Fetches and interprets Prometheus data
Analyzer: Forms hypotheses from the metrics
Researcher: Validates hypotheses by querying the database

These agents return strictly typed JSON outputs, so the graph router knows exactly what to do with their results—loop back for more diagnosis, proceed to remediation, or escalate to a human.

The Safety Layer (MCP + Human Approval)

Two critical safety mechanisms:

MCP (Model Context Protocol): Acts as a secure gateway between the AI and your database. The AI never sees credentials—MCP holds them locally and only exposes read-only diagnostic tools. It's like a USB-C port for AI data access.

Human-in-the-Loop: Before any destructive action, the graph pauses and sends an interactive Slack message. The workflow literally cannot proceed until a human clicks "Approve" or "Reject."

Dynamic Configuration

A key insight: Grafana alerts already contain everything we need. The MySQL instance IP, whether it's a replica, the cluster name—all of it is in the alert labels. We extract this dynamically, eliminating the need for static configuration files or AWS credential management. Each alert creates its own isolated connection context.

The Orchestration Flow

Here's how the pieces fit together:

Cyclic Routing in Practice

The router is simple but powerful. Here's how it works in code:

def router(state: IncidentState) -> str:
    """Route the workflow based on current state."""
    # Guardrail: Prevent infinite loops and runaway costs
    if state["cycle_count"] > 5:
        return "escalate"

    # Success: Found the root cause
    if state["status"] == "ready_for_remediation":
        return "request_approval"

    # Iteration needed: Loop back for more diagnosis
    return "diagnose"

The router decides:

If the researcher finds the root cause: Proceed to remediation
If the hypothesis is wrong: Loop back to the diagnose node with updated context
If we've tried 5 times without success: Escalate to a human
If we're ready to remediate: Pause and wait for human approval

This cycle continues until success, human intervention, or the safety limit is reached.

The Interrupt Pattern

When the graph reaches the approval point, it doesn't poll or busy-wait. It interrupts—pausing execution entirely and saving its state:

def approval_wait_logic(state: IncidentState):
    """Pause graph and wait for human approval via interrupt."""
    # Send Slack message with Approve/Reject buttons
    send_slack_approval_request(state["incident_id"], state["hypothesis"])

    # Interrupt: Pause execution entirely, wait for external input
    decision = interrupt({"request": "awaiting_approval", "incident_id": state["incident_id"]})

    # Graph resumes here when human clicks button
    if decision["approved"]:
        return {"status": "execute_remediation"}
    else:
        return {"status": "escalate"}

When a human clicks "Approve" in Slack, the webhook resumes the graph with the decision:

# In Slack webhook handler
async def handle_approval(incident_id: str, approved: bool):
    decision = {"approved": approved, "approver": user_name}

    # Resume the graph with the human's decision
    async for event in graph.astream(
        Command(resume=decision),
        config={"configurable": {"thread_id": incident_id}}
    ):
        pass  # Graph continues from interrupt point

Execution continues exactly where it left off. This is far more elegant than polling loops or external state machines.

Production Safety

Unlike standard scripts, agentic systems are billed per "thinking cycle." An LLM stuck in a loop trying to debug a phantom network issue will happily burn through tokens until your OpenAI bill looks like a phone number.

The cycle limit (capped at 5) ensures that if the AI is truly stumped, it gracefully escalates to a human SRE rather than looping infinitely.

Other critical guardrails:

Read-only database access: MCP only exposes diagnostic queries
Mandatory human approval: No destructive actions without explicit sign-off
Approval timeouts: Auto-escalate if humans don't respond in time
Structured logging: Full observability with correlation IDs for every incident

Why This Architecture Wins?

Without a graph orchestrator:

Scripts fail on first error
No way to iterate or backtrack
No built-in human approval mechanism
State is lost between steps
Can't implement "ask a human" mid-workflow

With LangGraph:

Agents iterate naturally: hypothesis → test → refine
State persists across cycles
Interrupts enable human-in-the-loop safety
Clear routing logic based on typed agent outputs
Built-in cycle limits prevent runaway costs

By moving away from linear DAGs and utilizing cyclic graphs with governed tool access (MCP) and human-in-the-loop interrupts, we finally have an infrastructure recovery system that behaves like a real engineer: it investigates, it fails, it adapts, it asks for help when needed, and it tries again.

The Bottom Line

Linear pipelines work for deterministic processes. But infrastructure failures are messy, non-linear, and often require human judgment. AI agents need an orchestrator that matches this reality—one that supports iteration, maintains context, and can pause for human input.

Graph-based orchestration isn't just a nice-to-have for AI agents. It's the difference between a brittle script that wakes you up at 3 AM and an autonomous system that either fixes the issue or escalates with full context.

📚 Resources

Want to build this yourself? Here is the reading list I used to put this architecture together:

🔗 LangGraph Documentation: The framework for building stateful, multi-actor applications with interrupts. Read the docs
🔗 LangGraph Interrupt Pattern: How to pause graphs for human input. Read the docs
🔗 PydanticAI: The typed, robust agent framework by the creators of Pydantic. Read the docs
🔗 Model Context Protocol (MCP): The open standard for securely connecting AI to data sources. Official Specification

Mastering Multi-Provider Routing with OpenRouter

Kirill Polishchuk — Sat, 14 Mar 2026 22:17:03 +0000

🧠 The Single-Provider Trap

Let's be real: treating a Large Language Model (LLM) provider like a highly available, always-on utility is a massive architectural risk. We've all experienced it. You deploy a sophisticated agentic workflow, and suddenly the primary API goes down, gets aggressively rate-limited, or starts throwing 5xx errors.

Relying on a single provider—even an industry giant—creates a systemic vulnerability. To build true enterprise-grade AI applications, we have to decouple the application layer from specific vendors. The goal is to engineer a resilient "intelligence backbone" that autonomously shifts traffic based on availability, latency, and unit economics.

🏗️ Enter the Unified Routing Plane

Instead of wrestling with half a dozen different SDKs and writing custom retry loops for OpenAI, Anthropic, Meta, and DeepSeek, modern architectures are shifting toward unified routing planes.

By using an API gateway like OpenRouter, your application interfaces with just one endpoint. The complexity is handled entirely behind the scenes: the gateway uses built-in fallback logic to automatically reroute failed requests to secondary models, or to alternative infrastructure providers hosting the exact same open-weight model.

⚙️ Declarative JSON Routing: Infrastructure as Data

The cleanest way to manage routing at scale is by externalizing your logic into a declarative JSON configuration. This keeps your application code lean and allows Platform or FinOps teams to adjust routing priorities dynamically without triggering a full code deployment.

Here is what a production-ready routing payload looks like:

{
  "model": "meta-llama/llama-3.3-70b-instruct",
  "messages": [{"role": "user", "content": "Analyze this dataset..."}],
  "provider": {
    "order": ["deepinfra/turbo", "fireworks"],
    "allow_fallbacks": true,
    "sort": "latency",
    "zdr": true,
    "max_price": {"prompt": 1, "completion": 2}
  }
}

Model-Level Fallbacks for Maximum Resilience

Beyond provider fallbacks, OpenRouter supports model-level fallbacks using the models array. This is a game-changer for resilience—if your primary model is completely unavailable across all providers, the gateway can automatically fall back to semantically similar models:

{
  "models": [
    "anthropic/claude-sonnet-4.5",
    "openai/gpt-5-mini",
    "google/gemini-3-flash-preview"
  ],
  "messages": [{"role": "user", "content": "Analyze this dataset..."}],
  "provider": {
    "sort": {"by": "throughput", "partition": "none"},
    "zdr": true
  }
}

Setting partition: "none" removes model grouping, allowing the router to sort endpoints globally across all models. This means if Claude is slow or down, your request automatically routes to the fastest available alternative—whether that's GPT-5-mini or Gemini—without any code changes.

Performance Thresholds for Predictable SLAs

For enterprise applications with strict latency requirements, you can set explicit performance thresholds using preferred_max_latency and preferred_min_throughput. These work with percentile statistics (p50, p75, p90, p99) calculated over a rolling 5-minute window:

{
  "model": "deepseek/deepseek-v3.2",
  "messages": [{"role": "user", "content": "Generate report..."}],
  "provider": {
    "sort": "price",
    "preferred_max_latency": {
      "p90": 2,
      "p99": 5
    },
    "preferred_min_throughput": {
      "p90": 50
    }
  }
}

Providers not meeting these thresholds are deprioritized (moved to fallback positions) rather than excluded entirely. This ensures your requests always execute while preferring endpoints that meet your SLA requirements.

Why this configuration is powerful:

Surgical Provider Targeting (order): We explicitly target optimized endpoints first, like DeepInfra's high-speed turbo instances.
Dynamic Sorting (sort): Setting this to "latency" instructs the gateway to actively seek out the fastest responding provider for your chosen model.
Zero Data Retention (zdr): A non-negotiable flag for enterprise compliance, ensuring your chosen providers do not log your sensitive prompts.
Cost Ceilings (max_price): Prevents automated fallovers from accidentally defaulting to a premium, budget-draining endpoint during a weekend outage.

Your application code remains blissfully simple. You just inject this JSON into a standard REST call:

import requests, json

# Load declarative routing policy
config = json.load(open("routing_config.json"))

# A single API call handles all fallbacks and routing internally
response = requests.post(
    "<https://openrouter.ai/api/v1/chat/completions>",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=config
)

💸 FinOps & Unit Economics

Running complex Retrieval-Augmented Generation (RAG) pipelines or large-context reasoning models gets expensive fast. A mature FinOps strategy requires strict controls, and centralizing your routing makes this vastly easier to manage.

You can establish cost-aware routing dynamically. By setting the provider.sort key to "price", the gateway automatically hunts down the cheapest inference provider currently hosting your requested open-source model. The max_price parameter ensures your AI spend remains entirely predictable, even when fallback chains are triggered.

Real-World Cost Impact

To understand the savings potential, consider the price variance across providers for the same model. For example, Llama 3.3 70B pricing varies significantly:

DeepInfra: ~$0.15/million input tokens, $0.20/million output tokens
Fireworks AI: ~$0.20/million input tokens, $0.20/million output tokens
Together AI: ~$0.20/million input tokens, $0.20/million output tokens
AWS Bedrock: ~$0.72/million input tokens, $0.72/million output tokens

For a workload processing 100 million tokens monthly, switching from the most expensive to the most affordable provider saves ~$57,000 per month. The max_price parameter acts as a circuit breaker—if no compliant provider is available under your ceiling, the request fails gracefully rather than silently draining your budget.

⚖️ The Centralization Trade-off

This architecture is incredibly powerful, but it's not a silver bullet. The biggest trade-off is centralization. By moving away from individual provider SDKs, you are trading multiple potential points of failure for a single, massive one: the routing gateway itself.

If the unified API's load balancers fail, your entire stack loses access to external AI simultaneously. It's a calculated risk—you're betting that a dedicated routing platform will maintain better aggregate uptime than any individual LLM provider.

🎯 The Bottom Line

Relying on a solitary API endpoint is no longer acceptable for modern, mission-critical systems. It exposes your business to unpredictable vendor rate limits, unannounced deprecations, and frustrating outages.

By adopting a centralized routing plane with declarative JSON configurations, engineering teams can cleanly abstract away the chaos of the AI provider ecosystem. You gain the ability to orchestrate dynamic fallback arrays and latency-based routing without constantly rewriting application logic. This pattern definitively hardens your application, creating a robust foundation for the next generation of autonomous agents.

📚 Resources

Official documentation - Official documentation on structuring JSON payloads for latency sorting, fallback arrays, and ZDR enforcement.
FinOps for AI Frameworks - Strategic frameworks for measuring AI unit economics and mitigating cloud waste.
Model Fallbacks - Deep dive into model-level routing strategies

"Just Enough" Platform Engineering: Replacing Terraform with Kubernetes APIs

Kirill Polishchuk — Sat, 28 Feb 2026 22:07:12 +0000

We’ve all been there. You want to build an Internal Developer Platform (IDP). You start with good intentions: "Let's simplify infrastructure for our developers." Six months later, you have a sprawling Backstage instance that nobody likes, a fragile mountain of Terraform modules that take 40 minutes to apply, and developers who still just DM you to "fix the S3 bucket permissions."

We fell into this trap. We tried to abstract everything away until we realized we were just hiding complexity, not managing it.

This article details a different approach. We call it "Just Enough" Platform Engineering. Instead of building a portal that triggers a CI pipeline to run Terraform (the "ClickOps" anti-pattern), we moved the abstraction layer into the Kubernetes cluster itself.

Using AWS Kro (Kubernetes Resource Orchestrator) and ACK (AWS Controllers for Kubernetes), we built a self-service API that allows developers to spin up production-ready, compliant microservices in minutes. No Jenkins pipelines. No Terraform state locks. Just kubectl apply.

Here is how we solved the "Day 2" operations gap and cut provisioning time from days to minutes.

🎯 The Real Problem: The "Day 2" Gap

Most platforms nail "Day 0" (creating the hello-world app). They fail at "Day 2" (maintenance).

The Scenario: You use a Terraform module to provision an S3 bucket for a team.

The Problem:

Drift: A developer manually changes the bucket policy in the AWS Console to debug something. Your Terraform state is now wrong.
Versioning: You update the Terraform module to enforce encryption. You now have to run terraform apply across 50 different repositories to propagate the fix.
Cognitive Load: Developers have to learn HCL just to add a queue.

We needed a solution that was actively reconciling (fixing drift automatically) and API-centric (versioned and manageable).

🛠️ The Architecture: Kubernetes as the Control Plane

We stopped treating Kubernetes as just a container scheduler and started treating it as a universal control plane.

AWS Kro: Allows us to define custom APIs (CRDs) without writing Go code. It acts as the "glue" or the orchestrator.
ACK (AWS Controllers for Kubernetes): Native Kubernetes controllers that talk to AWS APIs. They turn an S3 bucket into a Kubernetes object.

The Workflow:

Platform Team defines a ResourceGraphDefinition (RGD). This is the blueprint.
Kro converts that RGD into a custom Kubernetes API (e.g., kind: SecureMLWorkspace).
Developer applies a simple 5-line YAML file.
Kro + ACK automatically provision the Deployment, Service, IAM Role, and S3 Bucket, wiring them all together securely.

💻 Implementation: The "Secure ML Workspace" API

Let's build a real artifact. We want a Custom Resource called MLWorkspace that gives a data scientist:

A Jupyter Notebook (Deployment + Service).
A private S3 Bucket for datasets.
An IAM Role that allows only that notebook to access only that bucket.

1. The Foundation (Terraform)

We use Terraform only for the static base (EKS cluster, OIDC provider, and installing the controllers). We don't use it for the dynamic app resources.

Terraform

# Install the ACK S3 Controller via Helm
resource "helm_release" "ack_s3" {
  name       = "ack-s3-controller"
  chart      = "s3-chart"
  repository = "oci://public.ecr.aws/aws-controllers-k8s"

  # Crucial: Map the K8s ServiceAccount to an AWS IAM Role (IRSA)
  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = aws_iam_role.ack_s3_controller.arn
  }
}

2. The Abstraction (Kro ResourceGraphDefinition)

This is the "secret sauce." Instead of writing a complex Go Operator, we define the relationship graph in YAML.

Note: We use the ResourceGraphDefinition kind (the current standard for Kro).

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: ml-workspace-api
spec:
  # The Interface: What the developer sees
  schema:
    apiVersion: v1alpha1
    kind: MLWorkspace
    spec:
      project: string
      gpu: boolean | default=false
    status:
      notebookUrl: "http://${notebookservice.metadata.name}.${schema.metadata.namespace}.svc.cluster.local:8888"
      storage: ${s3bucket.status.ackResourceMetadata.arn}

  # The Implementation: What gets created
  resources:
    # 1. The Private S3 Bucket (Managed by ACK)
    - id: s3bucket
      template:
        apiVersion: s3.services.k8s.aws/v1alpha1
        kind: Bucket
        metadata:
          name: ${schema.spec.project}-data
        spec:
          name: ${schema.spec.project}-data-${schema.metadata.uid} # Unique Name
          encryption:
            rules:
              - applyServerSideEncryptionByDefault:
                  sseAlgorithm: AES256

    # 2. The IAM Policy for Bucket Access (Managed by ACK)
    - id: iampolicy
      readyWhen:
        - ${iampolicy.status.ackResourceMetadata.arn != null}
      template:
        apiVersion: iam.services.k8s.aws/v1alpha1
        kind: Policy
        metadata:
          name: ${schema.spec.project}-s3-policy
        spec:
          name: ${schema.spec.project}-s3-policy-${schema.metadata.uid}
          policyDocument: |
            {
              "Version": "2012-10-17",
              "Statement": [{
                "Effect": "Allow",
                "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
                "Resource": [
                  "arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}",
                  "arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}/*"
                ]
              }]
            }

    # 3. The IAM Role for K8s Service Account/IRSA (Managed by ACK)
    - id: iamrole
      readyWhen:
        - ${iamrole.status.ackResourceMetadata.arn != null}
      template:
        apiVersion: iam.services.k8s.aws/v1alpha1
        kind: Role
        metadata:
          name: ${schema.spec.project}-role
        spec:
          name: ${schema.spec.project}-role-${schema.metadata.uid}
          policies:
            - ${iampolicy.status.ackResourceMetadata.arn}
          assumeRolePolicyDocument: |
            {
              "Version": "2012-10-17",
              "Statement": [{
                "Effect": "Allow",
                "Principal": { "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/OIDC_URL" },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                  "StringEquals": {
                    "OIDC_URL:sub": "system:serviceaccount:${schema.metadata.namespace}:${schema.spec.project}-sa"
                  }
                }
              }]
            }

    # 4. The Kubernetes Service Account
    - id: serviceaccount
      template:
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: ${schema.spec.project}-sa
          namespace: ${schema.metadata.namespace}
          annotations:
            eks.amazonaws.com/role-arn: ${iamrole.status.ackResourceMetadata.arn}

    # 5. The Notebook Service
    - id: notebookservice
      template:
        apiVersion: v1
        kind: Service
        metadata:
          name: ${schema.spec.project}-notebook
          namespace: ${schema.metadata.namespace}
        spec:
          selector:
            app: ${schema.spec.project}
          ports:
            - port: 8888
              targetPort: 8888

    # 6. The Notebook Deployment (Wait for bucket to be ready)
    - id: notebook
      readyWhen:
        - ${notebook.status.availableReplicas == 1}
      template:
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: ${schema.spec.project}-notebook
          namespace: ${schema.metadata.namespace}
        spec:
          selector:
            matchLabels:
              app: ${schema.spec.project}
          template:
            spec:
              serviceAccountName: ${schema.spec.project}-sa
              containers:
                - name: jupyter
                  image: "jupyter/scipy-notebook:latest"
                  env:
                    # AUTOMATIC WIRING: Inject the Bucket ARN directly
                    - name: DATA_BUCKET
                      value: ${s3bucket.status.ackResourceMetadata.arn}
                  resources:
                    limits:
                      # Conditional Logic in CEL
                      nvidia.com/gpu: "${schema.spec.gpu? '1' : '0'}"

3. The Developer Experience

The developer doesn't care about IAM policies, encryption rules, or Pod selectors. They just want a workspace

apiVersion: v1alpha1
kind: MLWorkspace
metadata:
  name: fraud-detection-dev
spec:
  project: fraud-detection
  gpu: true

That's it. When they apply this, Kro creates the bucket, waits for the ARN to be generated by AWS, injects that ARN into the Pod's environment variables, and spins up the compute.

💰 The Cost Reality

Is running this expensive? We analyzed the costs of running the control plane (ACK + Kro) versus the operational savings.

The "Tax" (Infrastructure Cost):

Kro Controller: Runs as a standard Pod on your existing EKS nodes. Costs nothing beyond the base EC2/Fargate compute required (which is negligible).
ACK Controllers: Also run as Pods on your existing nodes. Minimal resource usage.
Total "Platform Tax": Essentially $0 in additional licensing or managed service fees. You only pay for the standard EKS cluster and the underlying compute nodes you are already running.

The Savings (Operational):

Drift Remediation: $0 (Automatic).
Wait Time: Reduced from days (ticketing queue) to seconds.
Security Audits: The RGD acts as a policy. You can verify that every MLWorkspace uses AES256 encryption just by checking the single RGD file.

🧠 My Individual Conclusion

After migrating our core data services to this model, here is my honest take:

1. The "Leaky Abstraction" Risk is Real

When an ACK resource fails (e.g., AWS rejects the bucket name because it's taken), the error bubbles up to the Kubernetes status. Your developers will need to know how to read kubectl describe output. You cannot hide the cloud entirely.

2. Portability vs. Integration

Kro creates a tight coupling with the underlying CRDs (ACK). If you move to Google Cloud, you have to rewrite your RGDs to use Config Connector (Google's equivalent). This is not a "write once, run anywhere" solution like pure Helm charts might claim to be, but the operational stability you gain on your primary cloud is worth the lock-in.

3. The Verdict

Use Kro if you are a platform team that wants to provide golden paths without building a massive software project. It sits perfectly in the sweet spot between "raw YAML" and "heavy enterprise portal."

📚 Resources & References

Official Project:(https://github.com/kubernetes-sigs/kro)
Cloud Controllers:(https://aws-controllers-k8s.github.io/community/docs/community/overview/)
Deep Dive:(https://www.cncf.io/blog/2025/12/15/building-platforms-using-kro-for-composition/)
Syntax Guide: CEL (Common Expression Language) Introduction

An AI Crew for Automated Diagramming and Documentation

Kirill Polishchuk — Sun, 16 Nov 2025 22:44:43 +0000

The Introduction

Our cloud documentation is almost always out of date. It's not because we're lazy; it's because the cloud moves too fast. A diagram drawn in a sprint planning meeting is obsolete by the time the code hits production. This documentation crisis, that every engineering team faces, is a massive and invisible tax. Nobody talks about it, but we know that manual updates are expensive, error-prone, and always outdated when you need them most. The "cost" isn't just the 2-3 days of senior engineer time every quarter—it's the production incidents that could have been prevented, the security vulnerabilities you didn't know existed, and the new hires who take weeks to understand the system.

I was tired of this cycle. So I built a solution that uses AI agents to automatically scan live AWS environments and generate accurate, multi-audience documentation in minutes—not days. Here's how it works, what I learned, and why this approach unlocks something bigger than just better diagrams.

The Problem

💡 Why Everything We've Tried Has Failed

❌ Manual Documentation

The promise: "We'll keep the wiki updated"
The reality: Updated once during setup, referenced never, trusted by no one
The cost: 2-3 days of senior engineer time per environment, outdated within weeks
❌ Diagrams-as-Code (Terraform/CloudFormation diagrams)

The promise: "Our IaC is our documentation"
The reality: Shows the intended state, not the actual state after three hotfixes and that manual console change on Friday night
The gap: What you planned vs. what actually exists
❌ Static Scanning Tools

The promise: "We'll scan your infrastructure"
The reality: Dumps 10,000 lines of JSON that tell you what exists but not why or how it's connected.

The Solution

💡 AI Agents That Understand Infrastructure

What we actually needed is a system that can perceive infrastructure like a scanner, understand it like a senior architect, and explain it like a technical writer—automatically. To achieve this, I created a "crew" of specialized AI agents—each with a specific job, just like a real engineering team.

Think of it like this:

The Inspector scans AWS (like a junior engineer running AWS CLI commands)
The Analyst understands relationships (like a senior architect reviewing configs)
The Draftsman creates diagrams (like a technical illustrator)
The Writers create documentation for different audiences:
- Technical Writer → detailed runbook for ops teams
- Executive Analyst → high-level summary for leadership
- Developer Advocate → practical guide for developers

All working in parallel, all generating outputs from the same live data, all in minutes.

The Transformation

💡 Before vs. After

Aspect	Before ( Manual Process )	After ( Automated with AI Agents )
⏱️ Time	2-3 days per environment	5-10 minutes per environment
👤 Who	Senior engineer (expensive)	Anyone with AWS access
📄 Output	One diagram, maybe a doc	Diagram + 4 tailored documents
🔄 Update Frequency	Quarterly if you're lucky	On-demand or automated (CI/CD)
🎯 Accuracy	Outdated within weeks	Always reflects current state
😰 Stress Level	High (always out of date)	Low (always accurate)

Quick Start

The entire system is open source. You can have it running in 5 minutes:

# 1. Install the package
git clone https://github.com/kirPoNik/aws-architecture-diagrams-with-crewai.git
cd aws-architecture-diagrams-with-crewai
pip install -e .

# 2. Run it (that's it!)
aws-diagram-generator \
  --name "Production" \
  --region us-east-1 \
  --tags "Environment=prod" "App=myapp"

# 3. Check your output/ directory for complete documentation

Prerequisites:

Python 3.10+
AWS credentials
AWS Config enabled
AWS Bedrock access (Claude 3.5 Sonnet preferred )

In under 10 minutes, you'll have:

✅ PlantUML architecture diagram with AWS icons
✅ Technical Runbook with every resource detail
✅ Executive Summary in plain English
✅ Developer Onboarding Guide with endpoints

How It Actually Works

Three Key Innovations:

Universal Discovery

This works with ANY AWS Service. The first breakthrough was realizing we don't need to hard-code describe_instances(), describe_db_instances(), etc. for every service. Instead, use AWS's universal APIs:
```
# This one API call finds ANY tagged resource across ALL services
paginator = tagging_client.get_paginator('get_resources')
for page in paginator.paginate(TagFilters=boto3_tag_filters):
    resources = page.get('ResourceTagMappingList', [])
    all_resource_mappings.extend(resources)
```
Why this matters:
- Works with services that didn't exist when you wrote the code. No maintenance as AWS adds new services.

Batch Processing

The second breakthrough was batching AWS Config calls instead of fetching resources one-by-one:

# Group by type
resources_by_type: Dict[str, List] = {}
for resource in resources:
    resource_type = extract_resource_type_from_arn(arn)
    resources_by_type[resource_type].append(resource)

# Fetch up to 20 at once
response = config_client.batch_get_resource_config(
    resourceKeys=resource_keys  # Batch of 20
)

# Automatic fallback for edge cases
if error_code == 'ValidationException':
    config_client.select_resource_config(
        Expression=f"SELECT * WHERE configuration.arn = '{safe_arn}'"
    )

Why this matters:

Processes 100s of resources in seconds
Built-in retry logic for throttling
Automatic fallback when batch isn't supported

AI Understanding

The third breakthrough was using specialized AI agents with personas:

inspector = Agent(
    role='AWS Infrastructure Inspector',
    goal='Scan AWS and provide detailed JSON of resources',
    backstory='You use AWS APIs to discover cloud resources based on tags.',
    tools=[aws_scanner_tool],
    llm=llm
)

analyst = Agent(
    role='Cloud Architecture Analyst',
    goal='Understand architecture, components, and relationships',
    backstory='You interpret raw infrastructure data and structure it into a logical model.',
    llm=llm
)

draftsman = Agent(
    role='PlantUML Diagram Draftsman',
    goal='Generate PlantUML diagram scripts',
    backstory='You convert architectural information into PlantUML using AWS icons.',
    llm=llm
)

# Chain them together: Inspector → Analyst → Draftsman
task_inspect = Task(description='Scan AWS...', agent=inspector)
task_analyze = Task(description='Analyze...', agent=analyst, context=[task_inspect])
task_draw = Task(description='Create diagram...', agent=draftsman, context=[task_analyze])

crew = Crew(agents=[...], tasks=[...])
result = crew.kickoff()

Why this matters:

Each agent is an expert in its domain
Outputs are human-readable, not raw JSON
Same data → 4 different perspectives (technical, executive, developer, visual)

The Architecture

💡 How It All Fits Together

What You Actually Get

💡 Here's what the final markdown file can look like

# AWS Architecture Documentation: Production Environment

## Table of Contents
1. Architecture Diagram
2. Technical Infrastructure Runbook
3. Executive Summary for Leadership
4. Developer Onboarding Guide

## Architecture Diagram
@startuml
!include <awslib/AWSCommon>
!include <awslib/Compute/EC2>
!include <awslib/Database/RDS>

rectangle "VPC: vpc-12345 (10.0.0.0/16)" {
  rectangle "Public Subnet: subnet-abc" {
    ElasticLoadBalancing(alb, "Application LB", "")
  }
  rectangle "Private Subnet: subnet-def" {
    EC2(web1, "Web Server 1", "t3.medium")
    EC2(web2, "Web Server 2", "t3.medium")
  }
  rectangle "DB Subnet: subnet-ghi" {
    RDS(db, "PostgreSQL", "db.t3.large")
  }
}

alb --> web1
alb --> web2
web1 --> db
web2 --> db
@enduml

## Technical Infrastructure Runbook

### Compute Resources
**EC2 Instance: i-0abc123** (Web Server 1)
- Instance Type: t3.medium
- Private IP: 10.0.1.10
- Security Groups: sg-web123 (allows 80/443 from ALB)
- IAM Role: web-server-role
- Tags: Environment=production, Tier=web

[... detailed configs for every resource ...]

## Executive Summary
This production environment hosts our customer-facing web application using a
highly available, three-tier architecture. The system consists of:

- **Web Tier:** Redundant web servers behind a load balancer for high availability
- **Database Tier:** Managed PostgreSQL database with automated backups
- **Security:** Private subnets, restricted security groups, encrypted data

The architecture supports approximately 10,000 daily users with 99.9% uptime...

## Developer Onboarding Guide
### Quick Start
**Application URL:** <https://my-app-prod-123.us-east-1.elb.amazonaws.com>

**Database Connection:**
Host: mydb.cluster-abc.us-east-1.rds.amazonaws.com
Port: 5432
Database: production_db
User: app_user

## **Environment Variables:**
[... practical connection details ...]

💭 Final Thoughts and Next Steps

This approach is powerful, but it's not magic. Here are the real-world considerations:

Dependency: The AWS Config discovery method is robust, but it relies on AWS Config being enabled and correctly configured to record all the resource types you care about.
Cost: This makes heavy use of a powerful LLM (like Claude 3.5 Sonnet or GPT-4). Running it on-demand is fine, but running it every 10 minutes on a massive environment could get expensive.
API Rate Limits: AWS Bedrock has very strong limits, especially on Anthropic Models ( 1-2 requests per minute). To work around we use models via inference profile. Also the Use-Case submission is required.
Non-Determinism: LLMs are non-deterministic. The Analyst might occasionally misinterpret a relationship or the Draftsman might make a syntax error. This requires prompt refinement and testing.

Once you have AI agents that can perceive and understand your infrastructure, you unlock an entire category of use cases:

Cost Optimization

finops_analyst = Agent(
    role='FinOps Analyst',
    goal='Identify cost optimization opportunities',
    backstory='You find abandoned or over-provisioned resources.'
)
# Output: "Found 5 unattached EBS volumes costing $150/month"
#         "RDS instance at 12% CPU could be downsized, save $200/month"

Security Auditing

security_auditor = Agent(
    role='Security Auditor',
    goal='Identify security vulnerabilities',
    backstory='You audit cloud configurations for compliance.'
)
# Output: "Security group sg-123 allows 0.0.0.0/0 on port 22"
#         "S3 bucket 'backups' is not encrypted"
#         "RDS instance publicly accessible"

Compliance Verification

compliance_checker = Agent(
    role='Compliance Checker',
    goal='Verify HIPAA/PCI-DSS/SOC2 compliance'
)
# Output: "HIPAA Violation: Database not in private subnet"
#         "PCI-DSS: Encryption at rest not enabled"

📚 Resources

📦 GitHub: aws-architecture-diagrams-with-crewai
🛠️ Tools Used: CrewAI | AWS Config | PlantUML
🎨 AWS Icons: aws-icons-for-plantuml
CrewAI GitHub Examples: https://github.com/crewAIInc/crewAI-examples

Predicting Failures in a Serverless App with AWS DevOps Guru and OpenTelemetry

Kirill Polishchuk — Sat, 25 Oct 2025 21:44:26 +0000

Limitation of the Traditional Monitoring

The management of modern distributed applications has become increasingly complex. Using traditional monitoring tools, which rely mainly on manual analysis, is insufficient for ensuring the availability and performance demanded by microservices or serverless topologies.

One of the main problems with traditional monitoring is the high volume and variety of telemetry data generated by IT environments. This includes metrics, logs, and traces, which in an ideal world should be consolidated on a single monitoring dashboard to allow observation of the entire system. Another problem is static thresholds for alarms. Setting them too low will generate a high volume of false positives, while setting them too high will fail to detect significant performance degradation.

To solve these problems, organizations are shifting to an intelligent, automated, and predictive solution known as AIOps. Instead of relying on human operators to manually connect the dots, AIOps platforms are designed to ingest and analyze these vast datasets in real time.

In this article, we will learn how AIOps platforms are capable of proactive anomaly detection—its most fundamental capability - as well as root cause analysis, prediction, and alert generation.

The Technology Stack

The solution detailed in this article is a combination of three synergistic pillars:

A managed AIOps platform that provides analytical intelligence. We will use AWS Guru, which is the core of our solution and acts as its "AIOps brain." AWS Guru is a managed service that leverages machine learning models built and trained by AWS experts. A key design principle is to make AIOps accessible to specialists without special machine learning expertise. Its primary function is to detect operational issues or anomalies and produce high-level insights instead of a stream of raw, uncorrelated alerts. These insights include related log snippets, a detailed analysis with a possible root cause, and actionable steps to diagnose and remediate the issue.
An Open-Standard observability framework that supplies high-quality telemetry data and provides a unified set of APIs, SDKs, and tools to generate, collect, and export it. The importance of OpenTelemetry lies in two principles: standardization and vendor neutrality. The benefit of using OpenTelemetry is that if we want to switch to a different AIOps tool, we can just redirect the telemetry stream.
A Serverless Application that is an example of a modern and dynamic microservice topology.

The complete architectural solution for a proposed telemetry pipeline can be observed on the below diagram.

💡 Figure1. The solution architecture illustrates how a user request flows through the serverless application, then ADOP lambda layer collects the telemetry data and sends it to X-Ray and CloudWatch and then AWS Guru ingests this data and generates an insights

Practical Implementation

It’s important to understand that AWS Guru does not collect any telemetry data itself but is configured to monitor and continuously analyze resources produced by the Application and identified by specific tags.

To give a reader a better understanding in this section we provide a comprehensive guide on how to implement the proposed solution and further in the Experiment section we will see on how to instrument it. The following structure of a git repository aligns with IAC best practices:

.
├── demo
│   ├── envs
│   │   └── dev
│   │       ├── env.hcl    # Environment-specific configuration that sets the environment name
│   │       ├── api_gateway
│   │       │   └── terragrunt.hcl
│   │       ├── devopsguru
│   │       │   └── terragrunt.hcl
│   │       ├── dynamodb
│   │       │   └── terragrunt.hcl
│   │       ├── iam
│   │       │   └── terragrunt.hcl
│   │       └── serverless_app
│   │           └── terragrunt.hcl
│   └── project.hcl    # Project-level configuration defining `app_name_prefix` and `project_name` used across all environments
├── root.hcl    # Root Terragrunt configuration that generates AWS provider blocks and configures S3 backend
├── src
│   ├── app.py    # Lambda handler function with OpenTelemetry instrumentation
│   ├── requirements.txt
│   └── collector.yaml
└── terraform
    └── modules    # Infrastructure Modules
        ├── api_gateway
        ├── devopsguru
        ├── dynamodb
        └── iam

💡 This Modular (Terragrunt) Approach has the following Benefits:

True environment isolation: each environment (dev, prod, etc.) has its own state, config, and outputs.
All major AWS resources (Lambda, API Gateway, DynamoDB, IAM, DevOps Guru) are reusable Terraform modules in terraform/modules/.
Easy to extend for new AWS services or environments with minimal duplication.

The full repository can be found https://github.com/kirPoNik/aws-aiops-detection-with-guru

The Lambda function (code in app.py) receives requests from API Gateway, generates an unique ID and put an item to the Dynamo DB Table. It also contains the logic to inject a "gray failure", which will be required for our experiment, see the code snipped with the Key Logic below:

import os
import time
import random
import boto3
import uuid

# --- CONFIGURATION FOR GRAY FAILURE SIMULATION ---
# This environment variable acts as our feature flag for the experiment
INJECT_LATENCY = os.environ.get("INJECT_LATENCY", "false").lower() == "true"
MIN_LATENCY_MS = 150  # Minimum artificial latency in milliseconds
MAX_LATENCY_MS = 500  # Maximum artificial latency in milliseconds

def handler(event, context):
    """
    Handles requests and optionally injects a variable sleep
    to simulate performance degradation.
    """

    # This is the core logic for our "gray failure" simulation
    if INJECT_LATENCY:
        latency_seconds = random.randint(MIN_LATENCY_MS, MAX_LATENCY_MS) / 1000.0
        time.sleep(latency_seconds)

    # The function's primary business logic is to write an item to DynamoDB
    try:
        table.put_item(
            Item={
                "id": str(uuid.uuid4()),
                "created_at": int(time.time())
            }
        )
        # ... returns a successful response ...
    except Exception as e:
        # ... returns an error response ...

and the collector configuration ( in collector.yaml), that defines pipelines to send traces to AWS X-Ray and metrics to Amazon CloudWatch, see the Key Logic below:

# This file configures the OTel Collector in the ADOT layer
exporters:
  # Send trace data to AWS X-Ray
  awsxray:
  # Send metrics to CloudWatch using the Embedded Metric Format (EMF)
  awsemf:

service:
  pipelines:
    # The pipeline for traces: receive data -> export to X-Ray
    traces:
      receivers: [otlp]
      exporters: [awsxray]
    # The pipeline for metrics: receive data -> export to CloudWatch
    metrics:
      receivers: [otlp]
      exporters: [awsemf]

Simulating Failure and Generating Insights

💡 The Experiment section

Step 1: Deploy the Stack

In the demo/envs/dev directory, run the usual commands:

terragrunt init --all
terragrunt plan --all
terragrunt apply --all

Grab the API endpoint from the output and save it.

export API_URL=$(terragrunt output -json --all  | jq -r 'to_entries[] | select(.key | test("api_endpoint")) | .value.value')

💡 You need to enable AWS DevOps Guru and wait 15-90 minutes for Discovering applications and resources

Step 2: Establish a Baseline

DevOps Guru needs to learn what "normal" looks like. Let's give it some healthy traffic. We'll use hey, a simple load testing tool perfect for this job.

Why hey? We could use a more complex tool like k6, which is great for scripting detailed user journeys. But for this test, we just need to hit an endpoint with a steady stream of requests. hey does that with a single command, keeping things simple.

Run a light load for a few hours. This gives the ML models plenty of data to build a solid baseline.

# Run for 4 hours at 5 requests per second
hey -z 4h -q 5 -m POST "$API_URL"

💡 Use GNU Screen to run this in background

Step 3: Inject the Failure

Now for the fun part. We'll introduce our "gray failure" - a subtle slowdown that a simple threshold alarm would likely miss.

In demo/envs/dev/serverless_app/terragrunt.hcl, add a new INJECT_LATENCY to our Lambda function's environment variable:

environment_variables = {
    TABLE_NAME                         = dependency.dynamodb.outputs.table_name
    AWS_LAMBDA_EXEC_WRAPPER            = "/opt/otel-instrument"
    OPENTELEMETRY_COLLECTOR_CONFIG_URI = "/var/task/collector.yaml"
    INJECT_LATENCY = "true" # <-- Change this to true
}

Apply the change. This quick deployment is an important event that DevOps Guru will notice.

terragrunt apply --all

Step 4: Generate Bad Traffic

Run the same load test again. This time, every request will have that extra, variable delay.

# Run for at least an hour to generate enough bad data
hey -z 1h -q 5 -m POST "$API_URL"

Our app is now performing worse than its baseline. Let's see if DevOps Guru noticed.

After 30-60 minutes of bad traffic, an "insight" popped up in the DevOps Guru console.

This is the real value of AIOps. A standard CloudWatch alarm would have just said, "Latency is high." DevOps Guru said, "Latency is high, and it started right after you deployed this change."

Conclusion

This experiment shows a clear path away from reactive firefighting. By pairing a standard observability framework like OpenTelemetry with an AIOps engine like AWS DevOps Guru, we can build systems that help us find and fix problems before they become disasters.

The big takeaway is correlation. The magic wasn't just spotting the latency spike; it was automatically linking it to the deployment. That's the jump from raw data to real insight.

The future of ops isn't about more dashboards. It's about fewer, smarter alerts that tell you what's wrong, why it's wrong, and how to fix it.

Resources

Building a 'Chat with Your Logs' System on AWS Using OpenSearch Serverless and Bedrock

Kirill Polishchuk — Thu, 04 Sep 2025 23:36:45 +0000

🧩 The Challenge: Drowning in Data During Incidents

In the critical moments of a production incident, engineering teams face a formidable challenge: navigating a deluge of log data to find the needle in the haystack. Traditional log analysis demands that engineers formulate precise, often complex, queries using specialized languages. This is effective when you know what to look for, but the real difficulty often lies in diagnosing the "unknown unknowns" - unexpected failures not captured by simple keyword searches.

What if you could ask questions in plain English, like, "What were the most common errors for the checkout service in the last 15 minutes?" This article demonstrates how to build a powerful, serverless AIOps pipeline on AWS to create a natural language interface for your application logs, transforming log analysis from a rigid, query-based task into an intuitive, conversational experience.

💬 The Solution: Conversational AIOps with RAG

This solution leverages a powerful pattern in generative AI known as Retrieval-Augmented Generation (RAG). RAG enhances the capabilities of Large Language Models (LLMs) by connecting them to external knowledge sources - in this case, your real-time application logs. This approach is highly cost-effective as it avoids expensive model retraining, instead providing the LLM with relevant, live context to answer questions accurately.

High-Level Architecture

The system is composed of a series of integrated, serverless AWS services that form a complete AIOps pipeline, from ingestion to a conversational response.

The data flows as follows:

Ingestion & Embedding: Logs are streamed to an Amazon OpenSearch Ingestion pipeline. The pipeline uses an AWS Lambda function to call Amazon Bedrock's Titan Text Embeddings model, converting the semantic content of each log into a numerical vector.
Indexing: The original log, now enriched with its vector embedding, is stored in an Amazon OpenSearch Serverless collection configured for high-performance vector search.
Query & Retrieval: A user asks a question through a simple web app. The app converts the question into a vector using the same Titan model and performs a k-Nearest Neighbors (k-NN) similarity search against the OpenSearch collection to find the most semantically relevant logs.
Synthesis & Response: The retrieved logs are passed as context, along with the original question, to a powerful generative LLM like Anthropic's Claude on Amazon Bedrock. Claude analyzes the logs, synthesizes the information, and generates a coherent, human-readable answer.

🧠 The AIOps Pipeline: Key Components

How Ingestion and Embedding Work Together

The core of the data processing is a seamless, serverless flow between the Amazon OpenSearch Ingestion pipeline and the embedding_lambda function. This is how raw logs are enriched with semantic meaning before they are ever stored.

Here’s a step-by-step breakdown of their interaction:

Data Arrives at the Pipeline: An application sends a log entry to the OpenSearch Ingestion pipeline's HTTP endpoint.
Pipeline Invokes the Lambda Processor: The pipeline's configuration includes a processor stage that points to our embedding_lambda function. When the pipeline receives log data, it automatically invokes this Lambda, passing the batch of log records to it.
Lambda Generates Embeddings: The embedding_lambda function executes its logic: it iterates through each log, extracts the text, and makes an API call to Amazon Bedrock's Titan Text Embeddings model. Bedrock returns a numerical vector (the embedding) that captures the log's meaning.
Lambda Enriches the Data: The Lambda function adds this new vector as a field (e.g., log_embedding) to the original log record.
Pipeline Sends Data to the Sink: The Lambda returns the modified, enriched log records back to the pipeline. The pipeline then sends this complete document to its configured sink - the OpenSearch Serverless vector collection - where it is indexed and becomes available for semantic search.

The Embedding Lambda: Adding Semantic Context

The embedding_lambda is a small but critical piece of the pipeline. Its sole job is to enrich the log data with semantic meaning. Triggered by the OpenSearch Ingestion pipeline for every new batch of logs, it performs three key steps:

Receives Logs: It accepts a batch of raw log entries from the ingestion pipeline.
Generates Vectors: It extracts the text from each log and calls the Amazon Bedrock API, specifically requesting an embedding from the Titan Text Embeddings model. Bedrock returns a numerical vector (e.g., a list of 1,536 numbers) that represents the log's meaning.
Returns Enriched Logs: The function adds this vector to the original log data under a new field, like log_embedding, and returns the modified batch to the ingestion pipeline, which then stores it in OpenSearch.

This function acts as a serverless, on-demand transformation engine, making our logs "smart" before they are even indexed.

def generate_embedding(text):
    body = json.dumps({"inputText": text})
    model_id = 'amazon.titan-embed-text-v2:0'

    try:
        response = bedrock_runtime.invoke_model(
            body=body,
            modelId=model_id,
            accept='application/json',
            contentType='application/json'
        )
        response_body = json.loads(response.get('body').read())
        return response_body.get('embedding')
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

def lambda_handler(event, context):
    for record in event:
        log_data = record.get('data', {})
        log_message = log_data.get('message', '')

        if log_message:
            embedding = generate_embedding(log_message)
            if embedding:
                # Add the new embedding vector to the log data
                log_data['log_embedding'] = embedding
        ...

OpenSearch Serverless: The Vector Store

We use an Amazon OpenSearch Serverless collection as our vector database. Its Vector search collection type is optimized for the high-performance similarity searches (k-NN) we need.

For this to work, we must configure the index mapping to treat our log_embedding field as a vector. This tells OpenSearch how to index the vector for efficient searching.

Here is a sample index mapping, which you would typically define in your Terraform configuration:

"log_embedding": {
    "type": "knn_vector",
    "dimension": 1024,
    "method": {
        "name": "hnsw",
        "engine": "faiss",
        "space_type": "l2",
        "parameters": {
            "ef_construction": 512,
            "m": 16
        }
    }
}

💡 Key Configuration Details:

"type": "knn_vector": This explicitly defines the log_embedding field for k-NN search.

"dimension": 1024: This must match the output dimension of your embedding model. Amazon Titan Text Embeddings generates vectors of this size.

"method": We specify the hnsw (Hierarchical Navigable Small World) algorithm, which provides an excellent balance of search speed and accuracy for large datasets.

🛠️ Practical Implementation Guide

The Git repository is structured using a modular approach, which is a best practice that promotes reusability and maintainability.

├── README.md
├── envs/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
├── modules/
│   ├── iam/
│   ├── ingestion_pipeline/
│   ├── embedding_lambda/
│   └── opensearch/
└── src/
    ├── embedding_lambda/
    └── streamlit_app/

💡 The repository separates the definition of the infrastructure (in a modules/ directory) from the configuration for specific deployments (in an envs/ directory). An engineer can deploy a complete development environment by simply running terraform apply within the envs/dev/ directory.

See Complete Code Repository for your reference.

The User Interface and Prompt Engineering

A simple web application built with Streamlit serves as the user-facing component. The quality of the final answer is heavily dependent on the quality of the prompt sent to the Claude model. A simple "Answer the question" prompt is insufficient. Instead, a robust prompt template is used to guide the model's behavior.

def get_llm_response(question, logs):
    log_context = "\\n".join(logs)

    prompt = f"""
    You are an expert AIOps assistant. Your task is to answer questions about application behavior based *only* on the provided log entries. Do not use any prior knowledge. If the answer cannot be found in the logs, you must state 'I cannot answer the question based on the provided logs.'

    Here are the relevant log entries retrieved:
    <logs>
    {log_context}
    </logs>

    Based on the logs above, please answer the following question:
    <question>
    {question}
    </question>
    """

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,
        "messages": [{"role": "user", "content": prompt}]
    })

    response = bedrock_runtime.invoke_model(body=body, modelId=BEDROCK_MODEL_ID_CLAUDE)
    response_body = json.loads(response.get('body').read())
    return response_body['content']['text']

💡 Prompting Tip: This prompt uses several best practices for Claude: it assigns a persona ("expert AIOps assistant"), provides clear constraints to prevent hallucination, and uses XML tags (<logs> and <question>) to structure the context, which significantly improves the model's ability to follow instructions.

Here are the latest model IDs we would use in 2025 :

For the highest capability (Opus): anthropic.claude-opus-4-1-20250805-v1:0

For a balance of performance and cost (Sonnet): anthropic.claude-sonnet-4-20250514-v1:0

💫 A New Paradigm for Observability

This serverless RAG solution represents a new approach to log analysis, with different strategic considerations compared to traditional tools.

Cost Model: Query vs. Ingestion

The AIOps RAG architecture shifts the cost model. The cost of ingesting and creating embeddings for logs is relatively low. The primary cost driver is the LLM inference at query time. Each user question triggers an API call to the Claude model with a context of retrieved logs. This means the system's operational cost is driven not by log volume, but by query volume and complexity. This makes the system ideal for high-value, deep-investigation queries during incidents, rather than high-frequency, dashboard-style monitoring.

🪄 The Future of Observability: Beyond Q&A

The vector embeddings generated during ingestion are a valuable data asset that can be leveraged for capabilities far beyond simple question-answering.

Automated Semantic Anomaly Detection: By applying clustering algorithms to the stream of log embeddings, the system can identify the emergence of new clusters of logs that are semantically distinct from the normal baseline. This can detect novel error types or subtle shifts in application behavior that keyword-based alerting would miss.
Automated Incident Summary Generation: The summarization capabilities of LLMs can be used to automatically generate a first draft of an incident summary. By retrieving logs from an incident's timeframe, the system can provide a timeline of key events, a likely root cause, and customer impact, drastically reducing the manual effort required for post-mortem analysis.

✅ Conclusion

The serverless RAG architecture presented here offers a transformative approach to log analysis on AWS. By combining the scalable vector search of Amazon OpenSearch Serverless with the advanced reasoning of foundation models on Amazon Bedrock, organizations can build powerful, conversational interfaces for their observability data. This approach lowers the barrier to deep log analysis, empowers a wider range of team members to participate in incident investigation, and opens the door to a new class of intelligent AIOps tools.