<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Oleksandr Kuryzhev</title>
    <description>The latest articles on DEV Community by Oleksandr Kuryzhev (@oleksandr_kuryzhev_42873f).</description>
    <link>https://dev.to/oleksandr_kuryzhev_42873f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3970301%2Fff42dfb6-af2a-4fc7-968a-54326187a691.jpg</url>
      <title>DEV Community: Oleksandr Kuryzhev</title>
      <link>https://dev.to/oleksandr_kuryzhev_42873f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/oleksandr_kuryzhev_42873f"/>
    <language>en</language>
    <item>
      <title>RAG Pipeline for SRE Runbooks: 7 Vector Search Tips That Work</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Mon, 15 Jun 2026 07:03:02 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/rag-pipeline-for-sre-runbooks-7-vector-search-tips-that-work-122k</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/rag-pipeline-for-sre-runbooks-7-vector-search-tips-that-work-122k</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://kuryzhev.cloud/2026/06/15/rag-pipeline-for-sre-runbooks-7-vector-search-tips-that-work" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Your on-call engineer gets paged at 2 AM and your RAG system confidently surfaces a runbook from six months ago — deprecated after the last migration, full of references to services that no longer exist. The engineer follows it anyway. That's the failure mode nobody talks about when they say "we RAG-ified our runbooks." Building a RAG pipeline for SRE runbooks that actually works in production means getting the embedding model, the index structure, the ingestion loop, and the retrieval quality all right at the same time. These seven tips are what I wish I'd known before our first on-call integration went sideways.&lt;/p&gt;

&lt;h2&gt;Tip 1: Choose the Right Embedding Model for Runbook Content&lt;/h2&gt;



&lt;p&gt;&lt;strong&gt;Generic embedding models misread SRE jargon — domain matters more than benchmark scores.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Terms like &lt;code&gt;OOMKilled&lt;/code&gt;, &lt;code&gt;CrashLoopBackOff&lt;/code&gt;, &lt;code&gt;HighMemoryUsage&lt;/code&gt;, or your internal alert names are essentially invisible to models trained on general web text. They get embedded close to random technical noise rather than clustering with semantically related runbook content. I learned this after watching &lt;code&gt;text-embedding-ada-002&lt;/code&gt; confidently return a Kubernetes networking runbook for a PostgreSQL replication alert because both happened to mention "connection timeout."&lt;/p&gt;

&lt;p&gt;My current preference is &lt;code&gt;BAAI/bge-small-en-v1.5&lt;/code&gt; via &lt;code&gt;sentence-transformers&amp;gt;=2.7.0&lt;/code&gt;. It produces 384-dimensional vectors, runs about 5x faster than ada-002 at inference time, and handles technical prose significantly better in practice. A single &lt;code&gt;t3.medium&lt;/code&gt; can push roughly 50 embed requests per second — more than enough for alert-driven RAG queries, though you'll need batching for bulk re-indexing. If you need a hosted option and ada-002 is already in your stack, it's usable, but use &lt;code&gt;distance: Dot&lt;/code&gt; in your Qdrant collection config for OpenAI vectors rather than Cosine — they're not interchangeable.&lt;/p&gt;

&lt;p&gt;One chunking detail that trips people up: don't split runbooks by fixed token count without respecting procedural step boundaries. Splitting "Step 3: drain the node" across two chunks destroys the procedural context the retriever needs. Use 512-token chunks with 64-token overlap as a starting point — the overlap preserves continuity across step boundaries without ballooning your index size.&lt;/p&gt;

&lt;h2&gt;Tip 2: Structure Your Vector Store Index Around Incident Taxonomy&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Metadata filtering before semantic search cuts irrelevant results by ~60% — don't skip it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A pure vector search across your entire runbook corpus will always surface some plausible-but-wrong results. The fix isn't a better model — it's filtering. Before the semantic ranking even runs, filter by structured metadata fields that you already have: &lt;code&gt;alert_name&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, &lt;code&gt;severity&lt;/code&gt;, &lt;code&gt;on_call_team&lt;/code&gt;, and critically, &lt;code&gt;last_updated&lt;/code&gt;. That last field is the one most teams forget to store, and it's what lets you warn engineers when the best matching runbook is eight months stale.&lt;/p&gt;

&lt;p&gt;For the vector store itself, I use &lt;a href="https://qdrant.tech/documentation/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt; in production. Version 1.9.x added native sparse+dense hybrid search via the &lt;code&gt;sparse_vectors&lt;/code&gt; config, which gives you BM25 keyword matching combined with semantic similarity in a single query — genuinely useful when alert names are exact-match keywords. If you're evaluating alternatives: Weaviate v1.24+ has the &lt;code&gt;generative-openai&lt;/code&gt; module built in, which is tempting, but it couples your retrieval and generation layers tightly and makes model swaps painful. Pinecone namespaces work well if you're already in that ecosystem and don't need hybrid search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for:&lt;/strong&gt; Qdrant's default Docker image ships with zero authentication enabled. Always set the &lt;code&gt;QDRANT_&lt;em&gt;SERVICE&lt;/em&gt;_API_KEY&lt;/code&gt; environment variable and keep port &lt;code&gt;6333&lt;/code&gt; inside a private subnet. I've seen this misconfiguration in three separate internal tooling audits.&lt;/p&gt;

&lt;h2&gt;Tip 3: Ingest Runbooks from Confluence or Git with a Lightweight Pipeline&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hash-based change detection keeps your vector store fresh without re-embedding everything on every run.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ingestion pipeline is where most RAG implementations get lazy and end up paying for it — either in stale data or in runaway embedding API costs. The pattern I use: store a &lt;code&gt;sha256&lt;/code&gt; of each document's content in Redis. On every pipeline run, compare the current hash. If it matches, skip re-embedding entirely. Only new or changed content hits the embedding model.&lt;/p&gt;

&lt;p&gt;For Git-based runbooks, enforce a path convention: &lt;code&gt;docs/runbooks/{service}/{alert_name}.md&lt;/code&gt;. This lets you extract &lt;code&gt;service&lt;/code&gt; and &lt;code&gt;alert_name&lt;/code&gt; metadata directly from the file path without parsing file content — simpler and less error-prone. For Confluence, the REST API endpoint &lt;code&gt;/wiki/rest/api/content?type=page&amp;amp;spaceKey=SRE&lt;/code&gt; works, and LangChain's &lt;code&gt;ConfluenceLoader&lt;/code&gt; (requires &lt;code&gt;atlassian-python-api&amp;gt;=3.41.0&lt;/code&gt;) gets you started fast. That said, I moved off it to a custom fetch — you get better metadata control and don't inherit LangChain's chunking decisions.&lt;/p&gt;

&lt;p&gt;Here's the full ingestion pipeline with hash-based deduplication and Redis embedding cache:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
# rag_ingest.py — Runbook ingestion pipeline with hash-based deduplication
# Deps: qdrant-client&amp;gt;=1.9.0, sentence-transformers&amp;gt;=2.7.0, python-dotenv, redis, tiktoken

import os
import hashlib
import json
from pathlib import Path
from dotenv import load_dotenv
import redis
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, Filter,
    FieldCondition, MatchValue
)
from sentence_transformers import SentenceTransformer

load_dotenv()

# --- Config ---
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = "sre_runbooks"
EMBED_MODEL = "BAAI/bge-small-en-v1.5"   # 384-dim, fast, good on technical text
CHUNK_SIZE = 512        # tokens
CHUNK_OVERLAP = 64      # token overlap to preserve step continuity
SCORE_THRESHOLD = 0.78  # minimum cosine similarity to surface a result

# --- Clients ---
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
model = SentenceTransformer(EMBED_MODEL)

def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -&amp;gt; list[str]:
    """Split on word boundaries respecting overlap — avoids mid-step cuts."""
    words = text.split()
    chunks, i = [], 0
    while i &amp;lt; len(words):
        chunk = " ".join(words[i:i + size])
        chunks.append(chunk)
        i += size - overlap  # slide with overlap
    return chunks

def embed_with_cache(text: str) -&amp;gt; list[float]:
    """Return cached embedding or compute and store it."""
    key = f"emb:v1:{hashlib.sha256(text.encode()).hexdigest()}"
    cached = redis_client.get(key)
    if cached:
        return json.loads(cached)
    vector = model.encode(text, normalize_embeddings=True).tolist()
    redis_client.setex(key, 604800, json.dumps(vector))  # TTL: 7 days
    return vector

def ingest_runbook(filepath: Path):
    """Parse path for metadata, chunk content, upsert to Qdrant."""
    # Expected path: docs/runbooks/{service}/{alert_name}.md
    parts = filepath.parts
    service = parts[-2] if len(parts) &amp;gt;= 2 else "unknown"
    alert_name = filepath.stem  # filename without .md

    content = filepath.read_text(encoding="utf-8")
    doc_hash = hashlib.sha256(content.encode()).hexdigest()

    # Fast change detection via Redis — skip unchanged docs entirely
    hash_key = f"doc_hash:{filepath}"
    if redis_client.get(hash_key) == doc_hash:
        print(f"[SKIP] {filepath} unchanged")
        return

    chunks = chunk_text(content)
    points = []
    for idx, chunk in enumerate(chunks):
        vector = embed_with_cache(chunk)
        point_id = int(hashlib.sha256(f"{filepath}:{idx}".encode()).hexdigest()[:8], 16)
        points.append(PointStruct(
            id=point_id,
            vector=vector,
            payload={
                "service": service,
                "alert_name": alert_name,
                "chunk_index": idx,
                "source_path": str(filepath),
                "doc_hash": doc_hash,
                "text": chunk,
            }
        ))

    qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
    redis_client.set(hash_key, doc_hash)  # update change-detection cache
    print(f"[OK] Ingested {len(points)} chunks from {filepath}")

def ensure_collection():
    """Create collection if it doesn't exist."""
    existing = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION_NAME not in existing:
        qdrant.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=384, distance=Distance.COSINE),
        )
        print(f"[INIT] Created collection: {COLLECTION_NAME}")

if __name__ == "__main__":
    ensure_collection()
    runbook_dir = Path("docs/runbooks")
    for md_file in runbook_dir.rglob("*.md"):
        ingest_runbook(md_file)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Tip 4: Wire the RAG Query into Your Alerting Workflow&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Surface runbook context automatically when an alert fires — not only when someone thinks to ask.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The real value of a RAG pipeline for SRE runbooks isn't a chat interface. It's injecting relevant procedure context into the incident notification itself, before the engineer even opens a terminal. The integration point is your Alertmanager or PagerDuty webhook. When a webhook fires, extract the &lt;code&gt;alertname&lt;/code&gt; label (Alertmanager v2 path: &lt;code&gt;.alerts[0].labels.alertname&lt;/code&gt;) and use it as the query string to your RAG endpoint.&lt;/p&gt;

&lt;p&gt;One PagerDuty-specific gotcha: webhook v3 sends &lt;code&gt;event.data.title&lt;/code&gt; as the incident name. Map this field, not &lt;code&gt;event.id&lt;/code&gt;, to your query — I've seen this wired wrong in three different integrations and the resulting queries return garbage.&lt;/p&gt;

&lt;p&gt;Set a similarity score threshold of &lt;code&gt;0.78&lt;/code&gt; with cosine distance as your starting point. Below that, return a &lt;code&gt;"matched": false&lt;/code&gt; signal so your Slack notification can still fire — just without a runbook attachment. A "no confident match" message is far safer than surfacing a low-confidence wrong runbook. Return the top-3 chunks maximum; more than that and engineers stop reading them.&lt;/p&gt;

&lt;p&gt;Here's the FastAPI query endpoint wired to an Alertmanager webhook payload:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
# rag_query.py — Query endpoint wired to Alertmanager webhook
# Receives alert payload, returns top-3 runbook chunks above threshold

import os
from fastapi import FastAPI, Request, HTTPException
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer

QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
COLLECTION_NAME = "sre_runbooks"
SCORE_THRESHOLD = 0.78
TOP_K = 3

app = FastAPI()
qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

@app.post("/query/alert")
async def query_from_alert(request: Request):
    """
    Accepts Alertmanager webhook JSON.
    Extracts alertname + service label, runs filtered vector search.
    Returns top-K chunks or a no-match signal.
    """
    body = await request.json()

    try:
        # Alertmanager v2 webhook schema
        alert = body["alerts"][0]
        alert_name = alert["labels"]["alertname"]       # e.g. "HighMemoryUsage"
        service = alert["labels"].get("service", None)  # optional label
    except (KeyError, IndexError):
        raise HTTPException(status_code=400, detail="Invalid Alertmanager payload")

    query_text = f"{alert_name} {service or ''}".strip()
    query_vector = model.encode(query_text, normalize_embeddings=True).tolist()

    # Pre-filter by alert_name metadata before semantic ranking
    search_filter = Filter(
        must=[FieldCondition(key="alert_name", match=MatchValue(value=alert_name))]
    ) if alert_name else None

    results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_vector,
        query_filter=search_filter,
        limit=TOP_K,
        score_threshold=SCORE_THRESHOLD,  # drop low-confidence results
        with_payload=True,
    )

    if not results:
        # Fallback: no confident match — Slack still pages, just without runbook
        return {"matched": False, "alert_name": alert_name, "chunks": []}

    return {
        "matched": True,
        "alert_name": alert_name,
        "chunks": [
            {
                "text": r.payload["text"],
                "source": r.payload["source_path"],
                "score": round(r.score, 4),
                "chunk_index": r.payload["chunk_index"],
            }
            for r in results
        ],
    }

# Example response:
# {
#   "matched": true,
#   "alert_name": "HighMemoryUsage",
#   "chunks": [
#     {"text": "Step 1: check OOMKilled pods with kubectl describe...",
#      "source": "docs/runbooks/api/HighMemoryUsage.md",
#      "score": 0.8912, "chunk_index": 2}
#   ]
# }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For Slack delivery, use Block Kit's &lt;code&gt;section&lt;/code&gt; block with a &lt;code&gt;mrkdwn&lt;/code&gt; text field to render the runbook chunk inline alongside the alert details. Include the &lt;code&gt;source_path&lt;/code&gt; and &lt;code&gt;score&lt;/code&gt; so engineers immediately know where it came from and how confident the match is.&lt;/p&gt;

&lt;h2&gt;Tip 5: Evaluate Retrieval Quality Before You Trust It in Production&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The silent failure mode is a RAG that returns plausible-but-wrong runbook steps with high confidence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams evaluate their RAG pipeline by asking "does the LLM answer look right?" That's the wrong question. You need to evaluate whether the &lt;em&gt;retrieved chunks&lt;/em&gt; were actually the correct runbook sections before any LLM even sees them. A well-phrased wrong answer is worse than an obvious failure.&lt;/p&gt;

&lt;p&gt;Build a golden dataset: 20-30 pairs of &lt;code&gt;(alert_name, expected_runbook_section)&lt;/code&gt;. Run recall@3 checks — does the correct chunk appear in the top 3 results? That's your baseline metric. For a more structured eval, the &lt;a href="https://docs.ragas.io/en/stable/" rel="noopener noreferrer"&gt;ragas library&lt;/a&gt; (v0.1.x) provides &lt;code&gt;context_recall&lt;/code&gt; and &lt;code&gt;answer_relevancy&lt;/code&gt; metrics. Note that ragas requires &lt;code&gt;openai&amp;gt;=1.0.0&lt;/code&gt; and makes separate LLM calls for scoring — budget for that API cost in your eval pipeline, it's not free.&lt;/p&gt;

&lt;p&gt;Run this eval gate on every significant change to the runbook corpus or after swapping embedding models. I caught a 15% recall drop after a Confluence space reorganization that changed page titles — the metadata-extracted &lt;code&gt;alert_name&lt;/code&gt; fields shifted, and the pre-filter was excluding correct results. Without the eval gate, that would have silently degraded on-call for weeks.&lt;/p&gt;

&lt;h2&gt;Tip 6: Secure the Pipeline — Runbooks Contain Sensitive Operational Detail&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Your vector store holds internal hostnames, escalation contacts, and credential patterns — treat it like production infrastructure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the access control gap I see most often. Teams move runbooks into a vector DB, wire up a query API, and mark it "internal only" as if that's sufficient. Runbooks regularly contain things like internal service hostnames, credential rotation procedures, escalation phone trees, and network topology details. If a service account with access to your RAG query API is compromised, an attacker can enumerate your entire operational playbook through semantic search.&lt;/p&gt;

&lt;p&gt;Enforce collection-level ACLs in Qdrant using per-collection API keys. In Weaviate, use RBAC to scope read access by team. Never expose the RAG query endpoint without authentication, even on an internal network — lateral movement from a compromised service is a real threat model, not a theoretical one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for:&lt;/strong&gt; the Redis embedding cache also needs protection. Those cached vectors can be used to reconstruct approximate source text. Keep Redis on a private interface, require &lt;code&gt;requirepass&lt;/code&gt;, and set appropriate &lt;code&gt;bind&lt;/code&gt; directives. I stopped treating the cache layer as "just an optimization" after reading about embedding inversion attacks — they're not academic anymore.&lt;/p&gt;

&lt;p&gt;Also store &lt;code&gt;last_updated&lt;/code&gt; as a metadata field on every point. Without it, you have no way to surface a staleness warning to the on-call engineer when the best matching runbook is months old. This is a cheap field to add and an expensive oversight to fix after the fact. For more on securing internal tooling pipelines, see the patterns we cover at &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Tip 7: Control Costs by Caching Embeddings and Limiting Re-Indexing Frequency&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Naive re-indexing pipelines multiply embedding costs fast — cache aggressively and schedule smart.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At first glance, embedding costs look trivial. Five hundred runbook pages at roughly 10 chunks each, priced at &lt;code&gt;text-embedding-ada-002&lt;/code&gt;'s $0.0001 per 1K tokens, works out to about $0.25 per full re-index. That sounds fine. But a naive pipeline that re-embeds everything on every CI merge, or that re-indexes when Confluence sends a webhook for a minor edit, turns that $0.25 into a daily charge. At scale with a self-hosted GPU model, it becomes compute time you're burning for no reason.&lt;/p&gt;

&lt;p&gt;The fix is two-layered. First, the Redis embedding cache with key pattern &lt;code&gt;emb:v1:{sha256(chunk_text)}&lt;/code&gt; — identical chunk content across different documents or pipeline runs hits the cache, not the model. Include a version prefix (&lt;code&gt;v1&lt;/code&gt;) so that when you upgrade your embedding model, you can invalidate the entire cache cleanly by bumping to &lt;code&gt;v2&lt;/code&gt; without touching cache logic. Second, schedule full re-indexes weekly. Run incremental re-indexing (changed documents only, via hash comparison) on every merge to &lt;code&gt;main&lt;/code&gt;. This keeps the index current without re-embedding stable content.&lt;/p&gt;

&lt;p&gt;One more cost lever: use gRPC instead of HTTP for Qdrant batch upserts. The default HTTP port is &lt;code&gt;6333&lt;/code&gt;, gRPC is &lt;code&gt;6334&lt;/code&gt;. Switching to gRPC gives approximately 30% lower latency on batch operations — not a cost saving directly, but it reduces the wall-clock time your ingestion job runs, which matters if you're paying for the compute that runs it.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/monitoring/" rel="noopener noreferrer"&gt;Prometheus, Loki, and alerting pipeline patterns for SRE teams&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/python/" rel="noopener noreferrer"&gt;Python automation scripts for DevOps workflows and AWS integrations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/kubernetes/" rel="noopener noreferrer"&gt;Kubernetes production patterns — HPA, security, and network policy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>MCP Servers for DevOps: Build vs Pre-Built — What to Choose</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Sun, 14 Jun 2026 07:02:35 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/mcp-servers-for-devops-build-vs-pre-built-what-to-choose-4in9</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/mcp-servers-for-devops-build-vs-pre-built-what-to-choose-4in9</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://kuryzhev.cloud/2026/06/14/mcp-servers-for-devops-build-vs-pre-built-what-to-choose" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You wired an LLM into your incident workflow, gave it &lt;code&gt;kubectl&lt;/code&gt; access via an MCP server you found on GitHub, and only later realized it was running against your production cluster with your personal kubeconfig. That's the kind of mistake that happens when you move fast with MCP server DevOps tooling without thinking through the operational model first. I've been there. This post is the comparison I wish I had before we started.&lt;/p&gt;

&lt;p&gt;The Model Context Protocol (MCP) is a JSON-RPC 2.0 protocol that gives LLMs like Claude or GPT-4 structured, typed access to external tools — &lt;code&gt;kubectl&lt;/code&gt;, Terraform, PagerDuty, your internal CMDB — with session state and proper error handling. It's not just another API wrapper. It's the difference between prompt-engineering your way to fragile shell commands and giving your AI assistant a real, auditable toolchain. But the moment you go from a weekend demo to a team-shared production deployment, you face a real architectural decision: do you use community-maintained pre-built MCP servers, or do you build your own?&lt;/p&gt;

&lt;p&gt;The wrong answer costs you either brittle, unmaintainable hacks or an over-engineered custom server nobody on your team wants to touch at 3 AM. Let me walk you through both options with the honest trade-offs.&lt;/p&gt;

&lt;h2&gt;When You Face This Choice&lt;/h2&gt;



&lt;p&gt;This decision usually surfaces when you're wiring LLMs into something that actually matters: incident triage, infra query automation, or CI/CD pipeline introspection. You need structured tool access — not raw API calls buried in a system prompt, but proper MCP tools with typed &lt;code&gt;inputSchema&lt;/code&gt;, error normalization, and the ability to maintain context across multi-step operations like "get the failing pods, look up the runbook, then check recent deploys."&lt;/p&gt;

&lt;p&gt;The MCP spec (currently at version &lt;code&gt;2025-03-26&lt;/code&gt;, which introduced breaking changes to the &lt;code&gt;resources/list&lt;/code&gt; response schema versus the earlier &lt;code&gt;2024-11-05&lt;/code&gt; version) defines a &lt;code&gt;protocolVersion&lt;/code&gt; field in the &lt;code&gt;initialize&lt;/code&gt; handshake. If your server and client are on different spec versions, tool calls silently misbehave. That's your first hint that the ecosystem is still moving fast and your tooling choices have real consequences.&lt;/p&gt;

&lt;p&gt;The core question: does your team need to expose internal tools that will never have a community server, and does this MCP setup need to serve more than one engineer? If yes to either, you're already leaning toward custom. If you're running a proof-of-concept for one person against non-production systems, pre-built is the right starting point. Let's go through both properly.&lt;/p&gt;

&lt;h2&gt;Option A — Pre-Built Community MCP Servers&lt;/h2&gt;

&lt;p&gt;Pre-built servers are genuinely impressive for getting started. &lt;code&gt;uvx mcp-server-kubernetes&lt;/code&gt; gives you pod listing, log fetching, and resource inspection in under ten minutes. The official GitHub MCP server (&lt;code&gt;github.com/github/github-mcp-server&lt;/code&gt;, distributed as a Go binary) supports fine-grained PATs and is the only community server I know of with a published security advisory process as of mid-2025. &lt;code&gt;mcp-server-prometheus&lt;/code&gt; handles instant PromQL queries with authentication. You get community-maintained tool schemas that already handle pagination, error normalization, and auth flows — things that take real time to get right when you build from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Zero-to-running in under 10 minutes. &lt;code&gt;npx @modelcontextprotocol/server-github&lt;/code&gt; or &lt;code&gt;uvx mcp-server-kubernetes&lt;/code&gt; works out of the box. The tool schemas are already battle-tested for common operations. You don't own the spec upgrade burden on day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; The tool surface is fixed. You cannot expose your internal Backstage catalog, your custom SLO dashboard API, or your Spinnaker deployment history. Versioning is genuinely chaotic — &lt;code&gt;mcp-server-kubernetes&lt;/code&gt; v0.1.x vs v0.2.x broke tool signatures in March 2025, and if you're pinned to an old version in a team deployment, you'll spend a debugging session figuring out why tool calls are returning &lt;code&gt;MCP error -32602: Invalid params&lt;/code&gt;. You also inherit their security model entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for this:&lt;/strong&gt; &lt;code&gt;uvx mcp-server-kubernetes&lt;/code&gt; requires &lt;code&gt;kubectl&lt;/code&gt; in PATH and uses your &lt;em&gt;active&lt;/em&gt; kubeconfig context. It will happily point at production if that's your current context. Always set &lt;code&gt;KUBECONFIG&lt;/code&gt; explicitly in your MCP server config to a context-specific kubeconfig file, not your default &lt;code&gt;~/.kube/config&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for this too:&lt;/strong&gt; Most community servers run as stdio transport by default — one process per session, zero shared state, fine for local Claude Desktop usage. The moment you want a team-shared deployment, you need a proxy layer. &lt;code&gt;mcp-proxy&lt;/code&gt; (available on PyPI) bridges stdio servers to SSE: &lt;code&gt;mcp-proxy --port 9000 -- uvx mcp-server-kubernetes&lt;/code&gt;. That's an extra operational component you now own. And &lt;code&gt;mcp-server-prometheus&lt;/code&gt; does not support PromQL range queries out of the box — only instant queries. If your on-call workflow needs &lt;code&gt;query_range&lt;/code&gt;, you're building a custom tool regardless.&lt;/p&gt;

&lt;h2&gt;Option B — Custom MCP Servers with the Python or TypeScript SDK&lt;/h2&gt;

&lt;p&gt;Building your own MCP server with the official Python SDK (&lt;code&gt;pip install mcp==1.6.0&lt;/code&gt;) or TypeScript SDK (&lt;code&gt;npm install @modelcontextprotocol/sdk@1.10.0&lt;/code&gt;) gives you complete control. You define exactly what tools exist, what their schemas look like, how auth works (mTLS, OIDC tokens, Vault-sourced secrets), and critically — what the LLM is actually allowed to touch. You can wrap internal APIs that will never have a community server: your CMDB, your internal runbook API, your custom SLO dashboards, your Spinnaker deployment history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Full control over tool schema, authentication model, and blast radius. You can implement tool-level allow-lists in handler middleware. You can log every single tool call to your SIEM. You can truncate verbose responses server-side before they hit the LLM context window — because full &lt;code&gt;kubectl get pods -A&lt;/code&gt; JSON output runs 8,000–15,000 tokens per call, which is both expensive and context-window-polluting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; You own the maintenance burden. The MCP spec moved from &lt;code&gt;2024-11-05&lt;/code&gt; to &lt;code&gt;2025-03-26&lt;/code&gt; with breaking changes to &lt;code&gt;resources&lt;/code&gt; and &lt;code&gt;sampling&lt;/code&gt;. The Python and TypeScript SDKs are still effectively pre-1.0 stable in practice despite their version numbers. You will be doing spec upgrades on a Friday afternoon at some point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical gotcha that almost everyone gets wrong:&lt;/strong&gt; The MCP SDK does NOT automatically validate &lt;code&gt;tool.inputSchema&lt;/code&gt; against the arguments the LLM passes. If the LLM sends malformed args — or if a prompt injection in a log file triggers an unexpected tool call with bad parameters — your handler receives them as-is. You will get a confusing Python &lt;code&gt;KeyError&lt;/code&gt; or &lt;code&gt;TypeError&lt;/code&gt; deep in your handler instead of a clean error message. You must add Pydantic (Python) or Zod (TypeScript) validation yourself, at the top of every tool handler. This is non-negotiable for production.&lt;/p&gt;

&lt;p&gt;Also: do not register more than 15–20 tools in a single MCP server. LLMs degrade in tool selection accuracy above roughly 15 tools. Split by domain: infra-tools, observability-tools, incident-tools. Separate servers, separate concerns.&lt;/p&gt;

&lt;p&gt;Here's a minimal but production-pattern custom MCP server in Python. It wraps &lt;code&gt;kubectl&lt;/code&gt; for pod listing and an internal runbook API, with proper Pydantic validation and response truncation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
# custom_mcp_server.py
# Minimal production-pattern MCP server wrapping kubectl + internal runbook API
# Requires: pip install mcp==1.6.0 pydantic==2.7.0 httpx==0.27.0

import asyncio
import subprocess
import json
import httpx
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, CallToolResult
from pydantic import BaseModel, ValidationError

# --- Input schema models (SDK does NOT validate these automatically) ---

class KubectlGetPodsInput(BaseModel):
    namespace: str
    label_selector: str = ""          # optional label filter e.g. "app=nginx"
    max_results: int = 20             # hard cap to control token usage

class RunbookLookupInput(BaseModel):
    service_name: str
    alert_name: str

# --- Server init ---

app = Server("sre-mcp-server")

RUNBOOK_API_BASE = "https://runbooks.internal.example.com/api/v1"
RUNBOOK_API_TOKEN = "REPLACE_WITH_VAULT_SOURCED_SECRET"   # inject via env in prod

# --- Tool definitions ---

@app.list_tools()
async def list_tools() -&amp;gt; list[Tool]:
    return [
        Tool(
            name="kubectl_get_pods",
            description="List pods in a namespace. Read-only. Max 20 results.",
            inputSchema=KubectlGetPodsInput.model_json_schema(),
        ),
        Tool(
            name="runbook_lookup",
            description="Fetch the runbook for a given service and alert name.",
            inputSchema=RunbookLookupInput.model_json_schema(),
        ),
    ]

# --- Tool handlers ---

@app.call_tool()
async def call_tool(name: str, arguments: dict) -&amp;gt; CallToolResult:

    if name == "kubectl_get_pods":
        try:
            args = KubectlGetPodsInput(**arguments)   # validate here, not in SDK
        except ValidationError as e:
            return CallToolResult(
                content=[TextContent(type="text", text=f"Invalid args: {e}")]
            )

        cmd = ["kubectl", "get", "pods", "-n", args.namespace, "-o", "json"]
        if args.label_selector:
            cmd += ["-l", args.label_selector]

        result = subprocess.run(cmd, capture_output=True, text=True, timeout=15)

        if result.returncode != 0:
            return CallToolResult(
                content=[TextContent(type="text", text=f"kubectl error: {result.stderr}")]
            )

        pods_json = json.loads(result.stdout)
        # Truncate to max_results and strip managed fields to reduce token usage
        items = pods_json.get("items", [])[:args.max_results]
        slim = [
            {
                "name": p["metadata"]["name"],
                "phase": p["status"].get("phase"),
                "ready": all(
                    c["ready"] for c in p["status"].get("containerStatuses", [])
                ),
                "restarts": sum(
                    c.get("restartCount", 0)
                    for c in p["status"].get("containerStatuses", [])
                ),
            }
            for p in items
        ]
        return CallToolResult(
            content=[TextContent(type="text", text=json.dumps(slim, indent=2))]
        )

    elif name == "runbook_lookup":
        try:
            args = RunbookLookupInput(**arguments)
        except ValidationError as e:
            return CallToolResult(
                content=[TextContent(type="text", text=f"Invalid args: {e}")]
            )

        async with httpx.AsyncClient() as client:
            resp = await client.get(
                f"{RUNBOOK_API_BASE}/runbooks",
                params={"service": args.service_name, "alert": args.alert_name},
                headers={"Authorization": f"Bearer {RUNBOOK_API_TOKEN}"},
                timeout=10,
            )
            if resp.status_code != 200:
                return CallToolResult(
                    content=[TextContent(type="text", text=f"Runbook API error: {resp.status_code}")]
                )
            return CallToolResult(
                content=[TextContent(type="text", text=resp.text)]
            )

    return CallToolResult(
        content=[TextContent(type="text", text=f"Unknown tool: {name}")]
    )

# --- Entry point ---

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And here's how to wire both a custom server and community servers together in your Claude Desktop config. This is the hybrid pattern I actually run:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
{
  "mcpServers": {

    "sre-tools": {
      "command": "python",
      "args": ["/opt/mcp-servers/custom_mcp_server.py"],
      "env": {
        "KUBECONFIG": "/home/sre/.kube/config-staging",
        "RUNBOOK_API_TOKEN": "${RUNBOOK_API_TOKEN}"
      }
    },

    "prometheus": {
      "command": "uvx",
      "args": ["mcp-server-prometheus"],
      "env": {
        "PROMETHEUS_URL": "https://prometheus.internal.example.com",
        "PROMETHEUS_USERNAME": "mcp-readonly",
        "PROMETHEUS_PASSWORD": "${PROM_PASSWORD}"
      }
    },

    "github": {
      "command": "/usr/local/bin/github-mcp-server",
      "args": ["stdio"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "${GITHUB_PAT}"
      }
    }

  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note the explicit &lt;code&gt;KUBECONFIG&lt;/code&gt; path pointing to a staging context. Note that the API token comes from a shell environment variable, not hardcoded. Small details, but they matter when this runs unattended.&lt;/p&gt;

&lt;p&gt;You can test any MCP server locally without a full LLM client using the &lt;code&gt;mcp dev&lt;/code&gt; CLI command, which launches the MCP Inspector UI at &lt;code&gt;http://localhost:5173&lt;/code&gt;. Run it before you ever connect Claude to a new server. It saves a lot of confusion.&lt;/p&gt;

&lt;h2&gt;Decision Matrix&lt;/h2&gt;

&lt;p&gt;Here's how these two options actually stack up across the dimensions that matter for real DevOps teams — not theoretical ones.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Pre-Built (Option A)&lt;/th&gt;
&lt;th&gt;Custom (Option B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;Under 10 minutes&lt;/td&gt;
&lt;td&gt;Half a day minimum&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool customization&lt;/td&gt;
&lt;td&gt;Fixed, community-defined&lt;/td&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal API access&lt;/td&gt;
&lt;td&gt;Not possible&lt;/td&gt;
&lt;td&gt;First-class&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security posture&lt;/td&gt;
&lt;td&gt;Inherited, opaque&lt;/td&gt;
&lt;td&gt;Explicit, auditable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance burden&lt;/td&gt;
&lt;td&gt;Low initially, chaotic on upgrades&lt;/td&gt;
&lt;td&gt;Ongoing, owned by you&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-user / team deployment&lt;/td&gt;
&lt;td&gt;Requires mcp-proxy, extra ops&lt;/td&gt;
&lt;td&gt;SSE transport, proper auth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production readiness&lt;/td&gt;
&lt;td&gt;Proof-of-concept to small teams&lt;/td&gt;
&lt;td&gt;Production-grade with work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SIEM / audit logging&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;Implement in handler middleware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The inflection point is clear. If your team has fewer than three internal-only tool integrations and this is a proof-of-concept, use pre-built. If you're running MCP in production for more than five engineers, touching real infrastructure, you need custom servers with proper auth and audit logging. There's no middle ground that scales.&lt;/p&gt;

&lt;p&gt;The transport layer is another forcing function. Stdio transport spawns a new process per Claude Desktop session — zero shared state, which breaks any tool that needs to maintain a Terraform plan or a multi-step workflow between calls. SSE transport (&lt;code&gt;http://localhost:8000/sse&lt;/code&gt; by default, configurable with &lt;code&gt;--host&lt;/code&gt;, &lt;code&gt;--port&lt;/code&gt;, or &lt;code&gt;MCP_HOST&lt;/code&gt;/&lt;code&gt;MCP_PORT&lt;/code&gt; env vars) is HTTP-based and supports multiple concurrent clients. The moment you need SSE, you're in custom server territory anyway, because most community servers don't ship with SSE support configured for team use.&lt;/p&gt;

&lt;p&gt;MCP has no built-in authorization at the tool level. This is the security gap that matters most. If your server exposes both &lt;code&gt;kubectl_get&lt;/code&gt; and &lt;code&gt;kubectl_delete&lt;/code&gt;, the LLM — or a prompt injection hiding in a log file your LLM just read — can call either. Implement tool-level allow-lists in your handler middleware. The official MCP documentation covers the protocol spec in detail at &lt;a href="https://modelcontextprotocol.io/docs" rel="noopener noreferrer"&gt;modelcontextprotocol.io&lt;/a&gt;. For Kubernetes RBAC, bind your MCP server's service account to a &lt;code&gt;ClusterRole&lt;/code&gt; with only &lt;code&gt;get&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, and &lt;code&gt;watch&lt;/code&gt; verbs. See the &lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/rbac/" rel="noopener noreferrer"&gt;Kubernetes RBAC documentation&lt;/a&gt; for the ClusterRole pattern. Never bind &lt;code&gt;create&lt;/code&gt;, &lt;code&gt;delete&lt;/code&gt;, or &lt;code&gt;patch&lt;/code&gt; unless you have a separate, explicitly audited use case for it.&lt;/p&gt;

&lt;h2&gt;My Pick&lt;/h2&gt;

&lt;p&gt;I prefer the hybrid approach, and I'm not hedging on this. Start with pre-built for your proof-of-concept — specifically the &lt;code&gt;mcp-server-kubernetes&lt;/code&gt; plus &lt;code&gt;mcp-server-prometheus&lt;/code&gt; combination. Get your team comfortable with what MCP-powered workflows actually feel like before you write a line of server code. Then, once you know which internal APIs you actually need and what the real operational requirements are, build a thin custom Python MCP server that wraps those internal integrations.&lt;/p&gt;

&lt;p&gt;Do not fork community servers. That's the trap. You end up owning their entire codebase plus your customizations, and when the spec upgrades, you're doing a merge instead of just bumping your SDK version.&lt;/p&gt;

&lt;p&gt;The pattern I run: community servers (Prometheus, GitHub) run as separate stdio processes. My custom server handles everything internal. Both are registered in Claude Desktop config as separate &lt;code&gt;mcpServers&lt;/code&gt; entries. The LLM picks tools from both. Clean separation, minimal blast radius per server.&lt;/p&gt;

&lt;p&gt;One non-negotiable: any MCP server touching production infrastructure runs with a dedicated service account, scoped RBAC, and logs every tool call to the SIEM. Treat it like a privileged CI runner, not a chatbot plugin. Store credentials in Vault or your secrets manager, inject them as environment variables at runtime, and rotate them on the same schedule as your other service accounts. The &lt;code&gt;mcp dev&lt;/code&gt; inspector tool at &lt;code&gt;http://localhost:5173&lt;/code&gt; is your friend for validating tool schemas before you ever point a real LLM at a server.&lt;/p&gt;

&lt;p&gt;MCP server DevOps tooling is moving fast. The spec will break again. But the operational principles — scoped access, audit logging, input validation, response truncation — those don't change. Get those right and you can absorb the spec churn without a production incident. You can find more patterns for building secure, automated infrastructure tooling at &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/kubernetes/" rel="noopener noreferrer"&gt;Kubernetes RBAC, pod security, and production cluster patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/ci-cd/" rel="noopener noreferrer"&gt;CI/CD pipeline automation, OIDC auth, and deployment strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/security/" rel="noopener noreferrer"&gt;Security hardening, secrets management, and audit logging for infra&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
    </item>
    <item>
      <title>Auto-Summarize On-Call Incidents with n8n AI Agents</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Sat, 13 Jun 2026 07:03:03 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/auto-summarize-on-call-incidents-with-n8n-ai-agents-1lj4</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/auto-summarize-on-call-incidents-with-n8n-ai-agents-1lj4</guid>
      <description>&lt;p&gt;Liquid syntax error: Variable '{{ {% raw %}' was not properly terminated with regexp: /\}\}/&lt;/p&gt;
</description>
      <category>devops</category>
    </item>
    <item>
      <title>AI Code Review for Terraform PRs: CI Checklist and Automation</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Fri, 12 Jun 2026 07:03:08 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/ai-code-review-for-terraform-prs-ci-checklist-and-automation-5pk</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/ai-code-review-for-terraform-prs-ci-checklist-and-automation-5pk</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://kuryzhev.cloud/2026/06/12/ai-code-review-for-terraform-prs-ci-checklist-and-automation" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Your Terraform PR passed tflint and checkov — but the AI reviewer just flagged that you're about to delete a production RDS instance and nobody noticed. That's the exact scenario that pushed our team to build a structured AI Terraform PR review process into CI. Not as a replacement for human review. As a safety net that catches what humans miss when they're tired, rushed, or reviewing their twelfth PR of the day.&lt;/p&gt;

&lt;p&gt;This checklist is for teams running Terraform 1.5+ in GitHub Actions, GitLab CI, or Atlantis pipelines. Every item here came from either a real incident or a near-miss. Some of it will be obvious. A few items will surprise you.&lt;/p&gt;

&lt;h2&gt;Why This Checklist&lt;/h2&gt;



&lt;p&gt;A typical IaC pull request touches anywhere from 10 to 50 resources. Under time pressure — and reviewers are almost always under time pressure — human engineers miss roughly 30% of policy violations. I've seen it happen on our own team. A reviewer approves a PR with an unencrypted S3 bucket because the checkov annotation looks fine at a glance, but the skip reason references a deprecated ADR. Nobody catches it until a compliance scan runs two weeks later.&lt;/p&gt;

&lt;p&gt;AI code review tools like &lt;a href="https://www.coderabbit.ai/" rel="noopener noreferrer"&gt;CodeRabbit&lt;/a&gt;, custom GPT-4o scripts, and Atlantis LLM hooks analyze the full plan diff in seconds. They don't get tired. They don't skip the last file because standup starts in five minutes. But they also hallucinate. They confuse AWS provider v4 attributes with v5 changes. They miss secrets buried in &lt;code&gt;.auto.tfvars&lt;/code&gt; files. That's exactly why you need a checklist — not to trust the AI blindly, but to define what it must check and what it cannot be trusted to catch alone.&lt;/p&gt;

&lt;p&gt;One more thing worth saying upfront: adding AI review as a &lt;em&gt;blocking&lt;/em&gt; CI gate on day one is a mistake. Run it as an advisory check for two weeks first. Let the team calibrate signal versus noise. Then promote it to blocking. I learned this the hard way after we blocked three legitimate PRs because the model flagged a &lt;code&gt;lifecycle { prevent_destroy = true }&lt;/code&gt; block as "suspicious" on a stateful resource it didn't recognize.&lt;/p&gt;

&lt;h2&gt;The AI Code Review Checklist for Terraform PRs&lt;/h2&gt;

&lt;p&gt;Each item below is something you configure once and enforce on every PR. The first four are static analysis gates. Items five through eight are AI-specific prompt checks. Nine through twelve cover plan diff review. The final three address state hygiene.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
&lt;strong&gt;Run tflint v0.50+ with &lt;code&gt;--format=json&lt;/code&gt;.&lt;/strong&gt; Earlier versions produce text output that's unparseable by downstream scripts. Pin the version explicitly in your CI — don't pull latest.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Wire tflint exit codes correctly.&lt;/strong&gt; Exit code 1 means violations found. Exit code 2 means the tool itself errored. Most pipelines treat both as the same failure. They are not. A tool error should page someone; a lint violation should block the PR.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Run checkov 3.x on changed files only.&lt;/strong&gt; Running checkov on the entire repo on every PR is slow and noisy. Use &lt;code&gt;git diff&lt;/code&gt; to scope it to changed &lt;code&gt;.tf&lt;/code&gt; files.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Distinguish checkov exit code 1 vs 2.&lt;/strong&gt; Same issue as tflint — exit code 2 is a tool error, not a policy violation. Wire them separately in your CI logic.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Prompt the AI to flag hardcoded credentials.&lt;/strong&gt; Static tools miss credentials embedded in &lt;code&gt;locals&lt;/code&gt; blocks or passed as &lt;code&gt;default&lt;/code&gt; values in variable declarations. Include this explicitly in your system prompt.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Prompt for missing &lt;code&gt;lifecycle&lt;/code&gt; blocks on stateful resources.&lt;/strong&gt; RDS instances, ElastiCache clusters, and S3 buckets without &lt;code&gt;prevent_destroy = true&lt;/code&gt; are a silent risk. The AI catches this consistently when you ask for it explicitly.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Prompt for untagged resources.&lt;/strong&gt; Define your required tags (env, owner, cost-center) in the system prompt. The model will flag resources missing any of them.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Prompt for overly permissive IAM.&lt;/strong&gt; Wildcard actions on wildcard resources. &lt;code&gt;Effect: Allow&lt;/code&gt; on &lt;code&gt;*&lt;/code&gt;. The AI is genuinely good at spotting these patterns across multiple policy documents simultaneously.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Generate a JSON plan and feed it to the AI.&lt;/strong&gt; Run &lt;code&gt;terraform plan -out=tfplan&lt;/code&gt; followed by &lt;code&gt;terraform show -json tfplan&lt;/code&gt;. This gives the model the actual planned state, not just HCL syntax — which is a fundamentally different and more accurate input.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Extract only changed resources before sending to the AI.&lt;/strong&gt; A full JSON plan for 300 resources hits 80k–110k tokens. GPT-4o's context window is 128k tokens — you'll overflow it on large repos. Filter to changed resources only using jq.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Ask the AI to produce a risk score.&lt;/strong&gt; LOW, MEDIUM, HIGH, CRITICAL. This gives reviewers a fast signal before they read the full comment. It also makes the output machine-parseable if you want to auto-block CRITICAL findings later.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Flag destructive changes explicitly.&lt;/strong&gt; Use the jq filter below to isolate deletes and replacements. Send this subset to the AI with elevated attention in the prompt. A delete on a database should always surface as CRITICAL.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Verify remote backend configuration is present.&lt;/strong&gt; No local &lt;code&gt;.tfstate&lt;/code&gt; files committed. State locking enabled via DynamoDB. Backend config not hardcoded with account IDs.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Check for state lock contention risk.&lt;/strong&gt; Running &lt;code&gt;terraform plan&lt;/code&gt; without &lt;code&gt;-lock=false&lt;/code&gt; in CI causes lock contention when multiple PRs run simultaneously. This one has burned us twice.&lt;/li&gt;
  &lt;li&gt;
&lt;strong&gt;Pin provider versions.&lt;/strong&gt; &lt;code&gt;version = "~&amp;gt; 5.0"&lt;/code&gt; in the &lt;code&gt;required_providers&lt;/code&gt; block. AI models trained before mid-2023 may suggest deprecated constraint syntax — validate suggestions against the &lt;a href="https://developer.hashicorp.com/terraform/language/providers/requirements" rel="noopener noreferrer"&gt;official Terraform provider requirements docs&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's the GitHub Actions workflow that runs this entire checklist. It handles tflint, checkov, plan generation, resource filtering, and the OpenAI API call in a single job:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# .github/workflows/terraform-ai-review.yml
# Requires: OPENAI_API_KEY secret, AWS credentials for plan, terraform 1.5+

name: Terraform AI Code Review

on:
  pull_request:
    paths:
      - '**.tf'
      - '**.tfvars'

jobs:
  ai-review:
    runs-on: ubuntu-22.04
    permissions:
      pull-requests: write   # needed to post review comments
      contents: read

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Terraform 1.7.x
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.5"
          terraform_wrapper: false   # wrapper breaks JSON output parsing

      - name: Setup tflint v0.50.3
        run: |
          curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | \
            TFLINT_VERSION=v0.50.3 bash

      - name: Run tflint (JSON output for downstream parsing)
        run: |
          tflint --format=json --recursive &amp;gt; tflint-results.json || true
          # exit code 2 = error, exit code 1 = lint violations — handle separately
          EXIT=$?; if [ $EXIT -eq 2 ]; then echo "tflint tool error" &amp;amp;&amp;amp; exit 2; fi

      - name: Run checkov on changed files only
        run: |
          pip install checkov==3.2.0 --quiet
          # get list of changed .tf files from git diff
          CHANGED=$(git diff --name-only origin/${{ github.base_ref }}...HEAD | grep '\.tf$' | tr '\n' ',')
          checkov -f "$CHANGED" \
            --output json \
            --compact \
            --skip-check CKV2_AWS_5 \
            &amp;gt; checkov-results.json || true

      - name: Terraform Init + Plan (generate JSON plan)
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          terraform init -input=false -backend-config="key=pr-${{ github.event.pull_request.number }}.tfstate"
          terraform plan -lock=false -input=false -out=tfplan
          terraform show -json tfplan &amp;gt; tfplan.json

      - name: Extract changed resources only (reduce token usage)
        run: |
          jq '[.resource_changes[] | select(.change.actions != ["no-op"]) |
            {address, actions: .change.actions, before: .change.before, after: .change.after}]' \
            tfplan.json &amp;gt; tfplan-diff.json
          # log token estimate: ~4 chars per token
          CHARS=$(wc -c &amp;lt; tfplan-diff.json)
          echo "Estimated tokens: $((CHARS / 4))"

      - name: Send to OpenAI for review + post PR comment
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          REPO: ${{ github.repository }}
        run: |
          python3 .github/scripts/ai_review.py \
            --plan tfplan-diff.json \
            --tflint tflint-results.json \
            --checkov checkov-results.json \
            --pr "$PR_NUMBER" \
            --repo "$REPO"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And here's the Python script that calls GPT-4o and posts the structured review comment back to the PR. It includes exponential backoff via &lt;code&gt;tenacity&lt;/code&gt; to handle OpenAI rate limits — tier-1 accounts cap at 500 RPM on GPT-4o, which a busy monorepo with 20 simultaneous PRs will absolutely hit:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# .github/scripts/ai_review.py
# Posts structured AI review comment to GitHub PR
# Requires: openai&amp;gt;=1.14.0, requests, tenacity

import argparse, json, os, sys
import openai
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

SYSTEM_PROMPT = """You are a senior DevOps engineer reviewing Terraform infrastructure changes.
Analyze the provided plan diff and linter results. Return a structured review with:
1. RISK_LEVEL: LOW | MEDIUM | HIGH | CRITICAL
2. DESTRUCTIVE_CHANGES: list any resource deletions or replacements
3. SECURITY_FINDINGS: IAM over-permissions, open security groups, missing encryption
4. MISSING_TAGS: resources lacking required tags (env, owner, cost-center)
5. RECOMMENDATIONS: max 5 bullet points, actionable only
Do NOT hallucinate resource attributes. If unsure, say 'verify in provider docs'."""

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=30))
def call_openai(plan_content: str, lint_content: str) -&amp;gt; str:
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = client.chat.completions.create(
        model="gpt-4o",
        max_tokens=1500,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"PLAN DIFF:\n{plan_content}\n\nLINTER FINDINGS:\n{lint_content}"}
        ]
    )
    return response.choices[0].message.content

def post_pr_comment(repo: str, pr: str, body: str):
    url = f"https://api.github.com/repos/{repo}/issues/{pr}/comments"
    headers = {
        "Authorization": f"Bearer {os.environ['GH_TOKEN']}",
        "Accept": "application/vnd.github+json"
    }
    resp = requests.post(url, json={"body": f"## 🤖 AI Terraform Review\n\n{body}"}, headers=headers)
    resp.raise_for_status()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--plan"); parser.add_argument("--tflint")
    parser.add_argument("--checkov"); parser.add_argument("--pr"); parser.add_argument("--repo")
    args = parser.parse_args()

    plan = json.load(open(args.plan))
    tflint = json.load(open(args.tflint))

    # Truncate if over ~80k tokens (~320k chars) — keep first 300k chars
    plan_str = json.dumps(plan)[:300_000]
    lint_str = json.dumps(tflint)[:20_000]

    review = call_openai(plan_str, lint_str)
    post_pr_comment(args.repo, args.pr, review)
    print("AI review posted successfully")

if __name__ == "__main__":
    main()&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Commonly Missed Items&lt;/h2&gt;

&lt;p&gt;These are the gaps that bite teams six months after they think the setup is complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI models hallucinate provider attributes.&lt;/strong&gt; This is the biggest one. AWS provider v5.x introduced breaking changes — renamed attributes, removed arguments, new required fields. Models trained before mid-2023 don't know about these. When the AI tells you to add &lt;code&gt;acl = "private"&lt;/code&gt; to an S3 bucket resource, it's wrong — that argument was removed in AWS provider v4.0. The fix: always pin provider versions with &lt;code&gt;~&amp;gt; 5.0&lt;/code&gt; in your &lt;code&gt;required_providers&lt;/code&gt; block, and instruct the AI in the system prompt to validate suggestions against the actual plan output, not just HCL syntax. If the plan succeeds, the attributes are valid. If the AI contradicts the plan, trust the plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for: &lt;code&gt;.tfvars&lt;/code&gt; and &lt;code&gt;.auto.tfvars&lt;/code&gt; files being excluded from AI context.&lt;/strong&gt; This is where secrets and environment-specific values live. Database passwords passed as variables, account IDs, CIDR blocks that reveal internal network topology. These files are often excluded from the AI context window either by accident (the workflow only globs &lt;code&gt;*.tf&lt;/code&gt;) or intentionally (to avoid sending secrets to the API). The problem is that the AI then reviews IAM policies and security group rules without knowing the actual values they'll be populated with. The solution is to redact sensitive values with &lt;code&gt;sed&lt;/code&gt; or &lt;code&gt;sops --decrypt | sed 's/=.*/=REDACTED/'&lt;/code&gt; before including them in the prompt, not to exclude the files entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Module version pinning is silently skipped by most AI tools.&lt;/strong&gt; A Terraform module sourced as &lt;code&gt;source = "terraform-aws-modules/vpc/aws"&lt;/code&gt; without a &lt;code&gt;version&lt;/code&gt; constraint will pull whatever is latest at plan time. This causes silent drift between environments. Static tools like tflint and checkov don't flag this by default. The AI won't flag it either unless you explicitly include "check for unpinned module versions" in your system prompt. Add it. I've seen a VPC module minor version bump change subnet behavior in production because nobody pinned it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for: &lt;code&gt;tfsec&lt;/code&gt; references in older pipelines.&lt;/strong&gt; &lt;code&gt;tfsec&lt;/code&gt; was merged into &lt;code&gt;trivy&lt;/code&gt; as of v0.21. If your CI image still references the &lt;code&gt;tfsec&lt;/code&gt; binary directly and it's been upgraded to a newer image, you'll get a silent no-op — the command exits 0 because the binary isn't found and the shell swallows the error. Check your CI logs for actual tfsec output, not just a green checkmark.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The checkov skip annotation is itself a finding.&lt;/strong&gt; When engineers add &lt;code&gt;#checkov:skip=CKV_AWS_20:reason&lt;/code&gt; to suppress a finding, some AI configurations will flag the skip annotation as a new security concern without reading the reason. This creates noise. Include in your system prompt: "Treat checkov skip annotations as accepted risks if a reason is provided. Do not re-flag them as findings."&lt;/p&gt;

&lt;p&gt;For more on securing your Terraform CI pipelines and managing infrastructure secrets, see the related posts at &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Automation Ideas&lt;/h2&gt;

&lt;p&gt;Once the checklist is solid, the next step is removing every manual touch from the process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Actions with CodeRabbit or a custom OpenAI script.&lt;/strong&gt; The workflow above handles the custom script path. For CodeRabbit, the setup is simpler — install the GitHub App, add a &lt;code&gt;.coderabbit.yaml&lt;/code&gt; config file, and it hooks into PR events automatically. CodeRabbit free tier covers 200 files per month; paid starts at $12/user/month. For most teams under 10 engineers, the free tier is sufficient for Terraform-only reviews. For larger teams or monorepos, the custom OpenAI script gives you more control over the system prompt and cost per call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Atlantis pre/post-plan hooks.&lt;/strong&gt; Atlantis added &lt;code&gt;pre_workflow_hooks&lt;/code&gt; in v0.19.0. You can wire checkov as a pre-plan hook and the Python AI review script as a post-plan hook. The Atlantis config in &lt;code&gt;atlantis.yaml&lt;/code&gt; at repo root controls this. Post-plan hooks receive the plan output path as an environment variable — pipe it through jq to extract changed resources, then pass to the AI script. This approach keeps everything server-side and avoids GitHub Actions runner costs for plan generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost management for large plans.&lt;/strong&gt; OpenAI API calls on large JSON plan files with 500+ resources cost between $0.08 and $0.40 per PR. That adds up fast on active repos. Two mitigations: first, use the jq filter to send only changed resources, not the full plan. Second, cache plan outputs by content hash — if the plan JSON hasn't changed between two PR pushes (force-push with no IaC changes), skip the API call entirely. Implement this with a &lt;code&gt;sha256sum&lt;/code&gt; check in your workflow before the OpenAI step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security note on API calls.&lt;/strong&gt; Sending your infrastructure topology to a third-party LLM is a real concern for regulated environments. Your plan JSON contains resource names, CIDR blocks, account IDs, and IAM policy structures. For HIPAA or PCI environments, use a self-hosted model — Ollama with CodeLlama 34B works reasonably well — or Azure OpenAI with data residency guarantees. Store &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;REVIEWPAD_API_KEY&lt;/code&gt; as encrypted CI secrets, not repo variables. Repo variables are readable by all contributors in GitHub Actions. That's not a theoretical risk — it's a common misconfiguration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration period before blocking.&lt;/strong&gt; Run AI review as a non-blocking advisory check for two weeks. Export the AI findings to a spreadsheet. Categorize each as true positive, false positive, or noise. Tune your system prompt to eliminate the false positive patterns. Then promote to blocking. Skipping the calibration period is how you end up with a team that ignores AI review comments because they've been burned by too many false alarms. The 45–90 second CI overhead is worth it — but only if the signal is trusted.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/terraform/" rel="noopener noreferrer"&gt;More Terraform automation patterns and CI pipeline setups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/ci-cd/" rel="noopener noreferrer"&gt;CI/CD pipeline configuration and deployment automation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/security/" rel="noopener noreferrer"&gt;Infrastructure security scanning and policy enforcement&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>terraform</category>
      <category>devops</category>
    </item>
    <item>
      <title>LLM Log Triage With Loki and CloudWatch Insights</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:31:39 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/llm-log-triage-with-loki-and-cloudwatch-insights-57d3</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/llm-log-triage-with-loki-and-cloudwatch-insights-57d3</guid>
      <description>&lt;p&gt;Liquid syntax error: Variable '{{app="{app}' was not properly terminated with regexp: /\}\}/&lt;/p&gt;
</description>
      <category>devops</category>
    </item>
    <item>
      <title>Jenkins Pipeline: Build, Test, and Deploy to AWS EC2 with ECR</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Wed, 10 Jun 2026 07:02:50 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/jenkins-pipeline-build-test-and-deploy-to-aws-ec2-with-ecr-46dl</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/jenkins-pipeline-build-test-and-deploy-to-aws-ec2-with-ecr-46dl</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://kuryzhev.cloud/2026/06/10/jenkins-pipeline-build-test-and-deploy-to-aws-ec2-with-ecr" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;If your Jenkins pipeline deploy to AWS is still using &lt;code&gt;image:latest&lt;/code&gt; and a hardcoded ECR password stored as a plain Username/Password credential, you are one 12-hour token expiry away from a broken production deploy at exactly the wrong moment. I have seen this happen twice on teams I joined mid-incident. The fix is not complicated — but it requires wiring up credentials, image tagging, and the deploy stage in the right order, which most tutorials skip entirely.&lt;/p&gt;

&lt;p&gt;This post walks through a complete, production-hardened setup: GitHub push triggers Jenkins, Jenkins builds and pushes a versioned Docker image to ECR, runs unit tests inside the container, then deploys to an EC2 instance over SSH. No Ansible, no manual steps, no &lt;code&gt;latest&lt;/code&gt; tags in production.&lt;/p&gt;

&lt;h2&gt;The Scenario&lt;/h2&gt;

&lt;p&gt;We have a Node.js microservice living on GitHub. Every push to &lt;code&gt;main&lt;/code&gt; should automatically build a Docker image, run the test suite, push the image to Amazon ECR, and deploy it to a target EC2 instance — zero manual SSH required. The same pipeline handles staging and production, gated by a parameter so nobody accidentally ships to prod from a feature branch.&lt;/p&gt;

&lt;p&gt;The full pipeline shape looks like this: &lt;strong&gt;Source → Build → Test → Push to ECR → Deploy to EC2 via SSH&lt;/strong&gt;. When everything works correctly, you get a green Jenkins build, a live endpoint returning &lt;code&gt;HTTP 200&lt;/code&gt;, and a deployment audit trail in the Jenkins console that shows exactly which image tag is running in which environment. That audit trail matters more than most people realize — it is the first thing you reach for during a 2 AM incident.&lt;/p&gt;

&lt;p&gt;We are not doing anything exotic here. No ECS, no Kubernetes, no blue-green infrastructure. Just a real-world pattern that a lot of teams actually run, done properly. Once this is solid, migrating to ECS &lt;code&gt;update-service&lt;/code&gt; for zero-downtime rolling deploys is a natural next step — and the Jenkinsfile barely changes.&lt;/p&gt;

&lt;h2&gt;Prerequisites&lt;/h2&gt;

&lt;p&gt;Before writing a single line of pipeline code, make sure every dependency is in place at the right version. Environment drift mid-tutorial wastes hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jenkins side:&lt;/strong&gt; Jenkins 2.440 LTS running on a dedicated EC2 instance — t3.small minimum, t3.medium recommended if you are running Docker builds on the same host. Required plugins: Pipeline, Git, AWS Credentials, Docker Pipeline (563.vd5d2e5c4007f), SSH Agent (295.v9ca_a_1c7cc3a_a_). Install all of them via &lt;em&gt;Manage Jenkins → Plugin Manager → Available&lt;/em&gt;. If the Docker Pipeline plugin version is older than 563.x, the &lt;code&gt;ecr:&lt;/code&gt; credential helper prefix will not be recognized.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS side:&lt;/strong&gt; An ECR repository already created, and an IAM user (or preferably an EC2 instance profile on the Jenkins host — more on that below) with permissions scoped to ECR push operations and EC2 describe. Store the IAM access key in Jenkins as a credential of type &lt;em&gt;AWS Credentials&lt;/em&gt; with the ID &lt;code&gt;aws-credentials&lt;/code&gt;. The ECR URI format is &lt;code&gt;&amp;lt;account_id&amp;gt;.dkr.ecr.&amp;lt;region&amp;gt;.amazonaws.com/&amp;lt;repo-name&amp;gt;&lt;/code&gt; — the region must match the AWS region configured on the Jenkins agent or the &lt;code&gt;ecr get-login-password&lt;/code&gt; call will silently fail with a generic 401.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local and target tooling:&lt;/strong&gt; AWS CLI v2.15+ (v1 uses the deprecated &lt;code&gt;aws ecr get-login&lt;/code&gt; command which pipes differently and will break the login step), Docker 25+, Git 2.40+. The target EC2 runs Amazon Linux 2023 with the Docker daemon already running and &lt;code&gt;ec2-user&lt;/code&gt; in the &lt;code&gt;docker&lt;/code&gt; group. Also confirm the &lt;code&gt;jenkins&lt;/code&gt; OS user on the Jenkins host is in the &lt;code&gt;docker&lt;/code&gt; group — if not, you will hit &lt;code&gt;Cannot connect to the Docker daemon at unix:///var/run/docker.sock&lt;/code&gt; immediately. Fix: &lt;code&gt;usermod -aG docker jenkins&lt;/code&gt;, then restart Jenkins.&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://www.jenkins.io/doc/book/pipeline/docker/" rel="noopener noreferrer"&gt;official Jenkins Pipeline with Docker documentation&lt;/a&gt; for plugin compatibility details.&lt;/p&gt;

&lt;h2&gt;Step 1 — Configure Jenkins Credentials and ECR Access&lt;/h2&gt;



&lt;p&gt;This is the step most tutorials either skip or get wrong. Wire up secrets before touching the Jenkinsfile.&lt;/p&gt;

&lt;p&gt;Navigate to &lt;em&gt;Manage Jenkins → Credentials → System → Global credentials&lt;/em&gt; and add three entries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The AWS IAM key pair — type: &lt;em&gt;AWS Credentials&lt;/em&gt;, ID: &lt;code&gt;aws-credentials&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The EC2 SSH private key — type: &lt;em&gt;SSH Username with private key&lt;/em&gt;, ID: &lt;code&gt;ec2-ssh-key&lt;/code&gt;, username: &lt;code&gt;ec2-user&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;A GitHub personal access token — type: &lt;em&gt;Username with password&lt;/em&gt;, ID: &lt;code&gt;github-token&lt;/code&gt; (used by the Multibranch Pipeline source)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now set up the ECR credential helper on the Jenkins agent. Install &lt;code&gt;amazon-ecr-credential-helper&lt;/code&gt; and add the following to &lt;code&gt;/root/.docker/config.json&lt;/code&gt; (or &lt;code&gt;/var/lib/jenkins/.docker/config.json&lt;/code&gt; depending on how Jenkins runs):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  "credsStore": "ecr-login"
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is critical. Without the credential helper, you have two bad options: store the ECR password as a plain credential (it expires after 12 hours, causing mysterious 401 errors mid-pipeline) or call &lt;code&gt;aws ecr get-login-password&lt;/code&gt; manually in every stage. The &lt;code&gt;ecr:&lt;/code&gt; prefix in &lt;code&gt;docker.withRegistry()&lt;/code&gt; handles token refresh automatically when the helper is configured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for this:&lt;/strong&gt; The most common mistake I see is teams storing ECR credentials as &lt;em&gt;Username/Password&lt;/em&gt; type in Jenkins. It works fine for the first 12 hours, then starts failing overnight with no clear error message. Use the credential helper. It takes five minutes to set up and you never touch it again.&lt;/p&gt;

&lt;h2&gt;Step 2 — Write the Jenkinsfile&lt;/h2&gt;

&lt;p&gt;The Jenkinsfile lives at the repo root. Jenkins detects it automatically when the job type is &lt;em&gt;Multibranch Pipeline&lt;/em&gt; or &lt;em&gt;Pipeline from SCM&lt;/em&gt;. Here is the full declarative pipeline — I will walk through each block below.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;// Jenkinsfile — Declarative pipeline: build, test, deploy to AWS EC2 via ECR
// Assumes: Jenkins 2.440+, Docker Pipeline plugin, SSH Agent plugin, AWS Credentials plugin

pipeline {
    agent any  // replace with a labeled agent node in production: agent { label 'docker-agent' }

    environment {
        AWS_REGION      = 'us-east-1'
        ECR_ACCOUNT_ID  = '123456789012'
        ECR_REPO        = 'my-app'
        IMAGE_TAG       = "${BUILD_NUMBER}-${GIT_COMMIT[0..7]}"  // e.g. 42-a3f9c12
        ECR_URI         = "${ECR_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}"
        FULL_IMAGE      = "${ECR_URI}:${IMAGE_TAG}"
    }

    parameters {
        choice(
            name: 'DEPLOY_ENV',
            choices: ['staging', 'production'],
            description: 'Target deployment environment'
        )
    }

    stages {

        stage('Checkout') {
            steps {
                // Checks out the triggering branch; Multibranch Pipeline sets GIT_COMMIT automatically
                checkout scm
            }
        }

        stage('Build &amp;amp; Push Image') {
            steps {
                script {
                    // ecr: prefix triggers amazon-ecr-credential-helper — no password stored in Jenkins
                    docker.withRegistry("https://${ECR_URI}", "ecr:${AWS_REGION}:aws-credentials") {
                        def appImage = docker.build("${FULL_IMAGE}", "--no-cache .")
                        appImage.push()
                        // Also push a human-readable env tag for traceability in ECR console
                        appImage.push("${DEPLOY_ENV}-latest")
                    }
                }
            }
        }

        stage('Run Tests') {
            steps {
                // Run tests inside the freshly built image — same environment as production
                sh """
                    docker run --rm \\
                        -e NODE_ENV=test \\
                        --name test-runner \\
                        ${FULL_IMAGE} \\
                        npm test -- --reporters=jest-junit --outputFile=test-results/results.xml
                """
            }
            post {
                always {
                    // Publish JUnit results regardless of pass/fail so Jenkins tracks test trends
                    junit allowEmptyResults: true, testResults: 'test-results/**/*.xml'
                }
            }
        }

        stage('Deploy') {
            // Gate: only deploy to production from the main branch
            when {
                expression {
                    return params.DEPLOY_ENV == 'staging' || env.BRANCH_NAME == 'main'
                }
            }
            environment {
                // Resolve the correct EC2 host based on chosen environment
                EC2_HOST = "${params.DEPLOY_ENV == 'production' ? env.PROD_HOST : env.STAGING_HOST}"
            }
            steps {
                // sshagent injects the private key for the duration of this block only
                sshagent(credentials: ['ec2-ssh-key']) {
                    sh """
                        ssh -o StrictHostKeyChecking=no ec2-user@${EC2_HOST} '
                            aws ecr get-login-password --region ${AWS_REGION} \\
                                | docker login --username AWS --password-stdin ${ECR_URI} &amp;amp;&amp;amp; \\
                            docker pull ${FULL_IMAGE} &amp;amp;&amp;amp; \\
                            docker stop app 2&amp;gt;/dev/null || true &amp;amp;&amp;amp; \\
                            docker rm   app 2&amp;gt;/dev/null || true &amp;amp;&amp;amp; \\
                            docker run -d --name app --restart unless-stopped \\
                                -p 80:3000 \\
                                -e NODE_ENV=${DEPLOY_ENV} \\
                                ${FULL_IMAGE}
                        '
                    """
                }
            }
        }
    }

    post {
        failure {
            // Notify team on any failure — replace with slackSend or emailext as needed
            echo "Pipeline FAILED for ${FULL_IMAGE} targeting ${params.DEPLOY_ENV}"
        }
        always {
            // Prevent disk exhaustion on the Jenkins agent over repeated builds
            cleanWs()
        }
    }
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A few things worth calling out explicitly. The &lt;code&gt;IMAGE_TAG&lt;/code&gt; uses &lt;code&gt;${BUILD_NUMBER}-${GIT_COMMIT[0..7]}&lt;/code&gt; — something like &lt;code&gt;42-a3f9c12&lt;/code&gt;. This is intentional. Using &lt;code&gt;latest&lt;/code&gt; in the deploy command is the single most common mistake that causes "works on my machine" deploys. Jenkins pulls the cached local &lt;code&gt;latest&lt;/code&gt; instead of the newly pushed image, and you spend 45 minutes wondering why your code change is not live.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;docker.withRegistry()&lt;/code&gt; call requires the full &lt;code&gt;https://&lt;/code&gt; prefix. Omitting it causes &lt;code&gt;unauthorized: authentication required&lt;/code&gt; with no further detail — one of the more frustrating silent failures in the Docker Pipeline plugin.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;cleanWs()&lt;/code&gt; post step is not optional on small instances. On a t3.small with an 8 GB root volume, Docker layer caches from repeated builds fill the disk in roughly 20 builds. I stopped using shared Jenkins masters for Docker builds entirely after we killed a production deploy because the agent ran out of disk space at the image push step.&lt;/p&gt;

&lt;h2&gt;Step 3 — Scope the IAM Policy&lt;/h2&gt;

&lt;p&gt;The pipeline needs AWS permissions to push to ECR. Here is the minimal IAM inline policy — scope it to the single repository, not the entire account.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ECRAuthToken",
            "Effect": "Allow",
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*"
        },
        {
            "Sid": "ECRPushToSpecificRepo",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:CompleteLayerUpload",
                "ecr:InitiateLayerUpload",
                "ecr:PutImage",
                "ecr:UploadLayerPart",
                "ecr:DescribeImages",
                "ecr:ListImages"
            ],
            "Resource": "arn:aws:ecr:us-east-1:123456789012:repository/my-app"
        },
        {
            "Sid": "EC2DescribeForHealthChecks",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": "*"
        }
    ]
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that &lt;code&gt;ecr:GetAuthorizationToken&lt;/code&gt; cannot be scoped to a specific resource — that is an AWS API constraint, not a mistake in the policy. Everything else is locked to the single ECR repository.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security consideration worth repeating:&lt;/strong&gt; The IAM user approach works, but it means long-lived access keys sitting in Jenkins credentials. The better path is to attach this policy to an EC2 instance profile on the Jenkins host. The Jenkins AWS Credentials plugin picks up the instance metadata automatically — no key ID, no secret, no rotation burden. I made this switch on every team I have worked with after the second time a leaked key caused an incident. See the &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html" rel="noopener noreferrer"&gt;AWS IAM roles for EC2 documentation&lt;/a&gt; for setup details.&lt;/p&gt;

&lt;p&gt;Also: &lt;code&gt;StrictHostKeyChecking=no&lt;/code&gt; in the SSH command is acceptable for known internal hosts, but if your security posture requires it, replace it with a known_hosts file pre-populated with the EC2 host fingerprint. On a Multibranch Pipeline, store it as a Jenkins file credential and copy it into &lt;code&gt;~/.ssh/known_hosts&lt;/code&gt; before the SSH step.&lt;/p&gt;

&lt;h2&gt;Verify and Test&lt;/h2&gt;

&lt;p&gt;A green Jenkins build is not proof the pipeline works. Here is how we actually verify the full end-to-end flow.&lt;/p&gt;

&lt;p&gt;Trigger a build manually first via &lt;em&gt;Build with Parameters&lt;/em&gt;, choosing &lt;code&gt;staging&lt;/code&gt;. Watch the Console Output in real time. You are looking for two specific lines: &lt;code&gt;Login Succeeded&lt;/code&gt; in the Build &amp;amp; Push stage, and the new container ID printed by &lt;code&gt;docker run&lt;/code&gt; in the Deploy stage. If the login line is missing, the credential helper is not configured correctly on the agent.&lt;/p&gt;

&lt;p&gt;Once the build is green, SSH into the target EC2 and run this to confirm the right image is actually running:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Confirm the new versioned image tag is running, not a stale container
docker ps --format "table {{.Image}}\t{{.Status}}\t{{.Ports}}"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You should see your versioned tag — something like &lt;code&gt;123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:42-a3f9c12&lt;/code&gt; — with status &lt;code&gt;Up X seconds&lt;/code&gt; and port &lt;code&gt;0.0.0.0:80-&amp;gt;3000/tcp&lt;/code&gt;. If you see &lt;code&gt;my-app:latest&lt;/code&gt;, the deploy command is using the wrong image reference.&lt;/p&gt;

&lt;p&gt;Hit the health endpoint directly:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;curl -I http://&amp;lt;ec2-public-ip&amp;gt;/health&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Expect &lt;code&gt;HTTP/1.1 200 OK&lt;/code&gt;. Then do the real gate test: intentionally break a unit test, push to &lt;code&gt;main&lt;/code&gt;, and confirm the pipeline fails at the &lt;code&gt;Run Tests&lt;/code&gt; stage without reaching Deploy. This is the proof the quality gate actually works — not just that the happy path is green. A lot of teams skip this step and find out the hard way that their test stage was misconfigured to always exit 0.&lt;/p&gt;

&lt;p&gt;Check the Jenkins workspace path at &lt;code&gt;/var/lib/jenkins/workspace/&amp;lt;job-name&amp;gt;&lt;/code&gt; after a few builds to confirm &lt;code&gt;cleanWs()&lt;/code&gt; is doing its job. The directory should be empty or absent between runs.&lt;/p&gt;

&lt;p&gt;What we have built here is a fully automated Jenkins pipeline deploy to AWS: GitHub push triggers a versioned Docker build, tests run inside the same image that ships to production, the image lands in ECR with a traceable tag, and the EC2 deploy happens over SSH with credentials scoped correctly at every stage. Slack alerting fires on failure, and the workspace cleans itself up. The natural next steps from here are migrating the deploy stage from direct SSH to ECS &lt;code&gt;update-service&lt;/code&gt; for zero-downtime rolling deploys, adding a SonarQube stage for static analysis between Build and Test, and replacing the IAM user entirely with an EC2 instance profile on the Jenkins host to eliminate long-lived key management. Each of those is a one-stage change to the Jenkinsfile — which is exactly the point of building the foundation right the first time. More CI/CD patterns we use in production are documented at &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/ci-cd/" rel="noopener noreferrer"&gt;More CI/CD pipeline patterns and deployment automation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/jenkins/" rel="noopener noreferrer"&gt;Jenkins configuration, shared libraries, and OIDC auth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/docker/" rel="noopener noreferrer"&gt;Docker build optimization and container runtime hardening&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Build a Jenkins Pipeline That Deploys to AWS ECS</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Tue, 09 Jun 2026 09:49:24 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/how-to-build-a-jenkins-pipeline-that-deploys-to-aws-ecs-2geo</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/how-to-build-a-jenkins-pipeline-that-deploys-to-aws-ecs-2geo</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://kuryzhev.cloud/2026/06/09/how-to-build-a-jenkins-pipeline-that-deploys-to-aws-ecs" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most Jenkins pipelines that deploy to AWS ECS work fine in a demo and silently leak credentials, orphan ECR images, and block build agents for hours the moment they hit real production load. I've inherited three of these pipelines in the last two years. The problems are always the same. The fixes are not complicated — but you have to understand why the mistakes happen in the first place.&lt;/p&gt;

&lt;p&gt;This is a deep-dive into how Jenkins pipelines actually work internally, where teams consistently get the architecture wrong, and what a production-grade Jenkinsfile looks like when you need reliable build, test, and AWS deploy stages without the hidden costs.&lt;/p&gt;

&lt;h2&gt;What a Jenkins Pipeline Actually Does End-to-End&lt;/h2&gt;



&lt;p&gt;A Jenkinsfile is not a shell script with YAML on top. It is a declarative DSL that Jenkins compiles into a &lt;strong&gt;pipeline execution graph&lt;/strong&gt; — closer to a DAG than a linear sequence of commands. Understanding this distinction matters because it changes how you reason about failures, retries, and resource allocation.&lt;/p&gt;

&lt;p&gt;The three core primitives — &lt;code&gt;agent&lt;/code&gt;, &lt;code&gt;stage&lt;/code&gt;, and &lt;code&gt;steps&lt;/code&gt; — have completely separate lifecycles. An &lt;code&gt;agent&lt;/code&gt; directive provisions a workspace and an executor. A &lt;code&gt;stage&lt;/code&gt; is a logical grouping that appears in the UI and can carry its own agent declaration. &lt;code&gt;steps&lt;/code&gt; are the actual commands that run inside that agent context. When you declare &lt;code&gt;agent none&lt;/code&gt; at the top level and then specify an agent per stage, Jenkins allocates and releases executors independently for each stage. This is intentional isolation — and most teams skip it entirely.&lt;/p&gt;

&lt;p&gt;Environment variables declared in the global &lt;code&gt;environment&lt;/code&gt; block propagate into every stage automatically. Variables declared inside a stage are scoped to that stage only. The SCM checkout step that Jenkins runs automatically on the default agent is not the same as running &lt;code&gt;git clone&lt;/code&gt; manually — it sets &lt;code&gt;GIT_COMMIT&lt;/code&gt;, &lt;code&gt;GIT_BRANCH&lt;/code&gt;, and other built-in variables that you can reference downstream.&lt;/p&gt;

&lt;p&gt;One thing that trips people up constantly: mixing declarative and scripted pipeline syntax. You can embed a &lt;code&gt;script {}&lt;/code&gt; block inside declarative stages for Groovy logic, but you cannot use declarative directives like &lt;code&gt;when&lt;/code&gt; or &lt;code&gt;post&lt;/code&gt; inside a fully scripted pipeline. Mixing them in the wrong direction causes silent failures where Jenkins simply skips the directive without throwing an error. Jenkins LTS 2.440.3 is the current stable release — pipelines written for 2.3xx may fail on &lt;code&gt;agent&lt;/code&gt; directive syntax changes, so check your version before debugging mysterious parse errors.&lt;/p&gt;

&lt;h2&gt;How Teams Use Jenkins Pipelines Wrong&lt;/h2&gt;

&lt;p&gt;I've seen the same three mistakes across every team that hasn't gone through a pipeline audit. Each one is invisible until it causes an incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Hardcoding AWS credentials as environment variables.&lt;/strong&gt; This pattern appears in a surprising number of production Jenkinsfiles:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;environment {
    AWS_ACCESS_KEY_ID = credentials('aws-key')  // WRONG — exposes key ID in UI
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Even when using the &lt;code&gt;credentials()&lt;/code&gt; helper, setting credentials directly in the &lt;code&gt;environment&lt;/code&gt; block exposes the key ID in the Jenkins "Environment Variables" panel, which is visible to every user with read access to the job. The correct approach uses &lt;code&gt;withCredentials&lt;/code&gt; with &lt;code&gt;AmazonWebServicesCredentialsBinding&lt;/code&gt; — scoped to the exact steps that need AWS access. Watch out for this: the Credentials Binding Plugin version 657.v2b_19db_7d6d6d or later is required for the &lt;code&gt;AmazonWebServicesCredentialsBinding&lt;/code&gt; class. Earlier versions fail silently and fall back to injecting the raw string.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Running all stages on the master node.&lt;/strong&gt; When every stage uses &lt;code&gt;agent any&lt;/code&gt; without specifying a label or Docker image, Jenkins defaults to the controller. On a team running 10 microservices, this turns a 4-minute build into a 22-minute queue. The controller is not an execution node — it is an orchestration node. Running builds on it also means a runaway build can starve the Jenkins UI of threads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: Skipping the &lt;code&gt;post&lt;/code&gt; block entirely.&lt;/strong&gt; Without a &lt;code&gt;post { always { cleanWs() } }&lt;/code&gt; block, workspaces accumulate on persistent agents. I've seen a 200GB disk fill up over a weekend from a pipeline that ran every 15 minutes and never cleaned up. Beyond disk exhaustion, skipping the post block means failed deployments leave ECR images untagged and S3 artifacts orphaned — storage costs compound at $0.10/GB/month with no upper bound. Jenkins does not clean workspaces automatically between builds. This is not a default you can rely on.&lt;/p&gt;

&lt;p&gt;There's also a concurrency issue worth calling out separately: not setting &lt;code&gt;disableConcurrentBuilds()&lt;/code&gt; in the &lt;code&gt;options&lt;/code&gt; block. Concurrent deploys to the same ECS service cause task definition version conflicts and rollback failures that are genuinely hard to debug after the fact.&lt;/p&gt;

&lt;h2&gt;The Correct Approach: Jenkinsfile for Build, Test, and AWS Deploy&lt;/h2&gt;

&lt;p&gt;Here is the full production-grade Jenkinsfile we use for a Java/Maven service deploying to AWS ECS. Every design decision is intentional — I'll explain the non-obvious ones inline.&lt;/p&gt;

&lt;p&gt;This pipeline uses per-stage Docker agents to isolate environments, stash/unstash to pass artifacts between stages, and &lt;code&gt;withCredentials&lt;/code&gt; scoped tightly around AWS CLI calls. The &lt;code&gt;IMAGE_TAG&lt;/code&gt; uses the short Git SHA for traceability — you can look at any running ECS task and trace it back to an exact commit.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;// Jenkinsfile — declarative pipeline for build, test, and deploy to AWS ECS
// Requires: Docker Pipeline plugin, AWS Credentials Binding plugin, JUnit plugin
// Jenkins LTS 2.440.3+, AWS CLI v2 pre-installed on agent image

pipeline {
    agent none  // No global agent — each stage declares its own to isolate environments

    options {
        disableConcurrentBuilds()           // Prevent parallel deploys to same ECS service
        timeout(time: 30, unit: 'MINUTES')  // Kill runaway builds; protects agent pool
        buildDiscarder(logRotator(numToKeepStr: '20'))
    }

    environment {
        AWS_REGION      = 'us-east-1'
        ECR_REGISTRY    = '123456789012.dkr.ecr.us-east-1.amazonaws.com'
        ECR_REPO        = 'my-app'
        IMAGE_TAG       = "${env.GIT_COMMIT[0..7]}"  // Short SHA for traceability
        ECS_CLUSTER     = 'prod-cluster'
        ECS_SERVICE     = 'my-app-service'
    }

    stages {
        stage('Build') {
            agent {
                docker {
                    image 'maven:3.9.6-eclipse-temurin-21'
                    args  '-v $HOME/.m2:/root/.m2'  // Cache local Maven repo across builds
                }
            }
            steps {
                sh 'mvn clean package -DskipTests -q'  // Tests run in dedicated stage
                stash name: 'build-artifact', includes: 'target/*.jar'
            }
        }

        stage('Test') {
            agent {
                docker {
                    image 'maven:3.9.6-eclipse-temurin-21'
                    args  '-v $HOME/.m2:/root/.m2'
                }
            }
            steps {
                unstash 'build-artifact'
                sh 'mvn test -q'
            }
            post {
                always {
                    // Publish results even on test failure so failures are visible in UI
                    junit 'target/surefire-reports/**/*.xml'
                }
            }
        }

        stage('Docker Build &amp;amp; Push') {
            agent { label 'docker-agent' }  // Agent with Docker daemon access
            steps {
                unstash 'build-artifact'
                withCredentials([[
                    $class:             'AmazonWebServicesCredentialsBinding',
                    credentialsId:      'aws-ecr-credentials',  // Stored in Jenkins Credentials
                    accessKeyVariable:  'AWS_ACCESS_KEY_ID',
                    secretKeyVariable:  'AWS_SECRET_ACCESS_KEY'
                ]]) {
                    sh """
                        # Authenticate to ECR — token valid 12 hours
                        aws ecr get-login-password --region ${AWS_REGION} \
                          | docker login --username AWS --password-stdin ${ECR_REGISTRY}

                        # Build with BuildKit for parallel layer execution
                        DOCKER_BUILDKIT=1 docker build \
                          --cache-from ${ECR_REGISTRY}/${ECR_REPO}:latest \
                          -t ${ECR_REGISTRY}/${ECR_REPO}:${IMAGE_TAG} \
                          -t ${ECR_REGISTRY}/${ECR_REPO}:latest .

                        docker push ${ECR_REGISTRY}/${ECR_REPO}:${IMAGE_TAG}
                        docker push ${ECR_REGISTRY}/${ECR_REPO}:latest
                    """
                }
            }
        }

        stage('Deploy to ECS') {
            agent { label 'docker-agent' }
            when {
                branch 'main'  // Only deploy from main branch; feature branches stop here
            }
            steps {
                // Manual approval gate — times out after 15 min to release the executor
                timeout(time: 15, unit: 'MINUTES') {
                    input message: "Deploy ${IMAGE_TAG} to production ECS?", ok: 'Deploy'
                }
                withCredentials([[
                    $class:             'AmazonWebServicesCredentialsBinding',
                    credentialsId:      'aws-ecr-credentials',
                    accessKeyVariable:  'AWS_ACCESS_KEY_ID',
                    secretKeyVariable:  'AWS_SECRET_ACCESS_KEY'
                ]]) {
                    sh """
                        aws ecs update-service \
                          --region ${AWS_REGION} \
                          --cluster ${ECS_CLUSTER} \
                          --service ${ECS_SERVICE} \
                          --force-new-deployment
                    """
                }
            }
        }
    }

    post {
        always {
            cleanWs()  // Mandatory: prevents disk exhaustion on persistent agents
        }
        failure {
            echo "Pipeline failed on branch ${env.BRANCH_NAME} at commit ${IMAGE_TAG}"
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;One gotcha worth calling out: the JUnit glob pattern matters. Using &lt;code&gt;target/surefire-reports/**/*.xml&lt;/code&gt; scopes the search correctly. If you use &lt;code&gt;**/surefire-reports/**/*.xml&lt;/code&gt; without the &lt;code&gt;target/&lt;/code&gt; prefix, Jenkins scans the entire workspace and slows test result parsing by 3–5x on large repos. I made this mistake on a monorepo with 40 modules and spent an afternoon wondering why the post-build step was taking longer than the tests themselves.&lt;/p&gt;

&lt;p&gt;Also watch out for the AWS CLI path issue. On Amazon Linux 2, AWS CLI v2 installs to &lt;code&gt;/usr/local/bin/aws&lt;/code&gt;. The v1 path is &lt;code&gt;/usr/bin/aws&lt;/code&gt;. If your Docker agent image has both installed, the wrong one can silently take precedence depending on PATH ordering — and the ECR login command syntax differs between versions in ways that produce confusing authentication errors rather than a clear "wrong version" message.&lt;/p&gt;

&lt;h2&gt;Advanced Patterns: Shared Libraries, Approvals, and Multi-Environment Promotion&lt;/h2&gt;

&lt;p&gt;Once you have one pipeline working correctly, the next problem is copy-paste drift across 20 microservices. Every team I've worked with hits this around service number five or six. The answer is Jenkins Shared Libraries.&lt;/p&gt;

&lt;p&gt;A Shared Library lives in a separate Git repository with a specific directory structure: &lt;code&gt;vars/&lt;/code&gt; for global step functions, &lt;code&gt;src/&lt;/code&gt; for Groovy classes, and &lt;code&gt;resources/&lt;/code&gt; for static files. Deviating from this structure causes &lt;code&gt;No such DSL method&lt;/code&gt; errors that are not immediately obvious. The library is loaded at the top of any Jenkinsfile with &lt;code&gt;@Library('jenkins-shared-libs@main') _&lt;/code&gt; — the trailing underscore is required and easy to forget.&lt;/p&gt;

&lt;p&gt;Here is a real shared library helper we use for ECS deployments. The key addition over the inline approach is the &lt;code&gt;aws ecs wait services-stable&lt;/code&gt; call, which blocks until the ECS service reaches steady state or fails the pipeline. Without this, a deployment that pushes a broken image appears successful in Jenkins — ECS just quietly fails the task and rolls back, and you find out from an alert 10 minutes later.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;// vars/ecsDeployHelper.groovy — Shared Library example
// Stored in: https://github.com/org/jenkins-shared-libs (loaded via @Library annotation)
// Load in Jenkinsfile with: @Library('jenkins-shared-libs@main') _

/**
 * Deploys a Docker image to an ECS service and waits for stability.
 * Usage: ecsDeployHelper.deploy(cluster: 'prod-cluster', service: 'my-app', region: 'us-east-1')
 */
def deploy(Map config) {
    // Validate required keys — fail fast with a clear message rather than cryptic AWS error
    ['cluster', 'service', 'region'].each { key -&amp;gt;
        if (!config[key]) error("ecsDeployHelper.deploy: missing required param '${key}'")
    }

    echo "Deploying to ECS cluster=${config.cluster} service=${config.service}"

    sh """
        aws ecs update-service \
          --region ${config.region} \
          --cluster ${config.cluster} \
          --service ${config.service} \
          --force-new-deployment

        # Wait up to 10 minutes for service to reach steady state
        # Fails pipeline if deployment does not stabilize — catches bad images early
        aws ecs wait services-stable \
          --region ${config.region} \
          --cluster ${config.cluster} \
          --services ${config.service}
    """

    echo "ECS service ${config.service} reached stable state successfully"
}

return this
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For multi-environment promotion, we use a &lt;code&gt;choice&lt;/code&gt; parameter — &lt;code&gt;params.ENVIRONMENT&lt;/code&gt; set to dev/staging/prod — combined with environment-specific credential blocks. The pattern keeps a single Jenkinsfile per service while allowing environment-specific IAM roles and regions. The IAM role attached to the Jenkins EC2 instance should use &lt;code&gt;sts:AssumeRole&lt;/code&gt; to access cross-account resources — never store long-lived access keys on the instance itself. This is not just a best practice; it is the only approach that survives a credential rotation without pipeline downtime.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;input&lt;/code&gt; step timeout deserves its own warning. If you write &lt;code&gt;input message: 'Deploy to production?'&lt;/code&gt; without wrapping it in &lt;code&gt;timeout(time: 15, unit: 'MINUTES')&lt;/code&gt;, the executor thread hangs indefinitely waiting for a human. That executor is blocked. On a small Jenkins instance with two executors, one forgotten approval gate can stall your entire CI system. I stopped using bare &lt;code&gt;input&lt;/code&gt; steps entirely after this happened during an on-call weekend.&lt;/p&gt;

&lt;h2&gt;Performance Notes: Build Time, Agent Cost, and ECR Storage&lt;/h2&gt;

&lt;p&gt;The performance decisions in a Jenkins pipeline have real dollar amounts attached to them. These are not theoretical optimizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker layer caching&lt;/strong&gt; is the single highest-impact change you can make to build time. Without the &lt;code&gt;--cache-from&lt;/code&gt; flag pulling a warm cache image from ECR, every build rebuilds all layers from scratch. With it, and with a well-structured Dockerfile that copies dependency files before source code, average build time drops from roughly 8 minutes to around 90 seconds for a typical Java service. The requirement is that you push the &lt;code&gt;:latest&lt;/code&gt; tag after every successful build — which the pipeline above does — so the cache is always warm for the next run. Also enable BuildKit: &lt;code&gt;DOCKER_BUILDKIT=1&lt;/code&gt; enables parallel execution of independent &lt;code&gt;RUN&lt;/code&gt; instructions. Without it, they execute sequentially and add 2–4 minutes to image build time depending on your Dockerfile structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spot instance agents&lt;/strong&gt; via the EC2 Fleet Plugin cut compute cost by 60–70% for non-production builds. The configuration that matters most is &lt;code&gt;idleTerminationMinutes: 5&lt;/code&gt;. Without it, idle agents accumulate overnight and you discover a surprisingly large EC2 bill the next morning. Five minutes is aggressive but safe for most build patterns — agents provision in under 60 seconds on modern AMIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECR lifecycle policies&lt;/strong&gt; are not optional if you run pipelines frequently. Every push creates a new image. Without an explicit lifecycle policy, ECR retains every image indefinitely at $0.10/GB/month. For a service with a 500MB image running 20 deploys per day, that compounds quickly. Set a policy that retains the last 30 tagged images and expires all untagged images after 1 day. The AWS CLI command is straightforward:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;aws ecr put-lifecycle-policy \
  --repository-name my-app \
  --lifecycle-policy-text '{
    "rules": [
      {
        "rulePriority": 1,
        "description": "Expire untagged images after 1 day",
        "selection": {
          "tagStatus": "untagged",
          "countType": "sinceImagePushed",
          "countUnit": "days",
          "countNumber": 1
        },
        "action": { "type": "expire" }
      },
      {
        "rulePriority": 2,
        "description": "Keep last 30 tagged images",
        "selection": {
          "tagStatus": "tagged",
          "tagPrefixList": ["v"],
          "countType": "imageCountMoreThan",
          "countNumber": 30
        },
        "action": { "type": "expire" }
      }
    ]
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;One last performance note: the &lt;code&gt;stash&lt;/code&gt;/&lt;code&gt;unstash&lt;/code&gt; mechanism has a 100MB default limit. Artifacts larger than this need to go through S3 using the S3 plugin or explicit AWS CLI copy commands. Hitting this limit produces an error that is not always immediately obvious about the cause — &lt;code&gt;hudson.remoting.ChannelClosedException&lt;/code&gt; is the error you'll see if the agent JVM also runs out of heap during a large stash operation. Set &lt;code&gt;-Xmx512m&lt;/code&gt; in agent JVM arguments via the EC2 plugin configuration if you see this.&lt;/p&gt;

&lt;p&gt;For the full Jenkins pipeline deploy AWS reference and related CI/CD patterns, see the &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud DevOps_DayS&lt;/a&gt; archive. The official &lt;a href="https://www.jenkins.io/doc/book/pipeline/syntax/" rel="noopener noreferrer"&gt;Jenkins Pipeline Syntax documentation&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service-console-v2.html" rel="noopener noreferrer"&gt;AWS ECS service update reference&lt;/a&gt; are the two external sources worth bookmarking alongside this.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/jenkins/" rel="noopener noreferrer"&gt;More Jenkins pipeline patterns, shared libraries, and OIDC authentication&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/ci-cd/" rel="noopener noreferrer"&gt;CI/CD pipeline design, deployment strategies, and release automation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/aws/" rel="noopener noreferrer"&gt;AWS automation, ECS, ECR, and IAM role patterns for deployments&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>AWS Lambda S3 Trigger Python: 3 Production Mistakes We Fixed</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Sat, 06 Jun 2026 10:24:03 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/aws-lambda-s3-trigger-python-3-production-mistakes-we-fixed-3b9m</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/aws-lambda-s3-trigger-python-3-production-mistakes-we-fixed-3b9m</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;We lost 6 hours of vendor CSV data because our AWS Lambda S3 trigger Python function silently crashed on a test event payload — and we had no DLQ to catch it. Nobody noticed until a vendor called asking why their upload hadn't been processed. That incident kicked off a proper post-mortem, and what we found was embarrassing: three compounding mistakes that any team new to event-driven Lambda patterns can fall into. This is the honest account of what went wrong and exactly how we hardened the pipeline afterward.&lt;/p&gt;

&lt;h2&gt;Context: Why We Chose Lambda and S3 for This Pipeline&lt;/h2&gt;



&lt;p&gt;The setup was straightforward on paper. Third-party vendors drop CSV files into an S3 bucket. A Lambda function picks up the S3 event notification, parses the file, and inserts rows into RDS PostgreSQL. We chose Lambda over ECS for a few practical reasons: invocations were sporadic (dozens per day, not continuous), we didn't want to manage a running container 24/7, and the per-invocation billing model made cost predictable at low volume.&lt;/p&gt;

&lt;p&gt;The stack: Python 3.11, boto3 1.34.x, Terraform for all infra, and S3 event notifications wired directly to Lambda — not EventBridge, which we weren't using at the time. The team had solid Python experience but relatively limited exposure to Lambda's async invocation model and its failure semantics. We assumed it would behave like a normal function call. It doesn't. That assumption is where the trouble started.&lt;/p&gt;

&lt;p&gt;If you want broader context on event-driven failure patterns we've hit before, the &lt;a href="https://kuryzhev.cloud/" rel="noopener noreferrer"&gt;kuryzhev.cloud DevOps archive&lt;/a&gt; covers several related post-mortems worth reading alongside this one.&lt;/p&gt;

&lt;h2&gt;Mistake 1: We Trusted the S3 Event Payload Without Validating It&lt;/h2&gt;

&lt;p&gt;Our original handler accessed the S3 object key like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;key = event["Records"][0]["s3"]["object"]["key"]&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;No guard. No validation. Direct dictionary access assuming the payload always matched the documented S3 event schema. That worked fine — until someone clicked "Send test event" from the Lambda console. The S3 test event uses a different schema. It sends a fake key value of &lt;code&gt;AmazonS3ExampleObject&lt;/code&gt; and the overall structure is close enough to look real, but critically it doesn't match what a live S3 notification sends. The function threw a &lt;code&gt;KeyError&lt;/code&gt;, Lambda logged it to CloudWatch, and then... nothing. The invocation was gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for this:&lt;/strong&gt; S3 event notifications are asynchronous and &lt;em&gt;do not retry by default&lt;/em&gt; on their own — Lambda handles retry behavior, and with no DLQ configured, failed invocations simply disappear. We had no dead-letter queue. We had no alarm on Lambda errors. We had CloudWatch Logs set to "Never expire" with no metric filter watching for ERROR lines. Six hours passed before the vendor noticed.&lt;/p&gt;

&lt;p&gt;The second gotcha here: S3 URL-encodes object keys. A filename like &lt;code&gt;vendor upload Q1.csv&lt;/code&gt; arrives as &lt;code&gt;vendor+upload+Q1.csv&lt;/code&gt;. If you access the key and pass it directly to &lt;code&gt;s3_client.get_object()&lt;/code&gt;, you'll get a &lt;code&gt;NoSuchKey&lt;/code&gt; error that looks like the file doesn't exist — when it's right there in the bucket. Always call &lt;code&gt;urllib.parse.unquote_plus(key)&lt;/code&gt; immediately after extracting the key. We weren't doing that either.&lt;/p&gt;

&lt;p&gt;Here's the validated event parsing function we use now. It extracts the bucket and key safely, decodes the URL-encoding, and explicitly rejects console test events before any processing begins:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# lambda_function.py
# Python 3.11 | boto3 1.34.x
# S3-triggered Lambda with structured error handling and DLQ support

import json
import csv
import io
import logging
import urllib.parse
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

s3_client = boto3.client("s3")


def parse_s3_event(event: dict) -&amp;gt; tuple[str, str]:
    """
    Extract and validate bucket name and object key from S3 event payload.
    URL-decodes the key — critical for filenames with spaces or special chars.
    Raises ValueError on malformed or test event payloads.
    """
    try:
        record = event["Records"][0]
        bucket = record["s3"]["bucket"]["name"]
        # S3 encodes object keys — unquote_plus handles spaces encoded as '+'
        key = urllib.parse.unquote_plus(record["s3"]["object"]["key"])
    except (KeyError, IndexError) as e:
        raise ValueError(f"Malformed S3 event payload: {e}") from e

    # Reject S3 console test events — they send a fake key
    if key == "AmazonS3ExampleObject":
        raise ValueError("Received S3 test event — skipping processing")

    return bucket, key&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Mistake 2: Unhandled Exceptions Caused Silent Duplicate Processing&lt;/h2&gt;

&lt;p&gt;Our original handler raised raw exceptions when something went wrong. The thinking was: Lambda will log it, we'll see it in CloudWatch, and we'll fix it. That reasoning misses something important about how Lambda handles async invocations.&lt;/p&gt;

&lt;p&gt;When a Lambda function invoked asynchronously raises an unhandled exception, Lambda retries it. The default is two additional attempts — three total. You can verify this in the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html" rel="noopener noreferrer"&gt;AWS Lambda async invocation docs&lt;/a&gt;. We didn't know that. So when a malformed CSV arrived and our parser threw an exception, Lambda dutifully tried again. And again. The same broken file was processed three times before the retries exhausted.&lt;/p&gt;

&lt;p&gt;The damage: we had no idempotency check. There was no record of which S3 keys had already been processed. The first invocation partially inserted rows before failing mid-loop. The retries inserted some of those rows again. We ended up with duplicate records in RDS that took an afternoon to identify and clean up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch out for this:&lt;/strong&gt; The distinction between retryable and non-retryable errors matters enormously in async Lambda. A transient &lt;code&gt;ClientError&lt;/code&gt; from boto3 (throttling, service unavailable) is worth retrying. A malformed CSV or a validation failure is not — retrying it three times just creates three times the mess.&lt;/p&gt;

&lt;p&gt;The fix has two parts. First, structured exception handling in the handler itself: catch &lt;code&gt;ValueError&lt;/code&gt; for fatal non-retryable conditions, log it, and return a 200 so Lambda doesn't retry. Re-raise only for genuinely transient errors. Second, set &lt;code&gt;MaximumRetryAttempts&lt;/code&gt; to 0 in Terraform for any Lambda handling non-idempotent operations:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def fetch_and_parse_csv(bucket: str, key: str) -&amp;gt; list[dict]:
    """
    Fetch CSV from S3 and parse with stdlib csv — no pandas dependency.
    Raises ClientError for missing objects (non-retryable after DLQ setup).
    """
    try:
        response = s3_client.get_object(Bucket=bucket, Key=key)
    except ClientError as e:
        error_code = e.response["Error"]["Code"]
        if error_code == "NoSuchKey":
            # Fatal — do not retry, let DLQ catch it
            raise ValueError(f"Object not found: s3://{bucket}/{key}") from e
        # Other ClientErrors (throttling, etc.) are retryable — re-raise
        raise

    body = response["Body"].read().decode("utf-8")
    reader = csv.DictReader(io.StringIO(body))
    return list(reader)


def lambda_handler(event: dict, context) -&amp;gt; dict:
    """
    Main Lambda handler. Structured error handling:
    - ValueError → fatal, log and return 200 to prevent retry
    - ClientError (non-NoSuchKey) → re-raise to trigger retry / DLQ
    """
    logger.info("Received event: %s", json.dumps(event))

    try:
        bucket, key = parse_s3_event(event)
        logger.info("Processing s3://%s/%s", bucket, key)

        rows = fetch_and_parse_csv(bucket, key)
        logger.info("Parsed %d rows from %s", len(rows), key)

        # TODO: insert rows into RDS or forward downstream
        # process_rows(rows)

        return {"statusCode": 200, "body": f"Processed {len(rows)} rows"}

    except ValueError as e:
        # Non-retryable: log, do not re-raise (prevents unnecessary retries)
        logger.error("Fatal validation error: %s", str(e))
        return {"statusCode": 400, "body": str(e)}

    except Exception as e:
        # Retryable or unexpected: re-raise so Lambda/DLQ handles it
        logger.exception("Unexpected error processing event")
        raise&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Mistake 3: Broad IAM Permissions and a Bloated Deployment Package&lt;/h2&gt;

&lt;p&gt;This one is actually two mistakes that compounded each other, and I'm grouping them because they both came from the same root cause: we shipped fast and didn't review the details.&lt;/p&gt;

&lt;p&gt;The IAM execution role we originally attached had &lt;code&gt;s3:*&lt;/code&gt; on &lt;code&gt;*&lt;/code&gt;. It passed our security review because the reviewer was focused on the VPC config and missed the policy. AWS Security Hub flagged it later under rule &lt;code&gt;S3.6&lt;/code&gt;, but by then it had been in production for three weeks. Least-privilege for an S3-reading Lambda is simple: &lt;code&gt;s3:GetObject&lt;/code&gt; scoped to &lt;code&gt;arn:aws:s3:::your-bucket-name/*&lt;/code&gt;. Nothing else. If the function never needs to list or write, those permissions shouldn't exist on the role.&lt;/p&gt;

&lt;p&gt;The second mistake: we bundled the entire &lt;code&gt;pandas&lt;/code&gt; library into the deployment package because one engineer was comfortable with it for CSV parsing. Pandas compressed is roughly 45 MB. Our deployment package hit the size limit warnings and our cold start went from around 800ms to over 4200ms. That's a 5x regression for functionality that Python's stdlib &lt;code&gt;csv&lt;/code&gt; module handles in 20 lines with zero dependencies. Lambda has a 250 MB unzipped deployment package limit — we were well under it, but cold start latency scales with package size regardless. We measured this directly: same function logic, stdlib &lt;code&gt;csv&lt;/code&gt; vs. pandas. The difference was consistent across 20 test invocations.&lt;/p&gt;

&lt;p&gt;I stopped using pandas in Lambda entirely after this. For anything beyond simple parsing — aggregations, complex transforms — I'd rather push that work into a proper compute layer (ECS, Glue) than fight Lambda's package constraints. The &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/python-package.html" rel="noopener noreferrer"&gt;AWS Lambda Python packaging docs&lt;/a&gt; explain dependency management and layer options if you genuinely need heavy libraries.&lt;/p&gt;

&lt;h2&gt;What We Do Differently Now&lt;/h2&gt;

&lt;p&gt;The full Terraform configuration below captures the hardened setup: DLQ attached, retry attempts set to zero, IAM scoped to the specific bucket ARN, and the &lt;code&gt;source_account&lt;/code&gt; condition on the Lambda permission resource to prevent confused deputy attacks when S3 is the invoking principal.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# terraform/lambda.tf
# Terraform ~&amp;gt; 5.x | AWS provider ~&amp;gt; 5.x
# Lambda function with scoped IAM, DLQ, and retry config

resource "aws_sqs_queue" "lambda_dlq" {
  name                      = "csv-processor-dlq"
  message_retention_seconds = 1209600 # 14 days
}

resource "aws_lambda_function" "csv_processor" {
  function_name = "csv-processor"
  runtime       = "python3.11"
  handler       = "lambda_function.lambda_handler"
  filename      = "lambda.zip"
  role          = aws_iam_role.lambda_exec.arn
  timeout       = 30
  memory_size   = 256

  # Avoid bundling heavy libs — keep package lean for cold start
  ephemeral_storage {
    size = 512 # MB, default; increase only if /tmp needed
  }

  dead_letter_config {
    target_arn = aws_sqs_queue.lambda_dlq.arn
  }
}

resource "aws_lambda_function_event_invoke_config" "csv_processor" {
  function_name          = aws_lambda_function.csv_processor.function_name
  maximum_retry_attempts = 0 # Disable async retries — we handle idempotency manually
}

# Least-privilege IAM — scoped to specific bucket, not s3:*
resource "aws_iam_role_policy" "lambda_s3_policy" {
  name = "csv-processor-s3-policy"
  role = aws_iam_role.lambda_exec.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject"]
        Resource = "arn:aws:s3:::your-bucket-name/*" # Scope to bucket, not *
      },
      {
        Effect   = "Allow"
        Action   = ["sqs:SendMessage"]
        Resource = aws_sqs_queue.lambda_dlq.arn
      }
    ]
  })
}

# Allow S3 to invoke Lambda — include SourceAccount to prevent confused deputy
resource "aws_lambda_permission" "allow_s3" {
  statement_id   = "AllowS3Invoke"
  action         = "lambda:InvokeFunction"
  function_name  = aws_lambda_function.csv_processor.function_name
  principal      = "s3.amazonaws.com"
  source_account = var.aws_account_id # Prevents confused deputy attack
  source_arn     = "arn:aws:s3:::your-bucket-name"
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Beyond the code, we added three process gates that must pass before any async Lambda merges to main. First: the function must be tested with both a real S3 event payload and an S3 console test event — the test event should be handled gracefully, not crash. Second: cold start must be measured and confirmed under 1 second on a clean invocation; if it's over that, the package is too large. Third: a CloudWatch alarm on DLQ message count must exist and be confirmed active. No alarm, no merge. We also set all Lambda log groups to 14-day retention — "Never expire" is not a default we accept anymore after seeing what log storage costs compound to across dozens of functions.&lt;/p&gt;

&lt;p&gt;The AWS Lambda S3 trigger Python pattern is genuinely simple when it works. The failure modes are subtle and the consequences — silent data loss, duplicate writes, security exposure — are disproportionately severe for how small the mistakes are. Every one of these was a one-line fix. The cost was not catching them before production.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/aws/" rel="noopener noreferrer"&gt;More AWS automation, Lambda patterns, and production hardening&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/terraform/" rel="noopener noreferrer"&gt;Terraform infrastructure patterns for AWS deployments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kuryzhev.cloud/category/python/" rel="noopener noreferrer"&gt;Python scripting and automation for DevOps workflows&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>Docker BuildKit Cache Optimization for Faster CI Pipelines</title>
      <dc:creator>Oleksandr Kuryzhev</dc:creator>
      <pubDate>Sat, 06 Jun 2026 06:11:05 +0000</pubDate>
      <link>https://dev.to/oleksandr_kuryzhev_42873f/docker-buildkit-cache-optimization-for-faster-ci-pipelines-1bal</link>
      <guid>https://dev.to/oleksandr_kuryzhev_42873f/docker-buildkit-cache-optimization-for-faster-ci-pipelines-1bal</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyusax41mevy39ye4fhca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyusax41mevy39ye4fhca.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
Docker BuildKit cache optimization for faster CI pipelines is one of those infrastructure improvements that pays dividends immediately — every pipeline run after the first one benefits from layers that were already built and pushed. This post walks through the full setup: enabling BuildKit correctly, choosing between inline and registry cache modes, and writing a GitHub Actions workflow that actually reuses layers instead of rebuilding from scratch on every push.&lt;/p&gt;

&lt;p&gt;The performance gap between a cold build and a warm cache hit can be dramatic. A Node.js or Python image that takes four minutes to install dependencies cold can complete that same stage in under ten seconds when the cache layer is pulled from a registry. Getting there requires more than just adding a flag — it requires understanding how BuildKit stores, retrieves, and validates cache manifests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcqao094fcmd965drpuj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcqao094fcmd965drpuj.jpg" alt=" " width="800" height="534"&gt;&lt;/a&gt;&lt;br&gt;
Before touching a single YAML file, confirm your environment meets these prerequisites. Missing any one of them is the most common reason cache configurations appear to work but silently miss on every run.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker 20.10+ with BuildKit support — either set DOCKER_BUILDKIT=1 as an environment variable or add "features": {"buildkit": true} to /etc/docker/daemon.json on self-hosted runners.&lt;/li&gt;
&lt;li&gt;Docker Buildx plugin — required for type=registry and type=gha cache backends. Plain docker build does not support these modes.&lt;/li&gt;
&lt;li&gt;GitHub Actions runner on ubuntu-22.04 or later — the docker/setup-buildx-action works reliably on this image.&lt;/li&gt;
&lt;li&gt;A container registry with write access — GitHub Container Registry (ghcr.io) works well here because the GITHUB_TOKEN handles authentication without managing separate credentials.&lt;/li&gt;
&lt;li&gt;A Dockerfile with dependency installation separated from application code — if your COPY . . instruction appears before RUN pip install or RUN npm ci, every code change invalidates the dependency layer and the cache never helps where it matters most.&lt;/li&gt;
&lt;li&gt;Network access from the runner to the registry — this sounds obvious, but using --cache-from without prior registry authentication causes a silent cache miss, not a build failure. You will not see an error; you will just see a slow build.
If you are running self-hosted runners in a private network, review the BuildKit cache backend documentation for proxy and authentication considerations before proceeding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implementation&lt;/strong&gt;&lt;br&gt;
There are two cache strategies worth understanding before writing any configuration: inline cache and registry cache. Inline cache embeds cache metadata directly into the image manifest using the BUILDKIT_INLINE_CACHE=1 build argument. It requires no separate cache tag, but it only captures the final stage — useless for multi-stage builds where you want to cache the builder stage independently. Registry cache, by contrast, stores cache data as a separate manifest tag and supports mode=max, which pushes every intermediate layer rather than just the final one.&lt;/p&gt;

&lt;p&gt;For most CI pipelines, registry cache with mode=max is the right choice. The difference between mode=min and mode=max is significant: min caches only the final stage output, while max caches every intermediate layer including builder stages, test runners, and dependency installers. If your Dockerfile has a dedicated build stage that compiles binaries, mode=max is what makes that stage reusable across runs.&lt;/p&gt;

&lt;p&gt;Here is the complete GitHub Actions workflow implementing registry cache with BuildKit. Note that DOCKER_BUILDKIT: "1" is set at the job-level env block, not inside a single step — this ensures it is available to all steps that invoke Docker, including setup actions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/docker-build.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Docker Build with BuildKit Cache&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;develop&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;DOCKER_BUILDKIT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
  &lt;span class="na"&gt;REGISTRY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io&lt;/span&gt;
  &lt;span class="na"&gt;IMAGE_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.repository }}&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
      &lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout source&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Docker Buildx&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/setup-buildx-action@v3&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Log in to GitHub Container Registry&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/login-action@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.REGISTRY }}&lt;/span&gt;
          &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.actor }}&lt;/span&gt;
          &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extract metadata for Docker&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;meta&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/metadata-action@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push with registry cache&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
          &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./Dockerfile&lt;/span&gt;
          &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event_name != 'pull_request' }}&lt;/span&gt;
          &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.meta.outputs.tags }}&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.meta.outputs.labels }}&lt;/span&gt;
          &lt;span class="c1"&gt;# Pull existing cache layers from registry before building&lt;/span&gt;
          &lt;span class="na"&gt;cache-from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache&lt;/span&gt;
          &lt;span class="c1"&gt;# Push all intermediate layers (mode=max) for maximum cache reuse&lt;/span&gt;
          &lt;span class="na"&gt;cache-to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache,mode=max&lt;/span&gt;
          &lt;span class="na"&gt;build-args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;BUILDKIT_INLINE_CACHE=1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify cache hit in build log&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;echo "Check above build step output for 'CACHED' layer lines."&lt;/span&gt;
          &lt;span class="s"&gt;echo "Re-run this workflow without code changes to confirm cache is working."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few implementation details worth highlighting. The cache tag (:cache) is stored as a separate manifest in the registry — deleting your application image tag does not remove the cache tag. Manage them independently if you have registry retention policies. Also note that pull requests use cache-from but the workflow skips push on PR runs (push: ${{ github.event_name != 'pull_request' }}), which means PR builds read from cache but do not write back. This is intentional — you do not want untested branches polluting the cache used by main.&lt;/p&gt;

&lt;p&gt;For teams already using GitHub Actions cache storage rather than a registry, switching to type=gha is straightforward: replace type=registry,ref=... with type=gha in both cache-from and cache-to. The docker/setup-buildx-action handles the necessary token injection automatically. That said, registry cache is more portable and works identically on self-hosted runners without GitHub-specific configuration.&lt;/p&gt;

&lt;p&gt;For more on structuring CI pipelines around reusable patterns, see the DevOps_DayS pipeline guides on kuryzhev.cloud, which cover related topics including Jenkins shared library design and GitLab CI artifact passing between stages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the Setup&lt;/strong&gt;&lt;br&gt;
Verifying that cache is actually being used — not just configured — requires inspecting build output directly. The key signal is the CACHED prefix on layer lines in the build log.&lt;/p&gt;

&lt;p&gt;Run the workflow twice without changing any application code. On the second run, add --progress=plain to your build command (or check the raw step output in GitHub Actions) and look for lines like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#8 [builder 3/5] RUN pip install --no-cache-dir -r requirements.txt
#8 CACHED

#9 [builder 4/5] COPY src/ /app/src/
#9 CACHED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see CACHED on dependency installation layers, the registry cache is working. If every layer shows a timestamp and byte count instead, the cache is being rebuilt. The most common causes are: the registry authentication step running after the buildx setup step (reorder them), the cache tag not yet existing on first run (expected — second run will be warm), or a COPY instruction ordering problem in the Dockerfile invalidating the cache before the expensive layers.&lt;/p&gt;

&lt;p&gt;To confirm the cache manifest exists in the registry independently of your application image, run the following against your registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inspect the cache manifest tag directly&lt;/span&gt;
docker manifest inspect ghcr.io/your-org/your-repo:cache

&lt;span class="c"&gt;# Expected output includes mediaType for BuildKit cache manifest&lt;/span&gt;
&lt;span class="c"&gt;# "mediaType": "application/vnd.oci.image.manifest.v1+json"&lt;/span&gt;
&lt;span class="c"&gt;# with layers referencing cached build stages&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the manifest inspect returns a standard image manifest rather than a BuildKit cache manifest, the cache-to step did not complete successfully — check that the runner has write permissions to the packages scope and that the registry login step preceded the build step in your workflow.&lt;/p&gt;

&lt;p&gt;Timing comparison is the most practical validation. Record the build duration from the Actions summary on run one (cold) and run two (warm). A well-structured Dockerfile with stable dependency layers should show a 60–80% reduction in build time on cache hits for dependency-heavy images.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set DOCKER_BUILDKIT=1 at the job env level in GitHub Actions, not just inside a single step — it must be present for all Docker-related actions in the job.&lt;/li&gt;
&lt;li&gt;Use mode=max for multi-stage Dockerfiles; mode=min only caches the final stage and misses most of the value in complex builds.&lt;/li&gt;
&lt;li&gt;The registry cache tag (:cache) is independent of your application image tag — manage its lifecycle separately if you have registry cleanup automation.&lt;/li&gt;
&lt;li&gt;A missing or failed registry login before cache-from causes a silent miss — verify authentication order in your workflow steps.&lt;/li&gt;
&lt;li&gt;Structure your Dockerfile so stable RUN dependency installation steps appear before frequently-changing COPY instructions — no amount of cache configuration compensates for poor layer ordering.&lt;/li&gt;
&lt;li&gt;Inline cache (BUILDKIT_INLINE_CACHE=1) is useful as a fallback for single-stage images but should not replace registry cache in production CI pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More interesting topics I will touch on in my blog.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
