DEV Community: Mehmet TURAÇ

Great Stack to Doesn't Work #6 — CI/CD: "Pipeline Green, Production Red"

Mehmet TURAÇ — Fri, 05 Jun 2026 20:42:17 +0000

A survival guide for when everything goes wrong in production.

The pipeline is green. Every stage passed. Tests: green. Lint: green. Build: green. Security scan: green. The deploy button says "Ready." You click it.

Five minutes later, the error rate jumps to 15%. The pipeline is still green. It will stay green while your users can't check out, because the pipeline tests what you wrote, not what production does with it.

Why Your Pipeline Lies to You

A green pipeline means your code compiles, your tests pass, and your container builds. It does not mean your code works in production. The gap between "works in CI" and "works in production" is where incidents live.

The most common gaps:

Environment drift. CI runs on a clean container with a fresh database. Production has 3 years of accumulated data, schema migrations that ran in a different order during the early days, and environment variables that were set manually by someone who left the company.

Data shape. Your tests use factory-generated data with predictable shapes. Production has users who put emojis in their name field, addresses that are 4,000 characters long, and order records with null values in columns that "should never be null."

Traffic patterns. CI runs one test at a time, sequentially. Production handles 10,000 concurrent requests. Race conditions that never appear in CI appear within minutes in production.

Dependency versions. Your lock file pins exact versions, but your Docker base image pulls latest, or a system package updates between builds. The code is identical. The runtime is not.

The pipeline can't test for all of this. But it can test for more than it currently does.

Layer Caching: Cutting Build Times by 80%

Docker builds are slow because they're rebuilding layers that haven't changed. Every RUN instruction creates a layer. If the layer's inputs haven't changed, Docker can reuse the cached version.

The problem: CI environments often start with an empty cache. Every build is a fresh build. 12 minutes to install dependencies that haven't changed since last week.

Solutions:

Registry-based caching. Push cache layers to your container registry. Pull them at the start of each build.

docker build \
  --cache-from myregistry/myapp:cache \
  --build-arg BUILDKIT_INLINE_CACHE=1 \
  -t myregistry/myapp:latest .
docker push myregistry/myapp:latest

GitHub Actions cache (or equivalent):

- uses: actions/cache@v4
  with:
    path: /tmp/.buildx-cache
    key: ${{ runner.os }}-buildx-${{ hashFiles('**/package-lock.json') }}

Separate dependency and code layers. This is Docker 101 but people still get it wrong:

COPY package*.json ./
RUN npm ci
COPY . .

Dependencies change weekly. Code changes hourly. Separate them so the expensive npm ci layer is cached across code-only changes.

A team I worked with reduced their build from 14 minutes to 3 minutes by adding registry-based caching and reordering their Dockerfile. No infrastructure changes. No new tools. Just understanding how Docker layer caching works.

Parallel Stages: Stop Running Tests Sequentially

If your test suite takes 20 minutes, and you have 4 CI runners, split the tests into 4 parallel groups. Each group takes 5 minutes. Total wall time: 5 minutes.

The naive approach — splitting by file count — creates unbalanced groups. One group might have 3 integration test files that each take 2 minutes, while another group has 50 unit test files that each take 100ms.

Better: split by historical timing data.

# GitHub Actions example with test splitting
jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - run: npx jest --shard=${{ matrix.shard }}/4

Jest's --shard flag distributes tests across shards using file hashing. For more sophisticated balancing, tools like split_tests (Ruby), pytest-split, or CI-specific features (CircleCI's test splitting, Buildkite's parallelism) use timing data from previous runs to create balanced groups.

Flaky Tests: The "This Test Passes Sometimes" Syndrome

Flaky tests are worse than failing tests. A failing test tells you something is broken. A flaky test tells you nothing — it might be broken, or it might just be having a bad day.

The damage is insidious. Engineers start re-running the pipeline when a test fails. "Oh, that test is flaky, just retry." Now you're training the team to ignore test failures. The day a real bug causes a test to fail, nobody investigates — they just retry until it passes.

Detection:

Track test results over time. If a test fails more than 1% of the time and the failures don't correlate with code changes, it's flaky.
Quarantine flaky tests into a separate suite that runs but doesn't block the pipeline. Fix them with priority.

Common causes:

Time dependency. Tests that assume a specific time or date, or that measure elapsed time with tight tolerances. A test that passes in 100ms locally might take 300ms in CI due to shared resources.
Order dependency. Test A creates data, test B reads it. When tests run in a different order (parallel execution, random seed), test B fails.
External dependency. Tests that call a real API, read from a shared database, or depend on DNS resolution.
Race conditions. Async operations that complete faster on your machine than in CI.

Fix: isolate, mock, use deterministic clocks, and clean up after every test.

Rollback Strategies: Choosing Your Safety Net

When a deployment goes wrong, how fast can you get back to the previous version?

Rolling update: Replace pods one by one. If the new version is broken, you notice after some pods are already updated. Rolling back means deploying the previous version, which takes as long as the original deployment.

Blue-green: Run two identical environments. Blue is live. Deploy to green. Test green. Switch traffic from blue to green. If green fails, switch back to blue. Rollback is instant — just change the traffic routing. Cost: you need double the infrastructure.

Canary: Send 1% of traffic to the new version. Monitor error rates, latency, and business metrics. If everything looks good, gradually increase to 10%, 25%, 50%, 100%. If anything looks bad at any stage, route all traffic back to the stable version.

Feature flags: Deploy the code but don't activate it. The feature is behind a flag that defaults to off. Enable it for internal users first. Then 1% of users. Then 10%. If something breaks, flip the flag off. The code stays deployed; the feature deactivates. This is the most granular rollback mechanism — you can revert a single feature without touching the deployment.

The 42-minute pipeline team's rollback strategy was "deploy the previous version," which also took 42 minutes. Their canary threshold was set to 5% error rate. By the time the canary caught the problem, 3% of real users had already been affected, and the rollback took another 42 minutes. Total incident duration: over an hour.

After fixing the pipeline speed (11 minutes) and implementing feature flags, their rollback time dropped from 42 minutes to under 10 seconds — just a flag flip.

Secret Management: Stop Hardcoding Credentials

Secrets in environment variables are the minimum bar. But CI/CD pipelines have their own secret lifecycle that most teams handle poorly.

Token expiration. CI tokens, deploy keys, API keys — they all expire. If nobody monitors expiration dates, one morning your pipeline fails and nobody can deploy until someone provisions a new token. This happened to us: a GitHub App installation token expired mid-deployment. 45 minutes of "why is git clone failing?" before someone checked the token creation date.

Secret rotation. If you rotate a database password, you need to update it in your CI secrets, your Kubernetes secrets, your application config, and your monitoring system. Miss one, and something breaks silently.

Least privilege. Your CI pipeline doesn't need admin access to your cloud account. It needs permission to push images, update deployments, and maybe run migrations. Create a dedicated CI service account with only the permissions it needs.

Use a secret manager (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) and pull secrets at runtime. Don't bake them into images. Don't store them in git. Don't pass them as build arguments (they end up in Docker layer metadata).

GitOps: Let Git Be the Source of Truth

GitOps (ArgoCD, Flux) flips the deployment model. Instead of "CI pushes a new version to the cluster," git is the desired state and an operator pulls the desired state from git.

The workflow:

PR changes the Kubernetes manifests or Helm values in a git repo.
PR is reviewed, approved, merged.
ArgoCD detects the change, compares it to the current cluster state, and applies the diff.

Benefits:

Every deployment is a git commit. Full audit trail.
Rollback is git revert. The operator sees the repo changed and syncs.
Drift detection — if someone kubectl applys something manually, ArgoCD detects the drift and can auto-correct.

The operational reality: GitOps adds complexity. You now have a git repo to manage, an operator to keep healthy, and a reconciliation loop that can conflict with manual interventions during incidents. It's worth it for teams with 10+ services and frequent deployments. For a team with 3 services deploying twice a week, a simple CI/CD pipeline is simpler and sufficient.

War Story: From 42 Minutes to 11

Monorepo. 4 services. 1 pipeline that built everything, tested everything, and deployed everything, regardless of which service changed.

The 42-minute breakdown:

Docker build: 8 minutes (no caching)
Unit tests: 12 minutes (sequential, 2,400 tests)
Integration tests: 14 minutes (starting 3 databases, running sequentially)
Deploy: 8 minutes (rolling update, health check wait)

The 8 changes:

Registry-based Docker caching. Build dropped from 8 minutes to 2.
Only build changed services. Used git diff to detect which service directories changed. If only service-A changed, only service-A builds and deploys.
Parallel unit tests with sharding. 4 shards, 3 minutes per shard (wall time: 3 minutes instead of 12).
Shared test database. Instead of starting a fresh database per test file, start one per test shard and use schema isolation. Integration test setup dropped from 6 minutes to 45 seconds.
Parallel integration tests. With the shared database, integration tests could run in parallel. 14 minutes down to 4.
Cached dependency installation. node_modules cached by lockfile hash. npm ci only runs when package-lock.json changes.
Deploy only changed services. Same git diff approach. If service-B didn't change, don't redeploy it.
Canary deploy with automated rollback. Instead of waiting for a full rolling update, deploy canary to 1 pod, run smoke tests, then proceed. If smoke tests fail, automatic rollback in 30 seconds.

Result: 11 minutes end-to-end for a single service change. 16 minutes for a full monorepo change. Developers went from deploying twice a day (because each deploy took so long) to deploying 8-10 times a day.

Key Takeaways

A green pipeline is a necessary condition for deployment, not a sufficient one. Your pipeline tests your code. Production tests your system.

Speed matters. A 42-minute pipeline doesn't just slow down deployment — it changes developer behavior. People batch changes, skip tests locally, and deploy less frequently. All of which increase risk.

Feature flags are the most underrated deployment tool. They decouple deployment from release. You can deploy code any time and release features when you're ready. Rollback is a flag flip, not a redeployment.

And manage your CI secrets like production secrets. They expire, they need rotation, and when they break, nobody can deploy.

Over to You

What's the longest your CI/CD pipeline has ever taken? How did you cut it down? And has anyone else been burned by an expired CI token during an incident?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

LLM-Free Multi-Agent Memory Architecture: How to Build Real Team Memory with Jira + GitHub + Commit Log

Mehmet TURAÇ — Fri, 05 Jun 2026 11:03:01 +0000

Introduction

One of the biggest problems in software teams is not writing code. Code eventually gets written, refactored, tested, and deployed. The real challenge, most of the time, is this:

"Why was this decision made?"

When a developer joins a project, they can't understand the work just by looking at the repository. They can see the code, but not the story behind it. Why was a service split this way? Why is an interface designed so oddly? Why does a test specifically check that edge case? Why has a file turned into something everyone is afraid to touch? The answers to these questions usually lie not in the code itself, but in the team's history.

That history is scattered across different tools:

Jira issues
GitHub Pull Requests
Review comments
Commit messages
Branch names
Incident records
Release notes
Sometimes Slack/Teams conversations

That's why, for a new developer, the learning process usually goes like this:

Look at the code → Find something you don't understand → Search Jira → Search PR → Search Slack → Ask the old developer → Repeat

This is team memory loss. And this loss costs time, causes errors, and exhausts people.

Section 1: Questions That Team Memory Should Answer

When a well-structured team memory system is in place, it should be able to answer questions like:

Who changed this file the most?
Who last touched this file?
Which commits resolved this issue?
Which PR was this change discussed in?
Why has this component changed so frequently in the last 90 days?
Has this bug occurred before?
Which issues and PRs should a new developer read to learn the auth module?
Who is the most suitable reviewer for a new PR?
Which files carry technical risk?
Which component is too dependent on a single person?

What these questions have in common: the answers are not in a single record. The answers are hidden in relationships.

For example, to answer "who knows the auth module?" it's not enough to just count commits. You need to look at all of these together:

People who committed to auth files
People who reviewed auth PRs
People who commented on auth issues
People who fixed auth bugs
People who have been active recently
People who made changes with large churn
Files with revert or incident history

So team memory is essentially a relationship problem.

Section 2: Why an LLM-Free Architecture?

LLMs are powerful, but it's not always right to put an LLM at the core of every problem. For systems like team memory, the main requirements are:

Accuracy
Auditability
Reproducibility
Low cost
Long-term maintainability
Permission and privacy control
Evidence-based response generation

Let me also add a personal note here. Even though I write a series about AI-free life on Dev.to, this time I specifically wanted to write something on the software engineering side without AI as well. Honestly, the motivation behind this article is partly to push myself outside of repetition while also giving you some food for thought: You can build quite useful, technically clean, and maintainable systems without tying every problem to an LLM.

Let me be even more direct: we're a bit tired of it. Constant AI, constant agents, constant RAG, constant prompts. These are certainly valuable topics, but sometimes you just want to see solid, old-school engineering. This article was written with exactly that motivation.

Why LLM-Free for Team Memory?

An LLM-centric approach carries several risks.

2.1 Hallucination Risk

An LLM might behave as if there's a relationship between an issue and a commit when no such relationship exists in the real system. Pointing to the wrong PR, showing the wrong person as an expert, or misinterpreting a past bug fix causes serious time loss.

In team memory, answers should not be "educated guesses." Answers must come with evidence.

2.2 Auditability Problem

If a system says "this file is risky," it should be able to explain why:

src/auth/token_service.py changed 18 times in the last 90 days.
5 of those changes are linked to bug fixes.
4 different developers have touched the file.
A race condition was discussed in the last two PRs.
The test file was not updated at the same rate.

This kind of answer is debatable, verifiable, and improvable. An LLM saying "it looked risky to me" doesn't deliver the same quality.

2.3 Cost and Latency

LLMs are not needed for questions like:

Which commit resolved this issue?
Who last touched this file?
Which files did this PR change?

These are pure data queries. SQL or graph traversal solves them instantly.

2.4 Reproducibility

For team memory, the same question should always produce the same answer on the same data. LLM-based systems can give different answers each time. This is unacceptable for audit and debugging.

Section 3: Core Architecture

The foundation of the system is a memory store. This store holds the following:

Git commit log
Jira issues and comments
GitHub PRs, reviews, and review comments
File paths and components
Developer identities

On top of this, agents query the memory store, score it, and produce explainable outputs.

The basic flow:

Jira / GitHub / Git
    ↓
Ingestion Layer
    ↓
Memory Store (relational + graph)
    ↓
Agents (ContextAgent, ExpertiseAgent, RiskAgent, ...)
    ↓
Explainable Output (CLI / API / Bot)

The Core Principle: Everything Is a Relationship

PROJ-1247 issue
  → linked to PR #382
  → resolved by commits f00ba47 and b91c0de
  → changed src/auth/token_service.py
  → contributed by Mehmet Turac and Ayşe Demir
  → reviewed by Burak Kaya

With this information, a new developer no longer has to search randomly.

Section 4: Classic Multi-Agent Logic

I'm not using the word "agent" in the LLM agent sense here. In this architecture, an agent is:

A small service with a specific task, which queries memory, makes rule-based decisions, and produces evidence-backed output.

So what we call an agent is not a bot running prompts. It's a perfectly classical software component.

ContextAgent

Extracts context for an issue, PR, or file.

ExpertiseAgent

Calculates the most knowledgeable people for a file or component.

RiskAgent

Finds risky files based on signals like high churn, bug fixes, and contributor spread.

ReviewRoutingAgent

Suggests suitable reviewer candidates for a new PR.

OnboardingAgent

For a new developer on a given component, lists the most valuable issues and PRs to read.

HygieneAgent

Reports data quality problems in the memory store.

Each agent works with a scoring and rule-based logic.

Section 5: Data Model

The minimum entity set for the first version is:

Developer
Repository
Issue
Commit
File
PullRequest
Review
IssueComment

Even with this model, a powerful memory system can be built.

Developer

A developer can appear with different identities across systems:

Git author email
GitHub username
Jira account id
Display name

These need to be linked to a single developer record.

Commit

Commits are among the most reliable events in the system. Hash, message, date, author, and changed files are stored.

File

Files should be stored not just as paths, but with component information.

For example:

src/auth/**      → auth
src/payment/**   → payment
infra/**         → infra

Issue

Issues give us business context. Summary, status, priority, type, component, and timestamps are stored.

PullRequest

PRs show us how a change was discussed within the team. Reviewers, changed files, linked issues, and commits are among the key fields.

Section 6: Schema

CREATE TABLE developers (...);
CREATE TABLE repositories (...);
CREATE TABLE issues (...);
CREATE TABLE files (...);
CREATE TABLE commits (...);
CREATE TABLE commit_files (...);
CREATE TABLE commit_issues (...);
CREATE TABLE pull_requests (...);
CREATE TABLE pr_commits (...);
CREATE TABLE pr_files (...);
CREATE TABLE pr_issues (...);
CREATE TABLE reviews (...);
CREATE TABLE issue_comments (...);

These tables represent graph thinking in a relational model. Join tables like commit_files, commit_issues, pr_files, pr_issues serve as relationships.

Section 7: Agent Scores

Expertise Score

When finding an expert for a file, looking only at commit count can be misleading. So the score can be calculated as follows:

expertise_score =
    commit_count * 10
  + review_count * 8
  + issue_comment_count * 2
  + churn / 20
  + recency_bonus

This score is not an absolute truth; it's a ranking signal. What matters is that the score is explainable.

Bad output:

Ayşe is an expert on this topic.

Good output:

Ayşe made 5 commits in this file recently, reviewed 3 PRs,
last activity was 2026-05-20, and total churn value is 320.

Risk Score

Explainable signals are needed for risk too:

risk_score =
    churn
  + bug_count * 100
  + contributor_count * 25
  + commit_count * 5

This is a simple starting point. In production, signals like test coverage, incidents, revert commits, deployment failures, and code ownership can be added.

Section 8: Example Usage Scenario

A new developer picks up issue PROJ-1247.

They run this from the CLI:

teammemory issue-context PROJ-1247

The system produces:

Issue: PROJ-1247
Summary: Token refresh race condition
Status: In Progress
Priority: High
Component: auth

Related PRs:
- #382 Fix token refresh race condition [merged]

Commits:
- f00ba47 Mehmet Turac — PROJ-1247 guard token refresh with per-session lock
- b91c0de Ayşe Demir — PROJ-1247 add regression test for refresh race

Changed files:
- src/auth/token_service.py
- src/auth/session_manager.py
- tests/auth/test_token_refresh.py

People in context:
- Mehmet Turac
- Ayşe Demir
- Burak Kaya

This output was generated without an LLM. Because everything is based on relationships in the database.

Then the developer wants to see file experts:

teammemory file-experts src/auth/token_service.py

Output:

Experts for src/auth/token_service.py

1. Ayşe Demir — score 92.0
   commits: 4, reviews: 2, comments: 1, churn: 430, last activity: 2026-05-20

2. Mehmet Turac — score 80.5
   commits: 3, reviews: 1, comments: 2, churn: 390, last activity: 2026-05-18

This answer too is not a guess — it's a calculated signal.

Section 9: Data Hygiene

The success of this system depends on data quality. If commit messages don't contain issue keys, PR descriptions are empty, or issues aren't linked to the right components, the team memory stays incomplete.

That's why HygieneAgent is critically important.

What it reports:

Commits that don't contain an issue key
PRs not linked to an issue
Empty PR descriptions
Issues marked as Done but not linked to any commit
Files missing component information

This report is not a blame tool — it's a tool for improving memory.

Section 10: Moving to Production

The demo runs with SQLite. The recommended structure for production:

PostgreSQL = raw event store, audit, checkpoint, agent outputs
Neo4j/AGE   = relationship analysis and traversal
FastAPI     = controlled access layer
CLI/Bot     = developer workflow integration

Things to pay attention to in production:

Incremental sync
Webhook + scheduled backfill
Idempotent ingestion
Rate limit management
Identity resolution
Permission control
Audit log
Token security
Repository-based access

Identity resolution is especially important. If the same person appears as mehmet@example.com in Git, mturac on GitHub, and Mehmet Turac in Jira, all of these need to be linked to a single developer record.

Section 11: Strengths of This Approach

Fully auditable.
Inexpensive.
Produces the same answer to the same query on the same data.
No LLM latency.
No model dependency.
No prompt brittleness.
Data security is easier to control.
Small agents are testable.
Can be incrementally added to legacy projects.
Instills engineering discipline in the team.

Section 12: Weaknesses

No natural language querying.
If data quality is poor, results degrade.
Informal decision sources like Slack are left out of the first version.
Initial identity matching is tedious.
Score design requires care.
If the reason for a decision isn't written in a commit or PR, the system can't know it.

These limitations are not flaws. On the contrary, they are the system's honesty. It doesn't make things up when it doesn't know.

Section 13: Roadmap

Phase 1 — Local Demo

SQLite schema
Seed data
CLI ~~- ContextAgent
ExpertiseAgent
RiskAgent~~

Phase 2 — Real Git Ingestion

Pulling commits from a local repo
Fetching file changes
Extracting Jira keys from commit messages

Phase 3 — Jira/GitHub Import

Jira JSON import
GitHub PR JSON import
Review records
PR-issue relationships

Phase 4 — API

FastAPI endpoints
Simple dashboard
GitHub Action integration

Phase 5 — Production Memory

PostgreSQL event store
Neo4j graph projection
Webhook sync
Permission control
Audit log

Conclusion

The main idea of this article is simple:

First build the data model correctly for team memory. Don't rush to LLMs.

Jira, GitHub, and Git already give us an incredibly valuable event history. If we correctly link this history, we can produce reliable answers to questions like:

Who changed what?
Why did they change it?
Which issue was it related to?
Which PR was it discussed in?
Which files are risky?
Which developer has current context in which area?
Where should a new person start?

In this system, answers don't come with "the model thought so." Answers come from commit, issue, PR, and review records.

Sometimes the best engineering is not using the most impressive technology; it's correctly scoping the problem and building a simpler, more reliable, and more explainable solution.

And this repo is trying to show exactly that:

No LLM.
No RAG.
No prompt.
No embedding.

There is data.
There are relationships.
There are rules.
There is evidence.

mturac / team-memory

TeamMemory LLM’siz

TeamMemory LLM’siz, yazılım ekipleri için Jira + GitHub + Git commit loglarından çalışan, tamamen deterministik bir takım hafızası örneğidir.

Bu repo özellikle şunu göstermek için hazırlandı:

Her takım hafızası problemi LLM, RAG, embedding, prompt veya agentic workflow gerektirmez. Bazen doğru veri modeli, iyi ingestion, sağlam sorgular ve küçük deterministik agent’lar daha güvenilir sonuç verir.

Bu örnekte LLM yoktur.
RAG yoktur.
Vector database yoktur.
Prompt yoktur.
Model çağrısı yoktur.

Bunun yerine:

SQLite event/memory store
Git commit ingestion
Jira/GitHub JSON import
Deterministik agent sınıfları
CLI
Opsiyonel FastAPI API
Seed demo datası
Kanıtlı çıktılar

vardır.

Hızlı başlangıç

cd teammemory-llmsiz
python -m venv .venv
source .venv/bin/activate
python -m pip install -e .[api,dev]

teammemory init-db --reset
teammemory seed
teammemory issue-context PROJ-1247
teammemory file-experts src/auth/token_service.py
teammemory component-risk auth
teammemory onboarding auth
teammemory review-suggest 382
teammemory hygiene

API çalıştırmak için:

uvicorn teammemory.api:app --reload

Örnek endpoint’ler:

curl http://127.0.0.1:8000/issues/PROJ-1247/context
curl "http://127.0.0.1:8000/files/experts?path=src/auth/token_service.py"
curl http://127.0.0.1:8000/components/auth/risk

…

View on GitHub

Great Stack to Doesn't Work Bonus: REST vs GraphQL vs gRPC: When to Use What

Mehmet TURAÇ — Fri, 05 Jun 2026 09:00:00 +0000

The honest comparison nobody asked for but everyone needs.

REST: The Default That's Fine

REST works. It's been working since 2000. Every developer knows it. Every tool supports it. Every proxy, cache, CDN, and load balancer understands HTTP verbs and status codes.

Choose REST when: Your API serves multiple clients with straightforward CRUD operations. Your team is small or mixed-experience. You need HTTP caching. You want the broadest tooling ecosystem.

REST hurts when: Mobile clients need 6 endpoints to render one screen (over-fetching). Different clients need different fields from the same resource (under-fetching/over-fetching). Your API surface is large and documentation gets stale.

GraphQL: The Flexible One

GraphQL lets clients ask for exactly the data they need in a single request. No more over-fetching. No more calling 6 endpoints to build a screen.

Choose GraphQL when: You have multiple client types (web, mobile, third-party) that need different data shapes. Frontend teams want to iterate without waiting for backend API changes. Your data model is a graph with complex relationships.

GraphQL hurts when: You underestimate the complexity. Query cost analysis (preventing clients from requesting deeply nested, expensive queries) is a whole discipline. Caching is harder because every query can be unique — no URL to cache against. N+1 query problems move from the client to the server-side resolver layer, and DataLoader only helps if you implement it correctly.

The security surface is also larger. A careless schema can let clients request your entire database through nested relationships. Rate limiting by query complexity (not just request count) is essential and non-trivial.

gRPC: The Fast One

gRPC uses Protocol Buffers (binary serialization) and HTTP/2 (multiplexed streams). It's faster than JSON over REST by a significant margin: smaller payloads, faster serialization, bidirectional streaming.

Choose gRPC when: Service-to-service communication where latency matters. Streaming use cases (real-time data feeds, long-running operations). You control both ends of the connection and can generate typed clients from proto files.

gRPC hurts when: You need browser support (gRPC-Web exists but adds complexity). Your clients are third-party developers who expect a REST API. Debugging is harder because binary payloads aren't human-readable. Load balancers need HTTP/2 support.

The Real Answer

Most teams should default to REST for external APIs and consider gRPC for internal service-to-service calls. GraphQL makes sense when your frontend team is spending more time waiting for API changes than building features.

The worst decision is choosing a technology because it's interesting. gRPC is fascinating. GraphQL is elegant. REST is boring. Boring wins when you're paged at 3 AM and need to debug a failed request by reading the URL.

Over to You

REST, GraphQL, or gRPC — what's your default choice in 2026 and why? Anyone running all three in the same platform?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Great Stack to Doesn't Work #5 — Linux: "Not a Kernel Panic, an Engineer Panic"

Mehmet TURAÇ — Thu, 04 Jun 2026 17:52:37 +0000

A survival guide for when everything goes wrong in production.

The system is slow. Not crashing, not failing — just slow. Response times are 10x normal. CPU usage looks fine. Memory looks fine. Disk looks fine. Every metric on the dashboard says "normal" but nothing feels normal.

The problem isn't in your application. It's three layers below, in kernel parameters you've never touched because the defaults "should be fine."

The defaults are fine for a laptop. They're not fine for a server handling 50,000 concurrent connections.

CPU: The Scheduler Isn't Always Fair

Linux uses CFS (Completely Fair Scheduler). It distributes CPU time proportionally across processes based on priority (nice values) and cgroup allocations. CFS is good at being fair. It's not always good at being fast.

Nice values range from -20 (highest priority) to 19 (lowest). Your application runs at nice 0 by default. A batch job someone started with nice -n 19 runs at the lowest priority — it gets CPU time only when nothing else wants it.

But nice values only matter under contention. If you have 16 cores and 8 processes, nice values are irrelevant — everyone gets a core. They start mattering when you have 32 processes competing for 16 cores.

CPU pinning (taskset/cpuset): For latency-sensitive workloads, pin your application to specific cores and keep everything else off them. This eliminates cache pollution — when processes bounce between cores, they lose their L1/L2 cache lines and spend cycles reloading data.

# Pin process to cores 0-3
taskset -c 0-3 ./my-application

# Or via cgroups
echo "0-3" > /sys/fs/cgroup/cpuset/my-app/cpuset.cpus

Financial trading systems and game servers live and die by CPU pinning. For web services, it's rarely worth the operational complexity — unless you've measured and confirmed cache misses are your bottleneck.

The numa trap: On multi-socket servers, NUMA (Non-Uniform Memory Access) means each CPU socket has "local" memory and "remote" memory. Accessing remote memory is 2-3x slower. If your application runs on socket 0 but allocates memory on socket 1's RAM, every memory access pays a latency penalty.

numactl --hardware     # See NUMA topology
numactl --localalloc ./my-application   # Force local memory allocation

Most cloud VMs abstract NUMA away, but bare metal servers? Check your topology.

Memory: Page Cache Is Your Best Friend

Linux uses all free memory as page cache — buffering disk reads in RAM. When you see "10 GB used, 2 GB free" on a 16 GB server, it doesn't mean you're low on memory. It means 4 GB is page cache, and it'll be released the moment a process needs it.

free -h lies to you if you don't read it carefully. Look at the "available" column, not "free." Available = free + reclaimable cache.

Swap: When physical memory is exhausted, Linux moves pages to swap (disk). This prevents OOM kills but makes the system extremely slow. Disk access is 1,000x slower than RAM. A system actively swapping is a system dying slowly.

vm.swappiness controls how aggressively the kernel swaps. Default is 60. For database servers: set it to 1 (not 0 — 0 disables swap entirely, which means the OOM killer strikes without warning). For Redis: set it to 1 and monitor closely. Redis's dataset should never touch swap.

sysctl vm.swappiness=1

OOM Killer: When memory is truly exhausted and swap (if any) is full, the kernel picks a process to kill. It chooses based on memory usage and oom_score_adj. Critical processes should have a low score:

echo -1000 > /proc/$(pidof my-critical-app)/oom_score_adj

This tells the OOM killer: kill anything else before touching this process. But if it's the only process eating memory, even -1000 won't save it.

The team that disabled swap: They read a blog post saying swap hurts performance. They set vm.swappiness=0 and disabled the swap partition. For months, everything was fine — they had plenty of RAM. Then a memory leak in a sidecar container slowly consumed memory over 3 weeks. Without swap as a buffer, the OOM killer fired without warning at 2 AM, killing the primary database process. No graceful shutdown. Transaction log corruption. 4-hour recovery.

Swap isn't the enemy. Uncontrolled swap is the enemy. A small swap partition (2-4 GB) gives the OOM killer a buffer to detect memory pressure before killing processes.

I/O: The Scheduler You Didn't Know Existed

Disk I/O has its own scheduler, separate from the CPU scheduler. It determines the order in which read/write requests reach the disk.

deadline: Assigns a deadline to each request (500ms for reads, 5s for writes by default). Guarantees no request starves. Good for databases.

mq-deadline: Multi-queue version of deadline. For NVMe drives with hardware multi-queue support, this is the default and correct choice.

none (noop): No reordering. Passes requests directly to the device. Use for NVMe SSDs where the device has its own sophisticated scheduler. Adding a kernel scheduler on top just adds latency.

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Change it
echo "none" > /sys/block/nvme0n1/queue/scheduler

For SSDs and NVMe: use none or mq-deadline. For spinning disks (if you still have them): use deadline or bfq (Budget Fair Queuing, good for interactive workloads).

Network Stack: The Parameters That Change Everything

Default Linux network settings are conservative. They're designed for a general-purpose machine, not a server handling tens of thousands of connections.

net.core.somaxconn: The maximum number of connections that can be queued for acceptance. Default: 4096 (was 128 on older kernels). If your application can't accept connections fast enough, new connections get dropped.

sysctl net.core.somaxconn=65535

Nginx, HAProxy, and any high-connection service should have this bumped. Also increase the application's own listen backlog to match.

net.ipv4.tcp_tw_reuse: Allows reusing sockets in TIME_WAIT state for new outgoing connections. On a server making many short-lived connections to backend services, TIME_WAIT sockets can accumulate in the thousands, exhausting ephemeral ports.

sysctl net.ipv4.tcp_tw_reuse=1

net.core.rmem_max / net.core.wmem_max: Maximum receive and send buffer sizes. Default values are often too low for high-throughput applications.

sysctl net.core.rmem_max=16777216
sysctl net.core.wmem_max=16777216
sysctl net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl net.ipv4.tcp_wmem="4096 65536 16777216"

The three values in tcp_rmem and tcp_wmem are: minimum, default, maximum. The kernel auto-tunes within this range based on available memory and connection count.

net.ipv4.tcp_keepalive_time: How long a connection sits idle before sending keepalive probes. Default: 7200 seconds (2 hours). If a client disconnects without closing the connection (network failure, crash), the server won't notice for 2 hours. That's 2 hours of a socket slot wasted.

sysctl net.ipv4.tcp_keepalive_time=600
sysctl net.ipv4.tcp_keepalive_intvl=60
sysctl net.ipv4.tcp_keepalive_probes=5

Profiling: perf, strace, eBPF

When the metrics don't tell you enough, go deeper.

perf — CPU profiling. Shows you where CPU time is being spent at the function level.

# Record 30 seconds of CPU activity
perf record -g -p $(pidof my-app) -- sleep 30
perf report

The flame graph (generated with Brendan Gregg's scripts) makes perf output readable. The widest bars are where your CPU spends the most time. If 40% of CPU time is in malloc, you have a memory allocation problem. If 30% is in pthread_mutex_lock, you have a contention problem.

strace — System call tracing. Shows every interaction between your application and the kernel.

strace -p $(pidof my-app) -f -e trace=network -T

-f follows child threads. -e trace=network filters to network calls only. -T shows time spent in each syscall. If connect() calls are taking 50ms, your DNS resolution is slow. If write() calls are taking 10ms, your disk or network is the bottleneck.

Warning: strace adds overhead. Don't run it on a production process during peak traffic unless you understand the impact. For production tracing, use eBPF instead.

eBPF — The modern way to observe production systems without overhead. eBPF programs run in the kernel, attached to specific events, with verified safety guarantees.

# Using bcc tools
tcplife          # Track TCP connection lifetimes
biolatency       # Disk I/O latency histogram
runqlat          # CPU scheduler queue latency
funccount        # Count function calls

eBPF tools like bcc and bpftrace give you kernel-level visibility without modifying your application or adding measurable overhead. They're the reason modern observability is possible without sampling bias.

The USE Method: Systematic Performance Analysis

Brendan Gregg's USE method: for every resource (CPU, memory, disk, network), check Utilization, Saturation, and Errors.

CPU:

Utilization: mpstat -P ALL 1 — per-core usage
Saturation: vmstat — check r column (run queue). If it's higher than core count, CPUs are overloaded
Errors: dmesg | grep -i error

Memory:

Utilization: free -h — check "available"
Saturation: vmstat — check si/so (swap in/out). Any non-zero value means swapping
Errors: dmesg | grep -i oom

Disk:

Utilization: iostat -xz 1 — check %util
Saturation: iostat — check avgqu-sz (average queue size). High values mean requests are waiting
Errors: smartctl -a /dev/sda

Network:

Utilization: sar -n DEV 1 — bytes per second vs link capacity
Saturation: netstat -s | grep -i drop — dropped packets
Errors: ifconfig or ip -s link — check error counters

Go through this checklist when "the system is slow." Most of the time, one resource will be saturated and everything else will look fine. That's your bottleneck.

The 7 Kernel Parameters Story

Production API server. Latency: 200ms average, 800ms P99. After a week of profiling, all the time was in kernel-level network and memory operations, not application code.

The 7 parameters that changed everything:

net.core.somaxconn = 65535 (was 128)
net.ipv4.tcp_tw_reuse = 1 (was 0)
net.core.rmem_max = 16777216 (was 212992)
net.core.wmem_max = 16777216 (was 212992)
vm.swappiness = 1 (was 60)
net.ipv4.tcp_keepalive_time = 600 (was 7200)
I/O scheduler to none (was cfq on an NVMe drive)

Result: average latency dropped to 20ms. P99 dropped to 85ms. No code changes. No infrastructure changes. Seven sysctl commands.

The defaults are designed for safety and generality. Production servers are neither safe nor general — they're specific, high-performance machines with specific workloads. Tune accordingly.

Key Takeaways

The kernel is not a black box. /proc and /sys expose everything. perf, strace, and eBPF let you look inside without guessing.

When "the system is slow," use the USE method. Check utilization, saturation, and errors for every resource. The bottleneck will reveal itself.

Default kernel parameters are fine for development machines. They're wrong for production. Every production server should have a tuned sysctl.conf based on its workload.

And never disable swap without understanding what happens when memory runs out. The OOM killer doesn't negotiate.

Over to You

Which kernel parameter change gave you the biggest performance win? Have you ever had the OOM killer strike at the worst possible moment?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Great Stack to Doesn't Work Bonus: 10 Docker Production Traps

Mehmet TURAÇ — Wed, 03 Jun 2026 09:00:00 +0000

Great Stack to Doesn't Work — Bonus

10 Docker Production Traps

Your Dockerfile works on your machine. Here's why it breaks everywhere else.

1. Your image is 2 GB because you're not using multi-stage builds.

Every RUN command creates a layer. If you install build tools, compile your app, and leave the build tools in the final image, you're shipping a toolbox alongside your application.

# Build stage
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/index.js"]

The build tools stay in the builder stage. The final image only has what it needs to run.

2. Your layer cache invalidates every build because COPY order is wrong.

Docker caches layers. If a layer hasn't changed, Docker reuses it. But layers are sequential — if layer 3 changes, layers 4, 5, 6 all rebuild.

# BAD: code changes invalidate npm install
COPY . .
RUN npm ci

# GOOD: dependencies cached separately from code
COPY package*.json ./
RUN npm ci
COPY . .

Your code changes every build. Your package.json changes occasionally. Copy the dependency manifest first, install, then copy the code. Now dependency installation is cached unless the manifest actually changes.

3. You're running as root.

Default Docker containers run as root. If an attacker exploits your application, they have root access inside the container. With certain volume mounts or misconfigurations, that can mean root on the host.

RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

Two lines. Massive security improvement.

4. You don't have a .dockerignore file.

Without .dockerignore, COPY . . sends everything to the Docker daemon: node_modules, .git, .env files, test fixtures, IDE configs. Slower builds, larger context, and potential secrets leaked into the image.

node_modules
.git
.env
*.md
test/
coverage/
.DS_Store

5. You're not using health checks.

Docker doesn't know if your application is healthy. It knows if the process is running. A Node.js server stuck in an infinite loop? Process is running. Docker says healthy.

HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

Now Docker can detect unhealthy containers and orchestrators can restart them.

6. Your logs disappear because you're writing to a file.

Docker captures stdout and stderr. If your application writes logs to /var/log/app.log, Docker's logging driver never sees them. docker logs returns nothing. Your centralized logging system collects nothing.

Log to stdout. Let Docker handle routing. Use a logging driver (json-file, fluentd, gelf) to send logs wherever they need to go.

7. You're using latest tag in production.

FROM node:latest means a different image every time someone builds. What worked last week might break today because latest moved to a new version. Pin your versions: FROM node:20.11-alpine.

Same for your own images. Never deploy myapp:latest to production. Use commit hashes or semantic versions: myapp:1.4.2 or myapp:abc123.

8. Your volume permissions break when switching between Linux and Mac.

Files created inside a container often have root ownership. When mounted to a Mac via Docker Desktop, this might work fine. On Linux, it breaks because your host user can't read root-owned files.

Set ownership explicitly in your Dockerfile or use --user flags to match host user IDs.

9. You're not setting memory limits.

Without memory limits, one container can consume all host memory and trigger the OOM killer, taking down other containers with it.

docker run --memory=512m --memory-swap=512m myapp

In Kubernetes, this maps to resource limits. Set them. Always.

10. You're rebuilding when you should be restarting.

Not every configuration change requires a new image. Environment variables, mounted config files, and feature flags can change at runtime. If you're rebuilding and redeploying because someone changed a log level, your deployment pipeline is doing too much work.

Separate build-time decisions (code, dependencies, base image) from run-time decisions (config, secrets, feature flags). Build less. Deploy smarter.

Over to You

Which Docker trap cost you the most debugging time? Any production Docker disaster stories you can share?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Great Stack to Doesn't Work #4 — Kubernetes: "Pod Is Running, App Is Dead"

Mehmet TURAÇ — Tue, 02 Jun 2026 07:11:42 +0000

A survival guide for when everything goes wrong in production.

The pod is Running. STATUS says Running. kubectl says Running. The deployment shows 3/3 replicas available. Every signal says this thing is alive.

But your users are getting timeouts. The health check endpoint returns 200, but the application thread pool is exhausted. The container is up. The process is running. The application is dead.

Kubernetes trusts your probes. If your probes lie, Kubernetes believes the lie.

The Three Probes: liveness, readiness, startup

These three probes look similar but serve completely different purposes. Mixing them up is responsible for more outages than any other Kubernetes misconfiguration.

Liveness probe: "Is this container broken beyond recovery?" If it fails, Kubernetes kills the container and restarts it. This is a last resort. If your liveness probe checks a database connection and the database is down, Kubernetes restarts your pod. The pod comes back. The database is still down. The liveness probe fails again. CrashLoopBackOff. You now have zero capacity instead of degraded capacity.

Liveness probes should check if the process itself is stuck — deadlocked threads, corrupted internal state, unresponsive event loop. They should NOT check downstream dependencies.

Readiness probe: "Can this container handle traffic right now?" If it fails, Kubernetes removes the pod from the Service endpoints. Traffic stops flowing to it, but the container stays alive. When readiness passes again, traffic resumes.

Readiness probes SHOULD check downstream dependencies. If your app can't reach the database, it shouldn't receive requests. Remove it from the load balancer, let other healthy pods handle traffic, and wait for the dependency to recover.

Startup probe: "Is this container still booting?" Runs only during startup. While the startup probe is running, liveness and readiness probes are disabled. This exists for applications with long initialization times — JVM warmup, large model loading, database migration runs.

Without a startup probe, an application that takes 60 seconds to start will fail the liveness probe (default 10-second timeout) and get killed before it ever finishes booting. CrashLoopBackOff on a perfectly healthy app that just needs more time.

The correct pattern:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 5
  # 30 * 5 = 150 seconds to start up

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  # Only runs after startup succeeds

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
  # Can toggle on/off during lifetime

Three separate endpoints. Three different checks. Don't make them the same URL.

Resources: The Art of Requests and Limits

Requests tell the Kubernetes scheduler how much resource to guarantee. If your pod requests 500m CPU and 256Mi memory, the scheduler only places it on a node with that much available.

Limits tell the kernel how much the container is allowed to use. Exceeding the memory limit triggers an OOMKill. Exceeding the CPU limit triggers throttling.

The dangerous configurations:

No requests, no limits: The pod is a BestEffort class. It gets whatever's available. Under node pressure, it's the first to be evicted. Never do this in production.

Requests equal to limits (Guaranteed QoS): The pod gets exactly what it asks for. No bursting above, no getting evicted under pressure (unless the node itself is failing). Predictable but expensive — you're reserving resources even when idle.

Requests lower than limits (Burstable QoS): The pod is guaranteed its request amount and can burst up to its limit when resources are available. This is the most common production configuration. The risk: if many pods burst simultaneously, the node runs out, and Kubernetes starts killing Burstable pods that exceed their requests.

The CPU throttling trap: CPU limits are enforced using CFS (Completely Fair Scheduler) bandwidth control. If your pod's limit is 1000m (1 core) and it needs a 200ms burst of 2 cores, it gets throttled for 100ms. Your application doesn't crash — it just gets mysteriously slow. container_cpu_cfs_throttled_seconds_total in Prometheus will show you if this is happening. Many teams set CPU limits too low and spend weeks debugging intermittent latency before checking throttling metrics.

My recommendation: set CPU requests but consider leaving CPU limits unset. Let pods burst on CPU. Set memory limits strictly — memory overcommit leads to OOMKills, which are worse than CPU throttling.

OOMKilled: Death by Memory

When a container exceeds its memory limit, the kernel kills it instantly. No graceful shutdown. No signal. No chance to flush buffers or close connections. The process is gone.

kubectl describe pod shows the exit code: 137 (128 + 9, where 9 is SIGKILL).

Common causes:

Memory leak: Gradual growth over hours or days. The pod works fine after restart, then slowly dies again.
Spike under load: The application allocates memory proportional to concurrent requests. During traffic spikes, memory exceeds the limit.
JVM heap misconfiguration: The JVM's -Xmx is set higher than the container's memory limit. The JVM thinks it has 4GB but the container only allows 2GB. The moment the heap grows past 2GB, OOMKill.

For JVM apps, always set -Xmx to roughly 75% of the container memory limit. The remaining 25% covers metaspace, thread stacks, native memory, and OS overhead.

For Node.js apps, set --max-old-space-size explicitly. The V8 default may exceed your container limit.

Evictions: When the Node Pushes Back

Eviction happens at the node level, not the pod level. When a node runs low on resources (memory, disk, or PIDs), kubelet starts evicting pods to protect the node.

Eviction order:

BestEffort pods (no requests/limits) — evicted first
Burstable pods exceeding their requests
Guaranteed pods — evicted last, only under extreme pressure

Priority classes let you influence this order. Create PriorityClasses for your workloads:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000
globalDefault: false
description: "Critical production services"

Pods with higher priority evict lower-priority pods when the node is under pressure. Your core payment service survives; your internal analytics job gets evicted.

Node Affinity, Taints, and Tolerations

"Why won't my pod schedule?" is the second most common Kubernetes question (after "why is it crashing").

Node affinity: tells the scheduler which nodes the pod prefers or requires.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node-type
              operator: In
              values: ["gpu"]

This pod will only run on nodes labeled node-type=gpu. If no such node exists, the pod stays Pending forever.

Taints: nodes repel pods. A tainted node won't accept pods unless they have a matching toleration.

kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

Now only pods that tolerate gpu=true can run there. This prevents CPU-only workloads from accidentally landing on expensive GPU nodes.

Common scheduling failures:

Pod is Pending with "insufficient cpu/memory" — the requested resources exceed what's available on any node. Either reduce requests or add nodes.
Pod is Pending with "no nodes match pod topology spread constraints" — you have topology rules that can't be satisfied.
Pod is Pending with "0/5 nodes are available: 5 node(s) had taints that the pod didn't tolerate" — every node is tainted and your pod doesn't have the right tolerations.

Pod Disruption Budgets: Don't Take Everything Down at Once

During a rolling update, Kubernetes terminates old pods and creates new ones. Without a PDB, Kubernetes can terminate all pods simultaneously if it's feeling aggressive.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

This guarantees at least 2 api pods are always running, even during node drains, cluster upgrades, or voluntary disruptions. Kubernetes will wait to terminate a pod until it can do so without violating the budget.

The 7 Causes of CrashLoopBackOff

CrashLoopBackOff means the container starts, crashes, restarts, crashes again, and Kubernetes increases the delay between restarts exponentially (10s, 20s, 40s, up to 5 minutes).

Application error on startup. Missing config, bad environment variable, connection refused to a required service. Check kubectl logs.
OOMKilled. Memory limit too low. Check kubectl describe pod for OOMKilled in Last State.
Liveness probe too aggressive. The app takes 30 seconds to start, the liveness probe starts at 10 seconds. The probe kills the app before it's ready.
Image pull error masquerading as crash. ImagePullBackOff can look like CrashLoopBackOff in the events. Check events, not just status.
Entrypoint/command misconfiguration. The CMD in the Dockerfile expects arguments that aren't passed, or the entrypoint script has a bash error.
Permissions. The container runs as non-root but needs to write to a directory owned by root. Or a mounted secret has wrong permissions.
Resource quota exhaustion. The namespace has a ResourceQuota and the pod's requests exceed what's available in the quota. The pod keeps trying and failing.

The debugging flow:

kubectl describe pod <name>          # Events section
kubectl logs <name>                  # Current logs
kubectl logs <name> --previous       # Previous crash logs
kubectl get events --sort-by='.lastTimestamp'  # Cluster events

--previous is the one people forget. The current container has no logs because it just started. The previous container's logs show why it crashed.

War Story: The 1-Second Readiness Probe

A payment service. 8 replicas behind a Kubernetes Service. Readiness probe checked /health with a 1-second timeout. The health endpoint pinged the database.

Under normal conditions: 200ms response time. Readiness passes. Traffic flows.

Black Friday: database load increases. Health endpoint response time creeps up. 800ms. 900ms. 1.1 seconds. Readiness probe fails. Kubernetes removes the pod from endpoints.

Now 7 pods handle the traffic that 8 were handling. Each remaining pod gets more load. Their health endpoints slow down. More probes fail. 6 pods. 5 pods. Cascading failure.

Within 90 seconds, all 8 pods were removed from the Service. Zero pods receiving traffic. The application was running perfectly — every pod was healthy. But every readiness probe was timing out because the database was slow.

Fixes:

Increased readiness probe timeout to 5 seconds.
Separated the health check from the database check. Readiness verifies the application can accept HTTP connections. A separate monitoring check verifies database connectivity.
Added a circuit breaker — if the database is slow, the app returns degraded responses from cache instead of timing out.

The lesson: your readiness probe is a load balancer decision. If it's too sensitive, it amplifies problems instead of containing them.

Key Takeaways

Kubernetes doesn't know your application is healthy. It knows your probes pass. Design probes that reflect real application health, not infrastructure connectivity.

Set memory limits. Don't set CPU limits unless you have a specific reason. Check throttling metrics before assuming your app is slow.

CrashLoopBackOff is a symptom, not a diagnosis. kubectl logs --previous is your first tool. kubectl describe pod is your second. The answer is almost always in the Events section.

And if your readiness probe checks a downstream dependency, make sure the timeout is generous enough that temporary slowness doesn't cascade into a full outage.

Over to You

What's the sneakiest CrashLoopBackOff cause you've debugged? Have you ever had a readiness probe cascade like the one in this article?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM

Mehmet TURAÇ — Sat, 30 May 2026 21:48:26 +0000

"AI slop" is not a model problem. It's an engineering problem you decided not to solve.

The slop is the bland, off-voice, half-hallucinated, occasionally-just-an-error-message text that your LLM emits maybe 5% of the time — and that 5% is the part users screenshot. The instinct is to fix it in the prompt: add three more sentences of "be concise, be accurate, match my tone." That treats a stochastic system as if it were deterministic. It isn't. You cannot prompt your way to a guarantee.

What actually works is treating the model like any other unreliable upstream dependency: wrap it in a harness that validates, rejects, and retries before anything reaches a user. The model proposes; the harness disposes. Here's how to build one.

Slop is a systems problem, not a prompt problem

Every production LLM feature I've shipped converged on the same shape: the model is one stage in a pipeline, not the pipeline itself. You don't trust raw generation any more than you'd trust raw user input. You parse it, you validate it against constraints you can express in code, and you reject anything that fails — automatically, before a human ever sees it.

The key insight is that most slop is detectable. Empty output, a leaked stack trace, the wrong language, a 900-word answer when you asked for 200, a banned phrase like "in today's fast-paced world" — these are all checkable with deterministic code. You don't need a judge model to catch them (though a judge model has its place at the end). You need a gate that runs on every generation, costs microseconds, and never gets tired.

Think of it as five layers, each rejecting a different class of failure.

Layer 1: Structured output, not freeform text

The single biggest reduction in slop comes from refusing to accept prose where you can demand structure. If you ask for a JSON object with named fields and a schema, the failure modes collapse from "infinite" to "a handful you can enumerate."

Use the provider's native structured-output / tool-calling mode and attach a real schema — Pydantic, Zod, JSON Schema, whatever your stack speaks. This does two things. First, it forces the model to commit to a shape, which kills rambling preambles ("Sure! Here's a great answer for you..."). Second, it gives you a parse step that fails loudly. If the model returns something that doesn't validate, that's not a soft warning — it's a rejected generation that triggers a retry. A parse failure is a quality signal, not an exception to swallow.

The corollary: never try/except: pass around your parser. A swallowed parse error is slop with the lights turned off.

Layer 2: Reject error strings the model smuggles through

This one surprises people. Models are trained on the entire internet, which includes a lot of error messages, apology boilerplate, and refusal language. Under pressure — ambiguous input, a retrieval miss, a truncated context — the model will sometimes emit text that is syntactically valid but semantically garbage: "I'm sorry, I cannot access that file," "Error: undefined," "As an AI language model, I don't have the ability to...," or a half-rendered template with {{variable}} still in it.

Structured output won't catch these, because they fit the schema fine. You need an explicit denylist of error-shaped strings and patterns, checked against every field. It's crude and it works. Maintain it like you maintain a spam filter — every time a new flavor of garbage reaches production, it earns a line in the rejection list.

Layer 3: Voice and constraint checks

This is where you encode the things that make output yours rather than generic. Most of it is deterministic and cheap:

Length bounds. A word or token range per field. Reject the 900-word answer and the one-liner.
Banned phrases. The motivational-closer clichés, the "delve," the emoji clusters, the corporate hedging. A regex pass.
Required language. If you build bilingual TR/EN tooling like I do, you check that a Turkish response is actually in Turkish — a quick script-ratio or language-ID check catches the model code-switching mid-paragraph.
Format invariants. Markdown headings present, no leaked system-prompt fragments, no placeholder tokens.

Here's the core of a harness that strings these layers together with a bounded retry loop.

import re
from pydantic import BaseModel, ValidationError

class Article(BaseModel):
    title: str
    body: str

ERROR_SHAPES = [
    r"as an ai language model",
    r"i (?:cannot|can't|am unable to) (?:access|comply)",
    r"\berror:\s",
    r"undefined|null\b",
    r"\{\{.*?\}\}",          # leaked template tokens
]
BANNED_PHRASES = [r"in today's fast-paced", r"delve into", r"unleash the power"]

def gate(text: str) -> list[str]:
    """Deterministic checks. Returns a list of failures (empty == pass)."""
    fails = []
    if not text.strip():
        fails.append("empty output")
    if not (200 <= len(text.split()) <= 800):
        fails.append(f"length out of bounds: {len(text.split())} words")
    for pat in ERROR_SHAPES:
        if re.search(pat, text, re.I):
            fails.append(f"error-shaped string: /{pat}/")
    for pat in BANNED_PHRASES:
        if re.search(pat, text, re.I):
            fails.append(f"banned phrase: /{pat}/")
    return fails

def generate(client, prompt: str, max_attempts: int = 3) -> Article:
    last_fails: list[str] = []
    for attempt in range(max_attempts):
        feedback = "" if not last_fails else (
            "\n\nYour previous output was rejected for: "
            + "; ".join(last_fails) + ". Fix these and return only the schema."
        )
        raw = client.structured(prompt + feedback, schema=Article)  # native structured mode
        try:
            article = Article.model_validate(raw)
        except ValidationError as e:
            last_fails = [f"schema: {e.error_count()} errors"]
            continue
        last_fails = gate(article.body)
        if not last_fails:
            return article
    raise RuntimeError(f"slop after {max_attempts} attempts: {last_fails}")

Notice what the harness does on rejection: it feeds the specific failures back into the next attempt. The model is far better at fixing a named defect than at avoiding an abstract one. And notice the loop is bounded — after max_attempts it raises rather than shipping. Failing closed is the whole point.

Layer 4: Deterministic quality gates

Layers 1–3 catch format and surface defects. Layer 4 catches semantic invariants that are specific to your task and still checkable in code. If you generate a summary, assert every cited number appears in the source. If you generate SQL, run it through a parser and an EXPLAIN, not the model's confidence. If you generate code, compile it and run the linter. If you generate a translation, check that named entities survived.

These gates are where domain knowledge lives. They're unglamorous assert statements, and they're the difference between a demo and a product. The rule: anything you can verify mechanically, you must — because the model will eventually get it wrong, and you want the gate to catch it, not the user.

Layer 5: Verify before ship

The last layer is the only one that may use another model, and only for the things code genuinely can't judge: faithfulness, relevance, tone-match. A cheap judge model scoring "does this answer the question, grounded in the provided context, in the requested voice?" on a 1–5 scale, with a hard threshold below which you reject, catches the subtle slop that passes every deterministic check.

Keep this layer last and keep it skeptical. A judge model is itself an LLM and can be fooled, so it's a final filter on output that has already survived four deterministic gates — never a replacement for them. And log every rejection at every layer. Your rejection logs are the highest-signal dataset you own: they tell you exactly how your model fails in production, which feeds back into prompts, denylists, and gates.

What this buys you

None of these layers is clever. That's the point. Cleverness is fragile; a denylist and a bounded retry loop are not. What the harness gives you is a guarantee about what reaches the user — not a probability, a guarantee — for every failure mode you've chosen to encode. Slop stops being a vibe you argue about and becomes a set of named, logged, falsifiable conditions.

The model is a brilliant, unreliable intern. You don't fix an unreliable intern by writing a longer brief. You fix it by reviewing the work before it goes out.

The open question I keep circling: which of these checks genuinely belong in deterministic code, and which are you quietly outsourcing to a judge model because writing the real assertion was too hard?

Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down"

Mehmet TURAÇ — Sat, 30 May 2026 21:47:06 +0000

A survival guide for when everything goes wrong in production.

Your Redis dashboard looks perfect. Hit ratio: 99.2%. Latency: sub-millisecond. Memory usage: 60% of available. Every metric says healthy.

Then at 2:47 PM, your API starts returning 500s. Response times spike to 30 seconds. Users can't log in. The dashboard still shows 99% hit ratio because the cache is working — it's serving cached errors to everyone equally fast.

Redis is doing exactly what you told it to do. The problem is what you told it to do.

Why Single-Threaded Is Fast (Until It Isn't)

Redis processes commands on a single thread. No locks. No context switching. No synchronization overhead. One CPU core, fully utilized, can handle 100K+ operations per second because it never waits for another thread to release a lock.

The event loop model (similar to Node.js) multiplexes thousands of client connections on a single thread using non-blocking I/O. Read a request, process it, write the response, move to the next. When your commands are simple — GET, SET, INCR — each one takes microseconds.

The trap: slow commands block everything. KEYS * on a million-key database? That's a full keyspace scan on the main thread. While it runs, every other client waits. SORT on a large set? Same. LRANGE on a list with 10 million elements? Same.

Redis 6.0 introduced I/O threading (io-threads config) for reading and writing network data on multiple threads, but command execution is still single-threaded. Redis 7.0 improved this further, but the fundamental model hasn't changed. Long-running commands on the main thread stall everything.

Rules:

Never use KEYS in production. Use SCAN instead — it's cursor-based and returns results incrementally.
Watch out for O(N) commands on large data structures: LRANGE, SMEMBERS, HGETALL on million-element structures.
Use SLOWLOG to find commands that are blocking the event loop.

Pipelining: The Easiest 10x You'll Ever Get

Every Redis command involves a network round trip: send request, wait for response. If you're executing 100 commands sequentially, that's 100 round trips. At 0.5ms per round trip, you're waiting 50ms for what should take 1ms of actual processing.

Pipelining batches commands into a single network write and reads all responses at once.

pipe = redis.pipeline()
for user_id in user_ids:
    pipe.get(f"user:{user_id}:profile")
results = pipe.execute()

Instead of 100 round trips, you make 1. The server processes all commands in sequence (it's single-threaded, remember) and buffers the responses. Your client sends the batch, waits once, and gets everything back.

Pipelining doesn't reduce server-side processing time — each command still runs individually. It eliminates network latency, which is almost always the dominant cost for simple commands.

The catch: if one command in the pipeline fails, the others still execute. Pipelining is not transactional. If you need atomicity, use MULTI/EXEC or Lua scripts.

Lua Scripting: Atomic Operations Without the Complexity

Redis evaluates Lua scripts atomically. While a script runs, nothing else executes. This makes Lua scripts the right tool for read-modify-write operations that would otherwise need distributed locking.

Classic example — rate limiting:

-- KEYS[1] = rate limit key
-- ARGV[1] = max requests
-- ARGV[2] = window in seconds
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
    return 0  -- rate limited
end
return 1  -- allowed

This increments a counter and sets expiry atomically. No race condition between INCR and EXPIRE. No chance of two requests both reading "0" and both thinking they're first.

Use EVALSHA instead of EVAL in production. EVALSHA references the script by its SHA1 hash, avoiding sending the full script text with every call. Load the script once with SCRIPT LOAD, then call it by hash.

Caveat: Lua scripts block the main thread for their entire duration. Keep them short. A script that queries 10 keys is fine. A script that iterates over 100,000 keys is a production incident waiting to happen.

Pub/Sub vs Streams: Two Very Different Tools

Pub/Sub is fire-and-forget. Publisher sends a message, all connected subscribers receive it instantly. If a subscriber disconnects and reconnects, it misses everything published while it was gone. No message persistence. No consumer groups. No acknowledgment.

Use Pub/Sub for: real-time notifications where missing a message is acceptable. Chat typing indicators. Cache invalidation signals. Dashboard live updates.

Streams (introduced in Redis 5.0) are persistent, append-only logs with consumer groups. Think of them as "Kafka Lite inside Redis."

XADD orders * user_id 42 amount 99.99
XREADGROUP GROUP payment_processors consumer_1 COUNT 10 BLOCK 5000 STREAMS orders >
XACK orders payment_processors 1234567890-0

Streams persist messages. Consumer groups track which consumer has read what. Unacknowledged messages can be claimed by other consumers if one dies. You get at-least-once delivery semantics.

Use Streams for: job queues, event sourcing, lightweight message processing where you don't want to deploy Kafka but need more than Pub/Sub.

Don't use Streams to replace Kafka at scale. Redis Streams are bounded by single-node memory. Kafka is designed for multi-broker distributed throughput. Different tools, different scale.

Memory Eviction: The Policy That Saves or Kills You

When Redis hits maxmemory, it needs to decide what to delete. The eviction policy determines what goes.

noeviction: Redis returns errors for write commands. Reads still work. Use this when you absolutely cannot lose data and you'd rather fail loudly than silently corrupt your cache. Common for session stores.

allkeys-lru: Evicts the least recently used key across all keys. The safest general-purpose policy. If you're using Redis purely as a cache, this is your default.

volatile-lru: Only evicts keys with a TTL set. Keys without TTL are never evicted. Use this when you have a mix of permanent data (config, feature flags) and cache data (user sessions, query results). The permanent data stays; the cache data gets evicted under pressure.

allkeys-lfu (Least Frequently Used): Evicts keys accessed least often, regardless of recency. Better than LRU when you have a mix of frequently-accessed hot data and occasionally-accessed warm data. A key accessed 1,000 times yesterday but not today won't be evicted as quickly as with LRU.

The disaster scenario: noeviction on a cache. Redis fills up. Every write fails. Your application treats the write failure as a cache miss and hits the database directly. Now your database is handling the full load that Redis was supposed to absorb. The database slows down. API latency spikes. Cascading failure.

Monitor evicted_keys in Redis INFO stats. A sudden spike means you're running out of memory and eviction is kicking in aggressively. Either add memory or investigate why your keyspace is growing.

Persistence: RDB vs AOF vs "I Thought Redis Was Just a Cache"

Many teams deploy Redis without persistence, treating it as a pure cache. Then the server restarts and 6 hours of cached data vanishes. Cold cache stampede: every request hits the database simultaneously.

RDB (snapshotting): Redis forks the process and writes the entire dataset to disk at intervals. Fast restores. Compact files. But you can lose data between snapshots — if Redis saves every 5 minutes and crashes 4 minutes after the last save, those 4 minutes are gone.

AOF (Append Only File): Redis logs every write operation. Three sync modes: always (fsync every write — safe but slow), everysec (fsync every second — good balance), no (let the OS decide — fastest but risky). On restart, Redis replays the log to rebuild state.

RDB + AOF: Use both. RDB for fast restores and backups. AOF for durability. On restart, Redis prefers AOF because it's more complete.

The real question: what happens to your system when Redis restarts with an empty cache? If the answer is "everything melts," you need persistence. If the answer is "things are slow for a few minutes while the cache warms up," maybe you don't — but you should still have RDB snapshots for disaster recovery.

The Thundering Herd: Cache Invalidation's True Face

You cache a popular product page for 5 minutes. 10,000 users are viewing it. The TTL expires. All 10,000 requests simultaneously hit the database for the same data. The database buckles under the sudden spike.

This is the thundering herd problem, and it's not theoretical. Any high-traffic system with TTL-based caching will encounter it.

Solutions:

Staggered TTLs. Add random jitter to expiration times: TTL = base_ttl + random(0, 60). Keys expire at different times, spreading the database load.

Lock-based refresh. When a key expires, only one request acquires a lock and rebuilds the cache. All others wait or serve stale data. Implementation with Lua:

local value = redis.call('GET', KEYS[1])
if value then return value end
local lock = redis.call('SET', KEYS[1] .. ':lock', 'locked', 'NX', 'EX', 5)
if lock then
    return nil  -- caller rebuilds cache
else
    return redis.call('GET', KEYS[1])  -- wait for rebuild
end

Early refresh. Refresh the cache before it expires. If TTL is 5 minutes, start a background refresh at 4 minutes. The cache never actually expires under normal operation.

How We Crashed Production 3 Times

Crash #1: Hot key. A flash sale product page was cached under a single key. 500,000 requests per second hit that one key. Redis can handle the throughput, but the single-threaded nature means this one key's reads were queuing behind each other. Latency spiked to 50ms — fine for one request, fatal for the 499,999 behind it.

Fix: cache the hot key locally in-process with a short TTL (1-2 seconds). Application memory serves 99% of requests, Redis serves the refresh.

Crash #2: Serialization bomb. Someone cached a full user object including activity history — 50MB serialized. Every time the app read that key, Redis had to send 50MB over the network. The single thread was blocked for 200ms per read. At 100 concurrent reads, the event loop was saturated.

Fix: cache only what you need. User profile: 2KB. User activity: separate key, paginated, never cached as a monolith.

Crash #3: Cache invalidation race. Service A updates a user record in the database and deletes the cache key. Service B reads the cache, gets a miss, reads the stale data from a read replica (replication lag), and writes the stale data back to cache. Now the cache has stale data and it won't refresh until the TTL expires.

Fix: don't write to cache after a miss if the data might be stale. Use read-from-primary for cache rebuilds, or use a TTL short enough that stale data self-corrects quickly.

When Redis, When Memcached?

This is a shorter decision than people make it.

Redis when: you need data structures beyond key-value (lists, sets, sorted sets, hashes, streams), persistence, pub/sub, Lua scripting, cluster mode, or any feature beyond simple caching.

Memcached when: you need a simple, multi-threaded cache with predictable memory allocation and you're caching large blobs (images, rendered HTML). Memcached's multi-threaded architecture handles large-value workloads more efficiently than Redis's single-threaded model.

In practice: Redis, almost always. The feature set is so much broader that the rare cases where Memcached wins are outweighed by Redis's versatility. The exception is if you're caching very large objects at very high throughput and you're hitting Redis's single-threaded bottleneck. Then Memcached's multi-threaded reads genuinely help.

Key Takeaways

Redis is fast by default and slow by mistake. The mistakes are predictable: slow commands on the main thread, missing pipelining, wrong eviction policy, no persistence on a critical cache, and hot keys.

Monitor commandstats to see which commands are running. Monitor slowlog to find the ones that are too slow. Monitor evicted_keys to know when you're running out of memory.

The 99% hit ratio dashboard doesn't mean your cache is healthy. It means your cache is serving something fast. Whether that something is correct, fresh, and useful — that's a different question.

Over to You

What's your worst Redis incident? Hot key? Thundering herd? Wrong eviction policy? The cache invalidation race condition stories are always the best.

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Great Stack to Doesn't Work Bonus: SQL vs NoSQL: Which One in 2026?

Mehmet TURAÇ — Sat, 30 May 2026 21:45:23 +0000

The honest decision framework, not another flame war.

The SQL vs NoSQL debate has been running for 15 years and it still generates more heat than light. Here's the framework that actually helps you decide.

The Real Question

It's not "SQL or NoSQL." It's: what does your access pattern look like?

If your application is mostly reading and writing related data through well-defined queries — orders with line items, users with addresses, products with categories — relational databases are purpose-built for this. JOINs are not expensive when they're indexed. Transactions are not slow when they're scoped correctly. PostgreSQL handles 50 million rows comfortably on a single node.

If your application is reading and writing self-contained documents with predictable access by a primary key, and you rarely need cross-document queries — user profiles, product catalogs, content management — a document database simplifies your code. No ORM mapping hell. No migration files for adding a field.

If your application writes massive volumes and reads by partition key with eventual consistency — time-series data, IoT telemetry, activity feeds at scale — wide-column stores like Cassandra were built for this specific workload.

The 2026 Reality

PostgreSQL has eaten NoSQL's lunch in many areas. JSONB support means you can store and query unstructured data inside PostgreSQL with GIN indexes. You get the document model flexibility without giving up transactions, JOINs, and a 30-year ecosystem. For 80% of startups and mid-size companies, PostgreSQL is the only database you need.

MongoDB has gotten more relational. Multi-document ACID transactions (since 4.0), schema validation, aggregation pipelines that look suspiciously like SQL. It's converging toward what PostgreSQL already does, but with a different starting point.

DynamoDB dominates serverless. If you're in AWS and your access pattern is simple key-value with known query patterns, DynamoDB's pricing model (pay-per-request) and operational simplicity are hard to beat. But the moment you need ad-hoc queries or flexible access patterns, you're fighting the database.

Cassandra is for a specific scale problem. If you don't need to write millions of rows per second across multiple data centers with tunable consistency, you don't need Cassandra. The operational overhead is significant.

Decision Tree

Start with PostgreSQL unless you have a specific reason not to. Then:

Need flexible schema with primarily key-based access? → Consider MongoDB
Need massive write throughput with geographic distribution? → Consider Cassandra
Need serverless, pay-per-request, AWS-native? → Consider DynamoDB
Need time-series at scale? → Consider TimescaleDB (PostgreSQL extension) or InfluxDB
Need graph queries (social networks, recommendation engines)? → Consider Neo4j or PostgreSQL's recursive CTEs

The worst decision is choosing NoSQL because "we might need to scale." Scale is not a database choice. It's an architecture problem. Most applications will never outgrow a single well-configured PostgreSQL instance. And the ones that do will need to re-architect regardless of their database.

The One Thing Nobody Tells You

The database you choose determines your debugging story. When something goes wrong with PostgreSQL, you have EXPLAIN ANALYZE, pg_stat_statements, 30 years of Stack Overflow answers, and a query planner that tells you exactly what it's doing.

When something goes wrong with Cassandra, you're reading GC logs and compaction stats. When DynamoDB throttles your reads, the only fix is to provision more capacity or redesign your partition key. When MongoDB's aggregation pipeline is slow, the explain output is a nested JSON document that takes 20 minutes to parse.

Choose the database whose failure mode you're most equipped to handle. Because it will fail, and your ability to debug it determines your recovery time.

Over to You

SQL or NoSQL for your current project — and why? Has anyone actually migrated from one to the other mid-project? How did it go?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Great Stack to Doesn't Work #2 — Kafka: "Where Did My Messages Go?"

Mehmet TURAÇ — Sat, 30 May 2026 21:44:14 +0000

A survival guide for when everything goes wrong in production.

There's a moment every engineer who works with Kafka experiences. You check the producer. Messages are sending. You check the consumer. Nothing. The consumer group shows zero lag because there's nothing to lag behind — as far as the consumer knows, the topic is empty.

But it's not empty. The messages are there. Somewhere. In some partition, at some offset, behind some configuration you set six months ago and forgot about.

Kafka doesn't lose messages. But it's very good at hiding them from you.

Consumer Lag: The Number Everyone Watches Wrong

Consumer lag is the difference between the latest offset in a partition and the offset your consumer group has committed. Simple concept. Dangerous in practice.

The mistake: treating lag as a single number. Lag is per-partition. If you have 30 partitions and one consumer is stuck on partition 17 while the others are healthy, the total lag looks manageable. But partition 17's data is hours behind, and whatever downstream system depends on that data is serving stale results.

Monitor lag per partition. Tools like Burrow, Kafka Exporter for Prometheus, or even kafka-consumer-groups.sh --describe break it down. If one partition's lag is growing while others are stable, you have a stuck consumer, a hot partition, or a poison message.

A poison message is a record your consumer can't process — malformed data, unexpected schema, null where it shouldn't be null. The consumer throws an exception, the offset doesn't commit, and it retries the same message forever. Lag grows. The consumer looks "alive" because it's processing — just not making progress.

The fix: dead letter queues. After N retries, move the message to a separate topic, commit the offset, and move on. Alert on the dead letter topic. Investigate later. Don't let one bad record block millions of good ones.

Rebalance Storms: The Silent Killer

Consumer rebalancing is Kafka's mechanism for redistributing partitions across consumers in a group. When a consumer joins or leaves, Kafka reassigns partitions. During rebalance, all consumers in the group stop processing. For a few seconds, nobody's doing anything.

This is fine. Unless it happens every 30 seconds.

Rebalance storms happen when Kafka thinks a consumer is dead, removes it from the group, triggers a rebalance, then the consumer comes back, joins the group, triggers another rebalance, and the cycle repeats.

Three timeout settings control this:

session.timeout.ms: how long Kafka waits for a heartbeat before declaring the consumer dead. Default: 45 seconds.
heartbeat.interval.ms: how often the consumer sends heartbeats. Default: 3 seconds.
max.poll.interval.ms: how long between two poll() calls before Kafka kicks the consumer out. Default: 5 minutes.

The most common cause of rebalance storms: max.poll.interval.ms is too short for your processing time. Your consumer polls 500 records, spends 6 minutes processing them, and by the time it polls again, Kafka has already declared it dead and rebalanced.

Fixes:

Increase max.poll.interval.ms to match your worst-case processing time.
Decrease max.poll.records so each batch processes faster.
Use static.group.instance.id — this enables static membership, which means Kafka won't immediately rebalance when a consumer temporarily disconnects. It waits for session.timeout.ms to expire first.
Use cooperative rebalancing (partition.assignment.strategy = CooperativeStickyAssignor) — instead of stopping all consumers during rebalance, it only reassigns the affected partitions.

One team I worked with had a 12-consumer group processing payment events. Every few minutes, all processing stopped for 10-15 seconds during rebalance. Twelve times an hour. That's 2 minutes of downtime every hour in a payment pipeline. The fix was adding static group instance IDs and switching to cooperative rebalancing. Total rebalance disruption dropped from 2 minutes per hour to near zero.

Exactly-Once: The Myth and the Reality

Kafka advertises exactly-once semantics. Here's what that actually means.

Idempotent producer (enable.idempotence = true): Kafka deduplicates messages from the same producer session. If a network retry causes the producer to send the same message twice, the broker detects the duplicate and discards it. This prevents duplicates within a single producer session. If the producer restarts, it gets a new session, and deduplication doesn't cross sessions.

Transactional producer + consumer: For true exactly-once across produce-and-consume workflows, you need transactions.

producer.beginTransaction();
producer.send(outputTopic, processedRecord);
producer.sendOffsetsToTransaction(consumerOffsets, consumerGroupId);
producer.commitTransaction();

This atomically writes the output record AND commits the consumer offset. Either both happen or neither does. If the transaction fails, the consumer re-reads the input, reprocesses it, and tries again.

The reality check:

Exactly-once works within Kafka. The moment your consumer writes to an external database, you're back to at-least-once unless you implement idempotency on the database side.
Transactions add latency. Each transaction involves coordination between the producer, the transaction coordinator, and the brokers hosting the output partitions.
Most systems don't need exactly-once. If your consumer can handle duplicates (idempotent writes, upserts, deduplication at the application layer), at-least-once is simpler and faster.

Don't reach for exactly-once because it sounds correct. Reach for it when duplicate processing would cause real damage — financial transactions, inventory counts, billing events. For analytics, logging, and notifications, at-least-once with deduplication is the pragmatic choice.

Partition Strategies: The Decision That Haunts You

Once you choose a partition key, changing it later means reprocessing everything. Choose carefully.

Key-based partitioning (default when you set a key): all messages with the same key go to the same partition. This guarantees ordering per key. If you're processing events per user, partition by user ID and every event for a given user arrives in order.

The trap: hot partitions. If one user generates 1,000x more events than average, their partition becomes the bottleneck. The consumer assigned to that partition falls behind while others are idle.

Round-robin (no key): messages distribute evenly across partitions. Maximum throughput, zero ordering guarantees. Use this for stateless processing where order doesn't matter — log aggregation, metrics collection, fan-out work queues.

Custom partitioner: when you need ordering within a logical group but want to control distribution. For example, partition by tenant_id % num_partitions to ensure per-tenant ordering while distributing large tenants across multiple partitions.

The question to ask: does your consumer need to see related messages in order? If yes, use a key that groups related messages. If no, use round-robin for maximum throughput.

Compaction: Powerful and Dangerous

Log compaction keeps only the latest value for each key. Instead of retaining messages by time or size, Kafka retains the last message per key indefinitely.

Use case: a topic that represents current state. User profile updates: you only care about the latest profile, not the history. Config changes: you want the current config, not every version.

The danger: if your producer accidentally sends a message with a null value (a tombstone), compaction deletes the key permanently. One bug in a producer can wipe state for thousands of keys, and because compaction runs in the background, you might not notice until downstream consumers can't find the data they expect.

Guard rails:

Monitor tombstone rate. A sudden spike in null-value messages is a red flag.
Separate compacted topics from retention-based topics. Don't compact your event stream.
Test your producer's null-handling thoroughly. A missing field serialized as null can become a tombstone.

The 4 Problems Disguised as "My Messages Are Disappearing"

When someone says their Kafka messages are disappearing, it's almost never data loss. Here's what's actually happening:

1. Retention expired. The topic's retention.ms is set to 7 days. Your consumer was down for 8 days. The messages were deleted before the consumer came back. This is not a bug. It's configuration.

2. Consumer offset reset. Your consumer group's committed offsets expired (controlled by offsets.retention.minutes, default 7 days in older Kafka versions). When the consumer restarts, it doesn't know where it left off and uses auto.offset.reset — which defaults to latest, meaning it skips everything produced while it was offline. Set it to earliest if you want to reprocess, or better yet, don't let your consumer stay down longer than your offset retention.

3. Wrong topic or partition. The producer is writing to orders-v2 and the consumer is reading from orders. Or the producer changed its partition key, so messages that used to go to partition 5 now go to partition 12, and the consumer assigned to partition 5 sees nothing.

4. Serialization mismatch. The producer is writing Avro, the consumer expects JSON. The consumer "reads" the message but can't deserialize it, throws an exception, and depending on error handling, either crashes or silently skips it. The message is there — the consumer just can't understand it.

The 2M Messages Per Second Story

Payment processing platform. 6 brokers, 3 racks, 120 partitions across 4 topics. Target: sustain 2 million messages per second with P99 produce latency under 10ms.

Initial state: 800K messages per second, P99 at 45ms. Rebalances every few minutes. Consumer lag growing during peak hours.

What we changed:

Broker side:

Increased num.io.threads and num.network.threads to match the core count. Default values are conservative.
Set log.flush.interval.messages to a higher value. Letting the OS page cache handle flushing is almost always faster than forcing Kafka to fsync.
Moved log directories to separate NVMe drives per mount point. Kafka is I/O bound. Spreading partitions across drives parallelizes writes.

Producer side:

Batch size from 16KB to 256KB. Larger batches mean fewer network round trips.
linger.ms from 0 to 5. Instead of sending immediately, the producer waits 5ms to fill the batch. Throughput jumps significantly for a tiny latency increase.
Compression: lz4. Reduces network bandwidth and disk usage. lz4 is fast enough that compression time is negligible compared to network savings.

Consumer side:

fetch.min.bytes from 1 to 64KB. Don't make a network round trip for a single message.
max.poll.records tuned to match processing capacity. Too high means long processing between polls and rebalance risk. Too low means excessive poll overhead.
Static group instance IDs + cooperative rebalancing. Eliminated rebalance storms.

Partition count: Increased from 120 to 360. More partitions = more parallelism, up to the point where metadata overhead becomes a problem (usually in the thousands).

Result: sustained 2.1M messages per second, P99 produce latency at 7ms, zero rebalances during the 4-hour peak window. The infrastructure didn't change. Six brokers, same hardware. Just configuration.

Key Takeaways

Kafka is a log. Everything flows from that mental model. Producers append to a log. Consumers read from a log at their own pace. Offsets are just positions in that log.

When "messages disappear," trace the offset. Where did the consumer last commit? What's the earliest available offset in the partition? The gap between those two numbers tells you everything.

Rebalances are the biggest operational pain in Kafka. Static membership and cooperative rebalancing aren't just nice-to-haves — they're the difference between a stable pipeline and one that hiccups every few minutes.

And if someone tells you they need exactly-once semantics, ask them what happens if their consumer processes a message twice. If the answer is "nothing, we have upserts," they don't need exactly-once. They need at-least-once with idempotency. Which is simpler, faster, and what most production systems actually run.

Over to You

Have you ever lost messages in Kafka — or thought you did? What was the actual root cause? I'd love to hear your rebalance storm stories.

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Harness Engineering: The Code Around the Model Is the Hard Part

Mehmet TURAÇ — Sat, 30 May 2026 07:09:57 +0000

Everyone benchmarks the model. Almost nobody benchmarks the harness — the loop, the tool dispatch, the context manager, the retry logic that wraps a raw inference call and turns it into something that can run unattended against production. In my experience building agentic platforms, swapping the model is a config change you ship in an afternoon. The harness is where the months go, and it's where reliability is actually won or lost.

This is the part that doesn't show up in demos. A demo agent calls a tool, gets a clean result, and prints a tidy answer. A production agent calls a tool that times out, gets a 200 with a malformed body, hits a rate limit on retry, and now has to decide whether to keep going or give up — all while staying inside a token budget and not corrupting anything downstream. The model doesn't solve that. The harness does.

The harness is the product

When people say "we built an agent," they usually mean they wrote a prompt and a tool schema. That's the easy 20%. The other 80% is the scaffolding that decides when to call the model, what to put in front of it, whether to trust what comes back, and what to do when something fails. That scaffolding is the harness, and it's where your engineering judgment lives.

The useful mental model: the LLM is a single, expensive, non-deterministic function call. Everything that makes that call safe, bounded, observable, and repeatable is your code. Treat the model as a component you don't control and the harness as the system you do, and most architecture decisions get clearer.

Anatomy of a harness

Strip away the framework branding and every agent harness has the same moving parts:

A control loop that runs steps until the task is done, a stop condition fires, or a budget is exhausted.
A context manager that assembles the prompt each step — system instructions, relevant history, tool specs — and decides what to drop when it won't fit.
A model call wrapped in its own timeout and retry policy.
A parse-and-validate stage that turns model output into a typed, checked action before anything acts on it.
A tool dispatcher that executes the chosen action with its own timeouts, retries, and idempotency handling.
Guardrails that gate side effects — allow-lists, argument validation, rate limits.
Observability that records every step as structured data.

Frameworks give you defaults for these. The defaults are fine for prototypes and quietly wrong for production, because the right policy is domain-specific. How many steps before you bail? What's a retryable tool error versus a fatal one? What do you drop from context first? Nobody can answer those for you.

Tool calls are an untrusted boundary

The single most common production failure I see is treating model output as if it were already valid. The model proposes a tool call; the harness executes it verbatim. Then one day the model emits an argument that's subtly out of range, or invents a tool name, or returns JSON with a trailing comment, and the dispatcher happily forwards garbage into a system that does real things.

A tool call from the model is a proposal, not an instruction. Validate it like input from an untrusted client, because that's exactly what it is.

def step(state: AgentState, tools: dict[str, Tool]) -> StepResult:
    # 1. Assemble context within budget — drop oldest observations first
    prompt = state.context.render(token_budget=state.remaining_tokens())

    # 2. Model call is fallible: its own timeout + bounded retry
    completion = call_model(prompt, timeout_s=30, max_retries=2)
    state.spend(completion.usage)

    proposal = completion.tool_call
    if proposal is None:
        return StepResult(done=True, answer=completion.text)

    # 3. Validate the proposal BEFORE anything acts on it
    tool = tools.get(proposal.name)
    if tool is None:
        # Don't crash — feed the error back so the model can recover
        state.context.add_observation(f"error: unknown tool '{proposal.name}'")
        return StepResult(done=False)

    try:
        args = tool.schema.validate(proposal.arguments)
    except ValidationError as e:
        state.context.add_observation(f"error: invalid args: {e}")
        return StepResult(done=False)

    # 4. Guardrail: side effects must pass policy
    if tool.has_side_effects and not policy.allows(tool, args, state):
        state.context.add_observation("error: action blocked by policy")
        return StepResult(done=False)

    # 5. Dispatch with the tool's own failure handling
    observation = dispatch(tool, args, timeout_s=tool.timeout, retries=tool.retries)
    state.context.add_observation(observation)

    trace.emit(step=state.step_no, tool=tool.name, usage=completion.usage,
               latency_ms=observation.latency_ms, outcome=observation.status)
    return StepResult(done=False)

Notice what the failure paths do: they don't raise. A bad tool name, invalid arguments, or a blocked action all become observations fed back into context. The model gets to see its mistake and try again. This single pattern — turning harness-level errors into model-visible feedback — is the difference between an agent that recovers and one that dies on the first imperfect output.

Context is a budget, not a buffer

The naive harness appends everything to a growing transcript and passes it back every step. This works until it doesn't: you blow the context window, latency climbs with every step, cost grows quadratically over a long task, and the model's attention degrades as the relevant signal drowns in old tool dumps.

Context is a budget you spend deliberately each step. That means making active decisions: which prior observations still matter, which can be summarized, which can be dropped entirely. A 40KB API response that mattered three steps ago is now dead weight — keep a one-line summary of what it told you and discard the body. The control loop's job isn't to remember everything; it's to keep the useful state in front of the model and evict the rest. Get this wrong and a task that should take eight steps either runs out of window at step twelve or costs five times what it should.

Plan for failure, because it's the default

In a system where one step is a network call to a probabilistic model and the next is a network call to a flaky third-party API, failure isn't the exception — it's the steady state. The harness has to assume every external call can time out, return malformed data, or partially succeed.

The parts that earn their keep here are unglamorous: timeouts on every external call (model included), bounded retries with backoff, idempotency keys on any tool that mutates state so a retry doesn't double-charge or double-send, and a hard step ceiling so a confused agent can't loop forever burning tokens. None of this is novel — it's the same distributed-systems discipline we've applied for two decades. What's new is that one of the unreliable components is now the decision-maker itself, which means a retry can produce a different decision. Your harness has to be correct under that, not just under transient errors.

You can't fix what you can't see

A non-deterministic system you can't replay is a system you can't debug. When an agent does something wrong in production — picks the wrong tool, loops, gives up early — "it worked on my machine" is meaningless, because your machine got a different sample.

So every step has to emit structured data: the assembled context, the model's decision, the tool called, the arguments, latency, token usage, and outcome. Not log lines you grep — structured spans you can query, aggregate, and replay. With that, "the agent failed" becomes "at step 7 it called the search tool with an empty query because the previous observation got evicted from context," which is an actual bug with an actual fix. Without it, you're tuning prompts by superstition. Token and cost accounting belong in the same trace, because on a long-running agent they're a production concern, not a billing footnote.

The takeaway

The model gets the headlines and the harness gets the pager. As base models keep improving, the differentiator between an agent that demos well and one that survives contact with production won't be which model you picked — it'll be the engineering quality of the code wrapped around it: how it validates, how it budgets context, how it fails, and how observable it is.

So here's what I keep coming back to: if you swapped your agent's underlying model tomorrow, how much of your reliability would survive the change — and how much was the harness carrying all along?

Runnable, tested example: https://github.com/mturac/harness-demo

Dynamic Workflows in Opus 4.8: Build a Self-Verifying PR Reviewer

Mehmet TURAÇ — Fri, 29 May 2026 21:23:49 +0000

You stopped being the loop

Most people use Opus 4.8 the way they used every model before it: open a chat, type a request, watch the cursor, correct it, repeat. That's a conversation. A dynamic workflow is something else entirely.

The shift is this: you stop being the loop. Instead, an orchestrator — plain code you control — spawns subagents you design, fanning out work in parallel, running steps in sequence, judging and merging results, and reporting back when the whole thing is done. Opus 4.8 can drive hundreds of parallel subagents inside a single workflow, with effort control per node so cheap steps stay cheap and hard steps think harder.

In this tutorial you'll learn the core patterns by building one concrete thing: a pull-request reviewer that fans out across correctness, security, and performance, then adversarially verifies every finding before it reaches you.

// You design the shape. The orchestrator runs it.
const found    = await parallel(DIMENSIONS.map(d => () => agent(d.prompt, { schema: FINDINGS })))
const deduped  = dedupeByFileLine(found.flatMap(r => r.findings))
const verified = await parallel(deduped.map(f => () => agent(refutePrompt(f), { schema: VERDICT })))
const real     = verified.filter(v => v.refuted === false)

By the end you'll know when to reach for parallel() versus pipeline(), how structured output schemas keep subagents composable, and where to set effort per node.

The mental model: it's a graph, not a prompt

Stop thinking "I send a prompt, I get a completion." Start thinking: an orchestrator runs a workflow graph, and each node is an agent call. The orchestrator is plain code. It decides what runs, in what order, and what to do with each result. Subagents are the leaf workers — each gets a focused prompt, a structured-output schema, and its own effort setting. The unit of work is no longer the prompt; it's the graph.

Two primitives compose every graph, and the difference between them is entirely about barriers — when the orchestrator blocks and waits.

`parallel()` is a barrier

parallel() fans work out to many subagents at once and resolves only when all of them return. Nothing downstream runs until the slowest node finishes. Use it for independent work that must be fully collected before the next decision — one subagent per review dimension, N-way verification, hundreds of concurrent checks.

// FAN-OUT: dimensions are independent → run them together
const found = await parallel(
  DIMENSIONS.map(d => () => agent(d.prompt, { schema: FINDINGS, effort: "medium" }))
)
// barrier: every dimension has returned before we continue
const deduped = dedupeByFileLine(found.flatMap(r => r.findings)) // plain code, no agent

Note the () => thunks. parallel() invokes them itself — it schedules the work; it doesn't receive already-started promises.

`pipeline()` enforces order

pipeline() chains stages where stage N+1 depends on stage N's output. Each stage blocks until its input exists, so the stages run strictly in sequence and the latencies add up. Reach for it when there's a true data dependency — you can't synthesize a review before findings exist, and you can't verify findings before they're deduplicated.

const review = await pipeline(
  () => parallel(DIMENSIONS.map(d => () => agent(d.prompt, { schema: FINDINGS }))),
  (found)   => dedupeByFileLine(found.flatMap(r => r.findings)),
  (deduped) => parallel(deduped.map(f => () => agent(refutePrompt(f), { schema: VERDICT }))),
)

Notice dedupeByFileLine is not an agent — deterministic work stays in code. You only spend a subagent where judgment is required.

The whole grammar: parallel for independence, pipeline for dependency. Real workflows alternate between the two, fanning out for breadth and chaining where order matters.

Structured outputs: typed, not parsed

Every agent() call above passes a schema. The model returns data shaped to that contract — FINDINGS, VERDICT, REVIEW — so you index fields instead of regexing prose. This is what lets the dedup and filter steps be plain code rather than yet another LLM call:

const real = verified.filter(v => v.refuted === false)

Schemas are the seams that keep subagents composable. A node's output is machine-readable, so the next node — agent or code — consumes it without a parsing layer in between.

The worked example: a self-verifying PR reviewer

Most "AI code review" is one model, one prompt, one pass. It finds plausible bugs and reports them with equal confidence — including the ones that aren't real. Dynamic workflows let you do better: fan out across review dimensions in parallel, then make the model attack its own findings before reporting them. Here's the full pipeline.

Step 1: Fan out across dimensions

Run one subagent per review dimension. They don't depend on each other, so they execute concurrently behind a barrier.

const DIMENSIONS = [
  { name: "correctness", prompt: correctnessPrompt(diff) },
  { name: "security",    prompt: securityPrompt(diff) },
  { name: "performance", prompt: perfPrompt(diff) },
];

const found = await parallel(
  DIMENSIONS.map(d => () => agent(d.prompt, { schema: FINDINGS }))
);

Each agent() call is an isolated subagent with its own context window — the security reviewer never sees the performance reviewer's noise. { schema: FINDINGS } forces a structured output: an array of { file, line, severity, claim }, not prose you have to regex later.

Step 2: Dedup (plain code, not an agent)

Three reviewers will flag the same line. Merging is deterministic set logic — don't spend a model on it.

const deduped = dedupeByFileLine(found.flatMap(r => r.findings));

flatMap flattens the per-dimension arrays into one list; dedupeByFileLine collapses entries sharing a (file, line) key. Use code wherever the answer is mechanical. Agents are for judgment, not joins.

Step 3: Adversarially verify

This is the step that kills false positives. For each surviving finding, spawn a skeptic subagent whose only job is to refute it.

const verified = await parallel(
  deduped.map(f => () => agent(refutePrompt(f), { schema: VERDICT }))
);
const real = verified.filter(v => v.refuted === false);

refutePrompt(f) instructs the subagent: "Here is a claimed bug. Prove it's wrong — find the guard, the caller, the type that makes it safe." VERDICT is { refuted: boolean, reason: string }. A finding that survives a dedicated attacker is worth reporting; one that doesn't, isn't.

For higher-stakes findings, fan out N skeptics per finding and keep only what a majority can't refute — verification scales independently of review:

async function survivesQuorum(f, n = 3) {
  const verdicts = await parallel(
    Array.from({ length: n }, () => () => agent(refutePrompt(f), { schema: VERDICT }))
  );
  const refutals = verdicts.filter(v => v.refuted).length;
  return refutals <= Math.floor(n / 2); // a majority could not refute it
}

This is a judge pattern: refutation is adjudication, kept separate from the generation in step 1. Asking a model to merely re-summarize its own findings launders the weak ones into the report. Refutation is a sharper filter than agreement.

Step 4: Synthesize

One agent turns confirmed findings into the review a human reads.

const review = await agent(synthesisPrompt(real), { schema: REVIEW });

Wiring it together

const review = await pipeline(
  ()        => parallel(DIMENSIONS.map(d => () => agent(d.prompt, { schema: FINDINGS }))),
  (found)   => dedupeByFileLine(found.flatMap(r => r.findings)),
  (deduped) => parallel(deduped.map(f => () => agent(refutePrompt(f), { schema: VERDICT }))),
  (verified, deduped) => synthesize(deduped, verified), // keep only refuted === false, then write
);

pipeline() is sequential — each stage's output feeds the next. parallel() is the barrier inside stages 1 and 3.

Effort control per node

Not every node deserves the same compute. Set effort per call: skeptics run cheap because refutation is a narrow question; synthesis runs at high effort because it's the artifact a human trusts.

agent(refutePrompt(f),       { schema: VERDICT, effort: "low"  });
agent(synthesisPrompt(real), { schema: REVIEW,  effort: "high" });

You spend reasoning where judgment is hard and conserve it where the work is mechanical — and a human still approves the final review before anything posts.

Pitfalls and best practices

Match the primitive to the dependency

parallel() returns when the slowest node finishes; pipeline() runs stages in sequence and accumulates their latency. Mismatching them is the most common cost mistake. Your review dimensions are independent, so fan them out — don't chain them.

// Good: 3 dimensions run concurrently, wall-time ≈ slowest dimension
const found = await parallel(DIMENSIONS.map(d => () => agent(d.prompt, { schema: FINDINGS })))

// Bad: same work, ~3x the latency for no reason
const found = await pipeline(
  () => agent(DIMENSIONS[0].prompt, { schema: FINDINGS }),
  () => agent(DIMENSIONS[1].prompt, { schema: FINDINGS }),
  () => agent(DIMENSIONS[2].prompt, { schema: FINDINGS }),
)

Reserve pipeline() for true data dependencies — verify needs dedup's output, so that edge stays sequential.

Dedup before you verify

Verification is the expensive phase: it can spawn N skeptics per finding. If correctness and security both flag auth.js:42, verifying twice burns budget for nothing. Collapse duplicates first with plain code — no agent required.

Keep a human at the merge

The synthesize step is your human-in-the-loop checkpoint. Confirmed findings are a recommendation, not an auto-commit — a person approves before anything lands.

Amplify signal, not noise

Fan-out multiplies whatever your base node produces, so the base node's reliability matters. Anthropic reports Opus 4.8 makes roughly 4x fewer silent code bugs than its predecessor; the more trustworthy each leaf reviewer is, the safer it is to run many of them in parallel.

When to reach for a workflow

A single agent is the right default. Reach for a dynamic workflow only when the task has structure you can name: independent dimensions that fan out in parallel, a verification step that must be adversarial rather than self-graded, or a synthesis pass that depends on confirmed inputs.

The PR-review example earns its workflow because each stage has a different shape — fan out, collapse in code, fan out again to refute, then synthesize. parallel() is the barrier; pipeline() enforces order; schemas keep the seams machine-readable; effort goes high on synthesis and low on the mechanical passes.

Open question: which of your "trust me" agent steps is actually an unverified claim waiting for a skeptic?