DEV Community: TheProdSDE

I ran two AI builders in parallel at a solo hackathon — here's what the 429 errors revealed

TheProdSDE — Wed, 08 Jul 2026 19:30:31 +0000

At a 3-hour Google for Developers hackathon, I ran two AI builders simultaneously on the same brief:

Hermes + Nvidia Nemotron 3 Ultra 550B — one raw prompt, fully autonomous
Claude → Gemini Antigravity — iterative spec-first workflow

The most revealing moment wasn't about speed or features. It was how each one handled 429 rate-limit errors during code generation.

// The 3-layer strategy I ended up having to write explicitly for Antigravity:
const RETRY_DELAYS = [2000, 5000, 10000];

async function retryWithBackoff<T>(fn: () => Promise<T>): Promise<T> {
  for (let attempt = 0; attempt <= RETRY_DELAYS.length; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      const is429 = err?.status === 429;
      const hasRetry = attempt < RETRY_DELAYS.length;
      if (is429 && hasRetry) {
        await new Promise(r => setTimeout(r, RETRY_DELAYS[attempt]));
      } else { throw err; }
    }
  }
  throw new Error('unreachable');
}

Hermes recovered from this autonomously. Antigravity needed the above as an explicit spec.

I scored 93%. Lost the remaining 7% to a single missing prompt — one I assumed I didn't need to write.

Full case study with every exact prompt I used (word for word):
👉 [Read on Medium → https://medium.com/gitconnected/i-ran-two-ai-builders-in-parallel-for-a-solo-hackathon-heres-what-actually-happened-ecb35bb7bafd?sharedUserId=theprodsde]

Includes: architecture diagrams, the 8-step build order, state persistence code, Zod schema tests, and the one prompt I wish I'd written.

Have you run multiple AI builders in parallel? What broke first?

Zero Trust Security in Production: Identity, OPA, Vault, mTLS & Audit Logging — A Complete Reference

TheProdSDE — Wed, 24 Jun 2026 06:15:35 +0000

This is **Part 5* of the Building a Zero-Trust Security Architecture series. Parts 1–4 covered secrets fundamentals, HashiCorp Vault, cloud secret managers (Azure Key Vault, AWS Secrets Manager), and why Kubernetes Secrets are not a full secret management platform. This part brings it all together into one production reference design.*

TL;DR — What This Part Actually Delivers

Why authentication, authorization, secret management, encryption, audit, and network control are six different jobs — and why collapsing them is the #1 architecture mistake
A full reference architecture connecting an IdP, OPA, a service mesh, Vault/cloud secret managers, and a SIEM
A corrected, OPA 1.0-compatible Rego policy example (with the rego.v1 + if fix most tutorials miss)
A current Istio mTLS + AuthorizationPolicy configuration using security.istio.io/v1
A security maturity model (Level 0 → Level 5) and decision matrix
A real incident-response runbook for a compromised service

The #1 Architecture Mistake

Most breaches don't happen because a team lacked some exotic tool.

They happen because basic separation of concerns broke down.

The JWT that tries to carry every permission. The database password that doubles as every service's identity. The "temporary" admin token nobody ever revoked.

Zero Trust isn't one system — it's six distinct responsibilities:

Layer	Responsibility
🔐 Authentication	Who are you?
✅ Authorization	What are you allowed to do?
🔑 Secret Management	How do services prove identity without static credentials?
🔒 Encryption	Is data protected at rest and in transit?
📋 Audit Logging	Can we prove what happened and by whom?
🌐 Network Controls	How do we limit blast radius when something goes wrong?

Collapse any two of these into one system and you've introduced a future incident.

Step 1 — User Authentication

Users authenticate through an identity provider: Keycloak, Okta, Auth0, or Microsoft Entra ID.

At this point, identity is established. Nothing more.

JWT validation is NOT authorization

A valid JWT proves who the user is. It does not prove what they are allowed to do.

That's a critical distinction — and where policy evaluation enters the design.

Step 2 — Authorization Through OPA

OPA (Open Policy Agent) becomes the centralized authorization engine. Applications query it for decisions instead of hardcoding complex rules scattered across services.

A Realistic Policy Example

A production policy should consider action, resource, ownership, and environment — not just a flat role string:

package authz

import rego.v1

default allow := false

allow if {
  input.user.role == "analyst"
  input.action == "read"
  input.resource.owner_team == input.user.team
  input.resource.environment != "production"
}

⚠️ OPA 1.0 note: Every current OPA release requires the if keyword in rule bodies and treats rego.v1 semantics as standard. The older bare allow { ... } style will fail opa check on a modern install. If you're copying Rego from older blog posts, run opa fmt --rego-v1 on it before trusting it in CI.

Anti-pattern: JWT contains all permissions

Putting every permission in the token creates staleness and revocation problems — you can't un-issue a JWT that's already in someone's hands.

✅ Prefer: tokens for identity claims + OPA for dynamic authorization decisions.

Step 3 — Service Authentication (Workload Identity)

User authentication alone is not enough. Microservices also need strong identity.

Common anti-patterns to avoid:

Shared API keys
Shared service passwords
Long-lived tokens copied across services

Kubernetes Workload Identity — Two Distinct Paths

It's worth being precise here, since "OIDC provider" gets used loosely in many guides:

Path A — Vault Kubernetes Auth Method
Vault validates the pod's projected service-account token directly against the cluster's API/JWKS endpoint. No external OIDC provider required.

Path B — Cloud-Native Federation
(AWS IRSA, GCP Workload Identity Federation, Azure AD Workload Identity)
The cluster's OIDC issuer federates with a cloud IAM provider, letting a pod assume a cloud role without any static keys.

Both achieve the same outcome — no long-lived secrets sitting in a pod — through different plumbing. Know which one you're actually running.

Step 4 — Secret Retrieval and Dynamic Credentials

After authentication, services retrieve secrets or dynamic credentials on demand.

What changes in the modern model:

Traditional	Modern
Hardcoded passwords	No hardcoded passwords
Shared, permanent credentials	Unique credentials per workload
Manual rotation	Short-lifetime, auto-rotated
Broad blast radius	Identity-scoped, limited exposure

Each application receives unique credentials, a unique identity, and a short lifetime.

Step 5 — mTLS and Service Mesh

Service-to-service authentication becomes operationally manageable through a service mesh. Saying "we use mTLS" without an implementation path doesn't help anyone.

Istio Example (current stable API)

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: prod
spec:
  mtls:
    mode: STRICT

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: service-b-allow-service-a
  namespace: prod
spec:
  selector:
    matchLabels:
      app: service-b
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/prod/sa/service-a"]

Two real corrections baked in here:

security.istio.io/v1 is the current stable API group (v1beta1 is the legacy version still floating around in older tutorials)

action: ALLOW is written explicitly. Istio defaults to ALLOW when omitted, but spelling it out makes the policy's intent obvious in a code review six months from now — and obvious at 2 a.m. during an incident

Does a service mesh replace Vault?

No. They solve different problems:

Mesh (Istio, Linkerd): service-to-service transport security and identity via mTLS
Secret manager (Vault, AWS Secrets Manager, Azure Key Vault): credential issuance, rotation, and storage for database passwords, API keys, etc.

Most mature architectures run both.

Step 6 — Encryption Architecture

A common mistake: letting applications own long-lived encryption keys. That creates key sprawl and inconsistent key handling.

Prefer a central cryptographic service (Vault Transit engine or a cloud KMS equivalent).

Keys remain protected because they never live inside the application process as ordinary static assets. The application calls the cryptographic service; the keys never leave it.

Step 7 — Network Policies

Zero Trust is incomplete if every pod can talk to every other pod by default.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-service-a-to-db
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: service-a
    ports:
    - protocol: TCP
      port: 5432

NetworkPolicy vs. service mesh:
NetworkPolicy controls reachability at the IP/port level. It doesn't provide encryption or cryptographic identity. Use both: NetworkPolicy to shrink the attack surface, mTLS/mesh to verify identity on the connections that are allowed.

Step 8 — Audit Architecture

Security without auditing is incomplete. Every meaningful decision should be traceable.

Questions your platform should answer quickly:

Who accessed data?
When?
Which policy allowed it?
Which secret or lease was used?
Which workload identity made the call?

Useful detections to build

A sudden spike in secret reads
Access to secrets outside normal namespace or service patterns
Lease revocation followed by repeated failed usage
Cross-region secret access anomalies

Incident Response Runbook — Compromised Service

Suppose Service A is compromised.

Traditional architecture often means: shared credentials, permanent access, weak attribution — a bad day.

Modern architecture enables this runbook:

Identify the workload identity, token accessor, or Vault lease tied to Service A
Revoke the affected Vault token or leases
Trigger dynamic credential rotation where needed
Query audit logs for every action performed by that identity
Tighten policy or disable the role
Redeploy the workload with corrected configuration
Review whether network policy and mesh policy limited the blast radius

This is where short-lived credentials and strong audit trails become operational advantages, not just design principles on a slide.

Security Maturity Model

Level	Description
0	Passwords in source code
1	Environment variables and manually managed credentials
2	Kubernetes Secrets and basic secret segregation
3	Secret manager + scheduled rotation
4	Identity-based access, dynamic credentials, workload identity
5	Zero Trust, policy as code, continuous verification, strong audit analytics, mesh identity, reduced lateral movement

The model exists so teams can choose the next practical step instead of trying to jump directly to Level 5 — which mostly produces a half-finished Vault deployment and a very tired platform team.

Decision Matrix

Small Startup

Recommended pattern: OIDC + OPA + Cloud secret manager + Namespace-scoped RBAC + Network policies
Benefits: Centralized authorization, better governance, better traceability

Large Enterprise

Recommended pattern: OIDC + OPA + Vault + PKI + mTLS + Dynamic credentials + SIEM + Formal runbooks
Benefits: Fine-grained authorization, stronger incident containment, better compliance posture, separation of duties

Common Anti-Patterns

Anti-pattern	Why it fails
JWT contains every permission	Permissions become stale and hard to revoke
Database password shared across services	No accountability, large blast radius
Secrets stored in Git	Permanent exposure risk — history doesn't forget, and neither do forks
Long-lived cloud access keys	Credential theft risk; prefer managed identity / workload identity
Single shared Vault admin token across teams	Broken accountability and dangerous privilege concentration

Quick Self-Check

Before reading further — how many of these apply to your stack right now?

[ ] JWTs contain all user permissions
[ ] A database password is shared across multiple services
[ ] Secrets are stored (or ever were) in Git
[ ] Long-lived cloud access keys are in use
[ ] A single Vault admin token is shared across teams

If you checked any box, this series was written for exactly where you are.

FAQ

Is OPA the same as Kubernetes RBAC?
No. Kubernetes RBAC controls access to the Kubernetes API itself (who can create a Deployment, read a Secret). OPA is a general-purpose policy engine that your applications call for their own authorization decisions. It can also evaluate Kubernetes admission requests via Gatekeeper — but its scope goes well beyond the cluster API.

Do I need Vault if I'm a small team?
Usually not on day one. A cloud-native secret manager paired with an OIDC provider covers most early-stage needs with far less operational overhead. Vault earns its complexity once you have multiple teams, multiple clouds, or compliance requirements that demand fine-grained, auditable access.

What's the single highest-leverage change at Level 1 or 2?
Move from static, shared credentials to short-lived, identity-bound ones — even before introducing OPA or a mesh. Almost every other improvement in this series compounds on top of that one change.

Read the Full Article

The complete reference architecture diagrams, Vault PKI + cert-manager integration details, compliance mapping (SOC 2, ISO 27001, PCI DSS, HIPAA, NIST CSF), and the full SIEM integration guide are in the original article:

👉 Building a Zero-Trust Security Architecture — Part 5

A Note of Thanks

This series is published in Towards AI — one of the leading publications for AI, ML, and engineering content. A huge thank you to the Towards AI team for their support in helping this work reach engineers and architects who are building these systems right now. If you're not following them yet, you should be.

Where Is Your Stack?

The series covers a maturity model from Level 0 to Level 5.

What level is your team actually at — and what's the biggest blocker to moving up?

Drop a comment below. If something in here matches — or doesn't match — what you're running in production, I'd genuinely like to hear about it. 👇

Part of the Building a Zero-Trust Security Architecture series.
Part 1 · Part 2 · Part 3 · Part 4 · **Part 5 — you are here**

Azure Key Vault vs AWS Secrets Manager: Cutting Through the Cloud Marketing

TheProdSDE — Tue, 16 Jun 2026 10:56:10 +0000

"A hands-on breakdown of Azure Key Vault, AWS Secrets Manager, IRSA, Azure Managed Identity, and when HashiCorp Vault actually matters."

This is Part 3 of our 4-part series on secret management for cloud-native teams. Start with Part 1 if you haven't.

Most cloud teams face this decision: should we use Azure Key Vault, AWS Secrets Manager, Google Secret Manager, or HashiCorp Vault?

The marketing departments of Azure and AWS have strong opinions. The internet has conflicting opinions.

We're going to give you the technical answer.

The Scoreboard

Feature	Azure Key Vault	AWS Secrets Manager	Google SM	Vault
Managed Identity Integration	⭐⭐⭐⭐⭐	⭐⭐⭐⭐ (IRSA)	⭐⭐⭐⭐	N/A
Auto DB Credential Rotation	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
Dynamic Credentials	⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
PKI / TLS Management	⭐⭐	⭐	⭐⭐	⭐⭐⭐⭐⭐
Multi-Cloud Support	⭐⭐	N/A (AWS-only)	N/A (GCP-only)	⭐⭐⭐⭐⭐
Price per Secret	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Self-hosted / Enterprise

When to Use Each

Azure Key Vault

Use this if:

90% of your workloads live in Azure (App Service, Functions, AKS)
Your team is Azure-first and has Entra ID mastery
You need managed identity integration out-of-the-box
Compliance requirement: "secrets must be in the same cloud"

YAML Example:

# Kubernetes Pod with Azure Workload Identity
apiVersion: v1
kind: Pod
metadata:
  labels:
    azure.workload.identity/use: "true"
spec:
  serviceAccountName: app-sa
  containers:
    - name: app
      image: myapp:v1
      env:
        - name: AZURE_VAULT_URL
          value: https://myvault.vault.azure.net/

AWS Secrets Manager

Use this if:

90% of your workloads live in AWS (ECS, Lambda, EKS)
You need automatic RDS/Redshift credential rotation (strongest feature)
Your database already has IAM auth enabled
You want per-secret CloudTrail audit events

YAML Example:

# Kubernetes with IRSA (IAM Roles for Service Accounts)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/app-role
---
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  serviceAccountName: app-sa
  containers:
    - name: app
      image: myapp:v1
      env:
        - name: AWS_ROLE_ARN
          value: arn:aws:iam::ACCOUNT:role/app-role
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

Google Secret Manager

Use this if:

You're GCP-native (Cloud Run, GKE, Cloud Functions)
You need deep integration with GCP Workload Identity
You want automatic secret versioning and access logs

HashiCorp Vault

Use this if:

Your infrastructure is multi-cloud (AWS + Azure + On-Prem)
You need dynamic credentials (short-lived, auto-generated)
You need PKI/TLS certificate management
You manage databases, SSH keys, or APIs that require rotation
Compliance requirement: centralized secret audit trail

Production Pattern:

# ExternalSecret syncs from Vault to Kubernetes
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
  target:
    name: app-secret
  data:
    - secretKey: db_password
      remoteRef:
        key: secret/data/prod/app/db
        property: password

Vault rotates the secret in the backend → ESO picks it up on the next refresh → App reads the updated file.

The Real Truth

Your choice isn't "Key Vault vs Secrets Manager."

Your choice is: Do I need a secrets storage solution, or a secrets management solution?

Storage = Key Vault, Secrets Manager (static secrets only)
Management = Vault (rotation, dynamic creds, PKI, audit)

Most teams end up with both: Cloud platform for workload identity credentials (zero storage, pure auth), and Vault for everything else.

Read the Full Technical Deep Dive

We broke down:

✅ Managed Identity internals (how IRSA actually works)
✅ Database credential rotation patterns
✅ Cost analysis (which is cheaper at scale?)
✅ Security architecture (etcd encryption, audit trails)
✅ When to choose which tool

Read Part 3: Azure Key Vault vs AWS Secrets Manager (Full Technical Breakdown)

Series

Part 1: Your Secrets Are Probably Leaking
Part 2: HashiCorp Vault Deep Dive
Part 3: Azure Key Vault vs AWS Secrets Manager (you are here)
Part 4: Kubernetes Secrets & Production Patterns (coming soon)

I Built a Finance App at a Hackathon, Abandoned It for Months, Then GitHub Copilot Helped Me Finally Ship It

TheProdSDE — Sat, 06 Jun 2026 19:01:35 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

"Where did my money go?"

Every expense tracker ever built answers that question. I wanted to build something that answered a harder one:

"Why did my money go there — and what should I do about it?"

FinancialEdApp is an AI-powered personal finance platform that combines expense tracking, budget planning, loan/EMI analysis, and a conversational AI assistant — all working together on your actual financial data.

It didn't start that way though. It started as a half-working prototype I built under pressure, shipped zero of, and left to collect dust in a private folder for months. This challenge gave me the push to finally finish it.

The Comeback Story

Where It Was: An Honest Before

The project started at Economic hackathon i saw somewhere. I had a vision for a smart finance app — not just a tracker, but something that would actually explain your financial behavior.

What I shipped at the deadline was considerably humbler:

Feature	State at Abandonment
Expense tracking UI	✅ Worked, barely styled
Budget planning	⚠️ UI existed, no logic behind it
AI Financial Assistant	❌ Completely mocked — returned hardcoded strings
Loan / EMI Analysis	❌ Static inputs, zero calculation logic
Data persistence	❌ Everything reset on refresh
Tests	❌ None
CI/CD	❌ "Deploy" meant running `npm run build` on my laptop
Cloud infrastructure	❌ Never left localhost

The AI assistant — the entire point of the app — was a facade. When a user asked "Why did my savings drop?", it returned the same canned response regardless of their data. I knew it. I just ran out of time.

Then life happened. The code sat untouched for 7months.

The Turning Point

When I saw this challenge, I opened the repo expecting to feel overwhelmed. That was hard to find the repo and code in machine sitting in some mycode folder.

I made a list of what "actually finished" would mean:

Real AI assistant connected to the user's own financial data
Working EMI calculator with full amortization logic
Budget tracking that actually saves and persists state
A CI/CD pipeline so deployments aren't a prayer
Production deployment on Azure Kubernetes Service (AKS) That's what I set out to finish. Here's how it went.

What Changed: The Real After

The AI Assistant — from fake to functional

The biggest gap. The original "AI" was if (query.includes("savings")) return hardcodedString. Embarrassing in retrospect.

The rewrite connected the assistant to the user's actual expense history, budget targets, and loan data. Now when you ask "Why did my savings decrease last month?" — the model gets your real data as context and responds with specific observations: "Your dining expenses increased 34% in March and exceeded your ₹4,000 budget by ₹1,200."

Loan & EMI Calculator — from stub to substance

The EMI form previously accepted inputs and did absolutely nothing with them. I built out the full amortization engine: monthly breakdowns, total interest calculations, and an early repayment simulator so users can see the actual rupee impact of prepaying a loan.

Budget Tracking — actually persistent now

Budgets used to vanish on refresh. The state management overhaul means budgets, categories, and goals persist properly. The budget vs. actual comparison view now shows live overrun indicators instead of static placeholder data.

Infrastructure — from "it works on my machine" to AKS

Containerized with Docker, automated with GitHub Actions, deployed to Azure Kubernetes Service with Azure Container Apps handling scaling. The CI/CD pipeline runs on every push to main — lint, build, test, deploy. No more laptop deploys.

Screenshots

My Experience with GitHub Copilot

Copilot wasn't just a code autocomplete tool on this project. It was the reason I didn't abandon the repo again the moment I reopened it.

Breaking Through the Paralysis

The hardest part of returning to an old project is re-orientation. What does this file do? Why is this state structured this way? I found myself using Copilot Chat as a code archaeologist — pasting in old functions and asking "What was this trying to do?" before I could safely extend it.

The Specific Moments That Mattered

1. The EMI Amortization Engine

I wrote one comment:

// Calculate monthly amortization schedule for a loan
// given principal, annual interest rate, and tenure in months
// Return array of { month, principal, interest, balance }

Copilot generated a complete, correct implementation. I verified the math against known loan calculators, found one edge case in the final-month rounding, asked Copilot to fix it — and it did. What would have taken me 90 minutes of formula-checking took 15.

2. CI/CD Pipeline from Scratch

I had never set up an AKS deployment pipeline before. I described what I wanted in a comment block in the YAML file and Copilot drafted the full GitHub Actions workflow: Docker build → push to Azure Container Registry → rolling deploy to AKS. It wasn't perfect on the first pass — the registry auth step needed fixing — but having a near-complete scaffold to edit was dramatically faster than starting from a blank file.

3. AI Context Injection

The trickiest engineering problem was safely injecting financial context into the AI prompts without exceeding token limits or leaking sensitive data. Copilot suggested a summarization pattern I hadn't considered — pre-processing transactions into statistical summaries (category totals, percentage changes, budget deviation) rather than passing raw transaction arrays. This made the prompts leaner and the AI responses noticeably more useful.

4. The Bug I'd Been Staring At

There was a budget comparison calculation that returned correct totals for some months and wrong ones for others. I'd been looking at it for an hour. I opened Copilot Chat, described the symptom, and it spotted the issue in under a minute: a timezone offset was causing transactions near midnight to be bucketed into the wrong month. I would not have found that quickly on my own.

The Broader Lesson

I used AI tools across the entire development lifecycle on this project — not just for writing code. Planning, architecture review, debugging, documentation, infrastructure, and CI/CD all benefited.

The most valuable insight: AI is most useful when you're specific. Vague prompts get vague code. When I described the exact shape of data I had, the exact output I needed, and the exact constraint I was working within — Copilot delivered something I could actually use.

Tech Stack

Layer	Technology
Frontend	React, Next.js, TypeScript
AI	OpenAI API, GitHub Copilot
Infrastructure	Docker, Azure Kubernetes Service, Azure Container Apps
DevOps	GitHub Actions, CI/CD pipeline

What's Next

This project is now a real foundation, not a prototype. Planned next:

Bank account integrations via open banking APIs
Automatic transaction import and categorization
Financial forecasting with ML-based trend detection
Family / shared budgeting mode
Personalized financial literacy modules based on spending patterns

Final Thought

The app I demo today is not the app that existed when I reopened this repo. The core idea was always solid — the execution just needed time, focus, and honestly, a development partner that never got tired of my questions at 1am.

That's what GitHub Copilot was on this project. Not magic. Just a remarkably patient collaborator.

🔗 GitHub: https://github.com/karangehlod/FinancialEdApp

Your JWT Is Lying to You — The Authorization Problem Nobody Solves Correctly

TheProdSDE — Tue, 02 Jun 2026 16:53:16 +0000

A valid token proves who you are. It says almost nothing about what you're actually allowed to do.
That gap is where most authorization architectures silently collapse — and where botnets quietly walk in.
I published a deep-dive on Medium covering the full picture of what happens after token validation — the layer most applications get dangerously wrong.
What's covered:
Why JWTs aren't enough
JWTs encode static claims at issuance time. Authorization decisions are almost never static. Resource state, approval workflows, time-based rules, and revocation events all live outside the token.
The four authorization models
RBAC, ABAC, ReBAC, and Policy-as-Code — when each applies, and when you need to combine them.
Open Policy Agent (OPA) — in depth
Full Rego policy walkthrough, real HTTP request/response, unit test examples, and Kubernetes admission control via Gatekeeper.
The policy engine landscape
Honest tradeoffs between OPA, Cedar (AWS), Cerbos, Casbin, and SpiceDB — including who each one is actually best for.
Threat-aware authorization
How botnets pass authentication and exploit weak authorization, how to wire IP reputation, velocity, and device signals into your OPA policy, and why BOLA has been OWASP's #1 API security risk since 2019.

👉 Read the full article on Medium: AuthZ
(Part 1 — covering OAuth 2.0, OIDC, SAML, JWT internals, Okta vs Keycloak — is linked at the top of the article.)

Vibe-Coding Works. That's Exactly Why It Will Destroy Your Codebase at Scale.

TheProdSDE — Mon, 11 May 2026 07:43:47 +0000

The productivity gains are real. The compound debt is realer. And almost no team is measuring the right thing.

Let me say the quiet part out loud: vibe-coding is not the villain in this story. The critics who call it "reckless" are wrong. The evangelists who call it "the future of engineering" are also wrong. And if your team is using it at scale without a framework, you are quietly building a time bomb inside your own codebase.

I've watched this pattern play out across multiple production systems. The first 90 days are miraculous. Features ship in hours. Backlogs shrink. Stakeholders are ecstatic. Then, somewhere around the 4-month mark, something subtle changes. PRs get harder to review. Bugs surface in places nobody thought to check. Onboarding new engineers takes longer. And the team can't quite explain why.

Nobody tracks the inflection point. That's the whole problem.

"The AI didn't make your codebase worse. It made your bad habits move at the speed of light."

The consensus is measuring the wrong metric

When teams evaluate vibe-coding at scale, they measure output velocity — lines shipped, tickets closed, deployment frequency. These numbers go up. Sometimes dramatically. So the conclusion seems obvious: AI-assisted coding works.

But output velocity is a leading indicator. The lagging indicators — the ones that actually matter for a codebase that needs to survive past your next sprint cycle — are mostly invisible until they aren't. Coupling density. Contextual coherence across modules. The ratio of code that can be safely changed by an engineer who didn't write it. These don't show up in your Jira dashboard.

Vibe-coding is extraordinarily good at generating locally correct code. A function that does what its prompt described. A class that passes its tests. An endpoint that returns the right shape. What it's structurally blind to — without deliberate scaffolding — is global coherence. And global coherence is exactly what decides whether your codebase scales or suffocates.

The 4 hidden costs nobody's budgeting for

These aren't theoretical. They're what I see in every team that's been vibe-coding at scale for more than a quarter without a deliberate system around it.

Cost 01 — The coherence gap

AI generates code that fits the prompt, not the system. Over hundreds of PRs, modules develop subtly incompatible assumptions. Nobody catches it because each PR looks fine in isolation. The architecture degrades not through big, obvious mistakes but through the slow accumulation of locally correct, globally incompatible decisions.

Cost 02 — The ownership void

Engineers stop building mental models of their code. When a bug surfaces in AI-written logic, the team debugs blind. Nobody truly "owns" what they didn't think through. This isn't a character flaw — it's a structural consequence of a workflow that optimizes generation over comprehension.

Cost 03 — The test confidence trap

AI writes tests optimistically — testing what the code does, not what it should do. Green CI becomes a false floor. Production failures start happening in the gap between "tests pass" and "system works." The more AI-generated your test suite, the more confidently wrong your confidence intervals are.

Cost 04 — The context window debt

Every future AI-assisted change now requires feeding the model more context to compensate for the incoherence introduced by previous AI-assisted changes. Debt compounds automatically. The very tool you're using to pay down technical debt is silently issuing new debt instruments faster than you can retire the old ones — unless you architect for it deliberately.

Here's what the critics get wrong

The anti-vibe-coding camp makes a seductive argument: the AI doesn't understand your system, therefore it cannot write code that belongs in your system. This sounds rigorous. It is also completely beside the point.

No tool understands your system. Junior engineers don't understand your system on day one. Contractors don't understand your system. Copy-pasted Stack Overflow answers definitely don't understand your system. We have always used code-generation mechanisms that produce locally correct, globally risky output — and we've always managed that risk through process: code review, architecture docs, team conventions, onboarding.

"The failure mode isn't that AI writes bad code. The failure mode is that teams stop doing the process work that made previous bad code safe to ship."

When vibe-coding goes wrong at scale, the AI isn't the reason. The reason is that the team scaled their output without scaling their quality infrastructure. They optimized the generation step without investing in the integration step. And because the AI made generation so effortless, the integration step started to feel like bureaucratic overhead — until it wasn't optional anymore.

The inflection point nobody is tracking

In my experience, there's a reliable pattern: teams hit an invisible inflection point around the 90-day mark. Before it, vibe-coding feels like a superpower. After it, the team is spending an increasing proportion of its time managing the artifacts of its own productivity.

The marker isn't a metric. It's a feeling. Engineers start saying things like "I'm not sure how this connects to the rest of the system" in PR reviews. Architectural discussions get longer and more circular. Postmortems trace bugs back not to a bad line of code but to a wrong assumption that was encoded in 30 files simultaneously.

At that point, you haven't hit a vibe-coding problem. You've hit a systems thinking deficit — and the AI accelerated your arrival at a destination you were already heading toward.

What actually works at scale (and why it's not what you think)

The answer is not to vibe-code less. The answer is to build the infrastructure that makes vibe-coding safe to do at velocity. Here is the framework I've seen work in practice:

1. Architect before you prompt.
Every AI-generated module should begin with a human-written architectural decision record — not a full spec, just a paragraph: what this does, what it doesn't do, and what global assumptions it's allowed to make. This becomes the ground truth the AI generates against. Without it, the AI invents assumptions and you pay for them later.

2. Review for coherence, not correctness.
Most teams review AI-generated PRs the same way they review human-written PRs: does this code do what it claims? The more important question at scale is: does this code belong in this system the way we want this system to evolve? That's a different review entirely, and almost no team has built the habit.

3. Rotate ownership explicitly.
If an engineer didn't write a module by hand, assign them a scheduled "understanding session" — read it, trace the dependency graph, add comments that explain the why. This is not about distrust. It's about preventing the ownership void that turns a future bug into a week-long archaeology expedition.

4. Test the boundary, not the happy path.
Mandate that every AI-generated test suite gets reviewed for what it doesn't test. AI writes optimistically. Your job is to be a pessimist: what happens when the upstream service returns a 200 with an empty body? What happens at the 2GB payload? AI will not volunteer these scenarios. Someone has to own the adversarial imagination.

5. Track the lagging indicators.
Instrument what matters: time-to-understand for new engineers on existing modules, mean time to debug production incidents that touch AI-generated code, coupling metrics across AI-generated vs. human-written boundaries. You cannot manage what you don't measure — and right now, most teams are only measuring velocity.

The real contrarian take

Here it is: the engineers who will thrive in the next five years are not the ones who use AI the most aggressively. They are the ones who understand when AI-generated code becomes a liability and how to defer that moment as long as possible.

Senior engineers have always done this. We've always taken code that worked locally and asked: does this belong here? Does this fit our operational model? Can someone wake up at 3am and understand this? Vibe-coding doesn't change those questions. It makes them more urgent and more frequent.

The teams that treat vibe-coding as a pure velocity tool will discover its costs on a 6-month delay with compounded interest. The teams that treat it as a powerful new generation mechanism that requires a proportionally upgraded integration discipline will build the fastest, most coherent codebases the industry has ever seen.

The productivity gains are real. They are also a trap — if you only measure the gains and not the price. Vibe-coding works. That's exactly why it deserves to be taken seriously enough to build a real system around it.

The engineers who will get this right aren't the loudest advocates or the loudest skeptics. They're the ones quietly asking: what does this code cost us in six months?

If this changed how you think about AI at scale — share it with your engineering team. The conversation your team hasn't had yet is the one this article is trying to start.

Tags: AI Engineering, Software Architecture, Vibe Coding, Technical Debt, Engineering Leadership, LLM Tools, Production Systems, Code Quality, Senior Engineers, Developer Productivity

MCP for Backend Engineers: When to Use It (and When to Skip It)

TheProdSDE — Tue, 21 Apr 2026 14:12:51 +0000

"If you only have one consumer, MCP is a liability."

Subtitle: Real code, real cost, and an honest answer to the question every AI team gets wrong.

TL;DR

1 app · 1 team · 1 toolset → use plain tool calling
Multiple teams or AI surfaces → use MCP
Existing APIs → wrap them in MCP
Early-stage / MVP → skip MCP

MCP is a scaling decision, not a starting point.

Why This Exists

A client had a simple ask: "Let our AI assistant use our REST API."

Three teams built three integrations. All three were wrong:

- Model called /getCustomer when it needed /getOrders
- Auth header dropped silently on retry
- Three teams hardcoded three different endpoint assumptions

Result: 3 integrations. 3 slightly different bugs. 1 very awkward client call.

That's when MCP became the answer. One server wrapping the API. Any AI host connects to it. The client got exactly what they asked for — and we got a pattern reused three times since.

MCP is not about smarter models — it's about cleaner boundaries.

MCP in One Line

MCP = API Gateway + Plugin System + Contract Layer for LLMs

Without MCP	With MCP
Duplicate integrations	Single contract
Auth bugs everywhere	Centralized auth
Hardcoded tools	Dynamic discovery
Fragile wrappers	Reusable servers
Rebuild per host	Plug into existing system

The Hero Diagram: MCP vs Tool Calling

This is the architecture difference that actually matters.

graph TD
    subgraph TC["❌ Tool Calling — Tightly Coupled"]
        direction TB
        TC_H["AI Host"]
        TC_T1["Tool: get_customer()"]
        TC_T2["Tool: list_orders()"]
        TC_T3["Tool: check_inventory()"]
        TC_H --> TC_T1
        TC_H --> TC_T2
        TC_H --> TC_T3
    end

    subgraph MCP["✅ MCP — Loosely Coupled via Contract"]
        direction TB
        MH1["Host: Web Chat"]
        MH2["Host: Slack Bot"]
        MH3["Host: VS Code"]

        MS1["customer-mcp-server\nOwner: CRM Team"]
        MS2["payments-mcp-server\nOwner: Payments Team"]
        MS3["infra-mcp-server\nOwner: Platform Team"]

        MH1 --> MS1
        MH1 --> MS2
        MH2 --> MS1
        MH2 --> MS3
        MH3 --> MS2
        MH3 --> MS3
    end

The shift: Tool calling is a local decision. MCP is an organizational contract.

What MCP Actually Is

Model Context Protocol (MCP) is an open protocol that standardises how AI applications connect to external tools and data sources. Think of it as USB-C for AI tools — one connector, many servers, any host.

Instead of every IDE, chat UI, or agent framework inventing its own plugin format, MCP defines a common contract using JSON-RPC 2.0 over Streamable HTTP or stdio.

Three roles in the spec:

Host — the application the user sees (IDE, CLI, chat UI, agent framework)
Client — a connector inside the host that speaks the MCP protocol
Server — a service that exposes tools, resources, and prompts to the model

The client sits between host and server. It connects to one server at a time and speaks the protocol; the host can orchestrate multiple clients simultaneously.

graph TD
    subgraph HOST["AI Host (Orchestrates Everything)"]
        direction TB
        H["Orchestrator Logic + LLM Calls"]
        C1["MCP Client 1"]
        C2["MCP Client 2"]
        H --> C1
        H --> C2
    end

    subgraph SERVERS["MCP Servers (Owned per Team)"]
        S1["customer-mcp-server"]
        S2["infra-mcp-server"]
    end

    C1 --> S1
    C2 --> S2

Key Insight: The model never talks to your backend directly. It only talks to tools exposed by MCP servers. That's the boundary.

Case Study: Wrapping a Client REST API

Before: Three separate UIs (web, Slack bot, IDE) called the REST API directly. Each integrator made slightly different assumptions about endpoints and auth — producing three brittle integrations and repeated bugs.

After: One MCP server wrapping the existing API. Same tool list and prompts exposed to every host.

Results in production:

Integration time reduced by ~60% for new hosts
Tool-call auth failures dropped from multiple incidents to zero in the first month (root cause: centralized token handling)
Two new hosts onboarded with zero backend changes

This is the practical payoff: a small upfront cost for long-term reduction in duplicated integration work and clearer ownership.

Should You Use MCP? — Decide Before Reading Further

Most tutorials make you read 80% before answering this. Here it is upfront.

If you pick MCP too early, you're trading ~1–2 extra days of setup for zero immediate benefit. Start with the simplest thing. Reach for MCP when the "three wrappers" problem is already real.

Three questions that determine everything:

How many consumers will call these tools?
- One app, one team → plain tool calling, done
- Multiple apps or teams → MCP starts making sense
Who owns the tools vs who owns the AI host?
- Same team owns both → plain tool calling, tight coupling is fine
- Different teams → MCP gives you the clean boundary
Do these tools already exist as APIs?
- Yes → wrap them as an MCP server, zero changes to existing backend
- No → build them either way, MCP adds structure from day one

Five Real Scenarios

Scenario 1 — Internal chatbot for one team
A team wants an AI that queries their own database and sends Slack alerts. One app, one team, they control everything.
→ Plain tool calling. MCP adds overhead with zero benefit here.

Scenario 2 — Your client's REST API (the exact story above)
Client has an existing API. Wants multiple AI surfaces to consume it — web app today, Slack bot next month, VS Code extension later.
→ MCP. Wrap the API once, every host connects via the same protocol.

Scenario 3 — Enterprise platform, multiple teams
Payments team, infra team, CRM team — each owns their domain. One central AI console orchestrates across all three.
→ MCP per domain. Each team ships and secures their own server.

Scenario 4 — Quick prototype / MVP
You're validating an idea over a weekend. Speed matters more than architecture.
→ Plain tool calling always. Get the idea working first. MCP is a refactor you do when it proves valuable.

Scenario 5 — Agentic workflow with long-running tasks
Agent coordinates across multiple systems, tasks need to be traceable and retryable across sessions.
→ MCP. The clean server/client boundary makes observability and retry logic significantly easier to implement.

Decision Flow

flowchart TD
    Start([Start])
    Start --> Prototype{Prototype / MVP?}
    Prototype -- Yes --> Plain[Plain tool calling]
    Prototype -- No --> Multi{Multiple hosts / teams?}
    Multi -- No --> Plain
    Multi -- Yes --> Ownership{Same team owns tools & host?}
    Ownership -- Yes --> Plain
    Ownership -- No --> APICheck{APIs already exist?}
    APICheck -- Yes --> MCP[Use MCP]
    APICheck -- No --> Consider[Build either way — MCP adds structure]
    classDef recommend fill:#f9f,stroke:#333,stroke-width:1px;
    class MCP recommend;

Key Insight: MCP is triggered by scale of ownership and consumption — not complexity.

Where MCP Is a Bad Idea

MCP is the wrong choice when:

You're at the early-stage or MVP phase
Only one host consumes the tools
One team owns the full stack
Latency is business-critical
Tool shapes are still evolving

If you only have one consumer, MCP is architecture cosplay.

Core MCP Concepts

Server-side primitives

1. Tools
Executable actions — often with side effects. The LLM decides when to call them.
Examples: create_invoice, deploy_service, run_sql_query.

2. Resources
Read-only data and context the model is allowed to see — not actions.
Examples: docs, config files, logs, user profile JSON, query results.

3. Prompts
Reusable workflow templates — Postman collections for LLM behaviour. Server-defined recipes the host can invoke.
Examples: bug_report_prompt, deployment_checklist_prompt.

Client-side capabilities

4. Sampling
The server asks the client: "Run a model completion for me using your model access." Useful when servers don't hold model keys.

5. Roots
The server asks: "What file paths or URIs can I operate on?" Organises where tools can act.

6. Elicitation
The server asks the client to collect more information from the user.
Example: "Ask the user which environment to deploy to: dev/staging/prod."

Request / Response Lifecycle

The most important thing to understand: the Host orchestrates LLM calls — not the MCP Client. The client only handles server communication. This distinction is what most sequence diagrams get wrong.

sequenceDiagram
    autonumber
    participant U as User
    participant H as Host / Orchestrator
    participant L as LLM
    participant C as MCP Client
    participant S as MCP Server

    U ->> H : Query
    H ->> C : discover tools
    C ->> S : GET /mcp/tools
    S -->> C : tool schemas
    C -->> H : return schemas
    H ->> L : prompt + tool schemas
    L -->> H : tool call decision
    H ->> C : execute tool call
    C ->> S : call_tool(name, args)
    S -->> C : tool result
    C -->> H : return result
    H ->> L : result + continue prompt
    L -->> H : final answer
    H -->> U : output

Key Insight: The model decides what to call, but the server controls what exists. That's the contract.

Minimal MCP Server (Python)

Three things to watch in this block:

@mcp.tool() is an executable action with side effects
@mcp.resource() is read-only context — the model sees it, doesn't call it
The /health endpoint is not optional — Container Apps and AKS need it for readiness probes. Skip it and your pods restart on a loop.

# server.py
# requires: fastmcp>=2.0.0, python-dotenv

import os
import uuid
from dotenv import load_dotenv
from fastmcp import FastMCP, Resource

load_dotenv()

mcp = FastMCP(
    name="customer-service",
    instructions="Customer service MCP server.",
)

# ── Tool: get_customer ────────────────────────────────────────────────────────
@mcp.tool()
async def get_customer(customer_id: str) -> dict:
    """Fetch customer profile by ID."""
    # Replace with your real DB call
    return {
        "customer_id": customer_id,
        "name": "Priya Sharma",
        "email": "priya@example.com",
        "tier": "premium",
        "last_order_date": "2026-03-15",
    }

# ── Tool: get_orders ─────────────────────────────────────────────────────────
@mcp.tool()
async def get_orders(customer_id: str, limit: int = 5) -> dict:
    """Fetch recent orders for a customer."""
    return {
        "customer_id": customer_id,
        "orders": [
            {"order_id": f"ORD-{i}", "status": "delivered", "amount": 1200 + i * 50}
            for i in range(1, limit + 1)
        ],
    }

# ── Tool: create_support_ticket ───────────────────────────────────────────────
@mcp.tool()
async def create_support_ticket(
    customer_id: str,
    issue: str,
    priority: str = "medium",
) -> dict:
    """Create a support ticket for a customer issue."""
    return {
        "ticket_id": f"TKT-{uuid.uuid4().hex[:8].upper()}",
        "customer_id": customer_id,
        "issue": issue,
        "priority": priority,
        "status": "open",
    }

# ── Resource: customer playbook ───────────────────────────────────────────────
@mcp.resource("customer-playbook")
async def customer_playbook() -> Resource:
    """Internal guidance for handling customer incidents."""
    content = """
# Customer Handling Playbook

- Always greet the customer by name.
- Premium customers: consider goodwill credit for severe issues.
- Escalate to L3 if incident impacts > 10 customers.
- SLA: 4 hours for premium, 24 hours for standard.
"""
    return Resource(
        name="customer_playbook",
        description="Internal guidance for handling customer incidents.",
        mimeType="text/markdown",
        content=content,
    )

# ── Prompt: email draft ───────────────────────────────────────────────────────
@mcp.prompt("email_draft_prompt")
async def email_draft_prompt():
    """Reusable workflow template for customer apology emails."""
    return {
        "name": "email_draft_prompt",
        "description": "Template for customer apology emails.",
        "messages": [{
            "role": "system",
            "content": (
                "You are a senior customer success manager.\n"
                "Write a short apology email to {{customer_name}} about: {{issue}}.\n"
                "Tone: professional, empathetic. Under 100 words."
            ),
        }],
    }

# ── Health check — required for ACA and AKS readiness probes ─────────────────
@mcp.custom_route("/health", methods=["GET"])
async def health():
    from fastapi.responses import JSONResponse
    return JSONResponse({"status": "ok", "server": "customer-mcp"})

if __name__ == "__main__":
    mcp.run(
        transport="streamable-http",
        host="0.0.0.0",
        port=int(os.getenv("PORT", 8000)),
    )

Start it:

pip install "fastmcp>=2.0.0" python-dotenv
python server.py
# Live at http://localhost:8000/mcp

Inspect without writing a single line of client code:

npx @modelcontextprotocol/inspector http://localhost:8000/mcp

MCP Client + Agent Loop

The key line is client.list_tools() — tools are discovered dynamically from the server, not hardcoded in the client. Every host that connects to this server gets the same tool list automatically, even as you add new tools.

# client.py
# requires: fastmcp>=2.0.0, openai>=1.0.0, python-dotenv

import asyncio
import json
import os
from dotenv import load_dotenv
from fastmcp import Client
from openai import AsyncOpenAI

load_dotenv()

openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
MCP_URL = os.getenv("MCP_SERVER_URL", "http://localhost:8000/mcp")


async def run(query: str) -> str:
    async with Client(MCP_URL) as client:

        # Discover tools from the server — not hardcoded here
        tools = await client.list_tools()
        openai_tools = [
            {
                "type": "function",
                "function": {
                    "name": t.name,
                    "description": t.description,
                    "parameters": t.inputSchema,
                },
            }
            for t in tools
        ]

        print(f"\n[MCP] {len(tools)} tools discovered: {[t.name for t in tools]}")
        messages = [{"role": "user", "content": query}]

        while True:
            resp = await openai_client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=openai_tools,
                tool_choice="auto",
            )
            msg = resp.choices[0].message
            messages.append(msg)

            # No tool calls = final answer
            if not msg.tool_calls:
                return msg.content

            # Execute each tool call via the MCP server
            for tc in msg.tool_calls:
                args = json.loads(tc.function.arguments)
                print(f"[Tool] {tc.function.name}({args})")

                result = await client.call_tool(tc.function.name, args)
                tool_content = (
                    result.content[0].text if result.content else "{}"
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": tool_content,
                })


if __name__ == "__main__":
    queries = [
        "Look up customer C-123 and tell me their tier.",
        "Get the last 3 orders for customer C-123.",
        "C-123 had a late delivery. Open a ticket and draft an apology email.",
    ]
    for q in queries:
        print(f"\n[Query] {q}")
        print(asyncio.run(run(q)))
        print("─" * 60)

Run it (server must be up):

python client.py

Auth — The Part Most Tutorials Skip

TL;DR:
Internal / pod-to-pod → skip auth entirely, use network policy.
Public-facing → OAuth 2.1 + PKCE, no exceptions. The spec mandates it.
Everything else is a practical middle ground.

For remote MCP servers (Streamable HTTP), the spec prescribes OAuth 2.1 Authorization Code + PKCE. The MCP client acts as the OAuth client; your server acts as the resource server; an Authorization Server (Keycloak, Auth0, Azure AD) issues the tokens.

The flow:

Client hits server without a token → 401 with WWW-Authenticate pointing to /.well-known/oauth-protected-resource
Client discovers the auth server from that metadata
Client runs OAuth 2.1 Authorization Code + PKCE
Client retries MCP requests with Authorization: Bearer <token>
Server validates the token and serves tools/resources

Here's the token validation middleware — the piece that took longest to get right because the aud (audience) claim check is easy to miss. Without it, a token issued for a different service will pass validation silently.

# auth_middleware.py
import os
import httpx
from authlib.jose import jwt, JsonWebKey
from fastapi import Request
from fastapi.responses import JSONResponse

MCP_RESOURCE_ID = os.getenv("MCP_RESOURCE_ID", "https://your-mcp-server/")
AUTH_SERVER_URL  = os.getenv("AUTH_SERVER_URL", "")
_jwks_cache: dict | None = None

SKIP_PATHS = {"/health", "/.well-known/oauth-protected-resource"}


async def fetch_jwks() -> dict:
    global _jwks_cache
    if _jwks_cache:
        return _jwks_cache
    async with httpx.AsyncClient() as http:
        r = await http.get(f"{AUTH_SERVER_URL}/protocol/openid-connect/certs")
        r.raise_for_status()
        _jwks_cache = r.json()
    return _jwks_cache


async def oauth_middleware(request: Request, call_next):
    if request.url.path in SKIP_PATHS:
        return await call_next(request)

    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        return JSONResponse(
            status_code=401,
            content={"error": "missing_token"},
            headers={"WWW-Authenticate": f'Bearer realm="{MCP_RESOURCE_ID}"'},
        )

    token = auth.removeprefix("Bearer ").strip()

    try:
        jwks   = await fetch_jwks()
        claims = jwt.decode(token, JsonWebKey.import_key_set(jwks))
        claims.validate()

        # ── The check most tutorials miss ───────────────────────────
        aud = claims.get("aud", [])
        if isinstance(aud, str):
            aud = [aud]
        if MCP_RESOURCE_ID not in aud:
            return JSONResponse(status_code=403, content={"error": "wrong_audience"})

        request.state.user = dict(claims)
        return await call_next(request)

    except Exception as e:
        return JSONResponse(status_code=401, content={"error": str(e)})

Wire it into your server:

# Add to server.py after: mcp = FastMCP(...)
from auth_middleware import oauth_middleware
mcp.app.middleware("http")(oauth_middleware)

Real-World Cost (Numbers, Not Theory)

Expect:

+3–100ms MCP protocol overhead per call (varies by gateway)
+280–950ms total round-trip latency including tool execution (p50–p95)
+1.8–4.2 seconds for multi-step agentic flows (3-agent chains, p50–p95)
Extra infra: containers, auth service, monitoring
Non-trivial engineering overhead during setup

If your system is latency-sensitive or single-tool focused, MCP will feel slower — because it is.

You're buying reuse and consistency, not speed.

Observability & Debugging

Make MCP servers and client-host interactions observable from day one:

Trace every request — propagate a correlation ID across host → client → server → backend and include it in logs and responses
Metrics to track — mcp_tool_call_latency, mcp_tool_call_errors, mcp_discovery_time, mcp_auth_failures, per-tool success rates
Structured JSON logs — (timestamp, level, correlation_id, tool, args, duration, error) to make postmortems fast
Distributed tracing — use OpenTelemetry to connect host traces to server traces for full request/response visibility
Troubleshooting checklist — confirm /.well-known/oauth-protected-resource, check token aud claim, verify /health readiness, replay failing tool calls against a local mock server

Testing & QA

Treat the MCP server like a public contract:

Contract tests — verify the tool list and input/output schemas remain compatible across releases. Fail CI on breaking schema changes.
Unit tests — each @mcp.tool() should have unit tests for edge cases
Integration tests — run the server against a mocked backend and a simulated client that performs discovery + tool calls
Smoke tests — lightweight end-to-end checks that run on deploy (e.g., a small request to get_customer and get_orders)

Production Checklist

[ ] Health/readiness: expose /health and use readiness probes (ACA/AKS)
[ ] Replicas: min-replicas >= 1 for interactive workloads to avoid cold starts
[ ] Auth: validate aud claim and enforce least-privilege scopes
[ ] Rate limiting & quotas: protect downstream systems from runaway agents
[ ] Secrets: store credentials in Key Vault / secret manager — avoid plain env vars for DB credentials
[ ] Observability: metrics, logs, and traces wired to your platform
[ ] Cost & latency: budget for an extra 280–950ms network hop per tool call

Deploying on Azure

MCP is not Azure-specific — it runs on any platform that supports HTTP. But if you're already in the Azure ecosystem, here's the honest breakdown.

flowchart TD
    A["Infra context?"]
    A -->|Greenfield| B["Azure Container Apps\n✅ Default choice"]
    A -->|Existing cluster| C["AKS + Workload Identity"]
    A -->|Low usage / event-driven| D["Azure Functions"]
    A -->|Existing App Service| E["App Service /mcp"]

Azure Deployment Decision Table

Situation	Best fit	Why
Greenfield, no existing infra	Azure Container Apps	Least ops overhead, managed ingress
Already running AKS	AKS + Ingress	Reuse cluster auth, colocate workloads
Multiple servers, same team	ACA shared environment	Internal discovery, Dapr sidecars
Infrequent tools / cost-sensitive	Azure Functions	Per-invocation billing, zero idle cost
Existing App Service web app	App Service /mcp	Zero infrastructure change
Enterprise, Istio service mesh	AKS + mTLS	Zero trust alongside OAuth 2.1

Option 1 — Azure Container Apps (start here)

Recommended for most teams starting fresh. Handles autoscaling, ingress, Entra ID integration, and scale-to-zero with no MCP-specific platform knowledge needed.

RG="mcp-rg"
ACR="<your-acr-name>"

# Build and push
az acr build --registry $ACR --image customer-mcp-server:latest .

# Create environment
az containerapp env create \
  --name mcp-env --resource-group $RG --location eastus

# Store credentials in Key Vault — never directly in env vars
az keyvault secret set \
  --vault-name mcp-keyvault \
  --name database-url \
  --value "postgresql://user:pass@host:5432/db"

# Deploy
az containerapp create \
  --name customer-mcp-server \
  --resource-group $RG \
  --environment mcp-env \
  --image $ACR.azurecr.io/customer-mcp-server:latest \
  --target-port 8000 \
  --ingress external \
  --min-replicas 1 \
  --max-replicas 10 \
  --secrets db-url=keyvaultref:<key-vault-secret-uri> \
  --env-vars DATABASE_URL=secretref:db-url

min-replicas 1 is critical for interactive use — scale-to-zero causes cold start that breaks mid-conversation flow. Expect ~1–3s latency even on warm replicas; design your agent loop timeouts accordingly (30s is a safe outer bound).

Option 2 — AKS (if you already run a cluster)

Use Workload Identity over storing credentials in env vars. Reuse cluster auth and network policies.

# mcp-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: customer-mcp-server
  namespace: mcp-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app: customer-mcp-server
  template:
    metadata:
      labels:
        app: customer-mcp-server
    spec:
      containers:
        - name: mcp-server
          image: <your-acr>.azurecr.io/customer-mcp-server:latest
          ports:
            - containerPort: 8000
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: mcp-secrets
                  key: database-url
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 15
            periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
  name: customer-mcp-server
  namespace: mcp-system
spec:
  selector:
    app: customer-mcp-server
  ports:
    - port: 8000
      targetPort: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mcp-ingress
  namespace: mcp-system
spec:
  rules:
    - host: mcp.yourdomain.internal
      http:
        paths:
          - path: /mcp
            pathType: Prefix
            backend:
              service:
                name: customer-mcp-server
                port:
                  number: 8000

kubectl create namespace mcp-system
kubectl apply -f mcp-deployment.yaml
kubectl get pods -n mcp-system

Option 3 — Azure Functions (infrequent, event-driven tools)

Best for tools called rarely — scheduled jobs, webhook-triggered actions, batch processors. Per-invocation billing means zero idle cost. Not suitable for interactive agent loops — cold start breaks conversational flow.

Option 4 — App Service (add `/mcp` to an existing service)

Mount the MCP server on /mcp alongside existing routes. Zero infrastructure change. Lowest friction path for adding MCP to a service that already exists.

Common Failure Modes

Token misconfiguration — auth breaks silently on retry
Tool explosion — too many vaguely named tools confuse the model
Latency stacking — chained MCP calls compound response time (budget 280–950ms per hop)
Business logic duplication — logic living in both the server and the calling agent
Weak observability — no tracing across the client–server boundary
No ownership boundaries — multiple teams modifying the same server

Critical rule: One MCP server = one owning team. Violating this recreates the exact problem MCP was meant to solve.

What MCP Does NOT Solve

Bad tool design
Weak or vague prompts
Latency at the model level
Capability gaps in the underlying LLM
Business logic duplication inside your tools

MCP standardises access. It does not improve what you expose.

Migration Path

Don't start with MCP. Migrate to it when the pain appears:

Start with plain tool calling
Identify tools being duplicated across teams or surfaces
Wrap those shared tools in an MCP server
Add auth, deployment, and observability incrementally

Avoid premature standardisation. It's just premature optimisation with a fancier name.

Further Resources

MCP spec: https://modelcontextprotocol.io/specification
Inspector: npx @modelcontextprotocol/inspector — discover and test any MCP server
FastMCP: https://github.com/jlowin/fastmcp — the Python server/client used in this guide

If this saved you from building an MCP server you didn't need — hit clap. It takes one second and tells the algorithm this is worth distributing. If you've already wired MCP into production, drop your setup in the comments — specifically curious what the auth layer looks like on your end.

The Honest Verdict

MCP won't make your architecture simpler. It will make it honest — every tool, resource, and auth scope exactly where it belongs, exposed through a contract any host can consume.

If you're building for one app, one team: skip it. Plain tool calling ships faster and is entirely adequate.

But if you're building something multiple systems or teams will consume — especially if those tools already exist as APIs — MCP ends the "three copies of the same integration, each slightly wrong" problem permanently.

That's the only problem it solves. It solves it very well.

Worth Each minute read if you work on RAG

TheProdSDE — Mon, 06 Apr 2026 09:59:14 +0000

TheProdSDE

Mar 24

Why Most RAG Systems Fail in Production (And How to Design One That Actually Works)

#ai #rag #agents #software

4 min read

Agentic AI Fails in Production for Simple Reasons — What MLDS 2026 Taught Me

TheProdSDE — Tue, 31 Mar 2026 14:56:35 +0000

TL;DR:
Most agentic AI failures in production are not caused by weak models, but by stale data, poor validation, lost context, and lack of governance. MLDS 2026 reinforced that enterprise‑grade agentic AI is a system design problem, requiring validation‑first agents, structural intelligence, strong observability, memory discipline, and cost‑aware orchestration—not just bigger LLMs.

I recently attended MLDS 2026 (Machine Learning Developer Summit) by Analytics India Magazine (AIM) in Bangalore. While many sessions featured advanced models and agentic frameworks, the most valuable insight was unexpected:

Most AI systems don’t fail in production because of bad models — they fail because of bad systems.

Across the summit, speakers repeatedly showed that issues like stale data, missing validation, poor observability, and uncontrolled execution are what derail agentic AI at scale—not lack of intelligence.

A recurring theme across sessions was clear: the hardest problem in AI today is no longer building impressive demos, but running AI systems reliably at enterprise scale. Many real-world failures stem from system design gaps rather than model limitations.

A Key Shift: From Models to Systems

One of the most important takeaways from the summit was that enterprise AI is fundamentally a system design problem, not a model selection problem.

Multiple speakers highlighted common failure modes seen in production:

Stale or outdated data
Poor data granularity
Context loss across multi-step workflows
False confidence and lack of validation
Black-box decisions with no observability

This explains why many AI solutions look powerful in prototypes but break down in real operational environments.

Policy Learning vs. Structural Intelligence

A particularly insightful discussion contrasted two approaches:

Runtime Policy Learning

Examples include Reinforcement Learning (RL), MADDPG, and Graph Neural Networks (GNNs):

Dynamic decision-making
GPU-intensive
Higher cost and latency
Harder to govern and observe

Structural Intelligence at Design Time

In this approach, intelligence is encoded into the system structure itself, often using graph-based designs:

Relationships are resolved at construction time
Minimal runtime inference
Deterministic behavior
Lower cost and faster response

Key insight: Not every intelligent system needs continuous runtime learning. When relationships are stable, embedding intelligence structurally can be more efficient and reliable.

Validation-First Agent Design

Another strong theme was the shift toward validation-first agents, not answer-first agents.

Successful agentic systems:

Ground every important output to source data
Track freshness and provenance
Validate semantics before taking actions
Plan explicitly before executing
Expose confidence where appropriate

Several talks emphasized that observability should evolve from “what happened?” to “was the result actually correct?”.

Agentic Memory: Accuracy, Cost, and Trust

Sessions on agentic memory highlighted how short-term memory, long-term memory, and pruning strategies directly influence:

Accuracy
Latency
Cost
User trust

The key takeaway was that memory should be treated as a first-class architectural concern, with explicit design choices and benchmarks—rather than an ad-hoc cache bolted on later.

Data Platforms and Practical Architecture Choices

The summit also covered modern data platforms that unify OLTP and OLAP workloads, with strong support for time-series data. These architectures reduce complexity and make near–real-time analytics more accessible.

A broader lesson emerged: cost, latency, reliability, and accuracy must be designed together. Choosing larger models without optimizing workflows, routing, and memory leads to unnecessary compute cost and slower systems.

Putting Agents into Production: Real-World Risks

One session focused entirely on lessons learned from deploying agents in production. Four recurring risks were highlighted:

Silent failures – systems appear healthy but produce wrong outputs
Black-box decisions – lack of explainability and traceability
Permission explosion – agents accumulating excessive access
Runaway execution – uncontrolled tool calls and rising costs

These issues reinforce the importance of governance, guardrails, observability, and scoped execution from day one.

AI-Assisted Development Needs Guardrails

Another notable takeaway was the need to pair AI-assisted code generation with strong static analysis and security validation. Integrations with tools like SonarQube demonstrate how AI-written and human-written code can be:

Validated automatically
Secured against vulnerabilities
Fixed via generated pull requests

This closes the gap between productivity gains and production reliability.

Final Reflections

MLDS 2026 reinforced a critical idea:

The future of AI in enterprises depends more on architecture, validation, and governance than on model strength alone.

Agentic AI succeeds when it is:

Grounded in reliable data
Observable and debuggable
Cost-aware and execution-bounded
Designed around real workflows
Rolled out with clear trust and adoption strategies

The biggest mindset shift is moving from “How powerful is the model?” to “How reliable and efficient is the end-to-end intelligent workflow?”

That, more than anything, was the most valuable learning from the summit.

If you’re working on agentic AI in production, I’d love to hear:

Where have agents broken down for you?
What controls or guardrails helped the most?
Are you handling validation and memory explicitly—or implicitly?

Let’s compare notes.

Why Most RAG Systems Fail in Production (And How to Design One That Actually Works)

TheProdSDE — Tue, 24 Mar 2026 12:23:04 +0000

A practical, system design–focused breakdown of why RAG systems degrade after launch—and what actually works in production.

Everyone builds a RAG system.

And almost all of them work — in demos.

Clean query
Relevant chunks
Decent answer

Ship it.

Then production happens.

Users ask vague follow-ups
Retrieval returns partial context
The model answers confidently… and incorrectly

And suddenly:

Your “working” RAG system becomes unreliable.

The Reality: RAG Fails Quietly

RAG doesn’t crash. It degrades.

Slightly wrong answers
Missing context
Hallucinated explanations with citations

Which is worse than a system that fails loudly.

Most teams blame:

embeddings
vector database
chunk size

But in real systems:

RAG failures are usually system design failures—not retrieval failures.

What a Production RAG System Actually Looks Like

Not this:

Query → Vector DB → LLM

But this:

Step 1: Parsing Matters More Than You Think

Most pipelines start like this:

text = pdf.read()
chunks = split(text)
embeddings = embed(chunks)

This is where things already break.

Problem

PDFs lose structure
Tables turn into noise
Headers/footers pollute chunks
Sections lose meaning

Production Approach

Document → Layout-aware parsing → Structured sections → Clean chunks

Key principles:

preserve headings and hierarchy
remove boilerplate
chunk by meaning, not length

If parsing is wrong, retrieval will always be wrong.

Step 2: Dense vs Sparse Retrieval (You Need Both)

Dense Retrieval (Embeddings)

semantic similarity
handles vague queries
fails on exact matches

Sparse Retrieval (BM25 / Keyword)

exact term matching
works for IDs, clauses
ignores meaning

Production Pattern: Hybrid Retrieval

This gives:

semantic understanding
exact precision

Using only vector search is a common production mistake.

Step 3: Reranking (The Accuracy Multiplier)

Top-K retrieval is noisy.

Add a reranker (cross-encoder):

evaluates (query, chunk) pairs
reorders by true relevance

This significantly improves answer quality without changing your database.

Step 4: Context Building (Where Systems Win or Lose)

Even with good retrieval, most failures happen here.

Common Mistakes

stuffing too many chunks
mixing unrelated documents
ignoring token limits

Production Approach

select top-ranked chunks only
preserve document structure
enforce token budget
maintain ordering

Better context > more context

Vector DB vs Graph DB — When to Use What

Use Vector Database When

unstructured data
semantic search
document retrieval

Use Graph Database When

relationships matter
multi-hop reasoning
structured entities

Hybrid (Real Systems)

Use graph when relationships matter.
Use vector when meaning matters.
Use both when systems get complex.

RAG Is Not Single-Turn — Managing Context Over Time

Most systems fail here.

RAG is not just:

retrieve → answer

It’s:

retrieve → answer → follow-up → correction → refinement

The Problem: Context Drift

If you blindly append chat history:

token usage explodes
wrong answers get reinforced
relevance drops

Production Strategy: Context Is a Filter

Not a dump.

Context Layers

Store full history
Select only relevant turns
Exclude invalid or corrected responses
Combine with retrieved context

When to Summarize vs Include Raw History

Include Raw

short conversations
active refinement
recent corrections

Summarize

long conversations (>5–7 turns)
approaching token limits

Critical Rule

Summarize facts—not hallucinations.

If a previous answer was wrong:

exclude it
prioritize user correction

Handling User Corrections (Critical for Trust)

Users will fix your system.

If you ignore that, the system feels broken.

Strategy

mark incorrect responses
exclude them from future context
boost corrected information

Example:

{
  "turn_id": 8,
  "valid": false,
  "corrected": true
}

Agentic RAG (When Retrieval Needs Reasoning)

Basic RAG is static.

Agentic RAG adds:

planning
iteration
tool usage

Architecture

Use It When

multi-step queries
missing context
dynamic retrieval

Avoid It When

simple Q&A
strict latency requirements

Otherwise you're adding complexity without ROI.

Confidence Scores and Citations (Trust Layer)

Without trust signals, users don’t trust answers.

Citations

Always return:

source document
section or chunk reference

Confidence Score (Simple Heuristic)

Combine:

retrieval score
reranker score
validation signal

Example:

confidence =
  0.4 * retrieval +
  0.4 * reranker +
  0.2 * validation

Optional Validation Step

Ask the model:

“Is this answer fully supported by the context?”

Lower confidence if not.

Guardrail: Don’t Trust the Model Alone

Even with RAG:

hallucinations still happen
citations can be fabricated

Enforce:

answers must reference retrieved chunks
no context → no answer

Final Architecture (Multi-Turn RAG System)

Production Checklist

If your system doesn’t have these, it will fail:

structured parsing
hybrid retrieval
reranking
controlled context building
memory filtering
correction handling
confidence + citations
observability

The Real Rule

RAG is not a retrieval problem. It’s a system design problem.

What Actually Works

The best RAG systems are:

simple
structured
observable
measurable

Not over-engineered.

Final Thought

If your system only works when:

the query is perfect
the data is clean
the demo is controlled

Then it doesn’t work.

What’s Next

Once RAG works, the next bottleneck is:

Cost.

Why LLM systems become expensive in production—and how to control it without killing performance.

Most AI Agent Frameworks Are Overkill — Here's How to Choose the Right One in 30 Seconds

TheProdSDE — Wed, 18 Mar 2026 12:58:03 +0000

A senior engineer's field-tested breakdown of LangGraph, AutoGen, CrewAI, Microsoft Agent Framework, and Haystack — from reviewing real production systems across teams.

Everyone is building AI agents right now.

LangGraph. AutoGen. CrewAI. Semantic Kernel. Microsoft Agent Framework.

But most production AI systems don't actually need an agent framework.

Across multiple teams and production codebases I've reviewed, the same two failure modes appear constantly — over-engineering and under-engineering. In one case, replacing a complex agent framework with ~200 lines of plain tool-calling code made the system 3× faster, easier to debug, and easier to maintain. In another, the absence of a framework caused a codebase to collapse under its own complexity.

Both failures had the same root cause:

The problem isn't choosing the wrong framework. It's choosing a framework before understanding the workflow.

This guide is the decision framework I now use before touching any agent tooling.

TL;DR — Pick Your Approach in 30 Seconds

Workflow Shape	Recommended Approach
Single request → tools → response	Plain tool calling (no framework)
Self-correcting loops, retries, re-evaluation	LangGraph
Parallel specialist agents	AutoGen 0.7.5
Enterprise persistent agents (Azure)	Microsoft Agent Framework RC
Sequential role-based task delegation	CrewAI 1.10.1
Document analysis and extraction pipelines	Haystack 2.x

Workflow Shapes at a Glance

Two Real Failure Modes I've Seen Across Teams

The over-engineering case.

During a cross-team architecture review, I found a financial data analysis pipeline — pull market metrics, cross-reference filings, produce a risk score — built with three microservices, a LangGraph orchestrator with twelve nodes, Redis for inter-agent memory, and a separate evaluation loop. Six weeks of engineering. It worked.

What it actually needed: 180 lines of FastAPI with direct OpenAI tool calling. Same output. 3× faster inference. Any engineer on the team could debug it in minutes.

After the review, the team simplified it together. The framework had been adopted before anyone mapped the workflow shape — a straight line — and LangGraph's graph model added cost with no return.

The under-engineering case.

In a separate review, I found the opposite problem. An infrastructure incident response system — Azure alerts, metric retrieval, remediation decisions, rollback logic, retry conditions, escalation thresholds — built with hand-rolled state machines, custom retry logic, and a bespoke tool orchestration layer. Three weeks in, the codebase was a maze. Every new remediation path required rewriting core routing logic.

A proper agent framework would have provided state management, conditional branching, retry handling, and checkpointing for free. Instead the team was reinventing those primitives by hand — the most expensive kind of technical debt.

The pattern is the same in both directions: the architecture was chosen before the workflow was understood.

The Core Question: Does Your Workflow Resist Simplicity?

Before touching any framework, draw the workflow on paper. Then answer these:

Does step N's output determine whether to redo step N-1? → You have a loop
Do multiple specialized agents need to run simultaneously? → You have parallelism
Does the workflow run for minutes or hours, surviving restarts? → You need persistent state
Does the agent need to decide its next action from intermediate results? → Dynamic planning
Do independent agents need to hand off context to each other? → Multi-agent delegation

None of these? Use plain tool calling. Here's what that looks like at production quality.

The Baseline: Plain Tool Calling (No Framework Needed)

Use case: Real-time ESG (Environmental, Social, Governance) risk scoring. Given a ticker, pull sustainability metrics, cross-reference regulatory filings, produce a risk-adjusted score, and persist an audit trail — one clean, observable service.

# requirements: fastapi, openai, psycopg2-binary, httpx, pydantic
from fastapi import FastAPI
from openai import OpenAI
from pydantic import BaseModel
import json, httpx, psycopg2, os

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class ESGRequest(BaseModel):
    ticker: str
    portfolio_id: str

def get_esg_metrics(ticker: str) -> dict:
    resp = httpx.get(
        f"https://api.sustainalytics.com/v1/esg/{ticker}",
        headers={"Authorization": f"Bearer {os.environ['ESG_API_KEY']}"},
        timeout=10
    )
    return resp.json()

def get_regulatory_flags(ticker: str) -> dict:
    with psycopg2.connect(os.environ["DATABASE_URL"]) as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT flag_type, severity, filing_date, description
                FROM regulatory_flags WHERE ticker = %s
                AND filing_date > NOW() - INTERVAL '2 years'
                ORDER BY severity DESC LIMIT 10
            """, (ticker,))
            rows = cur.fetchall()
    return {"flags": [{"type": r[0], "severity": r[1], "date": str(r[2]), "detail": r[3]} for r in rows]}

TOOLS = [
    {"type": "function", "function": {
        "name": "get_esg_metrics",
        "description": "Fetch ESG sustainability scores for a stock ticker",
        "parameters": {"type": "object", "properties": {"ticker": {"type": "string"}}, "required": ["ticker"]}
    }},
    {"type": "function", "function": {
        "name": "get_regulatory_flags",
        "description": "Retrieve regulatory violations and compliance flags",
        "parameters": {"type": "object", "properties": {"ticker": {"type": "string"}}, "required": ["ticker"]}
    }}
]

TOOL_MAP = {
    "get_esg_metrics": get_esg_metrics,
    "get_regulatory_flags": get_regulatory_flags
}

@app.post("/assess-esg")
async def assess_esg(req: ESGRequest):
    messages = [{"role": "user", "content": (
        f"Run a full ESG risk assessment for {req.ticker} in portfolio {req.portfolio_id}. "
        f"Produce a risk-adjusted score with recommendation: HOLD, REDUCE, or DIVEST."
    )}]
    while True:
        response = client.chat.completions.create(
            model="gpt-4o", messages=messages, tools=TOOLS, tool_choice="auto"
        )
        msg = response.choices[0].message
        if not msg.tool_calls:
            return {"ticker": req.ticker, "assessment": msg.content}
        messages.append(msg)
        for tc in msg.tool_calls:
            result = TOOL_MAP[tc.function.name](**json.loads(tc.function.arguments))
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)})

Multi-step tool calling, audit trail, fully debuggable by any engineer. If this solves the problem — stop here. No framework needed.

1. LangGraph — Stateful Cyclic Workflows

Version: 1.1.2 | pip install langgraph langgraph-checkpoint-redis

Use when: Your workflow has genuine loops — the result of one step conditionally re-runs a previous step. Redis checkpointing lets state survive service restarts mid-workflow.

Avoid when: The workflow is strictly sequential with no conditional branching. The graph model adds measurable overhead for zero architectural return.

Use case: Automated cloud cost optimization — scan Azure VMs for underutilization, simulate right-sizing savings, apply low-risk changes, re-scan. The loop continues until no optimization exceeds the savings threshold.

# requirements: langgraph, langgraph-checkpoint-redis, openai
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver
from typing import TypedDict, Annotated
import operator, json, os
from openai import OpenAI

oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class CostState(TypedDict):
    subscription_id: str
    resources: list[dict]
    candidates: Annotated[list, operator.add]
    applied: Annotated[list, operator.add]
    total_savings: float
    iteration: int

def scan_resources(state: CostState) -> dict:
    resources = azure_monitor_client.list_resources(state["subscription_id"])
    underutilized = [r for r in resources if r["avg_cpu_7d"] < 15 and r["avg_memory_7d"] < 30]
    return {"resources": underutilized, "iteration": state["iteration"] + 1}

def analyze_savings(state: CostState) -> dict:
    resp = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"""
            Analyze these underutilized Azure resources and recommend right-sizing.
            For each: resource_id, current_sku, recommended_sku, monthly_savings_usd, risk_level (LOW/MEDIUM/HIGH).
            Only include LOW or MEDIUM risk recommendations.
            Resources: {json.dumps(state["resources"])}
            Respond: {{"recommendations": [...]}}
        """}],
        response_format={"type": "json_object"}
    )
    candidates = json.loads(resp.choices[0].message.content)["recommendations"]
    return {"candidates": candidates, "total_savings": sum(c["monthly_savings_usd"] for c in candidates)}

def apply_changes(state: CostState) -> dict:
    applied = []
    for c in state["candidates"]:
        if c["risk_level"] == "LOW":
            azure_compute.resize_vm(c["resource_id"], c["recommended_sku"])
            applied.append(c)
    return {"applied": applied}

def check_threshold(state: CostState) -> str:
    return "loop" if state["total_savings"] > 500 and state["iteration"] < 5 else "done"

graph = StateGraph(CostState)
graph.add_node("scan", scan_resources)
graph.add_node("analyze", analyze_savings)
graph.add_node("apply", apply_changes)
graph.set_entry_point("scan")
graph.add_edge("scan", "analyze")
graph.add_edge("analyze", "apply")
graph.add_conditional_edges("apply", check_threshold, {"loop": "scan", "done": END})

optimizer = graph.compile(checkpointer=RedisSaver.from_conn_string(os.environ["REDIS_URL"]))
result = optimizer.invoke({
    "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
    "candidates": [], "applied": [], "total_savings": 0.0, "iteration": 0
})

Why this justifies LangGraph: The self-correcting re-scan is genuinely hard to express cleanly in plain tool calling without writing a custom state machine — which is exactly the under-engineering trap described earlier.

2. AutoGen 0.7.5 — Parallel Multi-Agent Collaboration

Version: 0.7.5 | pip install "autogen-agentchat>=0.7.5" "autogen-ext[openai,redis]"

Key additions in 0.7.5: linear memory via RedisMemory, fixed GraphFlow cycle detection, Anthropic thinking mode support, reasoning_effort parameter for GPT-5 models, improved Azure AI client streaming.

Use when: Multiple specialized agents run independent analysis simultaneously. Async, event-driven — agents can be deployed on separate containers with zero blocking between them.

Avoid when: The task is sequential and single-agent. Multi-agent coordination overhead only pays off when genuine parallelism exists.

Use case: Automated M&A due diligence — Legal, Financial, and Tech Audit agents work in parallel, then a Synthesis agent consolidates findings into an investment decision. Sequential execution would add hours of unnecessary latency to a time-sensitive deal process.

# requirements: autogen-agentchat>=0.7.5, autogen-ext[openai,redis]
import asyncio, os
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import SelectorGroupChat
from autogen_agentchat.conditions import TextMentionTermination
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.memory.redis import RedisMemory  

model_client = OpenAIChatCompletionClient(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])

legal_agent = AssistantAgent(
    name="LegalAgent",
    model_client=model_client,
    memory=[RedisMemory(redis_url=os.environ["REDIS_URL"], session_id="ma-legal")],
    system_message="""M&A legal counsel. Identify: IP gaps, change-of-control clauses,
    litigation exposure, employment liabilities.
    Output JSON: {risk_items, severity: BLOCKER|HIGH|MEDIUM|LOW, deal_impact}""",
    tools=[fetch_contract_repository, search_litigation_database]
)

financial_agent = AssistantAgent(
    name="FinancialAgent",
    model_client=model_client,
    memory=[RedisMemory(redis_url=os.environ["REDIS_URL"], session_id="ma-financial")],
    system_message="""M&A financial analyst. Identify: revenue quality, off-balance-sheet
    liabilities, working capital needs post-acquisition, EBITDA normalization.
    Output JSON: {financial_flags, normalized_ebitda, recommended_valuation_range}""",
    tools=[fetch_financial_statements, compute_dcf_model]
)

tech_audit_agent = AssistantAgent(
    name="TechAuditAgent",
    model_client=model_client,
    system_message="""Technical due diligence expert. Assess: tech debt score (1–10),
    security vulnerabilities, scalability ceiling, bus-factor risk.
    Output JSON: {tech_risks, estimated_remediation_cost_usd, integration_complexity}""",
    tools=[clone_and_analyze_repo, run_dependency_scan]
)

synthesis_agent = AssistantAgent(
    name="SynthesisAgent",
    model_client=model_client,
    system_message="""M&A deal lead. Wait for all domain agents to complete.
    Synthesize: PROCEED / PROCEED_WITH_CONDITIONS / ABORT.
    Include: top 5 risks, price adjustment recommendation, 90-day priorities.
    End your message with: ANALYSIS_COMPLETE""",
    tools=[generate_pdf_memo, notify_deal_team_slack]
)

async def run_ma_due_diligence(target: str):
    team = SelectorGroupChat(
        participants=[legal_agent, financial_agent, tech_audit_agent, synthesis_agent],
        model_client=model_client,
        termination_condition=TextMentionTermination("ANALYSIS_COMPLETE"),
        selector_prompt="""Run LegalAgent, FinancialAgent, TechAuditAgent first (order flexible,
        can run in parallel). Only select SynthesisAgent after all three have reported."""
    )
    async for msg in team.run_stream(task=f"Full M&A due diligence for: {target}"):
        print(f"[{msg.source}] {str(msg.content)[:120]}...")

asyncio.run(run_ma_due_diligence("TargetCorp Inc."))

3. Microsoft Agent Framework RC — Enterprise Production Agents

Version: Release Candidate, Feb 19, 2026 | pip install agent-framework --pre

The most significant shift in the Microsoft AI ecosystem right now. This framework unifies Semantic Kernel and AutoGen into a single SDK — both are entering maintenance mode as of this writing, and all new Microsoft investment flows into Agent Framework first.

RC signals a frozen, stable API surface with GA targeting Q1 2026. Core capabilities: persistent threads (Cosmos DB), Service Bus integration, MCP + A2A protocol support, multi-agent orchestration with handoff and group chat patterns, streaming checkpointing for long-running agents, full .NET and Python support.

Use when: Azure-first enterprise teams, production 24/7 background agents, regulated systems requiring full audit trails, or any team currently building on Semantic Kernel or AutoGen.

Avoid when: Greenfield Python-only stacks with no Azure dependency, or where GA stability is a hard requirement before adoption.

Use case: Autonomous DevOps deployment agent — monitors CI/CD completion events, validates health via Application Insights, promotes through dev → staging → prod automatically, pages on-call only when a gate fails. Runs continuously as a persistent, resumable agent.

# requirements: agent-framework --pre, azure-identity, azure-monitor-query, kubernetes
import os
from agent_framework import AgentClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
agent_client = AgentClient(
    endpoint=os.environ["AZURE_AI_FOUNDRY_ENDPOINT"],
    credential=credential
)

def check_app_insights(resource_id: str, environment: str, window_minutes: int = 5) -> dict:
    from azure.monitor.query import MetricsQueryClient
    from datetime import timedelta
    monitor = MetricsQueryClient(credential)
    result = monitor.query_resource(
        resource_id,
        metrics=["requests/failed", "requests/duration"],
        timespan=timedelta(minutes=window_minutes)
    )
    return {
        "environment": environment,
        "error_rate_percent": _calculate_error_rate(result),
        "p99_latency_ms": _calculate_p99(result),
        "gate_passed": _calculate_error_rate(result) < 1.0 and _calculate_p99(result) < 500
    }

def promote_deployment(service: str, image_tag: str, environment: str) -> dict:
    from kubernetes import client as k8s, config
    config.load_incluster_config()
    k8s.AppsV1Api().patch_namespaced_deployment(
        name=service, namespace=environment,
        body={"spec": {"template": {"spec": {"containers": [{
            "name": service,
            "image": f"{os.environ['ACR_REGISTRY']}/{service}:{image_tag}"
        }]}}}}
    )
    return {"status": "promoted", "service": service, "tag": image_tag, "environment": environment}

# Persistent thread — Cosmos DB preserves state across restarts
agent = agent_client.create_agent(
    model="gpt-4o",
    name="DeploymentOrchestrator",
    instructions="""Autonomous DevOps deployment agent.
    1. Validate health: error rate < 1%, p99 < 500ms, zero pod restarts
    2. Run smoke tests on the deployed environment
    3. All gates pass → promote to next environment
    4. Any gate fails → halt, capture full diagnostics, page on-call
    5. After prod deploy: monitor 15 minutes, auto-rollback if error rate > 2%
    Log every action: timestamp, metric values, decision, outcome.""",
    tools=[check_app_insights, promote_deployment, run_smoke_tests,
           rollback_deployment, page_oncall, log_deployment_event]
)
thread = agent_client.create_thread()

async def handle_pipeline_completion(event: dict):
    """Azure Event Grid webhook — fires on every pipeline completion"""
    agent_client.create_message(
        thread_id=thread.id, role="user",
        content=(
            f"Deployment complete — Service: {event['service_name']}, "
            f"Tag: {event['image_tag']}, Environment: {event['environment']}. "
            f"Begin validation and promotion workflow."
        )
    )
    return agent_client.create_and_process_run(thread_id=thread.id, agent_id=agent.id)

4. CrewAI 1.10.1 — Role-Based Sequential Pipelines

Version: 1.10.1 (March 3, 2026) | pip install crewai==1.10.1

⚠️ Note: v1.10.0 was yanked from PyPI due to an AMP runtime issue. Always pin to 1.10.1 directly. This release lazy loading in Memory module and resolves a concurrent multi-process LockException in production flows.

Use when: Clearly defined specialist roles, sequential task handoff, and fastest path from idea to working prototype. CrewAI is the most opinionated framework and the fastest to build with.

Avoid when: Fine-grained conditional edge control is needed, or throughput-sensitive production paths where abstraction overhead is measurable.

Use case: Automated system architecture pipeline — an Architect designs from requirements, a Reviewer stress-tests for production failures, a Documentation Lead produces the Architecture Decision Record for the engineering wiki.

# requirements: crewai==1.10.1
from crewai import Agent, Task, Crew, Process

architect = Agent(
    role="Principal Solutions Architect",
    goal="Design a scalable, cloud-native system architecture from requirements",
    backstory="10+ years on Azure and AWS. Expert in event-driven architecture, CQRS, and zero-trust.",
    verbose=True
)
reviewer = Agent(
    role="Staff Engineer — Technical Reviewer",
    goal="Identify scalability bottlenecks, security vulnerabilities, and operational risks",
    backstory="Former SRE. Has seen every production failure mode. Only approves designs that hold at 10x load.",
    verbose=True
)
doc_lead = Agent(
    role="Technical Documentation Lead",
    goal="Produce a complete Architecture Decision Record capturing every design trade-off",
    backstory="Undocumented architecture is a liability. Writes for the engineer joining 2 years later.",
    verbose=True
)

design_task = Task(
    description="""Design a complete architecture for: {requirements}
    Output JSON: components (name + responsibility), data_flows (source → destination),
    tech_choices (with justification), scaling_strategy per component, security_controls.""",
    expected_output="JSON architecture specification",
    agent=architect
)
review_task = Task(
    description="""Stress-test the architecture:
    1. 10x traffic spike — what breaks first?
    2. Single region failure — what is the blast radius?
    3. Compromised service account — what can an attacker reach?
    Return: {approved: YES/NO, blocking_issues: [], recommended_changes: []}""",
    expected_output="Review JSON with approval and findings",
    agent=reviewer,
    context=[design_task]
)
adr_task = Task(
    description="Write a complete ADR: Context, Decision, Consequences, Alternatives Considered, Risk Register. Format: Markdown.",
    expected_output="Complete ADR in Markdown",
    agent=doc_lead,
    context=[design_task, review_task]
)

crew = Crew(
    agents=[architect, reviewer, doc_lead],
    tasks=[design_task, review_task, adr_task],
    process=Process.sequential,
    verbose=True
)
result = crew.kickoff(inputs={
    "requirements": "Real-time fraud detection — 50K TPS, sub-100ms decisions, 99.99% uptime, multi-region active-active"
})
print(result.raw)

5. Haystack 2.x — Document Intelligence Pipelines

Version: 2.x | pip install haystack-ai

Use when: Document processing, extraction, or compliance analysis at scale is the core product — not general agentic behavior. Purpose-built for this and measurably outperforms general frameworks on document-centric workloads.

Avoid when: The problem is general agentic orchestration, multi-agent coordination, or any real-time interactive system.

Use case: Automated SOC 2 evidence collection — ingests all internal policy documents, maps clauses against Trust Service Criteria, produces a compliance gap report showing which controls are missing or non-conformant.

The pipeline maps extracted content against all six SOC 2 Trust Service Criteria (CC6, CC7, CC8, CC9, A1, C1), flags NOT_COVERED gaps with descriptions, and returns a structured compliance report with overall coverage percentage — no custom parsing layer required.

The Practical Decision Rule

If your AI system needs:

  Retries or self-correction    → LangGraph
  Multiple parallel specialists → AutoGen
  Enterprise Azure persistence  → Microsoft Agent Framework
  Sequential role-based tasks   → CrewAI
  Document extraction at scale  → Haystack
  None of the above             → No framework. Plain tool calling.

Framework Reference

Framework	Version	Primary Strength	Use When	Avoid When
Plain tool calling	—	Speed, simplicity, debuggability	Straight-line workflow	Never — if no loops exist
LangGraph	1.1.2	Cyclic graphs, checkpointing	Self-correcting loops, retries	No conditional branching
AutoGen	0.7.5	Parallel async agents, RedisMemory	Multiple specialists in parallel	Sequential single-agent tasks
Microsoft Agent Framework	RC Feb 2026	Enterprise persistence, SK + AutoGen unified	Azure production 24/7 agents	Python-only, no Azure stack
CrewAI	1.10.1	Role-based prototyping, fast iteration	Sequential delegation, fast prototyping	Fine-grained production control
Haystack	2.x	Document extraction pipelines	Document processing as core product	General agentic tasks

Production Lessons From Real Systems

After reviewing AI systems across multiple teams, a few patterns appear consistently:

1. Most workflows are simpler than they look.
The instinct when building with LLMs is to reach for orchestration layers. That instinct is usually wrong. Start with the simplest thing that could work, measure it, then add complexity only where the problem resists simplicity.

2. Agent frameworks pay off only when the workflow has the right shape.
Loops, parallel specialists, or long-running persistent state. If none of these exist, the framework is cost with no architectural return.

3. Debuggability matters more than clever architecture.
A system the team can debug at 2AM is worth more than an elegant multi-agent pipeline nobody fully understands. Production incidents don't wait for framework comprehension.

4. Hand-rolling framework primitives is the most expensive mistake.
Custom state machines, retry logic, and checkpointing written to avoid a framework dependency consistently cost more engineering time than learning the framework properly. Both failure modes described earlier confirm this.

5. Decide the architecture before writing the first line.
Draw the workflow. Identify the shape. Pick the tool. That decision should take 30 minutes, not six weeks of refactoring.

The Real Rule

If you can draw your workflow as a straight line — plain tool calling is your production architecture.
If that line needs to loop, branch, or coordinate parallel specialists — match the tool to the shape.

The best production AI systems are architecturally boring. One clean service, typed tool schemas, structured output, observable logs. No framework overhead unless the problem demands it.

Add complexity only when the problem resists simplicity. That's the whole framework.

What's Next in This Series

This post focused on orchestration — how to structure and run AI workflows in production.

But once orchestration is solved, most teams run into a different problem:

Their RAG system doesn’t actually work.

Not in demos — those look fine.
In production — it breaks.

Wrong answers. Missing context. Hallucinations with high confidence.

And in most cases, the root cause is not the vector database or the embedding model.

It’s the architecture around it.

The next article breaks down:

Why most RAG systems fail after launch
The common design mistakes teams repeat
And the production architecture that actually fixes it

If you’re building anything on top of retrieval, this is where things either scale — or quietly fail.

Are you running an agent framework in production — or did you strip one out and go back to basics?

What made that decision clear? Drop your stack and the reason in the comments. The real production stories are always more useful than the official docs.

Tags: #ai #llm #machinelearning #softwareengineering #devops

Version data sourced from official release channels: AutoGen 0.7.5 , Microsoft Agent Framework RC, CrewAI 1.10.1, LangGraph, Haystack

[Boost]

TheProdSDE — Wed, 18 Mar 2026 09:37:06 +0000

TheProdSDE

Mar 14

Stop Guessing Your LLM Replacement

#ai #llm #cloud #systemdesign

6 min read