DEV Community: Dextra Labs

How to Architect AI Agents That Pass Banking Compliance Audits (Real Patterns, Not Theory)

Dextra Labs — Sun, 28 Jun 2026 19:51:23 +0000

Building agents for banking is 30% AI work and 70% compliance plumbing. The 30% is the easy part. Here's how to handle the 70%.

The first time you build an underwriting agent for a bank, you'll write the credit logic in about a week. Then you'll spend the next two months on audit logging, explainability, human-in-the-loop checkpoints, and data residency and you'll understand why fintech AI projects take longer than the demo suggests.

This is the architecture that gets through an audit, using loan underwriting as the running example.

The Core Requirement: Every Decision Must Be Traceable

Regulators don't accept "the model gave it a score of 0.82." They want to know what data was used, what reasoning was applied, and what a human would need to review to understand and challenge the decision.

class AuditableDecision:
    """
    Every agent decision in a regulated context must
    produce this structure. Not optional. Not added later.
    """
    def __init__(self, decision_id: str):
        self.decision_id = decision_id
        self.inputs_used = {}
        self.reasoning_steps = []
        self.data_sources_consulted = []
        self.model_version = AGENT_VERSION
        self.timestamp = datetime.utcnow().isoformat()
        self.outcome = None
        self.confidence = None
        self.human_reviewable_explanation = None

    def add_reasoning_step(self, step_description: str, evidence: dict):
        self.reasoning_steps.append({
            "step": len(self.reasoning_steps) + 1,
            "description": step_description,
            "evidence": evidence,
            "timestamp": datetime.utcnow().isoformat()
        })

    def finalise(self, outcome: str, confidence: float, explanation: str):
        self.outcome = outcome
        self.confidence = confidence
        self.human_reviewable_explanation = explanation

    def to_audit_record(self) -> dict:
        return {
            "decision_id": self.decision_id,
            "inputs": self.inputs_used,
            "reasoning_chain": self.reasoning_steps,
            "data_sources": self.data_sources_consulted,
            "model_version": self.model_version,
            "outcome": self.outcome,
            "confidence": self.confidence,
            "explanation": self.human_reviewable_explanation,
            "timestamp": self.timestamp
        }

Every step in the agent's reasoning gets appended to this structure as it happens, not reconstructed afterward from logs, which is unreliable and which auditors specifically check for.

The Underwriting Decision Pipeline

async def run_underwriting_agent(application: dict) -> dict:
    decision = AuditableDecision(decision_id=application["application_id"])

    # Step 1: Document verification
    doc_result = await verify_documents(application["documents"])
    decision.data_sources_consulted.append("document_verification_service")
    decision.add_reasoning_step(
        "Verified submitted financial documents",
        {"documents_checked": doc_result["count"], 
         "verification_status": doc_result["status"]}
    )

    if doc_result["status"] != "verified":
        decision.finalise(
            outcome="ESCALATE_TO_HUMAN",
            confidence=1.0,
            explanation=f"Document verification failed: {doc_result['reason']}. "
                        f"Routed to human reviewer for manual document check."
        )
        await audit_store.append(decision.to_audit_record())
        return {"status": "escalated", "reason": "document_verification"}

    # Step 2: Risk scoring via Claude with explicit reasoning
    risk_response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1500,
        system="""Analyse this loan application's risk factors. 
        You MUST cite specific data points for every risk factor 
        identified. Vague reasoning fails compliance review.

        Return JSON: {
            "risk_score": float,
            "risk_factors": [{"factor": str, "data_point": str, "weight": str}],
            "recommendation": "APPROVE" | "DENY" | "HUMAN_REVIEW",
            "explanation": "2-3 sentences a non-technical auditor can verify"
        }""",
        messages=[{"role": "user", "content": json.dumps(application["financials"])}]
    )

    risk_analysis = json.loads(risk_response.content[0].text)
    decision.add_reasoning_step(
        "Risk assessment based on financial data",
        risk_analysis
    )

    # Step 3: Threshold-based routing - NOT autonomous final decisions
    if risk_analysis["risk_score"] > 0.7:
        outcome = "HUMAN_REVIEW"  # high risk always escalates
    elif risk_analysis["risk_score"] < 0.3:
        outcome = "AUTO_APPROVE"  # only very low risk auto-decides
    else:
        outcome = "HUMAN_REVIEW"  # middle ground escalates too

    decision.finalise(
        outcome=outcome,
        confidence=1 - abs(0.5 - risk_analysis["risk_score"]),
        explanation=risk_analysis["explanation"]
    )

    await audit_store.append(decision.to_audit_record())
    return {"status": outcome, "decision_record": decision.to_audit_record()}

Note the threshold structure: only very low risk applications auto-approve. Everything else, high risk and the ambiguous middle ground, escalates to a human. This isn't conservative for the sake of it; it's the threshold design that auditors expect and that protects the institution from autonomous decisions on the cases that matter most.

Human-in-the-Loop: Not a Button, a Checkpoint

A human-in-the-loop checkpoint needs to give the reviewer enough information to make a genuine decision, not just a button to click.

def build_human_review_package(decision: AuditableDecision, 
                                 application: dict) -> dict:
    return {
        "application_summary": application["summary"],
        "agent_reasoning_chain": decision.reasoning_steps,
        "agent_recommendation": decision.outcome,
        "confidence_level": decision.confidence,
        "specific_concerns": [
            rf for rf in decision.reasoning_steps 
            if rf.get("evidence", {}).get("weight") == "high"
        ],
        "override_requires_justification": True,
        "regulatory_basis": get_applicable_lending_regulations(application)
    }

override_requires_justification: True matters because a reviewer who can override without explaining why produces rubber-stamp approval rates, not genuine oversight and that's specifically what examiners check for.

PII Handling in Prompts

Financial data going into LLM prompts needs careful handling, especially for cross-border banks with data residency requirements.

def sanitise_for_prompt(application: dict, residency_zone: str) -> dict:
    """
    Strip or tokenise PII before it reaches the LLM prompt,
    based on the data residency requirements for this customer.
    """
    sanitised = application.copy()

    # Replace direct identifiers with tokens
    sanitised["applicant_name"] = f"APPLICANT_{hash_id(application['applicant_id'])}"
    sanitised["ssn_or_national_id"] = "[REDACTED]"

    if residency_zone == "EU":
        # GDPR-specific handling
        sanitised["address"] = tokenise_address(application["address"])

    return sanitised

For cross-border banks, the model call itself may need to happen in a specific region, or via a locally-hosted model, depending on the residency zone, this is an architectural decision made before any agent code is written, not a configuration flag added later.

What Auditors Actually Check

Decision immutability, audit records can't be modified after creation. Reasoning chain completeness, every step that influenced the outcome must be in the record, not reconstructed from application logs after the fact. Override documentation, every human override of an agent recommendation requires a written justification. Threshold justification, the bank must be able to explain why the auto-approve threshold is set where it is, with evidence it doesn't create disparate impact.

The full AI agents for underwriting in banking architecture guide covers the complete pipeline including the specific regulatory frameworks (FCRA, ECOA, FCA lending standards) that shape these design decisions.

Underwriting Is One Workflow

The patterns here, audit trails, explainability, human-in-the-loop thresholds, generalise across banking AI use cases, but the specific requirements differ meaningfully by workflow. Fraud detection has different latency requirements (you need decisions in milliseconds, not minutes) and different audit needs (pattern-based reasoning rather than document-based reasoning). Worth reading the fraud detection architecture guide before you design either system, because the architectural choices you make for one affect how cleanly you can extend to the other.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

RAG vs Agentic AI: A Developer's Decision Tree (With Code Examples for Both)

Dextra Labs — Wed, 24 Jun 2026 20:57:33 +0000

Two different problems wearing similar clothes. Here's how to tell them apart in thirty seconds, with working code for both.

I see this confusion in almost every project kickoff: "We need RAG" when the actual requirement is agentic, or "we need an agent" when RAG would be simpler, cheaper, and faster to ship.

Let's fix that with a decision tree you can actually use, plus working code for each path.

The Decision Tree

Does your system need to ANSWER QUESTIONS from documents?
├── YES, and that's the whole job → RAG
└── YES, but it also needs to TAKE ACTIONS across systems
    └── → Agent that uses RAG as a tool

Does your system need to TAKE ACTIONS across multiple systems?
├── YES, with no document retrieval needed → Plain Agent
└── YES, and it needs grounded knowledge from documents → 
    → Agent that uses RAG as a tool

The test question that resolves most confusion: "Does this system need to decide what to do, or does it need to find and synthesise information?" Finding and synthesising → RAG. Deciding and acting → agent.

Path 1: Pure RAG

RAG is the right architecture when your job is grounding LLM responses in a specific document set, answering questions, summarising content, finding relevant passages.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_anthropic import ChatAnthropic

# 1. Load and chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, 
    chunk_overlap=100
)
chunks = splitter.split_documents(documents)

# 2. Embed and store
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Build the retrieval chain
llm = ChatAnthropic(model="claude-sonnet-4-5")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

# 4. Query
result = qa_chain({"query": "What is our refund policy for enterprise customers?"})
print(result["result"])
print(result["source_documents"])  # Always show sources

This is the whole job: retrieve relevant chunks, ground the LLM's answer in them, return a response with citations. No planning loop, no tool orchestration, no multi-step decision-making. If your use case stops here, building agent infrastructure on top of this is unnecessary complexity.

Path 2: Pure Agent (No RAG)

An agent is right when the job is taking actions, checking systems, executing operations, making decisions that span multiple steps and there's no document knowledge base involved.

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "check_inventory",
        "description": "Check current stock level for a SKU",
        "input_schema": {
            "type": "object",
            "properties": {"sku": {"type": "string"}},
            "required": ["sku"]
        }
    },
    {
        "name": "create_purchase_order",
        "description": "Create a PO with a supplier",
        "input_schema": {
            "type": "object",
            "properties": {
                "supplier_id": {"type": "string"},
                "sku": {"type": "string"},
                "quantity": {"type": "integer"}
            },
            "required": ["supplier_id", "sku", "quantity"]
        }
    }
]

def run_inventory_agent(goal: str) -> str:
    messages = [{"role": "user", "content": goal}]

    for _ in range(6):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1500,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if hasattr(b, 'text'))

        messages.append({"role": "assistant", "content": response.content})
        tool_results = []

        for block in response.content:
            if block.type == "tool_use":
                result = execute_inventory_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        messages.append({"role": "user", "content": tool_results})

    return "Reached max iterations."

run_inventory_agent(
    "Check stock for SKU-4471. If below 50 units, "
    "create a PO with our primary supplier for 200 units."
)

No documents involved. The agent checks inventory, reasons about the threshold, and conditionally creates a purchase order. This is pure action orchestration.

Path 3: The Hybrid Agent Using RAG as a Tool

This is where most real enterprise systems actually land: an agent that needs to take actions, and one of the things it needs to do along the way is look something up in a document knowledge base.

import anthropic

client = anthropic.Anthropic()

def rag_lookup(query: str) -> str:
    """RAG retrieval wrapped as a tool the agent can call."""
    result = qa_chain({"query": query})  # the RAG chain from Path 1
    return json.dumps({
        "answer": result["result"],
        "sources": [doc.metadata.get("source") for doc in result["source_documents"]]
    })

tools = [
    {
        "name": "search_policy_documents",
        "description": "Search company policy documents for relevant information",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        }
    },
    {
        "name": "issue_refund",
        "description": "Process a refund for a customer order",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "amount": {"type": "number"}
            },
            "required": ["order_id", "amount"]
        }
    }
]

def execute_tool(name: str, input_data: dict) -> str:
    if name == "search_policy_documents":
        return rag_lookup(input_data["query"])
    elif name == "issue_refund":
        return process_refund(input_data["order_id"], input_data["amount"])

def run_refund_agent(customer_request: str) -> str:
    messages = [{"role": "user", "content": customer_request}]

    for _ in range(6):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1500,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if hasattr(b, 'text'))

        messages.append({"role": "assistant", "content": response.content})
        tool_results = [
            {"type": "tool_result", "tool_use_id": block.id,
             "content": execute_tool(block.name, block.input)}
            for block in response.content if block.type == "tool_use"
        ]
        messages.append({"role": "user", "content": tool_results})

    return "Reached max iterations."

run_refund_agent(
    "Customer wants a refund on order #8821 for $340. "
    "Check our refund policy first to see if this qualifies."
)

The agent decides to call search_policy_documents to check eligibility before deciding whether to call issue_refund. The RAG system is doing exactly what it's good at, grounded retrieval, but it's a tool in service of the agent's broader decision-making, not the entire system.

The Cost and Complexity Reality

RAG-only systems are cheaper to build and run. Single retrieval call, single generation call, predictable latency, easier to evaluate (you can measure retrieval precision and answer accuracy independently).

Agentic systems are more expensive and harder to debug. Multiple LLM calls per task, unpredictable latency (depends how many iterations the agent takes), harder to evaluate because failure can happen at the planning stage or the execution stage. They're also the only option when the task genuinely requires multi-step action across systems.

The mistake we see most often: teams building agentic infrastructure for what's fundamentally a question-answering problem, paying the complexity cost for capability they don't need.

The full RAG vs agentic AI comparison covers the cost modelling, latency benchmarks, and evaluation methodology differences in more depth.

What Comes After This Decision

Once you've picked your architecture, the next question is build vs buy, do you build this RAG pipeline or agent loop yourself, or do you use a managed platform? The answer depends on your timeline, your team's capacity, and how differentiated your specific use case actually is. We wrote the framework with cost models, time estimates, and decision criteria for exactly this question, worth reading before you commit engineering time to either path.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

Building an ERP Integration Layer for AI Agents: SAP, Oracle, NetSuite Architecture Patterns

Dextra Labs — Mon, 22 Jun 2026 20:16:26 +0000

Nobody writes about this because it's not glamorous. It's also where most enterprise AI projects actually live or die.

Every enterprise AI agent demo I've seen looks the same: clean API calls, mock data, instant responses. Then someone tries to connect it to actual SAP S/4HANA running in a client's data centre with a service account that was provisioned in 2019, and the demo timeline becomes a different project entirely.

ERP integration is the unglamorous 70% of enterprise AI agent work that determines whether your agent actually functions in production. This is the architecture guide I wish existed when we built our first finance agent integration.

The Authentication Layer: Three Patterns, Three ERPs

SAP S/4HANA - OAuth 2.0 with SAML bridging
SAP's modern API layer (via SAP Business Technology Platform) uses OAuth 2.0, but most enterprise SAP environments still run a hybrid setup with legacy RFC connections behind the scenes.

import requests
from datetime import datetime, timedelta

class SAPAuthManager:
    def __init__(self, client_id, client_secret, token_url):
        self.client_id = client_id
        self.client_secret = client_secret
        self.token_url = token_url
        self._token = None
        self._expiry = None

    def get_token(self) -> str:
        if self._token and datetime.now() < self._expiry:
            return self._token

        response = requests.post(self.token_url, data={
            "grant_type": "client_credentials",
            "client_id": self.client_id,
            "client_secret": self.client_secret
        })
        response.raise_for_status()

        data = response.json()
        self._token = data["access_token"]
        # SAP tokens typically expire in 3600s - refresh at 90% lifetime
        self._expiry = datetime.now() + timedelta(
            seconds=data["expires_in"] * 0.9
        )
        return self._token

    def get_headers(self) -> dict:
        return {
            "Authorization": f"Bearer {self.get_token()}",
            "Content-Type": "application/json"
        }

Oracle Fusion - Service accounts with JWT

Oracle Fusion Cloud uses OAuth 2.0 with JWT bearer tokens, typically via a dedicated service account with scoped REST API roles.

import jwt
import time
import requests

class OracleFusionAuth:
    def __init__(self, client_id, private_key, token_url, scope):
        self.client_id = client_id
        self.private_key = private_key
        self.token_url = token_url
        self.scope = scope

    def build_jwt_assertion(self) -> str:
        now = int(time.time())
        payload = {
            "iss": self.client_id,
            "sub": self.client_id,
            "aud": self.token_url,
            "exp": now + 300,
            "iat": now
        }
        return jwt.encode(payload, self.private_key, algorithm="RS256")

    def get_token(self) -> str:
        assertion = self.build_jwt_assertion()
        response = requests.post(self.token_url, data={
            "grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
            "assertion": assertion,
            "scope": self.scope
        })
        response.raise_for_status()
        return response.json()["access_token"]

NetSuite - Token-based authentication (TBA)

NetSuite uses OAuth 1.0a-style token-based auth, which is older but well-documented.

import hashlib
import hmac
import base64
import time
import secrets

class NetSuiteTBA:
    def __init__(self, account_id, consumer_key, consumer_secret, 
                 token_id, token_secret):
        self.account_id = account_id
        self.consumer_key = consumer_key
        self.consumer_secret = consumer_secret
        self.token_id = token_id
        self.token_secret = token_secret

    def build_auth_header(self, method: str, url: str) -> str:
        nonce = secrets.token_hex(16)
        timestamp = str(int(time.time()))

        base_string = "&".join([
            method, url, 
            f"oauth_consumer_key={self.consumer_key}",
            f"oauth_nonce={nonce}",
            f"oauth_signature_method=HMAC-SHA256",
            f"oauth_timestamp={timestamp}",
            f"oauth_token={self.token_id}",
            f"oauth_version=1.0"
        ])

        signing_key = f"{self.consumer_secret}&{self.token_secret}"
        signature = base64.b64encode(
            hmac.new(signing_key.encode(), base_string.encode(), 
                     hashlib.sha256).digest()
        ).decode()

        return (f'OAuth realm="{self.account_id}", '
                f'oauth_consumer_key="{self.consumer_key}", '
                f'oauth_token="{self.token_id}", '
                f'oauth_signature_method="HMAC-SHA256", '
                f'oauth_timestamp="{timestamp}", '
                f'oauth_nonce="{nonce}", '
                f'oauth_version="1.0", '
                f'oauth_signature="{signature}"')

RBAC matters across all three: provision the service account with the minimum scope needed for your agent's actions, never broad admin access. We've seen integrations provisioned with full ERP admin rights because it was "easier to set up", this is a security finding waiting to happen in any audit.

Data Sync: Event-Driven vs Polling

The architecture decision that affects everything downstream is whether your agent reacts to ERP events or polls for changes.

# POLLING PATTERN - simpler, higher latency, works everywhere
class ERPPollingSync:
    def __init__(self, erp_client, poll_interval_seconds=60):
        self.erp_client = erp_client
        self.poll_interval = poll_interval_seconds
        self.last_sync_timestamp = None

    async def poll_for_changes(self):
        while True:
            changes = await self.erp_client.get_changed_records(
                since=self.last_sync_timestamp
            )
            for record in changes:
                await self.process_change(record)

            self.last_sync_timestamp = datetime.utcnow()
            await asyncio.sleep(self.poll_interval)


# EVENT-DRIVEN PATTERN - lower latency, requires ERP webhook support
class ERPEventSync:
    def __init__(self, webhook_secret):
        self.webhook_secret = webhook_secret

    async def handle_webhook(self, request_body: bytes, signature: str):
        if not self.verify_signature(request_body, signature):
            raise ValueError("Invalid webhook signature")

        event = json.loads(request_body)

        # Idempotency check - ERP webhooks can fire duplicates
        if await self.already_processed(event["event_id"]):
            return {"status": "duplicate_ignored"}

        await self.process_event(event)
        await self.mark_processed(event["event_id"])

SAP S/4HANA supports event-driven architecture via SAP Event Mesh. Oracle Fusion supports business event subscriptions. NetSuite's webhook support is more limited, polling is often the pragmatic choice there.

Idempotency: Non-Negotiable for Financial Transactions

Every write operation your agent performs against an ERP needs an idempotency key. Network failures, retries, and webhook duplicates will otherwise create duplicate financial records.

class IdempotentERPWriter:
    def __init__(self, erp_client, idempotency_store):
        self.erp_client = erp_client
        self.store = idempotency_store

    async def create_invoice(self, invoice_data: dict, 
                              idempotency_key: str) -> dict:
        existing = await self.store.get(idempotency_key)
        if existing:
            return existing  # Already processed - return cached result

        try:
            result = await self.erp_client.create_invoice(invoice_data)
            await self.store.set(idempotency_key, result, ttl_hours=24)
            return result
        except Exception as e:
            await self.store.set(idempotency_key, 
                                  {"status": "failed", "error": str(e)}, 
                                  ttl_hours=1)
            raise

Audit Logging: Built In, Not Bolted On

Every agent action against an ERP needs an immutable audit record, what changed, who (or which agent) initiated it, and why.

async def execute_erp_action(action: dict, agent_context: dict):
    audit_record = {
        "agent_id": agent_context["agent_id"],
        "action_type": action["type"],
        "erp_system": action["target_system"],
        "record_affected": action["record_id"],
        "reasoning": agent_context["decision_rationale"],
        "timestamp": datetime.utcnow().isoformat(),
        "idempotency_key": action["idempotency_key"]
    }

    await audit_store.append(audit_record)  # append-only, immutable
    result = await erp_client.execute(action)

    audit_record["result"] = result
    audit_record["completed_at"] = datetime.utcnow().isoformat()
    await audit_store.append(audit_record)

    return result

ERP Downtime: Plan for It

ERP systems have maintenance windows and occasional outages. Your agent needs graceful degradation, not crashes.

class ResilientERPClient:
    def __init__(self, erp_client, circuit_breaker):
        self.erp_client = erp_client
        self.circuit_breaker = circuit_breaker

    async def call_with_resilience(self, operation, *args):
        if self.circuit_breaker.is_open():
            return {"status": "deferred", 
                     "reason": "ERP circuit breaker open"}

        try:
            result = await operation(*args)
            self.circuit_breaker.record_success()
            return result
        except ERPTimeoutError:
            self.circuit_breaker.record_failure()
            return {"status": "queued_for_retry"}

What This Means for Your Project Timeline

ERP integration is consistently underestimated in enterprise AI projects. The authentication setup alone, coordinating with the client's ERP admin team, provisioning service accounts, testing RBAC scopes, regularly takes longer than the AI agent's core logic development.

The full ERP integration architecture for AI agent finance workflows covers the complete reference architecture for all three ERP systems, including the specific gotchas we've hit with each.

ERP integration is one piece of enterprise AI agent deployment. The full enterprise AI agents in finance guide covers the rest of the stack, orchestration, compliance, monitoring, for teams building the complete system.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

AI Agent vs LLM: Why Just Calling the OpenAI API Doesn't Give You an Agent (And What Does)

Dextra Labs — Fri, 19 Jun 2026 18:55:36 +0000

"We built an agent, we called the API in a loop." No. You built a chatbot with extra steps. Here's the actual difference.

I want to start with a confession: I've made this mistake myself, early on, with enough confidence that I told a client we'd built an "agent" when what we'd actually built was a chatbot that occasionally called a function.

The mistake is understandable because the surface-level pattern looks similar. You call an LLM. You give it a prompt. You get a response. Loop it a few times, add a tool call, and it feels agentic. But there's a real, technical distinction between an LLM wrapper and an actual agent and the distinction matters because it determines whether your system can handle the messy, multi-step, failure-prone reality of production tasks.

Let's define it precisely, then build both versions so you can see exactly where the line is.

What Makes Something an Agent (Not Just an LLM Call)

An agent has four properties that a simple LLM wrapper doesn't have.

Autonomous tool use: the system decides which tools to call and when, based on reasoning about the current state, not a hardcoded sequence you wrote.

Planning and replanning: the system forms a plan, executes a step, observes the result, and revises its plan based on what it learns. A single API call with a fixed prompt can't do this. It doesn't see its own output and decide what to do next.

Memory across steps: the system maintains state about what it's already tried, what worked, and what didn't, within a single task execution.

Error recovery: when a tool call fails or returns unexpected output, the system adapts rather than crashing or returning a generic failure.

If your system doesn't do at least the first two, you have an LLM wrapper, not an agent. That's not an insult, LLM wrappers are useful and often the right tool. But calling it an agent when it's not creates a mismatch between what you've built and what stakeholders think they're getting.

The LLM Wrapper (What Most People Build First)

import anthropic

client = anthropic.Anthropic()

def simple_llm_call(user_query: str) -> str:
    """
    This is an LLM call. It is not an agent.
    It has no memory, no planning, no tool use,
    no error recovery.
    """
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": user_query}]
    )
    return response.content[0].text

# This works fine for simple Q&A
result = simple_llm_call("What's the capital of France?")

This is perfectly fine code for the right job, single-turn question answering. The problem is when people wrap this in a loop and call it agentic:

def "agent"_with_a_loop(tasks: list[str]) -> list[str]:
    """
    This is NOT an agent. It's an LLM call 
    executed multiple times. There's no reasoning
    about what to do next based on previous results.
    """
    results = []
    for task in tasks:
        result = simple_llm_call(task)
        results.append(result)
    return results

This is a batch processing loop. It doesn't reason about whether task 2 should change based on the outcome of task 1. It doesn't use tools. It doesn't recover from failure beyond whatever exception handling you bolt on externally. Calling this an agent is where the confusion starts.

The Actual Agent: ReAct Pattern With Tool Use

A real agent implements the ReAct pattern, Reasoning and Acting in a loop, where each action produces an observation that informs the next reasoning step.

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_database",
        "description": "Search the customer database by criteria",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "send_email",
        "description": "Send an email to a customer",
        "input_schema": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string"},
                "subject": {"type": "string"},
                "body": {"type": "string"}
            },
            "required": ["recipient", "subject", "body"]
        }
    }
]

def execute_tool(tool_name: str, tool_input: dict) -> str:
    """Actual tool execution — this is what makes it agentic."""
    if tool_name == "search_database":
        # Real database call would go here
        return json.dumps({"customer_id": 4521, "status": "active", 
                           "last_order": "2026-05-12"})
    elif tool_name == "send_email":
        # Real email sending would go here
        return json.dumps({"sent": True, "message_id": "msg_8821"})
    return json.dumps({"error": "unknown tool"})


def run_agent(goal: str, max_iterations: int = 8) -> str:
    """
    This IS an agent. It reasons, acts, observes, 
    and replans based on what it learns at each step.
    This is the property a simple LLM call doesn't have.
    """

    messages = [{"role": "user", "content": goal}]

    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            tools=tools,
            messages=messages
        )

        # The model decided whether to use a tool or finish.
        # This decision happens at every step based on 
        # everything that's happened so far — that's planning.
        if response.stop_reason == "end_turn":
            final_text = next(
                (b.text for b in response.content if hasattr(b, 'text')),
                "Task completed."
            )
            return final_text

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})

            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  [Agent decided to call] {block.name}({block.input})")

                    # Execute and observe — error recovery happens here
                    try:
                        result = execute_tool(block.name, block.input)
                    except Exception as e:
                        result = json.dumps({"error": str(e), 
                                              "recoverable": True})

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            # The observation goes back into context.
            # The NEXT reasoning step sees what happened
            # and decides what to do based on it.
            messages.append({"role": "user", "content": tool_results})

    return "Max iterations reached without completing the goal."

Run this with a goal like "Find customer 4521's order status and email them an update if their last order was more than 30 days ago" and watch what happens. The agent calls search_database, observes the result, reasons about whether the date condition is met, and only then decides whether to call send_email. If the database call returned an error, the agent would see that in its context and could try a different search strategy, that's the replanning property.

This is the part a fixed-sequence script can't do. A script that calls search_database() then unconditionally calls send_email() is not reasoning about the result. It's executing a predetermined sequence that happens to involve an LLM somewhere in it.

The Test: Could You Replace the LLM Call With an If-Statement?

Here's the practical test I use when someone shows me their "agent" architecture.

If you could replace the LLM call with a deterministic if-statement and get the same behaviour, you don't have agentic reasoning, you have a script with an LLM bolted on for the parts that needed natural language generation.

# This is NOT agentic, even though it calls an LLM
def fake_agent(customer_id: int):
    customer = search_database(f"id:{customer_id}")
    if customer['days_since_order'] > 30:
        email_body = simple_llm_call(
            f"Write a follow-up email for customer {customer_id}"
        )
        send_email(customer['email'], "Update", email_body)

The LLM here is doing text generation, not reasoning about what action to take. The decision logic is all in your Python if statement. This is a perfectly reasonable architecture for plenty of use cases, but it's not an agent. It's a script that uses an LLM as a writing tool.

The real agent version lets the model decide whether to check the order date, whether the result warrants an email, what the email should say, and whether to retry if the email send fails, all through its own reasoning over the conversation history, not through your hardcoded control flow.

Memory: The Property Most Wrapper Implementations Skip

A genuine agent maintains state across the interaction, which means it can reference earlier observations when making later decisions.

class AgentSession:
    """
    Memory across a task execution. This is what 
    separates a stateful agent from a stateless 
    LLM call repeated in a loop.
    """

    def __init__(self):
        self.messages = []
        self.tool_call_history = []
        self.observations = {}

    def add_observation(self, tool_name: str, result: dict):
        self.tool_call_history.append({
            "tool": tool_name,
            "result": result,
            "step": len(self.tool_call_history)
        })
        # Agent can later ask "did I already check this?"
        self.observations[tool_name] = result

    def already_attempted(self, tool_name: str) -> bool:
        """
        This prevents the agent from repeating failed 
        actions indefinitely — a common failure mode
        in naive implementations.
        """
        return tool_name in self.observations

Without this, a naive agent loop can call the same failing tool repeatedly because it has no memory of having already tried it. This is one of the most common production bugs in early agent implementations, the agent gets stuck in a loop because nothing tells it "you already tried this and it didn't work."

Why This Distinction Actually Matters

This isn't pedantry. The architectural decision has real consequences.

If your task is single-turn, answer this question, summarise this document, classify this text, an LLM wrapper is simpler, cheaper, faster, and easier to debug. Building agent infrastructure for a task that doesn't need planning or tool orchestration is wasted complexity.

If your task requires multiple steps where each step's outcome affects the next decision, process this application by checking three systems and deciding what to do based on what you find, you need actual agentic architecture. An LLM wrapper will fail because it can't adapt mid-task to what it discovers.

The full breakdown of AI agent vs LLM architectures, including when each is the right call for specific production use cases, covers the decision framework in more depth than fits here.

What's Next: RAG vs Agentic Architecture

Now that the LLM vs agent distinction is clear, the next architectural decision most teams face is RAG vs agentic AI and these get confused almost as often as LLM vs agent does. RAG solves the problem of grounding LLM responses in your specific documents. Agentic architecture solves the problem of taking multi-step actions across systems. They're different tools for different problems, and the RAG vs agentic AI comparison we published covers exactly where each one fits, including the hybrid pattern where an agent uses RAG as one of its tools.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

Building AI Agents for Compliance Monitoring in Finance: Architecture That Passes Auditors

Dextra Labs — Wed, 27 May 2026 20:36:28 +0000

The compliance AI that can't explain its decisions is worse than no compliance AI. Here's how to build one that can.

There's a specific failure mode that kills fintech AI projects that traditional software projects don't have.

The system works. The accuracy is good. The false positive rate is acceptable. And then your compliance officer asks: "Why did this transaction get flagged?" And the answer is "the model gave it a score of 0.87", which is not an answer a regulator will accept.

Explainability in compliance AI isn't a nice-to-have. It's a regulatory requirement. FINRA, FCA, RBI, every major financial regulator has issued guidance making clear that automated compliance decisions require documented reasoning that a human auditor can review and challenge. "The AI said so" is not documented reasoning.

This tutorial covers how to build a compliance monitoring agent architecture that produces decisions an auditor can actually work with.

The Architecture Overview

REGULATORY DATA FEEDS
(OFAC, FATF, FinCEN, local watchlists)
         ↓
[INGESTION AGENT] — normalise, deduplicate, version
         ↓
TRANSACTION STREAM (real-time)
         ↓
[SCREENING AGENT] — rule-based + Claude analysis
         ↓ 
         ├── LOW RISK → auto-clear + audit log
         ├── MEDIUM RISK → flag + evidence package → analyst queue
         └── HIGH RISK → block + SAR draft → senior review
                    ↓
         [AUDIT TRAIL AGENT] — immutable decision log
                    ↓
         [REPORTING AGENT] — SAR generation, regulatory reporting

Every stage produces a structured, human-readable decision record. This isn't a post-processing step, it's built into every agent's output schema from day one.

Agent 1: Regulatory Data Ingestion

Regulatory watchlists change constantly. OFAC updates the SDN list multiple times a week. FATF grey/black lists update quarterly. Local regulators issue updates on irregular schedules.

from anthropic import Anthropic
from datetime import datetime
import hashlib
import json

client = Anthropic()

class RegulatoryIngestionAgent:
    def __init__(self, db_connection, audit_logger):
        self.db = db_connection
        self.audit = audit_logger

    async def ingest_watchlist_update(
        self, 
        source: str,
        raw_data: bytes,
        update_metadata: dict
    ) -> dict:
        """
        Ingests watchlist updates with full provenance tracking.
        Every entry gets a source, version and effective date.
        """

        # Parse with Claude for flexible format handling
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4000,
            system="""Parse regulatory watchlist data into 
            structured entities. Handle variations in format
            across different regulatory sources.

            Extract for each entity:
            - canonical_name (primary identifier)
            - aliases (all alternative names)
            - entity_type (individual/organisation/vessel/aircraft)
            - identifiers (passport, tax ID, registration numbers)
            - addresses (with country codes)
            - listing_reason (sanctions program or crime category)
            - effective_date
            - source_reference (regulatory document ID)

            Return JSON array of entities.
            Flag any entries with ambiguous identity markers.""",
            messages=[{
                "role": "user",
                "content": f"Source: {source}\n\n{raw_data.decode('utf-8', errors='replace')}"
            }]
        )

        entities = json.loads(response.content[0].text)

        # Version control for watchlist entries
        for entity in entities:
            entity['_provenance'] = {
                'source': source,
                'ingest_timestamp': datetime.utcnow().isoformat(),
                'source_document_hash': hashlib.sha256(raw_data).hexdigest(),
                'regulatory_effective_date': update_metadata.get('effective_date'),
                'version_id': self.generate_version_id(entity, source)
            }

        await self.db.upsert_watchlist_entities(entities)

        self.audit.log({
            'event': 'watchlist_update_ingested',
            'source': source,
            'entities_added': len(entities),
            'timestamp': datetime.utcnow().isoformat()
        })

        return {
            'entities_processed': len(entities),
            'flagged_for_review': [e for e in entities if e.get('ambiguous')]
        }

The provenance tracking matters for audit purposes. When an auditor asks "was this entity on the watchlist at the time of this transaction?", you need to be able to answer precisely, not "yes, they're on the list now" but "this entity was added to the OFAC SDN list on [date] under [regulatory reference] and was active in our database from [timestamp]."

Agent 2: Real-Time Transaction Screening

This is the core compliance agent. It needs to be fast, blocking a payment for 30 seconds to run compliance checks is not acceptable in most contexts and it needs to produce explainable decisions.

class TransactionScreeningAgent:

    RISK_THRESHOLDS = {
        'auto_clear': 0.25,
        'analyst_review': 0.6,
        'block_and_escalate': 0.85
    }

    async def screen_transaction(
        self, 
        transaction: dict
    ) -> dict:
        """
        Screens transaction against watchlists and risk models.
        Returns decision with full reasoning chain for audit trail.
        """

        # Fast rule-based pre-screen
        rule_matches = await self.run_rule_engine(transaction)

        if rule_matches['exact_match']:
            return self.build_decision(
                transaction, 
                risk_score=0.95,
                decision='BLOCK',
                reasoning_type='exact_watchlist_match',
                evidence=rule_matches
            )

        # Claude analysis for fuzzy matching and context
        entity_context = await self.get_entity_context(
            transaction['counterparty']
        )

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1500,
            system="""You are a compliance analyst screening 
            financial transactions. Analyse the transaction
            against the provided entity context and risk factors.

            Provide a structured risk assessment with:
            1. Risk score (0.0-1.0)
            2. Primary risk factors (list each with evidence)
            3. Mitigating factors (if any)
            4. Decision rationale (2-3 sentences, auditor-readable)
            5. Recommended action: AUTO_CLEAR / ANALYST_REVIEW / BLOCK
            6. Confidence level: HIGH / MEDIUM / LOW

            Be specific. Cite the exact data points that 
            influenced the score. Vague rationale fails audits.

            Return as JSON with schema:
            {
                "risk_score": float,
                "risk_factors": [{"factor": str, "evidence": str, "weight": str}],
                "mitigating_factors": [str],
                "decision_rationale": str,
                "recommended_action": str,
                "confidence": str,
                "additional_checks_required": [str]
            }""",
            messages=[{
                "role": "user",
                "content": f"""Transaction details:
Amount: {transaction['amount']} {transaction['currency']}
Counterparty: {transaction['counterparty_name']}
Counterparty country: {transaction['counterparty_country']}
Transaction type: {transaction['type']}
Reference: {transaction.get('reference', 'None')}
Originating account risk tier: {transaction['account_risk_tier']}

Entity context from watchlist database:
{json.dumps(entity_context, indent=2)}

Fuzzy name match results:
{json.dumps(rule_matches['fuzzy_matches'], indent=2)}"""
            }]
        )

        analysis = json.loads(response.content[0].text)

        return self.build_decision(
            transaction,
            risk_score=analysis['risk_score'],
            decision=analysis['recommended_action'],
            reasoning_type='claude_analysis',
            evidence=analysis
        )

    def build_decision(
        self, 
        transaction: dict,
        risk_score: float,
        decision: str,
        reasoning_type: str,
        evidence: dict
    ) -> dict:
        """
        Builds the decision record that goes to audit trail.
        Every field that an auditor might ask about is explicit.
        """
        return {
            'transaction_id': transaction['id'],
            'screening_timestamp': datetime.utcnow().isoformat(),
            'decision': decision,
            'risk_score': risk_score,
            'reasoning_type': reasoning_type,
            'evidence': evidence,
            'agent_version': AGENT_VERSION,
            'watchlist_versions_consulted': self.get_active_watchlist_versions(),
            'regulatory_basis': self.get_applicable_regulations(transaction),
            'human_review_required': risk_score >= self.RISK_THRESHOLDS['analyst_review']
        }

The watchlist_versions_consulted field is one of the most important for audit purposes. When a regulator asks "was this screened against the current OFAC list?", you can provide the exact version ID of the list that was active at screening time.

Agent 3: The Audit Trail Agent

The audit trail is not a log. It's an immutable, queryable record of every compliance decision with enough context to reconstruct the reasoning from scratch.

class AuditTrailAgent:

    def __init__(self, immutable_store):
        # Immutable store — append only, no updates, no deletes
        self.store = immutable_store

    async def record_decision(self, decision_record: dict) -> str:
        """
        Records a compliance decision with full provenance.
        Returns the immutable record ID for reference.
        """

        # Generate explainability summary for human review
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=800,
            system="""Generate a plain-language explanation of 
            this compliance decision suitable for regulator review.

            The explanation must:
            1. State the decision and its risk basis clearly
            2. Identify the specific factors that drove the decision
            3. Note any watchlist matches with regulatory references
            4. Explain what additional review was triggered, if any
            5. Be written so a non-technical compliance officer
               can understand and defend it

            Maximum 200 words. No jargon. No model internals.
            The reader is an auditor, not a data scientist.""",
            messages=[{
                "role": "user",
                "content": json.dumps(decision_record, indent=2)
            }]
        )

        human_readable_explanation = response.content[0].text

        audit_record = {
            **decision_record,
            'human_readable_explanation': human_readable_explanation,
            'record_created_at': datetime.utcnow().isoformat(),
            'record_id': self.generate_record_id(decision_record)
        }

        record_id = await self.store.append(audit_record)

        return record_id

    async def generate_examination_report(
        self,
        date_range: tuple,
        transaction_ids: list = None,
        include_auto_cleared: bool = False
    ) -> dict:
        """
        Generates examination-ready compliance report.
        Format designed for regulatory examination.
        """

        records = await self.store.query(
            date_range=date_range,
            transaction_ids=transaction_ids,
            include_auto_cleared=include_auto_cleared
        )

        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=3000,
            system="""Compile a compliance examination report 
            from transaction screening records.

            Structure the report as regulators expect:
            1. Executive summary (screening volume, decision distribution)
            2. High-risk transaction summary (blocked and escalated)
            3. Watchlist match analysis (by source, match type)
            4. False positive analysis (analyst overrides)
            5. System performance metrics
            6. Notable patterns or anomalies

            Be factual. Cite specific transaction IDs for examples.
            Format for readability — this goes to regulators.""",
            messages=[{
                "role": "user",
                "content": f"Records for period {date_range}:\n{json.dumps(records, indent=2)}"
            }]
        )

        return {
            'report': response.content[0].text,
            'record_count': len(records),
            'period': date_range,
            'generated_at': datetime.utcnow().isoformat()
        }

The Explainability Requirement in Practice

The human-readable explanation generation is the piece that compliance teams consistently cite as the most valuable. Not the risk score, the explanation.

When an analyst reviews a flagged transaction, they need to understand not just that the system flagged it but why, in terms they can defend to a regulator. "Risk score: 0.73" tells them nothing they can act on. "Transaction flagged: counterparty name 'Al-Rashid Trading LLC' returns 0.87 similarity to sanctioned entity 'Al-Rasheed Trading' on OFAC SDN list (added 2024-03-15, Program: SDGT). Transaction amount ($47,000) above standard trade threshold for counterparty country. Pattern consistent with structuring indicators from FinCEN Advisory FIN-2023-A001" tells them exactly what to investigate.

The AI agents for compliance monitoring in finance article covers the full regulatory framework mapping, which specific regulations require which types of documentation, in detail.

What Auditors Actually Check

Three things that compliance AI architectures consistently fail on during examination:

Decision immutability: Auditors check that compliance records can't be modified after the fact. Your audit trail store must be append-only. If your logging goes to a database where records can be updated, you'll fail this check.

Watchlist version traceability: "We screened against the watchlist" is not sufficient. "We screened against OFAC SDN List version 20260415-1423, which was active from 2026-04-15 14:23 UTC" is sufficient.

Override documentation: When analysts override an automated decision, clearing a flagged transaction or escalating an auto-cleared one, the rationale must be documented in the compliance record. Systems that allow override without documentation create audit exposure.

Compliance Is One Layer

The architecture above handles transaction screening and AML monitoring. It's one component of a full agentic AI banking stack. For the complete architecture covering KYC automation, fraud detection, lending decisioning and portfolio risk management, the agentic AI in banking guide covers the full system design that compliance monitoring plugs into.

Compliance is just one banking use case. For the complete architecture guide covering lending, KYC, fraud detection and portfolio management, we published the complete agentic AI in banking guide. The compliance layer described here is designed to integrate cleanly with each of those use cases.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

How a 400-Engineer SaaS Company Cut PR-to-Production from 4.2 Days to 6.4 Hours with Claude Code Multi-Agent DevOps

Dextra Labs — Tue, 26 May 2026 05:27:08 +0000

This isn't a proof of concept. It's been running in production for seven months across a 400-person engineering organisation. Here's exactly how it works.

The 4.2-day number isn't unusual. For a SaaS company with multiple service teams, compliance requirements and a staging environment that sometimes behaves nothing like production, a PR sitting in queue for four days before it ships is normal. Not good, but normal.

The bottleneck wasn't lazy engineers. It was handoffs. PR opened → wait for reviewer availability → review completed → wait for CI → CI passes → wait for staging deployment → staging validated → wait for deployment approval → deploy. Each wait is measured in hours and each handoff introduces the possibility of context loss, miscommunication, or someone being in a meeting when their action is required.

The 400-engineer SaaS company we worked with had the additional constraint of SOC 2 compliance requirements, meaning deployment decisions needed documented rationale and "it looked fine" was not an acceptable audit trail.

The question wasn't whether they could speed up reviews. It was whether they could redesign the entire pipeline so that handoffs between automated systems happened in seconds while human judgment was reserved for the decisions that actually require it.

The Architecture

The pipeline uses five Claude Code agents, each with a specific scope. The handoffs between them are event-driven, no polling, no scheduled checks.

PR Opened
    ↓
[REVIEW AGENT] — Code quality, security scan, test coverage check
    ↓ (passes threshold)
[TEST AGENT] — Generates missing tests, validates existing coverage
    ↓ (coverage met)
[STAGING AGENT] — Deploys to staging, runs smoke tests
    ↓ (smoke tests pass)
[VALIDATION AGENT] — Performance regression check, integration tests
    ↓ (no regression)
[DEPLOYMENT AGENT] — Production deployment with rollback monitoring
    ↓
Human review required only for: threshold exceptions, new service integrations, schema changes

The key design decision: each agent has a defined pass/fail threshold. When a PR's complexity or risk score exceeds the threshold, it surfaces to a human reviewer with a pre-assembled context package rather than routing through the full automated pipeline.

Agent 1: The Review Agent

from anthropic import Anthropic
import subprocess
import json

client = Anthropic()

def review_agent(pr_diff: str, pr_metadata: dict) -> dict:
    """
    Analyses PR diff for code quality, security issues,
    and coverage gaps. Returns structured review with 
    risk score and required actions.
    """

    system_prompt = """You are a senior code reviewer for a 
    production SaaS platform. Analyse PRs for:
    1. Security vulnerabilities (SQL injection, auth bypass, 
       exposed secrets, injection vectors)
    2. Performance regressions (N+1 queries, missing indexes,
       synchronous blocking calls)
    3. Test coverage gaps on modified code paths
    4. API contract changes affecting downstream services

    Return ONLY valid JSON with this exact schema:
    {
        "risk_score": 1-10,
        "security_issues": [],
        "performance_concerns": [],
        "coverage_gaps": [],
        "api_breaking_changes": [],
        "auto_approvable": boolean,
        "requires_human_review": boolean,
        "review_rationale": "string"
    }

    risk_score >= 7 MUST set requires_human_review: true.
    API breaking changes MUST set requires_human_review: true."""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2000,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"""PR #{pr_metadata['number']}
Author: {pr_metadata['author']}
Files changed: {pr_metadata['files_changed']}
Description: {pr_metadata['description']}

Diff:
{pr_diff}"""
        }]
    )

    review = json.loads(response.content[0].text)

    # Audit trail, every decision gets logged
    log_audit_event({
        "event": "review_agent_decision",
        "pr_number": pr_metadata['number'],
        "risk_score": review['risk_score'],
        "requires_human": review['requires_human_review'],
        "rationale": review['review_rationale'],
        "timestamp": datetime.utcnow().isoformat(),
        "agent_version": AGENT_VERSION
    })

    return review

The audit trail logging is not optional, it's what satisfies the SOC 2 requirement that every deployment decision is documented. Every agent decision gets written to an immutable log with the full reasoning chain.

Agent 2: The Test Generation Agent

When the review agent identifies coverage gaps, the test agent generates the missing tests before the PR can proceed.

def test_generation_agent(
    source_code: str, 
    coverage_gaps: list[str],
    existing_tests: str
) -> dict:
    """
    Generates pytest tests for identified coverage gaps.
    Validates generated tests actually run before returning.
    """

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4000,
        system="""Generate pytest tests for the specified 
        coverage gaps. Requirements:
        - Tests must be runnable (no placeholder implementations)
        - Include edge cases for each identified gap
        - Match the style and patterns in existing_tests
        - Include docstrings explaining what each test validates
        - Use fixtures from existing conftest.py patterns

        Return JSON: {
            "tests": "complete test file content",
            "coverage_targets": ["list of functions tested"],
            "edge_cases_covered": ["list of edge cases"]
        }""",
        messages=[{
            "role": "user",
            "content": f"""Source code:\n{source_code}\n\n
Coverage gaps: {json.dumps(coverage_gaps)}\n\n
Existing tests (for style reference):\n{existing_tests}"""
        }]
    )

    result = json.loads(response.content[0].text)

    # Validate generated tests actually run
    validation = run_generated_tests(result['tests'])

    if not validation['passed']:
        # Retry with failure context
        return retry_test_generation(
            result, 
            validation['failures']
        )

    return result

The validation step, actually running the generated tests before they get committed, was added after week two of production operation when we discovered Claude occasionally generated tests that referenced fixtures that didn't exist. The retry loop with failure context solves this in one additional pass approximately 8% of the time.

Agent 3: Staging and Validation

The staging agent handles deployment to the staging environment and runs the smoke test suite. The validation agent runs on top of that output.

def staging_agent(pr_number: int, build_artifact: str) -> dict:
    deploy_result = deploy_to_staging(build_artifact)
    smoke_results = run_smoke_tests(deploy_result['endpoint'])

    # Collect metrics for regression comparison
    perf_metrics = collect_performance_metrics(
        deploy_result['endpoint'],
        duration_seconds=120
    )

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1500,
        system="""Analyse staging deployment results.
        Compare performance metrics against baselines.
        Identify any regressions or anomalies.

        Return JSON: {
            "staging_healthy": boolean,
            "regressions_detected": [],
            "anomalies": [],
            "performance_delta": {},
            "proceed_to_production": boolean,
            "reasoning": "string"
        }""",
        messages=[{
            "role": "user",
            "content": f"""Smoke test results: {json.dumps(smoke_results)}
Performance metrics: {json.dumps(perf_metrics)}
Baseline metrics: {json.dumps(get_baseline_metrics())}
PR number: {pr_number}"""
        }]
    )

    return json.loads(response.content[0].text)

Agent 4: The Deployment Agent with Rollback Monitoring

The deployment agent is where the most thought went into the design, because production deployments with autonomous rollback decisions are where the risk is highest.

def deployment_agent(
    pr_number: int,
    staging_validation: dict,
    deployment_config: dict
) -> dict:

    # Final pre-deployment check
    risk_assessment = assess_deployment_risk(
        pr_number, 
        staging_validation,
        deployment_config
    )

    if risk_assessment['risk_level'] == 'HIGH':
        return escalate_to_human(pr_number, risk_assessment)

    # Deploy with canary rollout
    deploy_result = canary_deploy(
        deployment_config,
        initial_traffic_percent=5
    )

    # Monitor for 10 minutes at 5% traffic
    monitoring_results = monitor_canary(
        deploy_result['deployment_id'],
        duration_minutes=10,
        error_rate_threshold=0.5,
        latency_p99_threshold_ms=800
    )

    if monitoring_results['thresholds_exceeded']:
        # Autonomous rollback decision
        rollback_result = execute_rollback(
            deploy_result['deployment_id']
        )

        # Claude analyses why rollback was needed
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1000,
            system="Analyse rollback event and generate incident report.",
            messages=[{
                "role": "user", 
                "content": f"""Deployment: {deploy_result}
Monitoring: {monitoring_results}
Rollback: {rollback_result}
Generate incident report with root cause hypothesis."""
            }]
        )

        incident_report = response.content[0].text
        notify_team(pr_number, incident_report)

        log_audit_event({
            "event": "autonomous_rollback",
            "pr_number": pr_number,
            "trigger": monitoring_results['threshold_exceeded'],
            "incident_report": incident_report
        })

        return {"status": "rolled_back", "report": incident_report}

    # Canary healthy — ramp to full traffic
    return complete_deployment(deploy_result['deployment_id'])

The canary rollout at 5% traffic with autonomous rollback if error rate exceeds 0.5% or p99 latency exceeds 800ms was the design decision that made the engineering team comfortable with autonomous deployment. Not "the agent decides to deploy and hopes for the best", the agent deploys to a tiny slice of traffic, watches it carefully and reverts immediately if anything looks wrong.

What Broke During Rollout

There were three significant failure modes in the first six weeks.

The false positive review problem: The review agent was flagging approximately 34% of PRs as requiring human review in week one, far too high for the automated pipeline to deliver meaningful speedup. The issue was the system prompt was too conservative on the "security issues" classification. A logging statement that included a user ID in the message was being flagged as "potential PII exposure in logs." Tuning the system prompt with specific examples of what constitutes an actual security issue vs a style concern reduced the human escalation rate to 11%.

The test generation hallucination problem: Mentioned above, generated tests referencing non-existent fixtures. The validation loop solved this. The broader lesson: any agent that produces artifacts that will be committed to a codebase needs validation that the artifacts actually work, not just that they look plausible.

The staging environment divergence problem: The validation agent was making production deployment decisions based on staging metrics that weren't representative of production load. Staging was running on smaller instances. A PR that performed fine under staging load would show latency issues under production traffic at 5% canary. We addressed this by calibrating the staging-to-production comparison models and adding an explicit adjustment factor for known environment differences.

The Results After Seven Months

PR-to-production average: 6.4 hours (down from 4.2 days). Human review rate: 11% of PRs (up from 100%, obviously, down from the 34% false positive rate in week one). Autonomous rollback rate: 2.3% of deployments, all within the canary window. Audit finding rate in SOC 2 review: zero deployment-related findings.

The deployment agent's incident reports have been reviewed by the security team and accepted as satisfying the "documented rationale for deployment decisions" requirement in the SOC 2 controls.

The full architecture, configuration details and the prompt engineering approach for the review agent are covered in the Claude Code multi-agent DevOps pipeline case study.

This isn't a demo, it's running in production across 400 engineers. If your DevOps pipeline has similar bottlenecks, long PR-to-production cycles, compliance documentation overhead, or too many handoffs between automated systems, Dextra Labs builds these multi-agent systems for engineering organisations at scale.

What Is Vibe Coding? And Does It Actually Work for Production Code? (I Tested 10 Tools)

Dextra Labs — Thu, 21 May 2026 04:44:05 +0000

Everyone keeps saying it. Half the people saying it can't define it. I spent three weeks finding out whether the thing they're describing actually holds up when you're building something real.

Let me define vibe coding properly, because the term has been stretched to the point where it means almost anything involving AI and code.

Vibe coding is a development workflow where you describe what you want in natural language, often imprecisely, often iteratively and let an AI tool generate, modify, or explain code based on your intent rather than your specification. The "vibe" is the feeling of directing rather than writing, of being a composer who sketches melodies and lets the AI fill in the notation.

The term was popularised by Andrej Karpathy in early 2025 and it resonated because it named something a lot of developers were already experiencing. You're not doing traditional programming. You're not doing no-code. You're doing something in between, guiding an AI through a problem using natural language plus occasional code review, trusting the tool to handle the implementation details while you stay at the problem level.

The debate is whether this is a legitimate development methodology or a fast path to unmaintainable code that works until it doesn't.

I tested it on real tasks to find out.

The Testing Methodology

Three task types that cover the range of what developers actually do:

Task 1: Build a React dashboard : A monitoring dashboard with real-time data, filtering and a chart component. Not a toy example, the kind of component you'd actually ship.

Task 2: Debug a Python API : A FastAPI endpoint with a subtle async bug causing intermittent 500 errors under load. The kind of bug that takes a human developer 2-3 hours to find.

Task 3: Refactor legacy code : A 300-line Python function handling multiple concerns simultaneously. The task: split it sensibly without changing behaviour.

Four evaluation dimensions:

Code quality : would a senior engineer approve this in a code review?
Speed : time to a working solution
Vibe : how natural did the flow feel? Did I feel like I was driving or fighting?
Production readiness : edge cases handled, error states covered, tests included?

The 10 Tools

Cursor, Windsurf, Claude (claude.ai), GitHub Copilot (agent mode), Bolt.new, v0 by Vercel, Replit Agent, Devin, Aider and Codeium.

Tool 1: Cursor

Code quality: 9/10 | Speed: Fast | Vibe: 9/10 | Prod ready: 8/10

Cursor is the benchmark that everything else gets compared against and the comparison is usually unfair to everything else.

The React dashboard task: I described what I wanted in the chat sidebar. Cursor read the existing file structure, understood the component patterns I was using and produced a dashboard that matched my codebase conventions without me specifying them. The chart component needed one round of iteration, the initial output used a library I didn't have installed, but the correction was a single message.

The debug task is where Cursor genuinely impressed me. I pasted the error logs and described the symptom. Cursor identified the async context manager issue in the database connection handling without me pointing it out. It explained why the bug caused intermittent failures specifically under load, not in isolation. That explanation was accurate and it's the kind of contextual reasoning that makes the debugging session feel like pairing with a capable engineer rather than using a tool.

The refactoring task: clean extraction of concerns, appropriate abstractions, preserved behaviour. The one gap was that tests weren't generated automatically, I had to ask for them separately.

The vibe is consistently good. The tab completion alone changes how fast you work. The chat integration with the file context feels natural. If you're not using Cursor and you're writing code daily, you're leaving velocity on the table.

Tool 2: Windsurf

Code quality: 8/10 | Speed: Fast | Vibe: 8/10 | Prod ready: 7/10

Windsurf's Cascade mode is the closest competitor to Cursor and in some tasks it's genuinely better. The multi-file coordination, when a change in one file should propagate to related files, is handled more proactively than Cursor in my testing.

For the React dashboard, Windsurf's output was slightly more boilerplate-heavy than Cursor's. The structure was correct but the styling choices felt generic in a way that would need cleanup before shipping. Not wrong, just not as convention-aware.

The debugging task showed the gap: Windsurf identified the right area of the code but its explanation of why the bug manifested under load was less precise than Cursor's. The fix was correct. The understanding behind it felt shallower.

The vibe is good, particularly in Cascade mode. Where Cursor feels like a co-pilot who reads your intent, Windsurf feels like a capable pair programmer who needs slightly more explicit direction. The distinction matters on complex tasks and disappears on simple ones.

Tool 3: Claude (claude.ai)

Code quality: 9/10 | Speed: Medium | Vibe: 7/10 | Prod ready: 9/10

Claude's code quality is consistently the highest of any tool I tested. The React dashboard output was clean, well-commented, accessible and included error boundary handling I hadn't asked for. The refactoring was architecturally thoughtful in a way that reflected genuine understanding of why the original code was problematic.

The debugging task: Claude caught the async issue, explained it with more depth than any other tool and provided a test case that would reproduce the bug reliably, something I hadn't asked for.

The vibe score reflects the interface constraint. Claude in the browser is a chat interface, not an IDE. The code quality is excellent but the workflow of copy-paste between the chat and my editor breaks the flow that Cursor and Windsurf maintain natively. When Claude gets API access to your IDE (this is coming), the vibe score changes.

For code review and architectural reasoning, Claude is the best tool here. For the integrated vibe coding flow, the interface is the limitation.

Tool 4: GitHub Copilot (Agent Mode)

Code quality: 7/10 | Speed: Very fast | Vibe: 8/10 | Prod ready: 6/10

Copilot's agent mode is fast. Tab completion that anticipates your next line before you've finished the current one is genuinely addictive. For boilerplate-heavy tasks, setting up a new component structure, writing standard CRUD operations, nothing is faster.

The gaps appear on complex tasks. The React dashboard output was functional but shallow, no error handling, no loading states, no edge case coverage. The structure was correct; the completeness wasn't there.

The debugging task was the weakest performance of any tool I'd consider recommending. Copilot identified the general area of the problem but missed the specific async context issue, suggesting a fix that would have helped in some cases but not addressed the root cause.

If you're primarily writing code and want faster typing, Copilot is excellent. If you're solving complex problems and want to understand them, it underperforms the tools with more reasoning depth.

Tool 5: Bolt.new

Code quality: 7/10 | Speed: Very fast | Vibe: 8/10 | Prod ready: 5/10

Bolt.new exists in a different category from the IDE-integrated tools. It's for generating full applications from descriptions, not for coding workflows within existing projects.

For the React dashboard, built from scratch, not integrated into an existing codebase, Bolt.new produced something visually impressive and functionally limited within about four minutes. The demo looks great. The code quality underneath is the kind that works until you need to change something.

For the debugging and refactoring tasks: Bolt.new isn't designed for this use case and it showed. These tasks require context about an existing codebase that Bolt.new's interface doesn't support well.

The vibe for greenfield work is genuinely good, describing a product and watching it appear is still impressive even if you've seen it a hundred times. The production readiness of the output is not there for anything beyond prototyping.

Tools 6–10: The Quick Summary

v0 by Vercel : Excellent for React UI components specifically, poor outside that domain. Design sensibility is the best of any tool here. If you're building Next.js frontends, v0 is a genuine productivity multiplier for component generation.

Replit Agent : Best if you need cloud deployment built into the workflow. The code quality is adequate, the integrated deployment is the differentiator.

Devin : The most autonomous of any tool. Genuinely impressive on multi-step tasks. The latency is real, it thinks before acting and the thinking takes time. For complex, long-horizon tasks where you want to describe an outcome and walk away, Devin is the tool. For interactive vibe coding where you want fast iteration, it's too slow.

Aider : The power user's choice. Terminal-native, works with any model, extremely configurable. The vibe is terminal-flavoured, excellent for developers who live in the command line, alienating for everyone else. Code quality is high when you configure it well.

Codeium : Strong autocomplete, adequate chat. The free tier is genuinely competitive with Copilot for basic completion. Less impressive on complex reasoning tasks.

The Honest Answer to "Does It Work for Production?"

Yes, with the right tools and the right mindset.

The vibe coding workflow produces production-quality code on well-defined tasks with tools like Cursor and Claude. The catch is that "well-defined" is doing work in that sentence. Vibe coding amplifies your ability to execute on a problem you understand, it doesn't replace the need to understand the problem.

The failure mode I saw consistently: developers who described what they wanted without understanding the constraints or edge cases, accepted the first output without critical review and discovered the gaps when the code ran in a real environment.

The success mode: developers who used vibe coding to accelerate the implementation of problems they'd already thought through, treated AI output as a first draft rather than a final answer and maintained the ability to read and understand the code that was generated.

The tools that produce the best production code are the ones with the deepest reasoning capability, Cursor, Claude, Aider, not the ones with the fastest output. Speed is a feature. Understanding the problem is still your job.

For the full ranked comparison with screenshots, prompting strategies and code sample comparisons across all ten tools, Dextra Labs tested all 10 vibe coding tools head-to-head with the detail that a single Dev.to article can't cover.

The full explainer on what vibe coding is, including the workflow patterns that work in production versus the ones that produce demo-quality code, covers the methodology in more depth.

Published by Dextra Labs | AI Consulting & Enterprise Development

Top 10 Agentic AI Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs... (Benchmarks Inside)

Dextra Labs — Wed, 20 May 2026 14:09:25 +0000

I spent six weeks running identical tasks through ten different frameworks so you don't have to argue about this in Slack anymore.

There's a conversation that happens in almost every engineering team building agents right now. Someone says "we should use LangChain." Someone else says "CrewAI is better for multi-agent stuff." A third person asks if anyone has looked at AutoGen. Nobody can agree because everyone is going off demos, blog posts and vibes.

I got tired of that conversation, so I ran actual benchmarks.

Six weeks, ten frameworks, five evaluation tasks repeated consistently across all of them. The tasks were chosen to reflect what production agent systems actually need to do, not what looks impressive in a README.

Here's what I tested and what I found.

The Benchmark Setup

Five evaluation dimensions:

Agent setup time : how long from pip install to a working agent with tool use. Measured in minutes, not "effort."

Tool integration complexity : how many lines of code to add a custom tool that calls an external API.

Multi-agent orchestration : can it coordinate multiple specialised agents? How cleanly?

Memory handling : does it support conversation memory and persistent context across sessions?

Error recovery : what happens when a tool call fails or returns unexpected output?

Hardware: M3 MacBook Pro, 32GB. All tests used Claude Sonnet as the underlying model via API for consistency. I'm not benchmarking model quality, I'm benchmarking framework overhead and developer experience.

The Frameworks

LangGraph, CrewAI, AutoGen, LlamaIndex Workflows, Haystack, OpenClaw, Semantic Kernel, Phidata, Pydantic AI and AgentOps.

Let's go through them.

1. LangGraph

Setup time: 18 minutes | Tool integration: Medium | Multi-agent: Excellent | Memory: Good | Error recovery: Excellent

LangGraph is the framework I'd recommend to engineers who think in graphs. The mental model, nodes are processing steps, edges are transitions, state flows through the graph, is powerful once it clicks and slightly alien until it does.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    tool_results: list
    next_action: str

def research_node(state: AgentState):
    # Agent reasoning step
    return {"messages": [{"role": "assistant", "content": "Researching..."}]}

def tool_node(state: AgentState):
    # Tool execution step  
    return {"tool_results": ["result_data"]}

workflow = StateGraph(AgentState)
workflow.add_node("research", research_node)
workflow.add_node("tools", tool_node)
workflow.add_edge("research", "tools")
workflow.add_edge("tools", END)

app = workflow.compile()

The error recovery is genuinely impressive. You can define explicit fallback edges, if this node fails, route here instead, which produces resilient agent behaviour without try/except spaghetti scattered throughout your application code.

The trade-off: the learning curve is real. Engineers who aren't comfortable with graph-based thinking will fight the abstraction. Setup time reflects this, 18 minutes because I kept second-guessing the state schema design.

2. CrewAI

Setup time: 8 minutes | Tool integration: Easy | Multi-agent: Excellent | Memory: Good | Error recovery: Medium

CrewAI has the most intuitive API of any framework on this list for multi-agent work. The Role-Task-Crew mental model maps directly to how you'd describe the work to another engineer.

from crewai import Agent, Task, Crew, Process

researcher = Agent(
    role="Research Analyst",
    goal="Find accurate information about {topic}",
    backstory="Expert at synthesising complex information",
    verbose=True,
    tools=[search_tool, web_scraper_tool]
)

writer = Agent(
    role="Technical Writer",
    goal="Write clear documentation from research",
    backstory="Turns technical findings into readable content"
)

research_task = Task(
    description="Research {topic} and identify key technical details",
    agent=researcher,
    expected_output="Structured research summary with sources"
)

write_task = Task(
    description="Write documentation based on the research",
    agent=writer,
    expected_output="Complete technical document"
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    process=Process.sequential
)

result = crew.kickoff(inputs={"topic": "MCP protocol"})

8 minutes to a working multi-agent system. That's impressive and it reflects how well-designed the abstractions are.

The error recovery is where CrewAI shows its youth. When tools fail, the default behaviour is for the agent to retry with the same approach, which sometimes loops rather than adapts. You can override this with custom callbacks but it requires more configuration than LangGraph's graph-native error routing.

3. AutoGen

Setup time: 22 minutes | Tool integration: Medium | Multi-agent: Excellent | Memory: Medium | Error recovery: Good

AutoGen is Microsoft's framework and it shows, in the best possible way. The conversational multi-agent pattern, where agents literally message each other to collaborate, is different from CrewAI's task assignment model and genuinely powerful for complex reasoning chains.

import autogen

config_list = [{"model": "claude-sonnet-4-5", "api_key": "your_key"}]

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"config_list": config_list},
    system_message="You are a helpful coding assistant."
)

code_reviewer = autogen.AssistantAgent(
    name="code_reviewer", 
    llm_config={"config_list": config_list},
    system_message="You review code for bugs and improvements."
)

user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    code_execution_config={
        "work_dir": "coding",
        "use_docker": False
    }
)

# Start a multi-agent conversation
user_proxy.initiate_chat(
    assistant,
    message="Write a Python function to parse nested JSON"
)

The code execution capability, agents can write and run code in a sandboxed environment, is genuinely useful and something not every framework handles this cleanly.

The 22-minute setup time reflects the Azure OpenAI configuration options and the number of agent parameters. Not complex, just verbose.

4. LlamaIndex Workflows

Setup time: 15 minutes | Tool integration: Easy | Multi-agent: Good | Memory: Excellent | Error recovery: Medium

LlamaIndex has the best RAG integration of any framework here — which makes sense given its origins. If your agent needs to reason over large document collections, LlamaIndex Workflows is the framework that handles this without bolted-on complexity.

LlamaIndex has the best RAG integration of any framework here, which makes sense given its origins. If your agent needs to reason over large document collections, LlamaIndex Workflows is the framework that handles this without bolted-on complexity.

from llama_index.core.workflow import Workflow, StartEvent, StopEvent, step, Event

class ResearchEvent(Event):
    query: str

class AnalysisEvent(Event):
    research_results: str

class ResearchWorkflow(Workflow):
    @step
    async def research(self, ev: StartEvent) -> ResearchEvent:
        # Query documents and retrieve context
        results = await self.query_index(ev.query)
        return ResearchEvent(query=ev.query)

    @step
    async def analyse(self, ev: ResearchEvent) -> StopEvent:
        # Synthesise the research
        analysis = await self.synthesise(ev.query)
        return StopEvent(result=analysis)

workflow = ResearchWorkflow(timeout=60, verbose=True)
result = await workflow.run(query="Explain the MCP protocol")

The event-driven architecture is clean once you understand it. The memory handling, particularly for RAG-heavy workloads, is the best on this list.

Where it falls short: the multi-agent orchestration requires more manual wiring than CrewAI or AutoGen. It's capable, but you're doing more of the coordination work yourself.

5. Haystack

Setup time: 12 minutes | Tool integration: Easy | Multi-agent: Good | Memory: Good | Error recovery: Good

Haystack's pipeline-based architecture is the most auditable of any framework here. Every processing step is explicit, the data flow is visible and the system is straightforward to debug when something goes wrong.

from haystack import Pipeline
from haystack.components.generators import AnthropicGenerator
from haystack.components.routers import MetadataRouter

pipeline = Pipeline()
pipeline.add_component("router", MetadataRouter(rules={
    "search": {"task": {"$eq": "search"}},
    "analysis": {"task": {"$eq": "analysis"}}
}))
pipeline.add_component("search_agent", AnthropicGenerator(
    model="claude-sonnet-4-5"
))
pipeline.add_component("analysis_agent", AnthropicGenerator(
    model="claude-sonnet-4-5"
))

pipeline.connect("router.search", "search_agent.prompt")
pipeline.connect("router.analysis", "analysis_agent.prompt")

For teams with compliance or audit requirements, the explicit pipeline structure makes Haystack genuinely preferable to more opaque frameworks. You can answer "what did this agent do and why" clearly from the pipeline logs.

The trade-off: less dynamic than graph-based frameworks. Complex conditional reasoning is harder to express in pipeline terms.

6. OpenClaw

Setup time: 25 minutes | Tool integration: Medium | Multi-agent: Good | Memory: Good | Error recovery: Good

OpenClaw is the self-hosted option on this list and the one worth knowing about if data privacy is a requirement. No API calls to external services, everything runs in your infrastructure.

from openclaw import Agent, LocalLLM, Tool

llm = LocalLLM(
    model_path="./models/llama-3.1-70b-q4",
    context_length=8192
)

@Tool.register("database_query")
async def query_db(sql: str) -> dict:
    conn = await get_db_connection()
    result = await conn.execute(sql)
    return {"rows": result.fetchall()}

agent = Agent(
    llm=llm,
    tools=["database_query"],
    system_prompt="You are a data analysis assistant.",
    memory_enabled=True
)

response = await agent.run(
    "Analyse the monthly revenue trend from the sales table"
)

The 25-minute setup reflects model download and local configuration. Once running, the performance is solid for its class.

The honest trade-off: the model capability ceiling is lower than API-based frameworks unless you have significant local compute. For use cases where data residency matters more than peak performance, it's worth it. For OpenClaw's full architecture and where it fits in the self-hosted landscape, the Dextra deep-dive covers what a README can't.

7. Semantic Kernel

Setup time: 20 minutes | Tool integration: Easy | Multi-agent: Good | Memory: Good | Error recovery: Good

Microsoft's other agent framework. Where AutoGen is built around conversational multi-agent patterns, Semantic Kernel is built around plugins and planners, a more structured, less conversational approach.

import semantic_kernel as sk
from semantic_kernel.connectors.ai.anthropic import AnthropicChatCompletion

kernel = sk.Kernel()
kernel.add_service(AnthropicChatCompletion(
    ai_model_id="claude-sonnet-4-5",
    api_key="your_key"
))

@sk.kernel_function(name="analyse_data", description="Analyse dataset")
async def analyse_data(kernel: sk.Kernel, data: str) -> str:
    return f"Analysis of: {data}"

kernel.add_function(plugin_name="DataPlugin", function=analyse_data)

The .NET-first heritage shows in the C# documentation being significantly better than the Python docs. If your team works in .NET, Semantic Kernel is the clear choice. For Python-first teams, it's capable but requires tolerance for occasionally thin Python documentation.

8. Phidata

Setup time: 6 minutes | Tool integration: Very easy | Multi-agent: Good | Memory: Excellent | Error recovery: Medium

Phidata has the fastest setup time on this list, six minutes is genuinely six minutes and the built-in storage integrations (PostgreSQL, SQLite, Redis) for agent memory are better than almost any other framework here out of the box.

from phi.agent import Agent
from phi.model.anthropic import Claude
from phi.tools.duckduckgo import DuckDuckGo
from phi.storage.agent.sqlite import SqlAgentStorage

agent = Agent(
    model=Claude(id="claude-sonnet-4-5"),
    tools=[DuckDuckGo()],
    storage=SqlAgentStorage(
        table_name="agent_sessions",
        db_file="agent_memory.db"
    ),
    add_history_to_messages=True,
    num_history_responses=5,
    show_tool_calls=True
)

agent.print_response("What are the latest developments in MCP?")

The trade-off for the fast setup: less flexibility for complex custom orchestration. Phidata is excellent for building agents quickly with solid memory. It's less suited for intricate multi-agent coordination patterns.

9. Pydantic AI

Setup time: 10 minutes | Tool integration: Very easy | Multi-agent: Medium | Memory: Medium | Error recovery: Excellent

If you already use Pydantic (and most Python developers do), Pydantic AI's mental model will feel immediately familiar. The typed output validation is the best of any framework here, if your agent produces structured data, Pydantic AI guarantees it conforms to your schema.

from pydantic_ai import Agent
from pydantic import BaseModel

class AnalysisResult(BaseModel):
    summary: str
    key_findings: list[str]
    confidence_score: float
    recommendations: list[str]

agent = Agent(
    'claude-sonnet-4-5',
    result_type=AnalysisResult,
    system_prompt="Analyse the provided data and return structured insights."
)

result = await agent.run("Analyse Q3 2025 performance metrics: revenue up 23%...")
print(result.data.key_findings)  # Guaranteed to be a list[str]
print(result.data.confidence_score)  # Guaranteed to be a float

The error recovery is excellent specifically because validation happens at the framework level, not just at the application level. If the model produces output that doesn't match the schema, Pydantic AI retries automatically with corrective context.

The multi-agent orchestration is the weak point. It's not impossible but it requires more manual coordination than dedicated multi-agent frameworks.

10. AgentOps

Setup time: 14 minutes | Tool integration: Medium | Multi-agent: Good | Memory: Medium | Error recovery: Good

AgentOps is different from the others, it's less a standalone framework and more an observability and orchestration layer that wraps other frameworks. If you're already using LangGraph or CrewAI and need production monitoring, cost tracking and session replay, AgentOps is the integration to look at.

import agentops
from agentops import track_agent, record_tool

agentops.init(api_key="your_key")

@track_agent(name="research_agent")
class ResearchAgent:
    @record_tool("web_search")
    async def search(self, query: str) -> str:
        results = await perform_search(query)
        return results

    async def run(self, task: str):
        research = await self.search(task)
        return research

agent = ResearchAgent()
result = await agent.run("Research agentic AI frameworks")
agentops.end_session("Success")

In production, the cost per session tracking alone makes this worth evaluating. Knowing which agent workflows are burning tokens without producing value is information you need and that most frameworks don't surface cleanly.

The Benchmark Summary

My Actual Recommendations

For a new production agent system: LangGraph. The learning curve is real, the error recovery and state management are worth it at scale.

For a team that needs something working this week: CrewAI. The time-to-working-system is the best of the serious frameworks.

For document-heavy RAG agent work: LlamaIndex Workflows. Nothing else handles this as naturally.

For regulated environments needing audit trails: Haystack. The pipeline explicitness isn't a limitation, it's the feature.

For self-hosted with data privacy requirements: OpenClaw. The setup overhead is the price of keeping data in your infrastructure.

The full breakdown with additional benchmarks on the top agentic AI frameworks in 2026 is published for teams who want more than fits in a single article.

If you're evaluating open-source options specifically, OpenClaw is worth a look for self-hosted use cases. We published a deep-dive on its architecture, deployment patterns and where it fits relative to the managed alternatives.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

Deploy an AI Agent to Azure AI Foundry in Under 10 Minutes Using AZD (Full Tutorial)

Dextra Labs — Tue, 19 May 2026 03:33:38 +0000

I'm going to save you the afternoon I lost the first time I tried to get an agent from my laptop to a live endpoint on Microsoft Foundry.

The official docs are thorough. They're also scattered across six different pages, three blog posts and a GitHub repo where the README was updated mid-week without anyone mentioning it. What I wanted was one page with every command, every config file, every gotcha, in order, that actually worked when I ran it.

This is that page.

By the end of this tutorial you'll have a working AI agent deployed to Microsoft Foundry with a live endpoint you can invoke from your terminal. The whole process takes two commands once you understand the setup. The setup is where people lose time, so that's where I'm going to be annoyingly specific.

What you're building

A hosted AI agent running on Microsoft Foundry, deployed through the Azure Developer CLI. The agent gets its own endpoint, managed identity, RBAC configuration and monitoring pipeline. You'll be able to invoke it from the terminal, test it in the Foundry portal playground and connect a frontend chat app to it.

The infrastructure is defined as code, Bicep templates generated in your repo that you own, version control and customize.

Prerequisites (check these before starting the clock)

Don't skip this section. Every "it doesn't work" message I've seen in GitHub issues traces back to a missing prerequisite.

You need an Azure subscription with permission to create resource groups and Foundry resources. Contributor access on the subscription is what you need for provisioning.

Install the Azure Developer CLI if you haven't already. On macOS: brew tap azure/azd && brew install azd. On Windows: winget install microsoft.azd. On Linux: curl -fsSL https://aka.ms/install-azd.sh | bash. Verify with azd version, you need 1.23.7 or later.

Install the Azure CLI as well: az --version to check. The AZD agent extension uses it under the hood for certain operations.
Sign in to Azure:

azd auth login

This opens a browser for authentication. If you're working in a headless environment or SSH session, use azd auth login --use-device-code instead.

One critical gotcha before you start: region availability. Hosted agents on Foundry currently operate in North Central US. If you deploy to a different region, provisioning will succeed but your agent won't run. This isn't a warning you'll see in the CLI output, it just silently doesn't work. Use northcentralus as your location. Save yourself the debugging session.

Step 1: Initialize the agent project

Create an empty directory and initialize the starter template:

mkdir my-first-agent && cd my-first-agent
azd init -t Azure-Samples/azd-ai-starter-basic --location northcentralus

When prompted, enter an environment name. This becomes the name prefix for your Azure resources. Keep it short and lowercase, something like my-agent-dev.

This command does two things that matter. First, it installs the azd ai agent extension automatically if you don't have it. You don't need to run azd extension install azure.ai.agents separately, though you can if you prefer being explicit. Second, it scaffolds the full infrastructure definition in your repo.

After init completes, look at what was generated:

ls infra/

You'll see main.bicep, the entry point that wires together all resources. This creates a Foundry account (the top-level container), a Foundry project (where your agent lives), model deployment configuration (GPT-4o by default), managed identity with RBAC role assignments and an Azure Container Registry for hosting the agent container.

The other critical file is azure.yaml, the AZD service map that ties your agent code to the Foundry host. Open it and familiarize yourself with the structure. This is where you define services, model deployments and connections declaratively.

Step 2: Deploy everything with one command

azd up

That's it. This single command orchestrates the entire deployment workflow. It provisions the infrastructure defined in the Bicep files, builds your container image, pushes it to Azure Container Registry, creates the Foundry project, deploys model endpoints, creates and starts your hosted agent and configures managed identity with the correct RBAC roles.

First run takes 4–6 minutes. Subsequent deployments are faster because the infrastructure already exists.

When it finishes, azd up prints a direct link to your agent in the Foundry portal. It also outputs several environment values you'll need later, agent name, version, account name, resource group and the Foundry endpoint URL. Save these or know that you can retrieve them later with azd env get-values.

Step 3: Test your agent

You have three ways to test and I'd suggest doing all three because they each catch different problems.
From the terminal:

azd ai agent invoke --message "What suites are available at the downtown Seattle hotel?"

This sends a prompt to your remote agent endpoint directly from the CLI. It preserves conversation context across turns, so you can have a multi-turn conversation. If you get a coherent response about hotel availability, your agent is live and serving.

From the Foundry portal, click the link from the azd up output. Navigate to your project, open the Agents section, find your deployed agent and launch the playground. Type a test query and confirm the response.

From a frontend chat app and this is where it gets interesting for anyone building something real:

git clone https://github.com/puicchan/chat-app-foundry
cd chat-app-foundry

Set the environment variables that wire the app to your published agent:

azd env set AZURE_AI_AGENT_NAME "seattle-hotel-agent"
azd env set AZURE_AI_AGENT_VERSION "<version-from-azd-output>"
azd env set AI_ACCOUNT_NAME "<your-ai-account-name>"
azd env set AI_ACCOUNT_RESOURCE_GROUP "<your-resource-group>"
azd env set AZURE_AI_FOUNDRY_ENDPOINT "<your-foundry-endpoint>"

Replace the placeholders with values from your azd up output. Run azd env get-values in your agent project directory if you didn't save them.

Step 4: Local development loop

This is the part that makes the workflow actually productive for daily development. You don't want to redeploy every time you change a prompt or adjust tool logic.

azd ai agent run

This starts your agent locally, connecting to the remote Azure resources defined in your environment. Edit code, restart, invoke, repeat. The invoke command is smart about routing, when a local agent is running, it targets that automatically.

Pair it with invoke in a second terminal:

azd ai agent invoke --message "Test my latest changes"

This is your tight feedback loop. Local execution with remote resource connectivity.

Step 5: Monitor in production

Once your agent is handling real traffic, you'll need logs:

# Recent container console logs
azd ai agent monitor

# Last 20 lines
azd ai agent monitor --tail 20

# System events (container start/stop)
azd ai agent monitor --type system

# Stream session logs in real time
azd ai agent monitor --session <session-id> --follow

The platform automatically injects an Application Insights connection string into your agent container, enabling OpenTelemetry tracing by default. Open the Application Insights resource in the Azure portal and navigate to Transaction Search or Performance for distributed traces.

The gotchas that will waste your time

I'm listing these because each one cost me or someone on our team real debugging time.

The region thing bears repeating. North Central US only for hosted agents as of May 2026. The docs mention this but the CLI doesn't enforce it at init time.

If you're on a Mac M4, you might hit Docker image build issues. The container image needs to target the correct architecture. Check the GitHub issues on the azure-dev repo, there are active threads with workarounds.

If you selected a sample template that includes tools and you aren't using an MCP server, comment out or remove the AZURE_AI_PROJECT_TOOL_CONNECTION_ID placeholder in your agent.yaml. Leaving it in causes a provisioning error that's not immediately obvious from the error message.

RBAC role names were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner and Azure AI Project Manager. If you're cross-referencing older docs or Stack Overflow answers, the old names might not resolve. The role IDs and permissions are unchanged, just the display names.

The Microsoft Agent Framework hit v1.0 in late April 2026. If you're following tutorials written before that date, credential handling, session patterns and several response types have been renamed. Earlier beta code will break without migration.

Cleaning up

When you're done experimenting:

azd down

This removes the resource group and all provisioned resources so you avoid ongoing charges. Do this every time you finish a session unless you're running a persistent deployment.

Where this fits in the bigger picture

This tutorial gets you from zero to a live agent endpoint. That's the starting line, not the finish. For production deployments you'll want retry logic and fallback models, evaluation pipelines that gate deployment on passing automated tests, multi-agent architectures where specialized agents handle different aspects of a workflow and proper CI/CD integration using the --no-prompt flag for non-interactive environments.

The deploy AI agent Microsoft Foundry AZD guide covers how we extend beyond this basic deployment into production-hardened agent systems.

Azure Foundry is one deployment target. If you're evaluating which agentic framework to build on before choosing your deployment infrastructure, we benchmarked the top agentic AI frameworks 2026, covering where each one shines, where it breaks down and which deployment patterns they support. Worth reading before you commit to an architecture.

Published by Dextra Labs, AI Consulting and Enterprise Agent Development

Prompt Engineering Is a Real Skill, Here's the Enterprise Playbook (Not Just 'Be Specific')

Dextra Labs — Thu, 30 Apr 2026 17:37:04 +0000

System prompt architecture, few-shot design patterns, chain-of-thought for structured outputs, evaluation frameworks and version control. The stuff that actually matters in production.

I want to push back on something.

There's a version of the "prompt engineering" conversation that treats it like a party trick, a collection of clever phrasings that coax better responses out of chatbots. "Be specific." "Think step by step." "You are an expert in..." This advice isn't wrong. It's just nowhere near sufficient for building AI features that work reliably in production.

The gap between "prompt that works in a demo" and "prompt architecture that works in production at volume" is significant. It involves system prompt design, few-shot example curation, output structure enforcement, evaluation methodology and versioning practices that most tutorials never touch.

This is the playbook for the serious version of the skill. If you're building AI features at a Series A, B, or C company and you're still thinking about prompts as strings you pass to an API, this article is for you.

1. System Prompt Architecture

Most developers treat the system prompt as a paragraph of instructions. Enterprise-grade system prompts are structured documents with distinct sections that serve different purposes.

Here's the architecture we use:

python
SYSTEM_PROMPT = """
## ROLE AND CONTEXT
You are a financial document analyst for {company_name}.
You work with {document_types} and your outputs are used by {audience}.
Your analysis directly affects {business_consequence}.

## CORE CAPABILITIES
- Extract structured data from unstructured financial documents
- Identify anomalies, risks and compliance gaps
- Generate actionable summaries calibrated to audience expertise

## BEHAVIOURAL CONSTRAINTS
- Never fabricate figures not present in the source document
- When uncertain, express uncertainty explicitly with language like 
  "The document appears to indicate..." rather than stating as fact
- Do not provide investment advice or specific financial recommendations
- If a document is illegible or incomplete, state this clearly

## OUTPUT REQUIREMENTS  
- Always respond in valid JSON matching the schema below
- Use null for fields you cannot determine, never guess
- Include a confidence_score (0.0-1.0) for extracted numerical values

## OUTPUT SCHEMA
{json_schema}

## EXAMPLES
{few_shot_examples}

The four-section structure does specific work:

Role and Context establishes who the model is and what stakes are involved. The phrase "your outputs are used by [audience]" and "directly affects [business consequence]" isn't rhetorical, it genuinely shifts how the model calibrates its confidence and thoroughness.

Core Capabilities sets scope. What the model is for, not what it isn't. Positive framing outperforms negative framing ("you can do X" outperforms "never do Y" in most cases).

Behavioural Constraints handles the failure modes that matter in production. Hallucination, overconfidence, scope creep. Each constraint is specific to an actual failure you've observed, not general safety boilerplate.

Output Requirements is non-negotiable for production systems. Your downstream code needs predictable output. Enforce structure here, not in a separate parsing step that assumes the model cooperated.

2. Few-Shot Design: The Patterns That Actually Work

Few-shot examples are the most underinvested part of most prompt architectures. Teams spend a week on the instruction text and 20 minutes on the examples. This is backwards.

The model learns more from examples than from instructions. Here's how to build them properly:

python
class FewShotLibrary:
    """
    Manages few-shot examples with metadata for 
    selection and evaluation.
    """

    def __init__(self):
        self.examples = []

    def add_example(
        self,
        input_text: str,
        ideal_output: dict,
        tags: list[str],
        difficulty: str,  # "easy", "medium", "hard", "edge_case"
        failure_mode_covered: str = None
    ):
        self.examples.append({
            "input": input_text,
            "output": ideal_output,
            "tags": tags,
            "difficulty": difficulty,
            "failure_mode": failure_mode_covered,
            "times_selected": 0
        })

    def select_for_prompt(
        self,
        n_examples: int = 5,
        include_difficulties: list[str] = None,
        required_tags: list[str] = None
    ) -> list[dict]:
        """
        Select examples strategically rather than randomly.
        Always include at least one edge case.
        """
        pool = self.examples.copy()

        if required_tags:
            pool = [e for e in pool 
                    if any(t in e['tags'] for t in required_tags)]

        if include_difficulties:
            pool = [e for e in pool 
                    if e['difficulty'] in include_difficulties]

        selected = []

        # Always include one edge case if available
        edge_cases = [e for e in pool if e['difficulty'] == 'edge_case']
        if edge_cases:
            selected.append(edge_cases[0])
            pool = [e for e in pool if e not in selected]

        # Fill remaining slots with mixed difficulty
        import random
        remaining = random.sample(
            pool, 
            min(n_examples - len(selected), len(pool))
        )
        selected.extend(remaining)

        return selected

    def format_for_prompt(self, examples: list[dict]) -> str:
        """Format selected examples for injection into system prompt."""
        formatted = []
        for ex in examples:
            formatted.append(
                f"Input: {ex['input']}\n"
                f"Output: {json.dumps(ex['output'], indent=2)}"
            )
        return "\n\n---\n\n".join(formatted)

Three design principles embedded in this:

Cover failure modes explicitly: Every known failure mode should have at least one example that demonstrates the correct handling. If the model sometimes fabricates figures when a document is partially illegible, you need an example showing exactly how to handle that case.

Include edge cases deliberately: Random selection of examples leaves your hardest cases underrepresented. Always reserve at least one slot for a case that represents a known difficulty.

Separate example storage from prompt construction: Inline examples are unmaintainable. A library with metadata lets you experiment with different example sets without touching the instruction text.

3. Chain-of-Thought for Structured Output

Chain-of-thought prompting is usually taught as a way to get better reasoning. For production systems, its more important use is as a reliability mechanism for structured output.

The pattern:

python
STRUCTURED_COT_TEMPLATE = """
Analyse the following document and extract the required information.

Think through your analysis in <reasoning> tags before producing 
your final output. Your reasoning should:
1. Identify the document type and its reliability indicators
2. Note any ambiguities or data quality issues
3. Explain your confidence level for each extracted field
4. Flag any compliance or risk concerns

After your reasoning, produce your output in the exact JSON schema 
specified. Do not include any text outside the JSON block.

<reasoning>
[Your analysis here, this will be parsed separately for quality review]
</reasoning>

<output>
{json_output_here}
</output>

Document to analyse:
{document_text}
"""

def parse_cot_response(response_text: str) -> dict:
    """
    Parse a chain-of-thought response into reasoning 
    and structured output.
    """
    import re

    reasoning_match = re.search(
        r'<reasoning>(.*?)</reasoning>', 
        response_text, 
        re.DOTALL
    )
    output_match = re.search(
        r'<output>(.*?)</output>', 
        response_text, 
        re.DOTALL
    )

    reasoning = reasoning_match.group(1).strip() \
                if reasoning_match else None
    output_text = output_match.group(1).strip() \
                  if output_match else None

    try:
        structured_output = json.loads(output_text) \
                            if output_text else None
    except json.JSONDecodeError:
        structured_output = None

    return {
        "reasoning": reasoning,
        "output": structured_output,
        "parse_success": structured_output is not None
    }

The reasoning section isn't just for debugging, it's a quality signal. When you log reasoning alongside outputs, patterns in the reasoning text predict output quality. A reasoning section that says "the document is partially illegible in section 3" should correlate with lower confidence scores in the output. When it doesn't, you've found a prompt calibration issue.

4. Evaluation Framework

This is the part most teams skip and regret. An evaluation framework is not optional, it's what lets you make changes to your prompts without flying blind.

python
class PromptEvaluator:
    def __init__(self, evaluation_dataset: list[dict]):
        """
        evaluation_dataset: list of {
            "input": str,
            "expected_output": dict,
            "test_type": str  # "accuracy", "format", "edge_case"
        }
        """
        self.dataset = evaluation_dataset
        self.results = []

    def run_evaluation(
        self, 
        prompt_version: str,
        model_fn: callable
    ) -> dict:
        """
        Run a full evaluation pass and return metrics.
        """
        results = []

        for test_case in self.dataset:
            response = model_fn(test_case['input'])
            parsed = parse_cot_response(response)

            result = {
                "test_type": test_case['test_type'],
                "parse_success": parsed['parse_success'],
                "field_accuracy": self._calculate_field_accuracy(
                    parsed['output'],
                    test_case['expected_output']
                ) if parsed['output'] else 0.0,
                "hallucination_detected": self._check_hallucination(
                    parsed['output'],
                    test_case['input']
                ) if parsed['output'] else False
            }
            results.append(result)

        metrics = {
            "prompt_version": prompt_version,
            "total_cases": len(results),
            "parse_success_rate": sum(
                r['parse_success'] for r in results
            ) / len(results),
            "average_field_accuracy": sum(
                r['field_accuracy'] for r in results
            ) / len(results),
            "hallucination_rate": sum(
                r['hallucination_detected'] for r in results
            ) / len(results),
            "by_test_type": self._group_by_type(results)
        }

        return metrics

    def _calculate_field_accuracy(
        self, 
        actual: dict, 
        expected: dict
    ) -> float:
        if not actual or not expected:
            return 0.0

        correct = 0
        total = 0

        for key, expected_value in expected.items():
            if key in actual:
                total += 1
                if self._values_match(actual[key], expected_value):
                    correct += 1

        return correct / total if total > 0 else 0.0

    def _values_match(self, actual, expected, tolerance=0.02) -> bool:
        """Match with tolerance for numerical values."""
        if isinstance(expected, (int, float)) and \
           isinstance(actual, (int, float)):
            if expected == 0:
                return actual == 0
            return abs(actual - expected) / abs(expected) <= tolerance
        return str(actual).strip() == str(expected).strip()

    def _check_hallucination(
        self, 
        output: dict, 
        source_text: str
    ) -> bool:
        """
        Basic hallucination check, verify numerical values 
        in output appear in source text.
        This is a heuristic, not a complete check.
        """
        if not output:
            return False

        import re
        source_numbers = set(re.findall(r'\d+\.?\d*', source_text))

        def extract_numbers(obj):
            numbers = set()
            if isinstance(obj, dict):
                for v in obj.values():
                    numbers.update(extract_numbers(v))
            elif isinstance(obj, list):
                for item in obj:
                    numbers.update(extract_numbers(item))
            elif isinstance(obj, (int, float)):
                numbers.add(str(obj))
            return numbers

        output_numbers = extract_numbers(output)
        hallucinated = output_numbers - source_numbers

        return len(hallucinated) > 0

    def _group_by_type(self, results: list) -> dict:
        grouped = {}
        for result in results:
            t = result['test_type']
            if t not in grouped:
                grouped[t] = []
            grouped[t].append(result)

        return {
            t: {
                "count": len(cases),
                "parse_success_rate": sum(
                    c['parse_success'] for c in cases
                ) / len(cases),
                "avg_accuracy": sum(
                    c['field_accuracy'] for c in cases
                ) / len(cases)
            }
            for t, cases in grouped.items()
        }

Run this evaluation before and after every prompt change. The comparison tells you whether you improved overall, improved on one case type at the expense of another, or made things worse in a way that wasn't visible from spot-checking.

5. Version Control for Prompts

Prompts are code. Treat them that way.

python
# prompts/financial_analysis/v2.3.py

PROMPT_METADATA = {
    "version": "2.3",
    "author": "your-name",
    "date": "2026-04-15",
    "changelog": "Added explicit null handling for partially illegible docs",
    "eval_metrics": {
        "parse_success_rate": 0.96,
        "average_field_accuracy": 0.91,
        "hallucination_rate": 0.02,
        "eval_dataset_version": "financial_docs_v4",
        "cases_tested": 247
    },
    "supersedes": "2.2",
    "breaking_changes": False
}

SYSTEM_PROMPT = """..."""  # Full prompt here
FEW_SHOT_EXAMPLES = [...]  # Example library here

Store prompts in version-controlled files with metadata. Every production deployment references a specific prompt version. When something breaks in production, you know exactly which prompt version was running and what changed between the last working version and the current one.

The eval_metrics field is the important one. You should never deploy a prompt version without knowing its evaluation scores on your test dataset. If you don't have those scores, you don't have a version, you have a draft.

The Patterns Connect

System prompt architecture gives you structure. Few-shot design gives you examples that cover the cases that matter. Chain-of-thought gives you reasoning transparency and output reliability. Evaluation gives you confidence that changes are improvements. Version control gives you the ability to debug production issues and roll back safely.

None of these are exotic. All of them are skipped in most AI feature builds because teams move fast and these practices feel like overhead until they aren't.

The enterprise prompt engineering guide from Dextra Labs covers the methodology in more depth including testing strategies for ambiguous outputs and prompt evaluation at scale.

Good prompt engineering is the difference between an AI feature that demos well and one that works in production. We use these patterns across client projects, from finance AI agents processing thousands of documents daily to customer support systems handling real customer escalations where the cost of a bad output is measured in customer relationships, not developer embarrassment.

The playbook exists. The question is whether you implement it before or after the first production incident.

Published by Dextra Labs | AI Consulting & Enterprise LLM Solutions

Claude MCP Explained: Building Enterprise AI Integrations That Actually Scale

Dextra Labs — Tue, 28 Apr 2026 05:33:20 +0000

What the Model Context Protocol actually is, why it changes enterprise AI architecture and how to wire Claude into Postgres, Jira and Slack with working code.

There's a problem that every enterprise AI project hits eventually.

You've built something that works in isolation, Claude answering questions, summarising documents, generating code. It's impressive in demos. Then someone asks the obvious next question: can it also look at our actual data? Can it create a Jira ticket when it finds a problem? Can it post the summary to the team Slack channel instead of a chat interface nobody checks?

And suddenly you're writing custom integration code. Lots of it. API wrappers, authentication handlers, context formatters, response parsers. Every new tool your agent needs is another bespoke integration. The agent that was simple in week one is a maintenance burden by month three.

This is the problem the Model Context Protocol was designed to solve. And if you're building enterprise AI systems that need to talk to real business tools, understanding MCP properly is one of the more valuable hours you'll spend this year.

What MCP Actually Is

The Model Context Protocol is an open standard developed by Anthropic that defines how AI models communicate with external tools, data sources and services. Think of it as the USB-C port for AI integrations, a standardised connector that works regardless of what's on either end.

Before MCP, connecting an LLM to an external tool meant:

Writing a custom function or tool definition in whatever format your LLM expected
Building the integration logic yourself
Handling authentication, error cases and response formatting manually
Repeating all of that for every new tool

With MCP, external tools expose themselves as MCP servers following a standard protocol. Your AI application connects to those servers through an MCP client. The protocol handles the communication layer. You write the business logic, not the plumbing.

The architecture has three components:

The MCP servers are the interesting part. They're lightweight services that wrap your existing APIs and databases, expose their capabilities in a standardised format and handle the translation between MCP protocol and
whatever the underlying system expects.

Why This Matters for Enterprise Architecture

The reason Claude MCP and Model Context Protocol changes enterprise AI architecture isn't just developer convenience. It's about three properties that enterprise systems actually need.

Composability: Once you've built an MCP server for Jira, every AI application in your organisation can use it. You're not rebuilding the Jira integration for every new agent, you're reusing a tested, maintained server. The integration work amortises across every use case that needs it.

Security isolation: MCP servers are separate processes. Your PostgreSQL MCP server has exactly the database permissions you configure for it, not more. The Claude model doesn't have direct database access. It calls the MCP server, which enforces its own access controls. This is a significantly better security model than giving your AI agent broad API credentials.

Auditability: Every tool call goes through the MCP protocol. You can log, monitor and audit at the MCP layer without instrumenting each individual integration. For enterprise compliance requirements, this is meaningful.

Building the Integration: Setup

We're going to build an agent that can query a PostgreSQL database, create Jira tickets and post to Slack. Three MCP servers, one Claude agent, working together.

bash
pip install anthropic mcp psycopg2-binary jira slack-sdk asyncio

Start with the MCP client setup:

python
import anthropic
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

client = anthropic.Anthropic()

# MCP server configurations
POSTGRES_SERVER = StdioServerParameters(
    command="python",
    args=["servers/postgres_server.py"],
    env={
        "DB_HOST": "your-db-host",
        "DB_NAME": "your-database",
        "DB_USER": "your-user",
        "DB_PASSWORD": "your-password"
    }
)

JIRA_SERVER = StdioServerParameters(
    command="python",
    args=["servers/jira_server.py"],
    env={
        "JIRA_URL": "https://your-domain.atlassian.net",
        "JIRA_EMAIL": "your-email@company.com",
        "JIRA_TOKEN": "your-api-token"
    }
)

SLACK_SERVER = StdioServerParameters(
    command="python",
    args=["servers/slack_server.py"],
    env={
        "SLACK_BOT_TOKEN": "xoxb-your-token"
    }
)

Building the PostgreSQL MCP Server

Each MCP server is a Python script that implements the MCP protocol and exposes tools to the host:

python
# servers/postgres_server.py
import asyncio
import psycopg2
import os
import json
from mcp.server import Server
from mcp.server.models import InitializationOptions
from mcp.types import Tool, TextContent
import mcp.server.stdio

app = Server("postgres-server")

def get_db_connection():
    return psycopg2.connect(
        host=os.environ['DB_HOST'],
        database=os.environ['DB_NAME'],
        user=os.environ['DB_USER'],
        password=os.environ['DB_PASSWORD']
    )

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Declare the tools this server exposes."""
    return [
        Tool(
            name="query_database",
            description=(
                "Execute a read-only SQL query against the database. "
                "Use this to retrieve data, counts, aggregations. "
                "Never use for INSERT, UPDATE, or DELETE operations."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The SQL SELECT query to execute"
                    },
                    "limit": {
                        "type": "integer",
                        "description": "Maximum rows to return (default 100)",
                        "default": 100
                    }
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="get_table_schema",
            description="Get the schema for a specific database table",
            inputSchema={
                "type": "object",
                "properties": {
                    "table_name": {
                        "type": "string",
                        "description": "Name of the table to inspect"
                    }
                },
                "required": ["table_name"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    """Handle tool calls from the MCP host."""

    if name == "query_database":
        query = arguments["query"]
        limit = arguments.get("limit", 100)

        # Safety check  enforce read-only
        query_lower = query.lower().strip()
        if any(keyword in query_lower 
               for keyword in ['insert', 'update', 'delete', 'drop', 'create', 'alter']):
            return [TextContent(
                type="text",
                text="Error: Only SELECT queries are permitted"
            )]

        # Add limit if not present
        if 'limit' not in query_lower:
            query = f"{query.rstrip(';')} LIMIT {limit}"

        try:
            conn = get_db_connection()
            cursor = conn.cursor()
            cursor.execute(query)

            columns = [desc[0] for desc in cursor.description]
            rows = cursor.fetchall()

            result = {
                "columns": columns,
                "rows": [dict(zip(columns, row)) for row in rows],
                "row_count": len(rows)
            }

            cursor.close()
            conn.close()

            return [TextContent(
                type="text",
                text=json.dumps(result, indent=2, default=str)
            )]

        except Exception as e:
            return [TextContent(
                type="text",
                text=f"Query error: {str(e)}"
            )]

    elif name == "get_table_schema":
        table_name = arguments["table_name"]

        try:
            conn = get_db_connection()
            cursor = conn.cursor()
            cursor.execute("""
                SELECT column_name, data_type, is_nullable, column_default
                FROM information_schema.columns
                WHERE table_name = %s
                ORDER BY ordinal_position
            """, (table_name,))

            columns = cursor.fetchall()
            cursor.close()
            conn.close()

            schema = [{
                "column": col[0],
                "type": col[1],
                "nullable": col[2],
                "default": col[3]
            } for col in columns]

            return [TextContent(
                type="text",
                text=json.dumps(schema, indent=2)
            )]

        except Exception as e:
            return [TextContent(
                type="text",
                text=f"Schema error: {str(e)}"
            )]

async def main():
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="postgres-server",
                server_version="1.0.0"
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

Jira MCP Server

python
# servers/jira_server.py
import asyncio
import os
import json
from jira import JIRA
from mcp.server import Server
from mcp.server.models import InitializationOptions
from mcp.types import Tool, TextContent
import mcp.server.stdio

app = Server("jira-server")

def get_jira_client():
    return JIRA(
        server=os.environ['JIRA_URL'],
        basic_auth=(os.environ['JIRA_EMAIL'], os.environ['JIRA_TOKEN'])
    )

@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="create_jira_ticket",
            description=(
                "Create a new Jira issue. Use when a problem, "
                "bug, or task needs to be tracked in Jira."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "project_key": {
                        "type": "string",
                        "description": "Jira project key (e.g. 'ENG', 'OPS')"
                    },
                    "summary": {
                        "type": "string",
                        "description": "Issue title/summary"
                    },
                    "description": {
                        "type": "string",
                        "description": "Detailed description of the issue"
                    },
                    "issue_type": {
                        "type": "string",
                        "enum": ["Bug", "Task", "Story"],
                        "description": "Type of issue to create"
                    },
                    "priority": {
                        "type": "string",
                        "enum": ["Highest", "High", "Medium", "Low"],
                        "description": "Issue priority"
                    }
                },
                "required": ["project_key", "summary", "issue_type"]
            }
        ),
        Tool(
            name="search_jira_issues",
            description="Search for existing Jira issues using JQL",
            inputSchema={
                "type": "object",
                "properties": {
                    "jql": {
                        "type": "string",
                        "description": "JQL query string"
                    },
                    "max_results": {
                        "type": "integer",
                        "default": 10
                    }
                },
                "required": ["jql"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    jira = get_jira_client()

    if name == "create_jira_ticket":
        try:
            issue_dict = {
                'project': {'key': arguments['project_key']},
                'summary': arguments['summary'],
                'description': arguments.get('description', ''),
                'issuetype': {'name': arguments['issue_type']},
            }

            if 'priority' in arguments:
                issue_dict['priority'] = {'name': arguments['priority']}

            issue = jira.create_issue(fields=issue_dict)

            return [TextContent(
                type="text",
                text=json.dumps({
                    "success": True,
                    "issue_key": issue.key,
                    "issue_url": f"{os.environ['JIRA_URL']}/browse/{issue.key}",
                    "summary": arguments['summary']
                })
            )]

        except Exception as e:
            return [TextContent(
                type="text",
                text=f"Jira error: {str(e)}"
            )]

    elif name == "search_jira_issues":
        try:
            issues = jira.search_issues(
                arguments['jql'],
                maxResults=arguments.get('max_results', 10)
            )

            results = [{
                "key": issue.key,
                "summary": issue.fields.summary,
                "status": issue.fields.status.name,
                "priority": issue.fields.priority.name 
                          if issue.fields.priority else None
            } for issue in issues]

            return [TextContent(
                type="text",
                text=json.dumps(results, indent=2)
            )]

        except Exception as e:
            return [TextContent(
                type="text",
                text=f"Search error: {str(e)}"
            )]

async def main():
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="jira-server",
                server_version="1.0.0"
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

The Agent That Connects Everything

Now the interesting part, the agent that uses all three servers together:

python
# agent.py
import asyncio
import json
import anthropic
from mcp import ClientSession
from mcp.client.stdio import stdio_client

async def run_enterprise_agent(user_query: str):
    """Run an agent with access to Postgres, Jira and Slack."""

    client = anthropic.Anthropic()

    # Connect to all MCP servers
    async with stdio_client(POSTGRES_SERVER) as (pg_read, pg_write), \
               stdio_client(JIRA_SERVER) as (jira_read, jira_write), \
               stdio_client(SLACK_SERVER) as (slack_read, slack_write):

        async with ClientSession(pg_read, pg_write) as pg_session, \
                   ClientSession(jira_read, jira_write) as jira_session, \
                   ClientSession(slack_read, slack_write) as slack_session:

            # Initialise all sessions
            await pg_session.initialize()
            await jira_session.initialize()
            await slack_session.initialize()

            # Collect all available tools from all servers
            pg_tools = await pg_session.list_tools()
            jira_tools = await jira_session.list_tools()
            slack_tools = await slack_session.list_tools()

            # Convert MCP tools to Anthropic tool format
            all_tools = []
            tool_session_map = {}

            for tool in pg_tools.tools:
                all_tools.append({
                    "name": tool.name,
                    "description": tool.description,
                    "input_schema": tool.inputSchema
                })
                tool_session_map[tool.name] = pg_session

            for tool in jira_tools.tools:
                all_tools.append({
                    "name": tool.name,
                    "description": tool.description,
                    "input_schema": tool.inputSchema
                })
                tool_session_map[tool.name] = jira_session

            for tool in slack_tools.tools:
                all_tools.append({
                    "name": tool.name,
                    "description": tool.description,
                    "input_schema": tool.inputSchema
                })
                tool_session_map[tool.name] = slack_session

            # Run the agent loop
            messages = [{"role": "user", "content": user_query}]

            system_prompt = """You are an enterprise AI assistant with access to 
            the company database, Jira project management and Slack messaging.

            When you find issues in data, create Jira tickets to track them.
            When you complete analysis, post summaries to the appropriate Slack channel.
            Always explain what you're doing and why."""

            while True:
                response = client.messages.create(
                    model="claude-sonnet-4-5",
                    max_tokens=4096,
                    system=system_prompt,
                    tools=all_tools,
                    messages=messages
                )

                if response.stop_reason == "end_turn":
                    # Extract final text response
                    for block in response.content:
                        if hasattr(block, 'text'):
                            print(f"\nAgent: {block.text}")
                    break

                if response.stop_reason == "tool_use":
                    messages.append({
                        "role": "assistant",
                        "content": response.content
                    })

                    tool_results = []

                    for block in response.content:
                        if block.type == "tool_use":
                            print(f"\n→ Calling tool: {block.name}")
                            print(f"  Input: {json.dumps(block.input, indent=2)}")

                            # Route to correct MCP session
                            session = tool_session_map[block.name]
                            result = await session.call_tool(
                                block.name,
                                arguments=block.input
                            )

                            result_text = result.content[0].text \
                                         if result.content else "No result"
                            print(f"  Result: {result_text[:200]}...")

                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": result_text
                            })

                    messages.append({
                        "role": "user",
                        "content": tool_results
                    })

# Run it
asyncio.run(run_enterprise_agent(
    "Check the orders table for any orders with status 'failed' in the last 24 hours. "
    "If you find more than 5, create a high-priority Jira bug in the ENG project "
    "and post a summary to the #operations Slack channel."
))

What Happens When You Run This

The agent receives the query. It calls get_table_schema to understand the orders table structure. It queries the database for failed orders in the last 24 hours. If the count exceeds five, it creates a Jira ticket with the relevant details. It posts to Slack with a formatted summary. All of this happens in a single agent session, with each step using the appropriate MCP server.

The tool routing is automatic, Claude reads the tool descriptions and decides which tool to use for each step. The MCP protocol handles the communication. You wrote the business logic, not the integration plumbing.

Enterprise Considerations

Three things that matter when you move from demo to production.

Server lifecycle management: MCP servers are processes. In production, you need process supervision, health checks and restart policies. Consider running MCP servers as containerised services with proper orchestration rather than spawning them as subprocesses.

Authentication and secrets: The environment variable approach above works for development. In production, pull credentials from a secrets manager (AWS Secrets Manager, HashiCorp Vault) rather than environment variables. Your MCP servers should never have credentials baked in.

Rate limiting and quotas: Your Jira and Slack MCP servers are calling external APIs. Implement rate limiting at the MCP server level to prevent an aggressive agent from exhausting your API quotas. This is significantly cleaner than rate limiting inside the agent.

MCP as Infrastructure

The real value proposition of MCP isn't any individual integration. It's the accumulation of tested, reusable MCP servers that your organisation builds over time. The Postgres server you build for one agent is available to every subsequent agent. The Jira server your DevOps agent uses can be reused by your customer success agent. You build the integration library
once and amortise it across every AI use case that follows.

This is why MCP matters for enterprise AI at scale, it transforms integration work from a per-project cost to a shared infrastructure investment.

MCP is the plumbing that makes enterprise agents work. For teams building production agent systems, the server architecture, authentication patterns, monitoring and multi-tool orchestration that goes beyond a single tutorial, Dextra Labs designs and deploys these integrations end-to-end.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development

How to Fine-Tune Claude on Amazon Bedrock for Your Domain (Complete Guide with Code)

Dextra Labs — Sat, 25 Apr 2026 05:48:31 +0000

Dataset prep, Bedrock setup, training configuration, evaluation, deployment, with real cost estimates for startups without ML teams.

Let me tell you when fine-tuning is actually the right answer.

Most of the time it isn't. A well-crafted system prompt with good examples handles 80% of domain adaptation problems faster, cheaper and with less operational overhead than fine-tuning. I'll come back to this at the end because it's genuinely important and most tutorials skip it.

But there's a specific category of problem where fine-tuning earns its complexity: when you need consistent output format that prompting alone can't reliably produce, when you're running high-volume inference where per-token costs compound, when your domain has terminology or reasoning patterns so specialised that few-shot examples don't transfer well, or when latency from long system prompts is measurably affecting your product experience.

If you're in that category, this guide gets you from zero to a deployed fine-tuned Claude model on Amazon Bedrock with working code throughout.

What Amazon Bedrock Fine-Tuning Actually Is

Bedrock's fine-tuning is model customisation, you're taking Anthropic's base Claude model and continuing its training on your domain-specific data. The result is a custom model variant that lives in your AWS account, responds to the same API you're already using and handles your specific use case with more consistency than the base model on the same prompts.

The key constraint: Bedrock fine-tuning uses the Claude models Anthropic makes available for customisation, which at time of writing is Claude Haiku. The capability is narrower than you might expect from the marketing , you're adapting behaviour and format consistency, not teaching the model fundamentally new knowledge. If you need the model to reason differently, fine-tuning helps. If you need it to know things that aren't in its training data, you need RAG, not fine-tuning.

Prerequisites

Before the code:

AWS account with Bedrock access enabled in your target region (us-east-1 or us-west-2 for Bedrock availability)
IAM role with Bedrock full access and S3 read/write permissions
Python 3.9+
An S3 bucket for your training data and model artefacts
Training dataset (we'll build one)

bash
pip install boto3 pandas jsonlines scikit-learn tqdm

python
import boto3
import json
import pandas as pd
import jsonlines
from pathlib import Path

# Configure your AWS session
session = boto3.Session(
    region_name='us-east-1'  # Confirm Bedrock availability in your region
)

bedrock = session.client('bedrock')
bedrock_runtime = session.client('bedrock-runtime')
s3 = session.client('s3')

BUCKET_NAME = "your-fine-tuning-bucket"
MODEL_ID = "anthropic.claude-haiku-20240307-v1:0"

Step 1: Dataset Preparation

This is where most fine-tuning projects succeed or fail. The model learns what you show it, garbage in, garbage out is nowhere more true than in fine-tuning.

Bedrock's fine-tuning expects data in a specific JSONL format. Each line is a complete training example with a prompt and the ideal completion.

python
# Each training example must follow this structure
example = {
    "prompt": "Your input prompt here",
    "completion": "The ideal output you want the model to produce"
}

For a domain adaptation use case, let's say we're fine-tuning for a legal document summarisation task, your data preparation looks like this:

python
class DatasetPreparator:
    def __init__(self, output_path: str):
        self.output_path = Path(output_path)
        self.examples = []

    def add_example(
        self, 
        document_text: str, 
        ideal_summary: str,
        document_type: str = None
    ):
        """Add a training example with optional metadata."""

        # Build the prompt that matches your production prompt structure
        # CRITICAL: Your fine-tuning prompt must match your inference prompt
        prompt = self._build_prompt(document_text, document_type)

        self.examples.append({
            "prompt": prompt,
            "completion": ideal_summary
        })

    def _build_prompt(self, text: str, doc_type: str = None) -> str:
        type_context = f" This is a {doc_type}." if doc_type else ""
        return (
            f"Summarise the following legal document in three sections: "
            f"Key Parties, Core Obligations and Risk Flags.{type_context}"
            f"\n\nDocument:\n{text}\n\nSummary:"
        )

    def validate_and_write(self, train_split: float = 0.9):
        """Validate examples and write train/validation splits."""

        # Validation checks
        issues = []
        for i, ex in enumerate(self.examples):
            if len(ex['prompt']) < 10:
                issues.append(f"Example {i}: prompt too short")
            if len(ex['completion']) < 20:
                issues.append(f"Example {i}: completion too short")
            if len(ex['prompt']) > 4000:
                issues.append(f"Example {i}: prompt exceeds token limit")

        if issues:
            print(f"Found {len(issues)} issues:")
            for issue in issues[:10]:  # Show first 10
                print(f"  {issue}")
            return False

        # Split into train and validation
        split_idx = int(len(self.examples) * train_split)
        train_data = self.examples[:split_idx]
        val_data = self.examples[split_idx:]

        # Write JSONL files
        for filename, data in [
            ("train.jsonl", train_data), 
            ("validation.jsonl", val_data)
        ]:
            with jsonlines.open(self.output_path / filename, 'w') as writer:
                writer.write_all(data)

        print(f"Written {len(train_data)} training examples")
        print(f"Written {len(val_data)} validation examples")
        return True

# Usage
prep = DatasetPreparator("./training_data")

# Load your examples — minimum 32 for Bedrock, 
# recommend 200+ for meaningful results
for _, row in your_dataframe.iterrows():
    prep.add_example(
        document_text=row['document'],
        ideal_summary=row['expert_summary'],
        document_type=row['type']
    )

prep.validate_and_write()

Dataset size guidance: Bedrock requires a minimum of 32 training examples. In practice, you won't see meaningful domain adaptation below 100 examples and the sweet spot for most use cases is 300 to 1,000 high-quality examples. High quality beats high volume. 200 expert-written summaries will outperform 2,000 mediocre ones.

Step 2: Upload Training Data to S3

python
def upload_training_data(
    local_dir: str, 
    bucket: str, 
    prefix: str = "fine-tuning"
) -> dict:
    """Upload training files to S3 and return URIs."""

    s3_uris = {}

    for filename in ["train.jsonl", "validation.jsonl"]:
        local_path = Path(local_dir) / filename
        s3_key = f"{prefix}/{filename}"

        print(f"Uploading {filename}...")
        s3.upload_file(
            str(local_path),
            bucket,
            s3_key
        )

        s3_uris[filename] = f"s3://{bucket}/{s3_key}"
        print(f"Uploaded to {s3_uris[filename]}")

    return s3_uris

uris = upload_training_data(
    "./training_data",
    BUCKET_NAME,
    "legal-summarisation/v1"
)

Step 3: Configure and Launch Fine-Tuning Job

python
def launch_fine_tuning_job(
    job_name: str,
    training_uri: str,
    validation_uri: str,
    output_bucket: str,
    role_arn: str
) -> str:
    """Launch a Bedrock fine-tuning job and return job ARN."""

    response = bedrock.create_model_customization_job(
        jobName=job_name,
        customModelName=f"{job_name}-model",
        roleArn=role_arn,
        baseModelIdentifier=MODEL_ID,

        # Training data configuration
        trainingDataConfig={
            "s3Uri": training_uri
        },
        validationDataConfig={
            "validators": [{
                "s3Uri": validation_uri
            }]
        },

        # Output configuration
        outputDataConfig={
            "s3Uri": f"s3://{output_bucket}/fine-tuning-output/{job_name}/"
        },

        # Hyperparameters
        # These are the defaults — adjust based on your dataset size
        hyperParameters={
            "epochCount": "3",        # Start with 3, increase if underfitting
            "batchSize": "32",        # 32 is standard for most cases  
            "learningRate": "0.00001" # Conservative default
        },

        customizationType="FINE_TUNING"
    )

    job_arn = response['jobArn']
    print(f"Fine-tuning job launched: {job_arn}")
    return job_arn

# Your IAM role ARN — must have Bedrock and S3 permissions
ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/BedrockFineTuningRole"

job_arn = launch_fine_tuning_job(
    job_name="legal-summarisation-v1",
    training_uri=uris["train.jsonl"],
    validation_uri=uris["validation.jsonl"],
    output_bucket=BUCKET_NAME,
    role_arn=ROLE_ARN
)

Hyperparameter guidance:

epochCount controls how many times the model sees your training data. Start at 3. If your validation loss is still improving at epoch 3, try 5. If it plateaus at epoch 1, your dataset may have quality issues.

learningRate at 0.00001 is conservative and safe. Going higher risks destabilising the base model's general capabilities. Lower if you're seeing erratic validation loss.

batchSize of 32 works for most datasets. Larger batches are more stable but require more memory.

Step 4: Monitor the Job

Fine-tuning a Claude Haiku model typically takes 30 to 90 minutes depending on dataset size. Don't just wait, monitor it.

python
import time

def monitor_job(job_arn: str, check_interval: int = 60) -> str:
    """Poll job status until completion. Returns final status."""

    print(f"Monitoring job: {job_arn}")

    while True:
        response = bedrock.get_model_customization_job(
            jobIdentifier=job_arn
        )

        status = response['status']
        print(f"[{time.strftime('%H:%M:%S')}] Status: {status}")

        if status in ['Completed', 'Failed', 'Stopped']:
            if status == 'Completed':
                custom_model_arn = response['outputModelArn']
                print(f"Success! Model ARN: {custom_model_arn}")
                return custom_model_arn
            else:
                failure_msg = response.get('failureMessage', 'Unknown error')
                raise Exception(f"Job {status}: {failure_msg}")

        # Show metrics if available
        if 'trainingMetrics' in response:
            metrics = response['trainingMetrics']
            print(f"  Training loss: {metrics.get('trainingLoss', 'N/A'):.4f}")

        time.sleep(check_interval)

custom_model_arn = monitor_job(job_arn)

Step 5: Evaluate Before You Deploy

Never skip evaluation. The fine-tuned model will be different from the base model, the question is whether it's different in the ways you wanted.

python
def evaluate_model(
    custom_model_arn: str,
    test_examples: list,
    base_model_id: str = MODEL_ID
) -> dict:
    """Compare fine-tuned model against base model on test examples."""

    results = {
        'fine_tuned': [],
        'base_model': [],
        'comparisons': []
    }

    for example in test_examples:
        prompt = example['prompt']
        reference = example['reference_output']

        # Run inference on both models
        ft_response = bedrock_runtime.invoke_model(
            modelId=custom_model_arn,
            body=json.dumps({
                "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
                "max_tokens_to_sample": 1000,
                "temperature": 0.1
            })
        )

        base_response = bedrock_runtime.invoke_model(
            modelId=base_model_id,
            body=json.dumps({
                "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
                "max_tokens_to_sample": 1000,
                "temperature": 0.1
            })
        )

        ft_output = json.loads(
            ft_response['body'].read()
        )['completion']

        base_output = json.loads(
            base_response['body'].read()
        )['completion']

        results['comparisons'].append({
            'prompt': prompt,
            'reference': reference,
            'fine_tuned': ft_output,
            'base_model': base_output
        })

    return results

# Run on 20-30 held-out examples that weren't in training
evaluation = evaluate_model(
    custom_model_arn,
    held_out_test_set
)

# Review comparisons manually — automated metrics 
# miss nuance that matters in production
for comp in evaluation['comparisons'][:5]:
    print(f"Prompt: {comp['prompt'][:100]}...")
    print(f"Reference: {comp['reference'][:200]}")
    print(f"Fine-tuned: {comp['fine_tuned'][:200]}")
    print(f"Base model: {comp['base_model'][:200]}")
    print("---")

Read the outputs. Don't just run BLEU scores and call it done. The qualitative assessment, does the fine-tuned model actually behave the way you wanted it to?, is what tells you whether to deploy or iterate.

Step 6: Deploy via Provisioned Throughput

Custom models require provisioned throughput to serve inference. This is the ongoing cost commitment.

python
def provision_model(
    model_arn: str,
    provisioned_name: str,
    model_units: int = 1
) -> str:
    """Provision throughput for the fine-tuned model."""

    response = bedrock.create_provisioned_model_throughput(
        modelUnits=model_units,
        provisionedModelName=provisioned_name,
        modelId=model_arn
    )

    provisioned_arn = response['provisionedModelArn']
    print(f"Provisioned model ARN: {provisioned_arn}")
    return provisioned_arn

provisioned_arn = provision_model(
    custom_model_arn,
    "legal-summarisation-prod",
    model_units=1  # Scale up based on your throughput needs
)

Production inference:

python
def invoke_custom_model(
    prompt: str,
    provisioned_arn: str,
    max_tokens: int = 1000
) -> str:
    """Invoke the fine-tuned model for production inference."""

    response = bedrock_runtime.invoke_model(
        modelId=provisioned_arn,
        body=json.dumps({
            "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
            "max_tokens_to_sample": max_tokens,
            "temperature": 0.1,
            "stop_sequences": ["\n\nHuman:"]
        }),
        contentType="application/json",
        accept="application/json"
    )

    result = json.loads(response['body'].read())
    return result['completion']

Cost Estimates

Honest numbers for a startup-scale use case:

Training costs:

Fine-tuning job: approximately $0.004 per 1,000 tokens in your training dataset
A 500-example dataset with average 800 tokens per example: ~$1.60 for training
Training runs multiple epochs: multiply by epoch count (~$5-8 total for 3 epochs)

Provisioned throughput:

1 model unit: approximately $5.50 per hour
Running 24/7: ~$3,960 per month
Running 8 hours/day: ~$1,320 per month

The provisioned throughput cost is the real number to plan around. For most startups, a fine-tuned Claude Haiku model only makes economic sense at volume, if you're running thousands of requests per day where the per-token efficiency gain or quality improvement justifies the fixed monthly cost.

Before You Fine-Tune: The Honest Check

I promised to come back to this.

Fine-tuning is genuinely powerful for the right problems. It's also consistently reached for too early by teams who haven't fully explored what's achievable with well-engineered prompts.

Before committing to the complexity and cost of fine-tuning, spend a week on prompt engineering. A good system prompt with 5-10 examples often gets you to 90% of what fine-tuning would achieve, at zero training cost, with the ability to iterate in minutes rather than hours.

For enterprise-grade prompt engineering, the methodology, evaluation approach and common mistakes that waste weeks of iteration, we wrote the complete guide on what prompt engineering actually is and how to do it systematically. Read it before you start a fine-tuning project.

If you've done the prompt work and you're still hitting the limitations, then the fine-tune Claude on Bedrock enterprise guide covers the production considerations, IAM architecture, multi-model versioning, A/B testing custom models, that go beyond what fits in a single tutorial.

Before you fine-tune, make sure you've exhausted prompt engineering. Sometimes a well-crafted system prompt does 80% of the job. Here's our enterprise prompt engineering guide if you haven't been there yet.

Published by Dextra Labs | AI Consulting & Enterprise LLM Solutions