DEV Community: Tebogo Tseka

AI Governance in Practice: FastAPI on EKS with Model Cards, Audit Logging, and Helm

Tebogo Tseka — Wed, 29 Apr 2026 14:14:03 +0000

AI governance is increasingly a business requirement, not an afterthought. Whether it's the EU AI Act, NIST AI RMF, or an internal risk committee, the question is the same: can you prove your model is behaving as intended, on every request, with a documented audit trail?

This post walks through an AI governance platform I built on AWS EKS: a FastAPI service that runs churn inference, records every prediction to an audit log, exposes a machine-readable model card, and packages everything into a Helm chart with horizontal pod autoscaling.

Source code: github.com/tsekatm/eks-ai-governance

Architecture

Client
  └── POST /predict → FastAPI (EKS pod)
                           │
                           ├── LogisticRegression inference
                           ├── Audit log entry (request_id, features, result)
                           └── Response: churn_probability, prediction, request_id

Kubernetes (EKS)
  ├── Helm chart  →  Deployment + Service + HPA (2–10 replicas)
  └── Terraform   →  VPC, EKS cluster, node groups, IAM OIDC

Governance Layer
  ├── GET /governance/model-card   → version, metrics, fairness, EU AI Act tier
  └── GET /governance/audit-log   → full prediction audit trail

The app is self-contained: no external database needed to demo it. Swap the in-memory audit log for DynamoDB and the model for a SageMaker endpoint and this is production-ready.

Step 1: The FastAPI Application

Four responsibilities, four endpoint groups.

Inference (`POST /predict`)

class PredictRequest(BaseModel):
    customer_id: str
    tenure_months: float
    monthly_charges: float
    total_charges: float
    num_complaints: int

@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest) -> PredictResponse:
    features = np.array([[
        request.tenure_months,
        request.monthly_charges,
        request.total_charges,
        request.num_complaints,
    ]])
    scaled = _scaler.transform(features)
    churn_prob = float(_model.predict_proba(scaled)[0][1])
    prediction = "churn" if churn_prob >= 0.5 else "retain"
    request_id = str(uuid.uuid4())

    audit_log.append({
        "request_id": request_id,
        "customer_id": request.customer_id,
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "features": request.model_dump(exclude={"customer_id"}),
        "prediction": prediction,
        "churn_probability": churn_prob,
        "model_version": MODEL_VERSION,
    })

    return PredictResponse(
        customer_id=request.customer_id,
        churn_probability=round(churn_prob, 4),
        prediction=prediction,
        model_version=MODEL_VERSION,
        request_id=request_id,
        timestamp=datetime.now(timezone.utc).isoformat(),
    )

Every prediction writes to the audit log before returning. The request_id ties the response to the log entry — useful for compliance queries: "show me exactly what the model returned for customer X on date Y."

Model Card (`GET /governance/model-card`)

A machine-readable model card makes governance reviewable by automated tools, not just humans with PDFs.

MODEL_CARD = {
    "model_id": "telecom-churn-v1",
    "version": "1.0.0",
    "type": "LogisticRegression",
    "training_date": "2026-04-29",
    "features": ["tenure_months", "monthly_charges", "total_charges", "num_complaints"],
    "metrics": {
        "accuracy": 0.89,
        "roc_auc": 0.92,
        "precision": 0.85,
        "recall": 0.81,
    },
    "fairness": {
        "demographic_parity_difference": 0.03,
        "equalized_odds_difference": 0.02,
        "evaluation_date": "2026-04-29",
    },
    "governance_tier": "Medium Risk",
    "eu_ai_act_classification": "Limited Risk",
    "approved_by": "AI Governance Board",
    "next_review": "2026-10-29",
}

The governance_tier and eu_ai_act_classification fields are what a risk committee actually needs. A CI gate could fail a deployment if these fields are missing or if next_review is in the past.

In a telecoms context, POPIA (South Africa) and RICA compliance add additional constraints on what customer data can flow through inference pipelines, making PII-aware audit logging essential. The audit log here records features and predictions but not raw PII — in production, customer_id would be a pseudonymised identifier resolved only by authorised downstream systems.

The fairness metrics (demographic_parity_difference, equalized_odds_difference) are hardcoded for this demo. In production, they are computed during the evaluation step of the SageMaker Pipeline (see Part 2 of this series) against protected attributes and updated automatically per training run before the model card is published.

Audit Log (`GET /governance/audit-log`)

@app.get("/governance/audit-log")
def get_audit_log() -> dict:
    return {"total": len(audit_log), "entries": list(audit_log)}

Simple. In production this would be backed by DynamoDB with a device_id + timestamp key schema — the same pattern used in the IoT pipeline project.

Health probes

@app.get("/healthz")
def healthz():
    return {"status": "ok", "model_version": MODEL_VERSION}

@app.get("/ready")
def ready():
    return {"status": "ready", "model_id": MODEL_ID}

Two separate probes because Kubernetes needs them for different purposes: /healthz is the liveness check (restart if down), /ready is the readiness check (remove from service if not ready to serve traffic). The readiness probe path is configurable in Helm values.

Step 2: The Model

The model is a LogisticRegression trained on 2,000 synthetic telecom accounts at module load time:

_rng = np.random.default_rng(42)
_X_train = np.column_stack([
    _rng.uniform(1, 72, 2000),     # tenure_months
    _rng.uniform(20, 150, 2000),   # monthly_charges
    _rng.uniform(20, 10000, 2000), # total_charges
    _rng.integers(0, 8, 2000),     # num_complaints
])
# Rule: ≥3 complaints, or new customer + high charge → churn
_y_train = (
    (_X_train[:, 3] >= 3) |
    ((_X_train[:, 0] < 6) & (_X_train[:, 1] > 80))
).astype(int)

_scaler = StandardScaler()
_model = LogisticRegression(max_iter=200, random_state=42)
_model.fit(_scaler.fit_transform(_X_train), _y_train)

Replacing this with a SageMaker endpoint call is a one-function swap — the governance layer (audit logging, model card) stays identical.

Step 3: Test Suite — 23/23

All tests written before implementation (TDD).

tests/test_app.py::TestHealthEndpoints        4 passed
  - /healthz returns 200 with status "ok"
  - /ready returns 200 with status "ready"

tests/test_app.py::TestPredictEndpoint        9 passed
  - 200 on valid payload
  - customer_id, churn_probability (0–1), prediction, model_version, request_id present
  - 422 on missing field
  - high-risk profile (1 month tenure, 5 complaints) → churn
  - low-risk profile (60 months, 0 complaints) → retain

tests/test_app.py::TestAuditLog               5 passed
  - empty initially
  - records entry after prediction
  - entry contains customer_id and prediction
  - total count increments correctly

tests/test_app.py::TestModelCard              5 passed
  - 200 response
  - version, metrics, fairness, governance_tier present

The fixture autouse=True on clear_audit_log ensures each test class starts with a clean in-memory log — no test bleeds state into the next.

Step 4: Docker — Multi-Stage Build

FROM python:3.12-slim AS builder
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

FROM python:3.12-slim AS production
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
RUN chown -R appuser:appuser /app
USER appuser
EXPOSE 8080
HEALTHCHECK CMD curl -f http://localhost:8080/healthz || exit 1
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

Two stages: the builder installs build tools and compiles packages; production copies only the compiled artifacts. Result: 146 MB image, non-root user, health check baked in.

docker build --platform linux/amd64 -t ai-governance-platform:local .
# Successfully built d1a906f056b5 — 146MB

Step 5: Helm Chart with HPA

The chart packages deployment + service + HPA into a single parameterised artifact.

HPA

# helm/templates/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: {{ .Values.autoscaling.minReplicas }}   # 2
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}   # 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}   # 70
    - type: Resource
      resource:
        name: memory
        target:
          averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }} # 80

The HPA scales between 2 and 10 pods on CPU (70%) and memory (80%) utilisation. Both thresholds are overridable per environment via values.yaml.

`_helpers.tpl`

Helm charts need a _helpers.tpl file to define shared template functions like fullname and labels. Without it, the templates can't render:

{{- define "ai-platform.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name .Chart.Name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}

helm lint ./helm/
# ==> Linting ./helm/
# 1 chart(s) linted, 0 chart(s) failed

Deploy:

helm install ai-platform ./helm -f helm/values.yaml
kubectl get pods    # 2–10 replicas depending on load
kubectl get hpa     # watch scaling events

Step 6: Infrastructure (Terraform)

The EKS cluster is provisioned with Terraform — nothing clicked in the console.

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "~> 20.0"
  cluster_name    = "ai-governance-cluster"
  cluster_version = "1.30"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["m5.large"]
      min_size       = 2
      max_size       = 6
      desired_size   = 3
    }
  }

  enable_irsa = true
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"
  name    = "ai-governance-vpc"
  cidr    = "10.0.0.0/16"

  azs             = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
}

enable_irsa = true provisions the OIDC provider so the ai-platform-sa service account (created by Helm) can assume an IAM role for DynamoDB and CloudWatch access — no static AWS credentials in the pod.

What I'd Add Next

SageMaker endpoint — replace the in-memory model with boto3.client("sagemaker-runtime").invoke_endpoint(), zero governance layer changes needed
DynamoDB audit store — replace the in-memory list with a DynamoDB table (model_id + timestamp key), enables cross-pod audit queries
Bedrock Guardrails — content filtering and PII redaction on inference inputs before they hit the model; a pattern I've already implemented in production for prompt injection prevention
CI gate on model card — fail deployment if next_review is expired or governance_tier is missing
Prometheus metrics — prediction_count, churn_rate, p99_latency scraped by CloudWatch Container Insights

Tebogo Tseka — Cloud Solutions Architect & ML Engineer
GitHub: @tsekatm | Blog: tebogosacloud.blog

Building a Real-Time IoT Telemetry Pipeline with Kinesis, Lambda, and DynamoDB

Tebogo Tseka — Wed, 29 Apr 2026 13:49:32 +0000

Every telecoms network has thousands of physical devices — routers, base stations, environmental sensors in data centres — quietly reporting readings every few seconds. When a sensor in a remote data centre starts reporting 52°C, you want to know immediately, not when someone notices the hardware has throttled.

This post walks through an IoT telemetry pipeline I built on AWS: a device simulator that generates realistic sensor data, a Kinesis stream to buffer it, a Lambda consumer that writes every reading to DynamoDB, and an alert handler that fires SNS notifications and CloudWatch metrics the moment a reading breaches a threshold.

Source code: github.com/tsekatm/iot-kinesis-stream

Architecture

Device Simulator (Python)
  └── put_record → Kinesis Data Stream
                         │
                         ▼
              Lambda: Kinesis Consumer
                │  (decode → validate → write)
                │
                ├──► DynamoDB (iot-sensor-readings)
                │    PK=device_id, SK=timestamp
                │
                └──► Lambda: Alert Handler (async invoke)
                              │
                              ├──► SNS Topic → Email / SMS
                              └──► CloudWatch
                                   IoTPipeline/Alerts :: AnomalyCount

Three components. Each one does exactly one thing.

In a telecoms context, these sensors monitor cabinet temperatures at cell tower sites, humidity in outdoor street-cabinet enclosures, and power supply metrics at edge nodes. The same pipeline pattern applies equally to network KPIs — signal strength, packet loss, handover failure rates — anywhere you need a low-latency path from raw telemetry to an ops team notification.

Step 1: Device Simulator

The simulator generates realistic sensor readings and publishes them to Kinesis. Normal readings stay within plausible operating ranges; a configurable anomaly probability injects spikes for demo and testing purposes.

# src/device_simulator.py

TEMP_NORMAL   = (15.0, 35.0)
TEMP_SPIKE    = (46.0, 60.0)    # above the 45°C alert threshold
HUMIDITY_NORMAL = (30.0, 80.0)
ANOMALY_PROBABILITY = 0.05      # 5% of readings are anomalies

def generate_telemetry(device_id: str, anomaly: bool = False) -> dict:
    if anomaly or random.random() < ANOMALY_PROBABILITY:
        temperature = round(random.uniform(*TEMP_SPIKE), 2)
        humidity_range = random.choice([HUMIDITY_HIGH_SPIKE, HUMIDITY_LOW_SPIKE])
        humidity = round(random.uniform(*humidity_range), 2)
    else:
        temperature = round(random.uniform(*TEMP_NORMAL), 2)
        humidity = round(random.uniform(*HUMIDITY_NORMAL), 2)

    return {
        "device_id": device_id,
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "temperature": temperature,
        "humidity": humidity,
        "pressure": round(random.uniform(950.0, 1050.0), 2),
        "firmware_version": "1.2.0",
    }

def publish_telemetry(client, device_id: str, payload: dict) -> None:
    client.put_record(
        StreamName=KINESIS_STREAM_NAME,
        Data=json.dumps(payload).encode("utf-8"),
        PartitionKey=device_id,    # routes same device to same shard
    )

Using device_id as the Kinesis partition key means all readings from the same device land in the same shard and are consumed in order. This matters if you later want to add rate-of-change anomaly detection — you need time-ordered records per device.

Step 2: Kinesis Consumer Lambda

The Decimal gotcha

DynamoDB does not accept Python float. If you try to write {"temperature": 23.5}, you'll get a TypeError: Float types are not supported. The fix is Decimal, and the safest way to convert is Decimal(str(value)) — not Decimal(value) which inherits floating-point imprecision:

from decimal import Decimal

item = {
    "device_id":   payload["device_id"],
    "timestamp":   payload["timestamp"],
    "temperature": Decimal(str(payload["temperature"])),
    "humidity":    Decimal(str(payload["humidity"])),
    "pressure":    Decimal(str(payload["pressure"])),
    "ingestion_timestamp": datetime.now(timezone.utc).isoformat(),
}
table.put_item(Item=item)

Partial batch responses

If one record fails, only that record should be retried — not the whole batch of 100. Kinesis supports this via partial batch responses:

def handler(event, context):
    batch_item_failures = []
    for record in event.get("Records", []):
        try:
            payload = decode_record(record)
            validate_payload(payload)
            write_to_dynamodb(payload)
            check_and_invoke_alerts(payload)
        except Exception as e:
            seq = record["kinesis"]["sequenceNumber"]
            batch_item_failures.append({"itemIdentifier": seq})

    return {"batchItemFailures": batch_item_failures}

Async alert invocation

When a reading breaches a threshold, the consumer invokes the alert handler Lambda asynchronously (InvocationType="Event"). Fire-and-forget — one slow SNS publish can't back up the Kinesis stream:

def check_and_invoke_alerts(payload: dict) -> None:
    temp, humidity = payload.get("temperature", 0), payload.get("humidity", 50)

    if temp > TEMP_THRESHOLD or humidity > HUMIDITY_HIGH or humidity < HUMIDITY_LOW:
        lambda_client.invoke(
            FunctionName=ALERT_FUNCTION_NAME,
            InvocationType="Event",
            Payload=json.dumps(payload).encode("utf-8"),
        )

Step 3: Alert Handler Lambda

def detect_anomalies(payload: dict) -> list:
    anomalies = []

    temp = payload.get("temperature")
    if temp is not None and temp > TEMP_THRESHOLD:
        anomalies.append({
            "field": "temperature", "value": float(temp),
            "threshold": TEMP_THRESHOLD, "direction": "above",
        })

    humidity = payload.get("humidity")
    if humidity is not None:
        if humidity > HUMIDITY_HIGH_THRESHOLD:
            anomalies.append({"field": "humidity", "value": float(humidity),
                               "threshold": HUMIDITY_HIGH_THRESHOLD, "direction": "above"})
        elif humidity < HUMIDITY_LOW_THRESHOLD:
            anomalies.append({"field": "humidity", "value": float(humidity),
                               "threshold": HUMIDITY_LOW_THRESHOLD, "direction": "below"})

    return anomalies

CloudWatch metric in a custom namespace so it doesn't get buried in Lambda system metrics:

def publish_cloudwatch_metric(payload: dict, anomalies: list) -> None:
    cloudwatch_client.put_metric_data(
        Namespace="IoTPipeline/Alerts",
        MetricData=[{
            "MetricName": "AnomalyCount",
            "Dimensions": [{"Name": "DeviceId", "Value": payload["device_id"]}],
            "Value": float(len(anomalies)),
            "Unit": "Count",
            "Timestamp": datetime.now(timezone.utc),
        }],
    )

Set a CloudWatch alarm on AnomalyCount > 5 per device per hour and page on-call automatically.

Test Suite: 30/30

All tests written before implementation (TDD).

tests/test_device_simulator.py     9 passed
tests/test_alert_handler.py        9 passed
tests/test_lambda_consumer.py     12 passed
─────────────────────────────────────────
Total                             30 passed

Key tests:

Temperature/humidity/pressure stay within realistic ranges across 100 random readings
Anomaly injection raises temperature above the normal maximum
put_record called with PartitionKey=device_id
Decimal conversion verified in DynamoDB write
High temp, high humidity, low humidity each detected with correct direction
CloudWatch namespace is IoTPipeline/Alerts, metric value equals anomaly count
Partial batch failure itemIdentifiers include only the failed sequence number

Infrastructure (Terraform)

resource "aws_kinesis_stream" "iot_stream" {
  name             = "iot-telemetry-stream"
  shard_count      = 1
  retention_period = 24
}

resource "aws_lambda_event_source_mapping" "kinesis_trigger" {
  event_source_arn               = aws_kinesis_stream.iot_stream.arn
  function_name                  = aws_lambda_function.consumer.arn
  starting_position              = "LATEST"
  batch_size                     = 100
  bisect_batch_on_function_error = true
  function_response_types        = ["ReportBatchItemFailures"]
}

resource "aws_dynamodb_table" "sensor_readings" {
  name         = "iot-sensor-readings"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "device_id"
  range_key    = "timestamp"
}

bisect_batch_on_function_error paired with ReportBatchItemFailures gives fine-grained retry — Kinesis bisects failing batches until it isolates the exact bad record.

What I'd Add Next

Rate-of-change detection — alert when temperature increases >5°C in 60 seconds, not just on absolute threshold
Dead letter queue — SQS DLQ for alert handler failures so no alert is silently dropped
Kinesis Data Analytics — SQL tumbling window aggregations for fleet-wide statistics
Device shadow — track last-known state per device for context-aware alerting

Tebogo Tseka — Cloud Solutions Architect & ML Engineer
GitHub: @tsekatm | Blog: tebogosacloud.blog

Building a Production MLOps Pipeline on AWS SageMaker for Telecom Churn

Tebogo Tseka — Wed, 29 Apr 2026 12:46:14 +0000

In my previous post, we trained a churn prediction model and deployed it to a SageMaker endpoint. That's where most tutorials stop. But in production, deploying a model is only the beginning.

Models degrade. Customer behaviour shifts. Contract mix changes when competitors launch new offers. A model trained on January's data may be quietly wrong by June — and without active monitoring, nobody will know until the churn rate climbs and someone asks awkward questions in a board meeting.

This post covers the system I built to make that model self-maintaining: an end-to-end MLOps pipeline that retrains automatically, gates on model quality, and raises alerts the moment it detects something drifting.

What We're Building

EventBridge (daily schedule)
        │
        ▼
Lambda: trigger_retraining
        │
        ▼
SageMaker Pipeline
  ├── Phase 1: PreprocessData   → encode, scale, split
  ├── Phase 2: TrainModel       → SKLearn estimator
  ├── Phase 3: EvaluateModel    → writes evaluation.json
  └── Phase 4: CheckAccuracy    → ROC-AUC ≥ 0.80?
           ├── YES → RegisterModel (PendingManualApproval)
           └── NO  → skip registration

Model Monitor (separate Lambda)
  ├── KS test on numeric features
  ├── Chi-squared on categorical features
  └── Alerts → SNS → CloudWatch

The whole pipeline is infrastructure-as-code (Terraform), parameterised, and testable locally with 24 unit tests.

Phase 1: The Pipeline Definition

SageMaker Pipelines SDK lets you define a DAG of steps in Python. The pipeline object is then serialised to JSON and pushed to SageMaker, which manages execution, retry, and lineage tracking.

# pipeline/pipeline.py
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.parameters import ParameterFloat, ParameterString
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo

def create_pipeline() -> Pipeline:
    # Overridable at execution time — no hardcoded values
    accuracy_threshold = ParameterFloat(name="AccuracyThreshold", default_value=0.80)
    input_data_uri = ParameterString(name="InputDataUri", default_value=f"s3://{default_bucket}/data/raw/")

    processing_step = ProcessingStep(
        name="PreprocessData",
        processor=sklearn_processor,
        inputs=[ProcessingInput(source=input_data_uri, destination="/opt/ml/processing/input")],
        outputs=[
            ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train"),
            ProcessingOutput(output_name="test",  source="/opt/ml/processing/output/test"),
        ],
        code="pipeline/preprocess.py",
    )

    # ... training_step, evaluation_step setup omitted for brevity — see pipeline.py in the repo

    condition_step = ConditionStep(
        name="CheckAccuracy",
        conditions=[ConditionGreaterThanOrEqualTo(
            left=JsonGet(step_name="EvaluateModel", property_file=evaluation_report,
                         json_path="metrics.accuracy.value"),  # accuracy chosen: simpler threshold, business-readable
            right=accuracy_threshold,
        )],
        if_steps=[register_step],
        else_steps=[],
    )

    return Pipeline(
        name="telecom-churn-pipeline",
        parameters=[input_data_uri, accuracy_threshold],
        steps=[processing_step, training_step, evaluation_step, condition_step],
    )

Every meaningful value — instance types, thresholds, data paths — is a ParameterString or ParameterFloat. The Lambda trigger can override any of them per execution without touching the pipeline definition.

Why accuracy and not ROC-AUC for gating? ROC-AUC is the better model quality metric (as argued in Part 1), but SageMaker ConditionalStep thresholds need to be intuitive for non-ML stakeholders who approve deployments. "Accuracy must exceed 80%" is easier to explain in a business review than an AUC threshold. You can always swap metrics.accuracy.value → metrics.roc_auc.value and adjust the threshold.

Phase 2: Preprocessing with Synthetic Fallback

# pipeline/preprocess.py
def load_data(input_dir: str) -> pd.DataFrame:
    csv_files = [f for f in os.listdir(input_dir) if f.endswith(".csv")] \
                if os.path.isdir(input_dir) else []
    if csv_files:
        return pd.read_csv(os.path.join(input_dir, csv_files[0]))
    print("No input CSV found — generating synthetic telecom data")
    return generate_synthetic_data(n=8000)

def split_and_save(X: np.ndarray, y: np.ndarray, test_size: float = 0.2) -> None:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=y
    )
    # SageMaker convention: target first, no header
    train_df = pd.DataFrame(np.column_stack([y_train, X_train]))
    train_df.to_csv(os.path.join(TRAIN_OUTPUT_DIR, "train.csv"), index=False, header=False)

Two details worth noting:

Stratified split — the churn class is imbalanced (~37%). Without stratification you risk the test set having a different class ratio than training.
Scaler fit on train only — fit on X_train, transform X_test. Fitting on all data leaks test set statistics into training.

Phase 3: Evaluation Report

The evaluation step writes a JSON file that the ConditionalStep reads:

# pipeline/evaluate.py
def compute_metrics(model, model_type, X_test, y_test) -> dict:
    if model_type == "sklearn":
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
    else:  # keras
        y_prob = model.predict(X_test).flatten()
        y_pred = (y_prob >= 0.5).astype(int)

    return {
        "accuracy":  {"value": round(accuracy_score(y_test, y_pred), 4)},
        "roc_auc":   {"value": round(roc_auc_score(y_test, y_prob), 4)},
    }

def write_evaluation_report(metrics: dict, output_dir: str) -> None:
    os.makedirs(output_dir, exist_ok=True)
    with open(os.path.join(output_dir, "evaluation.json"), "w") as f:
        json.dump({"metrics": metrics}, f, indent=2)

The PropertyFile in the pipeline definition tells SageMaker where to find this JSON and which key to extract (metrics.accuracy.value). The ConditionalStep compares that against AccuracyThreshold.

Phase 4: Automated Retraining via EventBridge

# src/trigger_retraining.py
PARAM_MAP = {
    "input_data_uri":         "InputDataUri",
    "training_instance_type": "TrainingInstanceType",
    "accuracy_threshold":     "AccuracyThreshold",
}

def handler(event: dict, context) -> dict:
    params = [
        {"Name": PARAM_MAP[k], "Value": str(v)}
        for k, v in event.get("pipeline_parameters", {}).items()
        if k in PARAM_MAP
    ]
    response = sagemaker_client.start_pipeline_execution(
        PipelineName=PIPELINE_NAME,
        PipelineParameters=params,
    )
    return {"statusCode": 200, "body": json.dumps({"execution_arn": response["PipelineExecutionArn"]})}

EventBridge fires this daily at 02:00 UTC. It can also be invoked on demand when new CSV files land in S3. The pipeline_parameters field lets you run A/B threshold experiments without changing code.

Phase 5: Drift Monitoring

Data Drift — Did the input distribution change?

# src/monitor.py
from scipy import stats

def check_data_drift(baseline, current, threshold=0.05):
    # KS test for numeric features
    for col in NUMERIC_COLS:
        stat, p_value = stats.ks_2samp(baseline[col].dropna(), current[col].dropna())
        drifted = bool(p_value < threshold)  # cast to Python bool — numpy.bool_ isn't JSON-serialisable

    # Chi-squared for categoricals
    for col in CATEGORICAL_COLS:
        all_cats = set(baseline[col].unique()) | set(current[col].unique())
        b_counts = baseline[col].value_counts().reindex(all_cats, fill_value=0)
        c_counts = current[col].value_counts().reindex(all_cats, fill_value=0)
        stat, p_value = stats.chisquare(f_obs=c_counts.values + 1, f_exp=b_counts.values + 1)
        drifted = bool(p_value < threshold)

KS test for numerics (non-parametric, catches any distributional shift), chi-squared for categoricals (frequency count comparison with Laplace smoothing).

One subtle bug: scipy.stats.ks_2samp returns numpy.bool_, not Python bool. The json.dumps in run_monitoring raises TypeError: Object of type bool is not JSON serializable. Fix: bool(p_value < threshold) — one word.

Model Drift — Is the model's behaviour changing?

def check_model_drift(predictions, actuals=None, baseline_accuracy=0.80):
    pred_churn_rate = float(predictions.mean())

    # Flag if predicted churn rate deviates >15pp from expected ~37%
    if abs(pred_churn_rate - 0.37) > 0.15:
        results["model_drift_detected"] = True

    # Delayed ground truth (labels arrive weeks later in telecom)
    if actuals is not None:
        degradation = baseline_accuracy - accuracy_score(actuals, predictions)
        if degradation > 0.05:  # 5pp drop triggers alert
            results["model_drift_detected"] = True

In telecom, churn labels arrive weeks after the prediction (when the contract period ends). The actuals parameter handles this delayed feedback loop.

Test Suite: 24/24

tests/test_preprocess.py          10 passed
tests/test_monitor.py              8 passed
tests/test_trigger_retraining.py   6 passed
──────────────────────────────────────────
Total                             24 passed

All tests written before implementation (TDD). Key test: test_no_drift_similar_data uses threshold=0.001 — even bootstrap-resampled data should not flag false drift under normal variance.

Infrastructure (Terraform)

All AWS resources provisioned via Terraform — nothing clicked in the console:

resource "aws_lambda_function" "trigger_retraining" {
  function_name = "telecom-churn-trigger-retraining"
  runtime       = "python3.11"
  role          = aws_iam_role.lambda_execution.arn

  environment {
    variables = {
      PIPELINE_NAME = var.pipeline_name
      AWS_REGION    = var.aws_region
    }
  }
}

resource "aws_cloudwatch_event_rule" "daily_retraining" {
  schedule_expression = "cron(0 2 * * ? *)"  # 02:00 UTC daily
}

Lambda IAM role: sagemaker:StartPipelineExecution scoped to the specific pipeline ARN only.

What's Next

A/B model deployment — route 10% of traffic to the new model, compare live accuracy before full cutover
SageMaker Feature Store — consistent feature engineering between training and inference
Approval webhook — Slack notification when a model lands in PendingManualApproval
CDR integration — feed Call Detail Records for real-time churn scoring at the network edge

Source Code

Full project: github.com/tsekatm/mlops-sagemaker-pipeline

Tebogo Tseka — Cloud Solutions Architect & ML Engineer
GitHub: @tsekatm | Blog: tebogosacloud.blog

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Tebogo Tseka — Wed, 29 Apr 2026 12:10:03 +0000

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Every month, a telecom operator quietly loses thousands of customers to a competitor. They call it churn — and in an industry where acquiring a new customer costs 5–10x more than retaining an existing one, predicting who is about to leave is one of the most valuable problems machine learning can solve.

In this tutorial, I'll walk you through a complete churn prediction pipeline I built for a telecom use case. We'll generate a realistic synthetic dataset, train three models (Decision Tree, Random Forest, and a Keras neural network), compare their performance, and deploy the best one to an Amazon SageMaker real-time endpoint.

By the end, you'll have a production-ready pipeline you can adapt for any telecoms operator.

Full source code: github.com/tsekatm/ml-churn-predictor

Why Telecom Churn Is a Hard ML Problem

Telecom churn has a few properties that make it interesting:

Class imbalance: Typically 20–40% of customers churn. The model must not simply predict "no churn" for everyone and claim 80% accuracy.
Behavioural signals are subtle: A customer moving from a two-year contract to month-to-month is a strong signal — but it manifests quietly in billing data.
High-value interventions: If you identify a high-risk customer 30 days early, a targeted retention offer (discounted upgrade, free month) can prevent the loss of 24+ months of revenue.

This makes recall — catching as many true churners as possible — more important than raw accuracy. We'll reflect that in our model design.

The Dataset

No real customer data? No problem. I generated a synthetic dataset of 10,000 telecom customers with realistic churn patterns calibrated to industry benchmarks.

python data/generate_data.py
# → data/raw/churn.csv (10,000 rows, 37% churn rate)

Feature	Type	Churn Signal Strength
tenure_months	Numeric	Strong — long-tenured customers rarely leave
contract_type	Categorical	Month-to-month: ~42% churn vs 3% for two-year
monthly_charges	Numeric	Higher bills correlate with higher churn
internet_service	Categorical	Fibre optic: ~41% churn
payment_method	Categorical	Electronic check: highest churn payment method

Pipeline Architecture

generate_data.py → preprocess() → Decision Tree
                              → Random Forest
                              → Keras Neural Network
                              → evaluate() → save_model()
                              → deploy.py → SageMaker Endpoint

Step 1: Data Preprocessing

CATEGORICAL_COLS = [
    "contract_type", "internet_service", "phone_service",
    "multiple_lines", "online_security", "tech_support",
    "payment_method", "paperless_billing",
]
NUMERIC_COLS = ["tenure_months", "monthly_charges", "total_charges", "senior_citizen"]

def preprocess(df):
    df = df.drop(columns=["customer_id"], errors="ignore").copy()
    df = df.dropna(subset=["churn"])

    encoders = {}
    for col in CATEGORICAL_COLS:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        encoders[col] = le

    X = df[NUMERIC_COLS + CATEGORICAL_COLS].values
    y = df["churn"].values.astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)  # fit only on train — never leak test stats

    return X_train, X_test, y_train, y_test, scaler, encoders

Two decisions worth noting:

Stratified split — preserves churn ratio in train and test sets.
Fit scaler on train only — fitting on the full dataset leaks test distribution into training.

Step 2: Training Three Models

Decision Tree — The Baseline

model = DecisionTreeClassifier(
    max_depth=8,
    min_samples_leaf=10,
    class_weight="balanced",  # compensates for ~37% minority class
    random_state=42,
)
model.fit(X_train, y_train)

Random Forest — The Workhorse

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    min_samples_leaf=5,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1,
)

Keras Neural Network — The Contender

model = keras.Sequential([
    layers.Input(shape=(input_dim,)),
    layers.Dense(128, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.2),
    layers.Dense(32, activation="relu"),
    layers.Dense(1, activation="sigmoid"),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss="binary_crossentropy",
    metrics=["accuracy", keras.metrics.AUC(name="auc")],
)

Batch normalisation stabilises training on mixed-scale features, Dropout prevents overfitting, and EarlyStopping on val_auc restores the best weights.

Step 3: Model Evaluation & Comparison

MODEL COMPARISON SUMMARY
Model                Accuracy   Precision  Recall     ROC-AUC
------------------------------------------------------------
decision_tree        0.7430     0.6189     0.7900     0.8218
random_forest        0.7660     0.6592     0.7575     0.8453
keras_nn             0.7565     0.6298     0.8252     0.8454
------------------------------------------------------------
Best model by ROC-AUC: keras_nn (0.8454)

Key observations:

Random Forest and Keras NN are neck-and-neck on ROC-AUC — both excellent.
Keras NN wins on recall (0.8252 vs 0.7575) — catches more churners, which matters most for retention campaigns.
Decision Tree at ROC-AUC 0.822 is weaker but fully interpretable — valuable for business presentations.

For deployment, I chose Random Forest over Keras NN despite the marginal ROC-AUC difference (0.8453 vs 0.8454). Random Forest offers simpler packaging (joblib serialisation vs TensorFlow SavedModel), faster cold-start inference on SageMaker, and better interpretability for business stakeholders — a practical trade-off in production.

Step 4: Deploying to SageMaker

# Package, upload, and deploy
archive = package_model("models/random_forest.pkl")
model_s3_uri = upload_to_s3(archive, S3_BUCKET, "models/random_forest/model.tar.gz")
create_model(model_name, role_arn, image_uri, model_s3_uri)
create_endpoint_config(config_name, model_name, instance_type="ml.m5.large")
deploy_endpoint(ENDPOINT_NAME, config_name)

# Test inference
result = invoke_endpoint(ENDPOINT_NAME, "12,0,65.5,786.0,1,1,1,0,0,0,1,0")
print(f"Churn probability: {result}")

Dry run (package + upload only):

python src/deploy.py --model-path models/random_forest.pkl --dry-run

Key Takeaways

1. Recall beats accuracy for churn — align your metric to the business objective, not the leaderboard.

2. Class weighting is non-negotiable — without it, your model silently optimises for the majority class.

3. The Decision Tree earns its place — explainability is not optional in regulated industries.

4. SageMaker packaging is straightforward — joblib model + model.tar.gz + role ARN is all you need.

What's Next

Add SHAP values for individual prediction explanations
Build a SageMaker Pipeline for automated retraining on new monthly data
Wire up a retention campaign API triggered at probability > 0.7
Add Model Monitor to detect data drift
Integrate with CDR (Call Detail Records) for real-time churn scoring at the network edge
Connect predictions to CRM retention workflows for automated intervention triggers

References

Tebogo Tseka — Cloud Solutions Architect & ML Engineer
GitHub: @tsekatm | Blog: tebogosacloud.blog*

How I Run Over 20 AI Agents Locally and Deploy One to Production at a Time

Tebogo Tseka — Mon, 13 Apr 2026 15:12:56 +0000

This article was originally published on tebogosacloud.blog.

I have over 20 AI agents. Only one is in production.

That is not a constraint. It is a strategy.

A system with one excellent production agent and a library of production-ready agents waiting locally is more mature than a system with ten mediocre agents all simultaneously causing incidents. I believe this. I have built for it. This article explains how.

While most teams are racing to deploy fleets of AI agents and discovering — usually painfully — that managing agents in production is far heavier than anyone told them, I have been doing the opposite. Build locally. Validate thoroughly. Design every agent to be production-ready from day one. Promote to AWS Bedrock AgentCore only when a use case has earned it.

The Problem With How Teams Ship Agents Today

There is a habit in AI development borrowed from early web development: ship fast, stabilise later. It worked reasonably well for stateless APIs. It fails for agents.

The operational overhead problem is the one no one talks about honestly. Each agent you lift to production is a runtime you now own. It needs monitoring, evaluation, cost governance, versioning, and a deployment pipeline. Each AgentCore runtime incurs ongoing costs — model invocation fees, Lambda execution time per tool call, API Gateway requests, DynamoDB reads for conversation state. One runtime versus twenty is a meaningful difference in your AWS bill before you have written a single line of business logic. That is not a fleet — that is a maintenance burden.

The failure modes compound it. A poorly configured agent does not throw a 500 error. It returns a plausible-sounding answer that is wrong. It invokes the right tool with the wrong parameters. It loses context mid-conversation and starts hallucinating a state that no longer exists. None of this shows up in your standard CloudWatch dashboard. You find out from a user.

I decided early that I was not going to pay that cost for use cases that had not been proven.

My Architecture: Local-First Agentic Development

My local environment is built around Claude Code and a system of over 20 agent personas, each defined as a structured Markdown file with a clear identity, a set of skills, and integration points into my SDLC.

They cover the full SDLC — from architecture and security to testing, defect management, data engineering, and content. A few examples to make it concrete:

The Defect Manager accepts a reported bug, writes a reproduction test, implements the fix, deploys to DEV, and closes the loop in ClickUp — without a human touching the keyboard between report and verification.

The SDET Engineer designs test cases using boundary value analysis, equivalence partitioning, and pairwise techniques, then executes them against an API proxy — never the live service directly.

The Cloud Security Specialist runs STRIDE-based threat models and generates Terraform-ready IAM policies scoped to least privilege for the specific service under review.

Each one is a specialist. None of them overlap. None are deployed unless a production use case demands it.

Each agent persona is structured around:

Identity: What this agent is responsible for, what it knows, how it behaves
Skills: Reusable knowledge modules the agent can apply
Tools: Callable actions the agent can invoke (APIs, MCP tools, Lambda functions)
SDLC stage: Where in the development lifecycle this agent operates

Local Development (Claude Code)
│
├── Agentic Architect (Orchestrator)
│   ├── HLD Architect          ──► Skills / Tools
│   ├── Cloud Security         ──► Skills / Tools
│   ├── SDET Engineer          ──► Skills / Tools
│   ├── Defect Manager         ──► Skills / Tools
│   ├── GenAI Engineer         ──► Skills / Tools
│   └── ... more agents        ──► Skills / Tools
│
│   Any agent, when lift criteria met:
│   ──► MCP Facade + OOP ABCs + Terraform + S3 Sync
│       ──► Production (AWS Bedrock AgentCore)
│           ├── AgentCore Runtime
│           ├── Lambda (Skills as Tools)
│           ├── API Gateway
│           └── DynamoDB (State)

The local environment gives me something production cannot: speed without consequence. I can iterate on a prompt, reshape a skill, change a tool's behaviour, and retest — all without a deployment pipeline, without CloudWatch logs, without touching live infrastructure. The feedback loop is minutes, not hours.

The Liftability Pattern

The most important design decision I made early: every agent I build locally must be liftable to production without rework.

Liftability is not a deployment script. It is a design discipline applied from day one. An agent is liftable when:

1. The MCP Facade is in place
Skills and tools are exposed as MCP (Model Context Protocol — an open standard for tool interoperability across LLM runtimes) endpoints. This interface works identically whether the agent is running locally in Claude Code or as a runtime in Bedrock AgentCore. The agent does not know or care where it is running. That is by design.

2. The implementation follows OOP ABCs
Each skill is implemented as a Python class inheriting from a base abstract class. This enforces a consistent interface, makes skills independently testable, and means they slot into AgentCore's tool registration without modification.

3. Infrastructure is Terraform-first
Every agent that will eventually be lifted has its Terraform written alongside the code — Lambda function definitions, IAM roles scoped to least privilege, API Gateway routes, DynamoDB tables for state. When lift day comes, terraform apply is the deployment.

4. Artifacts live in S3
Agent definitions, skill configurations, and prompt templates are stored in S3 — not hardcoded. In production, AgentCore reads from the same S3 paths. Promotion is a bucket sync, not a rewrite.

5. The agent has been tested end-to-end locally
This is the gate. Before an agent is considered liftable, it has unit tests for each skill, integration tests through an API proxy (not direct service calls), and a set of golden test cases that validate its end-to-end behaviour on representative inputs.

The lift checklist is not a formality. It is the reason the promotion is low risk when it happens.

Skills and Tools: The Real Unit of Capability

Here is the insight that took me longest to articulate clearly: in a production agentic system, the agent is not the unit of capability. The skill is.

An agent is an orchestrator. It decides which skill to apply, in what order, with what inputs. The intelligence of the system lives in how skills are designed, how they compose, and how reliably they execute — not in the agent's decision loop itself.

Skills are tested in isolation. Each skill has its own test suite. I can run pytest skills/defect_lifecycle_management/tests/ -v without spinning up an agent. The skill either works or it does not. This is the only way to know before production.

Skills are reusable across agents. My Cloud Security agent and my Peer Review agent both use the same IAM_Least_Privilege skill. Written once, tested once, composed freely.

Skills define the production surface area. When I lift an agent to Bedrock AgentCore, what I am deploying is a set of Lambda functions — one per skill — registered as tools. The AgentCore runtime is thin. The Lambda functions are where the real work happens.

New capability means a new skill, not a new agent. When I need the production agent to do something new, I write a skill, test it locally, and register it as a new tool in AgentCore. The operational surface stays flat even as capability grows.

The Production Decision: When Does an Agent Get Lifted?

Not every agent earns a production deployment. This is intentional.

The criteria I apply before lifting:

Is the use case proven? Has the agent demonstrated it can handle real inputs, not just the happy path?
Is the business need clear? Is there a user, system, or workflow that requires this agent callable as a REST API?
Are the skills stable? Have the underlying skills been through enough local iteration that core behaviour is settled?
Is the infrastructure written? Is the Terraform ready? IAM policies scoped? Monitoring configured?

When all four are true, the lift is a formality. The deployment is terraform apply plus a GitHub Actions workflow already parameterised for the target environment.

My current production agent — the Site Builder agent on Bedrock AgentCore — went through exactly this process. It ran locally for weeks. Skills were tested in isolation and end-to-end. Terraform was written alongside the code. When I lifted it, there were no surprises.

What This Gives You That Shipping Fast Does Not

A growing library of production-ready agents. At any point, agents at various stages of local maturity are queued for production — tested, Terraform-ready, waiting for the business case to pull them through.

Low-risk promotions. When I lift an agent, I already know it works. Tested locally on real inputs. Tested skills in isolation. Run end-to-end through an API proxy before touching AgentCore. The promotion is a confirmation, not an experiment.

Cost control where it matters. One AgentCore runtime with a growing skill set means one set of operational overhead — one monitoring configuration, one deployment pipeline, one cost centre to govern.

Faster local iteration. Because I am not trying to do everything in production, the local environment is unconstrained. A new agent persona can be tried in an afternoon. Skills can be composed in ways not tried before.

The Counterintuitive Takeaway

The industry benchmark for agentic maturity right now is fleet size — how many agents deployed, how many tools registered, how many concurrent sessions the platform can handle.

I think this is the wrong metric.

The right metrics are: how reliably does each production agent perform on the use cases it owns, and how quickly can a locally-proven agent be promoted when the business needs it.

Local-first agentic development is not a workaround for teams that cannot afford AgentCore at scale. It is a discipline. Build thoroughly. Test locally. Design for liftability from day one. Promote when the use case earns it.

The agents are ready. Production should always be the easy part.

Key Takeaways

Managing agents in production is operationally heavier than managing skills and tools — be deliberate about what you lift
Skills are the real unit of capability — design, test, and deploy at the skill level
Liftability is a design property, not a deployment script: MCP facade, OOP ABCs, Terraform-first, S3 artifacts, end-to-end tests
Local-first development absorbs iteration cost so production does not have to
The lift criteria (proven use case, stable skills, written infrastructure, clear business need) make every promotion low risk
Fleet size is a vanity metric — reliability per agent and time-to-lift are what matter

References

The Missing Test Suite: Why AI Projects Fail Before Production

Tebogo Tseka — Thu, 02 Apr 2026 14:39:57 +0000

Most AI projects never ship. The gap isn't the model — it's the lack of testability.

The Uncomfortable Truth

Gartner predicted that through 2022, 85% of AI projects would deliver erroneous outcomes due to bias in data, algorithms, or the teams managing them [1]. VentureBeat reported that 87% of data science projects never make it into production [2]. McKinsey's 2023 State of AI report confirmed that while generative AI adoption is accelerating, most organisations still struggle to move beyond experimentation [3].

Teams build impressive demos, stakeholders nod approvingly, and then the project quietly stalls somewhere between "it works on my laptop" and "it's running in production."

The usual suspects get blamed: data quality, model performance, organisational readiness. But there is a more fundamental problem hiding in plain sight — most teams have no idea how to test AI systems with the same rigour they apply to traditional software. Google's seminal paper on hidden technical debt in machine learning systems identified testing gaps as a primary source of production failures, noting that ML systems have a special capacity for incurring technical debt because they have all the maintenance problems of traditional code plus an additional set of ML-specific issues [4].

They test the code. They don't test the intelligence.

Two Systems, Two Test Suites

A production AI system is not one system. It is two systems woven together: deterministic software (APIs, data pipelines, orchestration logic) and non-deterministic AI behaviour (prompt responses, agent decisions, model outputs).

Most engineering teams are excellent at testing the first. They write unit tests, integration tests, and end-to-end tests. They practice TDD. They run CI pipelines that block merges on test failures. This is mature, well-understood discipline.

But the AI layer — the prompts, the agent behaviour, the model responses — gets treated as a black box. Teams eyeball a few outputs, declare it "good enough," and move on. There is no test suite. There is no regression safety net. There is no way to know if a prompt change that improved one scenario just broke twelve others.

Google's ML Test Score rubric [5] proposes a structured assessment of ML production readiness across data tests, model tests, infrastructure tests, and monitoring — yet most teams score poorly on all four dimensions. Microsoft Research's study of software engineering for machine learning found that even within large technology companies, testing practices for ML systems remain significantly less mature than those for traditional software [6].

This is the missing test suite. And it is the single biggest reason AI projects fail to reach production.

Prompt Test Cases as First-Class Citizens

If you would not ship a function without a unit test, you should not ship a prompt without a prompt test case.

A prompt test case is structurally similar to a traditional test: given an input, assert something about the output. The difference is that the assertion must account for non-determinism. You are not checking for exact string equality. You are evaluating whether the output meets defined criteria — relevance, completeness, format compliance, safety, and factual accuracy.

Ribeiro et al.'s CheckList framework [7] — which won Best Paper at ACL 2020 — demonstrated that traditional software testing methodologies can be directly applied to NLP models. CheckList introduces three test types that map cleanly to prompt testing: Minimum Functionality Tests (happy path), Invariance Tests (the model should produce equivalent outputs for equivalent inputs), and Directional Expectation Tests (changing the input in a specific way should change the output in a predictable direction).

Happy Path

Happy path prompt tests verify that the AI produces the expected output when given a well-formed, unambiguous input. These are your baseline. If these fail, nothing else matters.

Examples of happy path assertions:

Given a clear instruction, the agent produces a response that addresses all specified requirements
Given structured input data, the agent formats its output according to the defined schema
Given a multi-step task, the agent completes each step in the correct sequence

Happy path tests seem obvious, but most teams skip them. They assume that because the prompt "worked when they tried it," it will always work. It will not. Model updates, context changes, and subtle input variations all introduce drift.

Negative Scenarios

Negative prompt tests verify that the AI fails gracefully when given problematic input. This is where most unshipped AI projects have their fatal flaw — they only ever tested the golden path.

Perez et al. demonstrated that language models can be used to systematically red-team other language models, generating adversarial inputs that expose failure modes at scale [8]. The same principle applies to prompt testing — you can and should systematically probe for failures.

Test for:

Contradictory instructions: "Summarise this document in detail but keep it under 10 words." Does the agent flag the contradiction, or does it silently produce garbage?
Out-of-scope requests: When asked to perform a task outside its defined capabilities, does the agent refuse clearly, or does it hallucinate an answer?
Adversarial input: Prompt injection attempts, instructions disguised as data, requests to ignore system prompts. Does the agent hold its boundaries?
Missing context: When critical information is absent from the input, does the agent ask for clarification, or does it fabricate what it doesn't know?

Negative scenarios reveal the failure modes that will surface in production, because real users do not read your documentation and do not provide clean inputs.

Edge Cases

Edge case prompt tests probe the boundaries of agent behaviour. These are the scenarios that don't fit neatly into "it works" or "it's broken" — they live in the grey zone where AI systems are most unpredictable.

Test for:

Context window boundaries: What happens when the input is near the maximum token limit? Does output quality degrade? Does critical information from early in the context get lost?
Multi-turn drift: Over a long conversation, does the agent maintain consistency with its earlier responses, or does it contradict itself?
Ambiguous inputs: When a request has multiple valid interpretations, does the agent pick one and commit, or does it hedge uselessly?
Format edge cases: Empty strings, single-character inputs, inputs in unexpected languages, inputs with special characters or code snippets embedded in natural language
Hallucination triggers: Inputs that are factually adjacent to the agent's knowledge but require information it does not have. Does it admit uncertainty, or does it confabulate?

Edge case tests are expensive to design but cheap compared to a production incident where your AI agent confidently gives a user dangerously wrong information. The NIST AI Risk Management Framework explicitly identifies "the propensity for generative AI to produce confidently stated but incorrect outputs" as a key risk requiring systematic mitigation [9].

Designing Prompt Test Permutations

Systematic test design is not a new discipline. Software testing has mature techniques — codified in ISO/IEC 29119 [10] — for generating meaningful test cases without combinatorial explosion. Part 11 of this standard, published in 2020, specifically extends these techniques to AI-based systems [11]. The same approaches apply to prompt testing — they just need to be adapted for non-deterministic outputs.

Equivalence Partitioning for Prompts

Divide your input space into classes that you expect the AI to handle similarly. Instead of testing every possible phrasing of a request, identify the equivalence classes:

Short, direct instructions vs. long, detailed instructions
Technical language vs. conversational language
Single-task requests vs. compound multi-task requests
Inputs with complete context vs. inputs with partial context

Test one representative from each class. If the AI handles one member of the class correctly, it is likely to handle the others. Ribeiro et al. validated this approach empirically, showing that equivalence-class-based testing surfaces model failures far more efficiently than random sampling [7].

Boundary Value Analysis for Prompts

Identify the thresholds where agent behaviour changes:

The input length at which output quality begins to degrade
The number of instructions in a single prompt before the agent starts dropping tasks
The level of ambiguity at which the agent switches from executing to asking for clarification
The complexity threshold beyond which the agent starts making errors

Test inputs at, just below, and just above each boundary.

Decision Table Testing

For agents with conditional behaviour — different responses based on user role, input type, or context state — build a decision table. Map every combination of conditions to the expected action. Then write a test case for each row.

This is particularly critical for agents that make routing decisions, apply business rules, or enforce access controls. A missed condition in a decision table is a production bug waiting to happen.

The Prompt Regression Problem

Here is the scenario that kills AI projects in the transition from prototype to production:

A developer changes a prompt to fix a reported issue. The fix works. The specific scenario that was broken now produces the correct output. The developer commits the change, satisfied.

What the developer does not know is that the prompt change also altered the agent's behaviour on fourteen other scenarios — three of which are now producing incorrect outputs. Nobody finds out until users report problems. By then, confidence in the system is damaged and the project loses momentum.

This is the prompt regression problem, and it is solved the same way code regression is solved: with an automated test suite that runs on every change.

Building a Prompt Regression Harness

A prompt regression harness consists of:

A corpus of test cases: Input-output pairs covering happy paths, negative scenarios, and edge cases. Start with 20-30 and grow it continuously.
Evaluation criteria: For each test case, define what "correct" means. This might be a rubric (scores 1-5 on relevance, accuracy, completeness), a set of required elements (must mention X, must not mention Y), or a format check (valid JSON, under 200 words).
Automated evaluation: Use a combination of deterministic checks (format validation, keyword presence) and LLM-as-judge evaluation (a second model scoring the output against the rubric). Zheng et al.'s research on MT-Bench demonstrated that LLM-as-judge approaches can achieve high agreement with human evaluators when properly calibrated [12], though Shankar et al. caution that validator alignment with human preferences must itself be verified [13]. Neither approach alone is sufficient. Together, they provide reasonable coverage.
CI integration: Run the harness on every prompt change, just as you run unit tests on every code change. Block merges that cause regression.

The harness does not need to be perfect. It needs to be better than nothing — which is what most teams have today. Frameworks such as Stanford's HELM [14] and open-source tools like OpenAI Evals [15] and DeepEval [16] provide starting points for building evaluation infrastructure.

Strategy: From POC to Production

Testing is the foundation, but shipping an AI system to production requires a broader strategy. Google's MLOps maturity model [17] describes three levels of automation — from manual ML pipelines (Level 0) to fully automated CI/CD/CT pipelines (Level 2). Most AI projects are stuck at Level 0. These are the practices that move you forward.

1. Define Testability From Day One

Before writing a single prompt, define how you will test the AI's behaviour. If you cannot articulate what "correct" looks like for a given input, you are not ready to build. Testability is a design constraint, not an afterthought. The NIST AI RMF [9] frames this as "measuring" — one of four core functions alongside governing, mapping, and managing AI risk.

2. Version Your Prompts Like Code

Prompts are code. Store them in version control. Tag releases. Write changelogs. If you cannot diff two versions of a prompt and understand what changed and why, you have lost control of your system. White et al.'s prompt pattern catalogue [18] demonstrates that prompts can be documented and structured with the same rigour as software design patterns.

3. Build Evaluation Into the Pipeline

Do not evaluate AI output manually and sporadically. Build evaluation into your CI/CD pipeline. Every pull request that touches a prompt should trigger the test harness. Results should be visible in the PR review, just like test results. Kreuzberger et al.'s systematic review of MLOps architectures [19] confirms that continuous evaluation is a defining characteristic of production-grade ML systems.

4. Instrument for Observability

In production, you need to see what the AI is doing. Log inputs, outputs, latency, token usage, and evaluation scores. Build dashboards. Set alerts on quality degradation. You cannot improve what you cannot measure, and you cannot debug what you cannot observe. Klaise et al. detail practical approaches to monitoring ML models in production, including detecting data drift and concept drift before they degrade output quality [20].

5. Implement Human-in-the-Loop Gates

Not every AI decision should be autonomous from day one. Identify high-stakes decisions and route them through human review. As confidence grows and the test suite matures, progressively expand the automation boundary. This is not a concession — it is a deployment strategy. Mosqueira-Rey et al.'s comprehensive survey of human-in-the-loop machine learning [21] demonstrates that the most successful production AI systems are designed with human oversight as an integral component, not bolted on as an afterthought.

6. Plan for Model Changes

Models get updated. APIs change. Behaviour shifts. Your test suite is your safety net during model migrations. Teams that have one can upgrade models in an afternoon with confidence. Teams that don't spend weeks manually validating and still miss regressions. The EU AI Act [22] now mandates ongoing testing and monitoring for high-risk AI systems — model migration without regression testing is not just risky engineering, it is increasingly a compliance liability.

7. Treat Prompt Engineering as Software Engineering

The teams that ship AI to production are the teams that apply software engineering discipline to prompt development. They review prompts in pull requests. They write tests. They track regressions. They refactor. They don't treat prompts as magic incantations — they treat them as code that happens to be written in natural language. Reynolds and McDonell's early work on prompt programming [23] laid the conceptual foundation for this approach, framing prompt design as a form of programming rather than an art.

Closing

The AI industry has a completion problem, not a capability problem. The models are powerful enough. The tooling is mature enough. What is missing is the engineering discipline to make AI systems production-grade.

If you would not ship code without tests, do not ship prompts without them. If you would not deploy a function without observability, do not deploy an agent without it. If you would not merge a code change without regression checks, do not merge a prompt change without them.

The test suite your AI project is missing is the one that tests the AI itself. Build it, and you build the bridge from demo to production.

Testing AI is not a new discipline — it is the old discipline of software testing, applied to a new kind of system. The teams that recognise this will ship. The rest will keep building impressive demos that never leave the lab.

References

[1] Gartner, "Gartner Predicts: AI and the Future of Work," Gartner Research, 2019. Available: https://www.gartner.com/en/newsroom/press-releases

[2] VentureBeat, "Why do 87% of data science projects never make it into production?" VentureBeat, July 2019. Available: https://venturebeat.com/ai/why-do-87-of-data-science-projects-never-make-it-into-production/

[3] McKinsey & Company, "The state of AI in 2023: Generative AI's breakout year," McKinsey Global Institute, 2023. Available: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year

[4] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, et al., "Hidden Technical Debt in Machine Learning Systems," in Advances in Neural Information Processing Systems (NeurIPS), 2015. Available: https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[5] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt," in IEEE International Conference on Big Data, 2017. Available: https://research.google/pubs/pub46555/

[6] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software Engineering for Machine Learning: A Case Study," in Proceedings of the 41st International Conference on Software Engineering (ICSE), 2019. Available: https://www.microsoft.com/en-us/research/publication/software-engineering-for-machine-learning-a-case-study/

[7] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. (Best Paper Award). Available: https://arxiv.org/abs/2005.04118

[8] E. Perez, S. Ringer, K. Lukosiute, K. Nguyen, E. Chen, et al., "Red Teaming Language Models with Language Models," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. Available: https://arxiv.org/abs/2202.03286

[9] National Institute of Standards and Technology, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, January 2023. Available: https://www.nist.gov/itl/ai-risk-management-framework

[10] International Organization for Standardization, "ISO/IEC 29119: Software and systems engineering — Software testing," ISO/IEC, 2013-2022.

[11] International Organization for Standardization, "ISO/IEC TR 29119-11:2020: Software and systems engineering — Software testing — Part 11: Guidelines on the testing of AI-based systems," ISO/IEC, 2020. Available: https://www.iso.org/standard/79016.html

[12] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Advances in Neural Information Processing Systems (NeurIPS), 2023. Available: https://arxiv.org/abs/2306.05685

[13] S. Shankar, J. D. Zamfirescu-Pereira, B. Hartmann, A. Parameswaran, and I. Arawjo, "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences," 2024. Available: https://arxiv.org/abs/2404.12272

[14] P. Liang, R. Bommasani, T. Lee, et al., "Holistic Evaluation of Language Models (HELM)," Transactions on Machine Learning Research, 2022. Available: https://arxiv.org/abs/2211.09110

[15] OpenAI, "Evals: A framework for evaluating LLMs and LLM systems," GitHub, 2023. Available: https://github.com/openai/evals

[16] Confident AI, "DeepEval: The open-source LLM evaluation framework," GitHub, 2023. Available: https://github.com/confident-ai/deepeval

[17] Google Cloud, "MLOps: Continuous delivery and automation pipelines in machine learning," Google Cloud Architecture Center, 2023. Available: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

[18] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT," arXiv:2302.11382, 2023. Available: https://arxiv.org/abs/2302.11382

[19] D. Kreuzberger, N. Kuhl, and S. Hirschl, "Machine Learning Operations (MLOps): Overview, Definition, and Architecture," IEEE Access, vol. 11, 2023. Available: https://ieeexplore.ieee.org/document/10081336

[20] J. Klaise, A. Van Looveren, G. Vacanti, and A. Coca, "Monitoring Machine Learning Models in Production," arXiv:2007.06299, 2021. Available: https://arxiv.org/abs/2007.06299

[21] E. Mosqueira-Rey, E. Hernandez-Pereira, D. Alonso-Rios, J. Bobes-Bascaran, and A. Fernandez-Leal, "Human-in-the-loop machine learning: a state of the art," Artificial Intelligence Review, Springer, 2023. Available: https://link.springer.com/article/10.1007/s10462-022-10246-w

[22] European Parliament, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," Official Journal of the European Union, 2024. Available: https://eur-lex.europa.eu/eli/reg/2024/1689

[23] L. Reynolds and K. McDonell, "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm," in CHI 2021 Extended Abstracts, 2021. Available: https://arxiv.org/abs/2102.07350

Building an LLM Judge That Doesn't Lie to You

Tebogo Tseka — Tue, 31 Mar 2026 14:03:21 +0000

Our first LLM judge gave a 9/10 to a page where the hero text was completely invisible.

Dark grey text on a dark background image. The CSS was syntactically valid. The HTML was well-structured. Every tag was correct. The page was unusable. And our judge — Claude Opus, one of the most capable models available — scored it nearly perfect.

That was the moment I realised LLM-as-judge doesn't work out of the box. It requires engineering. This article explains what we built to make it trustworthy.

The Inflation Problem

The first implementation was simple: send the generated code to Claude Opus, ask it to rate 0–10. The results looked great. Average scores of 8–9/10 across the board. We nearly shipped those numbers.

Then we opened the generated sites in a browser.

Pages with broken images — where the model had written <img src="a serene mountain landscape with morning fog"> instead of a URL — scored 8/10. Pages with empty sections — where entire content blocks were missing — scored 7/10. Pages where navigation rendered as a bulleted list because list-style: none was missing from the CSS — scored 8.5/10.

The judge was systematically generous. Not because it was broken, but because of how LLMs process code.

Why Judges Inflate

Positivity bias from RLHF training. Language models are trained to be helpful, which creates a default toward positive assessment. When asked to evaluate code, the model focuses on what's present rather than what's wrong. Fifteen correct CSS properties and one devastating contrast failure? The judge sees the fifteen.

Code-level evaluation misses visual defects. Syntactically valid CSS can produce invisible text. color: #333 on background-image: url(dark-photo.jpg) is perfectly valid CSS and completely unreadable content. A judge that reads code without "seeing" the rendered result can't catch this category of defect.

Vague rubrics invite generous interpretation. "Rate the quality of this HTML/CSS from 0–10" gives the judge too much latitude. What does 7 mean? What separates a 6 from an 8? Without concrete criteria, the judge fills in the gaps with optimistic interpretation.

No calibration anchors. The judge has no reference for what a 5/10 page looks like versus a 9/10 page. Without anchors, scores cluster at the top of the range because the model has no incentive to be harsh.

Fix 1: Structural Guardrails

The first mitigation was the HTMLVisualChecker — an automated pre-judge validator that catches defects the LLM judge consistently misses.

It runs six checks:

Broken images. Scans every <img> tag's src attribute. If the src contains spaces and doesn't start with http, it's flagged — the model wrote a description instead of a URL. Also checks CSS background-image declarations for the same pattern.

# Catches: <img src="a modern office with glass facade">
if len(src) > 30 and " " in src and not src.startswith("http"):
    violations.append(Violation(
        id="VIS-BROKEN-IMAGE",
        severity=Severity.CRITICAL,
        deduction=-2.5,
        description=f"Image src contains text instead of URL: '{src[:60]}'",
    ))

Empty sections. Finds <section> and <div> elements with IDs or classes that contain no visible text content. An empty hero section means the page loads with a blank area where the headline should be.

Dark text on dark backgrounds. Extracts CSS variables from :root, identifies the text colour, checks whether background images are present, and flags when dark text is used without a light alternative. This is the check that caught the 9/10 invisible text page.

Broken navigation. Detects when a <nav> element contains <ul>/<li> markup but the CSS doesn't include list-style: none or flexbox layout — meaning the navigation renders as a bulleted list instead of a horizontal menu.

Missing interactivity. Checks for the presence of JavaScript, mobile menu toggles, smooth scrolling, and hover states. A page with interactive HTML elements but no JavaScript to make them work is functionally broken.

Local file paths. Flags src attributes pointing to filesystem paths (/Users/..., C:\..., relative paths without extensions) that won't work in a browser.

These checks don't replace the judge — they constrain it. If the HTMLVisualChecker finds a critical violation (broken images, empty sections, invisible text), that violation is recorded regardless of what the judge thinks. The judge can still evaluate the nuances of code quality and content accuracy, but it can't override a structural failure.

The analogy: unit tests don't replace code review, but they catch the obvious regressions before a human ever looks at the code.

Fix 2: Multimodal Judging

The second fix was sending the judge more than just code.

Code-only judging fails because CSS is a spatial language encoded as text. grid-template-columns: 1fr 2fr 1fr creates a three-column layout, but you can't verify it's correct without rendering it. rgba(0, 0, 0, 0.7) overlay on a hero image makes text readable, but the judge can't know the overlay is sufficient without seeing the result.

Our judge input bundle now includes:

Full HTML source code
Full CSS source code
The scoring rubric with violation catalogue
Gold standard HTML/CSS for comparison

The judge compares agent output against the gold standard at both the code level and the structural level. It can see whether the agent's CSS variables match the requirements AND whether the agent's HTML structure preserves all sections from the template.

In future rounds, we plan to add desktop and mobile screenshots to the bundle, making the judge truly multimodal — evaluating the rendered visual output alongside the source code.

Fix 3: The Violation Catalogue as Rubric

The third fix was the most impactful. Instead of asking the judge for a score, we ask it to identify specific violations from a fixed catalogue.

The catalogue defines 22 violation types, each with a unique ID, severity level, and fixed deduction amount:

- id: A11Y-DARK-TEXT-ON-DARK-BG
  description: Dark text on dark background (unreadable)
  severity: critical
  deduction: -3.0

- id: VIS-BROKEN-IMAGE
  description: Image shows alt text or broken placeholder
  severity: critical
  deduction: -2.5

- id: CONTENT-PARAPHRASED
  description: Content paraphrased instead of exact text
  severity: moderate
  deduction: -0.5

The judge prompt is explicit about what's expected:

Your job: identify every violation in the agent output by
comparing it against the gold standard and requirements.

Return ONLY a JSON object with violations from the catalogue.

Rules:
- Use EXACT deduction amounts from the violation catalogue
- Do NOT invent violation IDs — use only IDs from the catalogue
- Do NOT report violations that don't exist
- Focus ONLY on the specific action being evaluated

The judge returns structured JSON — not prose, not a score:

{
  "violations": [
    {
      "id": "VIS-BROKEN-IMAGE",
      "severity": "critical",
      "deduction": -2.5,
      "description": "Hero image src contains description, not URL",
      "evidence": "<img src=\"a serene landscape with mountains\">"
    },
    {
      "id": "CONTENT-PARAPHRASED",
      "severity": "moderate",
      "deduction": -0.5,
      "description": "About section text reworded from requirements",
      "evidence": "Requirements: 'Farm-fresh flavours' → Output: 'Fresh local ingredients'"
    }
  ],
  "summary": "Hero image broken, about text paraphrased",
  "strengths": ["Correct colour variables", "All sections present"],
  "critical_issues": ["Unusable hero — no visible image"]
}

This separation of concerns is the key design decision. The judge does classification — which violations are present? The scoring engine does arithmetic — sum the deductions, subtract from 10. The judge cannot inflate scores because it never assigns scores. It identifies problems. The math is deterministic.

How the Three Fixes Work Together

The evaluation pipeline runs in sequence:

1. HTMLVisualChecker    → catches structural/visual defects
2. Opus Judge           → identifies violations from catalogue
3. Scoring Engine       → 10 minus sum(all deductions)

The HTMLVisualChecker catches what the judge misses (broken images, contrast issues, empty sections). The judge catches what the checker can't evaluate (content accuracy, code quality nuances, whether the business name appears in all six required locations). The scoring engine applies fixed deductions from both sources.

Before these fixes, the same page with invisible text scored 9/10. After: the HTMLVisualChecker flags A11Y-DARK-TEXT-ON-DARK-BG (-3.0), the judge identifies VIS-BROKEN-IMAGE on the hero (-2.5) and CONTENT-PARAPHRASED on the about section (-0.5). Final score: 4.0/10.

That 4.0 is honest. The page has serious problems. The old 9.0 was a lie.

What We Learned About Judge Design

Constrain the output format

Free-text evaluation ("rate this code 0–10") produces inflated, inconsistent scores. Structured output with predefined violation types produces consistent, auditable results. The judge's job is classification, not scoring.

Separate detection from scoring

When the judge both finds problems and assigns scores, it conflates two tasks and does both poorly. When the judge only identifies violations and a deterministic engine applies fixed deductions, scores are reproducible and explainable.

Use structural checks as guardrails

LLM judges have blind spots. They read code as text and miss spatial defects. Automated structural checks catch the class of defects that LLMs consistently miss — and they run in milliseconds, not minutes.

Fixed-weight violations beat subjective assessment

Is a purple gradient better than a blue solid? The judge has opinions, but they're not universal. But a missing mobile menu toggle (-2.5) is objectively a defect. Fixed weights for objective violations eliminate the subjectivity that causes score inflation.

Known Limitations

We fixed inflation, but the judge isn't perfect. Here's what remains:

Single judge bias. Only Claude Opus evaluates. It may favour Claude-generated code — similar patterns, similar token distributions. We haven't tested with a second judge model. Round 2 will score a subset with an independent judge and compute Cohen's kappa for inter-rater agreement.

No inter-rater calibration. We don't know whether our scores are "right" in an absolute sense. We know they're consistent and that they correlate with visible defects. But a human QA review of a random sample would establish whether our 4/10 matches a human's assessment of quality.

Aesthetic subjectivity. The violation catalogue covers functional defects (broken images, missing content, contrast failures) but not aesthetic quality. Two pages can score identically — both have correct structure, content, and accessibility — while one looks significantly more professional. We don't measure that.

Measurement asymmetry from Round 1. Sonnet's gold standard scores (93.4%) were measured differently from alternative models' pipeline scores (59–68%). This doesn't affect the judge's per-action scoring, but it affects the aggregate comparison. Round 2 fixes this by running all models through the same pipeline.

Principles for LLM-as-Judge

If you're building an LLM judge for any evaluation task — not just code generation — three principles apply:

1. Structural guardrails before LLM evaluation. Catch the obvious defects with deterministic checks before the LLM judge runs. This prevents the judge from rationalising broken output.

2. Constrained violation catalogues over open-ended scoring. Define the defects you care about, assign fixed weights, and ask the judge to classify — not score. You get consistent, auditable, explainable results.

3. The judge is only as good as its rubric. Invest in the rubric. A 22-violation catalogue with severity tiers and fixed deductions took more design effort than the judge prompt itself. The catalogue IS the evaluation — the judge is just the executor.

LLM judges are powerful. They're also unreliable by default. The engineering isn't in the model — it's in the constraints you build around it.

This is part 3 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.

Previous: 5 Models, 467 Actions, 1 Winner
Next: The $0.07 vs $1.05 Question — Cost-Quality Tradeoffs

Originally published on tebogo.cloud

5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation

Tebogo Tseka — Mon, 30 Mar 2026 19:02:48 +0000

We tested five AI models on the same task 467 times. Each run produced a complete deployable website — not a code snippet, not a function, not a patch. A real site with HTML, CSS, JavaScript, and assets.

The question: can cheaper models match Claude Sonnet for production code generation?

The short answer is no. The longer answer is more interesting.

The Models

Five models, spanning a 15x cost range:

Model	Provider	Input/1M Tokens	Output/1M Tokens	Why We Tested It
Claude Sonnet 4.6	OpenRouter	$3.00	$15.00	Assumed gold standard
Claude Haiku 4.5	OpenRouter/CLI	$1.00	$5.00	Same family, lower tier
Kimi K2.5	OpenRouter	$0.42	$2.20	Moonshot AI's latest
DeepSeek V3.2	OpenRouter	$0.26	$0.38	Budget option
DeepSeek R1	OpenRouter	$0.70	$2.50	Reasoning-focused

These five represent distinct price tiers and architectural approaches. Sonnet and Haiku share a lineage. Kimi is multimodal. DeepSeek V3.2 optimises for cost. R1 optimises for step-by-step reasoning.

The 16-Action Pipeline

Each model received the same template skeleton and business requirements, then applied 16 sequential actions:

#	Action	Category
1	apply-colours	Brand
2	swap-fonts	Brand
3	replace-header-logo	Brand
4	replace-footer-logo	Brand
5	replace-favicon	Brand
6	replace-hero-bg	Images
7	replace-section-bgs	Images
8	update-hero-text	Content
9	update-about-text	Content
10	update-contact	Content
11	apply-hero-layout	Layout
12	apply-sections-layout	Layout
13	add-seo-meta	Technical
14	add-structured-data	Technical
15	add-accessibility	Technical
16	verify-contrast	Quality

Same requirements spec, same gold standard, same judge for all models. Each action scored 0–10 using a violation-deduction model (see Part 1). Maximum possible: 160 points.

Actions are sequential — each builds on the previous output. Errors compound. This is deliberate: it mirrors how agents work in production.

The Results

Model	Avg Score	95% CI	% of Max	Std Dev	Runs
Claude Sonnet 4.6	149.5	N/A†	93.4%	0.0†	21
Kimi K2.5	108.2	[92.7, 123.7]	67.6%	20.1	9
Claude Haiku 4.5	107.7	[91.0, 124.4]	67.3%	13.4	5
DeepSeek V3.2	94.0	[78.0, 110.0]	58.8%	28.9	15
DeepSeek R1	41.9	N/A (n=2)	26.2%	3.3	2

Sonnet 4.6:    ████████████████████████████████████████████████████████ 149.5 (93%)
Kimi K2.5:     ████████████████████████████████████████                108.2 (68%)  ±15.5
Claude Haiku:  ████████████████████████████████████████                107.7 (67%)  ±16.7
DeepSeek V3.2: ██████████████████████████████████                       94.0 (59%)  ±16.0
DeepSeek R1:   ███████████████                                          41.9 (26%)  n=2
               |---------|---------|---------|---------|---------|
               0        30        60        90       120       150

The Honesty Moment

Before interpreting these rankings, three caveats:

Sonnet was measured differently. Its 149.5 score comes from gold standard evaluation (automated quality signals against 21 templates), not the same 16-action pipeline as the alternatives. The 41-point gap between Sonnet and the field may be partly methodological. We're fixing this in Round 2.

Rankings 2–4 are noise. Kimi's confidence interval is [93, 124]. Haiku's is [91, 124]. DeepSeek V3.2's is [78, 110]. These overlap heavily. With current sample sizes, we cannot say which of these three is genuinely better. What we CAN say: all three cluster around 59–68% of max, well below Sonnet's 93%.

Sample sizes are small. 2–15 runs per model. We need n≥16 for 80% statistical power to detect a 20-point difference. The rankings are directionally useful but not statistically conclusive for the middle tier.

Per-Template Performance

Template	Sonnet	Kimi	Haiku	DeepSeek V3.2	Best Alt % of Sonnet
AI Page Builder (SaaS)	149.5	134.8	124.2	99.5	90.2%
Association Corporate	149.5	126.0	120.2	105.5	84.3%
Safari Lodge	149.5	—	108.2	120.5	80.6%
SaaS Product	149.5	112.0	89.5	112.0	74.9%
Gala Event	149.5	98.8	96.0	86.8	66.1%

The AI Page Builder template is the closest contest — Kimi reaches 90.2% of Sonnet's quality. The Gala Event template is the widest gap at 66.1%. Template complexity matters: simpler structures with fewer sections are easier for all models.

Action Difficulty: What's Easy and What's Impossible

This is where the data gets interesting. Not all 16 actions are created equal:

Rank	Action	Avg Score	Category
1	add-accessibility	9.4/10	Technical
2	add-seo-meta	9.2/10	Technical
3	update-about-text	8.8/10	Content
4	replace-favicon	8.6/10	Content
...	...	...	...
14	apply-colours	5.2/10	Brand
15	apply-hero-layout	2.8/10	Layout
16	apply-sections-layout	-0.8/10	Layout

The pattern is clear when you group by category:

Category	Avg Score	Observation
Technical (SEO, a11y, schema)	8.7/10	Models follow structured specs reliably
Content (text updates)	7.7/10	Good when verbatim rules enforced
Brand (colours, fonts, logos)	6.8/10	Moderate — CSS variable application is fragile
Images (hero, section bgs)	6.2/10	All models hallucinate descriptions as src
Layout (hero, sections)	1.0/10	Consistently catastrophic

Structured, well-defined tasks score high. Spatial, visual tasks score low. Same models, wildly different results depending on task type.

The Gap Analysis: Where Alternatives Fall Behind

Comparing each action against Sonnet reveals where the quality gap actually lives:

Action	Sonnet	Kimi	Haiku	DS-V3	Avg Gap
add-accessibility	9.5	9.6	9.8	9.2	+0.0
replace-favicon	9.0	9.0	8.8	8.4	-0.3
add-seo-meta	10.0	9.4	9.6	9.0	-0.7
...
apply-colours	9.5	6.2	5.8	6.5	-3.3
apply-hero-layout	9.0	4.7	3.2	2.8	-5.4
apply-sections-layout	9.0	1.6	-3.8	-1.5	-10.2

Three actions account for most of the quality gap:

apply-sections-layout (-10.2 point gap) — alternatives actively break layouts. Haiku scores -3.8 on average, meaning it makes pages significantly worse.
apply-hero-layout (-5.4 point gap) — layout transformation is fundamentally hard for all models below Sonnet.
apply-colours (-3.3 point gap) — CSS variable propagation is inconsistent. Models update some variables but miss gradients, overlays, and header tints.

Three actions show essentially zero gap:

add-accessibility (+0.0) — every model follows accessibility specs equally well.
replace-favicon (-0.3) — simple file replacement.
add-seo-meta (-0.7) — structured metadata is a universal strength.

This has a practical implication: if you could route easy tasks to cheap models and hard tasks to Sonnet, you could potentially cut costs without cutting quality on the tasks that matter. More on this in Part 4.

The Action Heatmap

Here's every model scored on every action — the full picture:

                    Kimi  Haiku  DS-V3  DS-R1
add-accessibility   9.6   9.8    9.2    8.1
add-seo-meta        9.4   9.6    9.0    6.8
update-about-text   9.2   8.8    8.6    0.6
replace-favicon     9.0   8.8    8.4    6.0
replace-header-logo 8.2   9.2    7.4    4.8
add-structured-data 7.8   8.8    7.0    5.1
update-hero-text    7.6   7.7    7.2    1.6
update-contact      7.4   7.6    7.0   -1.2
swap-fonts          7.6   7.0    6.8    2.1
replace-hero-bg     7.3   6.2    6.5    2.8
verify-contrast     6.4   7.8    5.8    4.8
replace-section-bgs 7.6   2.4    5.5    3.0
replace-footer-logo 6.0   8.6    4.8    2.0
apply-colours       6.2   5.8    6.5    0.2
apply-hero-layout   4.7   3.2    2.8   -3.9
apply-sections-lyt  1.6  -3.8   -1.5   -2.5

Notice DeepSeek R1's column. It scores -1.2 on contact updates and -3.9 on hero layout. These aren't just bad scores — they mean the model made the page actively worse than the starting template on basic tasks.

The Reasoning Model Trap

DeepSeek R1 scored 26.2% — worse than any other model by a wide margin. On two runs, it averaged 41.9/160. For context, a score of 41.9 means the model successfully completed roughly 4 of 16 actions and actively damaged several others.

Why? R1 is a reasoning model. It's optimised for step-by-step logical deduction — mathematical proofs, multi-hop reasoning, chain-of-thought problem solving. Code generation is not reasoning. It's pattern completion with spatial awareness.

R1 spent tokens "thinking" about CSS instead of writing it. Its chain-of-thought preambles consumed context window without producing better output. On layout tasks, it reasoned its way into worse solutions than models that simply pattern-matched from training data.

The lesson: match the model architecture to the task type. Reasoning models are the wrong tool for code generation. This seems obvious in hindsight, but R1's pricing ($0.70/$2.50) sits between Haiku and Sonnet — it looks like a mid-tier option until you run the evaluation.

The Variance Problem

Average scores tell half the story. The other half is variance.

Model	Avg Score	Std Dev	Best Run	Worst Run	Range
Claude Haiku	107.7	13.4	~121	~94	27
Kimi K2.5	108.2	20.1	~128	~88	40
DeepSeek V3.2	94.0	28.9	120.5	25.8	95

Haiku is the most consistent model — you know what you're getting. Its standard deviation (13.4) is half of Kimi's and less than half of DeepSeek V3.2's.

DeepSeek V3.2's variance is remarkable. Its best run (120.5) approaches Haiku's average. Its worst run (25.8) is catastrophic — worse than R1's average. Same model, same template, same requirements, 95-point swing.

For production systems, unpredictable quality is worse than consistently mediocre quality. A restaurant that's amazing 50% of the time and terrible 50% isn't a good restaurant. Haiku's consistency is a genuine advantage that doesn't show up in averages.

What We'd Do Differently

This was an exploratory evaluation — designed to identify patterns, not prove rankings. For Round 2, we're addressing three issues:

Run Sonnet through the same pipeline. The gold standard scoring method makes Sonnet's score non-comparable. In Round 2, Sonnet runs the same 16-action pipeline as every other model. Same judge, same conditions, same denominator.

Increase sample sizes. Minimum 15 runs per model across the same template set. That gives us 80% statistical power to detect a 20-point difference at alpha=0.05. No more overlapping confidence intervals for the middle tier.

Calibrate the judge. Our Claude Opus judge scores Claude models. There's an obvious bias risk. Round 2 will score a subset with a second judge model and compute inter-rater agreement. We'll also blind the judge by stripping model-identifying patterns from outputs.

Key Takeaways

No model matches Sonnet. The gap is directionally clear even with measurement caveats. For client-facing output where quality is non-negotiable, Sonnet remains the production choice.

The middle tier is a tie. Kimi, Haiku, and DeepSeek V3.2 are statistically indistinguishable. Pick based on secondary factors: Haiku for consistency, Kimi for peak performance, DeepSeek for cost.

Task type matters more than model choice. The difference between the easiest action (9.4/10) and the hardest (-0.8/10) is larger than the difference between any two models on the same action. If you optimise which tasks you give to AI rather than which AI you use, you'll see bigger quality gains.

Reasoning models don't generate code well. R1's architecture is wrong for this task. Don't pick a model based on its benchmark scores on reasoning tasks if your workload is code generation.

Variance is a feature, not noise. DeepSeek V3.2 is the cheapest option but the least predictable. Haiku costs 5x more but delivers consistent results. The reliability premium is real.

This is part 2 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.

Previous: Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs
Next: Building an LLM Judge That Doesn't Lie to You

Originally published on tebogo.cloud

Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

Tebogo Tseka — Mon, 30 Mar 2026 17:24:19 +0000

Most LLM benchmarks evaluate text. HumanEval checks if a function passes unit tests. SWE-bench measures whether a model can patch a repository. MBPP scores single-function completions.

None of these work when your AI agent generates an entire website.

I run a site builder agent that takes a template, a set of business requirements (brand colours, fonts, content, images, layout), and produces a deployable multi-file artifact: index.html, css/styles.css, js/main.js, and an assets/ directory. The output isn't a string. It's a folder. And a correct index.html paired with broken styles.css produces a broken site — even though each file might look reasonable in isolation.

I needed an evaluation framework that could score these outputs the way a QA engineer would: structurally, visually, semantically, and at the code level. Over six days, I built one. It evaluated 467 actions across 5 models, and the results changed how I think about AI code generation.

This article explains the framework.

Why Existing Benchmarks Don't Work Here

The gap between LLM benchmarks and real-world code generation is wider than it appears.

HumanEval tests single functions with pass/fail assertions. There's no partial credit for CSS that's 90% right but produces invisible text on a dark background. SWE-bench measures diffs against existing repositories — our agents generate from scratch, not patch. And MBPP evaluates isolated snippets with no concept of inter-file dependencies.

What I actually needed to measure fell into five categories: structural integrity (are the right files present?), visual fidelity (does it look correct?), content accuracy (is the business name right in all six locations?), code quality (is the CSS valid and responsive?), and accessibility (can users actually read the text?).

No existing benchmark covers all five for multi-file outputs. So I built a four-layer evaluation stack.

The 4-Layer Evaluation Stack

Each layer catches a different class of defect. They run in sequence, and their results feed into a unified scoring model.

Layer 1: Structural Checks

The FolderComparer validates the generated file tree against the gold standard. Does index.html exist? Is css/styles.css present? Are there unexpected files that shouldn't be there?

This layer catches the most fundamental failures. A missing index.html is an instant -5.0 deduction — the site literally cannot load. An extra file nobody asked for is a minor -0.25. The structural layer answers one question: did the agent produce the right artifacts?

Layer 2: Content Checks

The ContentComparer parses the generated HTML and validates text content, meta tags, heading hierarchy, alt text, and viewport configuration. It answers: does the content match what was requested?

This layer caught a failure pattern I didn't anticipate. Models paraphrase user-provided content roughly 30% of the time. The requirement says "Farm-fresh flavours, crafted with care" and the model writes "Fresh ingredients from local farms, prepared with dedication." Semantically similar. Functionally wrong. The client gave you exact copy — use it.

Layer 3: Visual Checks

The HTMLVisualChecker analyses HTML and CSS without rendering, catching issues that code review alone misses. It detects broken images (where the src attribute contains a description instead of a URL), empty sections, dark text on dark backgrounds, broken navigation layouts, and missing interactivity.

This layer exists because of a specific failure. Early in testing, our LLM judge gave a 9/10 to a page where the hero text was completely invisible — dark grey text (#333) on a dark background image. The CSS was syntactically valid. The HTML was well-structured. But the page was unusable. The visual checker now catches contrast violations by analysing CSS colour values against background declarations:

def check_dark_text_on_dark_bg(self) -> list[Violation]:
    """Detect potential dark-on-dark contrast issues from CSS."""
    root_match = re.search(r':root\s*\{([^}]+)\}', self.css)
    if not root_match:
        return violations

    # Extract CSS variables
    vars_dict = {}
    for match in re.finditer(r'(--[\w-]+)\s*:\s*([^;]+);', root_block):
        vars_dict[match.group(1)] = match.group(2).strip()

    text_color = vars_dict.get("--text-color", "").lower()
    has_bg_images = bool(re.search(
        r'background(-image)?\s*:\s*url\(', self.css
    ))

    if has_bg_images and self._is_dark_color(text_color):
        violations.append(Violation(
            id="A11Y-DARK-TEXT-ON-DARK-BG",
            severity=Severity.CRITICAL,
            deduction=-3.0,
            description="Dark text with background images — unreadable",
        ))

It also catches image hallucination — a universal failure across all five models we tested. Every model, at some point, writes image descriptions as src attributes:

<!-- What the model generates -->
<img src="a modern office building with glass facade and blue sky">

<!-- What it should generate -->
<img src="https://images.unsplash.com/photo-1486406146926-c627a92ad1ab">

The checker flags any src attribute longer than 30 characters containing spaces that doesn't start with http — a simple heuristic that catches this pattern reliably.

Layer 4: LLM Judge

The final layer is a Claude Opus multimodal judge. It receives the source code, the scoring rubric, and the violation catalogue, then returns a structured JSON response identifying every violation it finds.

The judge prompt is specific and constrained:

Your job: identify every violation in the agent output by
comparing it against the gold standard and requirements.

Return ONLY a JSON object with this structure:
{
  "violations": [{
    "id": "VIOLATION-ID-FROM-CATALOGUE",
    "severity": "critical|major|moderate|minor",
    "deduction": -N.N,
    "description": "What is wrong",
    "evidence": "Specific line showing the issue"
  }]
}

Rules:
- Use EXACT deduction amounts from the violation catalogue
- Do NOT invent violation IDs
- Do NOT report violations that don't exist

Three design decisions matter here. First, the judge identifies violations — it doesn't assign scores. The scoring engine applies fixed deductions. This separation prevents the judge from inflating or deflating scores arbitrarily. Second, the violation IDs are constrained to a catalogue of 22 known types. The judge can't invent new categories. Third, deduction amounts are fixed per violation type. The judge classifies; the scorer calculates.

The Violation-Deduction Scoring Model

Traditional AI evaluation uses additive scoring: start at 0, add points for what's correct. Our model inverts this.

Score = 10 - sum(deductions)

Every action starts at 10 (perfect). Each violation subtracts its fixed deduction. Scores can go negative — and they do. The layout transformation action averages -0.8/10 across all models, meaning models consistently make the page worse than the starting template.

Why deductive scoring? Because a page that's 90% correct but has invisible text is not a 9/10. It's broken. Additive scoring rewards partial completion. Deductive scoring penalises defects proportionally to their impact on the user.

The 22 violation types span seven categories:

Category	Example Violation	Severity	Deduction
Structural	Missing index.html	Critical	-5.0
Structural	Empty section (no visible content)	Critical	-3.0
Visual	Layout completely broken	Critical	-3.0
Visual	Broken image (description as src)	Critical	-2.5
Content	Missing text from requirements	Critical	-2.0
Content	Content paraphrased, not verbatim	Moderate	-0.5
Code Quality	Local file path instead of URL	Critical	-2.0
Code Quality	No responsive breakpoints	Major	-1.5
Accessibility	Dark text on dark background	Critical	-3.0
Accessibility	Missing alt text	Moderate	-0.5
Interactivity	No mobile menu toggle	Critical	-2.5
Performance	No lazy loading	Minor	-0.25

The severity tiers reflect real-world impact. A critical violation (-5.0 to -2.0) makes the site unusable or unprofessional. A major violation (-2.0 to -1.0) degrades the experience noticeably. Minor violations (-0.25) are polish issues that most users won't notice.

Gold Standards: The Ground Truth Problem

Every evaluation needs ground truth. Ours comes from 21 hand-verified reference templates covering landing pages, SaaS products, corporate sites, event pages, safari lodges, training portals, and more.

Each gold standard includes three stages:

gold-standards/
  template-ai-page-builder/
    requirements.md              # Business customisation spec
    stage-1-customise-template/  # Skeleton with spec applied
    stage-2-site-generation/     # Optimised and validated
    stage-3-deployment/          # Deploy config and manifest

The requirements.md file defines every customisation the agent must apply — brand colours, typography, logo paths, hero text, about section copy, contact details, layout patterns, SEO requirements. Here's a real excerpt:

## Brand Amendments

### Colours
- **Primary**: #B85C38
- **Secondary**: #5C3D2E
- **Accent**: #E8D5B7

### Typography
- **Heading Font**: Fraunces
- **Body Font**: Lato

## Content Amendments

### Hero Section
- **Headline**: Seasonal Menus. Local Ingredients.
              Unforgettable Meals.
- **CTA Button**: View Our Menu

These references are git-committed, versioned, and human-reviewed. They're not generated — they're hand-built by applying the requirements to each template and verifying every change visually. This matters because the judge compares agent output against these references. If the ground truth is wrong, every score is wrong.

The Evaluation Pipeline

Putting all four layers together, the orchestrator runs a 16-action pipeline per model per template:

Copy the template skeleton to the run directory (baseline)
Screenshot the baseline
For each of the 16 actions:
- Send the action instruction to the model
- Write modified files to the action directory
- Run the HTMLVisualChecker (Layer 3)
- Run the Opus judge against the gold standard (Layer 4)
- Record the ActionScore (10 minus deductions)
Aggregate all 16 action scores into a template score (max 160)

The 16 actions cover six categories: brand (colours, fonts, logos, favicon), images (hero background, section backgrounds), content (hero text, about text, contact info), layout (hero layout, sections layout), technical (SEO meta, structured data, accessibility), and quality (contrast verification).

Actions are sequential — each builds on the previous output. This is deliberate. Real agent workflows apply changes incrementally. A colour change affects subsequent image overlay decisions. A font change affects layout spacing. Sequential evaluation captures the compounding effect of errors, which is exactly what happens in production.

What This Framework Revealed

Over six days, this pipeline processed 467 actions across five models and six templates. The results were clear in some places and surprising in others.

What was clear: structured, well-defined tasks (SEO meta tags, accessibility attributes) score consistently high across all models (8.7-9.4/10 average). These are token-native tasks — key-value pairs and attribute additions that align with how language models process text.

What was surprising: layout transformation — applying CSS grid or flexbox changes to restructure page sections — scored negative on average. Every model, including the best one, made pages worse when asked to transform layouts. This isn't a prompt engineering problem. It's a spatial reasoning gap in current language model architectures.

What was most useful: the violation data drove targeted improvements. Instead of vaguely knowing "the agent sometimes produces bad output," I now know that 60% of font management failures come from a single issue (updating CSS font-family but not the Google Fonts <link> tag), and that 30% of content failures are verbatim violations (paraphrasing instead of using exact text). These specific failure patterns led to 1,191 lines of skill improvements across six production modules.

Applicability Beyond Websites

The framework's architecture — structural checks, content checks, visual checks, LLM judge, violation-deduction scoring — isn't website-specific. Any AI system that generates multi-file artifacts can be evaluated this way.

Document generation (reports, presentations, proposals) has the same inter-file dependency problem. Infrastructure-as-code (Terraform modules, CloudFormation templates) has structural requirements and validation rules. Even multi-file code generation (microservice scaffolding, API implementations) benefits from checking whether all the files work together, not just whether each file compiles.

The key insight: evaluating AI-generated artifacts requires evaluating the artifact as a whole, not its parts in isolation. A syntactically valid CSS file paired with an HTML file that references different class names is a broken website. The evaluation framework must understand that relationship.

This is part 1 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.

Originally published on tebogo.cloud

How I Create Memory for My Agents on Claude Code

Tebogo Tseka — Tue, 03 Mar 2026 18:33:32 +0000

How I Create Memory for My Agents on Claude Code

March 3, 2026

Introduction

AI agents forget everything. Every new session starts from zero — no context about your project, no memory of architectural decisions, no knowledge of your coding standards. You end up repeating yourself constantly.

I run 14 specialised agents across multiple AWS projects — an HLD Architect, a DevOps Engineer, an SDET, a Defect Manager, a Technical Content Engineer, and more. Each one needs to understand the codebase, follow specific rules, and build on work from previous sessions.

Repeating context every session is not an option. So I built a multi-layered memory architecture in Claude Code that gives my agents persistent knowledge, specialised expertise, and consistent behaviour across every conversation.

Here is exactly how I do it.

The Architecture: Six Layers of Memory

My agent memory system has six layers, each solving a different problem:

┌──────────────────────────────────────────────┐
│  Layer 6: Permissions (settings.local.json)  │  What the agent CAN do
├──────────────────────────────────────────────┤
│  Layer 5: Plans (.claude/plans/*.md)         │  What the agent IS doing
├──────────────────────────────────────────────┤
│  Layer 4: Auto Memory (memory/MEMORY.md)     │  What the agent HAS learned
├──────────────────────────────────────────────┤
│  Layer 3: Skills (*.skill.md)                │  HOW to do specific things
├──────────────────────────────────────────────┤
│  Layer 2: Agent Personas (*_Agent.md)        │  WHO the agent is
├──────────────────────────────────────────────┤
│  Layer 1: CLAUDE.md (project instructions)   │  The rules everyone follows
└──────────────────────────────────────────────┘

Every layer is just markdown files. No databases, no APIs, no infrastructure — just files that Claude Code loads automatically.

Layer 1: CLAUDE.md — The Constitution

Every project has a CLAUDE.md file at its root. Claude Code reads this file automatically at the start of every session. It is the single most important file in my entire setup.

My root CLAUDE.md sits at the workspace level and defines global rules that every agent must follow — what I call the TBT Law (Think Before Typing):

## TBT Law (Inviolable)

1. Be patient — 80% planning, 20% implementation
2. Do not be overeager — never try to impress by doing unrequested work
3. Always seek approval before implementing any plan
4. Never make changes without a plan — plan first, always
5. Do not rush the user — be patient, wait for direction
6. Do not make decisions or assumptions on the user's behalf
7. If unsure, ask — never guess or assume
8. If the plan isn't working, STOP — no workarounds
9. Rushing and over-eager changes will break code or design
10. If rules are violated, admit openly — do not hide mistakes

These ten rules prevent the most common failure mode with AI agents: doing too much, too fast, without thinking. Every agent, regardless of persona, follows these rules.

Below the TBT Law, the root CLAUDE.md defines:

Mandatory SDET Verification — every plan must be tested after execution
Defect Management — every bug gets logged, reproduced, fixed, and verified
Deployment-First Verification — no fix is considered testable until deployed
Repository Isolation — every service gets its own repo
AWS Resource Naming Conventions — DynamoDB tables use plain names, S3 buckets include environment suffixes

Project-Specific CLAUDE.md Files

Each project directory has its own CLAUDE.md that inherits from the root and adds project-specific context:

# my-saas-landing - Project Instructions

## Project Overview
**Repository**: my-saas-landing
**Purpose**: Marketing landing page - Single-page scroll site
**Stack**: React 18 + TypeScript + Vite

## Cross-App Navigation
| Action                  | Target URL                    |
|-------------------------|-------------------------------|
| "Start Free Trial"      | /app/onboarding               |
| "Buy" pricing button    | /checkout?planId={id}         |

## S3 Deployment
Landing page files deploy to the root of my-web-public S3 bucket...

This means the agent immediately knows what the project is, what stack it uses, how it deploys, and how it connects to other services — before I type a single word.

Layer 2: Agent Personas — Specialised Identities

I have 14 agent persona files, each defined as a markdown document. When I need a specific type of expertise, I load the corresponding persona.

Each persona file follows a consistent structure:

# DevOps Engineer Agent

## Identity
You are a Senior DevOps Engineer specialising in AWS infrastructure...

## Core Competencies
- CI/CD pipeline design (GitHub Actions)
- Infrastructure as Code (Terraform)
- Container orchestration (ECS, ECR)
- CloudFront distribution management

## Workflow
1. Assess current infrastructure state
2. Propose changes with risk assessment
3. Implement with rollback plan
4. Verify deployment
5. Document changes

## Constraints
- Never modify production without approval
- Always use Terraform for infrastructure changes
- Follow the AWS Well-Architected Framework

The key insight is that personas are not prompts — they are persistent identity files that the agent loads and embodies for the entire session. The DevOps Engineer thinks differently from the SDET, who thinks differently from the HLD Architect. They have different priorities, different vocabularies, and different workflows.

My current roster:

Persona	Purpose
HLD Architect	High-level design documents
LLD Architect	Low-level design documents
DevOps Engineer	CI/CD, infrastructure, deployments
SDET	Automated testing, defect tracking
Defect Manager	Bug lifecycle with issue tracker integration
GenAI Engineer	Bedrock, LLMs, RAG solutions
Cloud Security Specialist	IAM, GuardDuty, compliance
Technical Content Engineer	Blog posts, whitepapers, tutorials
Project Manager	Task orchestration, TBT workflow
Peer Review Architect	Design review, anti-pattern detection
Technical Business Developer	Market analysis, pricing models
Python AWS Developer	Lambda, DynamoDB, Step Functions
Java AWS Developer	Spring Boot, ECS services
Global Template Manager	Template lifecycle management

When I say "load the DevOps Engineer persona", the agent reads the file and adopts that identity — including its specific workflow, constraints, and communication style.

Layer 3: Skills — Reusable Knowledge Modules

Skills are the most underrated layer. They are standalone knowledge files (.skill.md) that any persona can reference. Think of them as shared libraries for agent knowledge.

Examples from my setup:

DynamoDB_Single_Table.skill.md — Single-table design patterns, GSI strategies, access patterns
HATEOAS_Relational_Design.skill.md — API design with hypermedia links
Development_Best_Practices.skill.md — SOLID, TDD, BDD, DDD principles
Monolith_Anti_Pattern_Validation.skill.md — Six anti-patterns (AP-1 through AP-6) to detect
Step_Functions_Decision_Logic.skill.md — State machine patterns
API_Proxy_Testing.skill.md — End-to-end testing patterns

A skill file looks like this:

# DynamoDB Single Table Design

## When to Apply
Apply when a service has 3+ entity types with relational access patterns.

## Partition Key Strategy
- Use composite keys: {ENTITY_TYPE}#{ENTITY_ID}
- GSI1PK for inverted lookups
- GSI2PK for cross-entity queries

## Access Patterns
| Pattern | PK | SK | Index |
|---------|----|----|-------|
| Get user by ID | USER#123 | METADATA | Table |
| Get user's sites | USER#123 | SITE# | Table |
| Get site by domain | DOMAIN#example.com | METADATA | GSI1 |

The power of skills is composition. When the LLD Architect is designing a new service, it can reference the DynamoDB skill, the HATEOAS skill, and the Development Best Practices skill simultaneously. When the SDET is writing tests, it pulls from the API Proxy Testing skill. The knowledge is defined once and reused across every persona.

Layer 4: Auto Memory — Learning Across Sessions

Claude Code has a built-in auto memory feature. It stores persistent notes in a memory/ directory within each project:

~/.claude/projects/{project-path}/memory/
├── MEMORY.md          # Always loaded (first 200 lines)
├── debugging.md       # Detailed debugging notes
├── patterns.md        # Confirmed patterns
└── architecture.md    # Architectural decisions

The MEMORY.md file is special — Claude Code loads the first 200 lines of it into every conversation automatically. This is where the agent stores things it has learned:

## Confirmed Patterns

- CloudFront Function handles SPA routing for all frontends
- S3 bucket serves all frontend apps from different prefixes
- Safe sync requires --exclude flags for other app prefixes
- Browser cache causes stale content after deployments (hard refresh needed)

## AWS SSO
- Profile name: dev
- Token expires frequently — run `aws sso login --profile dev`

I configure the agent to save memories with clear rules:

Save: Stable patterns confirmed across multiple sessions
Save: Key architectural decisions and important file paths
Save: Solutions to recurring problems
Don't save: Session-specific context or temporary state
Don't save: Speculative conclusions from reading a single file

The result is that the agent gets smarter over time. The first time it encounters the CloudFront routing behaviour, it investigates. The second time, it already knows.

Layer 5: Plans — Persistent Iteration

Plans bridge the gap between sessions. When a task is too large for one conversation, the agent writes a plan file:

~/.claude/plans/
├── zazzy-puzzling-cloud.md       # Frontend extraction plan
├── elegant-crunching-sunbeam.md  # Security hardening rollout
└── zazzy-percolating-lecun.md    # CDN deployment plan

A plan follows a consistent structure:

# Plan: Extract Landing Page into Standalone Repo

## Context
The landing page was prototyped inside the main app...

## Step 1: Scaffold New Repo
Create directory structure at /path/to/new/repo...

## Step 2: Create Fresh Files
- vite.config.ts — base: '/'
- App.tsx — no router, single-page scroll

## Step 3: Modify Copied Files
- Navigation.tsx — remove router dependency
- PricingPage.tsx — use window.location.href

## Verification
1. npm run dev → all sections render
2. npm run type-check → 0 errors
3. Images and assets load correctly

When a new session starts and the plan file exists, Claude Code includes a reminder:

"A plan file exists from plan mode. If this plan is relevant to the current work and not already complete, continue working on it."

This means the agent picks up exactly where it left off — no re-explanation needed.

Layer 6: Permissions — Trust Boundaries

The final layer controls what each agent can actually do. Claude Code uses settings.local.json to define allowed operations per project:

{
  "permissions": {
    "allow": [
      "Bash(git add *)",
      "Bash(git commit *)",
      "Bash(aws s3 sync *)",
      "Bash(aws cloudfront create-invalidation *)",
      "Bash(terraform plan *)",
      "Bash(pytest *)",
      "Bash(npm run build *)"
    ]
  }
}

My permissions file is 276 lines long. It covers Git operations, AWS CLI commands (IAM, S3, Lambda, DynamoDB, CloudFront, Route53), Terraform, Python tooling, and testing frameworks.

This is critical for the TBT Law. The agent can run tests and deploy to dev, but it cannot force-push to main or destroy production infrastructure without explicit approval.

How It All Comes Together

Here is a real workflow. I need to deploy a bug fix to a frontend app.

I open the project. Claude Code loads CLAUDE.md (Layer 1) — the agent knows the stack, deployment targets, and global rules.
I say "load the DevOps Engineer." The agent reads the persona file (Layer 2) — it now thinks like a DevOps engineer with CI/CD expertise.
The agent references existing knowledge. It checks auto memory (Layer 4) for deployment patterns — it already knows the S3 bucket name, CloudFront distribution ID, and safe sync exclusions.
It creates a plan. The plan (Layer 5) outlines: build, sync to S3, invalidate CloudFront, verify. Per TBT Law, it waits for my approval.
I approve. The agent executes within its permissions (Layer 6) — it can run npm run build and aws s3 sync, but it asks before running destructive commands.
SDET verification triggers. Per the CLAUDE.md mandatory rule, the SDET persona activates to verify the deployment — checking asset integrity, page load, and console errors.
The agent saves what it learned. If it encountered a new pattern (like a CloudFront cache behaviour), it writes it to auto memory for next time.

Six layers, all markdown files, zero infrastructure.

Practical Tips

Start with CLAUDE.md. You do not need all six layers on day one. A well-written CLAUDE.md with your project context and coding standards gives you 80% of the value.

Write personas for recurring roles. If you find yourself repeatedly explaining "you are a DevOps engineer who follows these patterns", extract it into a persona file.

Keep skills atomic. One skill, one topic. A DynamoDB skill should not also contain API design patterns. Composability comes from keeping them separate.

Curate auto memory. Review what the agent saves. Remove outdated entries. The memory file is limited to 200 lines — keep it focused on patterns that are genuinely stable.

Use plans for multi-session work. If a task will take more than one conversation, write a plan. The overhead of creating the plan pays for itself when you do not have to re-explain the context.

Set permissions deliberately. Start restrictive and expand. It is easier to grant new permissions than to recover from an agent that deleted your production database.

Conclusion

AI agents do not need to forget. The tools already exist in Claude Code — CLAUDE.md files, auto memory, plan persistence, and permission controls. What they need is architecture.

By structuring memory into six layers — rules, personas, skills, learning, plans, and permissions — I have agents that understand my projects, follow my standards, learn from past sessions, and operate within clear boundaries.

Every layer is a markdown file. Every file is version-controlled. The entire system is transparent, auditable, and easy to iterate on.

The best part? The agents get better every week. Not because the model improved, but because the memory did.

References

From IDE to Cloud: Lifting Your Local Agent into an MCP Server on Amazon Bedrock AgentCore

Tebogo Tseka — Tue, 03 Mar 2026 15:08:24 +0000

You have built an AI agent that works beautifully on your laptop. It calls tools, reasons through problems, and returns exactly the answer your users need. There is just one problem: it lives on localhost.

Moving from a local prototype to a production-grade, multi-tenant cloud service usually means weeks of infrastructure work — containers, load balancers, session isolation, authentication, observability. Amazon Bedrock AgentCore Runtime collapses that effort into a handful of commands while the Model Context Protocol (MCP) gives your agent a standard interface that any MCP-compatible client can discover and invoke.

In this post you will take a Python agent running in your IDE, transform it into an MCP server, and deploy it to AgentCore Runtime — with working code at every step.

What Is Amazon Bedrock AgentCore Runtime?

Amazon Bedrock AgentCore Runtime is a serverless hosting environment purpose-built for AI agents. It provides several capabilities that are hard to replicate on your own:

Framework-agnostic — Works with Strands Agents, LangGraph, CrewAI, or any custom Python agent. You are not locked into a single orchestration framework.
Model flexibility — Use any LLM — Amazon Bedrock models, Anthropic Claude, Google Gemini, or OpenAI.
Session isolation — Each user session runs in a dedicated microVM with isolated CPU, memory, and filesystem. When the session ends, the microVM is terminated and memory is sanitised.
Protocol support — Native support for Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication.
Extended execution — Synchronous requests get a 15-minute timeout; asynchronous sessions can run for up to 8 hours.
Consumption-based pricing — You pay only for the compute your agent actually uses, not for idle time waiting on LLM responses.

In short, AgentCore Runtime handles the infrastructure so you can focus on the agent logic.

What Is MCP and Why Does It Matter?

The Model Context Protocol (MCP) is an open standard that defines how AI agents discover and invoke tools over HTTP. Think of it as a contract: an MCP server exposes tools with typed inputs and outputs, and any MCP client can discover those tools at runtime and call them without custom integration code.

Key characteristics of MCP on AgentCore Runtime:

Stateless streamable-HTTP — AgentCore requires stateless servers. The platform automatically injects a Mcp-Session-Id header for session continuity.
Tool discovery — Clients call list_tools() to discover every tool the server exposes, with full JSON Schema descriptions.
Standard path — The server listens on 0.0.0.0:8000/mcp, which is the default path supported by most MCP SDKs.
Interoperability — Any MCP client — Claude Code, Cursor, Kiro, Amazon Q CLI — can connect to your deployed server with zero custom wiring.

By implementing MCP, your agent becomes a reusable building block that other agents and developer tools can compose into larger systems.

Architecture Overview

The flow has four stages: build locally and test on localhost, transform for AgentCore compatibility, deploy to AWS via the AgentCore CLI, and invoke from any MCP client.

 LOCAL IDE                    TRANSFORMATION                AGENTCORE RUNTIME                 INVOCATION
 ─────────                    ──────────────                ─────────────────                 ──────────

 ┌─────────────────────┐      ┌──────────────────────┐      ┌──────────────────────────┐      ┌─────────────────────┐
 │                     │      │                      │      │                          │      │  Claude Code /      │
 │  MCP Server Code    │      │  Install AgentCore   │      │  agentcore configure     │      │  Cursor / Kiro      │
 │  my_mcp_server.py   │      │  MCP Server in IDE   │      │  --protocol MCP          │      │         │           │
 │         │           │      │         │            │      │         │                │      │         ▼           │
 │         ▼           │      │         ▼            │      │         ▼                │      │  ┌───────────────┐  │
 │  Local Test         │─────▶│  Transform Agent     │─────▶│  agentcore launch        │      │  │ Agent Runtime │  │
 │  localhost:8000/mcp │      │  + BedrockAgentCore  │      │  Build + ECR + Deploy    │      │  │     ARN       │◀─┤
 │                     │      │  App wrapper         │      │         │                │      │  │  MicroVM      │  │
 └─────────────────────┘      └──────────────────────┘      │         ▼                │      │  │  Isolation    │  │
                                                            │  ┌────────────────────┐  │      │  └───────────────┘  │
                                                            │  │  Agent Runtime ARN │  │      │         ▲           │
                                                            │  │  MicroVM Isolation  │──┼─────▶│  Remote MCP Client │
                                                            │  │  Session Mgmt      │  │      │  Python Script      │
                                                            │  └────────────────────┘  │      │         ▲           │
                                                            │                          │      │  MCP Inspector      │
                                                            └──────────────────────────┘      └─────────────────────┘

Prerequisites

Before you start, make sure you have:

An AWS account with Amazon Bedrock AgentCore permissions
AWS CLI installed and configured with appropriate credentials
Python 3.10+ installed (3.13 recommended)
uv package manager installed (optional but recommended)
An MCP client: Claude Code, Cursor, Kiro, or Amazon Q CLI

Install the core packages:

pip install mcp
pip install bedrock-agentcore
pip install bedrock-agentcore-starter-toolkit

Step 1: Build Your Local MCP Server

Start by creating a simple MCP server with a few tools. This is the agent you will later lift into AgentCore Runtime.

Create a file called my_mcp_server.py:

# my_mcp_server.py

from mcp.server.fastmcp import FastMCP
from starlette.responses import JSONResponse

# Create the MCP server instance
# host must be 0.0.0.0 for AgentCore compatibility
mcp = FastMCP(host="0.0.0.0", stateless_http=True)


@mcp.tool()
def summarise_architecture(service_name: str) -> str:
    """Summarise the high-level architecture of an AWS service."""
    return (
        f"The {service_name} architecture typically includes "
        f"a control plane for management operations and a data plane "
        f"for runtime request handling, with IAM for access control."
    )


@mcp.tool()
def estimate_monthly_cost(
    service: str, requests_per_month: int, avg_duration_ms: int
) -> str:
    """Estimate monthly cost for a serverless AWS service."""
    cost_per_request = 0.0000002
    cost_per_gb_second = 0.0000166667
    memory_gb = 0.5
    duration_seconds = avg_duration_ms / 1000
    compute_cost = requests_per_month * duration_seconds * memory_gb * cost_per_gb_second
    request_cost = requests_per_month * cost_per_request
    total = compute_cost + request_cost
    return f"Estimated monthly cost for {service}: ${total:,.2f}"


@mcp.tool()
def generate_iam_policy(actions: list[str], resource_arn: str) -> dict:
    """Generate a least-privilege IAM policy document."""
    return {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": actions,
                "Resource": resource_arn,
            }
        ],
    }


if __name__ == "__main__":
    mcp.run(transport="streamable-http")

This server exposes three tools: an architecture summariser, a cost estimator, and an IAM policy generator. The key details for AgentCore compatibility are host="0.0.0.0" and stateless_http=True.

Test Locally

Start the server:

python my_mcp_server.py

The server starts on port 8000. From a separate terminal, run a test client:

# my_mcp_client.py

import asyncio
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client


async def main():
    mcp_url = "http://localhost:8000/mcp"

    async with streamablehttp_client(
        mcp_url, headers={}, timeout=120, terminate_on_close=False
    ) as (read_stream, write_stream, _):
        async with ClientSession(read_stream, write_stream) as session:
            await session.initialize()

            # Discover tools
            tools = await session.list_tools()
            print("Available tools:")
            for tool in tools.tools:
                print(f"  - {tool.name}: {tool.description}")

            # Invoke a tool
            result = await session.call_tool(
                "estimate_monthly_cost",
                {
                    "service": "AWS Lambda",
                    "requests_per_month": 1_000_000,
                    "avg_duration_ms": 200,
                },
            )
            print(f"\nResult: {result.content[0].text}")


asyncio.run(main())

You should see your three tools listed and a cost estimate returned.

Step 2: Install the AgentCore MCP Server in Your IDE

AWS provides an MCP server specifically for AgentCore development. This server runs inside your IDE's MCP client and guides the transformation, deployment, and testing workflow conversationally.

Add the following to your MCP client configuration:

Claude Code (~/.claude/mcp.json)

{
  "mcpServers": {
    "bedrock-agentcore-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.amazon-bedrock-agentcore-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": [
        "search_agentcore_docs",
        "fetch_agentcore_doc"
      ]
    }
  }
}

Cursor (.cursor/mcp.json)

{
  "mcpServers": {
    "bedrock-agentcore-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.amazon-bedrock-agentcore-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": [
        "search_agentcore_docs",
        "fetch_agentcore_doc"
      ]
    }
  }
}

Restart your MCP client after adding the configuration. Verify by checking that search_agentcore_docs and fetch_agentcore_doc tools appear in your tool list.

Step 3: Transform Your Agent for AgentCore

If you are deploying an MCP server (not a general agent), the transformation is minimal. Your FastMCP server already meets the protocol contract — it listens on 0.0.0.0:8000/mcp with stateless streamable-HTTP transport.

However, if you are deploying a general agent (not an MCP server), you need to wrap it with the AgentCore SDK:

1. Add the AgentCore import

from bedrock_agentcore.runtime import BedrockAgentCoreApp

2. Initialise the application

app = BedrockAgentCoreApp()

3. Decorate your entrypoint

@app.entrypoint
def handler(event, context):
    user_prompt = event.get("prompt", "")
    response = my_agent.run(user_prompt)
    return {"result": response}

4. Add the runner

if __name__ == "__main__":
    app.run()

Update your requirements.txt:

mcp
bedrock-agentcore

Step 4: Deploy to AgentCore Runtime

Deployment uses the AgentCore CLI from the starter toolkit. Two commands are all you need.

Configure the deployment

agentcore configure -e my_mcp_server.py --protocol MCP

The CLI walks you through a guided prompt:

Execution role — Provide an IAM role ARN with AgentCore Runtime permissions
ECR repository — Press Enter to auto-create one
Dependency file — Auto-detected from the current directory
OAuth — Type yes if you want authentication, then provide your Cognito discovery URL and client ID

Launch

agentcore launch

Behind the scenes, this command:

Builds an ARM64 Docker container with your server code and dependencies
Pushes the container image to Amazon ECR
Creates an AgentCore Runtime resource
Deploys your MCP server into an isolated microVM environment

On success, you receive an Agent Runtime ARN:

arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my_mcp_server-abc123

Save this ARN — you need it to invoke your server.

Step 5: Invoke and Test Your Deployed MCP Server

Your MCP server is now running on AWS. You can invoke it from any MCP client or from a Python script.

Remote invocation via Python

export AGENT_ARN="arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my_mcp_server-abc123"
export BEARER_TOKEN="your-oauth-bearer-token"

# my_mcp_client_remote.py

import asyncio
import os
import sys

from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client


async def main():
    agent_arn = os.getenv("AGENT_ARN")
    bearer_token = os.getenv("BEARER_TOKEN")

    if not agent_arn or not bearer_token:
        print("Error: AGENT_ARN or BEARER_TOKEN not set")
        sys.exit(1)

    encoded_arn = agent_arn.replace(":", "%3A").replace("/", "%2F")
    mcp_url = (
        f"https://bedrock-agentcore.us-west-2.amazonaws.com"
        f"/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT"
    )
    headers = {
        "authorization": f"Bearer {bearer_token}",
        "Content-Type": "application/json",
    }

    async with streamablehttp_client(
        mcp_url, headers, timeout=120, terminate_on_close=False
    ) as (read_stream, write_stream, _):
        async with ClientSession(read_stream, write_stream) as session:
            await session.initialize()

            # Discover tools
            tools = await session.list_tools()
            print("Deployed tools:")
            for tool in tools.tools:
                print(f"  - {tool.name}: {tool.description}")

            # Call a tool on the deployed server
            result = await session.call_tool(
                "generate_iam_policy",
                {
                    "actions": ["s3:GetObject", "s3:PutObject"],
                    "resource_arn": "arn:aws:s3:::my-bucket/*",
                },
            )
            print(f"\nGenerated policy:\n{result.content[0].text}")


asyncio.run(main())

You should see the same three tools you defined locally, now served from AgentCore Runtime with full session isolation, authentication, and observability.

Testing with MCP Inspector

You can also use the MCP Inspector for interactive testing. Point it at your deployed server's invocation URL with the appropriate bearer token, and you get a visual interface to discover tools, invoke them, and inspect responses.

What's Next?

You now have a production MCP server running on AgentCore Runtime. Here are natural next steps:

AgentCore Gateway — Connect your agent to external APIs and third-party tools through the managed gateway.
AgentCore Memory — Add persistent conversation context so your agent remembers prior interactions across sessions.
AgentCore Identity — Integrate with your corporate identity provider for end-user authentication.
Agent-to-Agent (A2A) — Deploy additional agents and let them communicate using the A2A protocol.
Observability — Enable built-in tracing to capture agent reasoning steps via CloudWatch Transaction Search.

References

By Tebogo Tseka — AWS Practice Manager & Solutions Architect at Big Beard Web Solutions

DEV Community: Tebogo Tseka

AI Governance in Practice: FastAPI on EKS with Model Cards, Audit Logging, and Helm

Architecture

Step 1: The FastAPI Application

Inference (POST /predict)

Model Card (GET /governance/model-card)

Audit Log (GET /governance/audit-log)

Health probes

Step 2: The Model

Step 3: Test Suite — 23/23

Step 4: Docker — Multi-Stage Build

Step 5: Helm Chart with HPA

HPA

_helpers.tpl

Step 6: Infrastructure (Terraform)

What I'd Add Next

Building a Real-Time IoT Telemetry Pipeline with Kinesis, Lambda, and DynamoDB

Architecture

Step 1: Device Simulator

Step 2: Kinesis Consumer Lambda

The Decimal gotcha

Partial batch responses

Async alert invocation

Step 3: Alert Handler Lambda

Test Suite: 30/30

Infrastructure (Terraform)

What I'd Add Next

Building a Production MLOps Pipeline on AWS SageMaker for Telecom Churn

What We're Building

Phase 1: The Pipeline Definition

Phase 2: Preprocessing with Synthetic Fallback

Phase 3: Evaluation Report

Phase 4: Automated Retraining via EventBridge

Phase 5: Drift Monitoring

Data Drift — Did the input distribution change?

Model Drift — Is the model's behaviour changing?

Test Suite: 24/24

Infrastructure (Terraform)

What's Next

Source Code

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Predicting Telecom Customer Churn with scikit-learn, Keras, and Amazon SageMaker

Why Telecom Churn Is a Hard ML Problem

The Dataset

Pipeline Architecture

Step 1: Data Preprocessing

Step 2: Training Three Models

Decision Tree — The Baseline

Random Forest — The Workhorse

Keras Neural Network — The Contender

Step 3: Model Evaluation & Comparison

Step 4: Deploying to SageMaker

Key Takeaways

What's Next

References

How I Run Over 20 AI Agents Locally and Deploy One to Production at a Time

The Problem With How Teams Ship Agents Today

My Architecture: Local-First Agentic Development

The Liftability Pattern

Skills and Tools: The Real Unit of Capability

The Production Decision: When Does an Agent Get Lifted?

What This Gives You That Shipping Fast Does Not

The Counterintuitive Takeaway

Key Takeaways

References

The Missing Test Suite: Why AI Projects Fail Before Production

The Uncomfortable Truth

Two Systems, Two Test Suites

Prompt Test Cases as First-Class Citizens

Happy Path

Negative Scenarios

Edge Cases

Designing Prompt Test Permutations

Equivalence Partitioning for Prompts

Boundary Value Analysis for Prompts

Decision Table Testing

The Prompt Regression Problem

Building a Prompt Regression Harness

Strategy: From POC to Production

1. Define Testability From Day One

Inference (`POST /predict`)

Model Card (`GET /governance/model-card`)

Audit Log (`GET /governance/audit-log`)

`_helpers.tpl`