Roger Viñas Alcon

Posted on Jul 2 • Edited on Jul 4

Strands Agents + Langfuse Evaluations

#langfuse #strandsagents #evaluations #ai

Before diving in, check out AI was supposed to take my job — instead it gave me a new one: Evaluations, a presentation that walks through this PoC.

See source code at Github

In this project we will build a Python banking assistant agent using Strands Agents and make it observable and continuously evaluated using Langfuse — step by step.

Strands Agents is a lightweight Python SDK for building LLM-powered agents with tool use and session memory, open-sourced by AWS in May 2025. It is Python-native — which pairs well with the Langfuse Python SDK — and new enough to be worth exploring. Any other Python agent framework would work just as well for this PoC.

With classic applications, quality is enforced through unit tests, integration tests, and static analysis — every function has a defined contract and a deterministic output you can assert on. In production, metrics (error rates, latency, memory) surface failures reliably.

AI applications break both of these. The same input may yield different outputs on each run — wording changes, tools get called in a different order, edge cases surface unpredictably. And in production, a request can return 200 OK in 300ms with a confident, completely wrong answer — classic metrics won't catch it. You need something more!

That's where traces and evaluations come in — supported by a growing number of platforms: Langfuse, Arize Phoenix, MLflow, LangSmith, W&B Weave, Datadog, AWS AgentCore, Azure AI Foundry, Google Vertex AI

This PoC uses Langfuse because it is open-source and is self-hostable with a single docker compose up, providing these features:

Tracing — recording a structured tree of every LLM call, tool call, and sub-agent step: inputs, outputs, latency, cost.
Evaluations — running scored assessments of agent outputs:
- Offline — run against a fixed, curated dataset before or after a change; re-runnable as a CI quality gate.
- Online — async, triggered by live traces; catch issues that didn't appear in your fixed dataset.
External Evaluations — attaching scores to live traces programmatically from your own code.
Annotation Queues — routing traces to human reviewers via explicit programmatic calls.
Prompt Management — versioning prompt templates and pulling them at runtime via SDK.

The app under evaluation is the banking sentinel — a customer support agent for ROGERVINAS bank built with Strands Agents: 3 mock accounts with 5 transactions each, and tools to freeze/unfreeze cards, look up transactions, and open or track disputes.

Ready? Get the agent and Langfuse running (Configuration → Run), open the chat UI and play with it, then go through the Implementation steps to see how each piece works.

Configuration
Run
Implementation
CI/CD
Documentation

Configuration

Prerequisites

Python (version pinned in .python-version)
uv
Docker + Docker Compose
A model provider (see below)

Install dependencies

uv sync --extra dev --extra evals

dev — dependencies for unit tests
evals — dependencies for evaluations

Environment

cp .env.example .env

Edit .env to set your model provider and Langfuse credentials.

Model providers

Set MODEL_PROVIDER in .env:

Provider	`MODEL_PROVIDER`	Requirements
Ollama (default)	`ollama`	`ollama serve` + `ollama pull llama3.1:8b`
AWS Bedrock	`bedrock`	AWS credentials configured
Google Gemini	`gemini`	`GOOGLE_API_KEY` in `.env`

Langfuse

We use the self-hosted with Docker Compose deployment. This PoC's docker-compose-langfuse.yml is a copy of the official docker-compose.

To start:

docker compose -f docker-compose-langfuse.yml up -d

Open the Langfuse UI at http://localhost:3000 → project banking-sentinel, with these pre-provisioned credentials:

Email:      admin@local.dev
Password:   password

To stop:

docker compose -f docker-compose-langfuse.yml down

To stop and delete all data:

docker compose -f docker-compose-langfuse.yml down -v

Run

uv run uvicorn banking_sentinel.api:app --reload

Open http://localhost:8000 and try it out:

Use the textboxes to simulate a different User ID, Account ID, Tier, and Session ID.
Chat with the agent — for example, say "Help, I do not have Netflix!".
Each answer may include suggested actions you can click directly, plus 👍 / 👎 feedback buttons.

Agent traces are sent to Langfuse automatically via OpenTelemetry, as the LANGFUSE_* and OTEL_* env vars are set in .env.

Implementation

With the agent and Langfuse running, let's walk through how each piece works, step by step.

Step 1: The Banking Agent

The domain: a banking customer support agent — the Sentinel — for ROGERVINAS bank. Three mock accounts (ACC-1001, ACC-1002, ACC-1003), five transactions each, and seven tools:

freeze_card — Freeze a card
unfreeze_card — Unfreeze a card
is_card_frozen — Check freeze status
get_transactions — List transactions between two dates
open_dispute — Open a dispute on a transaction
get_dispute_status — Check dispute status
list_disputes — List all disputes for an account

The implementation is intentionally minimal — a single process, file-based session storage, no RAG (the knowledge base is short enough to inline directly in the system prompt), no MCP, no external service calls. The goal here is to keep the agent simple so the focus stays on observability and evaluations. The full implementation lives under src/banking_sentinel; if you are new to Strands, follow the Strands Agents tutorial first.

In production, the agent would run behind a load balancer with multiple replicas, connect to real external services via MCP or direct API calls, and swap the local file-based session storage for a persistent store — see the Strands session management docs.

Step 2: Tracing

Traces are not standardised — what you get depends entirely on the framework and instrumentation. The GenAI OTel semantic conventions are still experimental, so SDKs and platforms implement them inconsistently — always verify your traces in the UI before relying on them.

In this PoC, for example, the Strands [otel] extra emits spans but does not set trace-level input/output on the root span — and we need those in later steps. So api.py wraps the entire /chat request in langfuse.start_as_current_observation(), which sets input/output on the root span while capturing the Strands OTel spans as children:

with langfuse.start_as_current_observation(name="banking-sentinel-chat", as_type="generation") as span:
    # ... run the agent ...
    span.update(input=request.message, output=response.answer, prompt=prompt_obj)

This produces the following span hierarchy in Langfuse:

banking-sentinel-chat  ← Langfuse-native root span (input/output/user_id)
  └── Strands OTel spans  ← captured automatically

Open http://localhost:3000 → Traces to see it:

Step 3: Strands Native Evaluations

The Strands Evals SDK provides Case, Experiment, and OutputEvaluator. No Langfuse required — runs fully offline and produces a local report. The full code for this step lives in evals/strands/run_evaluations.py.

Each Case bundles an input and expected output:

CASES = [
    Case(
        name="unauthorized-netflix-charge",
        input={
            "userId": "user-1001", "accountId": "ACC-1001", "accountTier": "Standard",
            "message": "I don't have Netflix but I see a charge on my account",
        },
        expected_output={
            "suggestedActions": ["FREEZE_CARD"],
            "claim": "The AI agent found a Netflix charge of 9.99 and offered the user to open a dispute",
        },
    ),
    # ...
]

Two evaluators score each result — one deterministic, one LLM-as-judge:

class CorrectnessEvaluator(OutputEvaluator):
    pass

class ClaimEvaluator(OutputEvaluator):
    pass

correctness_evaluator = CorrectnessEvaluator(
    model=_model,
    rubric="Score 1.0 if the actual output's suggested_actions contains all actions listed in expected_output's suggestedActions. Score 0.0 if any expected action is missing.",
)

claim_evaluator = ClaimEvaluator(
    model=_model,
    rubric="Score 1.0 if the actual output's answer matches the claim in expected_output. Score 0.0 if the answer does not match the claim.",
)

Note: Strands Evals can evaluate any Python callable — there is no coupling to the Strands agent itself. If you prefer a different evaluation framework, alternatives include DeepEval, Ragas (RAG-focused), Braintrust autoevals, or plain pytest with custom assertions.

There are two ways to run the task:

Embedded — the agent is instantiated in-process; external services are mocked, and you can inspect internal state (white-box). Fast, no server needed, ideal for CI. Use this when you want fast, isolated, reproducible runs.
API — the task hits a running server with real external services (black-box). Use this to validate against a live deployment or when mocking is not practical.

Run embedded — no server needed:

uv run python -m evals.strands.run_evaluations embedded

Run against a running server:

uv run python -m evals.strands.run_evaluations api --url http://localhost:8000

Report should look like:

============================================================
Evaluator: CorrectnessEvaluator
Overall score: 1.00
  ✅ unauthorized-netflix-charge: score=1.00 — suggested_actions contains the expected 'FREEZE_CARD'.
  ✅ expired-dispute-window: score=1.00 — suggested_actions contains the expected 'FREEZE_CARD'.

============================================================
Evaluator: ClaimEvaluator
Overall score: 1.00
  ✅ unauthorized-netflix-charge: score=1.00 — identifies the Netflix charge and offers to open a dispute.
  ✅ expired-dispute-window: score=1.00 — explains the 14-day dispute window has expired.

✅ Evaluation PASSED

Step 4: Experiments

A Langfuse experiment is an offline evaluation run: your agent is executed against a curated dataset, each output is scored automatically, and the results are stored so you can compare across code versions, prompt changes, and model upgrades over time.

A dataset is a versioned collection of test cases — each item has an input, an expected_output, and optional metadata. Each experiment execution is named and stored against that dataset, with evaluator scores recorded per item.

Where Step 3: Strands Native Evaluations produces a local pass/fail report with no history, Langfuse Experiments add:

Comparison across runs — see how scores change between code versions, prompt changes, or model upgrades side by side in the dashboard
Persistent results — every run is stored; you can go back and audit any historical experiment

Both Step 3: Strands Native Evaluations and Step 4: Experiments act as a CI quality gate — the script exits non-zero if scores drop below threshold.

This step has three parts:

Create the dataset
Implement the experiment
Run the experiment

Create the dataset

Each item has an input, an expected_output, and optional metadata. Datasets can be created in two ways:

Via UI: go to http://localhost:3000 → Datasets → New dataset, then add items manually
Programmatically (idempotent — safe to run repeatedly), see evals/langfuse/create_dataset.py:

uv run python -m evals.langfuse.create_dataset

Open http://localhost:3000 → Datasets → Items to see the dataset items:

Implement the experiment

Evaluators are plain Python callables — there are no built-in evaluators in the SDK and no base class to inherit from. Any function that matches this signature works:

from langfuse.experiment import Evaluation

def my_evaluator(
    *,
    input: Any,           # the dataset item's input
    output: Any,          # what the task returned
    expected_output: Any, # the dataset item's expected_output
    metadata: Optional[Dict[str, Any]],
    **kwargs: Any,
) -> Evaluation:          # or List[Evaluation] for multiple metrics at once
    ...
    return Evaluation(
        name="my-metric",   # metric name shown in the dashboard
        value=1.0,          # int | float | str | bool
        comment="optional", # shown alongside the score
    )

We implement the same two evaluators as in Step 3: Strands Native Evaluations — one deterministic (correctness), one LLM-as-judge (claim). The LLM-as-judge runs locally using whatever MODEL_PROVIDER you have configured (Ollama, Bedrock, Gemini) — it is not Langfuse's evaluation infrastructure, just your own model called from Python (see evals/langfuse/run_experiment.py):

def correctness_evaluator(*, output, expected_output, **kwargs) -> Evaluation:
    """Deterministic: checks if all expected suggested actions are present."""
    expected = set(expected_output.get("suggestedActions", []))
    actual = set(output.get("suggested_actions", []))
    score = len(expected & actual) / len(expected) if expected else 1.0
    return Evaluation(name="correctness", value=score, comment=f"Expected {expected}, got {actual}")

def claim_evaluator(*, output, expected_output, **kwargs) -> Evaluation:
    """LLM-as-judge: runs locally with your configured MODEL_PROVIDER."""
    judge = Agent(model=_model, callback_handler=lambda **_: None)
    result = judge(
        f"Does the following answer match the claim? Reply with YES or NO only.\n\n"
        f"Answer: {output['answer']}\n\nClaim: {expected_output['claim']}"
    )
    return Evaluation(name="claim_match", value=1.0 if "YES" in str(result).upper() else 0.0)

result = langfuse.run_experiment(
    name="banking-sentinel",
    data=dataset.items,
    task=embedded_task,
    evaluators=[correctness_evaluator, claim_evaluator],
    max_concurrency=1,
)

Run the experiment

Run embedded — no server needed:

uv run python -m evals.langfuse.run_experiment embedded

Run against a running server:

uv run python -m evals.langfuse.run_experiment api --url http://localhost:8000

Open http://localhost:3000 → Datasets → Experiments to see the experiment runs:

Step 5: Online Evaluations

Langfuse can automatically score live traces as they arrive — no code changes needed. In this PoC all chat traces are tagged banking-sentinel, and the root span of each is named banking-sentinel-chat, making them easy to target.

Setup:

1 — Add LLM Connection:
Go to http://localhost:3000 → Settings → LLM Connections → add your model provider API key.

2 — Set default evaluation model:
Go to http://localhost:3000 → Evaluators → set the Default Evaluation Model to the connection you just added.

3 — Create the evaluator:
Go to http://localhost:3000 → Evaluators → Create Evaluator. Two options:

Create from scratch (custom prompt) — your own domain-aware prompt. This is what we use for the PoC.
Use existing (built-in templates, e.g. Helpfulness, Hallucination) — often too generic for a specific domain.

Fill in the custom evaluator:

Name — e.g. banking-sentinel-helpfulness
Model — leave Use default evaluation model checked (the model you set in step 2)
Evaluation prompt — reference the trace content with {{input}} and {{output}} (see below)
Score type — numeric, boolean, or categorical
Score reasoning prompt and Score range prompt — leave at their defaults or customize at will

Evaluation prompt:

You are evaluating a banking customer-support assistant for ROGERVINAS bank.
Customers often report problems informally, without using precise banking
terms. Evaluate how helpful the assistant's reply is: does it identify the
relevant transaction, explain the applicable policy (e.g. the dispute window),
and offer concrete next steps (freeze card, open dispute)?

Customer message:
{{input}}

Assistant reply:
{{output}}

4 — Configure the rule:

Set target to Observations, filter by Type = GENERATION
Ensure Run on live incoming observations is checked (it is the default)
Add filter: Tags → any of → banking-sentinel
Add filter: Name → any of → banking-sentinel-chat — targets only the root span; avoids double-scoring the inner Strands generation (both carry the tag but have different names)
Set Sampling (100% is fine for PoC — reduce in production to control costs)
Map the evaluator's prompt variables to the trace fields accordingly
Click Execute — scores existing matching observations immediately and new ones going forward

Open http://localhost:3000 → Traces and click a banking-sentinel-chat trace. The banking-sentinel-helpfulness score shows as a small badge on the root span, with the full detail in the Scores tab:

Note: The online evaluations themselves produce traces — each LLM-as-judge call is a generation Langfuse records.

Note: The Langfuse API to create evaluators and rules programmatically is unstable and only available on Langfuse Cloud — not in self-hosted deployments.

Step 6: External Evaluations

Langfuse lets you attach scores to any trace programmatically from your own code using langfuse.create_score(). Common use cases include:

User feedback — 👍/👎 ratings from end users
Guardrail results — PII checks, content policy, format validation
Agent self-scoring — quality signal computed inline and submitted back
Custom pipelines — any score your application logic can produce

In this PoC we implement user feedback as our example. Automated evaluators catch a lot, but human judgment is still essential — collecting 👍/👎 feedback from real users gives you a ground-truth signal that no automated evaluator can fully replace.

How it works:

The /chat endpoint returns a trace_id in every response (read from span.trace_id)
The chat UI attaches 👍/👎 buttons to each assistant message, keyed to that trace_id
On click, the UI posts to /feedback with value: 1.0 (👍) or value: 0.0 (👎)
The backend (api.py) calls langfuse.create_score() — the score appears on the trace immediately

@app.post("/feedback")
def feedback_endpoint(request: FeedbackRequest):
    langfuse.create_score(
        trace_id=request.trace_id,
        name="user-feedback",
        value=request.value,   # 1.0 = 👍, 0.0 = 👎
        comment=request.comment,
    )

In the chat UI, send a message and click 👍 or 👎 on the reply. Then open http://localhost:3000 → Traces and click that trace. The user-feedback score shows as a small badge on the root span, with the full detail in the Scores tab:

Step 7: Annotation Queues

Annotation queues are a human review workflow — domain experts manually score traces to build ground truth, validate LLM-as-judge results, or investigate failures.

Key concept: Langfuse provides the queue infrastructure and a programmatic API to add items — but there are no automatic routing rules or triggers built in. Items only enter a queue through an explicit call, either from the UI (ad-hoc) or from your code. Your code owns the routing logic.

Common triggers you can implement:

User gives 👎 — route negative feedback traces for human investigation
Experiment score below threshold — enqueue failing traces to build better ground truth
Online evaluator scores low — poll scores and enqueue traces below a quality bar
Specific intent detected — route traces matching certain patterns (complaints, edge cases)
Random sampling — periodically enqueue a % of production traces for ongoing quality checks

1 — Create the queue (once, idempotent):

uv run python -m evals.langfuse.create_annotation_queue

This creates a quality score config (numeric, 0–1) and the banking-sentinel-review queue that uses it, so reviewers score each trace on quality — you could attach more score configs for other dimensions (see evals/langfuse/create_annotation_queue.py).

Set ANNOTATION_QUEUE_ID in .env to the returned queue ID.

2 — Enqueue traces (example: on 👎):
In this PoC, the /feedback endpoint in api.py adds the trace to the queue whenever the user gives a thumbs down:

if request.value == 0.0 and _annotation_queue_id:
    langfuse.api.annotation_queues.create_queue_item(
        _annotation_queue_id,
        object_id=request.trace_id,
        object_type=AnnotationQueueObjectType.TRACE,
    )

3 — Review the queue:

Go to http://localhost:3000 → Human Annotation
Open banking-sentinel-review
Review the conversation
Assign a quality score (and optionally add a comment)
Correct the output if necessary
Click Mark Completed
Scores and corrected output appear on the trace and contribute to your evaluation dashboard

Annotating an item — assign the quality score and an optional comment:

The quality score then shows as a small badge on the root banking-sentinel-chat span:

Step 8: Prompt Management

Langfuse can store and version system prompts independently of your code — iterate on the prompt without redeploying the app.

1 — Create the prompt programmatically:

uv run python -m evals.langfuse.create_prompt

Each run creates a new version (see evals/langfuse/create_prompt.py). The production label is set automatically, so it is served at runtime.

2 — Or create the prompt via the UI:

Go to http://localhost:3000 → Prompts → New prompt, then:

Name it banking-sentinel-system, type Text
Paste the template using {{variable}} syntax (Mustache)
Add the production label
Save

Either way, the prompt appears in Prompts, versioned and labeled production:

3 — Use the prompt from code:

Set USE_LANGFUSE_PROMPT=true in .env and create_agent() in agent.py fetches the prompt from Langfuse; otherwise it uses the hardcoded template:

def create_agent(langfuse, model, tools, user_tier, account_id, reference_date, session_manager=None) -> tuple:
    if langfuse is not None and os.getenv("USE_LANGFUSE_PROMPT", "false").lower() == "true":
        ...  # fetch the versioned prompt via langfuse.get_prompt(...)
    else:
        ...  # use the hardcoded template
    ...

It returns prompt_obj, which api.py will pass back to Langfuse to link the prompt version to the trace:

span.update(input=request.message, output=response.answer, prompt=prompt_obj)

The trace then shows a Prompt: banking-sentinel-system - v1 tag linking to the exact version used:

Benefits: version history, compare prompt versions across experiments, iterate without redeploying, A/B test prompts.

CI/CD

For this PoC, CI runs three sequential jobs that gate on each other — each stage must pass before the next starts:

Build — installs dependencies, builds the package, runs unit tests
Standalone Evals — runs Strands native evaluations in embedded mode (no Langfuse). Fails if any score drops below a threshold.
Langfuse Evals — runs Langfuse experiments and reports results to the dashboard. Fails if any score drops below a threshold. For this PoC we start Langfuse with Docker just for the CI run, but a real setup would point at a shared, always-on instance.

This means a code or prompt change that degrades agent quality will fail CI before it can reach production.

Langfuse acts as an active quality gate:

Traces, scores, and experiment history accumulate across every PR and deploy — production traces surface regressions and datasets grow from real failures.
A deployment job runs only after all eval jobs pass, so CI blocks deploys that would degrade quality — with score thresholds tuned per metric as you build up baseline data.

This is the continuous AI Engineering Loop as documented by Langfuse:

Documentation

Happy GenAI coding! 💙

DEV Community