Before diving in, check out AI was supposed to take my job — instead it gave me a new one: Evaluations, a presentation that walks through this PoC.
In this project we will build a Python banking assistant agent using Strands Agents and make it observable and continuously evaluated using Langfuse — step by step.
Strands Agents is a lightweight Python SDK for building LLM-powered agents with tool use and session memory, open-sourced by AWS in May 2025. It is Python-native — which pairs well with the Langfuse Python SDK — and new enough to be worth exploring. Any other Python agent framework would work just as well for this PoC.
With classic applications, quality is enforced through unit tests, integration tests, and static analysis — every function has a defined contract and a deterministic output you can assert on. In production, metrics (error rates, latency, memory) surface failures reliably.
AI applications break both of these. The same input may yield different outputs on each run — wording changes, tools get called in a different order, edge cases surface unpredictably. And in production, a request can return 200 OK in 300ms with a confident, completely wrong answer — classic metrics won't catch it. You need something more!
That's where traces and evaluations come in — supported by a growing number of platforms: Langfuse, Arize Phoenix, MLflow, LangSmith, W&B Weave, Datadog, AWS AgentCore, Azure AI Foundry, Google Vertex AI
This PoC uses Langfuse because it is open-source and is self-hostable with a single docker compose up, providing these features:
- Tracing — recording a structured tree of every LLM call, tool call, and sub-agent step: inputs, outputs, latency, cost.
-
Evaluations — running scored assessments of agent outputs:
- Offline — run against a fixed, curated dataset before or after a change; re-runnable as a CI quality gate.
- Online — async, triggered by live traces; catch issues that didn't appear in your fixed dataset.
- External Evaluations — attaching scores to live traces programmatically from your own code.
- Annotation Queues — routing traces to human reviewers via explicit programmatic calls.
- Prompt Management — versioning prompt templates and pulling them at runtime via SDK.
The app under evaluation is the banking sentinel — a customer support agent for ROGERVINAS bank built with Strands Agents: 3 mock accounts with 5 transactions each, and tools to freeze/unfreeze cards, look up transactions, and open or track disputes.
Ready? Get the agent and Langfuse running (Configuration → Run), open the chat UI and play with it, then go through the Implementation steps to see how each piece works.
Configuration
Prerequisites
- Python (version pinned in
.python-version) - uv
- Docker + Docker Compose
- A model provider (see below)
Install dependencies
uv sync --extra dev --extra evals
-
dev— dependencies for unit tests -
evals— dependencies for evaluations
Environment
cp .env.example .env
Edit .env to set your model provider and Langfuse credentials.
Model providers
Set MODEL_PROVIDER in .env:
| Provider | MODEL_PROVIDER |
Requirements |
|---|---|---|
| Ollama (default) | ollama |
ollama serve + ollama pull llama3.1:8b
|
| AWS Bedrock | bedrock |
AWS credentials configured |
| Google Gemini | gemini |
GOOGLE_API_KEY in .env
|
Langfuse
We use the self-hosted with Docker Compose deployment. This PoC's docker-compose-langfuse.yml is a copy of the official docker-compose.
To start:
docker compose -f docker-compose-langfuse.yml up -d
Open the Langfuse UI at http://localhost:3000 → project banking-sentinel, with these pre-provisioned credentials:
Email: admin@local.dev
Password: password
To stop:
docker compose -f docker-compose-langfuse.yml down
To stop and delete all data:
docker compose -f docker-compose-langfuse.yml down -v
Run
uv run uvicorn banking_sentinel.api:app --reload
Open http://localhost:8000 and try it out:
- Use the textboxes to simulate a different User ID, Account ID, Tier, and Session ID.
- Chat with the agent — for example, say "Help, I do not have Netflix!".
- Each answer may include suggested actions you can click directly, plus 👍 / 👎 feedback buttons.
Agent traces are sent to Langfuse automatically via OpenTelemetry, as the LANGFUSE_* and OTEL_* env vars are set in .env.
Implementation
With the agent and Langfuse running, let's walk through how each piece works, step by step.
Step 1: The Banking Agent
The domain: a banking customer support agent — the Sentinel — for ROGERVINAS bank. Three mock accounts (ACC-1001, ACC-1002, ACC-1003), five transactions each, and seven tools:
-
freeze_card— Freeze a card -
unfreeze_card— Unfreeze a card -
is_card_frozen— Check freeze status -
get_transactions— List transactions between two dates -
open_dispute— Open a dispute on a transaction -
get_dispute_status— Check dispute status -
list_disputes— List all disputes for an account
The implementation is intentionally minimal — a single process, file-based session storage, no RAG (the knowledge base is short enough to inline directly in the system prompt), no MCP, no external service calls. The goal here is to keep the agent simple so the focus stays on observability and evaluations. The full implementation lives under src/banking_sentinel; if you are new to Strands, follow the Strands Agents tutorial first.
In production, the agent would run behind a load balancer with multiple replicas, connect to real external services via MCP or direct API calls, and swap the local file-based session storage for a persistent store — see the Strands session management docs.
Step 2: Tracing
Traces are not standardised — what you get depends entirely on the framework and instrumentation. The GenAI OTel semantic conventions are still experimental, so SDKs and platforms implement them inconsistently — always verify your traces in the UI before relying on them.
In this PoC, for example, the Strands [otel] extra emits spans but does not set trace-level input/output on the root span — and we need those in later steps. So api.py wraps the entire /chat request in langfuse.start_as_current_observation(), which sets input/output on the root span while capturing the Strands OTel spans as children:
with langfuse.start_as_current_observation(name="banking-sentinel-chat", as_type="generation") as span:
# ... run the agent ...
span.update(input=request.message, output=response.answer, prompt=prompt_obj)
This produces the following span hierarchy in Langfuse:
banking-sentinel-chat ← Langfuse-native root span (input/output/user_id)
└── Strands OTel spans ← captured automatically
Open http://localhost:3000 → Traces to see it:
Step 3: Strands Native Evaluations
The Strands Evals SDK provides Case, Experiment, and OutputEvaluator. No Langfuse required — runs fully offline and produces a local report. The full code for this step lives in evals/strands/run_evaluations.py.
Each Case bundles an input and expected output:
CASES = [
Case(
name="unauthorized-netflix-charge",
input={
"userId": "user-1001", "accountId": "ACC-1001", "accountTier": "Standard",
"message": "I don't have Netflix but I see a charge on my account",
},
expected_output={
"suggestedActions": ["FREEZE_CARD"],
"claim": "The AI agent found a Netflix charge of 9.99 and offered the user to open a dispute",
},
),
# ...
]
Two evaluators score each result — one deterministic, one LLM-as-judge:
class CorrectnessEvaluator(OutputEvaluator):
pass
class ClaimEvaluator(OutputEvaluator):
pass
correctness_evaluator = CorrectnessEvaluator(
model=_model,
rubric="Score 1.0 if the actual output's suggested_actions contains all actions listed in expected_output's suggestedActions. Score 0.0 if any expected action is missing.",
)
claim_evaluator = ClaimEvaluator(
model=_model,
rubric="Score 1.0 if the actual output's answer matches the claim in expected_output. Score 0.0 if the answer does not match the claim.",
)
Note: Strands Evals can evaluate any Python callable — there is no coupling to the Strands agent itself. If you prefer a different evaluation framework, alternatives include DeepEval, Ragas (RAG-focused), Braintrust autoevals, or plain pytest with custom assertions.
There are two ways to run the task:
- Embedded — the agent is instantiated in-process; external services are mocked, and you can inspect internal state (white-box). Fast, no server needed, ideal for CI. Use this when you want fast, isolated, reproducible runs.
- API — the task hits a running server with real external services (black-box). Use this to validate against a live deployment or when mocking is not practical.
Run embedded — no server needed:
uv run python -m evals.strands.run_evaluations embedded
Run against a running server:
uv run python -m evals.strands.run_evaluations api --url http://localhost:8000
Report should look like:
============================================================
Evaluator: CorrectnessEvaluator
Overall score: 1.00
✅ unauthorized-netflix-charge: score=1.00 — suggested_actions contains the expected 'FREEZE_CARD'.
✅ expired-dispute-window: score=1.00 — suggested_actions contains the expected 'FREEZE_CARD'.
============================================================
Evaluator: ClaimEvaluator
Overall score: 1.00
✅ unauthorized-netflix-charge: score=1.00 — identifies the Netflix charge and offers to open a dispute.
✅ expired-dispute-window: score=1.00 — explains the 14-day dispute window has expired.
✅ Evaluation PASSED
Step 4: Experiments
A Langfuse experiment is an offline evaluation run: your agent is executed against a curated dataset, each output is scored automatically, and the results are stored so you can compare across code versions, prompt changes, and model upgrades over time.
A dataset is a versioned collection of test cases — each item has an input, an expected_output, and optional metadata. Each experiment execution is named and stored against that dataset, with evaluator scores recorded per item.
Where Step 3: Strands Native Evaluations produces a local pass/fail report with no history, Langfuse Experiments add:
- Comparison across runs — see how scores change between code versions, prompt changes, or model upgrades side by side in the dashboard
- Persistent results — every run is stored; you can go back and audit any historical experiment
Both Step 3: Strands Native Evaluations and Step 4: Experiments act as a CI quality gate — the script exits non-zero if scores drop below threshold.
This step has three parts:
Create the dataset
Each item has an input, an expected_output, and optional metadata. Datasets can be created in two ways:
- Via UI: go to http://localhost:3000 → Datasets → New dataset, then add items manually
-
Programmatically (idempotent — safe to run repeatedly), see
evals/langfuse/create_dataset.py:
uv run python -m evals.langfuse.create_dataset
Implement the experiment
Evaluators are plain Python callables — there are no built-in evaluators in the SDK and no base class to inherit from. Any function that matches this signature works:
from langfuse.experiment import Evaluation
def my_evaluator(
*,
input: Any, # the dataset item's input
output: Any, # what the task returned
expected_output: Any, # the dataset item's expected_output
metadata: Optional[Dict[str, Any]],
**kwargs: Any,
) -> Evaluation: # or List[Evaluation] for multiple metrics at once
...
return Evaluation(
name="my-metric", # metric name shown in the dashboard
value=1.0, # int | float | str | bool
comment="optional", # shown alongside the score
)
We implement the same two evaluators as in Step 3: Strands Native Evaluations — one deterministic (correctness), one LLM-as-judge (claim). The LLM-as-judge runs locally using whatever MODEL_PROVIDER you have configured (Ollama, Bedrock, Gemini) — it is not Langfuse's evaluation infrastructure, just your own model called from Python (see evals/langfuse/run_experiment.py):
def correctness_evaluator(*, output, expected_output, **kwargs) -> Evaluation:
"""Deterministic: checks if all expected suggested actions are present."""
expected = set(expected_output.get("suggestedActions", []))
actual = set(output.get("suggested_actions", []))
score = len(expected & actual) / len(expected) if expected else 1.0
return Evaluation(name="correctness", value=score, comment=f"Expected {expected}, got {actual}")
def claim_evaluator(*, output, expected_output, **kwargs) -> Evaluation:
"""LLM-as-judge: runs locally with your configured MODEL_PROVIDER."""
judge = Agent(model=_model, callback_handler=lambda **_: None)
result = judge(
f"Does the following answer match the claim? Reply with YES or NO only.\n\n"
f"Answer: {output['answer']}\n\nClaim: {expected_output['claim']}"
)
return Evaluation(name="claim_match", value=1.0 if "YES" in str(result).upper() else 0.0)
result = langfuse.run_experiment(
name="banking-sentinel",
data=dataset.items,
task=embedded_task,
evaluators=[correctness_evaluator, claim_evaluator],
max_concurrency=1,
)
Run the experiment
Run embedded — no server needed:
uv run python -m evals.langfuse.run_experiment embedded
Run against a running server:
uv run python -m evals.langfuse.run_experiment api --url http://localhost:8000
Open http://localhost:3000 → Datasets to see results:
Step 5: Online Evaluations (LLM-as-judge)
Langfuse can automatically score live traces as they arrive — no code changes needed. In this PoC all chat traces are tagged banking-sentinel, and the root span of each is named banking-sentinel-chat, making them easy to target.
Setup:
1 — Add LLM Connection:
Go to http://localhost:3000 → Settings → LLM Connections → add your model provider API key.
2 — Set default evaluation model:
Go to http://localhost:3000 → LLM-as-a-Judge → set the Default Evaluation Model to the connection you just added.
3 — Create the evaluator:
Go to http://localhost:3000 → LLM-as-a-Judge → Create Evaluator. Two options:
- Built-in evaluators (e.g. Helpfulness, Hallucination) — often too generic for a specific domain.
- Custom Evaluator — your own domain-aware prompt. This is what we use for the PoC.
Fill in the custom evaluator:
-
Name — e.g.
banking-sentinel-helpfulness - Model — leave Use default evaluation model checked (the model you set in step 2)
-
Evaluation prompt — reference the trace content with
{{input}}and{{output}}(see below) - Score reasoning prompt and Score range prompt — leave at their defaults or customize at will
Evaluation prompt:
You are evaluating a banking customer-support assistant for ROGERVINAS bank.
Customers often report problems informally, without using precise banking
terms. Evaluate how helpful the assistant's reply is: does it identify the
relevant transaction, explain the applicable policy (e.g. the dispute window),
and offer concrete next steps (freeze card, open dispute)?
Customer message:
{{input}}
Assistant reply:
{{output}}
4 — Configure the rule:
- Set target to
Observations, filter byType = GENERATION - Ensure Run on live incoming observations is checked (it is the default)
- Add filter:
Tags→any of→banking-sentinel - Add filter:
Name→any of→banking-sentinel-chat— targets only the root span; avoids double-scoring the inner Strands generation (both carry the tag but have different names) - Set Sampling (100% is fine for PoC — reduce in production to control costs)
- Map the evaluator's prompt variables to the trace fields accordingly
- Click
Execute— scores existing matching observations immediately and new ones going forward
Open http://localhost:3000 → Traces and click a banking-sentinel-chat trace. The banking-sentinel-helpfulness score shows as a small badge on the root span, with the full detail in the Scores tab:
Note: The online evaluations themselves produce traces — each LLM-as-judge call is a generation Langfuse records.
Note: The Langfuse API to create evaluators and rules programmatically is unstable and only available on Langfuse Cloud — not in self-hosted deployments.
Step 6: External Evaluations
Langfuse lets you attach scores to any trace programmatically from your own code using langfuse.create_score(). Common use cases include:
- User feedback — 👍/👎 ratings from end users
- Guardrail results — PII checks, content policy, format validation
- Agent self-scoring — quality signal computed inline and submitted back
- Custom pipelines — any score your application logic can produce
In this PoC we implement user feedback as our example. Automated evaluators catch a lot, but human judgment is still essential — collecting 👍/👎 feedback from real users gives you a ground-truth signal that no automated evaluator can fully replace.
How it works:
- The
/chatendpoint returns atrace_idin every response (read fromspan.trace_id) - The chat UI attaches 👍/👎 buttons to each assistant message, keyed to that
trace_id - On click, the UI posts to
/feedbackwithvalue: 1.0(👍) orvalue: 0.0(👎) - The backend (
api.py) callslangfuse.create_score()— the score appears on the trace immediately
@app.post("/feedback")
def feedback_endpoint(request: FeedbackRequest):
langfuse.create_score(
trace_id=request.trace_id,
name="user-feedback",
value=request.value, # 1.0 = 👍, 0.0 = 👎
comment=request.comment,
)
In the chat UI, send a message and click 👍 or 👎 on the reply. Then open http://localhost:3000 → Traces and click that trace. The user-feedback score shows as a small badge on the root span, with the full detail in the Scores tab:
Step 7: Annotation Queues
Annotation queues are a human review workflow — domain experts manually score traces to build ground truth, validate LLM-as-judge results, or investigate failures.
Key concept: Langfuse provides the queue infrastructure and a programmatic API to add items — but there are no automatic routing rules or triggers built in. Items only enter a queue through an explicit call, either from the UI (ad-hoc) or from your code. Your code owns the routing logic.
Common triggers you can implement:
- User gives 👎 — route negative feedback traces for human investigation
- Experiment score below threshold — enqueue failing traces to build better ground truth
- Online evaluator scores low — poll scores and enqueue traces below a quality bar
- Specific intent detected — route traces matching certain patterns (complaints, edge cases)
- Random sampling — periodically enqueue a % of production traces for ongoing quality checks
1 — Create the queue (once, idempotent):
uv run python -m evals.langfuse.create_annotation_queue
This creates a quality score config (numeric, 0–1) and the banking-sentinel-review queue that uses it, so reviewers score each trace on quality — you could attach more score configs for other dimensions (see evals/langfuse/create_annotation_queue.py).
Set ANNOTATION_QUEUE_ID in .env to the returned queue ID.
2 — Enqueue traces (example: on 👎):
In this PoC, the /feedback endpoint in api.py adds the trace to the queue whenever the user gives a thumbs down:
if request.value == 0.0 and _annotation_queue_id:
langfuse.api.annotation_queues.create_queue_item(
_annotation_queue_id,
object_id=request.trace_id,
object_type=AnnotationQueueObjectType.TRACE,
)
3 — Review the queue:
- Go to http://localhost:3000 → Annotation Queues
- Open
banking-sentinel-review - For each trace: review the conversation, assign a
qualityscore (and optionally add a comment), click Mark Completed - Scores appear on the trace and contribute to your evaluation dashboard
Annotating an item — assign the quality score and an optional comment:
The quality score then shows as a small badge on the root banking-sentinel-chat span:
Step 8: Prompt Management
Langfuse can store and version system prompts independently of your code — iterate on the prompt without redeploying the app.
1 — Create the prompt programmatically:
uv run python -m evals.langfuse.create_prompt
Each run creates a new version (see evals/langfuse/create_prompt.py). The production label is set automatically, so it is served at runtime.
2 — Or create the prompt via the UI:
Go to http://localhost:3000 → Prompts → New prompt, then:
- Name it
banking-sentinel-system, typeText - Paste the template using
{{variable}}syntax (Mustache) - Add the
productionlabel - Save
Either way, the prompt appears in Prompts, versioned and labeled production:
3 — Use the prompt from code:
Set USE_LANGFUSE_PROMPT=true in .env and create_agent() in agent.py fetches the prompt from Langfuse; otherwise it uses the hardcoded template:
def create_agent(langfuse, model, tools, user_tier, account_id, reference_date, session_manager=None) -> tuple:
if langfuse is not None and os.getenv("USE_LANGFUSE_PROMPT", "false").lower() == "true":
... # fetch the versioned prompt via langfuse.get_prompt(...)
else:
... # use the hardcoded template
...
It returns prompt_obj, which api.py will pass back to Langfuse to link the prompt version to the trace:
span.update(input=request.message, output=response.answer, prompt=prompt_obj)
The trace then shows a Prompt: banking-sentinel-system - v1 tag linking to the exact version used:
Benefits: version history, compare prompt versions across experiments, iterate without redeploying, A/B test prompts.
CI/CD
For this PoC, CI runs three sequential jobs that gate on each other — each stage must pass before the next starts:
- Build — installs dependencies, builds the package, runs unit tests
- Standalone Evals — runs Strands native evaluations in embedded mode (no Langfuse). Fails if any score drops below a threshold.
- Langfuse Evals — runs Langfuse experiments and reports results to the dashboard. Fails if any score drops below a threshold. For this PoC we start Langfuse with Docker just for the CI run, but a real setup would point at a shared, always-on instance.
This means a code or prompt change that degrades agent quality will fail CI before it can reach production.
Langfuse acts as an active quality gate:
- Traces, scores, and experiment history accumulate across every PR and deploy — production traces surface regressions and datasets grow from real failures.
- A deployment job runs only after all eval jobs pass, so CI blocks deploys that would degrade quality — with score thresholds tuned per metric as you build up baseline data.
This is the continuous AI Engineering Loop as documented by Langfuse:
Documentation
- Strands Agents docs
- Langfuse docs
- Langfuse × Strands Agents integration
- Strands Agents quick start
- Langfuse tracing
- Strands Evals SDK
- Langfuse datasets
- Langfuse model-based evaluations
- Langfuse user feedback
- Langfuse annotation queues
- Langfuse prompt management
Happy GenAI coding! 💙











Top comments (0)