Gino Llerena

Posted on Jul 4

Six Bugs Only a Live Model Could Teach Us

#agents #ai #llm #showdev

Building auditable environmental-compliance agents on Qwen Cloud — and what changed when we tested with real qwen-plus output

AgentOps Debugger is an agentic application to investigate environmental-compliance history in Peru.

The idea is simple: you ask in Spanish or English about companies regulated by OEFA, Peru’s environmental regulator, and the system retrieves public sanction records and regulatory documents, builds cited answers, drafts structured reports behind a human-approval step, and shows the complete trace of how the answer was produced.

The stack is:

Qwen qwen-plus on Qwen Cloud through DashScope’s OpenAI-compatible endpoint
Mastra with AI SDK v5
Hono + TypeScript backend
React workspace
Docker deployment on Alibaba Cloud ECS

The architecture: a Coordinator plans typed tasks, specialist Qwen agents execute them, and every step is stored in an audit ledger.

The strategy that worked well — until live model output entered the system

From the beginning, I designed the project as offline-first.

The full system can run without API keys: seed records, lexical BM25 retrieval, deterministic no-LLM agents, and a local demo from docker compose up.

That helped a lot because all our 315 tests run without network calls. The app is testable, reproducible, and easy to demo. Then live mode swaps the deterministic agents for real Qwen agents behind the same interfaces.

The idea was solid: keep the same typed boundaries, use zod contracts, and validate every structured output.

But when we deployed to Alibaba Cloud and started using real qwen-plus, the real lesson appeared:

Offline tests are necessary, but they cannot catch the most important failures in an agentic system, because many failures come from the model output distribution, not from your code.

We ran the same flows several times against the live model, and six different issues appeared. All tests were green, but the live behavior still broke in ways that only real model output could expose.

The six rounds of fixes

Round 1 — status values were not always what the contract expected

Our schema expected:

status: "completed" | "failed" | "needs_user_input"

But live qwen-plus returned values like:

"success"
"done"
"in_progress"

Sometimes it also skipped the required summary.

The strict parser rejected the whole task, even when the answer itself was useful.

Fix: I added tolerant preprocessors. They normalize status synonyms and derive fallback summaries when needed.

The lesson here is simple: rejecting a correct answer because of a label mismatch is usually the wrong trade-off.

Round 2 — the planner sometimes produced an empty plan

Sometimes the planner returned no tasks and no clarification question.

Technically, the output was not useful, but the app still tried to convert it into a normal response. That created a misleading canned answer.

Fix: I added a pure plan interpreter that detects degenerate plans, retries once, and then falls back to an honest localized message saying that the system could not derive a plan.

Better to be transparent than to pretend the agent understood something it did not.

Round 3 — citations used different field names

The citation schema expected fields like:

documentTitle
passage
confidence

But the model returned variants like:

title
text
high

Also, some confidence labels came in English even when the contract expected Spanish-style values.

Fix: I added alias mapping and per-item citation salvage.

Each citation is validated independently. If one citation is malformed, we drop that one and keep the valid citations.

One bad citation should not destroy four good ones.

Round 4 — the planner scheduled a save without a draft

One flow asked the user to approve saving a report, but the planner had not created the report draft first.

So the user approved the action, and the system correctly answered:

There is no report draft to save.

The logic was safe, but the user experience was broken.

Fix: The plan interpreter now detects an unpaired save task and inserts the missing draft task before it.

The prompt also explains the expected three-task recipe, but the code no longer assumes the model followed it.

This is one of the most important lessons: prompt contracts help, but code must still protect the workflow.

Round 5 — the model asked the same clarification again

The app can list sanctioned entities as clickable cards. The user clicks one company, and the run resumes with the selected RUC.

In English, the model sometimes received the selected entity and still asked the same clarification question again.

Fix: Once the user answers a clarification, the system never asks that same clarification again.

From that point, entity resolution is computed from the records. The model can write the narrative, but it does not control whether the entity was resolved.

The listing hardened in round 5: entities as clickable candidates. The resolution is deterministic, so it does not depend on the model mood.

Round 6 — the model claimed ambiguity when the data was not ambiguous

When asked for a report on:

Minera Las Bambas S.A.

The model claimed the entity was ambiguous, even though the legal name matched one unique record.

Because of that, it produced no data, and the save step failed after approval.

Fix: I added full-name resolution in the entity heuristic.

Now every model-claimed ambiguity is verified against the data. If the data resolves to one entity, the system answers. If the ambiguity is real, the system builds the candidate list itself.

The model can suggest ambiguity, but the data decides whether ambiguity exists.

The design principle after these six rounds

The main principle became clear:

Let the LLM narrate, but do not let it own structured outcomes.

After these fixes, the important structured parts are deterministic:

entity resolution
entity listings
chart data
report assembly
approval pairing
citation salvage
ambiguity verification

Qwen is still very useful. It understands the analyst’s intent, plans the work, and writes the legal narrative in Spanish and English using the right Peruvian regulatory terminology.

But the system does not ask the model to be the source of truth for things that should come from the records.

Even the mandatory disclaimer in the regulatory report is a z.literal. The model cannot rephrase it because the model never owns that part.

This is also why the project is called AgentOps Debugger. Every model call and tool call goes into an append-only trace ledger with token counts, latency, and attribution. When the next live issue appears, the trace shows exactly what happened.

The Traceability sheet on the live deploy: qwen-plus model calls with tokens and latency, plus tool calls with attribution.

Practical notes for Qwen Cloud

A few implementation details that were important:

Use `@ai-sdk/openai-compatible`, not `@ai-sdk/openai`

For DashScope’s /compatible-mode/v1, the OpenAI-compatible provider worked better.

The regular OpenAI provider can classify non-OpenAI model ids as reasoning models, send a developer role, and target the Responses API. DashScope rejects those assumptions.

Include the word “json” in the prompt

DashScope requires the literal word json in the messages before it honors:

response_format: { type: "json_object" }

One line in the system prompt fixed that.

Pin the contract in the prompt, but tolerate variance in the parser

The prompt reduces errors, but it does not eliminate them.

The parser must still be defensive:

normalize aliases
coerce known synonyms
salvage valid items
reject only what is truly unsafe

If you only rely on the prompt, the system will break. If you only rely on the parser, the model will drift more often. You need both.

Use the international DashScope endpoint when needed

For our deployment, the international endpoint was the right choice:

dashscope-intl.aliyuncs.com

Also, for this type of bulk agent workload, qwen-plus was a better trade-off than qwen-max: capable enough and much cheaper.

Test on the deployed URL, not only localhost

One non-model bug was very easy to miss.

crypto.randomUUID worked locally, but failed on the plain HTTP demo IP because it only exists in secure contexts.

So “New investigation” worked on localhost and broke on the deployed URL.

Live browser testing matters.

The result: a cited answer with charts, evidence chips, and suggested next steps, running live on Alibaba Cloud ECS with real Qwen.

Closing

An agentic system becomes more trustworthy when every conclusion can be traced back to the records, documents, and decisions that produced it.

Qwen Cloud gave us a model strong enough to plan and narrate in two languages over a technical legal domain.

But the engineering lesson of this hackathon was not only how to use the model. The real lesson was deciding what the model should not own.

For this kind of regulatory workflow, the model can help with intent, planning, and language.

But the final structured outcome must be computed, validated, and traceable.

Project: AgentOps Debugger — OEFA Environmental Compliance

Hackathon: Qwen Cloud Hackathon, Track 3 — Agent Society

Code: github.com/GinoLlerena/agentops-debugger-architecture

License: MIT

Stack: Qwen on Qwen Cloud, DashScope, Mastra, AI SDK v5, Hono, React, Docker, Alibaba Cloud ECS.

This post was written with AI assistance (Claude Code) — the same assistant we pair-programmed with during the hackathon. The bugs, fixes, and lessons are from our real build log; fittingly, the project it describes is about never trusting unverified AI output.

Top comments (2)

Tae Kim • Jul 4

The failure-comes-from-model-output-distribution insight is one of the most underappreciated problems in agentic systems -- green tests mean your interfaces are correct, but they say nothing about whether the model satisfies your typed contracts at runtime boundary conditions. The pattern that helps alongside the offline suite is parameterized contract fuzzing: generate synthetic model outputs at the extremes of what the live model might return (truncated structured output, overlong reasoning chains, language mixing) and assert that your Zod validators reject them gracefully rather than silently cascading into downstream state. BM25 as the offline retrieval layer is especially well-matched here because the retrieval behavior is deterministic and inspectable, which means you can reason about whether a live failure is a retrieval failure versus a generation failure without disentangling both at once. Curious which of the six bugs came from structured output schema drift versus model refusal or truncation -- that failure taxonomy would be useful for anyone building compliance-grade agents.

Gino Llerena • Jul 4

Good framing. I think the taxonomy is the most interesting part.

Small update: we are now at 9 bugs. Rounds 7–9 came after the post. 😔

Breakdown:

Schema drift: around 3 cases. For example, status values outside the contract, object vs array problems, and empty plans. Zod caught these at the boundary. Your contract-fuzzing idea probably would catch these before deploy, so yes, I agree with that.
Refusal / truncation: zero cases. qwen-plus did not refuse or cut the answer. The problem was different: it gave wrong data with confidence.
Schema-valid fabrication: this was the worst type. The output passed the typed contract, but the data was wrong: evidence IDs that do not exist, a different company than the resolved records, or an 11-digit RUC changed between agent steps. A fuzzer does not catch this because the shape is correct. The fix was referential verification: every ID from the model must exist in the real data, or we use a deterministic fallback.
Infra: one class. 429 rate limits when too many calls ran in parallel. Fixed with throttling and graceful degradation.

And yes, BM25 helped a lot. Because retrieval was deterministic, we could quickly say: “retrieval was right, generation lied,” instead of debugging both things at the same time.