Building auditable environmental-compliance agents on Qwen Cloud — and what changed when we tested with real qwen-plus output
AgentOps Debugger is an agentic application to investigate environmental-compliance history in Peru.
The idea is simple: you ask in Spanish or English about companies regulated by OEFA, Peru’s environmental regulator, and the system retrieves public sanction records and regulatory documents, builds cited answers, drafts structured reports behind a human-approval step, and shows the complete trace of how the answer was produced.
The stack is:
- Qwen qwen-plus on Qwen Cloud through DashScope’s OpenAI-compatible endpoint
- Mastra with AI SDK v5
- Hono + TypeScript backend
- React workspace
- Docker deployment on Alibaba Cloud ECS
The architecture: a Coordinator plans typed tasks, specialist Qwen agents execute them, and every step is stored in an audit ledger.
The strategy that worked well — until live model output entered the system
From the beginning, I designed the project as offline-first.
The full system can run without API keys: seed records, lexical BM25 retrieval, deterministic no-LLM agents, and a local demo from docker compose up.
That helped a lot because all our 315 tests run without network calls. The app is testable, reproducible, and easy to demo. Then live mode swaps the deterministic agents for real Qwen agents behind the same interfaces.
The idea was solid: keep the same typed boundaries, use zod contracts, and validate every structured output.
But when we deployed to Alibaba Cloud and started using real qwen-plus, the real lesson appeared:
Offline tests are necessary, but they cannot catch the most important failures in an agentic system, because many failures come from the model output distribution, not from your code.
We ran the same flows several times against the live model, and six different issues appeared. All tests were green, but the live behavior still broke in ways that only real model output could expose.
The six rounds of fixes
Round 1 — status values were not always what the contract expected
Our schema expected:
status: "completed" | "failed" | "needs_user_input"
But live qwen-plus returned values like:
"success"
"done"
"in_progress"
Sometimes it also skipped the required summary.
The strict parser rejected the whole task, even when the answer itself was useful.
Fix: I added tolerant preprocessors. They normalize status synonyms and derive fallback summaries when needed.
The lesson here is simple: rejecting a correct answer because of a label mismatch is usually the wrong trade-off.
Round 2 — the planner sometimes produced an empty plan
Sometimes the planner returned no tasks and no clarification question.
Technically, the output was not useful, but the app still tried to convert it into a normal response. That created a misleading canned answer.
Fix: I added a pure plan interpreter that detects degenerate plans, retries once, and then falls back to an honest localized message saying that the system could not derive a plan.
Better to be transparent than to pretend the agent understood something it did not.
Round 3 — citations used different field names
The citation schema expected fields like:
documentTitle
passage
confidence
But the model returned variants like:
title
text
high
Also, some confidence labels came in English even when the contract expected Spanish-style values.
Fix: I added alias mapping and per-item citation salvage.
Each citation is validated independently. If one citation is malformed, we drop that one and keep the valid citations.
One bad citation should not destroy four good ones.
Round 4 — the planner scheduled a save without a draft
One flow asked the user to approve saving a report, but the planner had not created the report draft first.
So the user approved the action, and the system correctly answered:
There is no report draft to save.
The logic was safe, but the user experience was broken.
Fix: The plan interpreter now detects an unpaired save task and inserts the missing draft task before it.
The prompt also explains the expected three-task recipe, but the code no longer assumes the model followed it.
This is one of the most important lessons: prompt contracts help, but code must still protect the workflow.
Round 5 — the model asked the same clarification again
The app can list sanctioned entities as clickable cards. The user clicks one company, and the run resumes with the selected RUC.
In English, the model sometimes received the selected entity and still asked the same clarification question again.
Fix: Once the user answers a clarification, the system never asks that same clarification again.
From that point, entity resolution is computed from the records. The model can write the narrative, but it does not control whether the entity was resolved.
The listing hardened in round 5: entities as clickable candidates. The resolution is deterministic, so it does not depend on the model mood.
Round 6 — the model claimed ambiguity when the data was not ambiguous
When asked for a report on:
Minera Las Bambas S.A.
The model claimed the entity was ambiguous, even though the legal name matched one unique record.
Because of that, it produced no data, and the save step failed after approval.
Fix: I added full-name resolution in the entity heuristic.
Now every model-claimed ambiguity is verified against the data. If the data resolves to one entity, the system answers. If the ambiguity is real, the system builds the candidate list itself.
The model can suggest ambiguity, but the data decides whether ambiguity exists.
The design principle after these six rounds
The main principle became clear:
Let the LLM narrate, but do not let it own structured outcomes.
After these fixes, the important structured parts are deterministic:
- entity resolution
- entity listings
- chart data
- report assembly
- approval pairing
- citation salvage
- ambiguity verification
Qwen is still very useful. It understands the analyst’s intent, plans the work, and writes the legal narrative in Spanish and English using the right Peruvian regulatory terminology.
But the system does not ask the model to be the source of truth for things that should come from the records.
Even the mandatory disclaimer in the regulatory report is a z.literal. The model cannot rephrase it because the model never owns that part.
This is also why the project is called AgentOps Debugger. Every model call and tool call goes into an append-only trace ledger with token counts, latency, and attribution. When the next live issue appears, the trace shows exactly what happened.
The Traceability sheet on the live deploy: qwen-plus model calls with tokens and latency, plus tool calls with attribution.
Practical notes for Qwen Cloud
A few implementation details that were important:
Use @ai-sdk/openai-compatible, not @ai-sdk/openai
For DashScope’s /compatible-mode/v1, the OpenAI-compatible provider worked better.
The regular OpenAI provider can classify non-OpenAI model ids as reasoning models, send a developer role, and target the Responses API. DashScope rejects those assumptions.
Include the word “json” in the prompt
DashScope requires the literal word json in the messages before it honors:
response_format: { type: "json_object" }
One line in the system prompt fixed that.
Pin the contract in the prompt, but tolerate variance in the parser
The prompt reduces errors, but it does not eliminate them.
The parser must still be defensive:
- normalize aliases
- coerce known synonyms
- salvage valid items
- reject only what is truly unsafe
If you only rely on the prompt, the system will break. If you only rely on the parser, the model will drift more often. You need both.
Use the international DashScope endpoint when needed
For our deployment, the international endpoint was the right choice:
dashscope-intl.aliyuncs.com
Also, for this type of bulk agent workload, qwen-plus was a better trade-off than qwen-max: capable enough and much cheaper.
Test on the deployed URL, not only localhost
One non-model bug was very easy to miss.
crypto.randomUUID worked locally, but failed on the plain HTTP demo IP because it only exists in secure contexts.
So “New investigation” worked on localhost and broke on the deployed URL.
Live browser testing matters.
The result: a cited answer with charts, evidence chips, and suggested next steps, running live on Alibaba Cloud ECS with real Qwen.
Closing
An agentic system becomes more trustworthy when every conclusion can be traced back to the records, documents, and decisions that produced it.
Qwen Cloud gave us a model strong enough to plan and narrate in two languages over a technical legal domain.
But the engineering lesson of this hackathon was not only how to use the model. The real lesson was deciding what the model should not own.
For this kind of regulatory workflow, the model can help with intent, planning, and language.
But the final structured outcome must be computed, validated, and traceable.
Project: AgentOps Debugger — OEFA Environmental Compliance
Hackathon: Qwen Cloud Hackathon, Track 3 — Agent Society
Code: github.com/GinoLlerena/agentops-debugger-architecture
License: MIT
Stack: Qwen on Qwen Cloud, DashScope, Mastra, AI SDK v5, Hono, React, Docker, Alibaba Cloud ECS.
This post was written with AI assistance (Claude Code) — the same assistant we pair-programmed with during the hackathon. The bugs, fixes, and lessons are from our real build log; fittingly, the project it describes is about never trusting unverified AI output.




Top comments (1)
The failure-comes-from-model-output-distribution insight is one of the most underappreciated problems in agentic systems -- green tests mean your interfaces are correct, but they say nothing about whether the model satisfies your typed contracts at runtime boundary conditions. The pattern that helps alongside the offline suite is parameterized contract fuzzing: generate synthetic model outputs at the extremes of what the live model might return (truncated structured output, overlong reasoning chains, language mixing) and assert that your Zod validators reject them gracefully rather than silently cascading into downstream state. BM25 as the offline retrieval layer is especially well-matched here because the retrieval behavior is deterministic and inspectable, which means you can reason about whether a live failure is a retrieval failure versus a generation failure without disentangling both at once. Curious which of the six bugs came from structured output schema drift versus model refusal or truncation -- that failure taxonomy would be useful for anyone building compliance-grade agents.