DEV Community

bob lee
bob lee

Posted on

Building a Self-Verifying FTIR Agent with Qwen Function Calling

Built for Track 4: Autopilot Agent — #QwenCloudHackathon


Most AI "agents" are API wrappers with a system prompt. Upload data, call one endpoint, return the result. No verification, no reasoning about what went wrong, no ability to self-correct.

For the Qwen Cloud Hackathon, I built ChemSpectra Agent — an FTIR spectral analysis system where Qwen-3.7-Max autonomously selects tools, cross-validates evidence across multiple results, and triggers self-verification when confidence is low. The key insight: an agent that checks its own work catches errors that single-pass analysis misses.

Why Function Calling Changes Everything

The agent has access to 5 analysis tools, each hitting a different endpoint of the FTIR.fun spectral library (130,000+ reference spectra):

Tool Purpose
identify_material Match spectrum against reference library, return ranked candidates
explain_peaks Explain what chemical bond vibration each peak represents
assign_functional_groups Map peaks to functional groups (C=O, O-H, N-H, etc.)
match_library_topk Rapid top-K screening without deep analysis
search_public_results Search publicly shared analysis cases (via MCP)

Instead of hardcoding which tools to call, I define these as Qwen Function Calling schemas and let the model decide:

AGENT_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "identify_material",
            "description": "Match spectrum against 130,000+ reference spectra...",
            "parameters": {
                "type": "object",
                "properties": {
                    "top_k": {"type": "integer", "default": 10},
                    "sample_type": {"type": "string"},
                },
            },
        },
    },
    # ... 4 more tools
]

response = Generation.call(
    api_key=DASHSCOPE_API_KEY,
    model="qwen3.7-max",
    messages=messages,
    tools=AGENT_TOOLS,       # Qwen decides which to call
    result_format="message",
)
Enter fullscreen mode Exit fullscreen mode

The result: different questions trigger different tool combinations. "What is this material?" → identify_material + explain_peaks. "Deformulate this sample" → all three analytical tools. "Quick screening" → just match_library_topk. The LLM decides, not the developer.

The ReAct Loop

The agent runs a Think → Act → Observe loop, up to 6 iterations:

  1. Qwen receives the user request + tool schemas
  2. Qwen returns tool_calls — which tools to invoke and with what parameters
  3. Agent executes the tools against FTIR.fun API
  4. Results are formatted and sent back to Qwen
  5. Qwen either calls more tools or produces a final synthesis

In practice, most analyses complete in 2-3 iterations. Qwen's enable_thinking=True mode shows the full chain-of-thought reasoning, so you can see why it chose each tool.

Cross-Validation: Where It Gets Interesting

After the ReAct loop, the agent doesn't just return results. It runs two automated checks:

Confidence estimation — calculated from match scores, candidate score gaps, and functional group coverage:

def _estimate_confidence(self, session):
    scores = []
    id_result = session.tool_results.get("identify_material", {})
    if id_result.get("matches"):
        top_sim = id_result["matches"][0].get("similarity", 0)
        scores.append(top_sim)
        if len(id_result["matches"]) >= 2:
            gap = top_sim - id_result["matches"][1].get("similarity", 0)
            scores.append(min(1.0, gap * 5))  # larger gap = more confident
    # ... more signals from other tools
Enter fullscreen mode Exit fullscreen mode

Evidence conflict detection — compares outputs across tools. If identify_material says "PET" but assign_functional_groups found no ester groups, that's a contradiction:

expected_groups = {
    "pet": ["ester", "c=o", "aromatic"],
    "nylon": ["amide", "n-h", "c=o"],
    "polyethylene": ["c-h", "ch2", "methylene"],
    "silicone": ["si-o", "si-c", "siloxane"],
}
# If 2+ expected groups are missing → conflict
Enter fullscreen mode Exit fullscreen mode

Self-Verification Round

When confidence < 0.75 or conflicts are detected, the agent automatically triggers a verification round. Qwen is told exactly what went wrong:

ISSUES DETECTED:
- functional_group_mismatch: material="pet", missing=["ester", "aromatic"]
- low_confidence: 0.62 (threshold: 0.75)
Enter fullscreen mode Exit fullscreen mode

Qwen then autonomously calls additional tools to investigate. After verification, confidence is recalculated. In testing, I've seen confidence traces like [0.62, 0.84] — a 35% improvement from one verification round.

Self-Repair

When Qwen's structured JSON output fails to parse (it happens — LLMs sometimes wrap JSON in markdown code blocks), the error and original output are sent back to Qwen with context:

repair_messages = messages + [
    {"role": "assistant", "content": raw},
    {"role": "user", "content": f"Parse error: {raw[:200]!r}\nReturn ONLY valid JSON."},
]
raw_retry = self._call_qwen(repair_messages)
Enter fullscreen mode Exit fullscreen mode

Near-100% recovery rate. No silent failures.

Why This Matters for Real Applications

In regulated industries — pharmaceutical QC under FDA 21 CFR Part 11, forensic substance identification, environmental contaminant detection — an AI that returns wrong results without flagging uncertainty is dangerous. ChemSpectra Agent's self-verification turns "AI that gives answers" into "AI that checks its work." The confidence trace provides an audit trail that fits existing compliance frameworks.

All LLM reasoning — tool selection, synthesis, verification, self-repair, follow-up chat, report generation — runs through Alibaba Cloud's dashscope SDK with qwen3.7-max. Six distinct call sites, one provider.

Try it: github.com/jxbaoxiaodong/chemspectra-agent

Top comments (0)