DEV Community: bob lee

Building a Self-Verifying FTIR Agent with Qwen Function Calling

bob lee — Wed, 08 Jul 2026 13:03:45 +0000

This post walks through the architecture of ChemSpectra Agent, an AI autopilot for FTIR spectral interpretation built during the Qwen Cloud Hackathon. The agent uses Qwen 3.7-Max (Alibaba Cloud DashScope) with Function Calling to orchestrate six analytical tools, then passes results through a deterministic rule engine that produces auditable, reproducible verdicts.

The core design constraint: the LLM chooses which tool to call, but Python rules — not the model — decide the verdict. The system runs against a production library of 130,000+ reference spectra.

The problem: near-tied candidates

A spectral library search returns ranked matches. When the top score is 0.95 and the second is 0.70, the answer is obvious. But in real-world hard cases, the gap between #1 and #2 can be 0.004 — with fifteen candidates scoring above 0.90.

A ranked list does not answer the analyst's question. They need to know: is the evidence strong enough to act on the top candidate, and if not, what should I do next?

Architecture overview

User uploads spectrum
    ↓
Qwen 3.7-Max (ReAct loop, Function Calling)
    ↓ routes to one of six tools per intent:
    ├── search_library       → FTIR.fun REST API (130K+ spectra)
    ├── assess_direction     → local deterministic arbitration
    ├── cross_validate       → local deterministic chemistry rules
    ├── explain_peaks        → FTIR.fun peak knowledge graph
    ├── check_spectrum_quality → local file parser + quality checks
    └── search_public_cases  → FTIR.fun MCP (shared analysis cases)
    ↓
Host evidence gate (ensures required tools were called)
    ↓
Deterministic verdict: GREEN / YELLOW / RED
    ↓
Human confirmation checkpoint
    ↓
Report

No LangChain, no agent framework. The ReAct loop is direct DashScope API calls with enable_thinking=True.

Tool routing by intent

Qwen's Function Calling handles intent routing naturally. The same agent session supports different question types:

"What material is this?" → search_library
"What does the peak at 1715 cm⁻¹ mean?" → explain_peaks
"Is this spectrum usable?" → check_spectrum_quality

After the initial library search, Qwen decides whether to call assess_direction and cross_validate based on the results. If the top score is strong and the gap is wide, it may skip arbitration. If candidates are close, it calls the direction tool.

But there's a safety net.

The host evidence gate

The agent can't skip critical evidence steps. Before the model generates a final synthesis, the host checks whether required tools were called:

required = {"search_library"}
if near_tied_candidates:
    required |= {"assess_direction", "cross_validate"}

called = {log["tool"] for log in session.tool_calls_log}
missing = required - called

for tool_name in missing:
    result = execute_tool(tool_name, session)
    feed_result_back_to_model(tool_name, result, session)

If Qwen tries to synthesize without running arbitration on near-tied results, the host runs the missing tool and feeds the result back. The model then continues with complete evidence. This is not a prompt instruction (which can be ignored) — it's a code-level gate.

Three-level deterministic verdicts

The verdict engine (assess_direction) examines the Top-15 search results and resolves to one of three levels:

Entity / GREEN: The top candidate is strong enough (score ≥ 0.85), the gap is clear, and chemistry checks pass. Evidence is sufficient to act.

Library Direction / YELLOW: No single compound can be locked (entity share is too low), but the candidates converge on a chemical family (direction share ≥ 50% with ≥ 2 supporting candidates). The agent names the family and generates a verification plan — a specific lab test to narrow it down.

Uncertain Direction / RED: Candidates diverge. The agent refuses to guess.

The confidence formula is deterministic:

confidence = library_confidence × rule_check_multiplier

Where library_confidence comes from the search score and rule_check_multiplier comes from the chemistry cross-validation (seven checks including forbidden functional groups). Three consecutive runs on the same input return identical numbers.

Cross-validation: negative evidence matters

The cross_validate tool runs seven chemistry checks against the top candidate:

Lead-score gap — is the top match meaningfully ahead?
Material-family functional groups — does the spectrum show the bands this material family must show?
Hard-forbidden groups — does the spectrum show bands this family must not show? (A polyolefin candidate with a strong 1730 cm⁻¹ C=O band is chemistry-inconsistent, regardless of its similarity score.)
Peak coverage — how many diagnostic peaks are accounted for?
Background contamination — CO₂, moisture, or atmospheric interference
Baseline quality — offset, tilt, saturation
Overall match quality — composite score

Each check returns PASS or FAIL with a specific reason. A FAIL on hard-forbidden groups is a veto — it overrides the similarity score.

These rules come from domain knowledge, not from training data. They live as auditable Python code:

FAMILY_REQUIRED_BANDS = {
    "styrenic": [(3000, 3100, "aromatic C-H stretch"),
                 (1600, 1610, "aromatic C=C ring"),
                 (690, 760, "aromatic C-H out-of-plane")],
    ...
}

FAMILY_FORBIDDEN_BANDS = {
    "polyolefin": [(1700, 1750, "C=O stretch — not expected in polyolefin")],
    ...
}

Reproducibility contract

The demo includes a RED sample where the agent returns score=0.8029, confidence=0.6423. Three consecutive runs return the same numbers. This is by design: the LLM's reasoning route may vary (it might call tools in a different order), but the deterministic verdict does not change because the numbers come from Python, not from the model.

We verify this with a stability test:

Run 1: score=0.8029, confidence=0.6423, level=uncertain_direction
Run 2: score=0.8029, confidence=0.6423, level=uncertain_direction
Run 3: score=0.8029, confidence=0.6423, level=uncertain_direction

Audit trail

Every tool call, LLM request, and UI event is logged as a timestamped JSON entry:

{
  "timestamp": "2026-07-07T12:06:23.668Z",
  "event": "tool_result",
  "tool": "assess_direction",
  "result": {
    "resolved_level": "library_direction",
    "entity_share": 0.069,
    "direction_share": 0.666,
    "dominant_direction": "fatty_ester"
  }
}

When someone asks "how did the AI reach this conclusion?", the answer is in the log — every step, every number, every tool call.

What I'd do differently

The current cross-validation rules cover common material families but not all of them. A more complete system would have hundreds of rules covering polymers, minerals, pharmaceuticals, and biologics. The architecture supports it — adding a new family is adding a dictionary entry — but the knowledge encoding takes time.

I'd also like to support multi-spectrum sessions: upload the original spectrum, then an extract, then an ash — and let the agent maintain a hypothesis state machine across all three, proposing the next experiment at each step. That's how real formulation analysis works.

Try it

Live demo: chemspectra.ftir.fun
Source code: github.com/jxbaoxiaodong/chemspectra-agent

Built with qwen3.7-max on Alibaba Cloud DashScope, Python, FastAPI, vanilla JS. No framework dependencies.

*Built for the Qwen Cloud Hackathon 2025, Track 4: Autopilot Agent.

Building a Self-Verifying FTIR Agent with Qwen Function Calling

bob lee — Fri, 26 Jun 2026 04:40:16 +0000

Built for Track 4: Autopilot Agent — #QwenCloudHackathon

Most AI "agents" are API wrappers with a system prompt. Upload data, call one endpoint, return the result. No verification, no reasoning about what went wrong, no ability to self-correct.

For the Qwen Cloud Hackathon, I built ChemSpectra Agent — an FTIR spectral analysis system where Qwen-3.7-Max autonomously selects tools, cross-validates evidence across multiple results, and triggers self-verification when confidence is low. The key insight: an agent that checks its own work catches errors that single-pass analysis misses.

Why Function Calling Changes Everything

The agent has access to 5 analysis tools, each hitting a different endpoint of the FTIR.fun spectral library (130,000+ reference spectra):

Tool	Purpose
`identify_material`	Match spectrum against reference library, return ranked candidates
`explain_peaks`	Explain what chemical bond vibration each peak represents
`assign_functional_groups`	Map peaks to functional groups (C=O, O-H, N-H, etc.)
`match_library_topk`	Rapid top-K screening without deep analysis
`search_public_results`	Search publicly shared analysis cases (via MCP)

Instead of hardcoding which tools to call, I define these as Qwen Function Calling schemas and let the model decide:

AGENT_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "identify_material",
            "description": "Match spectrum against 130,000+ reference spectra...",
            "parameters": {
                "type": "object",
                "properties": {
                    "top_k": {"type": "integer", "default": 10},
                    "sample_type": {"type": "string"},
                },
            },
        },
    },
    # ... 4 more tools
]

response = Generation.call(
    api_key=DASHSCOPE_API_KEY,
    model="qwen3.7-max",
    messages=messages,
    tools=AGENT_TOOLS,       # Qwen decides which to call
    result_format="message",
)

The result: different questions trigger different tool combinations. "What is this material?" → identify_material + explain_peaks. "Deformulate this sample" → all three analytical tools. "Quick screening" → just match_library_topk. The LLM decides, not the developer.

The ReAct Loop

The agent runs a Think → Act → Observe loop, up to 6 iterations:

Qwen receives the user request + tool schemas
Qwen returns tool_calls — which tools to invoke and with what parameters
Agent executes the tools against FTIR.fun API
Results are formatted and sent back to Qwen
Qwen either calls more tools or produces a final synthesis

In practice, most analyses complete in 2-3 iterations. Qwen's enable_thinking=True mode shows the full chain-of-thought reasoning, so you can see why it chose each tool.

Cross-Validation: Where It Gets Interesting

After the ReAct loop, the agent doesn't just return results. It runs two automated checks:

Confidence estimation — calculated from match scores, candidate score gaps, and functional group coverage:

def _estimate_confidence(self, session):
    scores = []
    id_result = session.tool_results.get("identify_material", {})
    if id_result.get("matches"):
        top_sim = id_result["matches"][0].get("similarity", 0)
        scores.append(top_sim)
        if len(id_result["matches"]) >= 2:
            gap = top_sim - id_result["matches"][1].get("similarity", 0)
            scores.append(min(1.0, gap * 5))  # larger gap = more confident
    # ... more signals from other tools

Evidence conflict detection — compares outputs across tools. If identify_material says "PET" but assign_functional_groups found no ester groups, that's a contradiction:

expected_groups = {
    "pet": ["ester", "c=o", "aromatic"],
    "nylon": ["amide", "n-h", "c=o"],
    "polyethylene": ["c-h", "ch2", "methylene"],
    "silicone": ["si-o", "si-c", "siloxane"],
}
# If 2+ expected groups are missing → conflict

Self-Verification Round

When confidence < 0.75 or conflicts are detected, the agent automatically triggers a verification round. Qwen is told exactly what went wrong:

ISSUES DETECTED:
- functional_group_mismatch: material="pet", missing=["ester", "aromatic"]
- low_confidence: 0.62 (threshold: 0.75)

Qwen then autonomously calls additional tools to investigate. After verification, confidence is recalculated. In testing, I've seen confidence traces like [0.62, 0.84] — a 35% improvement from one verification round.

Self-Repair

When Qwen's structured JSON output fails to parse (it happens — LLMs sometimes wrap JSON in markdown code blocks), the error and original output are sent back to Qwen with context:

repair_messages = messages + [
    {"role": "assistant", "content": raw},
    {"role": "user", "content": f"Parse error: {raw[:200]!r}\nReturn ONLY valid JSON."},
]
raw_retry = self._call_qwen(repair_messages)

Near-100% recovery rate. No silent failures.

Why This Matters for Real Applications

In regulated industries — pharmaceutical QC under FDA 21 CFR Part 11, forensic substance identification, environmental contaminant detection — an AI that returns wrong results without flagging uncertainty is dangerous. ChemSpectra Agent's self-verification turns "AI that gives answers" into "AI that checks its work." The confidence trace provides an audit trail that fits existing compliance frameworks.

All LLM reasoning — tool selection, synthesis, verification, self-repair, follow-up chat, report generation — runs through Alibaba Cloud's dashscope SDK with qwen3.7-max. Six distinct call sites, one provider.

Try it: github.com/jxbaoxiaodong/chemspectra-agent

How I Built a Self-Verifying AI Agent with DynamoDB and ReAct Reasoning

bob lee — Fri, 26 Jun 2026 04:34:25 +0000

Built for the #H0Hackathon — Hack the Zero Stack with Vercel v0 and AWS Databases

Most AI pipelines follow a fixed script: input in, output out, nobody checks the work. For the H0 hackathon (Track 2: Monetizable B2B App), I built ChemSpectra Agent — an FTIR spectral analysis system where the AI verifies its own conclusions and self-corrects when evidence conflicts.

The ReAct Loop

Instead of hardcoding which tools to call, the agent uses a ReAct loop with Qwen-3.7-Max function calling. The LLM autonomously selects from 5 tools — identify_material (130K+ reference spectra), explain_peaks, assign_functional_groups, match_library_topk, and search_public_results. A material ID request might trigger two tools; a deformulation request triggers all three analytical tools. The LLM decides, not the developer.

Cross-Validation and Self-Verification

After tools return results, _detect_evidence_conflicts() compares outputs. If identify_material says "PET" but assign_functional_groups found no ester groups, that's a contradiction:

expected_groups = {
    "pet": ["ester", "c=o", "aromatic"],
    "nylon": ["amide", "n-h", "c=o"],
}

The agent estimates confidence from match scores, candidate score gaps, and functional group coverage. Below 0.75 confidence or any conflicts, a verification round fires automatically:

needs_verification = (
    confidence < 0.75 or len(conflicts) > 0
)

The agent gets told exactly what went wrong and calls additional tools to investigate. Post-verification confidence is logged, creating traces like [0.62, 0.84].

DynamoDB: Beyond Key-Value Storage

Every session persists to DynamoDB with 30-day TTL — tool call logs, confidence traces, synthesis, final report. But we went deeper than basic CRUD:

Two GSIs — gsi-created (partition: ALL, sort: created_at) replaces full-table scan with efficient time-ordered query; gsi-material (partition: top_match, sort: created_at) enables "show me all PET analyses" aggregation
Atomic counters — a separate chemspectra-stats table tracks total_analyses and total_tools_called via DynamoDB ADD operations, safe under concurrent requests
Conditional writes — confirmed sessions use attribute_not_exists(session_id) OR step <> :confirmed to prevent concurrent overwrites of finalized reports

Regulated industries (pharma, forensics) require this audit trail. DynamoDB fits because the primary access is single-item by session_id, the GSIs cover the two secondary patterns, and TTL handles cleanup automatically.

Results

The loop runs 2-4 iterations in under 30 seconds. Self-repair for malformed LLM JSON has near-100% recovery. This turns "AI that gives answers" into "AI that checks its work" — essential when reports go into regulatory filings.

Try it: chemspectra-agent-h0.vercel.app | Code: github.com/jxbaoxiaodong/chemspectra-agent-h0

This article was written as part of my participation in the H0 AWS+Vercel Hackathon.

How I Built an FTIR Analysis Platform with Claude (and What I Learned About AI-Assisted Development)

bob lee — Wed, 10 Jun 2026 03:32:31 +0000

DEV.to Article: How I Built an FTIR Analysis Platform with Claude

Title: How I Built an FTIR Analysis Platform with Claude (and What I Learned About AI-Assisted Development)
Tags: python, chemistry, opensource, ai
Published: true (can publish immediately on DEV)

The Backstory

I'm a materials science graduate, not a software developer. I know FTIR spectroscopy — identifying polymers, interpreting functional group peaks, matching unknown samples against reference libraries. But when I needed to search FTIR spectra programmatically, I hit a wall: the existing tools were either expensive enterprise packages or Excel macros from the early 2000s.

So I decided to build my own. And I used Claude (Anthropic's AI assistant) as my coding partner.

This is the story of how a domain expert with basic Python skills built a production FTIR search platform — 135,000 spectra, MCP server, API, community features — with AI writing about 70% of the code.

Step 1: The Core Algorithm

FTIR spectrum matching sounds complex, but the core is simple geometry: given a set of peak positions from an unknown sample, find the library spectra with the most matching peaks within a tolerance window (typically ±5 to ±15 cm⁻¹).

What Claude helped with:

Writing the initial peak-matching loop
Setting up the Django project structure
Designing the database schema for the spectral library

What I handled:

Understanding which tolerance values actually work (different wavenumber regions need different tolerances)
Validating match results against known materials
Rejecting the first three algorithm designs that looked correct on paper but failed on real data

Lesson: AI can write the code faster than you can, but it can't tell you if the chemistry is right. Domain expertise is the bottleneck, not code.

Step 2: Parsing FTIR Instrument Files

This was the hardest technical challenge. FTIR instruments output data in at least 6 different formats:

Format	Origin	Difficulty
SPA	Thermo Nicolet	Medium — binary, proprietary
SPC	GRAMS	Medium — documented but complex
OPUS	Bruker	High — completely proprietary
CSV	Universal	Easy
JDX	JCAMP-DX	Medium — standard but varied implementations
XLSX	Labs	Easy — but infinite variations

What Claude helped with:

Writing binary file parsers from format documentation
Extracting peak tables from raw instrument data
Handling edge cases (missing metadata, non-standard headers)

What I handled:

Testing with real instrument files from my university lab
Identifying which format variants actually appear in practice
Setting up error handling for unparseable files

Lesson: Claude is surprisingly good at binary file parsing. I pasted format specs from Thermo and Bruker documentation, and it generated working parsers. But I caught three subtle byte-offset errors that would have silently corrupted data.

Step 3: The MCP Server

MCP (Model Context Protocol) lets AI agents call your tool directly. Instead of a human typing peak values into a web form, an AI agent can send structured requests and receive structured results.

The MCP server, at fastapi_server/mcp_server.py, exposes one main tool:

analyze_ftir_spectrum(file_content, filename, peaks)

Accept either an instrument file or a peak list. Returns ranked matches with similarity scores.

What Claude generated: ~90% of the MCP server code, including the Pydantic output schemas, error handling, and feature documentation.

Step 4: What Broke in Production

Problem 1: Memory
Loading the entire 135K-spectrum library into memory on every request was fine locally. On a 2GB VPS with other services running, it caused OOM kills within hours.

Fix: Added Redis caching for frequent searches, lazy loading for the library, and a batch query size limit.

Problem 2: Cloudflare timeouts
The MCP streamable-http transport needs persistent connections. Cloudflare's default 100-second timeout killed long searches.

Fix: Server-sent events for progress reporting, and Cloudflare timeout tuning.

Problem 3: Hallucination-like false positives
The matching algorithm returned chemically impossible candidates for very short peak lists (2-3 peaks).

Fix: Added a minimum peak count threshold and a confidence penalty for low-peak queries.

The Result

FTIR.fun is now:

Live at https://ftir.fun
MCP endpoint: https://ftir.fun/mcp — connect from Claude, Cursor, Copilot, or any MCP client
OpenAPI spec: https://ftir.fun/openapi.platform.yaml
GitHub: github.com/jxbaoxiaodong/ftirfun-mcp
~135,000 spectra indexed and searchable
~70% of the code co-written with Claude
~30% of the code rewritten after Claude's version failed in production

What I'd Tell Other Domain Experts Considering AI-Assisted Development

1. Start with the messy data, not the shiny framework.

I spent two weeks getting Claude to generate a perfect Docker Compose setup. Then I spent two months wrangling real FTIR instrument files. The infrastructure was the easy part — the data was the hard part.

2. AI will write code that looks right but is wrong.

Claude produced beautiful peak-matching code that passed unit tests and failed on real spectra. The peak positions "matched" mathematically but violated basic FTIR chemistry. You need domain knowledge to catch this.

3. Production is where the AI-generated code breaks first.

The code that looks clean in a notebook dies first under real load, real data variety, and real timeout limits. Be ready to rewrite the hot paths.

4. But the framework code is perfect for AI.

Settings, schemas, API routing, test scaffolding, README files, deployment scripts — Claude wrote these flawlessly. Let AI handle the glue while you focus on the domain logic.

What's Next

Confidence calibration (how reliable is a 0.85 similarity score?)
Expanded file format support
Public API with usage tiers
More MCP tools for agent workflows

FTIR.fun is an open-spectral-search project by a materials scientist who learned Python by building it. Questions, feedback, or FTIR datasets to contribute? ftir.fun@outlook.com