DEV Community

Discussion on: Function Calling Harness 2: CoT Compliance from 9.91% to 100%

Collapse
 
peacebinflow profile image
PEACEBINFLOW

That line—"Prompt asks for thought. Schema demands accountable thought"—gets at something I've been circling but hadn't articulated. The distinction between asking a model to think and demanding evidence that it actually walked the procedure is subtle, but it's the difference between trusting the output and being able to audit it. Free-form CoT gives you a plausible narrative. Schema-enforced CoT gives you a checklist where the empty boxes are visible.

What I keep turning over is the retrofit problem you mentioned—the investment committee memo where the decision was made before the data was reviewed, and the analysis exists to confirm what was already chosen. That pattern isn't unique to finance. It's how a lot of human reasoning works in practice. We decide, then justify. The schema attacks this not by trying to catch the model in a lie, but by demanding slots that can't be convincingly backfilled: a falsifiable kill condition, a counter-thesis that genuinely engages with the bear case rather than strawmanning it, evidence sources that are traceable rather than hand-waved. A model asked to reverse-engineer justification for a pre-baked conclusion will still try, but the gaps become structurally visible in ways they wouldn't in prose.

I'm curious about the failure mode you didn't name explicitly: the schema designer who encodes their own biases into the procedure. If I design an investment memo schema where the kill conditions are all price-drawdown thresholds, I've implicitly ruled out thesis-drift as a valid reason to exit. The schema enforces rigor, sure, but it also encodes a worldview. The more powerful this technique gets, the more the schema author's judgment becomes the hidden curriculum. At what point does "enforcing the procedure" become "enforcing my preferred version of the procedure," and how would anyone downstream know the difference?

Collapse
 
samchon profile image
Jeongho Nam

Great point — and it's the failure mode I should have named more directly in §3.4. You're right that "schema-enforced rigor" can quietly become "schema designer's preferred rigor."

The way I'd extend the argument: schema enforcement isn't the whole loop. In Part 1 (dev.to/samchon/qwen-meetup-functio...), the compiler wasn't just a gate — it was a feedback signal that told the harness when the procedure itself was wrong. The model converged because the verifier kept rejecting bad outputs until schema and procedure aligned with something that actually compiled.

Every serious domain has the same mechanism, just slower and noisier. Finance calls it backtesting. Medicine calls it retrospective study or chart review. Policy calls it ex-post evaluation. Law calls it precedent analysis. They're all the same shape: replay the procedure against historical cases and see whether it would have caught what mattered. A compiler is just a backtest with zero latency.

So a kill-condition schema that omits thesis-drift will show up as a portfolio holding losers too long when you backtest it. A SOAP schema that under-weights differential diagnosis will show up as missed diagnoses in chart review. The hidden-curriculum problem doesn't disappear, but it stops being permanent — encoded bias gets a half-life, because longitudinal verification eventually punishes the schema itself, not just individual outputs.

Schema design is a new kind of difficulty, I'll grant that. But it isn't a permanent black box. The schema itself becomes the object of verification, and that verification is performed by the backtesting mechanism each domain already has.

That said, I should be upfront: I come at this as a developer building coding agents, so my grasp of backtesting is conceptual at best. The above is closer to one plausible sketch of how I think schema bias might be checked, not a worked-out claim. If you see a better framing or something I should reconsider, I'd genuinely appreciate hearing it.