Pavel Kostromin

Posted on Mar 28

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

#ai #qwen #typia #uniontypes

Introduction & Problem Statement

Function calling in AI models is a cornerstone of their utility in engineering domains, enabling them to generate structured, executable code rather than mere text. However, when confronted with deeply recursive union types, even state-of-the-art models like Qwen falter. These types, characterized by nested structures and multiple possible data representations, introduce ambiguity that AI models struggle to resolve. The result? Low success rates and systematic bugs that undermine reliability.

During my presentation at the Qwen Meetup Korea, I dissected the specific challenges with Qwen models. The qwen3-coder-next model exhibited a 6.75% first-try success rate on deeply recursive union types, while the entire Qwen 3.5 family consistently failed due to a double-stringify bug. This bug, a mechanical artifact of how the model processes JSON-like structures, causes nested objects to be incorrectly serialized twice, breaking type integrity. The causal chain is clear: impact → internal JSON processing → observable type mismatches.

The root of the problem lies in the absence of robust infrastructure for handling union types. Existing implementations lack systematic validation, self-healing mechanisms, and type-driven feedback loops, leaving models vulnerable to edge cases. Large models, in particular, often mask underlying vulnerabilities by producing superficially correct outputs, while smaller models expose these flaws, making them invaluable for QA.

Without addressing these issues, AI models risk generating ambiguous, error-prone outputs, hindering their adoption in critical engineering workflows. The stakes are high: as AI integrates into software pipelines, reliable function calling on complex types is non-negotiable for scaling automation tools.

Mechanisms of Failure and Risk Formation

The double-stringify bug in Qwen 3.5 models exemplifies how internal processing deformations lead to observable failures. Here’s the causal chain:

Impact: Nested union types are incorrectly serialized twice.
Internal Process: The model’s JSON parser fails to distinguish between single and double-stringified values, treating both as valid inputs.
Observable Effect: Type mismatches during validation, causing function calls to fail.

The risk of such bugs is compounded by the lack of deterministic convergence in existing systems. Without type-driven infrastructure, models rely on probabilistic outputs, making errors inevitable. This is particularly dangerous in engineering domains, where ambiguity translates to system failure.

Solution Landscape: Comparative Analysis

To address these challenges, I evaluated three solution options:


Option	Effectiveness	Mechanism	Limitations
1. Enhanced Model Training	Moderate	Fine-tuning on union type examples.	Does not address underlying infrastructure gaps; fails on unseen edge cases.
2. Manual Validation Loops	Low	Human-in-the-loop for error correction.	Scalability issues; prone to human error.
3. Type-Driven Infrastructure (Typia)	Optimal	Automates schema generation, parsing, validation, and feedback.	Requires initial setup; dependent on type system completeness.

Typia emerged as the optimal solution due to its mechanically verifiable and deterministically convergent nature. By automating schema generation and validation, it eliminates ambiguity and provides precise feedback, transforming low success rates into consistent 100% reliability. The rule is clear: If handling deeply recursive union types → use type-driven infrastructure like Typia.

Practical Insights and Edge-Case Analysis

The success of Typia hinges on its ability to handle edge cases through lenient JSON parsing and type coercion. For example, when a model outputs a partially stringified object, Typia’s parser expands the input, coerces it to the correct type, and generates precise validation feedback. This process is mechanical: impact → parsing expansion → observable type correction.

However, Typia’s effectiveness is contingent on the completeness of the type system. If a type is underspecified or ambiguously defined, even Typia will fail. This highlights a critical rule: If type system is incomplete → Typia’s reliability diminishes.

Conclusion: Transforming Failure into Reliability

The journey from 6.75% to 100% success rates with Qwen models underscores the transformative power of type-driven infrastructure. By addressing the double-stringify bug and introducing systematic validation, we’ve demonstrated that reliable function calling on deeply recursive union types is achievable. The key lies in leveraging smaller models for QA, self-healing loops, and mechanically verifiable processes.

As AI models increasingly integrate into software pipelines, adopting such infrastructure is not optional—it’s imperative. Without it, the risk of ambiguous, error-prone outputs will persist, undermining the very utility of AI in engineering domains.

Methodology & Solution: From 6.75% to 100% Reliability

Achieving reliable function calling on deeply recursive union types in Qwen models required a systematic approach that addressed both the double-stringify bug and the broader infrastructure gaps. Here’s the step-by-step breakdown of the solution, grounded in mechanical processes and causal explanations.

1. Identifying the Root Cause: Double-Stringify Bug

The double-stringify bug in Qwen 3.5 models was the primary failure point. Mechanically, this occurred because the JSON parser failed to distinguish between single and double-stringified values. For example, a nested union type like:

{"type": "A", "value": "{\"type\": \"B\", \"value\": 42}"}

was parsed as:

{"type": "A", "value": {"type": "B", "value": "42"}}

This caused type mismatches during validation, leading to function call failures. The observable effect was a 0% success rate on union types, despite the model’s probabilistic outputs.

2. Developing the Function Calling Harness

The Function Calling Harness was designed to address this by integrating Typia, a type-driven infrastructure. Typia automates:

Schema generation: Converts TypeScript types into JSON schemas.
Lenient JSON parsing: Expands partially stringified inputs to their correct structure.
Type coercion: Converts values to their intended types (e.g., "42" → 42).
Precise validation feedback: Generates actionable error messages for self-healing loops.

This infrastructure mechanically eliminates ambiguity in function calling, ensuring deterministic convergence.

3. Self-Healing Loops: Turning Errors into Inputs

The self-healing loop was critical for transforming initial failures into successes. Here’s the causal chain:

Impact: Initial function call fails due to type mismatch.

Internal Process: Typia generates precise validation feedback (e.g., "Expected number, received string").

Observable Effect: The model retries with corrected input, leveraging the feedback.

Outcome: Success rate increases iteratively until reaching 100%.

This loop was particularly effective because smaller models (e.g., Qwen 3.5) exposed vulnerabilities that larger models might silently paper over, making them ideal QA engineers.

4. Comparative Analysis of Solutions

Three solution options were considered:

Enhanced Model Training: Fine-tuning on union type examples. Effectiveness: Moderate. Fails on unseen edge cases. Limitation: Does not address infrastructure gaps.
Manual Validation Loops: Human-in-the-loop for error correction. Effectiveness: Low. Prone to human error and scalability issues.
Type-Driven Infrastructure (Typia): Automates schema, parsing, validation, and feedback. Effectiveness: Optimal. Transforms 0% to 100% reliability. Limitation: Requires initial setup; dependent on type system completeness.

Optimal Solution: Type-driven infrastructure like Typia, as it addresses both the bug and infrastructure gaps systematically.

5. Rule of Thumb for Practitioners

If handling deeply recursive union types → use type-driven infrastructure like Typia.

If the type system is incomplete → Typia’s reliability diminishes.

6. Risk Mechanism and Contingency

The primary risk is incomplete type systems, which degrade Typia’s reliability. Mechanically, this occurs because:

Impact: Missing type definitions lead to unhandled edge cases.

Internal Process: Typia cannot generate schemas or feedback for undefined types.

Observable Effect: Function calls fail on unseen structures.

Contingency: Ensure type system completeness or implement fallback mechanisms.

Conclusion

By addressing the double-stringify bug and integrating type-driven infrastructure, the Function Calling Harness transformed Qwen models from 6.75% to 100% reliability on deeply recursive union types. This solution is not just a fix but a paradigm shift, emphasizing the critical role of mechanically verifiable processes in AI-driven engineering workflows.

Results & Impact: From 6.75% to 100% Reliability

The implementation of the Function Calling Harness with Typia as its core infrastructure transformed the Qwen models' performance on deeply recursive union types from abysmal to flawless. Here’s the breakdown of the outcomes and their broader implications:

1. Quantitative Leap: 6.75% → 100%

The qwen3-coder-next model initially achieved a 6.75% first-try success rate on deeply recursive union types. The Qwen 3.5 family, plagued by the double-stringify bug, hit 0%. Post-integration of Typia, both models reached 100% reliability. The causal chain:

Impact: Double-stringified JSON inputs (e.g., "{\"type\": \"A\", \"value\": \"{\\\"type\\\": \\\"B\\\", \\\"value\\\": 42}\"}") were parsed incorrectly, causing type mismatches.
Internal Process: Typia’s lenient JSON parsing and type coercion corrected partially stringified inputs, converting "42" to 42.
Observable Effect: Precise validation feedback enabled self-healing loops, iteratively refining outputs until convergence.

2. Broader Implications for Qwen Models

This breakthrough eliminates a critical bottleneck in Qwen’s utility for engineering domains. By ensuring deterministic convergence on complex types, Qwen models can now:

Generate AST data via function calling, not just text code, enabling backend auto-generation (e.g., AutoBe).
Integrate seamlessly into software pipelines, reducing ambiguity and error-prone outputs.
Serve as a foundation for AI-driven automation tools, accelerating adoption in real-world applications.

3. Comparative Analysis of Solutions

Three approaches were evaluated:

Enhanced Model Training: Fine-tuning on union type examples showed moderate effectiveness but failed on unseen edge cases. Mechanism: Models overfit to training data, lacking generalization.
Manual Validation Loops: Human-in-the-loop correction was ineffective due to scalability issues and human error. Mechanism: Manual intervention introduces latency and inconsistency.
Type-Driven Infrastructure (Typia): Optimal effectiveness, transforming 0% to 100% reliability. Mechanism: Automates schema generation, parsing, validation, and feedback, ensuring mechanical verifiability.

Rule of Thumb: For deeply recursive union types, use type-driven infrastructure like Typia. Its reliability diminishes only if the type system is incomplete.

4. Future Research Directions

This achievement opens avenues for:

Model-Neutral Function Calling: Extending Typia’s schema-based approach to other AI models, ensuring cross-platform reliability.
Self-Healing Mechanisms: Generalizing self-healing loops for other error types, not just type mismatches.
Type System Completeness: Developing tools to ensure type system completeness, addressing Typia’s primary contingency.

5. Practical Insights

Key takeaways for practitioners:

Smaller Models as QA Engineers: Smaller models like Qwen 3.5 exposed vulnerabilities that larger models masked. Mechanism: Smaller models lack the capacity to "paper over" systemic issues, making them ideal for QA.
Mechanically Verifiable Processes: Reliability in AI-driven engineering hinges on deterministic, type-driven processes, not probabilistic outputs.
Edge-Case Analysis: Incomplete type systems remain a risk. Mechanism: Missing type definitions lead to unhandled edge cases, causing function call failures. Contingency: Ensure type system completeness or implement fallback mechanisms.

In conclusion, the integration of Typia into Qwen models exemplifies how type-driven infrastructure can transform AI reliability. This breakthrough not only fixes a critical bug but also sets a new standard for handling complex types in AI-driven engineering workflows.

DEV Community