DEV Community

Jeffrey.Feillp
Jeffrey.Feillp

Posted on

Tian AI Thinker: Building a Three-Layer LLM Reasoning Engine

The Three-Layer Reasoning Architecture

Tian AI's Thinker module implements a sophisticated three-layer reasoning engine that adapts to different complexity levels. By intelligently routing queries to the appropriate reasoning mode, it achieves both speed and depth without compromising on quality.

Layer 1: Fast Mode (Direct Response)

For simple queries like greetings or basic facts, Fast Mode generates direct responses using minimal context. The prompt engineering here is deliberately lightweight:

System: You are a helpful AI assistant. Respond concisely.
Query: {user_input}
Response:
Enter fullscreen mode Exit fullscreen mode

This mode achieves ~30 tokens/second on Qwen2.5-1.5B, making it perfect for real-time chat interactions.

Layer 2: CoT Mode (Chain-of-Thought)

For multi-step reasoning problems, CoT Mode activates step-by-step thinking:

System: You are a reasoning AI. Think step by step.
Query: {user_input}
Let me think through this carefully:
1. First, I need to understand...
Enter fullscreen mode Exit fullscreen mode

The key trick is temperature control: we set temperature=0.3 for CoT to ensure logical consistency while maintaining some creative exploration.

Layer 3: Deep Mode (Context-Enhanced Reasoning)

The most powerful mode activates context-aware reasoning with retrieved knowledge:

System: You are a knowledgeable AI with access to a personal knowledge base.
Context: {retrieved_entries}
Query: {user_input}
Based on the context and my knowledge:
Enter fullscreen mode Exit fullscreen mode

Prompt Engineering for Small Models

Making Qwen2.5-1.5B punch above its weight requires careful prompt engineering:

  1. Structured output formats: Always request JSON or numbered lists for complex responses
  2. Few-shot examples: Include 2-3 examples in the system prompt for new tasks
  3. Negative constraints: Explicitly tell the model what NOT to do ("Do not mention external tools you don't have")
  4. Token budget: Cap response length to match query complexity

Performance Results

Mode Latency Quality Score Use Case
Fast 0.5-1s 6/10 Greetings, simple facts
CoT 2-3s 8/10 Math, logic problems
Deep 3-5s 9/10 Knowledge-based Q&A

The Thinker module dynamically selects the appropriate layer based on query complexity analysis, ensuring optimal performance for every interaction.

Top comments (0)