Gantz AI for Gantz

Posted on Jan 5

Chain of Thought: When It Helps, When It Hurts

"Think step by step."

Four words that supposedly make AI smarter.

Sometimes they do. Sometimes they make everything worse.

Here's when to use chain of thought, and when to skip it.

What is chain of thought?

Chain of thought (CoT) prompts the model to show its reasoning before answering.

Without CoT:
Q: If a train travels 120 miles in 2 hours, what's its speed?
A: 60 mph

With CoT:
Q: If a train travels 120 miles in 2 hours, what's its speed? Think step by step.
A: Let me work through this:
   1. Speed = Distance / Time
   2. Distance = 120 miles
   3. Time = 2 hours
   4. Speed = 120 / 2 = 60 mph

   The train's speed is 60 mph.

Same answer. But the model "thought" about it first.

When CoT helps

Complex multi-step reasoning

Task: "Is this code vulnerable to SQL injection?"

Without CoT:
"No, the code looks fine."
(Wrong - missed the vulnerability)

With CoT:
"Let me analyze this step by step:
1. The function takes user input from request.params
2. The input is concatenated directly into the SQL string
3. No parameterization or escaping is used
4. An attacker could input: ' OR '1'='1
5. This would bypass the WHERE clause

Yes, this code is vulnerable to SQL injection."
(Correct - found the issue)

CoT forces the model to actually trace through the logic.

Math and calculations

Task: "Calculate the total cost with tax and discount"

Without CoT:
"$127.50"
(Sometimes wrong)

With CoT:
"Let me calculate:
1. Base price: $150
2. Discount (15%): $150 × 0.15 = $22.50
3. After discount: $150 - $22.50 = $127.50
4. Tax (8%): $127.50 × 0.08 = $10.20
5. Total: $127.50 + $10.20 = $137.70"
(Shows work, catches errors)

Ambiguous problems

Task: "Should I use Redis or PostgreSQL for this?"

Without CoT:
"Use Redis."
(No reasoning, hard to trust)

With CoT:
"Let me consider the requirements:
1. Data size: ~10GB - both can handle this
2. Query patterns: Complex joins needed - favors PostgreSQL
3. Persistence: Must survive restarts - both support this
4. Latency: <10ms needed - Redis is faster, but PostgreSQL can achieve this with proper indexing
5. Consistency: Strong consistency required - favors PostgreSQL

Recommendation: PostgreSQL, because the complex joins and consistency requirements outweigh the latency advantage of Redis."

Shows reasoning. User can evaluate the logic.

Debugging and analysis

Task: "Why is this test failing?"

Without CoT:
"The assertion is wrong."
(Vague, might be wrong)

With CoT:
"Let me trace through the test:
1. Test calls createUser({name: 'Alice'})
2. createUser should return {id, name, createdAt}
3. Test expects response.id to be defined
4. Looking at createUser... it returns {name, createdAt} but NOT id
5. The id is generated but not included in the return object

The test fails because createUser doesn't return the id field."

When CoT hurts

Simple factual questions

Task: "What's the capital of France?"

Without CoT:
"Paris"
(Correct, fast)

With CoT:
"Let me think about this step by step. France is a country in Western Europe. The capital of a country is typically its seat of government. France's government is located in... Paris."

Result: Same answer, 4x the tokens, added latency.

CoT adds cost with no benefit for simple lookups.

Clear-cut decisions

Task: "Is 'hello world' a valid Python string?"

Without CoT:
"Yes"
(Correct, instant)

With CoT:
"Let me analyze this:
1. In Python, strings can be defined with single or double quotes
2. 'hello world' uses single quotes
3. The content contains letters and a space
4. Spaces are valid in strings
5. Therefore, yes, this is a valid Python string."

Result: Overthinking a trivial question.

Speed-critical tasks

Without CoT:
Response time: 200ms

With CoT:
Response time: 800ms

If you're building autocomplete or real-time features, CoT kills your latency.

When it causes overthinking

Task: "Read the config file and tell me the port number"

Without CoT:
*reads file*
"The port is 3000"

With CoT:
"Let me think about how to approach this:
1. First, I should determine the file format
2. Config files can be JSON, YAML, TOML, or INI
3. I should read the file to check the format
4. Then parse it appropriately
5. Look for a 'port' key
6. Consider that it might be nested under 'server' or 'http'
7. Also check for environment variable overrides
..."
*spends 30 seconds planning instead of just reading the file*

Analysis paralysis.

When it introduces errors

Sometimes CoT makes the model talk itself into wrong answers:

Task: "Is 0.1 + 0.2 == 0.3 in JavaScript?"

Without CoT:
"No, due to floating point precision."
(Correct)

With CoT:
"Let me think:
1. 0.1 + 0.2 should equal 0.3
2. This is basic arithmetic
3. JavaScript handles numbers fine for simple operations
4. Therefore, yes, 0.1 + 0.2 == 0.3 is true."
(Wrong - talked itself into the intuitive but incorrect answer)

The model's "reasoning" followed common sense instead of technical accuracy.

The cost of CoT

Token cost

Simple question without CoT: ~50 tokens response
Simple question with CoT: ~200 tokens response

1000 queries/day × 150 extra tokens × $0.01/1K = $1.50/day wasted

Latency cost

Without CoT:
- Tokens generated: 20
- Time: 200ms

With CoT:
- Tokens generated: 150
- Time: 1.5s

7x slower for the same answer.

Cognitive cost

Users have to read through reasoning they don't need:

User: "What's 2 + 2?"

Agent: "Let me work through this carefully. We have the number 2, and we need to add another 2 to it. Addition is the process of combining quantities. When we add 2 + 2, we get 4. Therefore, the answer is 4."

User: "I just wanted to know it's 4."

The decision framework

┌─────────────────────────────────────────────────────┐
│                  Use CoT when:                       │
├─────────────────────────────────────────────────────┤
│ ✓ Multi-step reasoning required                      │
│ ✓ Math or calculations involved                      │
│ ✓ Debugging or analysis tasks                        │
│ ✓ User needs to verify the logic                     │
│ ✓ Problem is ambiguous                               │
│ ✓ Accuracy > Speed                                   │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│                  Skip CoT when:                      │
├─────────────────────────────────────────────────────┤
│ ✗ Simple factual questions                           │
│ ✗ Clear-cut yes/no decisions                         │
│ ✗ Speed is critical                                  │
│ ✗ Token budget is tight                              │
│ ✗ Answer is obvious                                  │
│ ✗ User doesn't need reasoning                        │
└─────────────────────────────────────────────────────┘

Implementation patterns

Pattern 1: Conditional CoT

def should_use_cot(query: str) -> bool:
    """Decide if query needs chain of thought"""
    cot_indicators = [
        "why", "how", "explain", "analyze", "debug",
        "calculate", "compare", "evaluate", "which is better",
        "step by step", "reasoning"
    ]

    simple_indicators = [
        "what is", "who is", "when", "where",
        "yes or no", "true or false", "list"
    ]

    query_lower = query.lower()

    if any(ind in query_lower for ind in cot_indicators):
        return True
    if any(ind in query_lower for ind in simple_indicators):
        return False

    # Default: use CoT for longer queries
    return len(query.split()) > 15

def get_response(query: str) -> str:
    if should_use_cot(query):
        prompt = f"{query}\n\nThink through this step by step."
    else:
        prompt = f"{query}\n\nAnswer directly and concisely."

    return llm.create(prompt)

Pattern 2: Hidden CoT

Get the benefits without showing the user:

def get_response_with_hidden_cot(query: str) -> str:
    # First call: get reasoning (hidden from user)
    reasoning = llm.create(
        f"{query}\n\nThink through this step by step, showing your reasoning."
    )

    # Second call: get clean answer using the reasoning
    answer = llm.create(
        f"""Based on this reasoning:
{reasoning}

Now provide a concise, direct answer to: {query}

Don't show the reasoning, just the final answer."""
    )

    return answer

User gets accurate answer. Reasoning happens behind the scenes.

Pattern 3: CoT with extraction

def get_response_with_cot(query: str) -> dict:
    response = llm.create(
        f"""{query}

Think through this step by step, then provide your final answer.

Format:
REASONING:
[your step by step thinking]

ANSWER:
[your final answer]"""
    )

    # Parse out the parts
    parts = response.split("ANSWER:")
    reasoning = parts[0].replace("REASONING:", "").strip()
    answer = parts[1].strip() if len(parts) > 1 else response

    return {
        "reasoning": reasoning,
        "answer": answer
    }

# Usage - show reasoning only when needed
result = get_response_with_cot("Why is the build failing?")
print(result["answer"])  # User sees this

if user_wants_details:
    print(result["reasoning"])  # Optional detail

Pattern 4: CoT for agents

For agents, CoT helps with tool selection:

SYSTEM_PROMPT = """
You are a coding assistant with tools: read, write, search, run.

When deciding which tool to use, briefly think through:
1. What does the user need?
2. What information do I need first?
3. Which tool gets me that information?

Then use the appropriate tool.
"""

Example:

User: "Fix the bug in auth.py"

Agent thinking:
"User wants to fix a bug.
1. I need to see the current code first
2. I should read auth.py
3. Then identify the bug
4. Then fix it"

Agent: 🔧 read({"path": "auth.py"})

Without CoT, agents sometimes skip straight to modifications without reading first.

Pattern 5: Minimal CoT

Just enough reasoning, not verbose:

SYSTEM_PROMPT = """
When solving problems:
- State your approach in one line
- Then execute

Don't explain basic concepts or obvious steps.
"""

Result:

User: "Calculate compound interest: $1000, 5%, 10 years"

Agent: "Using A = P(1 + r)^t: $1000 × (1.05)^10 = $1,628.89"

Shows the formula used (verifiable) without paragraphs of explanation.

For RAG pipelines

CoT can help with answer synthesis:

def rag_with_cot(query: str, context: str) -> str:
    return llm.create(f"""
Context:
{context}

Question: {query}

Think through which parts of the context are relevant, then answer.
Keep reasoning brief.
""")

This helps the model:

Identify relevant passages
Ignore irrelevant retrieved content
Synthesize from multiple sources

Tool descriptions as CoT hints

With Gantz Run, tool descriptions can hint when to think:

# gantz.yaml
tools:
  - name: read
    description: Read a file
    parameters:
      - name: path
        type: string
        required: true
    script:
      shell: cat "{{path}}"

  - name: analyze
    description: Analyze code for issues. Think through potential bugs, security issues, and performance problems before reporting.
    parameters:
      - name: path
        type: string
        required: true
    script:
      shell: cat "{{path}}"

The analyze tool's description encourages reasoning. The read tool's description is direct - just do it.

Measuring CoT impact

def evaluate_cot_impact(test_cases: list) -> dict:
    results = {"with_cot": [], "without_cot": []}

    for case in test_cases:
        # Without CoT
        start = time.time()
        answer_no_cot = get_response(case["query"], use_cot=False)
        time_no_cot = time.time() - start
        correct_no_cot = answer_no_cot == case["expected"]

        # With CoT
        start = time.time()
        answer_cot = get_response(case["query"], use_cot=True)
        time_cot = time.time() - start
        correct_cot = answer_cot == case["expected"]

        results["without_cot"].append({
            "correct": correct_no_cot,
            "time": time_no_cot
        })
        results["with_cot"].append({
            "correct": correct_cot,
            "time": time_cot
        })

    # Summarize
    return {
        "accuracy_without_cot": sum(r["correct"] for r in results["without_cot"]) / len(test_cases),
        "accuracy_with_cot": sum(r["correct"] for r in results["with_cot"]) / len(test_cases),
        "avg_time_without_cot": sum(r["time"] for r in results["without_cot"]) / len(test_cases),
        "avg_time_with_cot": sum(r["time"] for r in results["with_cot"]) / len(test_cases),
    }

Run this on your actual queries to see if CoT helps your use case.

Summary

Chain of thought:

Scenario	Use CoT?	Why
Math problems	✅ Yes	Reduces calculation errors
Code analysis	✅ Yes	Forces thorough review
Debugging	✅ Yes	Traces through logic
Ambiguous questions	✅ Yes	Shows reasoning for trust
Simple lookups	❌ No	Wastes tokens
Yes/no questions	❌ No	Overthinking
Speed-critical	❌ No	Adds latency
Obvious answers	❌ No	Unnecessary

The rule: Use CoT when thinking helps. Skip it when it doesn't.

Don't add "think step by step" to everything. Be selective.

Do you use chain of thought in your prompts? When does it help most?

DEV Community