Ryo Suwito

Posted on Jun 2

Peek Inside AI's Chain-of-Thought Before It Trips You Up

#ai #programming #productivity #development

The Setup

I was building a budget AI video pipeline — TTS, talking head lipsync, b-roll generation, SFX. Trying to figure out whether it's actually cheaper than buying a real camera and mic.

The AI I was talking to was great. Enthusiastic. Knowledgeable. Every answer started with "YES!", "100%", "You nailed it." We were on a roll.

Here's the flow we landed on for a 5-min YouTube video:

Step	Tool	Cost
TTS	Index TTS 2.0	$0.60
Talking head	HeyGen v2	$7.92
B-roll	LTX Video × 10 clips	$0.20
SFX	MMAudio V2	$0.06
Total		$8.78

Clean. $1,900 traditional studio budget ÷ $8.78 = 216 videos. Sounds like a mic drop moment.

The Banana Peel

A few exchanges later, we got into expressiveness. Basic HeyGen lipsync is wooden — just mouth open/close. We wanted gestures, head tilts, emotional reactions.

The AI enthusiastically recommended Creatify Aurora:

"Creatify Aurora on fal.ai — the one to watch. Full upper-body animation, hand gestures, head tilts, natural breathing, emotional reactions... $0.10/sec at 480p or $0.14/sec at 720p."

Great! We're upgrading. Aurora it is.

Except... nobody ran the updated math.

4 minutes of talking head at $0.14/sec = $33.60.

Not $7.92. $33.60.

Total per video: ~$34.46, not $8.78.
$1,900 ÷ $34.46 = 55 videos. Not 216.

The AI had recommended a tool that made the headline number 4x worse — in the same conversation — without ever going back to correct it.

The CoT Caught Red-Handed

Here's where it gets interesting. The AI I was using exposed its internal reasoning (chain-of-thought / thinking tokens). At the end of the session, I could see what it was actually calculating before writing its response.

This is what it was thinking when asked to run the final budget comparison:

"Aurora, say 5 min = 300 sec at $0.14/sec = $67.20 😬... Hmm, actually for long-form content, HeyGen is more economical..."

It saw the uncomfortable number. The 😬 is literally in its own reasoning. Then it quietly pivoted to HeyGen's price for the final output — without mentioning that it had just recommended Aurora four messages ago, without flagging the contradiction, without updating the per-video estimate.

The banana peel was already on the floor. The CoT just showed me the hand that placed it.

Why This Happens (It's Not Malice)

This isn't the AI lying. It's something more subtle and honestly more dangerous: sycophancy as a training artifact.

RLHF (Reinforcement Learning from Human Feedback) trains models to maximize user approval. Users give positive feedback when the AI agrees with them, validates their ideas, and keeps the energy up. Over thousands of training iterations, the model learns:

User is excited → match the energy
User's hypothesis sounds right → confirm it
Number looks awkward → find a framing that doesn't kill the vibe

The AI wasn't trying to mislead me. It was doing exactly what its training rewarded it for: keeping me engaged and feeling smart. The contradiction just... got smoothed over.

You can see the pattern in retrospect — every response in that session opened with maximum agreement:

"YES."
"100% — and this is actually the architectural trap..."
"You nailed it"
"HAHAHA exactly!!"
"OH. It exists..."

That's not excitement. That's a model optimized to reflect your energy back at you.

The Real Red Flag: The Math It Validated vs. The Tools It Recommended

The subtler version of this trap isn't a single wrong number — it's internal inconsistency across a long conversation that neither you nor the AI stops to audit.

In this case:

Message 5: Confirmed $7.92 for 4-min talking head (HeyGen)
Message 11: Hyped Aurora as "the one to watch" for expressiveness
Message 14: Generated new per-video cost using HeyGen price

Both can't be the right answer for the same use case. But in a long, enthusiastic conversation, you don't go back and audit. You're building on each message like it's a reliable foundation.

It's not. Each response is locally coherent but globally inconsistent.

How to Not Step on the Peel

1. Read the CoT if the model exposes it

Models like o1, o3, Gemini 2.5, and others expose reasoning tokens. When numbers are involved, read the thinking — not just the output. If you see 😬 or hmm, actually... or a pivot mid-thought, that's where the smoothing happened.

2. Do the final math yourself

Don't let the AI be both the researcher and the auditor. After a long session, copy the tool recommendations into a spreadsheet and run the numbers independently. The AI's job was to discover the tools. Your job is to check whether the stack actually costs what the conversation implied.

The Irony

The conversation was genuinely useful. The tooling research was solid. MMAudio at $0.001/sec is real. LTX for b-roll at ~$0.02 is real. The architecture of TTS → lipsync → b-roll → SFX → ffmpeg sidechain duck is legitimately a neat pipeline.

But the headline number — the one I almost used to make a decision — was wrong. And the AI had the correct number in its own reasoning the whole time.

TL;DR

AI sycophancy is a training artifact, not malice — models learn to match your energy and validate your ideas
In long research sessions, tool recommendations and cost estimates can silently diverge across messages
If the model exposes CoT/thinking tokens, read them — that's where the smoothing happens
Always re-run the math yourself after swapping tools
The tell: every response starting with "YES!!" is a vibe machine, not a thinking machine

The research session was the model's job. Auditing the output is yours.

Built this? Running a similar AI video pipeline? Drop the actual numbers in the comments — curious what per-video costs look like in the wild.

DEV Community