Why I Chose DeepSeek Over GPT-4 for a Free AI Conversation App

#showdev #ai #nextjs #productivity

I did not choose DeepSeek because I think GPT-4 is bad. I chose it because I was building a free app, and free apps teach you what actually matters pretty fast.

The question was simple: how do I keep sessions cheap enough that people can practice a lot without me lighting money on fire?

The answer pushed me toward DeepSeek-V3 (and later R1 for specific tasks).

The real constraint was volume

The app is a conversation practice tool. People come in to rehearse hard talks, not to admire the model.

A single practice session runs 8-15 turns. Each turn is roughly 300-600 tokens in, 100-300 out. Multiply that by five sessions a week per active user and the costs start compounding.

Here is what the math looked like when I was choosing (mid-2026 pricing):

Model	Input cost (per 1M tokens)	Output cost (per 1M tokens)	Cost per 10-turn session (est.)
GPT-4o	$2.50	$10.00	~$0.04-0.06
GPT-4 Turbo	$10.00	$30.00	~$0.12-0.18
DeepSeek-V3	$0.27	$1.10	~$0.004-0.007
DeepSeek-R1	$0.55	$2.19	~$0.008-0.012

At scale, the difference between $0.005 and $0.05 per session is the difference between running a free product and needing a paywall after three conversations. I wanted people to come back daily without hitting a wall.

What DeepSeek handled well

It stayed in character for 10-15 turns. It pushed back when the user got vague. It followed persona heuristics (numbered if/then rules in the system prompt) about as reliably as GPT-4o did for our use case.

For salary negotiation rehearsal, the model needs to say "that's not in the budget" and hold that position for three more turns while the user tries different approaches. DeepSeek-V3 did this. Not perfectly, but reliably enough that sessions felt real.

It also made the app easier to run as a free product. People can try, fail, reset, and try again without me worrying about per-session cost.

Where GPT-4 was still better

GPT-4 (and 4o) is smoother with nuanced emotional wording. When a conversation gets subtle, loaded with subtext, or requires picking up on implied meaning, GPT-4 catches more.

For the breakup text persona, GPT-4o noticed when a user's "kind" message was actually passive-aggressive. DeepSeek missed that about 20% more often in my informal testing across ~100 sessions.

But polish was not the main bottleneck for this product. The main bottleneck was getting people enough reps to build actual comfort with discomfort.

The tradeoff I actually cared about

Do I want one beautiful session, or ten useful ones?

For this app, ten useful ones. Every time.

So I took the cheaper model, put the engineering effort into the prompt architecture (persona seed, heuristics, mode wrapper, boundaries), and accepted that 85-90% quality at 10x the volume was a better product than 95% quality at 1x.

The model matters. The scaffolding around it matters more.

What I changed to make DeepSeek work

A few things made the choice viable:

Tighter system prompts. DeepSeek drifts more with long, loose instructions. Shorter seed, more numbered rules.
Lower temperature (0.55 for roleplay, 0.2 for scoring). Kept persona variation without character breaks.
Max reply length cap in the mode wrapper. DeepSeek's default is wordier than GPT-4o, so I had to constrain it explicitly.
Built retries into the flow. A bad response does not kill the session; the user gets a fresh turn.

The last one is underrated for any practice app. The experience should not feel fragile.

My actual takeaway

If you are building a free AI app, the best model is not always the smartest one. It is the one that lets people come back tomorrow.

Not bragging rights. Not benchmark charts. Whether the app stays affordable enough to be used like a tool instead of a demo.

For cosskill, DeepSeek made more sense. It let me build something people use five times a week instead of once and forget. Which is usually the whole game for a practice product anyway.

If you want to see the product, it is at cosskill.com.