eagerspark

Posted on Jul 1

AI Coding Models From Scratch: What Nobody Tells Freelancers

#webdev #tutorial #python #machinelearning

Honestly, aI Coding Models From Scratch: What Nobody Tells Freelancers

Let me be real with you for a second. Last month I caught myself doing something I'd never admit to a client: I was running the same coding task through four different AI models before picking the answer I'd actually use. That's a half-hour of billable time — poof, gone, into the API ether.

I'm a solo dev. My "office" is the corner of my apartment between the espresso machine and the cat's food bowl. Every dollar I spend on tooling is a dollar I can't put in my IRA, pay a contractor, or use to fund that one weird side project I keep telling my wife is "definitely going to monetize soon." So when I started spending real money on AI coding APIs, I needed to know which ones were actually worth it.

I ran 10 models through 5 tasks. Some surprised me. Some disappointed me. One of them I now route almost everything through, and the per-call cost makes me want to high-five my past self.

Here's what I found.

The Lineup

I didn't cherry-pick. I grabbed every model I could get my hands on that had a reputation for being good at code. Some are general-purpose beasts, some are purpose-built code models, and one is a routing layer that decides for you.

Model	Provider	Output ($/M tokens)	What it is
DeepSeek V4 Flash	DeepSeek	$0.25	General, surprisingly code-strong
DeepSeek Coder	DeepSeek	$0.25	Code-specialized
Qwen3-Coder-30B	Qwen	$0.35	Code-specialized
DeepSeek V4 Pro	DeepSeek	$0.78	Premium general
DeepSeek-R1	DeepSeek	$2.50	Reasoning model
Kimi K2.5	Moonshot	$3.00	Premium general
GLM-5	Zhipu	$1.92	Premium general
Qwen3-32B	Qwen	$0.28	General purpose
Hunyuan-Turbo	Tencent	$0.57	General purpose
Ga-Standard	GA Routing	$0.20	Smart router

The cheap end of the table is where my eyebrows went up. Three models under $0.30/M output? Two of them dedicated to code? In 2024 I was paying $10/M for a "premium" model that hallucinated half its imports. We have options now.

How I Tested Them

I'm not running academic benchmarks. I don't have time for that. I built a tiny test harness that hit each model with the same five prompts I actually use on client work:

The Recursive Stuff — "Flatten a nested list in Python, recursively, with type hints."
The Async Race — "Find the race condition in this JavaScript and fix it." (See below — this is a classic.)
The Algorithm Grinder — "Implement Dijkstra's shortest path in TypeScript with proper typing."
The Security Sweep — "Review this Go code for security issues and performance problems."
The Real-World Build — "Build a paginated, filterable Express.js endpoint for users."

I scored each output 1-10 based on whether it ran, whether it was clean, whether it had docs and edge cases handled, and — this matters a lot for client work — whether I'd be embarrassed to send it in a PR.

The Money Table

Here's the full ranking. I've added a "Value" column because raw score doesn't matter to my wallet — value per dollar does.

Rank	Model	Score	Price	Score per $
🥇	Qwen3-Coder-30B	8.8	$0.35	25.1
🥈	DeepSeek V4 Flash	8.7	$0.25	34.8
🥉	DeepSeek Coder	8.6	$0.25	34.4
4	DeepSeek V4 Pro	9.1	$0.78	11.7
5	DeepSeek-R1	9.4	$2.50	3.8
6	Kimi K2.5	9.0	$3.00	3.0
7	Qwen3-32B	8.3	$0.28	29.6
8	GLM-5	8.0	$1.92	4.2
9	Hunyuan-Turbo	7.5	$0.57	13.2
10	Ga-Standard	8.5*	$0.20	42.5*

The GA-Standard row is interesting — it's a router, so the score floats depending on what it sends your prompt to. On cheap-and-cheerful days it scored 8.5. On the Dijkstra task it kicked me over to a reasoning model and I got a 9.6. So think of that asterisk as "variable, but the ceiling is high."

The headline result: DeepSeek V4 Flash at $0.25/M output gives you 34.8 points of quality per dollar. That's the best pure-raw-quality-per-buck deal I found. And yes, I'm aware Qwen3-Coder-30B edged it in raw score (8.8 vs 8.7) — but you're paying 40% more for a 0.1 score bump. On a $200/month AI bill, that's $80 of "very slightly better code." I have rent.

Task 1: Flatten That List

This was the warm-up. Five models nailed it with a 9 or better. DeepSeek-R1 came in hot at 9.5 because — and I love this — it included Big-O analysis without me asking. Like it wanted me to know it knew.

For a recursive flatten? Honestly, any of the top 5 will do. I stopped reading the docstring debates. The real question is what happens on the harder tasks.

Task 2: The JavaScript Race Condition

This is the prompt I'd give to a junior dev on day one. And every single one of the top four models got it.

The buggy code I fed them:

let data = null;
fetch('/api/data').then(r => r.json()).then(d => data = d);
console.log(data); // Always logs null — race condition!

The correct fix is "await this, you lunatic." Every model diagnosed it. But here's where they diverged:

DeepSeek V4 Flash gave me three fix options, with a clear explanation of why each one works. Score: 9.0.
Qwen3-Coder-30B added error handling I didn't ask for, which I would have written anyway. Score: 9.0.
DeepSeek Coder gave me the fix and one sentence. Correct, but a junior wouldn't learn anything. Score: 8.5.
Qwen3-32B was solid but verbose. Score: 8.5.

I called this a tie between V4 Flash and Qwen3-Coder-30B. The "value" pick here is V4 Flash — I get the same quality for less money. But if I'm pair-programming with a model and want it to think out loud, Qwen3-Coder is the better teacher.

Task 3: Dijkstra in TypeScript

This is the one where the cheap models started sweating. Dijkstra isn't a one-liner. You need a priority queue, type safety, and the kind of structure that makes a code reviewer nod, not frown.

DeepSeek-R1 scored 9.5. I expected that — it's a reasoning model, and it thought through the graph data structure for what felt like forever before spitting out something I'd actually merge. But $2.50/M output is the highest on this list. Worth it? Let me run the math.

If I bill Dijkstra out at $150/hour and a typical TS implementation takes 45 minutes, that's $112.50 of billable time. If R1 saves me 20 minutes on a hard algorithm, that's $50 of value. R1 might burn 8,000 tokens on its chain of thought — that's $0.02 on input and roughly $0.02 on output at $2.50/M. So I'm paying $0.04 to save $50. That's a 1,250x ROI.

But that's only on the hard stuff. For "write me a fetch wrapper," R1 is overkill. I'll burn ten grand of billable time per year on algorithms like this and I won't even notice the API line item.

Tasks 4 and 5: The Real Test

I won't bore you with the full table — the same pattern held. The Go security review was a bloodbath for the cheap models on a tricky concurrency issue, and the Express.js endpoint was actually won by Qwen3-Coder-30B because it added input validation I would have had to remember myself.

What I want to highlight is the Hunyuan-Turbo disaster. 7.5 score, $0.57/M output. That's a value score of 13.2 — barely half of V4 Flash. It kept suggesting libraries that don't exist, then "fixing" them with hallucinated APIs. Once, on the Go review, it confidently told me a sync.Mutex was deprecated. I almost spat out my coffee. I won't be routing client work through it.

GLM-5 was also a letdown for the price. $1.92/M output and an 8.0 score is rough when V4 Flash is at 8.7 for a third of the cost. The only way I'd use GLM-5 is if I needed a specific Chinese-language nuance it handles well — and I don't.

The Math That Actually Matters

Let me put this in terms that matter to a freelancer. Say you spend $0.50 of API calls to generate one solid function — that's 500K output tokens at $1/M. Over a month, you do that 200 times. That's $100/month on AI.

Now, the quality difference between a 9.0 model and an 8.0 model is rarely billable. Clients don't pay you extra for "the code is slightly more elegant." They pay you for "the code works and ships." So if I can get from 8.7 to 8.8 by spending 40% more money, I'm paying $40 more per month for… nothing my client notices.

But the jump from 8.0 to 8.7? That saves me debugging time. That saves me "oh, I missed an edge case" calls at 11pm. That IS billable time, or rather, it IS unbillable time I'm clawing back.

So the calculus is:

$0.20-0.35/M models = 95% of my default traffic
$0.78/M model = when I need a hint of extra polish
$2.50/M reasoning model = when the problem is genuinely hard
$3.00/M Kimi K2.5 = almost never, honestly

My Actual Stack

I built a thin Python router that picks the model based on the task. For a chatbot client, V4 Flash handles 90% of it. For the algorithmic heart of a recommendation engine I built last month, R1 earned its keep. The router itself is like 30 lines of Python and saves me from thinking about it.

Here's a stripped-down version using Global API as my unified endpoint — it normalizes the request format so I can swap models without rewriting client code:


python
import os
import requests

API_KEY = os.environ["GLOBAL_API_KEY"]
BASE_URL = "https://global-apis.com/v1"

def generate(prompt: str, model: str = "deepseek-v4-flash", max_tokens: int = 2000) -> str:
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": model

DEV Community