Dexmac

Posted on Dec 27, 2025

The Giant That Builds Smaller Giants: Custom AI Agents for Privacy and Efficiency

#ai #agents #programming #development

The future of AI isn't bigger models. It's smaller, specialized agents — distilled, custom-built, and running where your data stays safe.

The Problem with One-Size-Fits-All AI

Every time you use a frontier AI model like ChatGPT or Claude, three things happen:

Your data leaves your control. Your code, your ideas, your company's secrets travel to data centers you don't own.
You're paying for capabilities you don't need. These models know a little about everything — history, poetry, coding, cooking. But for your specific task, 90% of that knowledge is irrelevant overhead.
Massive resources are consumed. Running 500+ billion parameter models requires enormous computational power. For repetitive, domain-specific tasks, this is wildly inefficient.

There's a better way — and it's where AI is heading.

The Future: Custom, Distilled, Specialized Models

Here's the thesis:

The future of industrial AI isn't giant generalist models. It's smaller, specialized agents — custom-built for specific domains, running on efficient hardware, and equipped with deterministic tools that guarantee correctness.

This isn't speculation. It's already happening. Companies are realizing that for intensive, repetitive, well-defined tasks, a 30B parameter model with the right tools outperforms a 500B+ generalist that hallucinates.

Why Specialized Beats General

Frontier models (500B+ parameters):

Know a little about everything
Expensive to run
Prone to hallucination on niche topics
Data goes to third-party servers

Specialized agents (30B–120B parameters + tools):

Deep expertise in one domain
Significantly cheaper to run via Ollama Cloud or local GPU
Deterministic tools prevent hallucination
Can run on privacy-first infrastructure

The key insight: you don't need a model that knows everything. You need a model that can orchestrate tools that know specific things perfectly.

Privacy-First Doesn't Mean Local-Only

A common misconception: "If I care about privacy, I need my own GPU."

Not anymore. There's a spectrum of options:

Fully local (Ollama on your hardware)

Zero data leaves your machine
Requires a good GPU (RTX 3090/4090 for 30B+ models)
You control everything

Privacy-first cloud (Ollama Cloud, open-source model hosting)

Models like Qwen3-Coder, DeepSeek-V3, or OpenAI's gpt-oss — specialized, open-source, and efficient
Data sent, processed, and deleted — no training on your prompts, no logs retained
No GPU required on your end
Access to 30B-120B models with privacy guarantees

This middle ground is crucial. The sweet spot models (30B–120B parameters) often require more VRAM than consumer GPUs offer. Privacy-first cloud hosting lets you run them without sacrificing data control — your prompts are processed and immediately discarded, not stored or used for training.

Both options give you what closed-source APIs can't: confidence that your proprietary code isn't training someone else's model.

For enterprises in regulated industries — healthcare, finance, defense — this isn't a nice-to-have. It's a requirement.

The Hardware Horizon

The real revolution for local AI is still coming. Today's bottleneck isn't CPU power — it's memory bandwidth, not just VRAM capacity.

Here's why: for each token generated, the model must read through all its weights. A 20B model with MXFP4 quantization means ~10-12 GB of data per token. The speed limit isn't "does it fit?" — it's "how fast can you read it?"

GPU	VRAM	Memory Bandwidth	Practical Speed (gpt-oss:20b)
RTX 4070	12 GB	~200 GB/s	~23 tokens/s
RTX 3090	24 GB	~936 GB/s	~70-85 tokens/s
H100	80 GB	~3,350 GB/s	~300+ tokens/s

This explains why datacenter GPUs cost so much — and why consumer hardware hits a wall even when the model fits in memory.

But quantization techniques are advancing rapidly. Unsloth Dynamic 2.0 achieves remarkable compression while maintaining accuracy:

Model	Full Size	Quantized	VRAM Needed	Context
Qwen3-Coder-30B-A3B	~60GB	18GB (Q4_K_XL)	24GB GPU	1M tokens
Qwen3-Coder-30B-A3B	~60GB	13GB (IQ3_XXS)	16GB GPU	8K tokens

A 30B parameter model running on a consumer GPU with 1 million token context — this was impossible two years ago.

The future likely belongs to:

Mixture-of-Experts (MoE) architectures — 30B total parameters but only 3B active per token
Advanced quantization — Unsloth Dynamic, GGML, AWQ pushing the limits
Unified memory architectures where CPU and GPU share large RAM pools
NPUs (Neural Processing Units) integrated into consumer hardware

There's a catch: RAM prices are climbing, likely driven by datacenter demand for exactly these technologies. The economics may shift before the hardware does.

For now, privacy-first cloud bridges the gap — giving you access to efficient open-source models without waiting for affordable local hardware. But the gap is closing fast.

The Architecture: Giant Creates Smaller Giant

Here's the pattern that makes this work:

The frontier model is the architect — used occasionally, for design and refinement. The smaller model is the builder — used daily, for execution.

The key innovation isn't the small model. It's the deterministic tools. They transform an "okay" model into a reliable specialist.

The Secret Sauce: Tools That Can't Be Wrong

Smaller models hallucinate. Everyone knows this.

But here's a nuance most people miss: modern mid-size models often know the right answer — they just can't be trusted to give it consistently.

When I tested a 30B model on the amortization formula, it gave a correct mathematical answer. When I tested it again, it gave a slightly different (but still correct) variant. When I tested it in a complex multi-step task, it occasionally mixed up variable names or forgot syntax rules.

The problem isn't knowledge. It's reliability.

This is where deterministic tools change the equation:

Without Tools	With Tools
Model might know the formula	Tool always returns the exact formula
Model might remember syntax	Tool always validates syntax
Output varies between runs	Output is guaranteed consistent
Errors discovered at runtime	Errors caught before execution

The model's job becomes orchestration:

Understand what the user wants
Choose which tool to call
Apply the result correctly

The consistency comes from the tools. The intelligence comes from the model. Together, they achieve predictable reliability — which matters more than occasional brilliance in production systems.

The Iterative Loop: Learning From Failures

Here's what nobody tells you about specialized agents: you build them by watching models fail.

When I first tested the COBOL financial agent, I tried different model sizes. The pattern was clear:

8B models (too small): Got confused by multi-step tasks, forgot to use tools
30B+ models (sweet spot): Understood the task, used tools correctly, succeeded

Even the successful models had predictable failure patterns. For example, a 30B model correctly applied the amortization formula but forgot to add the 7 required spaces at the start of COBOL lines (column rules from 1959!).

The formula was right. The syntax was wrong. The program didn't compile.

This is a predictable failure pattern. Models consistently forget obscure syntax rules they weren't trained on. Once you identify the pattern, you can compensate — deterministically.

Deterministic Compensation

Instead of hoping the model remembers COBOL column rules, I added:

Post-processing in write_file: Every time the model writes COBOL code, the agent automatically scans for common formatting errors and fixes them before saving.
RAG documentation tools: The model can call get_cobol_syntax_docs("columns") to retrieve verified syntax examples — it doesn't need to remember, just to look up.
Auto-fix on compile errors: If compilation fails with column-related errors, a deterministic fixer attempts repairs.

Result: the same model that failed now succeeds, because its predictable weaknesses are patched by deterministic code.

This is the real workflow:

Give task to model with tools
Watch where it fails
Add deterministic compensation for that failure pattern
Repeat until reliable

You're not training the model. You're building guardrails around its known failure modes. The model stays the same; the tools get smarter.

Proof of Concept #1: COBOL Financial Calculations

Theory is nice. Does it actually work?

For the first stress test, I chose a domain that matters: legacy COBOL code maintenance for financial systems.

Why COBOL? Because:

It's used in 95% of ATM transactions and 80% of in-person bank transactions
Financial formulas are unforgiving — a wrong amortization calculation means wrong money
It proves the approach works in real-world enterprise scenarios

This isn't a toy example. Banks run on COBOL. Getting it wrong costs real money.

The Test Case

A buggy loan payment calculator that uses simple interest instead of the correct amortization formula:

Expected payment: $632/month
Buggy output:     $819/month (simple interest, WRONG)

The task: fix the bug. I tested multiple models across different sizes to find the sweet spot.

Model Comparison: Finding the Right Size

Model	Parameters	Used `get_formula`	Compiles	Output	Iterations	Result
Nemotron 3 Nano	30B (MoE 3B active)	✅	✅	$632.01	8	✅
DeepSeek V3.1	671B	✅	✅	$632.01	7	✅
Kimi K2	1T	✅	✅	$632.01	10	✅
DeepSeek V3.2	671B	✅	✅	$632.01	15	✅
Qwen3-Coder	480B	✅	✅	$632.01	12	✅
Qwen3 8B (local)	8B	❌ confused	❌	—	25	❌

Key observations:

All models that used get_formula succeeded. The deterministic tool guarantees the correct formula every time.
Nemotron 3 Nano is the sweet spot: a 30B MoE model with only 3B parameters active per token, completing the task in just 8 iterations — faster than 671B models while being far more efficient.
The 8B model failed — not because it doesn't know the formula (it does!), but because it got confused by the multi-step task and forgot to use the tools. This confirms 30B+ is the minimum for reliable agent work.
Frontier models (500B+) work, but are overkill — they cost more and aren't faster than well-tuned 30B models for this task.

The Real Value of Tools (When Models Already Know)

The COBOL agent has:

get_formula("amortization_payment"): Not because the model doesn't know it, but because the tool returns the exact same template every time with tested COBOL syntax
compile_cobol: GnuCOBOL compiler catches syntax errors the model might introduce
Auto-fix post-processing: Automatically adds missing column spacing when writing files

The lesson: tools provide consistency, not knowledge. For domains like COBOL where models already have the knowledge, deterministic tools ensure they apply it reliably.

A 30B model with tools produces the same correct output every time: $632.01/month ✓

Frontier models without tools produce correct output most of the time — but "most" isn't good enough for bank transactions.

Proof of Concept #2: The Extreme Case — Commodore 64

The COBOL test proved tools help with consistency. But what about domains where models genuinely don't know anything?

To prove this approach works even in the worst case, I chose a deliberately extreme challenge: building an AI agent that develops games for the Commodore 64 — a computer from 1982.

Why the C64? It's not because I'm nostalgic (okay, maybe a little). It's because:

Modern AI models have almost zero training data on C64 programming. If the approach works here, it works anywhere.
It's technically unforgiving. Strict C89 syntax, custom hardware chips, specific memory addresses. One mistake crashes everything.
It's a safe demonstration domain. Complex enough to prove the concept, without revealing proprietary industrial applications.

The same architecture applies to domains I can't discuss publicly.

What Happens Without Specialization

Ask a frontier model to write C64 code. It will:

Hallucinate memory addresses (VIC-II isn't at 0x1234, it's at $D000)
Use modern C syntax that won't compile on cc65
Forget that you need to enable clocks, set registers, handle interrupts

The code looks plausible. It doesn't work.

What Happens With a Specialized Agent

The C64 agent combines deterministic tools with RAG (Retrieval-Augmented Generation):

Deterministic Tools:

A compiler (cc65) that gives exact error messages
An emulator (VICE) that runs the program and captures screenshots
A vision model that verifies the game actually works

RAG Knowledge Base:

Hardware registers — VIC-II at $D000, SID at $D400, CIA at $DC00
Memory maps — screen RAM, color RAM, sprite pointers
cc65 syntax — C89 dialect with platform-specific extensions

This is a case where RAG is essential, not optional. Unlike COBOL, modern AI models have almost no training data on C64 internals. Ask GPT-4 where the border color register is — it will guess wrong. The RAG knowledge base provides verified facts the model simply doesn't have.

The model doesn't need to memorize that the VIC-II border color register is at $D020. The RAG tool knows. The model just needs to understand "make the border black → query the hardware knowledge base."

The Result

The agent creates playable C64 games — Pong, Breakout, demos. Running on emulated authentic hardware, compiled with period-correct tools, generated by a 30B model via Ollama Cloud.

No data leaked. A domain where generalist models fail completely, solved by a specialized agent with the right tools.

Tools vs RAG: Know When You Need Each

The COBOL and C64 agents illustrate two different scenarios:

Scenario	Example	What Models Know	What's Needed
Model knows, but inconsistently	COBOL, SQL, Python	Formula ✅, Syntax ✅	Deterministic tools for consistency
Model doesn't know	C64, niche hardware, proprietary systems	Nothing reliable	RAG for knowledge + tools for verification

For COBOL: Models in the 30B-120B range know amortization formulas and COBOL syntax. The get_formula tool doesn't teach them — it ensures they use the exact same template every time. The compile_cobol tool catches the occasional syntax slip.

For C64: Models genuinely don't know that POKE 53280,0 sets the border to black, or that sprite pointers live at $07F8. The RAG knowledge base provides this information. Without it, the model hallucinates plausible-looking but wrong addresses.

The practical rule: If your domain appears in modern training data (COBOL, Java, financial math), focus on deterministic tools for consistency. If your domain is obscure or proprietary (legacy hardware, internal APIs, custom protocols), you need RAG to inject missing knowledge.

When to Use Deterministic Tools (And When Not To)

Not every task needs deterministic guardrails. Here's a framework:

Use Deterministic Tools When:

✅ Correctness is non-negotiable — financial calculations, safety-critical systems, legal documents

✅ The domain has verifiable ground truth — formulas, specifications, standards that can be encoded

✅ Consistency across runs matters — production systems where "usually correct" isn't acceptable

✅ Errors are expensive — wrong loan payments, invalid code, compliance violations

Skip Deterministic Tools When:

❌ Creativity is the goal — brainstorming, drafting, exploring possibilities

❌ Approximate is good enough — summaries, explanations, documentation

❌ The domain is fuzzy — no clear right/wrong answers, subjective quality

❌ Speed matters more than perfection — quick prototypes, exploratory coding

The COBOL example shows the sweet spot: a domain where the model has knowledge but needs guardrails for reliability. The tools don't replace the model's intelligence — they channel it into consistent, verifiable outputs.

Who Should Care About This?

This approach isn't for everyone. But if any of these apply to you, it's worth exploring:

✅ Good fit:

Intensive, repetitive tasks in a well-defined domain
Privacy requirements that rule out sending data to third parties
Industrial applications where consistency matters more than creativity
Teams willing to invest upfront in building custom agents

❌ Not the right fit:

You need broad, cross-domain reasoning
Tasks are unpredictable and can't be anticipated
You need cutting-edge capabilities only frontier models have
No resources to build and maintain specialized agents

The hybrid reality: Most teams will use both — specialized agents (30B-120B) for routine work where privacy and efficiency matter, frontier models (500B+) for complex one-off problems. The goal isn't to eliminate large models. It's to stop using them by default when a specialized agent would do better.

The Commodore 64 Philosophy

The C64 had 64 kilobytes of memory. Developers learned to do incredible things within tight constraints. They optimized. They specialized. They made every byte count.

Forty years later, we're building AI systems with trillions of parameters — and often using 1% of that capability for any given task.

Perhaps it's time to apply the same philosophy to AI.

We don't always need bigger models. Sometimes we need smarter architecture: frontier models that design specialized agents, equipped with deterministic tools that never make mistakes.

The giant builds a smaller giant. And the smaller giant does the work.

This article describes CLI Code Agent, a framework for building specialized AI agents that run locally or on privacy-first cloud via Ollama. The COBOL financial agent and C64 game development agent are examples — stress tests proving the approach works in domains where reliability matters and training data is scarce. The project is experimental and evolving, but the results are promising.

DEV Community