Gaurav Vij

Posted on Jun 5

I kept using Claude Code. Added one thing to it. Cut AI engineering costs by 62%.

#ai #claude #claudecode #agents

Same task. Same machine. Same models. Two runs. $1.96 vs $0.74.

The difference wasn't prompt engineering. Wasn't a cheaper model. Wasn't a better GPU. It was whether Claude Code worked alone or handed off to an AI agent (Neo) before touching a single file.

Here's what actually happened.

The Task

Benchmark two Parakeet speech-to-text variants on a CPU-only Azure VM (2 vCPUs, 7.7GB RAM, no GPU):

nvidia/parakeet-tdt-0.6b-v3 — full precision HuggingFace model
mudler/parakeet-cpp-gguf — same weights, GGUF quantized, runs via C++ CLI

Framework: build-ai-applications/Eval-STT. Neither model is natively supported, so both runs had to extend the evaluator with custom code.

Metrics: WER, RTF, latency, CPU%, peak memory.

Run 1: Claude Code, Interactive

Standard workflow. Describe task, iterate turn by turn, fix errors as they surface, ship.

What Claude Code chose:

# HuggingFace Transformers — the obvious path
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="nvidia/parakeet-tdt-0.6b-v3",
    device="cpu",
)
result = pipe("test_audio.wav")

bfloat16 via HF Transformers. Reasonable. Works. Not the best choice for a CPU-only box.

For test audio: espeak-ng. Offline, fast, no dependencies.

Results:

Model	WER	RTF	Latency
HF bfloat16	20.9%	0.519	8.60s
GGUF Q4_K	20.9%	0.797	13.21s

Both models made the same three errors: "zest" → "mest", "tacos al pastor" → "taco mel pastor", "zestful" → "nestful". Same errors, both models — that's espeak-ng mispronouncing edge cases, not model failure.

Total cost: $1.96

Run 2: Claude Code + Neo

One prompt. Claude Code submitted the task to Neo via MCP and stepped back.

Neo's first move was not to write code. It spent 2 minutes reading:

The Eval-STT notebook structure
HuggingFace model cards for both models
CPU inference benchmarks for NeMo/Parakeet
ONNX Runtime vs PyTorch throughput on x86
parakeet.cpp build requirements and GGUF quantization tiers

Then it wrote a plan and asked one question: "gTTS or LibriSpeech sample for audio? Q4_K or Q6_K for GGUF?"

Reply: "You decide."

What Neo chose:

# ONNX Runtime via onnx-asr — researched, not obvious
from onnx_asr import load_model

model = load_model("parakeet-tdt-0.6b-v3")
transcription = model.transcribe("test_audio.wav")

ONNX Runtime with operator fusion and AVX2-optimized kernels. Faster than the PyTorch path on CPU because that's what the benchmarks showed.

For GGUF: Q6_K (776MB) over Q4_K. Better quality headroom, still fits in available RAM.

For test audio: gTTS. Natural-sounding speech, closer to training distribution.

Results:

Model	WER	RTF	Latency	Peak Memory
ONNX FP32	4.65%	0.328	5.50s	2,667MB
GGUF Q6_K	4.65%	0.708	11.88s	928MB

Total cost: $0.7448

What the Numbers Actually Mean

The WER gap is about audio, not models

Both models got identical WER within each run: 20.9% in Run 1, 4.65% in Run 2. If quantization or model choice were the variable, you'd see different WER between models within the same run. You don't.

The variable was the TTS engine. espeak-ng produced robotic audio that tripped up three words. gTTS produced natural audio the models handled correctly. NVIDIA reports 1.93% WER on LibriSpeech for this model — the real-world number is close to what Neo's run showed, not what the interactive run showed.

The RTF gap is about runtime choice

RTF 0.519 vs 0.328. Same model weights. Same hardware. Different inference backend.

That 37% throughput improvement is what you get when you pick ONNX Runtime for a CPU-only deployment instead of defaulting to HF Transformers. Neo found this by reading benchmarks before committing. The interactive run defaulted to the obvious path and never had reason to look further.

In production terms: the difference between 2 servers and 3.

The cost gap is about iteration

$1.96 vs $0.74. Interactive mode burns tokens on every re-read, every correction, every back-and-forth. Neo planned once and executed linearly — 10 subtasks, self-verified after each one, no back-and-forth.

The structured plan eliminated the iteration overhead that makes interactive AI engineering expensive at scale.

The Code, Side by Side

Claude Code Solo — single unified evaluator:

class ParakeetEvaluator:
    models_cfg = {
        "parakeet-tdt-0.6b-v3-hf": (
            "nvidia/parakeet-tdt-0.6b-v3",
            "_load_parakeet_nemo",
            "_transcribe_parakeet_nemo",
        ),
        "parakeet-tdt-0.6b-v3-gguf-q4k": (
            GGUF_MODEL,
            "_load_parakeet_gguf",
            "_transcribe_parakeet_gguf",
        ),
    }

Claude Code + Neo — separate scripts per model:

python evaluate_onnx.py   # ONNX Runtime path
python evaluate_gguf.py   # parakeet.cpp path

Neo kept them separate for cleaner debugging and verification. Each script outputs its own JSON. A combined results.json merges them. Neo also ran an internal verification check after each artifact — re-read files, confirmed sizes, checked exit codes — before marking steps complete.

Full code, results, and Neo's pre-execution plan are in the GitHub repo.

When to Use Which

Stick with interactive Claude Code when:

You're exploring and don't know exactly what you want yet
You need to understand what's being built, not just get output
The task is short enough that planning overhead isn't worth it
You're debugging something that needs real-time judgment

Add Neo when:

You have a defined AI/ML pipeline: evaluation, fine-tuning, RAG, inference benchmarking
The task has non-obvious decision points before you start (runtime, quantization, dataset)
You want the deliverable, not the process
Token cost on repeated or scaled runs matters

The pattern: the more your task looks like "figure out the right approach, then execute it," the more a research-first agent beats interactive iteration.

Setup

Neo runs locally via MCP inside Claude Code. Add it to your claude_desktop_config.json:

{
  "mcpServers": {
    "neo": {
      "command": "npx",
      "args": ["-y", "@heyneo/mcp"]
    }
  }
}

Your code, models, and data stay on your machine. Nothing remote.

Then just describe your AI engineering task in Claude Code and let Neo handle the execution.

Repo with all code, results, and charts: github.com/gauravvij/parakeet-stt-eval

DEV Community