I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

#ai #llm #openai #infrastructure

I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company.

The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.

Quick results.

Local Gemma 4 26B on llama.cpp (RTX 4090 + W6800): live serving fine. Batch lane blocked. vLLM has no 4-bit MoE path I need, container wants CUDA 12.9, host driver is 12.8. GGML_CUDA_DISABLE_GRAPHS=1 keeps llama.cpp alive when graph optimizer segfaults.
OpenRouter: no real batch. Live pricing. At concurrency 32, latency 2 to 17 seconds, 121s timeouts, 429s.
Gemini batch SDK: silently inline-concatenates documents into one context. Cross-document leak. Neo4j rollback. Upstream googleapis/python-genai issue 1984 is not-planned.
OpenAI Batch (gpt-5.4-mini): JSONL line-isolated, 50 percent off, 100-doc nano gate in 2.7 min, zero 429s, around 1 cent per document.

The local rig stays for live serving, ER API LLM gate, multimodal, and ablations. The batch lane moves to OpenAI.

Full retrospective with the side-by-side table: https://hannune.ai/blog/local-llm-to-openai-batch.html

Top comments (2)

Mike Czerwinski • Jun 28

This is your fabricated-evidence failure one layer down. Truncating an article past its supporting sentence and Gemini inline-concatenating documents are the same bug: a boundary you assumed was enforced was not, and the pipeline emitted finished-looking output anyway. A partial document still produces output. A leaked batch still produces output. That is what makes both dangerous, the absence is invisible at the seam. And both fixes are the same move: stop trusting the generator to respect the boundary, make it mechanical. Substring-or-drop on one side, line-isolation and no cross-document attention on the other. The rule that picked the lane was not a benchmark. It was which lane could not silently merge two things you needed kept apart.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.