How I Ran 2,859 LLM Code Generation Tests with EvalScope — and Got Zero Errors

#ai #llm #testing

After three weeks of running Qwen2.5-32B on a DGX Spark, the number that surprised me most wasn't the throughput or latency. It was zero.

Zero structural errors across 2,859 code generation tests.

What I Tested

EvalScope with code generation tasks covering:

Structured JSON output
Function calling (OpenAI tool format)
Multi-step tool use chains
Code completion with specific output formats

Each test run validates four things:

Valid JSON structure — no unclosed brackets, no broken syntax
Correct function call schema — the right parameters, right types
No truncated output — response completes fully within the token budget
Response within timeout — no hung generations

Seven test sessions, roughly 400 prompts each. Every single one passed.

The Setup

Model: Qwen2.5-32B-Instruct-AWQ (4-bit)
Engine: vLLM 0.21 with continuous batching
Temperature: 0 (deterministic mode)
Hardware: DGX Spark, 128GB unified memory, ARM64

bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen2.5-32B-Instruct-AWQ \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes

Why Zero Errors Surprised Me

I've used cloud APIs extensively. Even the best ones occasionally return truncated JSON under load, or a function call with a missing parameter. It's rare — 0.1-0.3% error rates — but when you're running autonomous agents doing 40+ sequential tool calls, a single failure cascades.

At 0.3% error rate per call, a 50-step agent loop has a ~14% chance of hitting at least one failure. Your agent works perfectly nine times, then mysteriously dies on the tenth run.

With zero errors in 2,859 trials, the 95% confidence upper bound on the error rate is 0.13%. That means a 50-step loop has a 93.8%+ chance of completing cleanly.

The Comparison

I also ran 1,280 identical prompts against cloud APIs:

Backend	Latency (median)	Structural Errors
DeepSeek V3	2.6s	0
Kimi	4.9s	2
Qwen2.5-14B (Mac M4)	9.9s	0
STORM (DGX, 32B)	19.6s	0

Cloud wins on speed. But the local setup matched the cloud on reliability, while the 14B on a $599 Mac Mini held its own on quality.

Reproduce It

Full methodology, test datasets, and raw results are on GitHub:

https://github.com/YIQI-NUMBER1/stormengine

If you've got a local setup, pull the repo and run the benchmarks. If you find errors I missed, open an issue — I genuinely want to know what breaks this.

DEV Community

How I Ran 2,859 LLM Code Generation Tests with EvalScope — and Got Zero Errors

Top comments (0)