Robin

Posted on Mar 4

We Published That Our Premium Tier Failed on 60% of Tasks. Then We Fixed It.

#ai #devtools #architecture #benchmarks

We Published That Our Premium Tier Failed on 60% of Tasks. Then We Fixed It.

By Hossein Shahrokni | 2026-02-26

When we launched on Product Hunt, our Phase 4 benchmark was live on the site.

It showed council mode — our multi-model premium tier — timing out on 6 of 10 developer tasks. We didn't hide those numbers. They were in the benchmark table, linked from the maker comment, publicly downloadable as JSON.

This is the follow-up post. We shipped the fix. Here's what broke, what we changed, and what Phase 5 shows.

What council mode is

Council mode runs each request through four specialist models — Code, Research, Creative, and a Captain who synthesizes their outputs — before returning an answer. The verification pass is what makes it more than just asking four models and averaging. The Captain cross-examines the specialists' outputs, catches contradictions, and produces a synthesized response.

The benchmark hypothesis: four specialists catching each other's errors should outperform a single model, even the best one. Phase 4 was the first time we actually ran it at scale. It told a different story.

What Phase 4 showed

Phase 4 (Feb 25, 30 calls, 10 developer tasks): council timed out on 6 of 10 tasks. The benchmark recorded timeouts as failures. The result was a 5-something/10 average that told us nothing about whether the underlying approach worked — just that the implementation had a critical fault.

We published it anyway. If you're running a transparency-first benchmark, you publish the ugly runs too.

The root cause

Each specialist call had a 90-second AbortSignal. Four specialists running sequentially. Worst case: 4 × 90s = 360 seconds of execution time.

Connection timeout on Vercel: 90 seconds.

The math was wrong from the start. Under load, every council request that hit a slow specialist exceeded the connection ceiling and died.

The fix

The sequential architecture was correct — that's what drives council quality. Each specialist reads what the previous one said before responding. Running them in parallel would break that.

What was wrong was the per-specialist timeout with no ceiling on the total pipeline. Sprint 12 added PIPELINE_TOTAL_TIMEOUT_MS — a hard ceiling on total council execution time — plus a streaming bypass for simple requests (~2.4s). Complex tasks run the full sequential chain within a fixed budget. If a specialist runs long, the Captain synthesizes with whatever's complete. Zero timeouts since the fix shipped.

Phase 5 results

We re-ran the benchmark with council V3 live.

Tier	Score (dev tasks)	Won vs Opus	vs. Phase 4
Council V3	8.77/10	8 of 10	was timing out 6/10
Balanced	8.80/10	8 of 10	unchanged
Opus 4.6 direct	8.6/10	baseline	baseline
Frugal	8.3/10	3 of 10	unchanged

Council wins 8 of 10 developer tasks head-to-head against Opus direct. Avg response time: ~90s. Zero timeouts.

The council dev average (8.77) now beats both Opus (8.6) and Balanced (8.7) on developer tasks. The wins are clearest on architecture decisions, complex reasoning, and open-ended design — tasks where specialist cross-examination resolves ambiguity before synthesis. The all-task average (7.27 across 16 tasks including non-dev tasks) shows council is optimized for developer work, not general use. Full outputs are published — read the individual tasks and judge which council wins are meaningful for your use case.

(Note: cost column omitted — benchmark doesn't track multi-model cost. See komilion.com/pricing for current premium tier pricing.)

Full outputs at komilion.com/compare-v2 — every response, every judge verdict, JSON download.

What this means for the three tiers

Frugal and Balanced haven't changed. Phase 4 confirmed both: Balanced beats Opus on 8 of 10 tasks at half the cost, Frugal at 97% quality for 1.6% of the cost. Those findings stand.

Premium (neo-mode/premium) now routes to council V3. If you were on Premium before this post, your next call goes to council automatically.

The komilion.neo.councilTrace field in the response shows the full specialist breakdown — which model handled each role, what it contributed, how the Captain synthesized.

Why we published the failure data

Because you'd find it anyway. Developer benchmarks get read carefully. If we'd published Phase 4 with a footnote like "council results excluded due to technical issues," someone would have asked why.

Publishing the failure is also how you prove the fix is real. The before and after are both in the data. You can verify it yourself.

komilion.com — Phase 5 benchmark published same day as this post.

DEV Community

We Published That Our Premium Tier Failed on 60% of Tasks. Then We Fixed It.

We Published That Our Premium Tier Failed on 60% of Tasks. Then We Fixed It.

What council mode is

What Phase 4 showed

The root cause

The fix

Phase 5 results

What this means for the three tiers

Why we published the failure data

Top comments (0)