Local Model, Big Results
Everyone assumes you need GPTâ4 or Claude 3 to hit stateâofâtheâart
reasoning accuracy on math benchmarks.
Here's the blunt truth: you don't.
- Raw GPTâoss:20B on GSM8K: 77% accuracy.
- GPTâoss:20B + Orka-Reasoning workflow: 92% accuracy.
- Runs locally, on consumer GPU, at zero cost.
This isn't a promptâhack. It's orchestration --- multiâagent reasoning
with structured workflows. And it rivals (and in some cases,
outperforms) frontier models.
Why GSM8K Matters
GSM8K (Grade School Math 8K) is the gold standard for math reasoning.
It's deceptively simple: thousands of gradeâschool word problems.
Why it matters:
- Math is unforgiving: no room for vague language or handâwavy answers.
- Reasoning benchmark: if a model can't handle GSM8K reliably, it's not reasoning --- it's guessing.
- Comparability: every big model reports GSM8K numbers, so it's the cleanest baseline.
So when GPTâoss:20B jumps from 77% â 92% on GSM8K just by running inside
Orka, it's not noise --- it's signal.
What Orka-Reasoning Brings
Most frameworks today are just wrappers around LLM APIs. Orka is not.
It's a cognitive orchestrator: YAMLâdefined flows, multiâperspective
agents, memory logging, trace replay.
Key features that matter here:
- Fork/Join workflows: multiple agents run in parallel, results reconciled.
- Multiâperspective reasoning: progressive, conservative, realist, ethical purist agents all debate (see trace excerpt below).
- Evaluator workflow: not just "did it look right?" but similarity, precision, and explainability scores.
- Traces: every reasoning step logged, inspectable.
This is infrastructure, not a wrapper.
Methodology
The benchmark pipeline is transparent and reproducible.
- Dataset: GSM8K (8,000 math problems).
- Chunked execution: 1000 problems per run, 8 chunks in total.
- Zero failures: 100% of cases processed without crashes.
- Evaluator scores: similarity, precision, explainability per case.
Pipeline (from README.md
)
python run_benchmark.py cot_multi_perspective.yml benchmark_evaluator.yml test.json
What gets generated
- Detailed CSV logs (
benchmark_report_TIMESTAMP.csv
) - JSON summaries (
benchmark_report_TIMESTAMP_summary.json
) - Orka traces with every reasoning step
Results: The Numbers
Raw GPTâoss:20B baseline (77%) vs. OrkaâReasoning (92%).
- Average similarity score across 8 chunks: ~0.95
- Average precision score: ~0.97
- Average explainability: ~0.96
- Zero failed cases out of 8000
Chart (from repo):
Orka lifted GPTâoss:20B into the same accuracy bracket as Claude 3 Opus,
Gemini Ultra, and DeepSeekâV2.5 --- without cloud cost.
Traces: Reasoning in the Open
Here's what makes Orka unique: you can see how the model reached the
answer.
Excerpt (abridged from execution trace):
- radical_progressive:
response: "OrKa-reasoning represents a progressive approach..."
confidence: 0.95
- traditional_conservative:
response: "OrKa-reasoning relies on established knowledge..."
confidence: 0.90
- pragmatic_realist:
response: "OrKa-reasoning is a structured approach..."
confidence: 0.95
- ethical_purist:
response: "OrKa-reasoning emphasizes transparency and ethics..."
confidence: 0.90
All four perspectives are logged, reconciled, and synthesized into a
final answer.
This isn't a black box --- it's a cognitive society at work.
Why This Matters
- Cost: running locally means $0 inference cost.
- Transparency: you can audit every reasoning step.
- Customizability: workflows are YAML, not hardcoded prompts.
- Performance: 92% GSM8K puts local GPTâoss:20B in the big leagues.
For education: reliable math tutor with explainable reasoning.
For enterprise: domainâspecific reasoning at scale, with full audit
trails.
For research: playground for experimenting with multiâagent cognition.
Reproducibility: Check the Reports
All benchmark execution reports are public:
đ
github.com/marcosomma/orka-reasoning/tree/master/docs/benchmark
Every run is logged, chunk summaries included.
Example (chunk 1 summary):
{
"chunk_number": 1,
"total_cases": 1000,
"successful_cases": 1000,
"failed_cases": 0,
"success_rate": 100.0,
"average_similarity_score": 0.978,
"average_precision_score": 0.988,
"average_explainability_score": 0.964
}
Eight such summaries cover the entire dataset (8000 cases).
Analysis: Orchestration > Scale
What's happening here is simple but radical:
Scaling the model wasn't enough. Structuring the reasoning was.
The "bigger is better" paradigm assumes you need 70B or 175B params to
compete.
But Orka proves that with orchestration --- multiâperspective agents,
structured debate, evaluator feedback --- even a 20B model can punch in
the 92% GSM8K tier.
This is a shift: cognitive infrastructure beats bruteâforce scaling.
Conclusion
Conclusion:
The Orka-Reasoning framework, powered by GPTâoss:20B, is a killer case
because it delivers 92% accuracy on GSM8K (with CoT/self-consistency),
runs efficiently on consumer hardware, and provides transparent,
multi-perspective reasoning. The traces prove its reliability (100%
correct on sample problems) and scalability (8,000 cases with zero
failures).
This isn't about another wrapper.
This is about building cognition as infrastructure.
Orka doesn't just answer questions --- it shows its work.
And that's why GPTâoss:20B, inside Orka, isn't just a model. It's a
reasoning system.
Top comments (2)
This is an impressive result! Congrats!
Thanks :)