Local Model, Big Results
Everyone assumes you need GPT‑4 or Claude 3 to hit state‑of‑the‑art
reasoning accuracy on math benchmarks.
Here's the blunt truth: you don't.
- Raw GPT‑oss:20B on GSM8K: 77% accuracy.
- GPT‑oss:20B + Orka-Reasoning workflow: 92% accuracy.
- Runs locally, on consumer GPU, at zero cost.
This isn't a prompt‑hack. It's orchestration --- multi‑agent reasoning
with structured workflows. And it rivals (and in some cases,
outperforms) frontier models.
Why GSM8K Matters
GSM8K (Grade School Math 8K) is the gold standard for math reasoning.
It's deceptively simple: thousands of grade‑school word problems.
Why it matters:
- Math is unforgiving: no room for vague language or hand‑wavy answers.
- Reasoning benchmark: if a model can't handle GSM8K reliably, it's not reasoning --- it's guessing.
- Comparability: every big model reports GSM8K numbers, so it's the cleanest baseline.
So when GPT‑oss:20B jumps from 77% → 92% on GSM8K just by running inside
Orka, it's not noise --- it's signal.
What Orka-Reasoning Brings
Most frameworks today are just wrappers around LLM APIs. Orka is not.
It's a cognitive orchestrator: YAML‑defined flows, multi‑perspective
agents, memory logging, trace replay.
Key features that matter here:
- Fork/Join workflows: multiple agents run in parallel, results reconciled.
- Multi‑perspective reasoning: progressive, conservative, realist, ethical purist agents all debate (see trace excerpt below).
- Evaluator workflow: not just "did it look right?" but similarity, precision, and explainability scores.
- Traces: every reasoning step logged, inspectable.
This is infrastructure, not a wrapper.
Methodology
The benchmark pipeline is transparent and reproducible.
- Dataset: GSM8K (8,000 math problems).
- Chunked execution: 1000 problems per run, 8 chunks in total.
- Zero failures: 100% of cases processed without crashes.
- Evaluator scores: similarity, precision, explainability per case.
Pipeline (from README.md
)
python run_benchmark.py cot_multi_perspective.yml benchmark_evaluator.yml test.json
What gets generated
- Detailed CSV logs (
benchmark_report_TIMESTAMP.csv
) - JSON summaries (
benchmark_report_TIMESTAMP_summary.json
) - Orka traces with every reasoning step
Results: The Numbers
Raw GPT‑oss:20B baseline (77%) vs. Orka‑Reasoning (92%).
- Average similarity score across 8 chunks: ~0.95
- Average precision score: ~0.97
- Average explainability: ~0.96
- Zero failed cases out of 8000
Chart (from repo):
Orka lifted GPT‑oss:20B into the same accuracy bracket as Claude 3 Opus,
Gemini Ultra, and DeepSeek‑V2.5 --- without cloud cost.
Traces: Reasoning in the Open
Here's what makes Orka unique: you can see how the model reached the
answer.
Excerpt (abridged from execution trace):
- radical_progressive:
response: "OrKa-reasoning represents a progressive approach..."
confidence: 0.95
- traditional_conservative:
response: "OrKa-reasoning relies on established knowledge..."
confidence: 0.90
- pragmatic_realist:
response: "OrKa-reasoning is a structured approach..."
confidence: 0.95
- ethical_purist:
response: "OrKa-reasoning emphasizes transparency and ethics..."
confidence: 0.90
All four perspectives are logged, reconciled, and synthesized into a
final answer.
This isn't a black box --- it's a cognitive society at work.
Why This Matters
- Cost: running locally means $0 inference cost.
- Transparency: you can audit every reasoning step.
- Customizability: workflows are YAML, not hardcoded prompts.
- Performance: 92% GSM8K puts local GPT‑oss:20B in the big leagues.
For education: reliable math tutor with explainable reasoning.
For enterprise: domain‑specific reasoning at scale, with full audit
trails.
For research: playground for experimenting with multi‑agent cognition.
Reproducibility: Check the Reports
All benchmark execution reports are public:
👉
github.com/marcosomma/orka-reasoning/tree/master/docs/benchmark
Every run is logged, chunk summaries included.
Example (chunk 1 summary):
{
"chunk_number": 1,
"total_cases": 1000,
"successful_cases": 1000,
"failed_cases": 0,
"success_rate": 100.0,
"average_similarity_score": 0.978,
"average_precision_score": 0.988,
"average_explainability_score": 0.964
}
Eight such summaries cover the entire dataset (8000 cases).
Analysis: Orchestration > Scale
What's happening here is simple but radical:
Scaling the model wasn't enough. Structuring the reasoning was.
The "bigger is better" paradigm assumes you need 70B or 175B params to
compete.
But Orka proves that with orchestration --- multi‑perspective agents,
structured debate, evaluator feedback --- even a 20B model can punch in
the 92% GSM8K tier.
This is a shift: cognitive infrastructure beats brute‑force scaling.
Conclusion
Conclusion:
The Orka-Reasoning framework, powered by GPT‑oss:20B, is a killer case
because it delivers 92% accuracy on GSM8K (with CoT/self-consistency),
runs efficiently on consumer hardware, and provides transparent,
multi-perspective reasoning. The traces prove its reliability (100%
correct on sample problems) and scalability (8,000 cases with zero
failures).
This isn't about another wrapper.
This is about building cognition as infrastructure.
Orka doesn't just answer questions --- it shows its work.
And that's why GPT‑oss:20B, inside Orka, isn't just a model. It's a
reasoning system.
Top comments (0)