FLenQA Benchmark: Do Current LLMs Reason at Their Claimed Context Lengths?

#ai #agents #programming #llm

Some days ago, I started working on a research assistant that uses multi-agent orchestration mainly because the goal was to use small, local models (ignoring latency and output token/secs which impacts inference speed).

Most small models have limited reasoning capability and are highly susceptible to the "context rot" problem and "lost in the middle" phenomenon as demonstrated in the paper "Same Task, More Tokens" by Mosh Levy, et al. They created a Benchmark named FlenQA.

The FlenQA framework investigates something that's been bugging me, "what happens to an LLM's output quality when you give it longer prompts?". The conventional narrative says "bigger context window = better output", but local models have limitations in their context window length and Levy's paper challenges that narrative. They found significant degradation in LLM reasoning at input lengths far shorter than the model's technical maximum. In simple terms, just because a model can accept 1 million tokens doesn't mean it can reason effectively over them.

I decided to run a benchmark on some of the models I'm working on, using the Open Router gateway to test them without managing separate API keys.

10 models were used across 150 samples each (10 samples per task × 5 context sizes × 3 tasks), using standard prompting without chain-of-thought. The maximum token used was 3000, although Granite 4.1 8B has a maximum context length of 131,000 it performed better than Deepseek V4 flash and pro.

Here's the link to view the leaderboard.

FLenQA Benchmark: Do Current LLMs Reason at Their Claimed Context Lengths? | Richmond Eribo

I ported a 2024 research paper to a live platform and benchmarked 9 open-source models. Here's what I found when I tested whether models can actually reason over long inputs.

richmonderibo.dev

You may wonder what model is "Owl Alpha", and which frontier lab trained it. It's a stealth model, or you can call it a mystery model. Labs do this when their aim is to gather traces from users' request. A model with over a million context windows with no current owner did better than Qwen, Gemini Flash, and Deepseek in this benchmark.

DeepSeek V4 Pro underperformed its reputation on this benchmark, landing at 89.33%, the lowest score among the larger models.

All evaluations were run via OpenRouter. 150 samples per model, 10 per (task × context size) bucket, standard prompting without chain-of-thought. Results reflect a single run; individual scores may vary slightly across runs due to random sample selection.

NOTE: The second part of this series will test the impact of orchestration agents on the FlenQA dataset, without COT also, following the paper: "Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks"

The github codebase is here