I have a secret passion for LFM2.5-Thinking. It's tiny 1.2B, it's fast, it's a reasoning model, and it's good. Really good.
My tests are still in progress. All i can do is share some early results. I use the public GSM8k dataset, but with my own benchmarking scripts.
What is the GSM8k benchmark?
Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
The Top 10 Leaderboard in 2026, up to 97%. Take note of the massive context size.
And this what "State of the Art" results looked like in 2021, barely 35%.
Some early results
Questions: 1319 (test)
Context sizes to test: [1000, 2000, 3000, 4000, 5000, 6000, 7000]
Endpoint: http://192.168.1.110:8000 / lfm2.5-thinking
=== max_tokens=1000 ===
[200/1319] acc=135/200 (67.5%) rate=3.9q/s
[400/1319] acc=251/400 (62.8%) rate=4.6q/s
[600/1319] acc=387/600 (64.5%) rate=5.0q/s
[800/1319] acc=512/800 (64.0%) rate=5.1q/s
[1000/1319] acc=640/1000 (64.0%) rate=5.3q/s
[1200/1319] acc=771/1200 (64.2%) rate=5.3q/s
Result: 851/1319 (64.5%) @ 5.48q/s
=== max_tokens=2000 ===
[200/1319] acc=163/200 (81.5%) rate=2.1q/s
[400/1319] acc=321/400 (80.2%) rate=2.3q/s
[600/1319] acc=479/600 (79.8%) rate=2.5q/s
[800/1319] acc=636/800 (79.5%) rate=2.5q/s
[1000/1319] acc=791/1000 (79.1%) rate=2.5q/s
[1200/1319] acc=956/1200 (79.7%) rate=2.6q/s
Result: 1055/1319 (80.0%) @ 2.63q/s
=== max_tokens=3000 ===
[200/1319] acc=171/200 (85.5%) rate=1.5q/s
[400/1319] acc=341/400 (85.2%) rate=1.5q/s
[600/1319] acc=505/600 (84.2%) rate=1.5q/s
[800/1319] acc=674/800 (84.2%) rate=1.5q/s
[1000/1319] acc=836/1000 (83.6%) rate=1.5q/s
[1200/1319] acc=1008/1200 (84.0%) rate=1.5q/s
Result: 1113/1319 (84.4%) @ 1.57q/s
=== max_tokens=4000 ===
[200/1319] acc=175/200 (87.5%) rate=1.1q/s
[400/1319] acc=348/400 (87.0%) rate=1.1q/s
[600/1319] acc=517/600 (86.2%) rate=1.1q/s
[800/1319] acc=683/800 (85.4%) rate=1.1q/s
[1000/1319] acc=852/1000 (85.2%) rate=1.1q/s
[1200/1319] acc=1033/1200 (86.1%) rate=1.1q/s
Result: 1139/1319 (86.4%) @ 1.17q/s
=== max_tokens=5000 ===
[200/1319] acc=176/200 (88.0%) rate=0.8q/s
[400/1319] acc=350/400 (87.5%) rate=0.9q/s
[600/1319] acc=523/600 (87.2%) rate=0.9q/s
[800/1319] acc=687/800 (85.9%) rate=0.9q/s
[1000/1319] acc=850/1000 (85.0%) rate=0.9q/s
[1200/1319] acc=1025/1200 (85.4%) rate=0.9q/s
Result: 1129/1319 (85.6%) @ 0.93q/s
=== max_tokens=6000 ===
[200/1319] acc=181/200 (90.5%) rate=0.7q/s
[400/1319] acc=351/400 (87.8%) rate=0.7q/s
[600/1319] acc=523/600 (87.2%) rate=0.7q/s
[800/1319] acc=696/800 (87.0%) rate=0.7q/s
[1000/1319] acc=863/1000 (86.3%) rate=0.7q/s
[1200/1319] acc=1048/1200 (87.3%) rate=0.7q/s
Result: 1153/1319 (87.4%) @ 0.73q/s
=== max_tokens=7000 ===
[200/1319] acc=172/200 (86.0%) rate=0.5q/s
[400/1319] acc=346/400 (86.5%) rate=0.6q/s
[600/1319] acc=520/600 (86.7%) rate=0.6q/s
[800/1319] acc=683/800 (85.4%) rate=0.6q/s
[1000/1319] acc=853/1000 (85.3%) rate=0.6q/s
[1200/1319] acc=1034/1200 (86.2%) rate=0.6q/s
Result: 1137/1319 (86.2%) @ 0.61q/s
=== Summary ===
max_tokens accuracy correct total rate
1000 64.5% 851 1319 5.5
2000 80.0% 1055 1319 2.6
3000 84.4% 1113 1319 1.6
4000 86.4% 1139 1319 1.2
5000 85.6% 1129 1319 0.9
6000 87.4% 1153 1319 0.7
7000 86.2% 1137 1319 0.6
About boxed & fallback. The model is requested to put the result in a "boxed" format, but it sometimes fail to do so. I have some "fallback" parsing to try to extract the answer anyway
I'm not running my full suite, therefore there might be some false negatives here (improper parsing, flag correct result as incorrect).
You'll hear me saying it at lof time on dev.to, i don't have enough compute power. Yet, it's a good rough estimate.
TODO
I need to double check the results, i need to check longer context, i need to check another model, i need to test turboquant, and so much more...



Top comments (0)