DEV Community

Laurent Laborde
Laurent Laborde

Posted on

Benchmarking LFM2.5-Thinking on GSM8k (early result)

#ai

I have a secret passion for LFM2.5-Thinking. It's tiny 1.2B, it's fast, it's a reasoning model, and it's good. Really good.

My tests are still in progress. All i can do is share some early results. I use the public GSM8k dataset, but with my own benchmarking scripts.

What is the GSM8k benchmark?

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

The Top 10 Leaderboard in 2026, up to 97%. Take note of the massive context size.

Top 10 Leaderboard

And this what "State of the Art" results looked like in 2021, barely 35%.

SOTA GSM8k in 2021

Some early results

Questions: 1319 (test)
Context sizes to test: [1000, 2000, 3000, 4000, 5000, 6000, 7000]
Endpoint: http://192.168.1.110:8000 / lfm2.5-thinking

=== max_tokens=1000 ===
  [200/1319] acc=135/200 (67.5%) rate=3.9q/s
  [400/1319] acc=251/400 (62.8%) rate=4.6q/s
  [600/1319] acc=387/600 (64.5%) rate=5.0q/s
  [800/1319] acc=512/800 (64.0%) rate=5.1q/s
  [1000/1319] acc=640/1000 (64.0%) rate=5.3q/s
  [1200/1319] acc=771/1200 (64.2%) rate=5.3q/s
  Result: 851/1319 (64.5%) @ 5.48q/s

=== max_tokens=2000 ===
  [200/1319] acc=163/200 (81.5%) rate=2.1q/s
  [400/1319] acc=321/400 (80.2%) rate=2.3q/s
  [600/1319] acc=479/600 (79.8%) rate=2.5q/s
  [800/1319] acc=636/800 (79.5%) rate=2.5q/s
  [1000/1319] acc=791/1000 (79.1%) rate=2.5q/s
  [1200/1319] acc=956/1200 (79.7%) rate=2.6q/s
  Result: 1055/1319 (80.0%) @ 2.63q/s

=== max_tokens=3000 ===
  [200/1319] acc=171/200 (85.5%) rate=1.5q/s
  [400/1319] acc=341/400 (85.2%) rate=1.5q/s
  [600/1319] acc=505/600 (84.2%) rate=1.5q/s
  [800/1319] acc=674/800 (84.2%) rate=1.5q/s
  [1000/1319] acc=836/1000 (83.6%) rate=1.5q/s
  [1200/1319] acc=1008/1200 (84.0%) rate=1.5q/s
  Result: 1113/1319 (84.4%) @ 1.57q/s

=== max_tokens=4000 ===
  [200/1319] acc=175/200 (87.5%) rate=1.1q/s
  [400/1319] acc=348/400 (87.0%) rate=1.1q/s
  [600/1319] acc=517/600 (86.2%) rate=1.1q/s
  [800/1319] acc=683/800 (85.4%) rate=1.1q/s
  [1000/1319] acc=852/1000 (85.2%) rate=1.1q/s
  [1200/1319] acc=1033/1200 (86.1%) rate=1.1q/s
  Result: 1139/1319 (86.4%) @ 1.17q/s

=== max_tokens=5000 ===
  [200/1319] acc=176/200 (88.0%) rate=0.8q/s
  [400/1319] acc=350/400 (87.5%) rate=0.9q/s
  [600/1319] acc=523/600 (87.2%) rate=0.9q/s
  [800/1319] acc=687/800 (85.9%) rate=0.9q/s
  [1000/1319] acc=850/1000 (85.0%) rate=0.9q/s
  [1200/1319] acc=1025/1200 (85.4%) rate=0.9q/s
  Result: 1129/1319 (85.6%) @ 0.93q/s

=== max_tokens=6000 ===
  [200/1319] acc=181/200 (90.5%) rate=0.7q/s
  [400/1319] acc=351/400 (87.8%) rate=0.7q/s
  [600/1319] acc=523/600 (87.2%) rate=0.7q/s
  [800/1319] acc=696/800 (87.0%) rate=0.7q/s
  [1000/1319] acc=863/1000 (86.3%) rate=0.7q/s
  [1200/1319] acc=1048/1200 (87.3%) rate=0.7q/s
  Result: 1153/1319 (87.4%) @ 0.73q/s

=== max_tokens=7000 ===
  [200/1319] acc=172/200 (86.0%) rate=0.5q/s
  [400/1319] acc=346/400 (86.5%) rate=0.6q/s
  [600/1319] acc=520/600 (86.7%) rate=0.6q/s
  [800/1319] acc=683/800 (85.4%) rate=0.6q/s
  [1000/1319] acc=853/1000 (85.3%) rate=0.6q/s
  [1200/1319] acc=1034/1200 (86.2%) rate=0.6q/s
  Result: 1137/1319 (86.2%) @ 0.61q/s

=== Summary ===
max_tokens  accuracy   correct   total    rate
      1000     64.5%       851    1319    5.5
      2000     80.0%      1055    1319    2.6
      3000     84.4%      1113    1319    1.6
      4000     86.4%      1139    1319    1.2
      5000     85.6%      1129    1319    0.9
      6000     87.4%      1153    1319    0.7
      7000     86.2%      1137    1319    0.6
Enter fullscreen mode Exit fullscreen mode

LFM2.5 result

About boxed & fallback. The model is requested to put the result in a "boxed" format, but it sometimes fail to do so. I have some "fallback" parsing to try to extract the answer anyway

I'm not running my full suite, therefore there might be some false negatives here (improper parsing, flag correct result as incorrect).

You'll hear me saying it at lof time on dev.to, i don't have enough compute power. Yet, it's a good rough estimate.

TODO

I need to double check the results, i need to check longer context, i need to check another model, i need to test turboquant, and so much more...

Top comments (0)