LLMs are Bad at Math

#llm #ai #machinelearning #math

LLMs are known to struggle with math. Not in those PhD level tasks from AIME eval, where the reasoning models compete and shine... But rather in the everyday math we deal with - additions, multiplications, etc.

Take for example Grok 3's DeepSearch where I prompted it to "... list countries by their GDP per capita in Japanese Yen". As you can see in the screenshot below, the agent did it most reasonably - found a readily available GDP per capita table from IMF, came up with a USD to JP¥ conversion rate, and created a summary table with IMF data converted using the exchange rate.

In its explanation of the approach "... each USD value was multiplied by 146 to get JPY. For example, Luxembourg’s 140,941 USD became 20,577,186 JPY (140,941 × 146)" Grok 3 makes a calculation mistake. My non-AI native calculator gives me 21,891,386 as the result of 140,941 × 146 multiplication. All the cells in the following table were also wrong.

I went further by testing Grok in 3 different modes:

No thinking + Web Search
Thinking + Web Search
DeepSearch

For each of the modes, the approach by Grok was the same: finding source data in USD, pegging to a certain exchange rate, doing the calculation, and outputting the resulting table. If we put aside the questions of why in all 3 cases the exchange rate was different, why pick a certain list of countries (and never use the full list of countries and territories)... I tested how one of the best SOTA models (Grok-3 ?Mini) faired with converting USD to JPY:

No thinking + Web Search: 32 countries, 3 wrong calculations
Thinking + Web Search: 13 countries, all correct
DeepSearch: 11 countries, 11 wrong (deviating at ~0.5% from true values)

The complete calculation verification is available in this spreadsheet.

The example demonstrates a very common pitfall in LLM use. Any prompt and any context dealing with numbers may require the model to do the basic math. Likely it will not resort to using a tool call (i.e. asking a Python interpreter to run calculations) hence the numbers produced by LLM are not trustworthy. And I rarely see that prompts with numbers are followed by a tool call for calculus, models readily return completions with calculations done.

Say you have Office 365 Copilot, Claude, ChatGPT, or any other chatbot doing errands for you. You ask it to look into an invoice and highlight value-for-money outliers. Or you are working on a quote and ask the chatbot to prepare a report. Or as a PM you use the AI assistant to look into sprint stats and evaluate velocity. There are numerous cases requiring basic number crunching. And if your life depends on the accuracy of those numbers I wouldn't trust any digit in the result. No matter what LLM product you use, Perplexity, Glean, Deep Research, Copilot, Gemini - all are based on LLMs that are bad at math.

But how bad are LLMs at this sort of math? Assume you have the correct input (though it is rarely the case, models can easily hallucinate at any step, e.g. while processing a table in a picture). What are the chances LLM will get the math right?

I've created a benchmark testing just that: llm_arithmetic. It prompts a model multiple times to do additions, subtractions, multiplications, and divisions of random numbers - and registers the accuracy.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Model                                      ┃ Trials ┃ Correct % ┃  NaN % ┃  Dev % ┃ Comp. Tok. ┃       Cost ┃      Avg Error ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ o4-mini-2025-04-16-medium                  │    480 │    97.08% │  0.00% │  2.92% │ 1110603.00 │  $4.903872 │         0.002% │
│ o4-mini-2025-04-16-medium-4k               │    480 │    93.54% │  0.00% │  6.46% │ 1083780.00 │  $6.741561 │         0.001% │
│ o4-mini-2025-04-16-low                     │    480 │    88.96% │  0.00% │ 11.04% │  575871.00 │  $2.551050 │         0.959% │
│ deepseek-r1                                │    480 │    84.17% │  0.21% │ 15.62% │ 1462524.00 │  $3.210413 │      2669.789% │
│ claude-sonnet-4-20250514-thinking16000     │    480 │    76.04% │  0.00% │ 23.96% │ 1332908.00 │ $20.085939 │      1740.396% │
│ o3-mini-2025-01-31-medium                  │    480 │    75.21% │  0.00% │ 24.79% │  945716.00 │  $4.178371 │         2.287% │
│ grok-3-mini-beta-high                      │    480 │    71.88% │  1.25% │ 26.88% │    2702.00 │  $0.006156 │       827.580% │
│ deepseek-r1-4k                             │    480 │    70.00% │  0.00% │ 30.00% │  620371.00 │  $0.000000 │       712.913% │
│ qwen3-32b@cerebras-thinking                │    480 │    69.58% │  5.62% │ 24.79% │ 2767460.00 │  $0.000000 │ 840317057.169% │
│ qwen3-14b@q8_0-ctx4k-thinking              │    480 │    66.25% │  0.21% │ 33.54% │ 2338564.00 │  $0.000000 │      9492.622% │
│ o1-mini-2024-09-12                         │    480 │    66.04% │  0.00% │ 33.96% │  572960.00 │  $7.617905 │      6825.446% │
│ claude-opus-4-20250514-thinking16000       │    480 │    65.83% │  0.00% │ 34.17% │  396158.00 │  $0.000000 │      1831.015% │
│ qwen3-14b@iq4_xs-ctx32k-thinking           │    480 │    65.83% │  0.83% │ 33.33% │ 2552276.00 │  $0.000000 │      8152.815% │
│ qwen3-32b@iq4_xs-ctx16k-thinking           │    480 │    65.62% │  3.75% │ 30.63% │ 3499454.00 │  $0.000000 │      5227.605% │
│ o3-mini-2025-01-31-low                     │    480 │    65.21% │  0.00% │ 34.79% │  284738.00 │  $1.270064 │         5.435% │
│ qwen3-14b@iq4_xs-ctx4k-thinking            │    480 │    65.00% │  0.42% │ 34.58% │ 2245910.00 │  $0.000000 │  72213401.589% │
│ qwen3-14b@q4_k_m-ctx4k-thinking            │    480 │    64.79% │  0.00% │ 35.21% │ 2334475.00 │  $0.000000 │      3769.350% │
│ claude-sonnet-3.7-20250219-thinking4096    │    480 │    57.08% │ 18.96% │ 23.96% │ 1214269.00 │ $18.306354 │       889.557% │
│ gemini-2.5-pro-preview-03-25               │    480 │    55.83% │  0.00% │ 44.17% │    5517.00 │  $0.078019 │        20.602% │
│ qwen3-14b@iq4_xs-ctx32k-thinking-4k        │    480 │    55.21% │  0.21% │ 44.58% │  710967.00 │  $0.000000 │       988.474% │
│ claude-sonnet-3.7-20250219-4k              │    480 │    52.50% │  0.00% │ 47.50% │    4213.00 │  $0.000000 │      2217.925% │
│ xai/grok-3-mini-beta                       │    480 │    51.46% │  0.00% │ 48.54% │    2511.00 │  $0.006060 │       913.579% │
│ claude-sonnet-3.7-20250219                 │    480 │    51.04% │  0.00% │ 48.96% │    4147.00 │  $0.114204 │      1302.437% │
│ claude-opus-4-20250514                     │    480 │    50.42% │  0.00% │ 49.58% │    4169.00 │  $0.572685 │      5037.315% │
│ gemini-2.5-flash-preview-04-17-thinking    │    480 │    50.42% │  0.21% │ 49.38% │  521284.00 │  $0.315585 │        27.894% │
│ claude-sonnet-4-20250514                   │    480 │    50.00% │  0.00% │ 50.00% │    4125.00 │  $0.113868 │        20.410% │
│ gemini-2.5-flash-preview-04-17-thinking    │    480 │    49.79% │  0.21% │ 50.00% │  310022.00 │  $1.087891 │       481.693% │
│ claude-3.5-haiku                           │    480 │    49.58% │  0.00% │ 50.42% │    3987.00 │  $0.029816 │      3351.666% │
│ gpt-4.5-preview-2025-02-27                 │    480 │    49.58% │  0.00% │ 50.42% │    2647.00 │  $1.607175 │        24.709% │
│ gpt-4.1-2025-04-14-4k                      │    480 │    48.54% │  0.00% │ 51.46% │    2688.00 │  $5.163010 │        25.919% │
│ gemini-2.5-flash-preview-04-17-no-thinking │    480 │    48.54% │  0.00% │ 51.46% │    5238.00 │  $0.005956 │        30.566% │
│ gpt-4.1-2025-04-14                         │    480 │    48.12% │  0.00% │ 51.88% │    2729.00 │  $0.068629 │      7284.099% │
│ qwen3-32b@cerebras                         │    480 │    46.46% │  0.00% │ 53.54% │    7457.00 │  $0.000000 │        63.979% │
│ qwen3-32b@iq4_xs-ctx16k                    │    480 │    46.04% │  1.04% │ 52.92% │    7132.00 │  $0.000000 │        63.271% │
│ qwen3-14b@iq4_xs-ctx32k                    │    480 │    45.21% │  1.67% │ 53.12% │    7533.00 │  $0.000000 │ 392239118.901% │
│ gpt-4-0613                                 │    480 │    41.04% │  0.00% │ 58.96% │    2450.00 │  $0.631020 │    362466.402% │
│ gpt-4.1-nano-2025-04-14                    │    480 │    38.54% │  0.42% │ 61.04% │    2841.00 │  $0.002749 │    686001.894% │
│ gpt-35-turbo-0125                          │    480 │    35.62% │  0.62% │ 63.75% │    2438.00 │  $0.011725 │        43.177% │
│ gpt-35-turbo-1106                          │    480 │    33.96% │  0.21% │ 65.83% │    2560.00 │  $0.011907 │       409.261% │
│ gpt-4o-mini-2024-07-18                     │    480 │    32.29% │  0.00% │ 67.71% │    2862.00 │  $0.004137 │        64.570% │
│ claude-2.1                                 │    480 │    13.33% │  0.00% │ 86.67% │    2661.00 │  $0.000000 │       174.584% │
│ deepseek-r1-distill-qwen-14b@iq4_xs        │    480 │    10.21% │ 70.21% │ 19.58% │ 1113604.00 │  $0.000000 │       163.793% │
└────────────────────────────────────────────┴────────┴───────────┴────────┴────────┴────────────┴────────────┴────────────────┘

My observations based on testing a range of models:

In general, models are fine with small numbers (2-3 digits)
Performance is worse with multiplication and the worst with division
There's a huge gap in performance between models
o3/o4 models are surprisingly good, I'd trust it with number crunching tasks where accuracy under 1 percent is tolerable

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more