DEV Community

Cover image for Deepseek R1 vs OpenAI o1
Maxim Saplin
Maxim Saplin

Posted on

95 5 4 3 5

Deepseek R1 vs OpenAI o1

Deepseek R1 is out - available via Deepseek API or free Deepseek chat. If you are following LLM/Gen AI space you've likely seen titles, read posts, or watched videos praising the model: 671B MoE model, open weight, lots of info on training process. It challenges OpenAI's reasoning models (o1/o1-mini) across many benchmarks at a fraction of a cost... There are even smaller "distilled" versions of R1 available for the local run (via llama.cpp/ollama/lmstudio etc.).

I have been stress-testing models with LLM Chess since Autumn, and so far, none of the "reasoning" (or "thinking") models impressed me (except OpenAI's o1). Right away, I launched the benchmark, but I had to wait for a few days to collect enough data (it seems that the API was throttled; it was extremely slow).

LLM Chess simulates multiple games of a random bot playing against an LLM. Thousands of prompts, millions of tokens, every game is unique (unlike most evals which have fixed sets of prompts/pass conditions). Several metrics are collected and aggregated across multiple runs. The performance of a model is evaluated in reasoning (% of Wins/Draws) and steer-ability/durability (how often a model fails to follow instructions or drops out of the game due to multiple erroneous replies).

Reasoning Models

Before o1, LLMs couldn’t beat a random player in chess. GPT-4o? Zero wins. Claude 3.5 - Zero. They’d either collapse early or drag games into a 200-move game limit (with an automatic Draw assigned).

Then came o1. OpenAI’s "reasoning" models broke the record:

  • o1-preview: 46.67% wins
  • o1-mini: 30% wins

Other "reasoning" models? After the o1 release in late 2024 came the controversy around OpenAI's secrecy... The hidden "reasoning" tokens discussion (invisible yet billed) and how people were banned since OpenAI suspected they tried to crack their secrets. At the time we have seen attempts to reproduce the success of o1 with AI labs introducing "reasoning" models. E.g. Qwen's QwQ, Sky T1. Even Google released their experimental Gemini Thinking model in December 2024.

None of the alternative "reasoning" or "thinking" models came close to OpenAI models - they struggled even with basic instruction following drowning in verbosity, dropping out of the game loop after just a few moves: games lasted for 2 to 14 moves on average. Take for example a non-reasoning old and out-of-fashion GPT-4 Turbo which lasted on average 192 moves (before losing to the random Player due to a checkmate :).

Those late-2024 non-OpenAI reasoning models happened to be surrogates. This has set my expectations of R1 low...

R1

Deepseek's reasoning model turned out to be the real deal. It did score a meaningful number of wins while maintaining a low number of mistakes.

Model Wins Draws Mistakes Tokens/move
o1-preview 46.67% 43.33% 3.74 2660
o1-mini 30.00% 50.00% 2.34 1221
Deepseek-R1 22.58% 19.35% 18.63 4585

Mistakes - # of LLM erroneous replies per 1000 moves

R1 did good, but not great. Pay attention to how few Draws it had compared to o1 models. That's due to R1 breaking the protocol, violating prompt instruction, OR hallucinating illegal moves (and hence scoring a loss). It struggles with the instruction following and is susceptible to prompt variations falling out of the game loop randomly.

For reference here are the top non-reasoning models as of January 2025:

Model Wins ▼ Draws Mistakes Tokens/move
anthropic.claude-v3-5-sonnet-v1 6.67% 80.00% 0.27 80.42
gpt-4o-2024-11-20 4.23% 87.32% 0.15 50.58
gpt-4-turbo-2024-04-09 0.00% 93.33% 0.00 6.03
anthropic.claude-v3-opus 0.00% 83.33% 1.61 72.86

Reasoning Models - a League of their Own

Besides a significant number of wins, the reasoning models maintained a positive average material difference. Material count in a chess game is a weighted score of all the pieces (i.e. a Pawn being 1 unit of material and a Queen being 9). Each player starts the game with material count of 39. The eval calculates the difference in material at the end of each game - if a player loses more pieces than it captures, the difference will be negative. Other non-reasoning models (and reasoning "surrogates") typically have their material diff negative or around 0 (if they fail to progress in the game breaking the loop).

Here's the number for average material diff at the end of the game:

Model Material Diff Avg Game Duration (moves)
o1-preview-2024-09-12 9.99 124.8
o1-mini-2024-09-12 10.77 142.73
deepseek-reasoner-r1 10.83 91.77
anthropic.claude-v3-5-sonnet-v1 -4.48 183.38
gpt-4o-2024-11-20 -8.23 189.72
qwq-32b-preview@q4_k_m -0.07 7.97
gemini-2.0-flash-thinking-exp-1219 0.00 2.33

Distilled R1

I have also tested a few quantized versions of Distilled R1. What Deepseek did was fine-tune several smaller (70B, 14B, 8B etc.) Qwen 2.5 and Llama 3.1 models using the outputs of the full-size R1 model. Supposedly they should have gained reasoning skills. There's also a special <think></think> section in the output keeping all the reasoning tokens isolated from the final answer (something important the earlier thinking models missed).

They didn't do well:

Model Wins ▼ Draws Mistakes Tokens
deepseek-r1-distill-qwen-32b@q4_k_m 0.00% 0.00% 727.27 2173.83
deepseek-r1-distill-qwen-14b@q8_0 0.00% 0.00% 1153.85 3073.06

Besides, I noticed these models sometimes had failed to properly open and close the think tags (missing the openinig <think>).

P.S>

Google has also dropped an update to Gemini Thinking the day after the R1 release!

It did much better than the December version! At least it is now steerable and can last for ~40 moves in a game. They have also added the separation of the thinking part not bloating the response with the reasoning tokens. And, it is also a thinking surrogate...

Model Wins ▼ Draws Mistakes Tokens
gemini-2.0-flash-thinking-exp-01-21 0.00% 6.06% 5.97 17.77
gemini-2.0-flash-thinking-exp-1219 0.00% 0.00% 1285.71 724.54

Curiously, most of the game drop outs happened due to server error (e.g. some copyright filters) OR getting empty completions - there're definitely stability issues with the model.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (4)

Collapse
 
peter_truchly_4fce0874fd5 profile image
Peter Truchly

My first thought was: why would anyone even try to play chess with LLM? There are better "algorithms" for that.
But yes, especially in todays world where everybody is expecting AGI with every new LLM why it shouldn't play chess after all?

Where I see the problem (and a limitation) of current approach is the misuse of general purpose reasoning capabilities of LLMs, which are undeniably there but only in emergent form. What would average person do if confronted with this task? Most of us would just use some software, smaller portion would implement their own software and only handful (of chess masters) would play by themselves.

Unless we equip these 'AI' models with complete set of tools, environment and ability to use workflow patterns to model and execute a workflow designed for a given task, results are going to be quite disappointing, at least for the nearby future.

Collapse
 
markaurit profile image
MarkAurit

I performed a moderately complex, real-life query against the usa stock market using ChatGPT and DeepSeek. I only had one measurement of success: an accurate return of information. ChatGPT did extremely well - it added a chart (not requested) and much more information than just the closing price; in other words, about what you get on Yahoo Finance stock queries. DeepSeek merely replied with "that information isnt in my system yet". I realize that DeepSeek is a tool in its relative infancy, and in time it will be just as powerful and useful as ChatGPT. But as of now, it is useless to me.
Query: what is the stock market price of AAPL

Collapse
 
ai_joddd profile image
Vinayak Mishra

Nice post.. Had a question for you after seeing this. How good is deepseek than others w.r.t hallucinations, as yesterday night I was reading a content piece on LLM hallucination detection..

Collapse
 
maximsaplin profile image
Maxim Saplin

Assuming that a hallucinations include cases when the model does not what it is asked for (e.g. skips an acton) or decides to make move that is not legal - R1 is quite bad. You can tell this by large number of mistakes and low average game duration (you can see that by hovering over a row in the leaderboard)

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay