Shrijal Acharya for Composio

Posted on Feb 21, 2025 • Originally published at composio.dev

Grok 3 vs. Deepseek r1: A deep analysis

#ai #productivity #opensource #discuss

Almost everyone now knows about the DeepSeek R1 model, an open-source AI from China that took the internet by storm.

The main selling point of DeepSeek is that it's completely free and open-source and can rival some of OpenAI's paid models, like the o1.

Then, on February 16, Elon Musk announced Grok 3, labelling it as the "Smartest AI on Earth."

But how does it compare to our free model, DeepSeek R1? This made me curious, and I decided to test how Grok 3 compares against DeepSeek R1 in areas like coding, reasoning, math, and creative writing.

Let's find out if all the hype around Grok 3 holds any weight.

I tested both models against a list of prompts I created and showed the results and my thoughts on each model's ability in various tasks.

TL;DR

If you want to jump straight to the conclusion, here's a quick summary of the findings comparing DeepSeek R1 and Grok 3:

Reasoning and Math: DeepSeek R1 and Grok 3 models perform similarly on reasoning questions.

Coding: Grok 3 outperforms DeepSeek R1 and writes much better code.

Creative Writing: Both models perform well in creative writing. DeepSeek is known to be a great model, but I prefer Grok 3.

Brief on Grok 3 AI Model

Grok 3 is the latest language model from xAI, offering 10x the computational power. It includes tools like DeepSearch for step-by-step reasoning and Big Brain Mode for handling complex tasks.

Currently, the Grok 3 model is in beta mode, but is it really that much better than the DeepSeek R1 model? We'll find out by the end of this article.

According to the official benchmarks shared by the xAI team at the launch event, Grok 3 appears to be a game-changer, outperforming all its competitors in almost every benchmark.

DeepSeek R1 vs. Grok 3

I've used Chatbot Arena to test both of these models side by side, as it is the only available and trusted third-party site that allows testing the early Grok 3 model.

Reasoning Problems

ℹ️ Here, we will check the reasoning capabilities of both the models.

1. Russian Roulette

Let's start up the show with an interesting question:

You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five bullets, spins the chambers and fires at himself, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing at you. Should he spin again?

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Both of the models answered the problem correctly with correct reasoning. ✅

2. Olympic Athlete Siblings

I am the sister of two Olympic athletes. But these two athletes are not my sisters. How is this possible?

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Here as well, both the models answered the problem correctly with correct reasoning. ✅

3. Find out the Liar

The first two questions were straightforward. Let's conclude the reasoning test with a slightly trickier question:

You meet three people: Haris, Antony, and Michael. Haris says, "At least one of us is a liar." Antony says, "Haris is lying." Michael says, "Antony is telling the truth." Determine who is lying and who is telling the truth.

Response from Grok 3:

Here, Grok 3 provided thorough reasoning, considering each case and breaking down the problem step by step to support its conclusion.

Response from DeepSeek R1:

Final Verdict: Both models answered the problem correctly ✅, but the reasoning of the Grok 3 model stands out to me.

Summary:

From this, we can see that both models are quite strong in reasoning capabilities, as they both provided correct answers to all my reasoning questions. Still, the reasoning capability of the Grok 3 model seems to outperform R1 at times.

Mathematics

💁 Looking at the reasoning question answers from both models, I was convinced that both of these models should also pass the math questions.

Starting with an easy question, let's see how they perform:

1. Number of people in the train

There were some people on a train. 19 people get off the train at the first stop. 17 people get on the train. Now there are 63 people on the train. How many people were on the train to begin with?

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Both models chose a similar approach and ended up with the correct answer. ✅

2. Prime Number Theorem Approximation

Estimate the number of prime numbers less than 10^8.

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Both models chose the correct approach of using the Prime Number Theorem, but the answer from the DeepSeek R1 model was off by around ~1K, while Grok 3 was off by around ~20K. So, it's fair to say DeepSeek R1 ✅ outperformed Grok 3 ❌ here.

3. Count vowels and exponentiate

This is definitely the toughest problem for the models to crack. It is a somewhat modified version of a very famous question for LLMs with a math twist: "Calculate the number of r's in strawberry."

Calculate (x - 14)^10 where x is the number of vowels in the answer.

Response from Grok 3:

To solve the problem, we need to compute \\((x - 14)^{10}\\), where \\(x\\) is the number of vowels in the answer. This is a self-referential problem, so we need to find a value of the answer such that the number of vowels in the answer matches the \\(x\\) used in the expression.

### Step 1: Define the problem
- Let \\(y = (x - 14)^{10}\\), where \\(x\\) is the number of vowels in \\(y\\).
- Vowels are the letters "a," "e," "i," "o," "u" (we'll consider only lowercase for simplicity, unless otherwise specified).
- We need to find a \\(y\\) such that the number of vowels in \\(y\\) equals \\(x\\), and \\(y = (x - 14)^{10}\\).

### Step 2: Trial and error approach
Since \\(x\\) is the number of vowels, and vowels are non-negative integers, \\(x \\geq 0\\). However, for \\((x - 14)^{10}\\) to be meaningful, \\(x\\) should be such that the expression produces a number whose vowel count matches \\(x\\).

Let's try different values of \\(x\\) and compute \\(y\\), then check the number of vowels in \\(y\\).

#### Case 1: \\(x = 16\\)
- \\(y = (16 - 14)^{10} = 2^{10} = 1024\\).
- Write \\(y\\) in words: "one thousand twenty-four."
- Count vowels in "one thousand twenty-four":
  - o (1), e (1), o (2), u (1), a (1), o (2) → 6 vowels.
- \\(x = 16\\), but the number of vowels in \\(y\\) is 6, so \\(16 \\neq 6\\). This doesn't work.

#### Case 2: \\(x = 18\\)
- \\(y = (18 - 14)^{10} = 4^{10}\\).
- \\(4^5 = 1024\\), so \\(4^{10} = (4^5)^2 = 1024^2\\).
- \\(1024^2 = 1,048,576\\).
- Write \\(y\\) in words: "one million forty-eight thousand five hundred seventy-six."
- Count vowels:
  - o (1), e (1), i (1), i (2), o (2), a (1), o (3), u (1), a (2), e (2), u (2), e (3), i (3) → 13 vowels.
- \\(x = 18\\), but the number of vowels in \\(y\\) is 13, so \\(18 \\neq 13\\). This doesn't work.

#### Case 3: \\(x = 15\\)
- \\(y = (15 - 14)^{10} = 1^{10} = 1\\).
- Write \\(y\\) in words: "one."
- Count vowels in "one": **o**n**e** → 2 vowels.
- \\(x = 15\\), but the number of vowels in \\(y\\) is 2, so \\(15 \\neq 2\\). This doesn't work.

#### Case 4: \\(x = 17\\)
- \\(y = (17 - 14)^{10} = 3^{10}\\).
- \\(3^5 = 243\\), so \\(3^{10} = (3^5)^2 = 243^2\\).
- \\(243^2 = 59,049\\).
- Write \\(y\\) in words: "fifty-nine thousand forty-nine."
- Count vowels:
  - i (1), i (2), e (1), o (1), u (1), a (1), i (3), e (2) → 8 vowels.
- \\(x = 17\\), but the number of vowels in \\(y\\) is 8, so \\(17 \\neq 8\\). This doesn't work.

### Step 3: Look for patterns
Let's summarize the results so far:
- \\(x = 14\\), \\(y = 0\\), vowels = 2.
- \\(x = 15\\), \\(y = 1\\), vowels = 2.
- \\(x = 16\\), \\(y = 1024\\), vowels = 6.
- \\(x = 17\\), \\(y = 59,049\\), vowels = 8.
- \\(x = 18\\), \\(y = 1,048,576\\), vowels = 13.

It seems that as \\(x\\) increases, \\(y\\) grows exponentially (since it's \\(x - 14\\) raised to the 10th power), and the number of vowels in \\(y\\) also increases, but not necessarily matching \\(x\\).

### Step 4: Try smaller \\(x\\)
Let's try \\(x\\) closer to 14 but with a focus on smaller vowel counts.

#### Case 5: \\(x = 12\\)
- \\(y = (12 - 14)^{10} = (-2)^{10} = 1024\\).
- Write \\(y\\) in words: "one thousand twenty-four."
- Count vowels: 6 vowels (as calculated earlier).
- \\(x = 12\\), but the number of vowels in \\(y\\) is 6, so \\(12 \\neq 6\\). This doesn't work.

### Step 5: Try to find a fixed point
We need to find an \\(x\\) such that the number of vowels in \\((x - 14)^{10}\\) equals \\(x\\). Let's try \\(x = 2\\) (since we saw small vowel counts earlier).

#### Case 6: \\(x = 2\\)
- \\(y = (2 - 14)^{10} = (-12)^{10}\\).
- \\((-12)^{10} = 12^{10}\\) (since 10 is even).
- \\(12^5 = 248,832\\), so \\(12^{10} = (12^5)^2 = 248,832^2\\).
- \\(248,832^2 = 61,917,364,224\\).
- Write \\(y\\) in words: "sixty-one billion nine hundred seventeen million three hundred sixty-four thousand two hundred twenty-four."
- Count vowels:
  - i (1), o (1), e (1), i (2), i (3), o (2), e (2), e (3), i (4), o (3), o (4), u (1), a (1), o (5), e (4), o (6) → 16 vowels.
- \\(x = 2\\), but the number of vowels in \\(y\\) is 16, so \\(2 \\neq 16\\). This doesn't work.

### Step 6: Conclusion
After trying multiple values of \\(x\\), we observe that the number of vowels in \\(y = (x - 14)^{10}\\) does not consistently match \\(x\\). The problem may not have a solution where the number of vowels in the result exactly equals \\(x\\), or it may require a specific \\(x\\) that we haven't found yet.

Given the complexity and the lack of a clear fixed point, it seems likely that the problem might not have a solution within the constraints, or the solution requires a more advanced approach (e.g., considering different number naming conventions or constraints on \\(x\\)).

If we must provide an answer, we note that no \\(x\\) tried so far satisfies the condition. Further exploration or clarification (e.g., constraints on \\(x\\), naming conventions) may be needed.

Response from DeepSeek R1:

Final Verdict: As expected, neither of the models could reach the solution. ❌

This question seems to have been cracked by OpenAI's o1 model, though.

Summary:

As for math, both models performed quite well on the first two questions, which were easier, but both failed to solve a tricky question that required a bit of thought ❌. Based on these results, both models seem to perform similarly, so it's hard to favour one over the other.

Coding

Now, let's check how these models handle solving a tricky LeetCode question with an acceptance rate of just 12.8%: Find the Occurrence of First Almost Equal Substring. It was added recently, so it's pretty rare that they have been trained on it.

You are given two strings s and pattern.

A string x is called almost equal to y if you can change at most one character in x to make it identical to y.

Return the smallest starting index of a substring in s that is almost equal to pattern. If no such index exists, return -1.

A substring is a contiguous non-empty sequence of characters within a string.

Example 1:
Input: s = "abcdefg", pattern = "bcdffg"
Output: 1

Explanation:
The substring s[1..6] == "bcdefg" can be converted to "bcdffg" by changing s[4] to "f".

Example 2:
Input: s = "ababbababa", pattern = "bacaba"
Output: 4

Explanation:
The substring s[4..9] == "bababa" can be converted to "bacaba" by changing s[6] to "c".

Example 3:
Input: s = "abcd", pattern = "dba"
Output: -1

Example 4:
Input: s = "dde", pattern = "d"
Output: 0

Constraints:
1 <= pattern.length < s.length <= 105
s and pattern consist only of lowercase English letters.

Response from Grok 3:

class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        n = len(pattern)

        # Helper function to check if two strings differ by at most one character
        def differsByAtMostOne(s1: str, s2: str) -> bool:
            if len(s1) != len(s2):
                return False

            # Count differences
            diff_count = 0
            for c1, c2 in zip(s1, s2):
                if c1 != c2:
                    diff_count += 1
                if diff_count > 1:
                    return False
            return True

        # Slide window of size n (pattern length) through s
        for i in range(len(s) - n + 1):
            # Get current substring
            substring = s[i:i+n]
            # Check if it differs by at most one character
            if differsByAtMostOne(substring, pattern):
                return i

        return -1

The code definitely accomplished what was asked for, but it hit Time Limit Exceeded on some test sets. So, while it solved the problem, it isn't the most optimal solution to this problem.

Response from DeepSeek R1:

class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        ans = self.find_almost_equal_substring(s, pattern)
        return ans

    def compute_z(self, s):
        n = len(s)
        Z = [0] * n
        Z[0] = 0  # The first character is the entire string, which is not considered here
        l, r = 0, 0
        for i in range(1, n):
            if i > r:
                l = r = i
                while r < n and s[r - l] == s[r]:
                    r += 1
                Z[i] = r - l
                r -= 1
            else:
                k = i - l
                if Z[k] < r - i + 1:
                    Z[i] = Z[k]
                else:
                    l = i
                    while r < n and s[r - l] == s[r]:
                        r += 1
                    Z[i] = r - l
                    r -= 1
        return Z

    def find_almost_equal_substring(self, s, pattern):
        m = len(pattern)
        n = len(s)
        if m == 0:
            return 0
        if n < m:
            return -1

        # Step 1: Check for exact match
        exact = s.find(pattern)
        if exact != -1:
            return exact

        # Step 2: Compute lcp and lcs arrays using Z-algorithm
        # Compute lcp (longest prefix of pattern starting at s[i])
        concat_lcp = pattern + '#' + s
        Z_lcp = self.compute_z(concat_lcp)
        lcp = [0] * n
        for i in range(n):
            lcp[i] = Z_lcp[m + 1 + i]

        # Compute lcs (longest suffix of pattern ending at s[i])
        reversed_pattern = pattern[::-1]
        reversed_s = s[::-1]
        concat_lcs = reversed_pattern + '#' + reversed_s
        Z_lcs = self.compute_z(concat_lcs)
        lcs = [0] * n
        len_rev_pat = len(reversed_pattern)
        for i in range(n):
            pos_in_rev_s = i
            pos_in_concat = len_rev_pat + 1 + pos_in_rev_s
            if pos_in_concat >= len(Z_lcs):
                lcs_val = 0
            else:
                lcs_val = Z_lcs[pos_in_concat]
            lcs[n - 1 - i] = lcs_val

        min_index = float('inf')

        # Iterate over each possible j (position in pattern)
        for j in range(m):
            required_prefix = j
            required_suffix = m - j - 1

            # Iterate possible starting indices i in s
            for i in range(n - m + 1):
                # Check prefix condition
                if lcp[i] < required_prefix:
                    continue
                # Check suffix condition
                suffix_start = i + j + 1
                if suffix_start >= n:
                    continue
                if lcs[suffix_start] < required_suffix:
                    continue
                # Check if current character is a mismatch
                if s[i + j] != pattern[j]:
                    if i < min_index:
                        min_index = i
                        # Early exit if possible
                        if min_index == 0:
                            return 0
                        # Break the inner loop as we're looking for the earliest i
                        break

        return min_index if min_index != float('inf') else -1

This was very unexpected from DeepSeek R1. The code failed at the very first test case. If I have to compare the code quality, it is very poorly written as well.

Couldn't agree more with this tweet:

Summary:

Here, Grok 3 is clearly the winner. ✅ It at least provided working code, though it wasn't the most optimal. On the other hand, DeepSeek R1 wrote code that couldn't pass the very first test case and was unnecessarily long and poorly written. ❌

Creative Writing

Write a story on "A man buys an old typewriter from a flea market, only to realize that whatever he types on it comes true, but with a dark twist.". Summarize the entire story with the twist in three paragraphs.

Response from Grok 3:

Response from DeepSeek R1:

Summary:

Even though DeepSeek R1 is said to be the best model for writing:

Here, I must say that both did a great job crafting this story and wrapping up the entire twist within 3 paragraphs, but I prefer the response from the Grok 3 model more ✅ than the DeepSeek R1 model.

The story just felt to have a better flow.

Final Verdict!

Based on these observations, here’s my final verdict:

Both Grok 3 and the DeepSeek R1 model perform very similarly for Reasoning questions and Mathematics. Choosing one over the other doesn't seem to make much difference.

For Coding, Grok 3 is the clear winner compared to the DeepSeek R1 model. DeepSeek R1 doesn't come close regarding code quality or overall answers.

Both models are quite strong for Creative Writing, but I prefer Grok 3’s responses. They feel more engaging, natural, and polished.

I pretty much agree with Satoshi on the Grok 3 and DeepSeek R1 part of this comparison:

What do you think? Let me know your thoughts in the comments below! 👇🏻

Shrijal Acharya

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Top comments (32)

Bhaskar Prajapati • Feb 21 '25

Since Deepseek model is release, I've been using it since then. The only issue I have with it is that it just always has this issue of "Server is busy, please try again later"

Else everything about Deepseek feels good and I think I need no other models in my use case.
This is a great comparision post. Thank you for sharing! 👍🏽

Shrijal Acharya Composio • Feb 21 '25 • Edited

Thanks for checking out, Bhaskar 🙌

The only issue I have with it is that it just always has this issue of "Server is busy, please try again later"

What else to expect from a free model, though. 😮‍💨

I myself don't see much hype that should be around Grok 3. Even though it is being said, "best AI in the world right now", the metrics don't differ by much. And comparing that Deepseek r1 is a complete open-source model built as a side project of a developer, the way it is performing is phenomenal.

But hey, it is what it is! 🤷‍♂️

Bhaskar Prajapati • Feb 27 '25

Never heard anyone complain on this issue that I was having, so I decided to use Ollama on Windows to run it locally.

Anmol Baranwal Composio • Feb 21 '25

Awesome Shrijal! 🔥 This looks quite detailed. Let me share it further :)

Shrijal Acharya Composio • Feb 21 '25

Thank you for checking it out, Anmol! Go ahead. 🙌

Joyce Gloria • Mar 17 '25

Great post—thanks for shedding light on this! Grok 3 vs DeepSeek both platforms are pushing the boundaries of AI, but they seem to cater to slightly different audiences and use cases. Grok 3’s strength lies in its ability to handle complex, real-time data processing and its adaptability to dynamic environments. On the other hand, DeepSeek’s focus on deep learning and predictive analytics makes it a powerhouse for industries like finance, healthcare, and marketing.

Shrijal Acharya Composio • Mar 17 '25

Completely agree with you, Joyce! :D

Benny Schuetz • Feb 28 '25

Great post. Concerning coding, I just wrote this small post a few hours ago dealing with Grok3 and shader generation:

AI-Generated Shader Experiments: A Journey

Benny Schuetz ・ Feb 27

#ai #shader #creativecoding #webgl

Shrijal Acharya Composio • Feb 28 '25

Wow, that's a good one. Thank you for sharing, @benny00100 ✌️

Areeba Nishat • May 30 '25

Interesting comparison! Grok 3 seems to be aiming for tight integration with X (formerly Twitter), while Deepseek R1 feels more research-focused, especially with its emphasis on reasoning capabilities. Performance-wise, both have their strengths — Grok with real-time data access, and Deepseek with its structured output and deeper context understanding.

By the way, if you’re creating any visuals or profile assets while sharing your benchmarks or results, this Stylish Name Generator came in handy for me — adds a nice touch to usernames or project titles.

Looking forward to more insights if you’re planning to do performance benchmarking or hands-on testing!

Shrijal Acharya Composio • May 31 '25

Thank you for checking! 🙌

Muhammad Usman • Feb 22 '25

Nice analysis, great work.
Also,
🚀 I have Just finished my very first frontend challenge for Dev Community!
dev.to/web_dev-usman/discover-your...
Give your feedback there, and support me.

Shrijal Acharya Composio • Feb 22 '25

That's a good project you built. Nice work!

Areeba Nishat • May 16 '25

Grok 3 vs. DeepSeek R1: A Deep Analysis

The AI landscape has been significantly reshaped with the introduction of two formidable models: Grok 3 by xAI and DeepSeek R1 by the Chinese startup DeepSeek. Both models have garnered attention for their advanced capabilities, but they cater to different user needs and preferences.

Areeba Nishat • May 13 '25

In early 2025, two advanced AI models—Grok 3 by Elon Musk's xAI and DeepSeek R1 by Chinese AI firm DeepSeek—emerged, each bringing unique strengths to the AI landscape. While Grok 3 focuses on high-performance computing and real-time data processing, DeepSeek R1 emphasizes cost-efficiency and accessibility. This analysis delves into their key differences and performance benchmarks.
New York Post
+5

Shrijal Acharya Composio • May 13 '25

🙌

Areeba Nishat • May 18 '25

The landscape of large language models (LLMs) continues to evolve rapidly, with emerging contenders like Grok-3 and DeepSeek R1 pushing the boundaries of open and closed-source AI. Both models represent ambitious efforts to compete with titans like OpenAI, Anthropic, and Google DeepMind. But how do Grok 3 and DeepSeek R1 truly compare?

Model Origins & Philosophy Grok 3 Developer: xAI (Elon Musk's AI company)

Integration: Deeply tied into X (formerly Twitter) as a conversational AI assistant.

Philosophy: Positioned as a "truth-seeking AI" with fewer political constraints, Grok is designed to answer questions with wit and a bit of attitude—mirroring Musk’s brand voice.

Closed-source: Proprietary model, not openly available for download or fine-tuning.

DeepSeek R1
Developer: DeepSeek (China-based research group)

Model Size: ~67B parameters

Philosophy: Research-driven, open-weight model designed to rival GPT-3.5/4 level performance. Focuses on reasoning, code generation, and open accessibility.

Open-source: Hugely beneficial to researchers and developers who want transparency and control.

Architecture & Capabilities
Feature Grok 3 DeepSeek R1
Parameters Not publicly disclosed (est. ~70B–100B) 67B
Architecture Transformer-based, fine-tuned on X platform data Dense Transformer, pre-trained on multilingual + code datasets
Context Length Unknown (likely 8k–16k) 32k tokens
Code Support Basic code generation, with sarcastic tone possible Strong code generation, GPT-4-level reasoning in benchmarks
Multimodal Grok 3 (planned or partial) Text-only in R1 (as of now)
Benchmark Performance
Grok 3:
Not many public benchmarks available.

Anecdotal reports suggest Grok 3 is comparable to GPT-3.5, with witty conversational abilities and real-time X integration.

Strengths lie in live internet querying, contextual integration, and personality.

DeepSeek R1:
Strong on academic benchmarks, often outperforming LLaMA 2 70B and matching GPT-3.5 Turbo on:

MMLU

GSM8K

HumanEval (code)

Weaknesses may include slightly less polish in natural conversation, though better raw reasoning.

Use Cases
Use Case Grok 3 DeepSeek R1
Casual Chat & Real-time Search Excellent (via X integration) Not designed for this
Research & Custom Fine-tuning Closed model Fully open weights
Code Generation & Reasoning Decent, personality-driven Excellent, GPT-4-like
Business/Enterprise Use Through X AI APIs (future plans) For teams building custom AI stacks
Open-Source vs. Closed
DeepSeek R1 wins for transparency, flexibility, and research potential. You can fine-tune it, run it locally, or embed it in enterprise solutions.

Grok 3 is currently only accessible via the X platform and aims to drive traffic and engagement to Musk’s ecosystem.

🧩 Conclusion: Which One Should You Use?
You're a... Choose... Why
Researcher or AI builder DeepSeek R1 Open-source, customizable, high performance
Casual user on X Grok 3 Fun, witty, real-time news-aware assistant
Developer needing high reasoning/code AI DeepSeek R1 Outperforms many closed models in logic-heavy tasks
Fan of Elon Musk or X ecosystem Grok 3 Deep integration with social platform, distinct tone

Final Thought:
Grok 3 is bold, personality-driven, and uniquely tied to a social platform. DeepSeek R1 is a technical powerhouse—open, research-grade, and surprisingly competitive with closed models. If you're choosing between the two, your goals—entertainment vs. engineering—will make the decision clear.