DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at effloow.com

Mercury 2: Inception's Diffusion LLM at 1,000 Tokens/s

Most language models work like a very fast typist: one token at a time, left to right, no going back. Mercury 2 from Inception Labs works more like an editor: it starts with a rough draft and refines the whole thing in parallel until it converges on a good answer.

That architectural shift is not a minor optimization. It is the reason Mercury 2 delivers over 1,000 tokens per second — roughly 5x faster than speed-optimized competitors like Claude 4.5 Haiku and GPT-5.4 Mini at comparable quality levels. For developers building agentic systems, retrieval pipelines, or any workflow where an LLM is called dozens of times per task, that difference is not cosmetic.

Released on February 24, 2026, Mercury 2 is Inception Labs' production-ready flagship and the first reasoning-capable diffusion LLM. This guide covers what it is, why it is fast, where it fits in a real stack, and how to start using it today.

What Is a Diffusion LLM?

To understand Mercury 2's speed, you first need to understand why conventional LLMs are slow.

Autoregressive models — GPT, Claude, Llama, Gemma — generate text sequentially. Each token depends on every token before it. That causal dependency means you cannot generate token 50 until token 49 is finalized. No matter how many GPUs you throw at the problem, generation is serial by design. Speed-optimized models like Claude Haiku top out around 89–200 tokens per second under optimal conditions.

Diffusion models approach the problem differently. Image generation models like Stable Diffusion start with pure noise and iteratively refine the entire image in parallel, converging on a coherent result over a fixed number of denoising steps. Mercury applies this same concept to text.

In Mercury's architecture:

  1. The model starts with a canvas of masked tokens — placeholders for every position in the output.
  2. A Transformer-based denoiser network evaluates the full canvas and scores corrections across all positions simultaneously.
  3. Over a fixed number of refinement steps (analogous to diffusion steps in image generation), the model fills in tokens in parallel, guided by learned text distributions and conditioned on the input prompt.
  4. The result converges from a coarse draft to a finished answer without ever being forced into a left-to-right order.

This is described in Inception's technical paper (arXiv 2506.17298) as using "masking-based corruption on discrete tokens" — a key design choice over Gaussian noise used in continuous diffusion models, which produces more stable training and sharper convergence for language tasks.

The practical upshot: modern GPUs are built for massive parallelism. Autoregressive generation leaves most of that parallelism idle because of its sequential dependency chain. Diffusion generation saturates GPU arithmetic intensity far more efficiently — which is why Mercury Coder runs at over 1,000 tokens per second on a standard NVIDIA H100.

Mercury 2 Performance: What the Numbers Say

Mercury 2 reaches 1,009 tokens per second on NVIDIA Blackwell GPUs with 1.7 seconds of end-to-end latency, according to Inception's official announcement. Independent benchmark firm Artificial Analysis measured Mercury 2 at 1,196 tokens/sec — higher than the official figure, likely due to different request batching conditions.

For context, compare that to speed-optimized models in its quality tier:

Model Tokens/sec Input Cost ($/M) Output Cost ($/M) Context
Mercury 2 ~1,000–1,196 $0.15 $0.35 128K
Claude 4.5 Haiku ~89 $0.25 $1.25 200K
GPT-5.4 Mini ~71 $0.15 $0.60 128K
Gemini 3 Flash ~180 $0.10 $0.40 1M

Speed comparisons come from Artificial Analysis independent benchmarks. Pricing sourced from official Inception API docs and OpenRouter listings as of May 2026.

On quality, Mercury 2 benchmarks at:

  • AIME 2025: 91.1
  • GPQA: 73.6
  • IFBench: 71.3
  • LiveCodeBench: 67.3

These place Mercury 2 in the same quality tier as Claude 4.5 Haiku and GPT-5.2 Mini — capable enough for most production reasoning tasks, but not competing with frontier reasoning models like Claude Opus 4.7 or GPT-5 on complex analysis.

The eWeek headline describing Mercury 2 as "13x faster than Claude Haiku" comes from end-to-end latency comparisons under specific test conditions — the gap is real, though the magnitude varies by task type and load.

Where Mercury 2 Fits in a Production Stack

The speed advantage matters most in specific architectural patterns. Here are the practical cases.

Multi-Step Agentic Workflows

If you are building an agent that calls an LLM ten times per task — planning, tool selection, execution, verification, retry — then latency compounds. At 3 seconds per call, a ten-step workflow takes 30 seconds. At Mercury 2's 1.7 seconds per call, the same workflow completes in 17 seconds. That is not just a user experience improvement; it changes what architectures are feasible for interactive use cases.

Inception explicitly positions Mercury 2 for this: "In an agentic system, latency doesn't just accumulate; it multiplies." A 5x speed increase per call can make a previously-too-slow pipeline viable.

High-Throughput Data Processing

Batch extraction, document classification, structured data parsing from unstructured input — any workload where you are processing hundreds or thousands of documents benefits from higher tokens-per-second throughput. Mercury 2's lower cost ($0.15/$0.35 per million tokens on the direct API) compounds the benefit.

Voice and Real-Time Interfaces

Integrating reasoning inside a sub-2-second round-trip is only possible with a model that responds quickly. Mercury 2's 1.7-second end-to-end latency makes it a realistic candidate for voice agents, customer support copilots, and interactive tutoring systems where users perceive delays above 2–3 seconds as laggy.

Code Autocomplete and Inline Agents

The original Mercury Coder models (Mini and Small) were designed specifically for this — 1,109 tokens/sec and 737 tokens/sec respectively on H100s. Mercury 2 extends this to include reasoning, making it viable for inline code agents that need to plan, refactor, or debug while the developer stays in flow.

API Integration: Drop Into Your Existing Stack

Mercury 2 is OpenAI API-compatible. If you are already using the OpenAI Python SDK, swapping in Mercury 2 requires one base URL change:

from openai import OpenAI

client = OpenAI(
    api_key="your-inception-api-key",
    base_url="https://api.inceptionlabs.ai/v1"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the key difference between diffusion and autoregressive LLMs in three sentences."}
    ],
    max_tokens=256
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

No library changes, no schema migration, no new authentication patterns. Any existing code that wraps the OpenAI client works.

Mercury 2 is also available through:

  • OpenRouter (inception/mercury-2) — for multi-provider fallback routing
  • Vercel AI Gateway — for serverless workloads with automatic rate limiting
  • Puter Developer API — for browser-side or Node.js integrations without a backend

New Inception accounts receive 10 million free tokens to start.

Streaming Support

Mercury 2 supports streaming responses via the standard OpenAI streaming interface:

stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a function to parse JSON with error handling."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

One nuance: because diffusion generation is parallel rather than sequential, the streaming behavior may feel different from autoregressive models — tokens can arrive in bursts rather than a steady drip. This is expected behavior, not a bug.

What to Watch: Known Limits

Mercury 2 is not a straight replacement for every LLM use case. A few specific limitations are worth understanding before you commit a workload to it.

Quality ceiling on complex reasoning. Mercury 2 benchmarks comparably to Haiku and Mini-tier models. For tasks that need deep analysis, long-horizon planning, or multi-document synthesis, frontier models still outperform. Mercury 2's sweet spot is tasks where a capable-but-not-frontier model is sufficient and speed matters.

Non-autoregressive generation semantics. Diffusion models generate tokens through parallel refinement, not a left-to-right decode. Temperature and sampling parameters still work, but their behavior differs from what you might expect from autoregressive models. If your application relies on exact token-probability steering or specific sampling strategies, test carefully.

No self-hosting option as of May 2026. Mercury 2 is cloud-only. For organizations with strict data residency requirements or that need to run inference in isolated environments, this is a constraint. Inception Labs has not announced an on-premise or self-hosted offering.

Context window is 128K, not 200K+. Claude 4.5 Haiku supports 200K tokens; Gemini 3 Flash supports 1M. If your workload regularly consumes long contexts (large codebase ingestion, long document threads), Mercury 2's 128K may require more aggressive context management or chunking strategies.

Is Mercury 2 Worth Adding to Your Stack?

The clearest yes is any system where:

  • An LLM is called more than five times per user request
  • Response latency directly affects user experience
  • The task quality requirements are "good, not frontier"
  • Cost per token matters at scale

The clearest no is:

  • You need frontier-tier reasoning on complex problems
  • You have strict data residency requirements
  • Your prompts regularly approach or exceed 128K tokens

For many production agentic systems, those criteria point to Mercury 2 as a strong fit for intermediate reasoning steps — the calls where you are routing, classifying, extracting, or generating structured outputs, rather than the final synthesis that a user will read closely.

Strengths
<ul>
  <li>1,000+ tokens/sec — 5–13x faster than comparable quality models</li>
  <li>OpenAI API-compatible — no SDK changes required</li>
  <li>Competitive pricing at $0.15/$0.35 per million tokens direct</li>
  <li>Available on OpenRouter and Vercel AI Gateway for flexibility</li>
  <li>Reasoning-capable — suitable for multi-step agent chains</li>
  <li>10M free tokens on new accounts</li>
</ul>


Limitations
<ul>
  <li>Quality ceiling below frontier models (Opus, GPT-5)</li>
  <li>Cloud-only — no self-hosting option as of May 2026</li>
  <li>128K context window vs. 200K–1M for some competitors</li>
  <li>Streaming token behavior differs from autoregressive models</li>
  <li>Newer architecture means fewer community resources and examples</li>
</ul>
Enter fullscreen mode Exit fullscreen mode

FAQ

Q: Is Mercury 2 a reasoning model?

Yes. Mercury 2 is described by Inception as "the first reasoning diffusion LLM." It scores 91.1 on AIME 2025, comparable to reasoning-capable small models. It is not in the same tier as o3 or Claude Opus 4.7 for heavy reasoning tasks, but it handles multi-step inference effectively.

Q: Can I use Mercury 2 with LangChain or LlamaIndex?

Since Mercury 2 is OpenAI API-compatible, it works with any framework that accepts a custom base_url and api_key on the OpenAI client. LangChain's ChatOpenAI class, for example, accepts openai_api_base and openai_api_key overrides.

Q: How does diffusion LLM generation affect determinism?

Autoregressive models with temperature=0 produce deterministic output. Diffusion models have a different sampling dynamic — the parallel refinement process introduces its own stochasticity. Mercury 2 supports temperature parameters, but exact reproducibility under identical conditions may behave differently from what you are used to. Test your specific use case.

Q: Does Mercury 2 support function calling and structured outputs?

Inception's API is OpenAI-compatible, which includes support for function calling and JSON mode. Confirm support for specific features in the Inception Platform documentation before building production workflows that depend on them.

Q: Where does Mercury 2 fit in a cost comparison?

At $0.15/$0.35 per million tokens (direct API), Mercury 2 undercuts Claude 4.5 Haiku and is competitive with GPT-5.4 Mini on price. Combined with its throughput advantage, cost-per-task in high-volume agentic workloads can be significantly lower.

Key Takeaways

Mercury 2 is the most production-ready example of a genuinely different architecture for language generation. Diffusion-based token refinement breaks the sequential dependency that limits autoregressive models, and the practical result — over 1,000 tokens per second with reasoning capability — changes the math for multi-step agentic systems.

The architecture is not magic: quality tops out at the Haiku/Mini tier, and there are real constraints around context length and self-hosting. But for the workflows where speed and cost per token matter at scale, Mercury 2 is now a serious option that deserves a benchmark in your eval suite.

The OpenAI API compatibility makes the cost of trying it close to zero. Swap the base URL, run your eval suite, and see whether the throughput advantage justifies the tradeoffs for your specific workload.

Bottom Line

Mercury 2 is the first production-ready diffusion LLM with reasoning, and its 1,000+ tokens/sec throughput is a genuine architectural advantage for agentic systems. If your pipeline chains multiple LLM calls per task, Mercury 2 deserves a slot in your latency budget — just keep it in the intermediate-step role, not the final synthesis seat.

Top comments (0)