Jangwook Kim

Posted on May 5 • Originally published at effloow.com

Mistral Large 3: The 675B Open-Weight MoE Model Developer Guide

#mistral #llm #opensource #moe

Mistral Large 3 launched in December 2025 as Mistral's flagship open-weight model. Six months later it remains the largest model Mistral has publicly released under a permissive license. This guide covers the architecture, benchmarks, pricing, and practical considerations for developers deciding whether to use it in 2026.

What Mistral Large 3 Is

Mistral Large 3 (model ID mistral-large-2512, the 2512 indicating December 2025) is a sparse Mixture-of-Experts (MoE) model with 675 billion total parameters and 41 billion active parameters per forward pass. Mistral trained it from scratch on 3,000 NVIDIA H200 GPUs.

The MoE architecture is the key efficiency decision. Instead of activating all 675B parameters for every token, the model routes each token through a subset of "expert" subnetworks. With 41B active parameters, Mistral Large 3 runs at roughly the same computational cost as a 41B dense model while accessing the capacity of a 675B one.

For self-hosting, that matters significantly: memory footprint and inference compute are determined by the active parameter count, not the total.

License: Apache 2.0

Mistral Large 3 is released under Apache 2.0. This means:

Free commercial use
Can be fine-tuned and redistributed
No usage restrictions beyond standard Apache terms
Model weights downloadable from HuggingFace: mistralai/Mistral-Large-3-675B-Instruct-2512

The license is what distinguishes Mistral Large 3 from comparable models. GPT-4o and Claude Opus 4 are API-only. Llama 4 405B carries a Meta community license with usage restrictions above 700M monthly users. Mistral Large 3 has no such conditions.

For enterprise deployment teams that need to run models on-premises, in air-gapped environments, or with full weight access for fine-tuning, this is the practical differentiator.

Specifications at a Glance

Verified from official Mistral documentation and HuggingFace model card.

Specification	Value
Total parameters	675B
Active parameters	41B
Architecture	Sparse MoE
Context window	256,000 tokens
License	Apache 2.0
Model ID (Mistral API)	`mistral-large-latest` or `mistral-large-2512`
HuggingFace ID	`mistralai/Mistral-Large-3-675B-Instruct-2512`
Release date	December 2025
Training hardware	3,000 NVIDIA H200 GPUs

Benchmarks

Scores from ArtificialAnalysis.ai and IntuitionLabs (independent evaluators).

Benchmark	Score
MMLU	~85.5%
MMLU-Pro	73.11%
MATH-500	93.60%
HumanEval	Top tier (varies by test configuration)

The MATH-500 score of 93.60% is high relative to most non-reasoning-specialized models. This reflects Mistral's investment in mathematical reasoning capability during training.

Important caveat: Benchmark scores are useful for directional comparison, not absolute guarantees. Model performance on your specific task will differ based on instruction format, system prompt, temperature, and the nature of your domain. Always run your own evaluation on a representative sample before making infrastructure decisions.

API Pricing

Verified from Mistral API documentation and pricepertoken.com (May 2026).

Token type	Price
Input	$0.50 per 1M tokens
Output	$1.50 per 1M tokens

At $0.50/$1.50 per million tokens, Mistral Large 3 is positioned as a high-quality model at roughly half the cost of closed-source frontier options. For comparison, the pricing tier aligns with mid-tier API models from OpenAI and Anthropic while offering open weights.

Where to Access It

Via Mistral API:

curl https://api.mistral.ai/v1/chat/completions \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-large-latest",
    "messages": [{"role": "user", "content": "Explain MoE architecture in one paragraph."}]
  }'

Via Python SDK:

from mistralai import Mistral

client = Mistral(api_key="YOUR_MISTRAL_API_KEY")

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[
        {"role": "user", "content": "Write a Python function to parse JSONL files."}
    ]
)

print(response.choices[0].message.content)

Via Azure AI Foundry:

Mistral Large 3 is available on Azure AI Foundry through Mistral's partnership with NVIDIA and Microsoft. This allows Azure customers to access it with Azure-native billing and compliance controls.

Self-hosted:

Download weights from HuggingFace and serve with vLLM, llama.cpp (for quantized versions), or Mistral's own inference tools. The 41B active parameter count makes it feasible to run on multi-GPU setups that cannot fit a 70B dense model at full precision.

Using the 256K Context Window

256,000 tokens is enough for approximately 200,000 words — a long novel, several codebases, or months of conversation history in a single context. In practice, there are two considerations:

Cost: Input tokens are billed at $0.50/1M, so a 200K-token context costs $0.10 per call. For use cases that genuinely need long context, that is reasonable. For tasks that do not, context management (summarization, retrieval) remains more economical.

Quality at context extremes: Models trained with long context windows often exhibit "lost in the middle" degradation — important information buried in the middle of a very long context is weighted less than information at the beginning or end. This is a general LLM limitation, not specific to Mistral. If your use case places critical information in the middle of large documents, test explicitly at that position.

Function Calling and Tool Use

Mistral Large 3 supports native function calling in the same JSON schema format as OpenAI:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

Tool use quality is one of the model's strong points, consistent with the MMLU-Pro scores.

Mistral Large 3 vs the Broader Mistral 2026 Lineup

As of May 2026, Mistral has several models in production. Understanding the lineup helps pick the right one:

Model	Best for	Notes
Mistral Large 3	Highest-quality tasks, self-hosting, fine-tuning	675B MoE, Apache 2.0, $0.50/$1.50/1M
Mistral Small 4	General use, lower cost	Dense model, merges reasoning + vision + coding
Mistral Medium 3.5	Long-running agentic tasks	Powers Le Chat Work Mode and Mistral Vibe
Codestral	Code-specialized tasks	Dedicated coding model
Ministral	Edge / low-latency	Small models for on-device use

A note on Le Chat Work Mode: Le Chat's "Work Mode" feature — which runs multi-step agentic tasks like email research, calendar management, and document production — is powered by Mistral Medium 3.5, not Mistral Large 3. The two are separate products at different capability levels. Large 3 is the open flagship; Work Mode is a product feature built on Medium 3.5's instruction-following and long-context strengths.

When to Choose Mistral Large 3

It makes sense when:

You need open weights (Apache 2.0) for self-hosting, air-gap deployment, or fine-tuning
You want a strong general model at a moderate API price point
Your use case involves long documents and needs genuine 256K context support
You are building on Azure and want Mistral native on Azure AI Foundry

Consider alternatives when:

You need the latest reasoning capabilities (dedicated reasoning models like o3 or Gemini 2.5 Pro outperform Large 3 on multi-step reasoning tasks)
You need multimodal (image/audio) input — Large 3 is text-only; Mistral Small 4 includes vision
Cost is the primary constraint — Mistral Small 4 at a lower price tier handles most general tasks

Fine-Tuning

Apache 2.0 means you can fine-tune and redistribute Mistral Large 3 without restrictions. For fine-tuning at this model size, common approaches:

QLoRA (quantized low-rank adaptation): Reduces memory requirement significantly; feasible on consumer multi-GPU hardware with quantization
LoRA at fp16/bf16: Requires approximately 80–100 GB VRAM in practice for the 41B active weight subset
Full fine-tuning: Requires a multi-node setup; practical only with datacenter hardware

Mistral's own fine-tuning API (fine_tuning.jobs.create) is available through the Mistral API for managed fine-tuning without local GPU infrastructure.

Model specifications verified from Mistral official announcement (mistral.ai/news/mistral-3), HuggingFace model card (mistralai/Mistral-Large-3-675B-Instruct-2512), and Mistral API documentation. Benchmarks from ArtificialAnalysis.ai and IntuitionLabs. Pricing from Mistral API documentation and pricepertoken.com (verified May 2026). Le Chat Work Mode attribution confirmed via testingcatalog.com and mistral.ai/news/vibe-remote-agents-mistral-medium-3-5.

DEV Community