Posted on Jan 28

DeepSeek OCR 2: Complete Guide to Running & Fine-tuning in 2026

#deepseek #ocr

🎯 Core Highlights (TL;DR)

Revolutionary Architecture: DeepSeek OCR 2 introduces DeepEncoder V2 with human-like visual reading order, achieving SOTA performance on document understanding
Lightweight & Powerful: Only 3B parameters but outperforms larger models on complex layouts, tables, and mixed text-structure documents
Easy Deployment: Run locally via vLLM, Transformers, or Unsloth with comprehensive fine-tuning support
Proven Results: 86-88% improvement in language understanding after fine-tuning, with 57-86% reduction in Character Error Rate (CER)
Open Source: Fully available on Hugging Face with detailed documentation and community support

What is DeepSeek OCR 2?
Key Features & Architecture
DeepSeek OCR 2 vs Other OCR Solutions
How to Run DeepSeek OCR 2
Fine-tuning Guide
Performance Benchmarks
Community Feedback & Real-world Usage
FAQ
Conclusion & Next Steps

What is DeepSeek OCR 2?

DeepSeek OCR 2 is a state-of-the-art 3B-parameter vision-language model released on January 27, 2026, by DeepSeek AI. Unlike traditional OCR systems that merely extract text, DeepSeek OCR 2 focuses on image-to-text with stronger visual reasoning, enabling comprehensive document understanding.

The Innovation: DeepEncoder V2

The breakthrough lies in DeepEncoder V2, which fundamentally changes how AI "sees" documents:

Traditional Vision LLMs:

Scan images in fixed grid patterns (top-left → bottom-right)
Process visual information sequentially without context
Struggle with complex layouts and multi-column documents

DeepSeek OCR 2 with DeepEncoder V2:

Builds global understanding first, then learns human-like reading order
Determines what to attend to first, next, and so on
Excels at following columns, linking labels to values, and reading tables coherently

💡 Key Insight
DeepEncoder V2 enables the model to 'see' an image in the same logical order as a human, dramatically improving accuracy on complex layouts.

Key Features & Architecture

Core Capabilities

Feature	Description	Benefit
Dynamic Resolution	(0-6)×768×768 + 1×1024×1024	Handles various document sizes efficiently
Visual Tokens	(0-6)×144 + 256 tokens	Optimized memory usage
Human-like Reading	DeepEncoder V2 architecture	Superior layout understanding
Compact Size	Only 3B parameters	Fast inference, low resource requirements
Multi-format Support	Images, PDFs, documents	Versatile application scenarios

Supported Modes

DeepSeek OCR 2 supports multiple operation modes:

Document Mode: <image>\n<|grounding|>Convert the document to markdown.
Free OCR: <image>\nFree OCR. (without layout preservation)
Figure Parsing: <image>\nParse the figure.
General Vision: <image>\nDescribe this image in detail.
Recognition: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.

⚠️ Important Note
For best results with structured documents, use the <|grounding|> tag to enable layout-aware processing.

DeepSeek OCR 2 vs Other OCR Solutions

Comprehensive Comparison

Solution	Type	Strengths	Limitations	Best For
DeepSeek OCR 2	Open-source VLM	Human-like reading order, SOTA accuracy, fine-tunable	Requires GPU for optimal performance	Complex documents, research, custom applications
MistralOCR	Closed-source API	Extremely fast, excellent structure preservation	Not open-source, API-dependent	Production pipelines, public documents
PaddleOCR-VL	Open-source	Strong performance, comprehensive pipeline	Complex setup, steep learning curve	Enterprise deployments
Gemini Flash	Multimodal LLM	Phenomenal accuracy, semantic understanding	API costs, privacy concerns	General-purpose OCR with reasoning
GPT-4o/Claude	Multimodal LLM	Excellent reasoning, conversational	Expensive, may hallucinate	Interactive document analysis

Community Insights

Based on Reddit r/LocalLLaMA discussions:

MistralOCR User Pipeline:

MistralOCR (structure extraction) 
  → Qwen3-VL (semantic descriptions) 
  → Devstral (markdown cleanup) 
  → Kimi-K2 (summarization) 
  → Qwen3 (embeddings) 
  → pgvector (storage)

💡 Expert Opinion
"MistralOCR does better than SOTA multimodal LLMs like GPT-4o/Claude because it can maintain structure and include media in the output." - LocalLLaMA community member

When to Choose DeepSeek OCR 2:

✅ Need full control and customization
✅ Working with sensitive/private documents
✅ Require fine-tuning for specific languages or domains
✅ Want to run completely offline
✅ Building research or academic projects

When to Consider Alternatives:

🔄 Need fastest possible inference (→ MistralOCR API)
🔄 Require semantic image understanding (→ Gemini/GPT-4o)
🔄 Working only with public documents (→ Cloud APIs)

How to Run DeepSeek OCR 2

System Requirements

Minimum Requirements:

Python 3.12.9+
CUDA 11.8+ (NVIDIA GPU)
8GB+ VRAM (for 4-bit quantization)
16GB+ VRAM (for full precision)

Tested Environment:

torch==2.6.0
transformers==4.46.3
tokenizers==0.20.3
flash-attn==2.7.3
einops, addict, easydict

Method 1: vLLM (Recommended for Production)

Advantages: Fastest inference, batch processing, production-ready

Installation

uv venv
source .venv/bin/activate
# Install vLLM nightly build (until v0.11.1 release)
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Usage Code

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

# Create model instance
llm = LLM(
    model="unsloth/DeepSeek-OCR-2",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

# Prepare batched input
image_1 = Image.open("document1.png").convert("RGB")
image_2 = Image.open("document2.png").convert("RGB")
prompt = "<image>\n<|grounding|>Convert the document to markdown."

model_input = [
    {"prompt": prompt, "multi_modal_data": {"image": image_1}},
    {"prompt": prompt, "multi_modal_data": {"image": image_2}}
]

sampling_param = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td>
    ),
    skip_special_tokens=False,
)

# Generate output
model_outputs = llm.generate(model_input, sampling_param)

for output in model_outputs:
    print(output.outputs[0].text)

💡 Pro Tip
The NGramPerReqLogitsProcessor prevents repetition issues (similar to Whisper's failure mode) and improves output quality.

Method 2: Hugging Face Transformers (Most Flexible)

Advantages: Full control, easy debugging, research-friendly

Installation

pip install torch==2.6.0 transformers==4.46.3 tokenizers==0.20.3
pip install einops addict easydict
pip install flash-attn==2.7.3 --no-build-isolation

Usage Code

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    _attn_implementation='flash_attention_2', 
    trust_remote_code=True, 
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Run inference
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'output_directory'

res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=768, 
    crop_mode=True, 
    save_results=True
)

Method 3: Unsloth (Best for Fine-tuning)

Advantages: 1.4x faster training, 40% less VRAM, 5x longer context

Installation

pip install --upgrade unsloth
# Force update if already installed
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo

Usage Code

from unsloth import FastVisionModel
import torch
from transformers import AutoModel
import os

os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0'

from huggingface_hub import snapshot_download
snapshot_download("unsloth/DeepSeek-OCR-2", local_dir="deepseek_ocr")

model, tokenizer = FastVisionModel.from_pretrained(
    "./deepseek_ocr",
    load_in_4bit=False,  # Set True for 4-bit quantization
    auto_model=AutoModel,
    trust_remote_code=True,
    unsloth_force_compile=True,
    use_gradient_checkpointing="unsloth",
)

prompt = "<image>\nFree OCR."
image_file = 'your_image.jpg'
output_path = 'output_directory'

res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=640, 
    crop_mode=True, 
    save_results=True
)

⚠️ ROCm/Vulkan Support
As of January 2026, DeepSeek OCR 2 primarily supports NVIDIA GPUs. AMD ROCm and Vulkan support is under community development.

Fine-tuning Guide

Why Fine-tune DeepSeek OCR 2?

Fine-tuning enables:

Language Adaptation: Support for non-English languages (e.g., Persian, Chinese, Arabic)
Domain Specialization: Medical records, legal documents, handwritten notes
Format Optimization: Custom output formats, specific markdown styles
Accuracy Improvement: 57-86% reduction in Character Error Rate (CER)

Performance Improvements

Persian Language Fine-tuning Results:

Metric	Before Fine-tuning	After Fine-tuning	Improvement
OCR 1 CER	1.4866	0.6409	-57%
OCR 2 CER	4.1863	0.6018	-86%
Overall	-	-	88.6% improvement

Step-by-Step Fine-tuning Process

1. Prepare Your Dataset

# Dataset format example
dataset = [
    {
        "image": "path/to/image1.jpg",
        "text": "Expected OCR output...",
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    },
    # More examples...
]

2. Use Unsloth Free Colab Notebook

Access the official notebook: Unsloth DeepSeek OCR Fine-tuning

3. Configure Training Parameters

from unsloth import FastVisionModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load model for training
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/DeepSeek-OCR-2",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Configure LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./deepseek_ocr_finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)

4. Train and Evaluate

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    args=training_args,
)

trainer.train()
model.save_pretrained("./final_model")

✅ Best Practice
Start with 100-500 high-quality examples. More data isn't always better—focus on diversity and accuracy.

Fine-tuning Benefits Summary

Aspect	Benefit
Training Speed	1.4x faster than standard methods
Memory Usage	40% less VRAM required
Context Length	5x longer sequences supported
Accuracy	No degradation vs full fine-tuning
Cost	Free Colab notebook available

Performance Benchmarks

OmniDocBench v1.5 Results

Table 1: Comprehensive Document Reading Evaluation

Model	Visual Tokens	Reading Order	Overall Score	Complex Layouts	Tables	Math
DeepSeek OCR 2	256-1120	✅ Human-like	SOTA	Excellent	Excellent	Excellent
DeepSeek OCR 1	256-1120	❌ Grid-based	High	Good	Good	Good
PaddleOCR-VL	Variable	❌ Grid-based	High	Good	Very Good	Good
GPT-4o	High	❌ Grid-based	High	Good	Good	Very Good
Claude 3.5	High	❌ Grid-based	High	Very Good	Good	Good

Real-world Performance Insights

Community Testing Results:

Document Skew Handling:
- ✅ Handles 90°/180°/270° rotations reliably
- ⚠️ Minor tilts/skews may reduce accuracy (preprocessing recommended)
Repetition Issues:
- Similar to Whisper, may repeat text in failure modes
- Mitigated by NGramPerReqLogitsProcessor in vLLM
Speed Comparison:
- MistralOCR API: Fastest (cloud-based)
- DeepSeek OCR 2 (vLLM): Fast (local, batched)
- DeepSeek OCR 2 (Transformers): Moderate (local, flexible)
- Local alternatives: Several orders of magnitude slower

💡 Expert Insight
"For pure text OCR, most recent models are nearly flawless. DeepSeek OCR 2 excels at complex math formatting and not hallucinating content." - LocalLLaMA community

Community Feedback & Real-world Usage

Positive Experiences

From r/LocalLLaMA users:

"My experience with [DeepSeek OCR 2] has been phenomenal. Though it's amazing, I also noticed that there is a failure mode which causes the model to repeat itself (like Whisper), not sure of the cause but something to take note of." - u/Pvt_Twinkietoes

"It is truly an amazing model and very grateful they open sourced it." - Community member

Production Pipelines

Advanced OCR Pipeline Example:

Input Document
    ↓
DeepSeek OCR 2 (structure + text extraction)
    ↓
Qwen3-VL (semantic figure descriptions)
    ↓
Devstral (markdown standardization)
    ↓
Kimi-K2 (summarization)
    ↓
Qwen3 (embeddings generation)
    ↓
pgvector (vector storage)

Use Case Scenarios

Use Case	Recommended Setup	Why
Research Papers	DeepSeek OCR 2 + Local	Privacy, math accuracy, custom formatting
Business Documents	MistralOCR API	Speed, reliability, structure preservation
Medical Records	DeepSeek OCR 2 Fine-tuned	Privacy, domain adaptation, compliance
Handwritten Notes	Gemini Flash / GPT-4o	Superior handwriting recognition
Multilingual Docs	DeepSeek OCR 2 Fine-tuned	Language adaptation, offline capability
High-volume Processing	vLLM + DeepSeek OCR 2	Batch processing, cost-effective

🤔 FAQ

Q: How does DeepSeek OCR 2 compare to GPT-4o for OCR tasks?

A: DeepSeek OCR 2 excels at structure preservation and doesn't hallucinate, while GPT-4o is better for semantic understanding and handwritten text. For pure document OCR with layout preservation, DeepSeek OCR 2 is often superior and runs locally. For interactive analysis requiring reasoning, GPT-4o is better.

Q: Can I run DeepSeek OCR 2 on AMD GPUs or Apple Silicon?

A: As of January 2026, official support is for NVIDIA GPUs with CUDA. ROCm (AMD) and Vulkan support are under community development. For Apple Silicon, you may need to use CPU inference (significantly slower) or wait for Metal backend support.

Q: What's the difference between "Free OCR" and "grounding" modes?

Free OCR (<image>\nFree OCR.): Extracts text without preserving layout
Grounding mode (<image>\n<|grounding|>Convert to markdown.): Preserves document structure, tables, and formatting

Use grounding mode for documents where layout matters.

Q: How much VRAM do I need?

4-bit quantization: 8GB VRAM (RTX 3070 or better)
Full precision (bfloat16): 16GB VRAM (RTX 4080 or better)
Fine-tuning: 24GB+ VRAM recommended (or use Unsloth optimizations)

Q: Is DeepSeek OCR 2 better than PaddleOCR?

A: According to benchmarks, DeepSeek OCR 2 achieves higher accuracy on complex layouts. However, PaddleOCR has a more mature ecosystem and production pipeline. DeepSeek OCR 2 is easier to get started with but PaddleOCR may be better for large-scale deployments with existing infrastructure.

Q: Can I use DeepSeek OCR 2 for commercial projects?

A: Yes, DeepSeek OCR 2 is open-source. Check the license on the Hugging Face repository for specific terms.

Q: How do I handle PDF files?

A: DeepSeek OCR 2 processes images. For PDFs:

Convert PDF pages to images (using pdf2image or similar)
Process each image with DeepSeek OCR 2
Combine results

See the GitHub repository for PDF processing utilities.

Q: What's the best way to improve accuracy for my specific documents?

Preprocessing: Ensure images are properly oriented and high-resolution
Prompt engineering: Use appropriate prompts (<|grounding|> for structured docs)
Fine-tuning: Create 100-500 examples of your document type and fine-tune
Post-processing: Use LLMs to clean up minor errors

Conclusion & Next Steps

Key Takeaways

✅ DeepSeek OCR 2 represents a significant leap in document understanding with its human-like visual reading order via DeepEncoder V2

✅ Multiple deployment options (vLLM, Transformers, Unsloth) make it accessible for various use cases

✅ Fine-tuning capabilities enable 57-86% accuracy improvements for specialized domains

✅ Open-source nature provides full control, privacy, and customization

✅ Competitive performance with SOTA results on complex layouts, tables, and mathematical content

Recommended Action Plan

For Researchers & Developers:

✅ Start with the Hugging Face Transformers implementation for flexibility
✅ Test on your specific document types
✅ If accuracy is insufficient, prepare a fine-tuning dataset
✅ Use Unsloth's free Colab notebook for efficient fine-tuning
✅ Deploy with vLLM for production workloads

For Production Deployments:

✅ Benchmark DeepSeek OCR 2 vs MistralOCR for your use case
✅ Consider privacy requirements (local vs API)
✅ Set up vLLM for batch processing
✅ Implement preprocessing (rotation detection, image enhancement)
✅ Build post-processing pipeline for consistency

For Privacy-sensitive Applications:

✅ Deploy DeepSeek OCR 2 locally with Transformers or vLLM
✅ Fine-tune on your specific document types
✅ Implement secure document handling workflows
✅ Consider air-gapped deployment for maximum security

Resources

Official Repository: GitHub - DeepSeek-OCR-2
Model Download: Hugging Face - DeepSeek-OCR-2
Fine-tuning Notebook: Unsloth Colab Notebook
Documentation: Unsloth DeepSeek OCR 2 Guide
Research Paper: DeepSeek OCR 2 Paper (PDF)
Community Discussion: r/LocalLLaMA Thread

Final Thoughts

DeepSeek OCR 2 democratizes advanced document understanding technology, making it accessible to researchers, developers, and organizations of all sizes. Its combination of cutting-edge architecture, practical performance, and open-source availability positions it as a top choice for 2026 OCR projects.

Whether you're processing research papers, digitizing historical documents, or building production OCR pipelines, DeepSeek OCR 2 offers the flexibility and performance needed for modern document understanding tasks.

💡 Start Today
The fastest way to get started is using the Hugging Face Transformers implementation. Install the dependencies, download the model, and run your first OCR in under 10 minutes.

Last Updated: January 2026

Model Version: DeepSeek-OCR-2 (3B parameters)

License: Check Hugging Face repository for details

DeepSeek OCR 2 Complete Guide

🎯 Core Highlights (TL;DR)

Table of Contents

What is DeepSeek OCR 2?

The Innovation: DeepEncoder V2

Key Features & Architecture

Core Capabilities

Supported Modes

DeepSeek OCR 2 vs Other OCR Solutions

Comprehensive Comparison

Community Insights

How to Run DeepSeek OCR 2

System Requirements

Method 1: vLLM (Recommended for Production)

Installation

Usage Code

Method 2: Hugging Face Transformers (Most Flexible)

Installation

Usage Code

Method 3: Unsloth (Best for Fine-tuning)

Installation

Usage Code

Fine-tuning Guide

Why Fine-tune DeepSeek OCR 2?

Performance Improvements

Step-by-Step Fine-tuning Process

1. Prepare Your Dataset

2. Use Unsloth Free Colab Notebook

3. Configure Training Parameters

4. Train and Evaluate

Fine-tuning Benefits Summary

Performance Benchmarks

OmniDocBench v1.5 Results

Real-world Performance Insights

Community Feedback & Real-world Usage

Positive Experiences

Production Pipelines

Use Case Scenarios

🤔 FAQ

Q: How does DeepSeek OCR 2 compare to GPT-4o for OCR tasks?

Q: Can I run DeepSeek OCR 2 on AMD GPUs or Apple Silicon?

Q: What's the difference between "Free OCR" and "grounding" modes?

Q: How much VRAM do I need?

Q: Is DeepSeek OCR 2 better than PaddleOCR?

Q: Can I use DeepSeek OCR 2 for commercial projects?

Q: How do I handle PDF files?

Q: What's the best way to improve accuracy for my specific documents?

Conclusion & Next Steps

Key Takeaways

Recommended Action Plan

Resources

Final Thoughts