DEV Community

cz
cz

Posted on

DeepSeek OCR 2: Complete Guide to Running & Fine-tuning in 2026

🎯 Core Highlights (TL;DR)

  • Revolutionary Architecture: DeepSeek OCR 2 introduces DeepEncoder V2 with human-like visual reading order, achieving SOTA performance on document understanding
  • Lightweight & Powerful: Only 3B parameters but outperforms larger models on complex layouts, tables, and mixed text-structure documents
  • Easy Deployment: Run locally via vLLM, Transformers, or Unsloth with comprehensive fine-tuning support
  • Proven Results: 86-88% improvement in language understanding after fine-tuning, with 57-86% reduction in Character Error Rate (CER)
  • Open Source: Fully available on Hugging Face with detailed documentation and community support

Table of Contents

  1. What is DeepSeek OCR 2?
  2. Key Features & Architecture
  3. DeepSeek OCR 2 vs Other OCR Solutions
  4. How to Run DeepSeek OCR 2
  5. Fine-tuning Guide
  6. Performance Benchmarks
  7. Community Feedback & Real-world Usage
  8. FAQ
  9. Conclusion & Next Steps

What is DeepSeek OCR 2?

DeepSeek OCR 2 is a state-of-the-art 3B-parameter vision-language model released on January 27, 2026, by DeepSeek AI. Unlike traditional OCR systems that merely extract text, DeepSeek OCR 2 focuses on image-to-text with stronger visual reasoning, enabling comprehensive document understanding.

The Innovation: DeepEncoder V2

The breakthrough lies in DeepEncoder V2, which fundamentally changes how AI "sees" documents:

Traditional Vision LLMs:

  • Scan images in fixed grid patterns (top-left β†’ bottom-right)
  • Process visual information sequentially without context
  • Struggle with complex layouts and multi-column documents

DeepSeek OCR 2 with DeepEncoder V2:

  • Builds global understanding first, then learns human-like reading order
  • Determines what to attend to first, next, and so on
  • Excels at following columns, linking labels to values, and reading tables coherently

πŸ’‘ Key Insight
DeepEncoder V2 enables the model to 'see' an image in the same logical order as a human, dramatically improving accuracy on complex layouts.

Key Features & Architecture

Core Capabilities

Feature Description Benefit
Dynamic Resolution (0-6)Γ—768Γ—768 + 1Γ—1024Γ—1024 Handles various document sizes efficiently
Visual Tokens (0-6)Γ—144 + 256 tokens Optimized memory usage
Human-like Reading DeepEncoder V2 architecture Superior layout understanding
Compact Size Only 3B parameters Fast inference, low resource requirements
Multi-format Support Images, PDFs, documents Versatile application scenarios

Supported Modes

DeepSeek OCR 2 supports multiple operation modes:

  • Document Mode: <image>\n<|grounding|>Convert the document to markdown.
  • Free OCR: <image>\nFree OCR. (without layout preservation)
  • Figure Parsing: <image>\nParse the figure.
  • General Vision: <image>\nDescribe this image in detail.
  • Recognition: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.

⚠️ Important Note
For best results with structured documents, use the <|grounding|> tag to enable layout-aware processing.

DeepSeek OCR 2 vs Other OCR Solutions

Comprehensive Comparison

Solution Type Strengths Limitations Best For
DeepSeek OCR 2 Open-source VLM Human-like reading order, SOTA accuracy, fine-tunable Requires GPU for optimal performance Complex documents, research, custom applications
MistralOCR Closed-source API Extremely fast, excellent structure preservation Not open-source, API-dependent Production pipelines, public documents
PaddleOCR-VL Open-source Strong performance, comprehensive pipeline Complex setup, steep learning curve Enterprise deployments
Gemini Flash Multimodal LLM Phenomenal accuracy, semantic understanding API costs, privacy concerns General-purpose OCR with reasoning
GPT-4o/Claude Multimodal LLM Excellent reasoning, conversational Expensive, may hallucinate Interactive document analysis

Community Insights

Based on Reddit r/LocalLLaMA discussions:

MistralOCR User Pipeline:

MistralOCR (structure extraction) 
  β†’ Qwen3-VL (semantic descriptions) 
  β†’ Devstral (markdown cleanup) 
  β†’ Kimi-K2 (summarization) 
  β†’ Qwen3 (embeddings) 
  β†’ pgvector (storage)
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Expert Opinion
"MistralOCR does better than SOTA multimodal LLMs like GPT-4o/Claude because it can maintain structure and include media in the output." - LocalLLaMA community member

When to Choose DeepSeek OCR 2:

  • βœ… Need full control and customization
  • βœ… Working with sensitive/private documents
  • βœ… Require fine-tuning for specific languages or domains
  • βœ… Want to run completely offline
  • βœ… Building research or academic projects

When to Consider Alternatives:

  • πŸ”„ Need fastest possible inference (β†’ MistralOCR API)
  • πŸ”„ Require semantic image understanding (β†’ Gemini/GPT-4o)
  • πŸ”„ Working only with public documents (β†’ Cloud APIs)

How to Run DeepSeek OCR 2

System Requirements

Minimum Requirements:

  • Python 3.12.9+
  • CUDA 11.8+ (NVIDIA GPU)
  • 8GB+ VRAM (for 4-bit quantization)
  • 16GB+ VRAM (for full precision)

Tested Environment:

torch==2.6.0
transformers==4.46.3
tokenizers==0.20.3
flash-attn==2.7.3
einops, addict, easydict
Enter fullscreen mode Exit fullscreen mode

Method 1: vLLM (Recommended for Production)

Advantages: Fastest inference, batch processing, production-ready

Installation

uv venv
source .venv/bin/activate
# Install vLLM nightly build (until v0.11.1 release)
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Enter fullscreen mode Exit fullscreen mode

Usage Code

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

# Create model instance
llm = LLM(
    model="unsloth/DeepSeek-OCR-2",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

# Prepare batched input
image_1 = Image.open("document1.png").convert("RGB")
image_2 = Image.open("document2.png").convert("RGB")
prompt = "<image>\n<|grounding|>Convert the document to markdown."

model_input = [
    {"prompt": prompt, "multi_modal_data": {"image": image_1}},
    {"prompt": prompt, "multi_modal_data": {"image": image_2}}
]

sampling_param = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td>
    ),
    skip_special_tokens=False,
)

# Generate output
model_outputs = llm.generate(model_input, sampling_param)

for output in model_outputs:
    print(output.outputs[0].text)
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Pro Tip
The NGramPerReqLogitsProcessor prevents repetition issues (similar to Whisper's failure mode) and improves output quality.

Method 2: Hugging Face Transformers (Most Flexible)

Advantages: Full control, easy debugging, research-friendly

Installation

pip install torch==2.6.0 transformers==4.46.3 tokenizers==0.20.3
pip install einops addict easydict
pip install flash-attn==2.7.3 --no-build-isolation
Enter fullscreen mode Exit fullscreen mode

Usage Code

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    _attn_implementation='flash_attention_2', 
    trust_remote_code=True, 
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Run inference
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'output_directory'

res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=768, 
    crop_mode=True, 
    save_results=True
)
Enter fullscreen mode Exit fullscreen mode

Method 3: Unsloth (Best for Fine-tuning)

Advantages: 1.4x faster training, 40% less VRAM, 5x longer context

Installation

pip install --upgrade unsloth
# Force update if already installed
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo
Enter fullscreen mode Exit fullscreen mode

Usage Code

from unsloth import FastVisionModel
import torch
from transformers import AutoModel
import os

os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0'

from huggingface_hub import snapshot_download
snapshot_download("unsloth/DeepSeek-OCR-2", local_dir="deepseek_ocr")

model, tokenizer = FastVisionModel.from_pretrained(
    "./deepseek_ocr",
    load_in_4bit=False,  # Set True for 4-bit quantization
    auto_model=AutoModel,
    trust_remote_code=True,
    unsloth_force_compile=True,
    use_gradient_checkpointing="unsloth",
)

prompt = "<image>\nFree OCR."
image_file = 'your_image.jpg'
output_path = 'output_directory'

res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=640, 
    crop_mode=True, 
    save_results=True
)
Enter fullscreen mode Exit fullscreen mode

⚠️ ROCm/Vulkan Support
As of January 2026, DeepSeek OCR 2 primarily supports NVIDIA GPUs. AMD ROCm and Vulkan support is under community development.

Fine-tuning Guide

Why Fine-tune DeepSeek OCR 2?

Fine-tuning enables:

  • Language Adaptation: Support for non-English languages (e.g., Persian, Chinese, Arabic)
  • Domain Specialization: Medical records, legal documents, handwritten notes
  • Format Optimization: Custom output formats, specific markdown styles
  • Accuracy Improvement: 57-86% reduction in Character Error Rate (CER)

Performance Improvements

Persian Language Fine-tuning Results:

Metric Before Fine-tuning After Fine-tuning Improvement
OCR 1 CER 1.4866 0.6409 -57%
OCR 2 CER 4.1863 0.6018 -86%
Overall - - 88.6% improvement

Step-by-Step Fine-tuning Process

1. Prepare Your Dataset

# Dataset format example
dataset = [
    {
        "image": "path/to/image1.jpg",
        "text": "Expected OCR output...",
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    },
    # More examples...
]
Enter fullscreen mode Exit fullscreen mode

2. Use Unsloth Free Colab Notebook

Access the official notebook: Unsloth DeepSeek OCR Fine-tuning

3. Configure Training Parameters

from unsloth import FastVisionModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load model for training
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/DeepSeek-OCR-2",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Configure LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./deepseek_ocr_finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)
Enter fullscreen mode Exit fullscreen mode

4. Train and Evaluate

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    args=training_args,
)

trainer.train()
model.save_pretrained("./final_model")
Enter fullscreen mode Exit fullscreen mode

βœ… Best Practice
Start with 100-500 high-quality examples. More data isn't always betterβ€”focus on diversity and accuracy.

Fine-tuning Benefits Summary

Aspect Benefit
Training Speed 1.4x faster than standard methods
Memory Usage 40% less VRAM required
Context Length 5x longer sequences supported
Accuracy No degradation vs full fine-tuning
Cost Free Colab notebook available

Performance Benchmarks

OmniDocBench v1.5 Results

Table 1: Comprehensive Document Reading Evaluation

Model Visual Tokens Reading Order Overall Score Complex Layouts Tables Math
DeepSeek OCR 2 256-1120 βœ… Human-like SOTA Excellent Excellent Excellent
DeepSeek OCR 1 256-1120 ❌ Grid-based High Good Good Good
PaddleOCR-VL Variable ❌ Grid-based High Good Very Good Good
GPT-4o High ❌ Grid-based High Good Good Very Good
Claude 3.5 High ❌ Grid-based High Very Good Good Good

Real-world Performance Insights

Community Testing Results:

  1. Document Skew Handling:

    • βœ… Handles 90Β°/180Β°/270Β° rotations reliably
    • ⚠️ Minor tilts/skews may reduce accuracy (preprocessing recommended)
  2. Repetition Issues:

    • Similar to Whisper, may repeat text in failure modes
    • Mitigated by NGramPerReqLogitsProcessor in vLLM
  3. Speed Comparison:

    • MistralOCR API: Fastest (cloud-based)
    • DeepSeek OCR 2 (vLLM): Fast (local, batched)
    • DeepSeek OCR 2 (Transformers): Moderate (local, flexible)
    • Local alternatives: Several orders of magnitude slower

πŸ’‘ Expert Insight
"For pure text OCR, most recent models are nearly flawless. DeepSeek OCR 2 excels at complex math formatting and not hallucinating content." - LocalLLaMA community

Community Feedback & Real-world Usage

Positive Experiences

From r/LocalLLaMA users:

"My experience with [DeepSeek OCR 2] has been phenomenal. Though it's amazing, I also noticed that there is a failure mode which causes the model to repeat itself (like Whisper), not sure of the cause but something to take note of." - u/Pvt_Twinkietoes

"It is truly an amazing model and very grateful they open sourced it." - Community member

Production Pipelines

Advanced OCR Pipeline Example:

Input Document
    ↓
DeepSeek OCR 2 (structure + text extraction)
    ↓
Qwen3-VL (semantic figure descriptions)
    ↓
Devstral (markdown standardization)
    ↓
Kimi-K2 (summarization)
    ↓
Qwen3 (embeddings generation)
    ↓
pgvector (vector storage)
Enter fullscreen mode Exit fullscreen mode

Use Case Scenarios

Use Case Recommended Setup Why
Research Papers DeepSeek OCR 2 + Local Privacy, math accuracy, custom formatting
Business Documents MistralOCR API Speed, reliability, structure preservation
Medical Records DeepSeek OCR 2 Fine-tuned Privacy, domain adaptation, compliance
Handwritten Notes Gemini Flash / GPT-4o Superior handwriting recognition
Multilingual Docs DeepSeek OCR 2 Fine-tuned Language adaptation, offline capability
High-volume Processing vLLM + DeepSeek OCR 2 Batch processing, cost-effective

πŸ€” FAQ

Q: How does DeepSeek OCR 2 compare to GPT-4o for OCR tasks?

A: DeepSeek OCR 2 excels at structure preservation and doesn't hallucinate, while GPT-4o is better for semantic understanding and handwritten text. For pure document OCR with layout preservation, DeepSeek OCR 2 is often superior and runs locally. For interactive analysis requiring reasoning, GPT-4o is better.

Q: Can I run DeepSeek OCR 2 on AMD GPUs or Apple Silicon?

A: As of January 2026, official support is for NVIDIA GPUs with CUDA. ROCm (AMD) and Vulkan support are under community development. For Apple Silicon, you may need to use CPU inference (significantly slower) or wait for Metal backend support.

Q: What's the difference between "Free OCR" and "grounding" modes?

A:

  • Free OCR (<image>\nFree OCR.): Extracts text without preserving layout
  • Grounding mode (<image>\n<|grounding|>Convert to markdown.): Preserves document structure, tables, and formatting

Use grounding mode for documents where layout matters.

Q: How much VRAM do I need?

A:

  • 4-bit quantization: 8GB VRAM (RTX 3070 or better)
  • Full precision (bfloat16): 16GB VRAM (RTX 4080 or better)
  • Fine-tuning: 24GB+ VRAM recommended (or use Unsloth optimizations)

Q: Is DeepSeek OCR 2 better than PaddleOCR?

A: According to benchmarks, DeepSeek OCR 2 achieves higher accuracy on complex layouts. However, PaddleOCR has a more mature ecosystem and production pipeline. DeepSeek OCR 2 is easier to get started with but PaddleOCR may be better for large-scale deployments with existing infrastructure.

Q: Can I use DeepSeek OCR 2 for commercial projects?

A: Yes, DeepSeek OCR 2 is open-source. Check the license on the Hugging Face repository for specific terms.

Q: How do I handle PDF files?

A: DeepSeek OCR 2 processes images. For PDFs:

  1. Convert PDF pages to images (using pdf2image or similar)
  2. Process each image with DeepSeek OCR 2
  3. Combine results

See the GitHub repository for PDF processing utilities.

Q: What's the best way to improve accuracy for my specific documents?

A:

  1. Preprocessing: Ensure images are properly oriented and high-resolution
  2. Prompt engineering: Use appropriate prompts (<|grounding|> for structured docs)
  3. Fine-tuning: Create 100-500 examples of your document type and fine-tune
  4. Post-processing: Use LLMs to clean up minor errors

Conclusion & Next Steps

Key Takeaways

βœ… DeepSeek OCR 2 represents a significant leap in document understanding with its human-like visual reading order via DeepEncoder V2

βœ… Multiple deployment options (vLLM, Transformers, Unsloth) make it accessible for various use cases

βœ… Fine-tuning capabilities enable 57-86% accuracy improvements for specialized domains

βœ… Open-source nature provides full control, privacy, and customization

βœ… Competitive performance with SOTA results on complex layouts, tables, and mathematical content

Recommended Action Plan

For Researchers & Developers:

  1. βœ… Start with the Hugging Face Transformers implementation for flexibility
  2. βœ… Test on your specific document types
  3. βœ… If accuracy is insufficient, prepare a fine-tuning dataset
  4. βœ… Use Unsloth's free Colab notebook for efficient fine-tuning
  5. βœ… Deploy with vLLM for production workloads

For Production Deployments:

  1. βœ… Benchmark DeepSeek OCR 2 vs MistralOCR for your use case
  2. βœ… Consider privacy requirements (local vs API)
  3. βœ… Set up vLLM for batch processing
  4. βœ… Implement preprocessing (rotation detection, image enhancement)
  5. βœ… Build post-processing pipeline for consistency

For Privacy-sensitive Applications:

  1. βœ… Deploy DeepSeek OCR 2 locally with Transformers or vLLM
  2. βœ… Fine-tune on your specific document types
  3. βœ… Implement secure document handling workflows
  4. βœ… Consider air-gapped deployment for maximum security

Resources

Final Thoughts

DeepSeek OCR 2 democratizes advanced document understanding technology, making it accessible to researchers, developers, and organizations of all sizes. Its combination of cutting-edge architecture, practical performance, and open-source availability positions it as a top choice for 2026 OCR projects.

Whether you're processing research papers, digitizing historical documents, or building production OCR pipelines, DeepSeek OCR 2 offers the flexibility and performance needed for modern document understanding tasks.

πŸ’‘ Start Today
The fastest way to get started is using the Hugging Face Transformers implementation. Install the dependencies, download the model, and run your first OCR in under 10 minutes.


Last Updated: January 2026

Model Version: DeepSeek-OCR-2 (3B parameters)

License: Check Hugging Face repository for details

DeepSeek OCR 2 Complete Guide

Top comments (0)