Developers and AI engineers often need to turn visual inputs—scanned documents, screenshots, charts, and images—into structured text that large language models (LLMs) can process efficiently. DeepSeek-OCR addresses this with a model designed for “contexts optical compression”: compressing visual information into compact, context-rich text tokens for LLM workflows.
Released in October 2025, DeepSeek-OCR is built for teams working on document automation, image-to-text conversion, and visual data analysis. Its LLM-focused design aims to preserve context while reducing token and compute overhead for real-time or large-scale OCR pipelines.
What Is Contexts Optical Compression?
Contexts optical compression converts images into compact text-like representations that LLMs can consume efficiently.
Traditional OCR typically extracts plain text. DeepSeek-OCR goes further by preserving information such as:
- Document structure
- Headings and sections
- Tables and lists
- Spatial relationships
- Figure and chart context
- Grounded references to regions in the image
This matters when your downstream task is not just “read the text,” but “understand the document.”
For example, a plain OCR output might lose which value belongs to which table column. A context-aware OCR output can preserve the table structure so an LLM can answer questions or generate structured data more reliably.
Key advantages
- Rich context: Keeps document layout, headings, tables, and spatial references.
- Flexible resolution modes: Supports quick previews through high-detail extraction with different token budgets.
- Grounding capabilities: Enables references to specific visual regions for interactive document or visual QA workflows.
Traditional OCR tools such as Tesseract can work well for simple text extraction, but complex layouts, distorted scans, handwriting, and multilingual documents often require more context-aware processing. DeepSeek-OCR uses deep neural architectures to handle these scenarios with higher fidelity.
How DeepSeek-OCR Works
DeepSeek-OCR uses an LLM-centric vision encoder that compresses visual data into a smaller set of informative tokens.
A typical workflow looks like this:
Image analysis
The model processes the input image at the selected resolution and identifies text, layout, figures, and document structure.Token generation
Visual features are converted into compressed representations that distinguish document components such as headings, body text, tables, and figures.Dynamic resolution handling
“Gundam” mode combines multiple image segments for dense or oversized documents.Grounding tags
Special references such as<|ref|>xxxx<|/ref|>can be used to point to elements in the image.
Token modes
DeepSeek-OCR supports multiple resolution/token trade-offs:
| Mode | Resolution | Tokens |
|---|---|---|
| Tiny | 512×512 px | 64 |
| Small | 640×640 px | 100 |
| Base | 1024×1024 px | 256 |
| Large | 1280×1280 px | 400 |
Use these modes based on your workload:
- Use tiny or small for fast previews and lightweight processing.
- Use base for a balanced production default.
- Use large when fine-grained layout or small text matters.
DeepSeek-OCR Features for Developers
DeepSeek-OCR includes several features that are useful when building OCR-backed applications and APIs:
- Native resolution flexibility: Choose the right mode for speed, cost, or detail.
- Dynamic “Gundam” mode: Process high-resolution or dense documents by stitching multiple segments.
- Markdown output: Convert documents into structured Markdown while preserving tables, lists, and hierarchy.
- Figure parsing: Extract data and descriptions from charts and graphs.
- General image captioning: Generate contextual image descriptions.
- Location referencing: Query or extract information about specific image elements using grounding.
- Fast inference: Supports up to 2500 tokens/sec on an A100-40G GPU with vLLM and Transformers compatibility.
- Lightweight deployment: Designed for secure and scalable integration with minimal dependencies.
Example use cases
You can use DeepSeek-OCR for:
- Automated document processing in financial or legal workflows
- Visual question-answering systems
- Accessibility tooling with rich image descriptions
- Batch OCR pipelines for digital archiving
- API-based document ingestion before LLM summarization or extraction
Under the Hood: DeepSeek-OCR Architecture
DeepSeek-OCR is designed for efficient, context-aware OCR.
The architecture includes:
- Image preprocessing: Resizes and normalizes input images.
- Vision Transformer backbone: Splits images into patches and converts them into embeddings.
- Compressed tokenization: Uses attention and feed-forward layers to synthesize visual context into concise tokens.
- LLM integration: Prepends visual tokens to text prompts to reduce context length and memory usage.
- Spatial grounding: Uses special tokens to map queries to coordinates or image regions.
- Optimized training: Fine-tuned on paired image-text datasets to balance compression and extraction accuracy.
In dynamic mode, DeepSeek-OCR stitches embeddings from multiple passes so documents with different sizes and densities can be processed consistently.
Installation Guide: Getting Started with DeepSeek-OCR
Use a modern Python environment with CUDA support.
1. Clone the repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
2. Create and activate a Conda environment
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
3. Install PyTorch and dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
4. Install vLLM
Download the vLLM-0.8.5 wheel from the official release, then install it:
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
5. Install project requirements
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
Note: The documentation advises ignoring errors related to vLLM and Transformers.
Choosing the Right Resolution Mode
Pick the resolution mode based on the document type and downstream task.
| Scenario | Recommended mode |
|---|---|
| Fast preview OCR | Tiny |
| Simple scanned pages | Small |
| General production document OCR | Base |
| Dense tables, small text, complex layouts | Large |
| Very large or dense documents | Dynamic “Gundam” mode |
A practical production flow is:
- Start with base mode.
- Evaluate extraction quality on your real documents.
- Move to large only when layout fidelity or small-text recognition is insufficient.
- Use dynamic mode for oversized or dense inputs.
Performance and Benchmarking
DeepSeek-OCR is designed for high-throughput OCR workloads.
Reported performance and benchmark highlights include:
- Speed: Up to 2500 tokens/sec on an A100-40G GPU
- Benchmarks: Strong performance on Fox and OmniDocBench for OCR precision, layout retention, and figure parsing
- Compression: Reduces tokens by 50% while maintaining 95%+ extraction accuracy
- Resolution scaling: Higher modes provide more detail but use more tokens
For most production use cases, base mode offers a strong balance between detail and token efficiency.
Comparing DeepSeek-OCR with Other OCR Solutions
| Feature | DeepSeek-OCR | PaddleOCR | GOT-OCR2.0 | MinerU | Tesseract |
|---|---|---|---|---|---|
| LLM Integration | Yes | No | Partial | No | No |
| Contextual Output | Yes | No | Partial | No | No |
| Dynamic Resolution | Yes | No | No | No | No |
| Grounding Support | Yes | No | No | No | No |
| Token Compression | High | Medium | Medium | Low | Low |
| Markdown Output | Yes | No | No | No | No |
DeepSeek-OCR is best suited for LLM-oriented OCR pipelines where layout, context, and compressed visual tokens matter.
Building an OCR API Around DeepSeek-OCR
A common way to integrate DeepSeek-OCR into an application is to wrap it behind an API.
A minimal architecture could look like this:
- Client uploads an image or document page.
- Backend stores the file temporarily.
- OCR worker runs DeepSeek-OCR with the selected mode.
- API returns Markdown, structured text, or grounded references.
- Downstream services pass the result to an LLM, database, or search index.
Example API contract:
POST /ocr
Content-Type: multipart/form-data
Request fields:
file: document image
mode: tiny | small | base | large
output_format: markdown | text | json
Example JSON response:
{
"mode": "base",
"output_format": "markdown",
"content": "# Invoice\n\n| Item | Amount |\n| --- | ---: |\n| API usage | $120.00 |",
"metadata": {
"tokens": 256,
"processing_status": "completed"
}
}
This kind of contract makes it easier to test the OCR service independently before connecting it to an LLM workflow.
Why Apidog Matters for DeepSeek-OCR API Integration
When deploying DeepSeek-OCR in real projects, you need more than the model. You also need to validate endpoints, inspect responses, mock services, and collaborate on API contracts.
Apidog helps with:
- API testing: Validate OCR endpoints, payloads, headers, and responses.
- Mocking: Simulate OCR APIs before the model service is production-ready.
- Automation: Build repeatable tests for OCR workflows.
- Performance checks: Track response times and error cases.
- Collaboration: Share API collections and documentation with teammates.
A practical DeepSeek-OCR API workflow with Apidog:
- Define your OCR endpoint schema.
- Add sample upload requests for different document types.
- Test response formats such as Markdown, text, and JSON.
- Mock the OCR API for frontend or integration teams.
- Add automated checks for status codes, required fields, and response shape.
This helps keep your OCR service stable as you iterate on model settings, resolution modes, and downstream LLM prompts.
Conclusion
DeepSeek-OCR gives developers a practical way to convert visual data into compressed, context-rich representations for LLM workflows. Its support for dynamic resolution, Markdown output, figure parsing, grounding, and high-throughput inference makes it useful for advanced OCR and document automation systems.
For production usage, treat DeepSeek-OCR as part of a larger API pipeline: define clear request and response contracts, test multiple document types, and monitor performance. Tools like Apidog can help you build, test, mock, and collaborate on that API layer more reliably.





Top comments (0)