Behiç Oytun Şenkul

Posted on Jul 10

OCR - ID Card Scanner (VLM)

#ocr #vlm

In this article, we present a production-grade pipeline for extracting Turkish national identification numbers (TCNo) from scanned or digitally generated PDF documents. The solution leverages PyMuPDF for fast PDF rasterization and Ollama-hosted VLMs for accurate, structured information retrieval. It is designed to process batches of documents efficiently, with strong fault tolerance, format validation, and persistent output management via JSON serialization.

This article covers the architectural design, individual components, and engineering considerations behind this solution.

Use Case

Structured information extraction from PDFs remains a persistent challenge in sectors such as finance, public administration, and legal compliance. Fields like Turkish ID numbers are critical yet often embedded within unstructured or semi-structured document formats. Traditional OCR systems like Tesseract or Google Vision often fail to discriminate between valid and invalid extractions due to lack of domain context.

By integrating multimodal LLMs capable of visual reasoning (e.g., Qwen-VL or LLaVA), we introduce a more semantic-aware pipeline that performs better in edge cases, delivers higher precision, and allows tighter control through prompt engineering.

Architecture Overview

The solution is implemented as a self-contained Python class PDFOCRProcessor, organized into five major stages:

Environment setup and folder management
PDF rasterization into per-page PNGs
Image-based extraction using Ollama + VLM
Field-level validation and filtering
Output serialization and deduplication

Step-by-Step Review

1. Environment Initialization and Folder Handling

def __init__(self):
    self.temp_folder = "temp_images"
    self.output_file = "extracted_texts.json"
    self.create_temp_folder()

A temporary folder is used to store intermediate PNG images for each page. This is necessary because the Ollama API expects actual image file paths for multimodal input. The JSON output file acts as a persistent cache, enabling resumable or idempotent processing.

Folder cleanup is handled gracefully with:

def clear_temp_folder(self):
    ...

This ensures the disk remains free of residual intermediate files.

2. PDF Rasterization via PyMuPDF

pix = page.get_pixmap(dpi=100)
pix.save(image_path)

Each PDF page is rendered as a PNG image using PyMuPDF's get_pixmap() method. A DPI of 100 is chosen for optimal trade-off between image resolution and processing speed. Higher DPI values can be configured if document fidelity is a concern.

This method is preferred over alternatives like pdf2image or wand due to PyMuPDF's speed, native PDF parsing support, and ease of integration.

3. Multimodal Extraction with Qwen-VL

response = ollama.chat(
    model='qwen2.5vl:7b',
    messages=[{
        'role': 'user',
        'content': (
            "You are given an image of a document. From this image, extract only the following values if they are clearly visible:\n\n"
            "1. TCNo (Turkish Identification Number):\n"
            "- Must be exactly 11 digits\n"
            "- Must contain only numeric characters (0–9), no letters or symbols\n"
            "- Ignore anything that does not strictly match this format\n\n"
        ),
        'images': [os.path.abspath(img_path)]
    }],
    options={ "temperature": 0 }
)

Prompt engineering plays a pivotal role here. The instructions are deliberately constrained to reduce hallucination and prevent the model from returning loosely matched values. The temperature is set to zero to ensure deterministic behavior.

4. Validation Logic

def is_valid_tcno(self, tcno):
    return tcno.isdigit() and len(tcno) == 11

Even though the model is guided with strict instructions, an extra layer of post-validation ensures compliance with domain-specific rules. This protects against partially correct or corrupted outputs. Only 11-digit numeric values are accepted as valid TC numbers.
Additionally, the logic is wrapped with:

def extract_and_validate_tcno(self, text):
    ...

This function parses each line and replaces invalid values with "None" for clarity and traceability.

5. Output Serialization and Deduplication
Before initiating any processing, the pipeline checks whether a document has been previously processed:

if pdf_id in results:
    print(f"PDF is done before: {pdf_path}")
    return

This prevents redundant computation in multi-run scenarios.

The final results are structured in a hierarchical JSON format:

{
  "document_id": {
    "file_name": "filename.pdf",
    "content": {
      "filename_page_1.png": "TCNo: 12345678901",
      "filename_page_2.png": "TCNo: None"
    }
  }
}

Batch Processing


def process_folder(self, folder_path):
    ...

Entire directories can be processed via a single call, enabling integration with file drop services, cloud buckets, or internal archives. Each PDF is processed in isolation, and failures in one file do not halt the pipeline.

All exceptions are caught, logged, and written into the JSON as error_* keys to ensure no data loss or silent failures.

Logging
Informational logs like the following help with observability:

Processing: ./PDFs/sample_form.pdf
OCR output: TCNo: 12345678901
Completed: ./PDFs/sample_form.pdf

While this implementation uses standard output for logging, it can be extended with the logging module or sent to a structured log aggregator (e.g., ELK stack or Datadog) in production settings.

Full Code

# Import required libraries
import os  # For file system operations
import json  # For JSON file handling
import fitz  # PyMuPDF for PDF processing
import ollama  # For OCR functionality
import re  # For regular expressions

class PDFOCRProcessor:
    """Main class for processing PDF files and extracting text using OCR"""

    def __init__(self):
        """Initialize the processor with default settings"""
        self.temp_folder = "temp_images"  # Folder for temporary image storage
        self.output_file = "extracted_texts.json"  # Output file for results
        self.create_temp_folder()  # Ensure temp folder exists

    def create_temp_folder(self):
        """Create temporary folder if it doesn't exist"""
        if not os.path.exists(self.temp_folder):
            os.makedirs(self.temp_folder)

    def clear_temp_folder(self):
        """Clean up temporary image files"""
        for filename in os.listdir(self.temp_folder):
            file_path = os.path.join(self.temp_folder, filename)
            try:
                if os.path.isfile(file_path):
                    os.unlink(file_path)  # Delete file
            except Exception as e:
                print(f"Error: {file_path} is not deleted. {e}")

    def pdf_to_images(self, pdf_path):
        """Convert each PDF page to an image file"""
        images = []
        pdf_document = fitz.open(pdf_path)  # Open PDF file
        for page_number in range(len(pdf_document)):
            page = pdf_document.load_page(page_number)  # Load page
            pix = page.get_pixmap(dpi=100)  # Create image with 100 DPI
            # Generate unique image filename
            image_path = os.path.join(
                self.temp_folder, 
                f"{os.path.basename(pdf_path)[:-4]}_page_{page_number + 1}.png"
            )
            pix.save(image_path)  # Save image
            images.append(image_path)
        return images

    def update_json(self, results):
        """Update output JSON file with processed results"""
        with open(self.output_file, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=4, ensure_ascii=False)

    def is_valid_tcno(self, tcno):
        """Validate Turkish ID number format"""
        return tcno.isdigit() and len(tcno) == 11  # Must be 11 digits

    def extract_and_validate_tcno(self, text):
        """Extract and validate TCNo from OCR text"""
        lines = text.splitlines()
        validated_lines = []
        for line in lines:
            if line.strip().lower().startswith("tcno:"):  # Find TCNo line
                value = line.split(":", 1)[-1].strip()  # Extract value
                valid = self.is_valid_tcno(value)  # Validate
                validated_lines.append(f"TCNo: {value if valid else 'None'}")
            else:
                validated_lines.append(line)  # Keep other lines unchanged
        return "\n".join(validated_lines)

    def process_pdf(self, pdf_path):
        """Process a single PDF file through the OCR pipeline"""
        pdf_id = os.path.splitext(os.path.basename(pdf_path))[0]
        results = {}

        # Load existing results if file exists
        if os.path.exists(self.output_file):
            with open(self.output_file, 'r', encoding='utf-8') as f:
                results = json.load(f)

        # Skip processed files
        if pdf_id in results:
            print(f"PDF is done before: {pdf_path}")
            return

        print(f"Processing: {pdf_path}")
        image_paths = self.pdf_to_images(pdf_path)  # Convert to images
        extracted = {}

        try:
            # Process each image through OCR
            for img_path in image_paths:
                response = ollama.chat(
                    model='qwen2.5vl:7b',
                    messages=[{
                        'role': 'user',
                        'content': (
                            "Extract only TCNo (Turkish Identification Number) "
                            "if visible:\n"
                            "- Must be exactly 11 digits\n"
                            "- Numbers only, no letters/symbols\n"
                        ),
                        'images': [os.path.abspath(img_path)]
                    }],
                    options={"temperature": 0}  # Strict mode
                )

                # Handle response
                content = (response.message.content if hasattr(response, "message") 
                          else str(response))
                content = self.extract_and_validate_tcno(content)  # Validate
                print("OCR output:", content)
                extracted[os.path.basename(img_path)] = content  # Store result

            # Update results
            results[pdf_id] = {
                "file_name": os.path.basename(pdf_path),
                "content": extracted
            }
            self.update_json(results)

        except Exception as e:
            print(f"Error: {pdf_path} : {str(e)}")
            # Record error in results
            results[pdf_id] = {
                "file_name": os.path.basename(pdf_path),
                "content": {f"error_{i}": str(e) for i in range(len(image_paths))}
            }
            self.update_json(results)

        finally:
            self.clear_temp_folder()  # Clean up
            print(f"Completed: {pdf_path}")

    def process_folder(self, folder_path):
        """Process all PDF files in a directory"""
        pdf_files = [f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]
        pdf_files.sort()  # Process in consistent order
        for pdf_file in pdf_files:
            pdf_path = os.path.join(folder_path, pdf_file)
            self.process_pdf(pdf_path)

if __name__ == "__main__":
    # Create processor instance and start processing
    processor = PDFOCRProcessor()
    processor.process_folder("PDFs")  # Process all PDFs in folder
    print("All process is done!")

Conclusion
This article introduced a robust, efficient, and semantically aware pipeline for extracting Turkish Identification Numbers from scanned or native PDF documents using multimodal LLMs. The combination of PyMuPDF for fast rasterization and Ollama-powered Vision Language Models enables reliable and scalable extraction in real-world deployments.

Key Advantages

Strong format validation (domain-specific)
High recall and precision for TCNo fields
JSON-based resumable output storage
Batch processing with fault tolerance

By leveraging modern LLM capabilities in tandem with traditional document processing techniques, we can move closer to fully automated, high-accuracy document understanding systems for regulated environments.

Thanks for reading...

DEV Community

OCR - ID Card Scanner (VLM)

Use Case

Architecture Overview

Step-by-Step Review

Top comments (0)