In this article, we present a production-grade pipeline for extracting Turkish national identification numbers (TCNo) from scanned or digitally generated PDF documents. The solution leverages PyMuPDF for fast PDF rasterization and Ollama-hosted VLMs for accurate, structured information retrieval. It is designed to process batches of documents efficiently, with strong fault tolerance, format validation, and persistent output management via JSON serialization.
This article covers the architectural design, individual components, and engineering considerations behind this solution.
Use Case
Structured information extraction from PDFs remains a persistent challenge in sectors such as finance, public administration, and legal compliance. Fields like Turkish ID numbers are critical yet often embedded within unstructured or semi-structured document formats. Traditional OCR systems like Tesseract or Google Vision often fail to discriminate between valid and invalid extractions due to lack of domain context.
By integrating multimodal LLMs capable of visual reasoning (e.g., Qwen-VL or LLaVA), we introduce a more semantic-aware pipeline that performs better in edge cases, delivers higher precision, and allows tighter control through prompt engineering.
Architecture Overview
The solution is implemented as a self-contained Python class PDFOCRProcessor, organized into five major stages:
- Environment setup and folder management
- PDF rasterization into per-page PNGs
- Image-based extraction using Ollama + VLM
- Field-level validation and filtering
- Output serialization and deduplication
Step-by-Step Review
1. Environment Initialization and Folder Handling
def __init__(self):
self.temp_folder = "temp_images"
self.output_file = "extracted_texts.json"
self.create_temp_folder()
A temporary folder is used to store intermediate PNG images for each page. This is necessary because the Ollama API expects actual image file paths for multimodal input. The JSON output file acts as a persistent cache, enabling resumable or idempotent processing.
Folder cleanup is handled gracefully with:
def clear_temp_folder(self):
...
This ensures the disk remains free of residual intermediate files.
2. PDF Rasterization via PyMuPDF
pix = page.get_pixmap(dpi=100)
pix.save(image_path)
Each PDF page is rendered as a PNG image using PyMuPDF's get_pixmap() method. A DPI of 100 is chosen for optimal trade-off between image resolution and processing speed. Higher DPI values can be configured if document fidelity is a concern.
This method is preferred over alternatives like pdf2image or wand due to PyMuPDF's speed, native PDF parsing support, and ease of integration.
3. Multimodal Extraction with Qwen-VL
response = ollama.chat(
model='qwen2.5vl:7b',
messages=[{
'role': 'user',
'content': (
"You are given an image of a document. From this image, extract only the following values if they are clearly visible:\n\n"
"1. TCNo (Turkish Identification Number):\n"
"- Must be exactly 11 digits\n"
"- Must contain only numeric characters (0–9), no letters or symbols\n"
"- Ignore anything that does not strictly match this format\n\n"
),
'images': [os.path.abspath(img_path)]
}],
options={ "temperature": 0 }
)
Prompt engineering plays a pivotal role here. The instructions are deliberately constrained to reduce hallucination and prevent the model from returning loosely matched values. The temperature is set to zero to ensure deterministic behavior.
4. Validation Logic
def is_valid_tcno(self, tcno):
return tcno.isdigit() and len(tcno) == 11
Even though the model is guided with strict instructions, an extra layer of post-validation ensures compliance with domain-specific rules. This protects against partially correct or corrupted outputs. Only 11-digit numeric values are accepted as valid TC numbers.
Additionally, the logic is wrapped with:
def extract_and_validate_tcno(self, text):
...
This function parses each line and replaces invalid values with "None" for clarity and traceability.
5. Output Serialization and Deduplication
Before initiating any processing, the pipeline checks whether a document has been previously processed:
if pdf_id in results:
print(f"PDF is done before: {pdf_path}")
return
This prevents redundant computation in multi-run scenarios.
The final results are structured in a hierarchical JSON format:
{
"document_id": {
"file_name": "filename.pdf",
"content": {
"filename_page_1.png": "TCNo: 12345678901",
"filename_page_2.png": "TCNo: None"
}
}
}
Batch Processing
def process_folder(self, folder_path):
...
Entire directories can be processed via a single call, enabling integration with file drop services, cloud buckets, or internal archives. Each PDF is processed in isolation, and failures in one file do not halt the pipeline.
All exceptions are caught, logged, and written into the JSON as error_* keys to ensure no data loss or silent failures.
Logging
Informational logs like the following help with observability:
Processing: ./PDFs/sample_form.pdf
OCR output: TCNo: 12345678901
Completed: ./PDFs/sample_form.pdf
While this implementation uses standard output for logging, it can be extended with the logging module or sent to a structured log aggregator (e.g., ELK stack or Datadog) in production settings.
Full Code
# Import required libraries
import os # For file system operations
import json # For JSON file handling
import fitz # PyMuPDF for PDF processing
import ollama # For OCR functionality
import re # For regular expressions
class PDFOCRProcessor:
"""Main class for processing PDF files and extracting text using OCR"""
def __init__(self):
"""Initialize the processor with default settings"""
self.temp_folder = "temp_images" # Folder for temporary image storage
self.output_file = "extracted_texts.json" # Output file for results
self.create_temp_folder() # Ensure temp folder exists
def create_temp_folder(self):
"""Create temporary folder if it doesn't exist"""
if not os.path.exists(self.temp_folder):
os.makedirs(self.temp_folder)
def clear_temp_folder(self):
"""Clean up temporary image files"""
for filename in os.listdir(self.temp_folder):
file_path = os.path.join(self.temp_folder, filename)
try:
if os.path.isfile(file_path):
os.unlink(file_path) # Delete file
except Exception as e:
print(f"Error: {file_path} is not deleted. {e}")
def pdf_to_images(self, pdf_path):
"""Convert each PDF page to an image file"""
images = []
pdf_document = fitz.open(pdf_path) # Open PDF file
for page_number in range(len(pdf_document)):
page = pdf_document.load_page(page_number) # Load page
pix = page.get_pixmap(dpi=100) # Create image with 100 DPI
# Generate unique image filename
image_path = os.path.join(
self.temp_folder,
f"{os.path.basename(pdf_path)[:-4]}_page_{page_number + 1}.png"
)
pix.save(image_path) # Save image
images.append(image_path)
return images
def update_json(self, results):
"""Update output JSON file with processed results"""
with open(self.output_file, 'w', encoding='utf-8') as f:
json.dump(results, f, indent=4, ensure_ascii=False)
def is_valid_tcno(self, tcno):
"""Validate Turkish ID number format"""
return tcno.isdigit() and len(tcno) == 11 # Must be 11 digits
def extract_and_validate_tcno(self, text):
"""Extract and validate TCNo from OCR text"""
lines = text.splitlines()
validated_lines = []
for line in lines:
if line.strip().lower().startswith("tcno:"): # Find TCNo line
value = line.split(":", 1)[-1].strip() # Extract value
valid = self.is_valid_tcno(value) # Validate
validated_lines.append(f"TCNo: {value if valid else 'None'}")
else:
validated_lines.append(line) # Keep other lines unchanged
return "\n".join(validated_lines)
def process_pdf(self, pdf_path):
"""Process a single PDF file through the OCR pipeline"""
pdf_id = os.path.splitext(os.path.basename(pdf_path))[0]
results = {}
# Load existing results if file exists
if os.path.exists(self.output_file):
with open(self.output_file, 'r', encoding='utf-8') as f:
results = json.load(f)
# Skip processed files
if pdf_id in results:
print(f"PDF is done before: {pdf_path}")
return
print(f"Processing: {pdf_path}")
image_paths = self.pdf_to_images(pdf_path) # Convert to images
extracted = {}
try:
# Process each image through OCR
for img_path in image_paths:
response = ollama.chat(
model='qwen2.5vl:7b',
messages=[{
'role': 'user',
'content': (
"Extract only TCNo (Turkish Identification Number) "
"if visible:\n"
"- Must be exactly 11 digits\n"
"- Numbers only, no letters/symbols\n"
),
'images': [os.path.abspath(img_path)]
}],
options={"temperature": 0} # Strict mode
)
# Handle response
content = (response.message.content if hasattr(response, "message")
else str(response))
content = self.extract_and_validate_tcno(content) # Validate
print("OCR output:", content)
extracted[os.path.basename(img_path)] = content # Store result
# Update results
results[pdf_id] = {
"file_name": os.path.basename(pdf_path),
"content": extracted
}
self.update_json(results)
except Exception as e:
print(f"Error: {pdf_path} : {str(e)}")
# Record error in results
results[pdf_id] = {
"file_name": os.path.basename(pdf_path),
"content": {f"error_{i}": str(e) for i in range(len(image_paths))}
}
self.update_json(results)
finally:
self.clear_temp_folder() # Clean up
print(f"Completed: {pdf_path}")
def process_folder(self, folder_path):
"""Process all PDF files in a directory"""
pdf_files = [f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]
pdf_files.sort() # Process in consistent order
for pdf_file in pdf_files:
pdf_path = os.path.join(folder_path, pdf_file)
self.process_pdf(pdf_path)
if __name__ == "__main__":
# Create processor instance and start processing
processor = PDFOCRProcessor()
processor.process_folder("PDFs") # Process all PDFs in folder
print("All process is done!")
Conclusion
This article introduced a robust, efficient, and semantically aware pipeline for extracting Turkish Identification Numbers from scanned or native PDF documents using multimodal LLMs. The combination of PyMuPDF for fast rasterization and Ollama-powered Vision Language Models enables reliable and scalable extraction in real-world deployments.
Key Advantages
- Strong format validation (domain-specific)
- High recall and precision for TCNo fields
- JSON-based resumable output storage
- Batch processing with fault tolerance
By leveraging modern LLM capabilities in tandem with traditional document processing techniques, we can move closer to fully automated, high-accuracy document understanding systems for regulated environments.
Thanks for reading...
Top comments (0)