How to extract meaningful data from PDFs, Excel sheets, images, and 10+ file formats using smart parsing strategies
The Document Processing Dilemma
Picture this: Your application needs to process documents. Sounds simple, right? But here's the catch—documents come in countless formats: PDFs, Word files, Excel spreadsheets, PowerPoint presentations, images, HTML, JSON, XML, and more. Each format has its own quirks, structure, and extraction challenges.
Traditional solutions often involve cobbling together different libraries, writing repetitive code, and dealing with edge cases for each file type. What if there was a better way?
In this article, I'll show you how to build an extensible, production-ready document parsing system that handles 10+ file formats with a unified interface, supports both native extraction and OCR processing, and scales to handle millions of documents.
🎯 The Architecture: Three Key Principles
1. Abstraction Through Base Classes
The foundation of any great parsing system is a well-designed abstraction layer. Instead of writing separate logic for each file type, we create a BaseReader class that defines the common interface:
class BaseReader(ABC):
"""Abstract base class for all document readers"""
# File type categories
OCR_ONLY_EXTENSIONS = {".png", ".jpg", ".jpeg"}
NATIVE_ONLY_EXTENSIONS = {".csv", ".xlsx", ".xls", ".json"}
HYBRID_EXTENSIONS = {".pdf", ".docx", ".doc", ".pptx", ".ppt", ".txt"}
@abstractmethod
def extract_content(self) -> IntermediateModel:
"""Extract content from the document"""
pass
def should_use_ocr(self) -> bool:
"""Determine whether to use OCR based on file type"""
if self.file_type in self.OCR_ONLY_EXTENSIONS:
return True
elif self.file_type in self.NATIVE_ONLY_EXTENSIONS:
return False
else: # HYBRID_EXTENSIONS
return self.processing_config.get("optical_recognition", False)
Why this matters: This design pattern allows each file type to implement its own extraction logic while maintaining a consistent interface. Adding support for a new file format? Just create a new reader class that inherits from BaseReader.
2. Dual Processing Modes: Native vs OCR
Different documents require different extraction approaches:
- Native Processing: Direct content extraction using format-specific libraries (fast, preserves structure)
- OCR Processing: Optical Character Recognition for scanned documents or images (slower, handles visual content)
Our system intelligently chooses the right approach:
def get_reader_for_file(file_data: dict, processing_config: dict) -> BaseReader:
"""Return the appropriate reader instance for the file type"""
file_type = file_data.get("file_type", "").lower()
if file_type == ".pdf":
return PDFReader(file_data, processing_config)
elif file_type in [".txt", ".xml", ".html", ".json", ".csv"]:
return TextReader(file_data, processing_config)
elif file_type in [".xlsx", ".xls"]:
return ExcelReader(file_data, processing_config)
elif file_type in [".png", ".jpg", ".jpeg"]:
return ImageReader(file_data, processing_config)
elif file_type in [".ppt", ".pptx"]:
return PresentationReader(file_data, processing_config)
elif file_type in [".doc", ".docx"]:
return WordReader(file_data, processing_config)
else:
raise ValueError(f"Unsupported file type: {file_type}")
3. Standardized Output Format
Regardless of input format, all readers output a consistent IntermediateModel:
intermediate_model = IntermediateModel(
file_id=self.file_id,
file_name=self.file_name,
file_url=self.file_url,
file_type=self.file_type,
kb_id=self.kb_id,
content=content # Structured content dictionary
)
This standardization makes downstream processing (chunking, indexing, searching) incredibly simple.
📄 Deep Dive: Format-Specific Strategies
PDF Processing: The Hybrid Approach
PDFs are tricky because they can contain both machine-readable text and scanned images. Our PDFReader handles both scenarios:
class PDFReader(BaseReader):
def extract_content(self) -> IntermediateModel:
use_ocr = self.should_use_ocr()
extract_images = self.processing_config.get("image_2_text", False)
if use_ocr:
content = self._extract_with_ocr() # AWS Textract
else:
content = self._extract_native() # PyPDF2/pdfplumber
if extract_images:
self._extract_and_save_images(content) # PyMuPDF
return self._build_intermediate_model(content)
Native extraction using PyPDF2 is lightning-fast for text-based PDFs:
def _extract_native(self) -> dict:
with open(self.local_file_path, "rb") as file:
pdf_reader = PdfReader(file)
content = {}
for page_num, page in enumerate(pdf_reader.pages):
page_text = page.extract_text()
content[f"page_{page_num}"] = {
"text": page_text.strip(),
"type": "pdf_page",
"page_number": page_num + 1,
"images": []
}
return content
Pro tip: For PDFs with embedded images or complex layouts, OCR processing with AWS Textract provides superior results, detecting tables, forms, and relationships between elements.
Excel & Spreadsheets: Preserving Structure
Excel files contain structured data that must be preserved. Our ExcelReader uses openpyxl to maintain the table structure:
class ExcelReader(BaseReader):
def _process_worksheet(self, worksheet, sheet_name: str) -> dict:
# Extract with table structure
rows_data = []
for row in worksheet.iter_rows(values_only=True):
row_values = [str(cell) if cell is not None else "" for cell in row]
if any(row_values): # Skip empty rows
rows_data.append(row_values)
# Format as structured text
text_content = self._format_table_as_text(rows_data)
return {
"text": text_content,
"type": "excel_sheet",
"sheet_name": sheet_name,
"row_count": len(rows_data)
}
Key insight: Converting tabular data to readable text format makes it searchable while preserving the relationship between cells.
Images: OCR is Your Best Friend
Images require OCR processing since there's no native text to extract. Our ImageReader leverages AWS Textract:
class ImageReader(BaseReader):
def _extract_with_textract(self) -> dict:
textract = boto3.client("textract")
with open(self.local_file_path, "rb") as file:
image_bytes = file.read()
# Call Textract to detect text
response = textract.detect_document_text(
Document={"Bytes": image_bytes}
)
return self._process_textract_response(response)
def _process_textract_response(self, response: dict) -> dict:
# Extract lines and words from Textract response
lines = []
for block in response.get("Blocks", []):
if block["BlockType"] == "LINE":
lines.append(block["Text"])
return {
"page_0": {
"text": "\n".join(lines),
"type": "image",
"ocr_status": "success"
}
}
Best practice: Always implement file size checks—Textract has a 10MB limit for synchronous calls. For larger files, use asynchronous processing.
Word Documents: Handle Both .docx and Legacy .doc
Modern .docx files use python-docx for native extraction, while legacy .doc files require OCR:
class WordReader(BaseReader):
def extract_content(self) -> IntermediateModel:
use_ocr = self.should_use_ocr()
if use_ocr:
return self._extract_with_ocr()
else:
try:
return self._extract_native()
except Exception as e:
if self.file_type == ".doc":
# Legacy format not supported natively
return self._return_unsupported_status(
"Enable OCR to process .doc files"
)
raise e
The native extraction preserves document structure including paragraphs, tables, and images:
def _extract_native(self) -> dict:
doc = Document(self.local_file_path)
content = {}
page_index = 0
for element in doc.element.body:
if isinstance(element, CT_P): # Paragraph
para = Paragraph(element, doc)
content[f"page_{page_index}"] = {
"text": para.text,
"type": "paragraph"
}
elif isinstance(element, CT_Tbl): # Table
table = Table(element, doc)
table_text = self._extract_table_text(table)
content[f"page_{page_index}"] = {
"text": table_text,
"type": "table"
}
page_index += 1
return content
PowerPoint: Extract Slides, Text, and Images
Presentations combine text, images, and structured layouts. Our PresentationReader handles all of it:
class PresentationReader(BaseReader):
def _extract_native(self) -> dict:
prs = Presentation(self.local_file_path)
content = {}
for slide_num, slide in enumerate(prs.slides):
slide_content = {
"text": "",
"type": "slide",
"slide_number": slide_num + 1,
"images": []
}
# Extract text from all shapes
text_parts = []
for shape in slide.shapes:
if hasattr(shape, "text"):
text_parts.append(shape.text)
# Handle tables
if shape.shape_type == MSO_SHAPE_TYPE.TABLE:
table_text = self._extract_table_text(shape.table)
text_parts.append(table_text)
slide_content["text"] = "\n".join(text_parts)
content[f"page_{slide_num}"] = slide_content
return content
Text-Based Formats: HTML, XML, JSON, CSV
For text-based formats, our TextReader uses built-in Python libraries to minimize dependencies:
class TextReader(BaseReader):
def _extract_native(self) -> dict:
with open(self.local_file_path, encoding=self._detect_encoding()) as file:
raw_content = file.read()
if self.file_type == ".txt":
processed_text = raw_content
elif self.file_type == ".xml":
processed_text = self._process_xml(raw_content)
elif self.file_type == ".html":
processed_text = self._process_html(raw_content)
elif self.file_type == ".json":
processed_text = self._process_json(raw_content)
elif self.file_type == ".csv":
processed_text = self._process_csv(raw_content)
return {
"page_0": {
"text": processed_text,
"type": self.file_type.replace(".", "")
}
}
HTML extraction uses a custom parser to strip tags while preserving content:
class HTMLTextExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.text_content = []
def handle_data(self, data: str):
if self.current_tag not in ["script", "style"]:
cleaned = data.strip()
if cleaned:
self.text_content.append(cleaned)
def get_text(self) -> str:
return " ".join(self.text_content)
🎨 Image Extraction: Going Beyond Text
Many documents contain valuable information in images. Our system can extract and save images separately:
def _extract_and_save_images(self, content: dict) -> None:
"""Extract images from PDF and save to S3"""
pdf_document = fitz.open(self.local_file_path)
for page_num in range(len(pdf_document)):
page = pdf_document[page_num]
image_list = page.get_images()
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
# Generate unique image identifier
image_id = str(uuid.uuid4())
image_key = f"images/{self.kb_id}/{image_id}.{base_image['ext']}"
# Upload to S3
self.s3_client.put_object(
Bucket=self.bucket_name,
Key=image_key,
Body=image_bytes
)
# Track image URL in content
image_url = f"s3://{self.bucket_name}/{image_key}"
content[f"page_{page_num}"]["images"].append(image_url)
This enables downstream image-to-text processing or multimodal AI applications.
âš¡ Performance Optimization Strategies
1. Graceful Error Handling
Never let one problematic file crash the entire batch. Keep processing resilient and report partial successes:
def process_single_file(file_data: dict, config: dict) -> dict:
reader = get_reader_for_file(file_data, config)
try:
return {"status": "ok", "result": reader.process()}
except Exception as e: # Narrow in real code
return {
"status": "error",
"file_id": file_data.get("file_id"),
"file_name": file_data.get("file_name"),
"error": str(e),
}
2. Numeric Type Normalization
Different storage layers (NoSQL, relational ORMs, JSON parsers) may return numeric wrappers. Normalize before serialization:
def normalize_numeric(value):
try:
# Integers remain integers; decimals become float for JSON
from decimal import Decimal
if isinstance(value, Decimal):
return int(value) if value % 1 == 0 else float(value)
return value
except ImportError:
return value
def deep_normalize(obj):
if isinstance(obj, dict):
return {k: deep_normalize(v) for k, v in obj.items()}
if isinstance(obj, list):
return [deep_normalize(i) for i in obj]
return normalize_numeric(obj)
🔧 Putting It All Together: Generic Batch Pipeline
A framework-agnostic example that you can call from a CLI, a web job, or a worker queue:
def process_documents(files: list[dict], config: dict) -> dict:
"""Process a list of file metadata dictionaries.
files: each dict contains at least file_id, file_name, file_type, file_url
config: processing flags (e.g., {"optical_recognition": True})
"""
results = []
for f in files:
normalized = deep_normalize(f)
results.append(process_single_file(normalized, config))
summary = {
"total": len(results),
"success": sum(1 for r in results if r.get("status") == "ok"),
"errors": [r for r in results if r.get("status") == "error"],
"results": results,
}
return summary
# Example usage:
# documents = load_pending_documents_from_store()
# outcome = process_documents(documents, {"optical_recognition": False})
# print(json.dumps(outcome, indent=2))
🚀 Real-World Benefits
This architecture delivers:
- Extensibility: Add new file formats by creating a new reader class
- Flexibility: Switch between native and OCR processing per document
- Scalability: Works across threads, workers, or serverless functions
- Reliability: Graceful error handling and status tracking
- Maintainability: Clean abstraction layer and consistent interfaces
💡 Key Takeaways
Building a production-ready document parsing system requires:
✅ Strong abstraction layer with base classes defining common interfaces
✅ Format-specific strategies using the best library for each file type
✅ Dual processing modes (native + OCR) for maximum coverage
✅ Standardized output making downstream processing trivial
✅ Robust error handling to prevent pipeline failures
✅ Performance optimization through smart dependency management
What's Next?
Consider these enhancements for your parsing system:
- Multimodal processing: Combine text with image embeddings
- Table extraction: Preserve table structure for structured data
- Layout analysis: Maintain document formatting and hierarchy
- Incremental processing: Handle document updates efficiently
- Quality metrics: Track extraction confidence and completeness
Conclusion
Document parsing doesn't have to be painful. With the right architecture—abstraction, format-specific strategies, and intelligent processing modes—you can build a system that handles any file format thrown at it.
The key is to think in terms of interfaces, not implementations. By defining a clear contract through the BaseReader class, you create a system that's both powerful and maintainable.
Whether you're building a knowledge base, search engine, or AI application, these patterns will serve you well. Start with the formats you need today, and confidently add new ones tomorrow.
Have you built a document parsing system? What challenges did you face? Share your experiences in the comments below!
Resources
- PyPDF2: PDF text extraction
- openpyxl: Excel file processing
- python-docx: Word document handling
- python-pptx: PowerPoint processing
- AWS Textract: OCR service for images and scanned documents
- PyMuPDF (fitz): PDF image extraction
About the Author
Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.
Top comments (0)