Suraj Khaitan

Posted on Nov 22

🚀 The Ultimate Guide to Intelligent Document Parsing: Building a Universal File Reader System

#ai #datascience #python #programming

How to extract meaningful data from PDFs, Excel sheets, images, and 10+ file formats using smart parsing strategies

The Document Processing Dilemma

Picture this: Your application needs to process documents. Sounds simple, right? But here's the catch—documents come in countless formats: PDFs, Word files, Excel spreadsheets, PowerPoint presentations, images, HTML, JSON, XML, and more. Each format has its own quirks, structure, and extraction challenges.

Traditional solutions often involve cobbling together different libraries, writing repetitive code, and dealing with edge cases for each file type. What if there was a better way?

In this article, I'll show you how to build an extensible, production-ready document parsing system that handles 10+ file formats with a unified interface, supports both native extraction and OCR processing, and scales to handle millions of documents.

🎯 The Architecture: Three Key Principles

1. Abstraction Through Base Classes

The foundation of any great parsing system is a well-designed abstraction layer. Instead of writing separate logic for each file type, we create a BaseReader class that defines the common interface:

class BaseReader(ABC):
    """Abstract base class for all document readers"""

    # File type categories
    OCR_ONLY_EXTENSIONS = {".png", ".jpg", ".jpeg"}
    NATIVE_ONLY_EXTENSIONS = {".csv", ".xlsx", ".xls", ".json"}
    HYBRID_EXTENSIONS = {".pdf", ".docx", ".doc", ".pptx", ".ppt", ".txt"}

    @abstractmethod
    def extract_content(self) -> IntermediateModel:
        """Extract content from the document"""
        pass

    def should_use_ocr(self) -> bool:
        """Determine whether to use OCR based on file type"""
        if self.file_type in self.OCR_ONLY_EXTENSIONS:
            return True
        elif self.file_type in self.NATIVE_ONLY_EXTENSIONS:
            return False
        else:  # HYBRID_EXTENSIONS
            return self.processing_config.get("optical_recognition", False)

Why this matters: This design pattern allows each file type to implement its own extraction logic while maintaining a consistent interface. Adding support for a new file format? Just create a new reader class that inherits from BaseReader.

2. Dual Processing Modes: Native vs OCR

Different documents require different extraction approaches:

Native Processing: Direct content extraction using format-specific libraries (fast, preserves structure)
OCR Processing: Optical Character Recognition for scanned documents or images (slower, handles visual content)

Our system intelligently chooses the right approach:

def get_reader_for_file(file_data: dict, processing_config: dict) -> BaseReader:
    """Return the appropriate reader instance for the file type"""
    file_type = file_data.get("file_type", "").lower()

    if file_type == ".pdf":
        return PDFReader(file_data, processing_config)
    elif file_type in [".txt", ".xml", ".html", ".json", ".csv"]:
        return TextReader(file_data, processing_config)
    elif file_type in [".xlsx", ".xls"]:
        return ExcelReader(file_data, processing_config)
    elif file_type in [".png", ".jpg", ".jpeg"]:
        return ImageReader(file_data, processing_config)
    elif file_type in [".ppt", ".pptx"]:
        return PresentationReader(file_data, processing_config)
    elif file_type in [".doc", ".docx"]:
        return WordReader(file_data, processing_config)
    else:
        raise ValueError(f"Unsupported file type: {file_type}")

3. Standardized Output Format

Regardless of input format, all readers output a consistent IntermediateModel:

intermediate_model = IntermediateModel(
    file_id=self.file_id,
    file_name=self.file_name,
    file_url=self.file_url,
    file_type=self.file_type,
    kb_id=self.kb_id,
    content=content  # Structured content dictionary
)

This standardization makes downstream processing (chunking, indexing, searching) incredibly simple.

📄 Deep Dive: Format-Specific Strategies

PDF Processing: The Hybrid Approach

PDFs are tricky because they can contain both machine-readable text and scanned images. Our PDFReader handles both scenarios:

class PDFReader(BaseReader):
    def extract_content(self) -> IntermediateModel:
        use_ocr = self.should_use_ocr()
        extract_images = self.processing_config.get("image_2_text", False)

        if use_ocr:
            content = self._extract_with_ocr()  # AWS Textract
        else:
            content = self._extract_native()     # PyPDF2/pdfplumber

        if extract_images:
            self._extract_and_save_images(content)  # PyMuPDF

        return self._build_intermediate_model(content)

Native extraction using PyPDF2 is lightning-fast for text-based PDFs:

def _extract_native(self) -> dict:
    with open(self.local_file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        content = {}

        for page_num, page in enumerate(pdf_reader.pages):
            page_text = page.extract_text()
            content[f"page_{page_num}"] = {
                "text": page_text.strip(),
                "type": "pdf_page",
                "page_number": page_num + 1,
                "images": []
            }

    return content

Pro tip: For PDFs with embedded images or complex layouts, OCR processing with AWS Textract provides superior results, detecting tables, forms, and relationships between elements.

Excel & Spreadsheets: Preserving Structure

Excel files contain structured data that must be preserved. Our ExcelReader uses openpyxl to maintain the table structure:

class ExcelReader(BaseReader):
    def _process_worksheet(self, worksheet, sheet_name: str) -> dict:
        # Extract with table structure
        rows_data = []
        for row in worksheet.iter_rows(values_only=True):
            row_values = [str(cell) if cell is not None else "" for cell in row]
            if any(row_values):  # Skip empty rows
                rows_data.append(row_values)

        # Format as structured text
        text_content = self._format_table_as_text(rows_data)

        return {
            "text": text_content,
            "type": "excel_sheet",
            "sheet_name": sheet_name,
            "row_count": len(rows_data)
        }

Key insight: Converting tabular data to readable text format makes it searchable while preserving the relationship between cells.

Images: OCR is Your Best Friend

Images require OCR processing since there's no native text to extract. Our ImageReader leverages AWS Textract:

class ImageReader(BaseReader):
    def _extract_with_textract(self) -> dict:
        textract = boto3.client("textract")

        with open(self.local_file_path, "rb") as file:
            image_bytes = file.read()

        # Call Textract to detect text
        response = textract.detect_document_text(
            Document={"Bytes": image_bytes}
        )

        return self._process_textract_response(response)

    def _process_textract_response(self, response: dict) -> dict:
        # Extract lines and words from Textract response
        lines = []
        for block in response.get("Blocks", []):
            if block["BlockType"] == "LINE":
                lines.append(block["Text"])

        return {
            "page_0": {
                "text": "\n".join(lines),
                "type": "image",
                "ocr_status": "success"
            }
        }

Best practice: Always implement file size checks—Textract has a 10MB limit for synchronous calls. For larger files, use asynchronous processing.

Word Documents: Handle Both .docx and Legacy .doc

Modern .docx files use python-docx for native extraction, while legacy .doc files require OCR:

class WordReader(BaseReader):
    def extract_content(self) -> IntermediateModel:
        use_ocr = self.should_use_ocr()

        if use_ocr:
            return self._extract_with_ocr()
        else:
            try:
                return self._extract_native()
            except Exception as e:
                if self.file_type == ".doc":
                    # Legacy format not supported natively
                    return self._return_unsupported_status(
                        "Enable OCR to process .doc files"
                    )
                raise e

The native extraction preserves document structure including paragraphs, tables, and images:

def _extract_native(self) -> dict:
    doc = Document(self.local_file_path)
    content = {}
    page_index = 0

    for element in doc.element.body:
        if isinstance(element, CT_P):  # Paragraph
            para = Paragraph(element, doc)
            content[f"page_{page_index}"] = {
                "text": para.text,
                "type": "paragraph"
            }
        elif isinstance(element, CT_Tbl):  # Table
            table = Table(element, doc)
            table_text = self._extract_table_text(table)
            content[f"page_{page_index}"] = {
                "text": table_text,
                "type": "table"
            }
        page_index += 1

    return content

PowerPoint: Extract Slides, Text, and Images

Presentations combine text, images, and structured layouts. Our PresentationReader handles all of it:

class PresentationReader(BaseReader):
    def _extract_native(self) -> dict:
        prs = Presentation(self.local_file_path)
        content = {}

        for slide_num, slide in enumerate(prs.slides):
            slide_content = {
                "text": "",
                "type": "slide",
                "slide_number": slide_num + 1,
                "images": []
            }

            # Extract text from all shapes
            text_parts = []
            for shape in slide.shapes:
                if hasattr(shape, "text"):
                    text_parts.append(shape.text)

                # Handle tables
                if shape.shape_type == MSO_SHAPE_TYPE.TABLE:
                    table_text = self._extract_table_text(shape.table)
                    text_parts.append(table_text)

            slide_content["text"] = "\n".join(text_parts)
            content[f"page_{slide_num}"] = slide_content

        return content

Text-Based Formats: HTML, XML, JSON, CSV

For text-based formats, our TextReader uses built-in Python libraries to minimize dependencies:

class TextReader(BaseReader):
    def _extract_native(self) -> dict:
        with open(self.local_file_path, encoding=self._detect_encoding()) as file:
            raw_content = file.read()

        if self.file_type == ".txt":
            processed_text = raw_content
        elif self.file_type == ".xml":
            processed_text = self._process_xml(raw_content)
        elif self.file_type == ".html":
            processed_text = self._process_html(raw_content)
        elif self.file_type == ".json":
            processed_text = self._process_json(raw_content)
        elif self.file_type == ".csv":
            processed_text = self._process_csv(raw_content)

        return {
            "page_0": {
                "text": processed_text,
                "type": self.file_type.replace(".", "")
            }
        }

HTML extraction uses a custom parser to strip tags while preserving content:

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.text_content = []

    def handle_data(self, data: str):
        if self.current_tag not in ["script", "style"]:
            cleaned = data.strip()
            if cleaned:
                self.text_content.append(cleaned)

    def get_text(self) -> str:
        return " ".join(self.text_content)

🎨 Image Extraction: Going Beyond Text

Many documents contain valuable information in images. Our system can extract and save images separately:

def _extract_and_save_images(self, content: dict) -> None:
    """Extract images from PDF and save to S3"""
    pdf_document = fitz.open(self.local_file_path)

    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]
        image_list = page.get_images()

        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]

            # Generate unique image identifier
            image_id = str(uuid.uuid4())
            image_key = f"images/{self.kb_id}/{image_id}.{base_image['ext']}"

            # Upload to S3
            self.s3_client.put_object(
                Bucket=self.bucket_name,
                Key=image_key,
                Body=image_bytes
            )

            # Track image URL in content
            image_url = f"s3://{self.bucket_name}/{image_key}"
            content[f"page_{page_num}"]["images"].append(image_url)

This enables downstream image-to-text processing or multimodal AI applications.

⚡ Performance Optimization Strategies

1. Graceful Error Handling

Never let one problematic file crash the entire batch. Keep processing resilient and report partial successes:

def process_single_file(file_data: dict, config: dict) -> dict:
    reader = get_reader_for_file(file_data, config)
    try:
        return {"status": "ok", "result": reader.process()}
    except Exception as e:  # Narrow in real code
        return {
            "status": "error",
            "file_id": file_data.get("file_id"),
            "file_name": file_data.get("file_name"),
            "error": str(e),
        }

2. Numeric Type Normalization

Different storage layers (NoSQL, relational ORMs, JSON parsers) may return numeric wrappers. Normalize before serialization:

def normalize_numeric(value):
    try:
        # Integers remain integers; decimals become float for JSON
        from decimal import Decimal
        if isinstance(value, Decimal):
            return int(value) if value % 1 == 0 else float(value)
        return value
    except ImportError:
        return value

def deep_normalize(obj):
    if isinstance(obj, dict):
        return {k: deep_normalize(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [deep_normalize(i) for i in obj]
    return normalize_numeric(obj)

🔧 Putting It All Together: Generic Batch Pipeline

A framework-agnostic example that you can call from a CLI, a web job, or a worker queue:

def process_documents(files: list[dict], config: dict) -> dict:
    """Process a list of file metadata dictionaries.

    files: each dict contains at least file_id, file_name, file_type, file_url
    config: processing flags (e.g., {"optical_recognition": True})
    """
    results = []
    for f in files:
        normalized = deep_normalize(f)
        results.append(process_single_file(normalized, config))
    summary = {
        "total": len(results),
        "success": sum(1 for r in results if r.get("status") == "ok"),
        "errors": [r for r in results if r.get("status") == "error"],
        "results": results,
    }
    return summary

# Example usage:
# documents = load_pending_documents_from_store()
# outcome = process_documents(documents, {"optical_recognition": False})
# print(json.dumps(outcome, indent=2))

🚀 Real-World Benefits

This architecture delivers:

Extensibility: Add new file formats by creating a new reader class
Flexibility: Switch between native and OCR processing per document
Scalability: Works across threads, workers, or serverless functions
Reliability: Graceful error handling and status tracking
Maintainability: Clean abstraction layer and consistent interfaces

💡 Key Takeaways

Building a production-ready document parsing system requires:

✅ Strong abstraction layer with base classes defining common interfaces

✅ Format-specific strategies using the best library for each file type

✅ Dual processing modes (native + OCR) for maximum coverage

✅ Standardized output making downstream processing trivial

✅ Robust error handling to prevent pipeline failures

✅ Performance optimization through smart dependency management

What's Next?

Consider these enhancements for your parsing system:

Multimodal processing: Combine text with image embeddings
Table extraction: Preserve table structure for structured data
Layout analysis: Maintain document formatting and hierarchy
Incremental processing: Handle document updates efficiently
Quality metrics: Track extraction confidence and completeness

Conclusion

Document parsing doesn't have to be painful. With the right architecture—abstraction, format-specific strategies, and intelligent processing modes—you can build a system that handles any file format thrown at it.

The key is to think in terms of interfaces, not implementations. By defining a clear contract through the BaseReader class, you create a system that's both powerful and maintainable.

Whether you're building a knowledge base, search engine, or AI application, these patterns will serve you well. Start with the formats you need today, and confidently add new ones tomorrow.

Have you built a document parsing system? What challenges did you face? Share your experiences in the comments below!

Resources

PyPDF2: PDF text extraction
openpyxl: Excel file processing
python-docx: Word document handling
python-pptx: PowerPoint processing
AWS Textract: OCR service for images and scanned documents
PyMuPDF (fitz): PDF image extraction

About the Author

Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.

DEV Community