Why I Built a Pure Python Library for Legacy Office Files (And Why RAG Pipelines Need One)

#ai #machinelearning #python #opensource

Why I Built a Pure Python Library for Legacy Office Files (And Why RAG Pipelines Need One)

If you're building RAG pipelines or document ingestion for LLM agents, you've probably solved the easy part already. Modern Office files? No problem. python-docx, openpyxl, python-pptx — pick your library, extract your text, move on.

Then someone points your pipeline at an enterprise SharePoint.

The Legacy File Problem

Enterprise SharePoints are digital archaeology sites. Marketing uploaded PowerPoints in 2008. Legal has Word documents from 2005. Finance runs on Excel files that predate most of your team's careers.

These aren't edge cases. In my experience, legacy .doc, .xls, and .ppt files make up a significant chunk of any long-running enterprise document store. And if you're building a system that needs to ingest "all the documents," you can't just skip them.

Why Existing Solutions Didn't Work for Me

I needed to process these files in AWS Lambda functions for a RAG pipeline. My options were:

LibreOffice

The standard answer. Install LibreOffice, run it headless, convert files to text. It works, but it adds over 1GB to your container image. Lambda has a 250MB limit for deployment packages (10GB with container images, but still). Plus, configuring headless LibreOffice is its own adventure.

Apache Tika

Solid tool, widely used. But it requires a Java runtime and typically runs as a separate server. That's another service to deploy, monitor, and secure. For a document extraction step in a pipeline, it felt like overkill.

Subprocess calls to command-line tools

Various tools exist that you can shell out to. But subprocess calls are a security concern, they break in restricted environments, and they make your code platform-dependent.

I wanted something simpler: a Python library I could pip install and call.

Building sharepoint-to-text

So I built sharepoint-to-text.

The core idea: parse both legacy Office binary formats (OLE2) and modern XML-based formats (OOXML) directly in Python. No external dependencies. No subprocess calls. Just text extraction.

import sharepoint2text

# Works the same for legacy or modern files
result = sharepoint2text.read_file("ancient_report.doc")
result = sharepoint2text.read_file("modern_report.docx")

It handles:

Legacy formats: .doc, .xls, .ppt
Modern formats: .docx, .xlsx, .pptx
Plus .pdf and plain text

One interface, no conditional logic, no format detection boilerplate in your code.

Why This Matters for RAG and LLM Agents

If you're building document ingestion for RAG, you're probably dealing with heterogeneous input. Users upload files. Pipelines crawl document stores. You can't control what formats show up.

The typical approach is a cascade of if-statements and multiple libraries:

# The ugly version
if path.endswith('.docx'):
    text = extract_with_python_docx(path)
elif path.endswith('.doc'):
    text = extract_with_libreoffice(path)  # hope it's installed
elif path.endswith('.xlsx'):
    text = extract_with_openpyxl(path)
# ... and so on

With sharepoint-to-text, it's just:

import sharepoint2text

result = sharepoint2text.read_file(path)

The library figures out the format and handles it appropriately.

Deployment Benefits

Because it's pure Python with no system dependencies:

Container images stay small — no LibreOffice bloat
Serverless-friendly — works in Lambda, Cloud Functions, Azure Functions
No security concerns — no subprocess calls, no shell execution
Cross-platform — Windows, macOS, Linux, whatever

When You Might Need This

You're building RAG pipelines against enterprise document stores
Your LLM agent needs to process user-uploaded files of unknown vintage
You're deploying to serverless with size constraints
Your security team doesn't allow subprocess execution
You're tired of maintaining LibreOffice in containers

Try It Out

pip install sharepoint-to-text

GitHub: https://github.com/Horsmann/sharepoint-to-text

I'd appreciate feedback, especially if you hit edge cases with specific file types. Legacy Office formats are notoriously inconsistent, and real-world files are the best test suite.