<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tobias Horsmann</title>
    <description>The latest articles on DEV Community by Tobias Horsmann (@tobias_horsmann_dcbbacec3).</description>
    <link>https://dev.to/tobias_horsmann_dcbbacec3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3678767%2F06e226d7-56fb-4115-a3fb-eb4566958b16.png</url>
      <title>DEV Community: Tobias Horsmann</title>
      <link>https://dev.to/tobias_horsmann_dcbbacec3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tobias_horsmann_dcbbacec3"/>
    <language>en</language>
    <item>
      <title>Why I Built a Pure Python Library for Legacy Office Files (And Why RAG Pipelines Need One)</title>
      <dc:creator>Tobias Horsmann</dc:creator>
      <pubDate>Thu, 25 Dec 2025 21:30:17 +0000</pubDate>
      <link>https://dev.to/tobias_horsmann_dcbbacec3/why-i-built-a-pure-python-library-for-legacy-office-files-and-why-rag-pipelines-need-one-4c37</link>
      <guid>https://dev.to/tobias_horsmann_dcbbacec3/why-i-built-a-pure-python-library-for-legacy-office-files-and-why-rag-pipelines-need-one-4c37</guid>
      <description>&lt;h1&gt;
  
  
  Why I Built a Pure Python Library for Legacy Office Files (And Why RAG Pipelines Need One)
&lt;/h1&gt;

&lt;p&gt;If you're building RAG pipelines or document ingestion for LLM agents, you've probably solved the easy part already. Modern Office files? No problem. &lt;code&gt;python-docx&lt;/code&gt;, &lt;code&gt;openpyxl&lt;/code&gt;, &lt;code&gt;python-pptx&lt;/code&gt; — pick your library, extract your text, move on.&lt;/p&gt;

&lt;p&gt;Then someone points your pipeline at an enterprise SharePoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Legacy File Problem
&lt;/h2&gt;

&lt;p&gt;Enterprise SharePoints are digital archaeology sites. Marketing uploaded PowerPoints in 2008. Legal has Word documents from 2005. Finance runs on Excel files that predate most of your team's careers.&lt;/p&gt;

&lt;p&gt;These aren't edge cases. In my experience, legacy &lt;code&gt;.doc&lt;/code&gt;, &lt;code&gt;.xls&lt;/code&gt;, and &lt;code&gt;.ppt&lt;/code&gt; files make up a significant chunk of any long-running enterprise document store. And if you're building a system that needs to ingest "all the documents," you can't just skip them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Existing Solutions Didn't Work for Me
&lt;/h2&gt;

&lt;p&gt;I needed to process these files in AWS Lambda functions for a RAG pipeline. My options were:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LibreOffice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The standard answer. Install LibreOffice, run it headless, convert files to text. It works, but it adds over 1GB to your container image. Lambda has a 250MB limit for deployment packages (10GB with container images, but still). Plus, configuring headless LibreOffice is its own adventure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Tika&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Solid tool, widely used. But it requires a Java runtime and typically runs as a separate server. That's another service to deploy, monitor, and secure. For a document extraction step in a pipeline, it felt like overkill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subprocess calls to command-line tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Various tools exist that you can shell out to. But subprocess calls are a security concern, they break in restricted environments, and they make your code platform-dependent.&lt;/p&gt;

&lt;p&gt;I wanted something simpler: a Python library I could &lt;code&gt;pip install&lt;/code&gt; and call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building sharepoint-to-text
&lt;/h2&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/Horsmann/sharepoint-to-text" rel="noopener noreferrer"&gt;sharepoint-to-text&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The core idea: parse both legacy Office binary formats (OLE2) and modern XML-based formats (OOXML) directly in Python. No external dependencies. No subprocess calls. Just text extraction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sharepoint2text&lt;/span&gt;

&lt;span class="c1"&gt;# Works the same for legacy or modern files
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sharepoint2text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ancient_report.doc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sharepoint2text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;modern_report.docx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legacy formats: &lt;code&gt;.doc&lt;/code&gt;, &lt;code&gt;.xls&lt;/code&gt;, &lt;code&gt;.ppt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Modern formats: &lt;code&gt;.docx&lt;/code&gt;, &lt;code&gt;.xlsx&lt;/code&gt;, &lt;code&gt;.pptx&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Plus &lt;code&gt;.pdf&lt;/code&gt; and plain text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One interface, no conditional logic, no format detection boilerplate in your code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for RAG and LLM Agents
&lt;/h2&gt;

&lt;p&gt;If you're building document ingestion for RAG, you're probably dealing with heterogeneous input. Users upload files. Pipelines crawl document stores. You can't control what formats show up.&lt;/p&gt;

&lt;p&gt;The typical approach is a cascade of if-statements and multiple libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The ugly version
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.docx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_with_python_docx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.doc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_with_libreoffice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# hope it's installed
&lt;/span&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.xlsx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_with_openpyxl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ... and so on
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With sharepoint-to-text, it's just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sharepoint2text&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sharepoint2text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The library figures out the format and handles it appropriately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Benefits
&lt;/h2&gt;

&lt;p&gt;Because it's pure Python with no system dependencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Container images stay small&lt;/strong&gt; — no LibreOffice bloat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serverless-friendly&lt;/strong&gt; — works in Lambda, Cloud Functions, Azure Functions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No security concerns&lt;/strong&gt; — no subprocess calls, no shell execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform&lt;/strong&gt; — Windows, macOS, Linux, whatever&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When You Might Need This
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You're building RAG pipelines against enterprise document stores&lt;/li&gt;
&lt;li&gt;Your LLM agent needs to process user-uploaded files of unknown vintage&lt;/li&gt;
&lt;li&gt;You're deploying to serverless with size constraints&lt;/li&gt;
&lt;li&gt;Your security team doesn't allow subprocess execution&lt;/li&gt;
&lt;li&gt;You're tired of maintaining LibreOffice in containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sharepoint-to-text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/Horsmann/sharepoint-to-text" rel="noopener noreferrer"&gt;https://github.com/Horsmann/sharepoint-to-text&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'd appreciate feedback, especially if you hit edge cases with specific file types. Legacy Office formats are notoriously inconsistent, and real-world files are the best test suite.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
