<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: alex zheng</title>
    <description>The latest articles on DEV Community by alex zheng (@alex_zheng_49d3089c0d3fdf).</description>
    <link>https://dev.to/alex_zheng_49d3089c0d3fdf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3971483%2F3e934ee6-0778-4d15-a813-a24610e80524.png</url>
      <title>DEV Community: alex zheng</title>
      <link>https://dev.to/alex_zheng_49d3089c0d3fdf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alex_zheng_49d3089c0d3fdf"/>
    <language>en</language>
    <item>
      <title>A practical pipeline for turning messy business documents into spreadsheets</title>
      <dc:creator>alex zheng</dc:creator>
      <pubDate>Sat, 06 Jun 2026 15:53:07 +0000</pubDate>
      <link>https://dev.to/alex_zheng_49d3089c0d3fdf/a-practical-pipeline-for-turning-messy-business-documents-into-spreadsheets-4bj7</link>
      <guid>https://dev.to/alex_zheng_49d3089c0d3fdf/a-practical-pipeline-for-turning-messy-business-documents-into-spreadsheets-4bj7</guid>
      <description>&lt;p&gt;Most spreadsheet cleanup work is not really an Excel problem. It is an extraction and review problem.&lt;/p&gt;

&lt;p&gt;A team receives a PDF price list, an invoice packet, a screenshot from a dashboard, an email order, or a pasted block of OCR text. Someone then has to decide what the columns should be, copy values into rows, fix inconsistent labels, and export a table that other people can trust.&lt;/p&gt;

&lt;p&gt;The useful workflow is usually smaller than a full data platform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Accept messy source material&lt;/li&gt;
&lt;li&gt;Define the target columns in plain language&lt;/li&gt;
&lt;li&gt;Extract rows into a draft table&lt;/li&gt;
&lt;li&gt;Review and correct the table before export&lt;/li&gt;
&lt;li&gt;Save the instruction pattern for the next similar file&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That review step matters. For business data, a wrong total or a shifted column can be worse than no automation at all. A good document-to-spreadsheet flow should make uncertainty visible instead of pretending the first extraction is perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern I use
&lt;/h2&gt;

&lt;p&gt;When designing a cleanup flow, I start with the final sheet rather than the source file.&lt;/p&gt;

&lt;p&gt;For example, an invoice workflow might need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;supplier_name&lt;/li&gt;
&lt;li&gt;invoice_number&lt;/li&gt;
&lt;li&gt;invoice_date&lt;/li&gt;
&lt;li&gt;line_item_description&lt;/li&gt;
&lt;li&gt;quantity&lt;/li&gt;
&lt;li&gt;unit_price&lt;/li&gt;
&lt;li&gt;tax&lt;/li&gt;
&lt;li&gt;total&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A bank statement workflow might need a completely different shape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transaction_date&lt;/li&gt;
&lt;li&gt;description&lt;/li&gt;
&lt;li&gt;debit&lt;/li&gt;
&lt;li&gt;credit&lt;/li&gt;
&lt;li&gt;balance&lt;/li&gt;
&lt;li&gt;category&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The source can be messy, but the requested output should be explicit. Once the target columns are clear, extraction becomes a bounded task rather than a vague conversion task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why reusable recipes help
&lt;/h2&gt;

&lt;p&gt;The first document usually takes the most time because you are still deciding the schema. But many cleanup jobs repeat. A company may receive the same supplier invoice every month, the same sales report every week, or the same order email format every day.&lt;/p&gt;

&lt;p&gt;That is where a saved recipe becomes useful. A recipe is not just a prompt. It is the memory of the output structure and review expectations for a specific class of documents.&lt;/p&gt;

&lt;p&gt;A practical recipe should remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the column schema&lt;/li&gt;
&lt;li&gt;naming conventions&lt;/li&gt;
&lt;li&gt;extraction rules&lt;/li&gt;
&lt;li&gt;fields to ignore&lt;/li&gt;
&lt;li&gt;export format&lt;/li&gt;
&lt;li&gt;review notes from previous runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the workflow lightweight while still making it repeatable.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small tool approach
&lt;/h2&gt;

&lt;p&gt;I have been building Messy2Sheet around this idea: turn messy PDFs, screenshots, emails, and pasted business data into clean Excel or CSV files with custom columns and a reviewable preview: &lt;a href="https://messy2sheet.com/" rel="noopener noreferrer"&gt;https://messy2sheet.com/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is not to replace a database or BI system. It is to remove the manual 20-minute cleanup step that happens before the data is useful enough to import, reconcile, or share.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would avoid
&lt;/h2&gt;

&lt;p&gt;I would avoid treating every document as a generic file conversion problem. A PDF-to-CSV converter that does not know the intended columns often just moves the mess from one format to another.&lt;/p&gt;

&lt;p&gt;I would also avoid hiding the review step. Even when AI extraction works well, the user still needs a clear place to verify the rows, fix structure, and decide whether the output is ready.&lt;/p&gt;

&lt;p&gt;For small operations teams, that is usually the difference between a demo and a tool they can actually use.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
