<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ryan Park</title>
    <description>The latest articles on DEV Community by Ryan Park (@ryan_park_72189933fc08ef5).</description>
    <link>https://dev.to/ryan_park_72189933fc08ef5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3882590%2F2add8567-69ee-41fd-9aa7-3eb4f9bacd2a.png</url>
      <title>DEV Community: Ryan Park</title>
      <link>https://dev.to/ryan_park_72189933fc08ef5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ryan_park_72189933fc08ef5"/>
    <language>en</language>
    <item>
      <title>I found that one Excel file was eating 85% of my indexing time tags: opensource, dotnet, productivity, showdev</title>
      <dc:creator>Ryan Park</dc:creator>
      <pubDate>Thu, 16 Apr 2026 13:39:48 +0000</pubDate>
      <link>https://dev.to/ryan_park_72189933fc08ef5/i-found-that-one-excel-file-was-eating-85-of-my-indexing-time-tags-opensource-dotnet-4ooa</link>
      <guid>https://dev.to/ryan_park_72189933fc08ef5/i-found-that-one-excel-file-was-eating-85-of-my-indexing-time-tags-opensource-dotnet-4ooa</guid>
      <description>&lt;p&gt;I'm building an open-source local file search tool that indexes &lt;br&gt;
the &lt;em&gt;inside&lt;/em&gt; of documents — Word, Excel, PDF, and 10+ other formats. &lt;br&gt;
Think "Everything Search, but for file contents instead of filenames."&lt;/p&gt;

&lt;p&gt;Last week, I finally sat down to figure out why indexing was so slow. &lt;br&gt;
The answer was embarrassing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;I tested on a real document library: 6,512 files from actual work — &lt;br&gt;
IPO filings, contracts, financial reports, spreadsheets. The kind of &lt;br&gt;
messy, organic file collection that real people have on their PCs.&lt;/p&gt;

&lt;p&gt;Indexing rate: &lt;strong&gt;37.9 files/min&lt;/strong&gt;. Total time: &lt;strong&gt;171.6 minutes&lt;/strong&gt;. &lt;br&gt;
Almost 3 hours to index 6,500 files. Not great.&lt;/p&gt;

&lt;h2&gt;
  
  
  The diagnosis
&lt;/h2&gt;

&lt;p&gt;I instrumented the extraction pipeline to log per-file timing. &lt;br&gt;
Sorted by duration. The answer was immediately obvious:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Excel files (.xlsx) consumed 85.7% of total indexing time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The top 20 slowest files? All Excel. And the #1 offender — a single &lt;br&gt;
45MB spreadsheet — took over &lt;strong&gt;2 hours by itself&lt;/strong&gt;. One file. &lt;br&gt;
Two hours. Out of a 3-hour total.&lt;/p&gt;

&lt;p&gt;The parser was dutifully extracting every cell from every sheet, &lt;br&gt;
including machine-generated data dumps with hundreds of thousands &lt;br&gt;
of rows that no human would ever search through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix (and why the "obvious" approach was the right one)
&lt;/h2&gt;

&lt;p&gt;The instinct was to optimize the Excel parser — stream cells &lt;br&gt;
instead of loading everything into memory, skip empty rows, &lt;br&gt;
parallelize sheet extraction. I could have spent a week on that.&lt;/p&gt;

&lt;p&gt;But I stepped back and asked: &lt;strong&gt;who is searching for row &lt;br&gt;
247,831 of a machine-generated data dump?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nobody. These aren't documents with prose, paragraphs, or &lt;br&gt;
searchable content. They're data exports — ETL outputs, &lt;br&gt;
financial models with formula grids, log dumps saved as .xlsx &lt;br&gt;
because someone's workflow ends with "Export to Excel."&lt;/p&gt;

&lt;p&gt;The actual content people search for in spreadsheets — &lt;br&gt;
column headers, summary sheets, labeled data — fits comfortably &lt;br&gt;
under 10MB. A 45MB spreadsheet is almost always programmatically &lt;br&gt;
generated bulk data.&lt;/p&gt;

&lt;p&gt;So the fix was deliberately simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;10MB size cap&lt;/strong&gt; — files above this get metadata-only 
indexing (filename, path, dates, sheet names), not cell-by-cell 
text extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-only extraction&lt;/strong&gt; — strip formulas, styling markup, 
and internal references before indexing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I considered more granular approaches — parse only the first &lt;br&gt;
N sheets, skip sheets with &amp;gt;10k rows, sample cells instead &lt;br&gt;
of extracting all. But every heuristic added complexity without &lt;br&gt;
meaningfully improving search quality. The files people actually &lt;br&gt;
search for were already fast. The files that were slow were &lt;br&gt;
unsearchable by nature.&lt;/p&gt;

&lt;p&gt;Sometimes the right optimization is to stop doing work that &lt;br&gt;
produces no value, not to do the same work faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;171.6 minutes → 30.8 minutes. 5.6x faster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same 6,512 files. Same hardware. The indexing bottleneck was never &lt;br&gt;
the search engine, the database, or the embedding model. It was one &lt;br&gt;
format handler doing too much work on files that didn't need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other things I fixed while I was in there
&lt;/h2&gt;

&lt;p&gt;Once I had the profiling infrastructure, I kept pulling threads:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search was firing twice on Enter.&lt;/strong&gt; A debounce timer and the &lt;br&gt;
keydown handler were both triggering searches. Every Enter key = &lt;br&gt;
two identical queries running in parallel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Korean IME was triggering searches mid-composition.&lt;/strong&gt; Korean &lt;br&gt;
characters are composed from multiple keystrokes (ㅎ → 하 → 한). &lt;br&gt;
Each keystroke was firing a search before the user finished typing. &lt;br&gt;
Fix: require 2+ completed syllables before executing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filename matches dominated results unfairly.&lt;/strong&gt; A file &lt;em&gt;named&lt;/em&gt; &lt;br&gt;
&lt;code&gt;report.docx&lt;/code&gt; scored 5x higher than a 50-page document with &lt;br&gt;
"report" mentioned dozens of times in the body. Reduced filename &lt;br&gt;
boost from 5.0x to 2.5x so body content gets a fair shot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PDF parser was silently indexing garbage.&lt;/strong&gt; Some PDFs have CMap &lt;br&gt;
encoding that produces garbled text when extracted. The parser was &lt;br&gt;
happily indexing strings like &lt;code&gt;ÿþ÷ðîñ&lt;/code&gt; as if they were real content. &lt;br&gt;
Now it detects garbled text and flags the file as unindexable instead &lt;br&gt;
of polluting search results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI model download had no integrity check.&lt;/strong&gt; The BGE-M3 embedding &lt;br&gt;
model is 2.3 GB. The SHA256 hash fields existed in the code but &lt;br&gt;
were empty strings — verification was silently skipped. Fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;The performance problem wasn't where I expected. I was ready to &lt;br&gt;
optimize SQLite queries, batch database writes, parallelize the &lt;br&gt;
pipeline. The actual fix was: don't parse a 45MB spreadsheet &lt;br&gt;
cell by cell.&lt;/p&gt;

&lt;p&gt;Profiling before optimizing sounds obvious. But when you're a solo &lt;br&gt;
dev shipping features every week, "I'll profile it later" turns into &lt;br&gt;
months of users experiencing a slow tool because you never looked &lt;br&gt;
at where the time actually goes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The project
&lt;/h2&gt;

&lt;p&gt;The tool is &lt;a href="https://github.com/LocalSynapse/LocalSynapse" rel="noopener noreferrer"&gt;LocalSynapse&lt;/a&gt; &lt;br&gt;
— a local file search engine with a built-in MCP server &lt;br&gt;
(so AI agents like Claude can search your files too).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;C# / Avalonia (cross-platform desktop)&lt;/li&gt;
&lt;li&gt;SQLite FTS5 for BM25 text ranking&lt;/li&gt;
&lt;li&gt;BGE-M3 via ONNX Runtime for semantic search&lt;/li&gt;
&lt;li&gt;Apache 2.0 license&lt;/li&gt;
&lt;li&gt;Windows + macOS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever spent 10 minutes digging through folders for a file &lt;br&gt;
you &lt;em&gt;know&lt;/em&gt; exists, that's the problem this solves. &lt;br&gt;
&lt;a href="https://localsynapse.com" rel="noopener noreferrer"&gt;localsynapse.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy to answer questions about the architecture, the profiling &lt;br&gt;
setup, or anything else.&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>opensource</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
