I'm building an open-source local file search tool that indexes
the inside of documents — Word, Excel, PDF, and 10+ other formats.
Think "Everything Search, but for file contents instead of filenames."
Last week, I finally sat down to figure out why indexing was so slow.
The answer was embarrassing.
The setup
I tested on a real document library: 6,512 files from actual work —
IPO filings, contracts, financial reports, spreadsheets. The kind of
messy, organic file collection that real people have on their PCs.
Indexing rate: 37.9 files/min. Total time: 171.6 minutes.
Almost 3 hours to index 6,500 files. Not great.
The diagnosis
I instrumented the extraction pipeline to log per-file timing.
Sorted by duration. The answer was immediately obvious:
Excel files (.xlsx) consumed 85.7% of total indexing time.
The top 20 slowest files? All Excel. And the #1 offender — a single
45MB spreadsheet — took over 2 hours by itself. One file.
Two hours. Out of a 3-hour total.
The parser was dutifully extracting every cell from every sheet,
including machine-generated data dumps with hundreds of thousands
of rows that no human would ever search through.
The fix (and why the "obvious" approach was the right one)
The instinct was to optimize the Excel parser — stream cells
instead of loading everything into memory, skip empty rows,
parallelize sheet extraction. I could have spent a week on that.
But I stepped back and asked: who is searching for row
247,831 of a machine-generated data dump?
Nobody. These aren't documents with prose, paragraphs, or
searchable content. They're data exports — ETL outputs,
financial models with formula grids, log dumps saved as .xlsx
because someone's workflow ends with "Export to Excel."
The actual content people search for in spreadsheets —
column headers, summary sheets, labeled data — fits comfortably
under 10MB. A 45MB spreadsheet is almost always programmatically
generated bulk data.
So the fix was deliberately simple:
- 10MB size cap — files above this get metadata-only indexing (filename, path, dates, sheet names), not cell-by-cell text extraction
- Text-only extraction — strip formulas, styling markup, and internal references before indexing
I considered more granular approaches — parse only the first
N sheets, skip sheets with >10k rows, sample cells instead
of extracting all. But every heuristic added complexity without
meaningfully improving search quality. The files people actually
search for were already fast. The files that were slow were
unsearchable by nature.
Sometimes the right optimization is to stop doing work that
produces no value, not to do the same work faster.
The result
171.6 minutes → 30.8 minutes. 5.6x faster.
Same 6,512 files. Same hardware. The indexing bottleneck was never
the search engine, the database, or the embedding model. It was one
format handler doing too much work on files that didn't need it.
Other things I fixed while I was in there
Once I had the profiling infrastructure, I kept pulling threads:
Search was firing twice on Enter. A debounce timer and the
keydown handler were both triggering searches. Every Enter key =
two identical queries running in parallel.
Korean IME was triggering searches mid-composition. Korean
characters are composed from multiple keystrokes (ㅎ → 하 → 한).
Each keystroke was firing a search before the user finished typing.
Fix: require 2+ completed syllables before executing.
Filename matches dominated results unfairly. A file named
report.docx scored 5x higher than a 50-page document with
"report" mentioned dozens of times in the body. Reduced filename
boost from 5.0x to 2.5x so body content gets a fair shot.
PDF parser was silently indexing garbage. Some PDFs have CMap
encoding that produces garbled text when extracted. The parser was
happily indexing strings like ÿþ÷ðîñ as if they were real content.
Now it detects garbled text and flags the file as unindexable instead
of polluting search results.
AI model download had no integrity check. The BGE-M3 embedding
model is 2.3 GB. The SHA256 hash fields existed in the code but
were empty strings — verification was silently skipped. Fixed.
The lesson
The performance problem wasn't where I expected. I was ready to
optimize SQLite queries, batch database writes, parallelize the
pipeline. The actual fix was: don't parse a 45MB spreadsheet
cell by cell.
Profiling before optimizing sounds obvious. But when you're a solo
dev shipping features every week, "I'll profile it later" turns into
months of users experiencing a slow tool because you never looked
at where the time actually goes.
The project
The tool is LocalSynapse
— a local file search engine with a built-in MCP server
(so AI agents like Claude can search your files too).
- C# / Avalonia (cross-platform desktop)
- SQLite FTS5 for BM25 text ranking
- BGE-M3 via ONNX Runtime for semantic search
- Apache 2.0 license
- Windows + macOS
If you've ever spent 10 minutes digging through folders for a file
you know exists, that's the problem this solves.
localsynapse.com
Happy to answer questions about the architecture, the profiling
setup, or anything else.
Top comments (0)