A few days ago I published the core extraction engine of GSTExtract on GitHub, MIT licensed. The repository is here. It's not the full product. It's a seven-week-old snapshot, and that gap is the whole strategy.
This post is about the decisions I made picking what to open-source and what to keep private. The deciding took longer than writing the code did, which surprised me. Writing it up in case anyone else is staring at the same question on their own project.
Context, Briefly
GSTExtract reads Indian GST invoice PDFs and pulls out the fields (GSTIN, invoice number, amounts, taxes) into Excel. Small businesses and accountants use it at month-end to avoid hand-typing supplier bills into Tally. The stack is Python — pdfplumber for digital PDFs, Tesseract OCR as fallback, regex + keyword-anchored extraction, openpyxl for Excel output. FastAPI wraps it for the hosted version. The hosted tool is free during early access, with a paid tier planned for high-volume users later.
So the question was: how do I open-source without giving away the commercial edge?
What Went Public
The deterministic regex plus keyword extraction pipeline. GSTIN parsing with the Luhn mod-36 checksum. Zone segmentation that splits an invoice into header and footer and routes fields by zone. Tax-line detection with keyword anchoring. Excel export. The test suite. Basically, the commoditised parsing work that anyone doing invoice extraction would eventually re-implement themselves if they cared enough.
If someone wants to understand how you pull GSTINs and invoice totals out of a messy PDF, the code is all there. Forkable, patchable, usable in their own pipelines.
What Stayed Private
The seven weeks of engine refinements since the cut point:
- Multi-vendor accuracy tuning for Amazon, Flipkart, Swiggy, Zomato, BookMyShow, RedBus, and Myntra invoices (each one has a slightly different layout that breaks generic extraction)
- Table-based tax extraction for borderless PDFs
- Multi-invoice detection for combined bundles
- Per-field confidence scoring refinements
- Proprietorship invoice edge cases (handwritten-style templates that use bare "FROM" labels instead of "BILL FROM", single-digit invoice serials, and so on)
Plus the entire webapp: FastAPI layer, rate limiting, CSRF, file validation, the invoice validation gate that rejects credit notes and proforma bills, the batch upload flow, the learning-data logging. All of that stays closed.
Why the Time-Lag
This is the part I thought about most. Just publishing the latest engine would let anyone stand up a competing extraction site with the same accuracy I have. Publishing a snapshot from seven weeks ago means competitors get a working baseline, but not the current edge. The engineering work I do this month becomes public (maybe) in a few months, not immediately.
Redis, MongoDB, and Elastic have all done versions of this. The open-core pattern: community gets a real, working version of the core. Commercial version stays ahead by some time delta. Nobody feels cheated, but the pricing power stays intact.
For me, the specific cut was commit 0aa5f07 from 2 March 2026, labeled "Phase 12 production hardening" in the private repo. Engine is solid from there. Everything after is refinement.
What I Expect to Get From This
Honestly, the single biggest thing is the backlink. GitHub has domain authority 96. Having a public repo that links to gstextract.com from the README gives my young domain an authority signal it couldn't easily get any other way. On a site that's only a couple of months old and trying to rank for competitive GST queries, that one dofollow link is legitimately meaningful.
After that, in descending likelihood:
- Credibility signal to technical readers who want to see how the extraction works. Some fraction of people evaluating a hosted tool will check if the code is worth trusting.
- Issues and pull requests from people hitting edge cases I haven't seen. GST invoice formats are wilder than you'd think. Someone will inevitably send me a PDF from a vendor I've never heard of that breaks something.
- Forks from people who need self-hosted extraction for their own use cases. Rare, but they become potential collaborators.
What I'm not expecting: hundreds of stars, viral adoption, a community forming around the repo. For a narrow Indian-GST tool, the realistic audience is small. A handful of serious users finding it and either using it or contributing is the bar, not hockey-stick open-source growth.
Honest Tradeoffs
What this might cost me:
- Competitors can clone and study the extraction logic. They would eventually anyway, but I've shortened the path.
- Time maintaining the public version even if nobody uses it. If I ignore it for months, it rots and becomes a bad signal.
- Some fraction of potential paying customers might think "well, the core is free, I'll just self-host" and walk away. I don't think this is many people, but it's nonzero.
What I'm betting:
- The moat is distribution plus the continuous-improvement time-lag, not the code itself.
- Most competitors don't actually want to self-host and maintain a parser, deal with Tesseract and Poppler, manage their own uptime. Running the hosted tool is a service, not just code.
- People who'll pay for the hosted version are paying for the current-version edge, the no-setup convenience, and whatever the product becomes post-launch. The snapshot on GitHub is closer to educational than competitive.
We'll see if that's right. If self-hosting forks start cutting into hosted usage in a measurable way, I'll revisit the cut-point. For now, the calculus feels fine.
Links
- Repo: github.com/ritusmoikaushik/gstextract-core
- Hosted tool: gstextract.com
If you're building something similar and thinking through an open-core cut, happy to compare notes. And if you run into parsing edge cases on unusual invoice formats, open an issue on the repo.
Top comments (0)