I built a search engine for 3 million Polish businesses — here's what I learned

#python #webdev #startup #programming

Poland has over 3 million registered businesses spread across two separate public registries — KRS (corporations) and CEIDG (sole proprietorships). Finding reliable data about a Polish company used to mean navigating slow government portals, dealing with inconsistent data formats, and manually cross-referencing multiple sources.
So I built nipgo.pl to fix that.

The problem
If you're a B2B sales person, accountant, or procurement manager in Poland, verifying a contractor means:

Going to the KRS portal — slow, no API-friendly interface
Checking CEIDG separately — different format, different search
Cross-referencing VAT status on the Ministry of Finance whitelist
Manually checking if the company has any public procurement history

This is painful. Especially when you need to do it for 50 companies a week.

What nipgo.pl does
nipgo.pl aggregates all of this into one search:

700k+ KRS entities (corporations, partnerships, foundations)
2.6M+ CEIDG entities (sole proprietorships)
VAT status from the Ministry of Finance
Public procurement history (BZP tenders since 2021)
Public subsidies and grants (SUDOP registry)
Contact data scraped from public sources
AI-generated company summaries

Search by company name, NIP (tax ID), REGON, phone number, email, domain, or owner name. Filter by industry (PKD code), region, legal form, registration date, or capital amount.

The data challenge
The hardest part wasn't building the UI — it was the data.
KRS API returns asterisked names for natural persons (GDPR compliance since 2023). Getting full names requires authenticated scraping of PDF registry documents — each one a different format depending on when the company was registered.
CEIDG has ~2.6M records across ~50,000 paginated API pages. Running that takes weeks and requires careful rate limit management across multiple API tokens.
PKD codes (Polish industry classification) exist in two formats — pre-2015 companies use a nested array format, newer ones use flat objects. Handling both without crashes took more debugging than I'd like to admit.
VAT whitelist has an Imperva WAF that limits requests to ~1,400/day from a single IP. Batch endpoints return zero results in practice. Individual lookups only.

What I'd do differently
Start with the data pipeline, not the UI. I spent too much time on the frontend before the data was clean enough to display. A beautiful UI on top of messy data is useless.
Build keyset pagination from day one. OFFSET-based pagination on 2.6M records causes timeout hell at high offsets. Switching to keyset pagination (cursor-based) was a painful but necessary refactor.
Monitor everything early. Data quality issues in public registries are invisible until a user hits an edge case — a company registered in 1994 with a completely different JSON structure, a CEIDG record with a null NIP, a PKD code from a deprecated classification system.

Current state
The platform is live at nipgo.pl with a freemium model:

Free — basic search and registry data
Basic — contact data, CSV export, monitoring, CRM
Pro — financial reports, risk scoring, full history

Still a lot to build — financial statements, ownership graphs, automated change alerts. But the core data is there and it works.

If you're building something similar for another country's business registry, happy to share what I've learned. Drop a comment or reach out at hello@nipgo.pl.

Built with: Next.js, Supabase (PostgreSQL), Python scrapers, Vercel
Data: KRS API, CEIDG API, MF VAT Whitelist, BZP, SUDOP

DEV Community

I built a search engine for 3 million Polish businesses — here's what I learned

Top comments (0)