Alessandro Binda

Posted on Jun 12

I Built a Free Database of 272M Companies — Here's How

#opensource #database #buildinpublic #career

I'm a solo developer in Milan, and over the past year I built a free database of 272 million companies from 40+ countries. It's live at score.get-scala.com, with a free API, SDKs on npm and PyPI, and a sample dataset on Kaggle.

This is the story of how and why I built it.

The Problem

Business data is absurdly expensive. Want to look up a company's revenue, employee count, or credit rating?

Dun & Bradstreet: $20,000+/year
ZoomInfo: $15,000+/year
Clearbit: $12,000+/year
Bureau van Dijk (Orbis): $30,000+/year

These prices make sense for Fortune 500 companies. But for startups, indie developers, and small agencies? It's a non-starter.

I wanted to build something that makes business intelligence accessible to everyone.

The Data Sources

The secret is that most company data is public. Governments require businesses to register, and many publish this data openly. I built scrapers and parsers for:

France: SIRENE / INSEE (11M+ companies)
UK: Companies House (5M+)
US: SEC EDGAR + state registries
Germany: Handelsregister / Bundesanzeiger
Italy: Camera di Commercio open data
Spain: BORME (Boletin Oficial del Registro Mercantil)
Netherlands: KvK (Kamer van Koophandel)
Sweden: Bolagsverket
Norway: Bronnoysund Register Centre
Denmark: CVR (Central Business Register)
Finland: PRH + YTJ
Belgium: Crossroads Bank for Enterprises
Austria: Firmenbuch extracts
Poland: KRS + CEIDG
Czech Republic: ARES
Plus 25+ more countries...

Each source has its own format, encoding, update frequency, and quirks. French SIRENE gives you a 6GB CSV monthly. UK Companies House has a solid REST API. German Handelsregister requires parsing HTML. US state registries are all different.

The Tech Stack

I optimized for cost efficiency since I'm bootstrapping this solo:

DuckDB + Parquet: The entire 272M company dataset compresses to ~19GB in Parquet format. DuckDB can query it blazingly fast without loading it all into memory.
Fastify API: Node.js REST API serving search, autocomplete, and company detail endpoints.
PostgreSQL: For user accounts, API keys, usage tracking, and frequently-accessed company data.
Hetzner dedicated server: The whole thing runs on a single Hetzner box at €34/month. That's it. No AWS. No Kubernetes. No $10K/month cloud bill.

Total infrastructure cost: €34/month for 272 million companies.

The Scoring Algorithm

Raw data isn't very useful. You want to know: is this company healthy?

I built a proprietary scoring algorithm that rates every company from 0 to 100 with letter grades:

Grade	Score	Meaning
AA	80-100	Excellent financial health
A	60-79	Good, reliable
BB	40-59	Average, some risk
B	20-39	Below average
C	10-19	Poor, significant risk
D	5-9	Very poor, likely distressed
E	0-4	Critical / default risk

The score factors in: company age, legal status, industry risk, jurisdiction, available financial data, filing history, and more. It's not perfect (no public-data score can be), but it's a solid first-pass filter.

The Data Pipeline

Every company goes through this pipeline:

Scrape: Pull from government registries (automated, scheduled)
Parse: Normalize names, addresses, legal forms across 40+ formats
Deduplicate: Match entities across sources (fuzzy name matching + registration numbers)
Classify: Assign NACE/SIC industry codes where missing
Score: Run the credit scoring algorithm (0-100)
Grade: Convert score to letter grade (AA to E)
Index: Load into DuckDB/Parquet + PostgreSQL for API serving

The whole pipeline reruns monthly for most sources, weekly for some.

The AI Factor

I'll be honest: 95% of the code was written with Claude (Anthropic's AI assistant). I'm a solo developer and there's no way I could have built scrapers for 40+ countries, a scoring algorithm, an API, a dashboard, SDKs, and an MCP server without AI assistance.

This isn't a confession — it's the future of software. One developer with AI can build what used to require a team of 10.

What's Available Now

Free Dashboard

Search any company at score.get-scala.com. No signup required for basic searches.

API (from €19/month)

REST API with search, autocomplete, company details, credit scores, and bulk endpoints.

curl "https://score.get-scala.com/api/v1/companies/search?q=BMW&country=DE" \
  -H "Authorization: Bearer YOUR_API_KEY"

JavaScript SDK

npm install scala-score

import { ScalaScore } from "scala-score";
const client = new ScalaScore({ apiKey: "your-key" });
const results = await client.companies.search({ name: "BMW", country: "DE" });

Python SDK

pip install scala-score

from scala_score import ScalaScore
client = ScalaScore(api_key="your-key")
results = client.companies.search(name="BMW", country="DE")

MCP Server (for AI Agents)

npx scala-mcp-server

Let Claude, GPT, or any MCP-compatible AI agent search company data directly.

Kaggle Sample

994K companies on Kaggle — download and explore for free.

The Vision

Business intelligence shouldn't be locked behind $20K/year enterprise contracts. Every developer, every startup, every small business should be able to look up a potential partner, supplier, or client and get reliable data.

I'm building toward 500 million companies by 2027, covering every country that publishes business registry data.

If you're building something with company data, I'd love to hear about it. The API is free to start — 100 lookups/month, no credit card required.

Links

Built solo in Milan. Powered by public data and Claude. Democratizing business intelligence, one API call at a time.

DEV Community