DEV Community

Alessandro Binda
Alessandro Binda

Posted on

I Built a Free Database of 272M Companies — Here's How

I'm a solo developer in Milan, and over the past year I built a free database of 272 million companies from 40+ countries. It's live at score.get-scala.com, with a free API, SDKs on npm and PyPI, and a sample dataset on Kaggle.

This is the story of how and why I built it.


The Problem

Business data is absurdly expensive. Want to look up a company's revenue, employee count, or credit rating?

  • Dun & Bradstreet: $20,000+/year
  • ZoomInfo: $15,000+/year
  • Clearbit: $12,000+/year
  • Bureau van Dijk (Orbis): $30,000+/year

These prices make sense for Fortune 500 companies. But for startups, indie developers, and small agencies? It's a non-starter.

I wanted to build something that makes business intelligence accessible to everyone.


The Data Sources

The secret is that most company data is public. Governments require businesses to register, and many publish this data openly. I built scrapers and parsers for:

  • France: SIRENE / INSEE (11M+ companies)
  • UK: Companies House (5M+)
  • US: SEC EDGAR + state registries
  • Germany: Handelsregister / Bundesanzeiger
  • Italy: Camera di Commercio open data
  • Spain: BORME (Boletin Oficial del Registro Mercantil)
  • Netherlands: KvK (Kamer van Koophandel)
  • Sweden: Bolagsverket
  • Norway: Bronnoysund Register Centre
  • Denmark: CVR (Central Business Register)
  • Finland: PRH + YTJ
  • Belgium: Crossroads Bank for Enterprises
  • Austria: Firmenbuch extracts
  • Poland: KRS + CEIDG
  • Czech Republic: ARES
  • Plus 25+ more countries...

Each source has its own format, encoding, update frequency, and quirks. French SIRENE gives you a 6GB CSV monthly. UK Companies House has a solid REST API. German Handelsregister requires parsing HTML. US state registries are all different.


The Tech Stack

I optimized for cost efficiency since I'm bootstrapping this solo:

  • DuckDB + Parquet: The entire 272M company dataset compresses to ~19GB in Parquet format. DuckDB can query it blazingly fast without loading it all into memory.
  • Fastify API: Node.js REST API serving search, autocomplete, and company detail endpoints.
  • PostgreSQL: For user accounts, API keys, usage tracking, and frequently-accessed company data.
  • Hetzner dedicated server: The whole thing runs on a single Hetzner box at €34/month. That's it. No AWS. No Kubernetes. No $10K/month cloud bill.

Total infrastructure cost: €34/month for 272 million companies.


The Scoring Algorithm

Raw data isn't very useful. You want to know: is this company healthy?

I built a proprietary scoring algorithm that rates every company from 0 to 100 with letter grades:

Grade Score Meaning
AA 80-100 Excellent financial health
A 60-79 Good, reliable
BB 40-59 Average, some risk
B 20-39 Below average
C 10-19 Poor, significant risk
D 5-9 Very poor, likely distressed
E 0-4 Critical / default risk

The score factors in: company age, legal status, industry risk, jurisdiction, available financial data, filing history, and more. It's not perfect (no public-data score can be), but it's a solid first-pass filter.


The Data Pipeline

Every company goes through this pipeline:

  1. Scrape: Pull from government registries (automated, scheduled)
  2. Parse: Normalize names, addresses, legal forms across 40+ formats
  3. Deduplicate: Match entities across sources (fuzzy name matching + registration numbers)
  4. Classify: Assign NACE/SIC industry codes where missing
  5. Score: Run the credit scoring algorithm (0-100)
  6. Grade: Convert score to letter grade (AA to E)
  7. Index: Load into DuckDB/Parquet + PostgreSQL for API serving

The whole pipeline reruns monthly for most sources, weekly for some.


The AI Factor

I'll be honest: 95% of the code was written with Claude (Anthropic's AI assistant). I'm a solo developer and there's no way I could have built scrapers for 40+ countries, a scoring algorithm, an API, a dashboard, SDKs, and an MCP server without AI assistance.

This isn't a confession — it's the future of software. One developer with AI can build what used to require a team of 10.


What's Available Now

Free Dashboard

Search any company at score.get-scala.com. No signup required for basic searches.

API (from €19/month)

REST API with search, autocomplete, company details, credit scores, and bulk endpoints.

curl "https://score.get-scala.com/api/v1/companies/search?q=BMW&country=DE" \
  -H "Authorization: Bearer YOUR_API_KEY"
Enter fullscreen mode Exit fullscreen mode

JavaScript SDK

npm install scala-score
Enter fullscreen mode Exit fullscreen mode
import { ScalaScore } from "scala-score";
const client = new ScalaScore({ apiKey: "your-key" });
const results = await client.companies.search({ name: "BMW", country: "DE" });
Enter fullscreen mode Exit fullscreen mode

Python SDK

pip install scala-score
Enter fullscreen mode Exit fullscreen mode
from scala_score import ScalaScore
client = ScalaScore(api_key="your-key")
results = client.companies.search(name="BMW", country="DE")
Enter fullscreen mode Exit fullscreen mode

MCP Server (for AI Agents)

npx scala-mcp-server
Enter fullscreen mode Exit fullscreen mode

Let Claude, GPT, or any MCP-compatible AI agent search company data directly.

Kaggle Sample

994K companies on Kaggle — download and explore for free.


The Vision

Business intelligence shouldn't be locked behind $20K/year enterprise contracts. Every developer, every startup, every small business should be able to look up a potential partner, supplier, or client and get reliable data.

I'm building toward 500 million companies by 2027, covering every country that publishes business registry data.

If you're building something with company data, I'd love to hear about it. The API is free to start — 100 lookups/month, no credit card required.


Links


Built solo in Milan. Powered by public data and Claude. Democratizing business intelligence, one API call at a time.

Top comments (0)