DEV Community

Dmitriy Dmitriy
Dmitriy Dmitriy

Posted on

I built a company data pipeline — here’s what broke in real-world data

I built a Python pipeline that automates company data collection:
searching registries, extracting websites, scraping content, finding phone numbers, and generating summaries using AI.

At first, I thought scraping would be the hard part.

It turned out to be the easiest.

What the pipeline does

The system:

  • finds companies via API (DNB)
  • extracts website and metadata
  • visits multiple pages of each site
  • extracts phone numbers from HTML, links and structured data
  • generates summaries using LLMs
  • exports everything into a structured Excel dataset

What actually broke

1. APIs are unstable and rate-limited

The DNB API would randomly return 429 and 503 errors.

Without retries, the pipeline would fail after a few requests.

I ended up implementing:

  • retry logic
  • exponential backoff
  • random delays between requests

Even then, stability is never guaranteed.

2. Phone numbers are surprisingly hard

Phone extraction turned out to be one of the hardest parts.

Numbers appear:

  • in different formats (+1, brackets, spaces, dashes)
  • mixed with dates or IDs
  • inside HTML text, links (tel:) and JSON-LD

I had to build logic to:

  • normalize formats
  • filter invalid matches
  • classify numbers by reliability

Without this step, the output was unusable.

3. Websites are inconsistent

Every site is different:

  • different HTML structures
  • missing data
  • broken markup
  • content hidden in scripts

There is no universal parser.

Even simple tasks like extracting clean text require handling multiple edge cases.

4. Matching company data is unreliable

Company names don’t always match exactly across sources.

Small differences (spacing, symbols, legal forms) break naive matching.

This forced me to implement stricter matching logic and fallbacks.

5. AI is helpful, but not deterministic

I used LLMs to generate company summaries from scraped text.

But even with the same input:

  • outputs vary
  • rate limits happen
  • some responses fail

To make it usable, I had to:

  • control prompts carefully
  • limit output length
  • add fallback between models

Results

The pipeline can process hundreds of companies per run,
replacing hours of manual work.

But the main takeaway:

Scraping is only a small part of the system.

Most of the effort goes into:

  • cleaning data
  • validating results
  • handling edge cases

Conclusion

If you’re starting with web scraping, the hardest part isn’t extracting data.

It’s making that data reliable.

Real-world data is messy, inconsistent and unpredictable.

Building a pipeline is not about scraping —
it’s about turning raw data into something usable.

Happy to share more details if there’s interest.

Top comments (0)