Dmitriy Dmitriy

Posted on Apr 4

I built a company data pipeline — here’s what broke in real-world data

#dataengineering #python #webscraping #api

I built a Python pipeline that automates company data collection:
searching registries, extracting websites, scraping content, finding phone numbers, and generating summaries using AI.

At first, I thought scraping would be the hard part.

It turned out to be the easiest.

What the pipeline does

The system:

finds companies via API (DNB)
extracts website and metadata
visits multiple pages of each site
extracts phone numbers from HTML, links and structured data
generates summaries using LLMs
exports everything into a structured Excel dataset

What actually broke

1. APIs are unstable and rate-limited

The DNB API would randomly return 429 and 503 errors.

Without retries, the pipeline would fail after a few requests.

I ended up implementing:

retry logic
exponential backoff
random delays between requests

Even then, stability is never guaranteed.

2. Phone numbers are surprisingly hard

Phone extraction turned out to be one of the hardest parts.

Numbers appear:

in different formats (+1, brackets, spaces, dashes)
mixed with dates or IDs
inside HTML text, links (tel:) and JSON-LD

I had to build logic to:

normalize formats
filter invalid matches
classify numbers by reliability

Without this step, the output was unusable.

3. Websites are inconsistent

Every site is different:

different HTML structures
missing data
broken markup
content hidden in scripts

There is no universal parser.

Even simple tasks like extracting clean text require handling multiple edge cases.

4. Matching company data is unreliable

Company names don’t always match exactly across sources.

Small differences (spacing, symbols, legal forms) break naive matching.

This forced me to implement stricter matching logic and fallbacks.

5. AI is helpful, but not deterministic

I used LLMs to generate company summaries from scraped text.

But even with the same input:

outputs vary
rate limits happen
some responses fail

To make it usable, I had to:

control prompts carefully
limit output length
add fallback between models

Results

The pipeline can process hundreds of companies per run,
replacing hours of manual work.

But the main takeaway:

Scraping is only a small part of the system.

Most of the effort goes into:

cleaning data
validating results
handling edge cases

Conclusion

If you’re starting with web scraping, the hardest part isn’t extracting data.

It’s making that data reliable.

Real-world data is messy, inconsistent and unpredictable.

Building a pipeline is not about scraping —
it’s about turning raw data into something usable.

Happy to share more details if there’s interest.

DEV Community