I built a Python pipeline that automates company data collection:
searching registries, extracting websites, scraping content, finding phone numbers, and generating summaries using AI.
At first, I thought scraping would be the hard part.
It turned out to be the easiest.
What the pipeline does
The system:
- finds companies via API (DNB)
- extracts website and metadata
- visits multiple pages of each site
- extracts phone numbers from HTML, links and structured data
- generates summaries using LLMs
- exports everything into a structured Excel dataset
What actually broke
1. APIs are unstable and rate-limited
The DNB API would randomly return 429 and 503 errors.
Without retries, the pipeline would fail after a few requests.
I ended up implementing:
- retry logic
- exponential backoff
- random delays between requests
Even then, stability is never guaranteed.
2. Phone numbers are surprisingly hard
Phone extraction turned out to be one of the hardest parts.
Numbers appear:
- in different formats (+1, brackets, spaces, dashes)
- mixed with dates or IDs
- inside HTML text, links (tel:) and JSON-LD
I had to build logic to:
- normalize formats
- filter invalid matches
- classify numbers by reliability
Without this step, the output was unusable.
3. Websites are inconsistent
Every site is different:
- different HTML structures
- missing data
- broken markup
- content hidden in scripts
There is no universal parser.
Even simple tasks like extracting clean text require handling multiple edge cases.
4. Matching company data is unreliable
Company names don’t always match exactly across sources.
Small differences (spacing, symbols, legal forms) break naive matching.
This forced me to implement stricter matching logic and fallbacks.
5. AI is helpful, but not deterministic
I used LLMs to generate company summaries from scraped text.
But even with the same input:
- outputs vary
- rate limits happen
- some responses fail
To make it usable, I had to:
- control prompts carefully
- limit output length
- add fallback between models
Results
The pipeline can process hundreds of companies per run,
replacing hours of manual work.
But the main takeaway:
Scraping is only a small part of the system.
Most of the effort goes into:
- cleaning data
- validating results
- handling edge cases
Conclusion
If you’re starting with web scraping, the hardest part isn’t extracting data.
It’s making that data reliable.
Real-world data is messy, inconsistent and unpredictable.
Building a pipeline is not about scraping —
it’s about turning raw data into something usable.
Happy to share more details if there’s interest.
Top comments (0)