DEV Community

Anthony Leignel
Anthony Leignel

Posted on

I Built a Local Data Collection System in Python

Results interface showing collected prospects with columns for name, website, email, phone, city, country, and a checkbox to mark contacted leads; includes filters and export options

Why I Built It

Most data collection and lead generation tools are delivered as SaaS products.

You create an account, subscribe, send your data to a third-party platform and keep paying every month to continue using the tool.

I wanted a different approach.

I built a fully local data collection system that runs on the user's machine, without subscriptions, without paid APIs and without relying on external platforms.

The goal was simple:

  • Create searches
  • Collect data from multiple sources
  • Clean and validate the results
  • Export usable data
  • Keep everything under the user's control

What the System Does

The application allows users to create and manage searches from a web interface.

For each search, users can define:

  • A search name
  • A keyword
  • A city
  • A country
  • The search engines to use

The system currently supports:

  • DuckDuckGo
  • Bing
  • Qwant

Once a search is executed, the collection engine starts gathering information from the selected sources.

Data Processing Pipeline

The system follows a structured workflow.

Load Configuration
↓
Initialize Environment
↓
Load Searches
↓
Run Collectors
↓
Clean Data
↓
Normalize Data
↓
Validate Records
↓
Remove Duplicates
↓
Generate Exports
↓
Save Results
Enter fullscreen mode Exit fullscreen mode

Each component has a single responsibility.

Collectors collect data.

Processors clean and validate it.

Exporters generate output files.

The main engine orchestrates the workflow.

Extracting Business Information

The system doesn't stop at search engine results.

When a website is discovered, the application can visit the site and extract useful information such as:

  • Website URL
  • Email addresses
  • Phone numbers
  • Company name
  • Location information

The collected data is then normalized and validated before being added to the final dataset.

Managing Contacted Leads

One feature I wanted from the beginning was lead tracking.

Users can mark prospects as contacted directly from the interface.

The information is stored locally and remains available after closing the application.

This makes it easy to distinguish:

  • New prospects
  • Already contacted prospects

without relying on an external CRM.

Exporting Data

Once processing is complete, results can be exported as:

  • CSV
  • JSON

The exported files are ready to be imported into other systems or used for further analysis.

Local First

One of the main design goals was independence.

The system runs locally.

There is:

  • No SaaS
  • No subscription
  • No third-party account
  • No paid API dependency

The user owns the software and the collected data.

Technical Stack

The project is built with:

  • Python
  • Flask
  • Playwright
  • BeautifulSoup
  • Requests

The interface is served locally through Flask and can be accessed from a browser.

Final Thoughts

This project started as a simple data collection tool and gradually evolved into a complete workflow capable of collecting, processing and exporting structured business data.

Building it locally introduced some interesting challenges around browser automation, data normalization, validation and architecture design.

The result is a modular system that can be extended with new collectors, processors and export formats without changing the overall architecture.

For me, the most important aspect remains simple:

The software belongs to the user and the data never has to leave the machine.


If you want to explore the technical implementation, you can find it here:
https://github.com/Palks-Studio/data-collection-system


https://palks-studio.com

Top comments (0)