A year ago, I had zero scrapers on Apify. Today, I have 30 published actors, with 9 of them ranking in the top 3 of their niche.
Here is what that journey looked like — and why I think building data tools is one of the most underrated side businesses in 2026.
It started with one frustration
I kept getting the same request from clients: "Can you get me data from [website]?" Every time, I would build a custom scraper from scratch. Proxy rotation, error handling, CAPTCHA detection, pagination — the same boilerplate, over and over.
So I decided to build reusable tools and put them where people could find them.
The first 12 actors taught me the fundamentals
Google Maps was the hardest — it is a single-page app, so standard HTML scraping does not work. I discovered that using Google's local search parameter (tbm=lcl) returns parseable HTML instead of JavaScript bundles. That single insight saved weeks of work:
# Google Maps SPA workaround: use local search
# Instead of scraping maps.google.com (SPA, JS-heavy)
# Use Google Search with tbm=lcl parameter
import requests
from bs4 import BeautifulSoup
def search_local_businesses(query, location):
url = "https://www.google.com/search"
params = {
"q": f"{query} {location}",
"tbm": "lcl", # Local search - returns parseable HTML
}
headers = {"User-Agent": "Mozilla/5.0 ..."}
resp = requests.get(url, params=params, headers=headers)
soup = BeautifulSoup(resp.text, "html.parser")
# Parse structured results instead of fighting SPAs
return extract_businesses(soup)
Amazon taught me about anti-bot arms races. You need multiple selector fallbacks, proxy rotation, and CAPTCHA detection:
# Multi-selector fallback pattern for fragile pages
PRICE_SELECTORS = [
"#priceblock_ourprice",
"#priceblock_dealprice",
"span.a-price > span.a-offscreen",
"#corePrice_feature_div span.a-price-whole",
"#price_inside_buybox",
]
def extract_price(page):
for selector in PRICE_SELECTORS:
el = page.query_selector(selector)
if el:
return parse_price(el.text_content())
return None # All selectors failed — site changed again
Meta Threads taught me that modern sites hide data in script tags as JSON. Scraping the visible HTML gives you nothing:
# Modern sites embed data in <script> tags, not in visible HTML
import json
import re
def extract_thread_data(html):
# Find the JSON blob in script tags
pattern = r'<script type="application/json"[^>]*>(.*?)</script>'
matches = re.findall(pattern, html, re.DOTALL)
for match in matches:
try:
data = json.loads(match)
# Navigate the nested structure to find posts
if "require" in data or "props" in data:
return extract_posts_from_json(data)
except json.JSONDecodeError:
continue
return []
Scaling from 12 to 30 revealed the real opportunity
The first 12 were general-purpose tools: social media scrapers, e-commerce data, lead generation. Useful, but competitive.
The next 18 went niche. And that is where it got interesting:
- ESG Report Aggregator — collects corporate sustainability reports and GHG emissions data. Turned out ESG analysts and compliance teams were desperate for automated data collection.
- Public Procurement Hub — scrapes government tenders from TED (EU) and SAM.gov (US). B2B sales teams use this to find contract opportunities before competitors.
- Price Drop Alert — monitors prices across multiple retailers and alerts when thresholds are hit. E-commerce managers love it.
- AI Training Data Curator — structures web data for ML training pipelines. With the AI boom, demand for clean training data is enormous.
- Conference Event Scraper — aggregates event data for marketing teams planning their annual calendars.
The niches nobody thinks about are the ones that make money
General scrapers (Amazon, Google, LinkedIn) have fierce competition. But when you build for ESG compliance, government procurement, or AI training data — you are often the only option.
9 of my 30 actors are now in the top 3 of their niche. Not because they are technically superior, but because nobody else bothered to build for those audiences.
5 things I learned
1. Documentation sells more than features
A clear README with example outputs converts 10x better than a feature list:
## Example Output
| Company | Scope 1 (tCO2e) | Scope 2 | Report URL |
|---------------|------------------|---------|-----------------------|
| Siemens AG | 284,000 | 591,000 | siemens.com/esg-2025 |
| SAP SE | 45,000 | 122,000 | sap.com/sustainability|
2. Reliability beats functionality
Users forgive missing features. They do not forgive broken runs. I invest more time in error handling than in new features.
3. Niche beats general
A "B2B Review Intelligence" tool sells better than "Generic Review Scraper." Name your tools for the audience, not the technology.
4. Maintenance is the real product
Websites change constantly. The value is in keeping things working, not in the initial build. I check selector stability weekly.
5. Data is the new API
Many websites do not have public APIs. If you can reliably extract and structure their data, you have built a de facto API that people will pay for.
Where I am going next
More vertical-specific tools. Real estate analytics, supply chain monitoring, patent tracking. The deeper the niche, the less competition and the more willingness to pay.
If you are a developer thinking about a side project — consider building data tools. The demand is massive, the competition in niches is low, and the business model (pay-per-use) scales without your involvement.
How I Structure and Price Data Products
One thing I got completely wrong in the first six months: treating actors like open-source tools and pricing them at zero or "pay what you want." Turns out free tools attract tyre-kickers, not serious users. Here's the pricing structure that actually works for me:
| Tier | Monthly Price | What It Gets You | Best For |
|---|---|---|---|
| Free trial | $0 | 50 runs / month | Evaluation |
| Starter | $9 | 500 runs, basic output | Individuals |
| Professional | $29 | 5,000 runs, full schema, webhooks | Small teams |
| Enterprise | $99+ | Unlimited, SLA, custom fields | B2B clients |
The key insight: charge for reliability, not features. My ESG aggregator has four fields — company, scope 1, scope 2, report URL. That's it. But it runs at 98.7% uptime and I update selectors within 24 hours of any site change. That reliability is what justifies $29/month for a compliance analyst who needs it for Monday's board report.
One more thing on pricing: always offer a per-compute-unit billing option on Apify in addition to your own subscription tiers. Some users just want to run 10 extractions once and never again. Forcing them into a $9 monthly plan kills the conversion.
The Technical Stack Behind All 30 Actors
People ask me what technology I use. It's boring on purpose:
requests + BeautifulSoup → 18 of 30 actors (static HTML targets)
Playwright (Python async) → 9 of 30 actors (SPAs, login walls, scroll events)
Crawlee → 3 of 30 actors (complex crawl graphs, multi-domain spiders)
The proxy layer is always the same: a residential proxy pool from a paid provider, rotated per request, with a fallback to datacenter proxies for non-sensitive public data. I do not roll my own proxy management — I use Apify's built-in proxy integration and pay for what I use.
For output I always produce two formats: raw JSON (every field, no transformation) and a cleaned CSV (human-readable column names, normalised price formats, deduplicated rows). Users who want the raw pipeline get JSON. Users who open it in Excel get CSV. Supporting both doubles my audience with about 20 minutes of extra code.
import csv, json
def save_output(records: list[dict], slug: str):
# Raw JSON — for developers
with open(f"{slug}.json", "w") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
# Cleaned CSV — for analysts
if records:
with open(f"{slug}.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=records[0].keys())
writer.writeheader()
writer.writerows(records)
What I Would Do Differently Starting Today
Go B2B from day one. My B2C actors (social media, public review scrapers) get lots of runs but low revenue. My B2B tools (procurement, ESG, supplier intelligence) have fewer users but 4-8x higher willingness to pay. The job title on the other end of the contract matters more than the run count in your dashboard.
Invest in a clear changelog. Every time I update selectors or add a field, I post a changelog entry in the actor README. Users see activity, trust goes up, churn goes down. It takes five minutes per update. I skipped this for the first year and lost users who assumed the tool was abandoned.
Build monitoring before you launch. I ship every actor with a lightweight health check: a scheduled run against a known test URL that validates at least one expected field in the output. If the check fails, I get a webhook notification. This has caught three silent selector breaks before any paying user noticed.
The compound effect of 30 actors working in parallel — each collecting a small stream of revenue — is more durable than any single big project. Build incrementally, maintain obsessively, and pick niches you can own.
All 30 actors are live on Apify Store.
What data problem would you want automated? Drop a comment.
Want This Data Without Running the Infrastructure?
If you're reading this and thinking "I need this data but I don't want to maintain scrapers" — that's exactly what I built for.
👉 Data Tools on Apify Store — 30+ ready-to-run actors covering ESG data, procurement intelligence, price monitoring, and more. Pay per run, no subscriptions required.
Or if you need a custom data pipeline built for your specific use case — a vertical nobody else has tackled yet:
📩 vhubsystems@gmail.com | Hire on Upwork
Top comments (0)