yuki yama

Posted on Jul 1 • Originally published at yamachampion.com

The Python Scraper Strategy: How to Automate Market Research and Earn $600+/Month

#ai #automation #python #sidehustle

Please note: All earnings mentioned are illustrative simulations.

If you can extract clean, decision-ready market data week after week, you can sell it reliably at $150 per report. Four buyers per month equals $600. Ten buyers is $1,500. With Python doing the heavy lifting and Manus running your automations on schedule, this becomes a repeatable side hustle with low overhead and scalable output.

Quick proof-of-concept math

Offer: Focused, 10–20 page market research reports targeted at operators who make fast decisions (agencies, ecom founders, indie SaaS, local services).
Price: $150 per report.
Volume: 4–10 customers/month.
Inputs: 300–2,000 public records scraped and cleaned per niche; distilled to a curated list with benchmarks and insights.
Time: 12–20 hours for the first niche build; 2–3 hours/week to refresh and sell.
Tooling budget: $65–$150/month (proxy + minimal cloud + doc workspace + automation). Net margin >80% after the initial build.

If you want a running start, register a Manus account now using this invitation link: https://manus.im/invitation/OJF7FVBNBC14H?utm_source=invitation&utm_medium=social&utm_campaign=copy_link. You’ll use it to schedule scrapes, transform data, and deliver reports automatically.

What you’re selling and who buys it

You’re not selling “raw scraped data.” You’re selling a decision in a box. Your report gives busy operators the short list of opportunities plus the numbers to justify action.

Buyer profiles that convert at $150:

Ecom operators: “Which 150 stores in X niche are growing and what are their average price points and review velocity?”
Agencies/consultants: “Which 200 local businesses look under-optimized for SEO/ads and what’s the outreach angle?”
Indie SaaS founders: “Which 300 companies recently hired for roles that signal they’ll buy software like mine?”

Each report has:

Data: Clean table (CSV + PDF/HTML) of 150–500 curated entities with 8–20 fields.
Benchmarks: Quartiles and medians for price, rating, review count, content frequency, shipping times, etc.
Playbook: Actionable 1–2 page plan based on the data (e.g., “Raise AOV by bundling,” “Category X is underserved,” “Top 10 prospects to pitch”).

Your niche selection framework (10 minutes)

Must have buyers: At least 2,000 reachable operators in your lane and clear monetary upside.
Public, structured data sources: Directories, listings, product pages, company sites, job boards, changelogs, public sitemaps.
Observable signals: Prices, ratings, review counts, release notes, shipping times, ad pixels, schema.org metadata.
Low-to-medium anti-bot: Avoid sites aggressively blocking scrapers or behind logins.
Repeatability: Data should refresh monthly or faster.

Five niches that meet criteria

DTC micro-niches: “Pet supplements,” “Eco-friendly cleaning,” “Pickleball accessories.”
Marketplace sellers: “Top 500 Etsy stores in home decor with 1-year review growth.”
SaaS signals: “B2B companies hiring RevOps; using XYZ tools; ICP for outreach.”
Local services: “Home services in mid-sized cities missing key SEO basics.”
Event ecosystems: “Sponsors and speakers across 20 industry events; vendor targeting intel.”

Define your data blueprint (start with 12–16 fields)

Identifier: URL, name, domain, market segment, country.
Commerce metrics: Price range, number of SKUs, shipping time, discounting frequency, AOV proxy, bundle presence.
Social/public signals: Review count, rating, last 90-day review growth; posting cadence; newsletter existence.
Technical signals: CMS/platform, analytics pixel presence, schema markup presence.
Outreach hooks: Contact email in footer, contact form URL, press page, hiring page, partnerships page.
Calculated fields: Price quartile, growth rank, differentiation score (simple weighted formula).

The stack you’ll use

Rotating proxy/API to reduce blocks: ScraperAPI
Enterprise-grade IP pool alternative: Bright Data
Managed crawler for complex rendering: Zyte
Low-cost VPS for scheduled runners: DigitalOcean
Clean delivery workspace for clients: Notion

Manus will orchestrate the entire pipeline, from scheduled runs to transformations and delivery. Use the invitation link to create your account now: https://manus.im/invitation/OJF7FVBNBC14H?utm_source=invitation&utm_medium=social&utm_campaign=copy_link.

The 7-step pipeline (overview)
1) Source mapping: 3–5 public pages per niche that collectively emit 500–2,000 items.
2) Fetch: Polite, paginated requests with IP rotation and caching.
3) Parse: Extract fields deterministically (CSS/XPath) and store normalized JSON rows.
4) Enrich: Compute your metrics, score opportunities, and tag segments.
5) Validate: Deduplicate, sanity-check ranges, spot-check 30 random rows manually.
6) Publish: Generate a polished table and a 1–2 page commentary; export CSV + PDF/HTML.
7) Sell: Email 100–200 qualified prospects; link to a checkout; deliver instantly post-purchase.

Implementation in detail

1) Map your sources
Pick a concrete scope to keep delivery tight. Example: “Top 300 Shopify pet-supplement stores in the US.”

Find sources that satisfy:

A navigable list you can page through (collections, category pages, directory list pages).
Detail pages with consistent markup.
Public sitemaps or predictable URL patterns.
Robots.txt that does not disallow your intended paths.

Examples of safe starting points:

Vendor directories and public leaderboards.
Category listing pages on ecommerce sites.
Company sites/pages that expose structured metadata (schema.org Product, Organization).
Job boards’ public search result pages with role filters.

2) Build the fetcher
Use Python with httpx or requests. Rotate IPs and user-agents. Cache pages locally to avoid re-fetching unchanged URLs.

Minimal fetcher outline:

HEAD requests to test connectivity.
GET with conditional headers (If-Modified-Since, ETag) if available.
Backoff: exponential backoff if 429/5xx.
Rate limit: 0.5–2 requests/second per domain; jittered sleep.

Sample snippet (simplified):

import os, time, random, json, hashlib
import httpx
from bs4 import BeautifulSoup

PROXY_BASE = os.getenv("PROXY_BASE_URL")  # e.g., https://proxy.example.com?api_key=XXX&url=
UA_POOL = [
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124 Safari/537.36",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Version/17.3 Safari/605.1.15"
]

def cache_path(url):
    return "cache/" + hashlib.md5(url.encode()).hexdigest() + ".html"

def fetch(url, use_cache=True):
    cp = cache_path(url)
    if use_cache and os.path.exists(cp):
        with open(cp, "r", encoding="utf-8") as f:
            return f.read()
    headers = {"User-Agent": random.choice(UA_POOL), "Accept-Language": "en-US,en;q=0.9"}
    full_url = f"{PROXY_BASE}{url}" if PROXY_BASE else url
    for attempt in range(5):
        r = httpx.get(full_url, headers=headers, timeout=30)
        if r.status_code == 200 and r.text.strip():
            os.makedirs("cache", exist_ok=True)
            with open(cp, "w", encoding="utf-8") as f: f.write(r.text)
            time.sleep(0.5 + random.random()*1.0)
            return r.text
        if r.status_code in (429, 503): time.sleep(2 ** attempt + random.random())
        else: break
    raise RuntimeError(f"Fetch failed: {url} code={r.status_code}")

Notes:

Set PROXY_BASE_URL to your rotating proxy endpoint from the provider of your choice.
Always check robots.txt and the site’s terms before scraping. Prefer official APIs when available.

3) Parse the listing and detail pages
Create one parser per source to keep code maintainable. Extract only what you need.

def parse_listing(html):
    soup = BeautifulSoup(html, "lxml")
    items = []
    for card in soup.select(".product-card, li.listing, .shop-card"):
        name = (card.select_one(".title") or {}).get_text(strip=True) if card.select_one(".title") else None
        url = card.select_one("a")["href"] if card.select_one("a") else None
        price_text = (card.select_one(".price") or {}).get_text(strip=True) if card.select_one(".price") else ""
        items.append({"name": name, "url": url, "price_text": price_text})
    return items

def parse_detail(html):
    soup = BeautifulSoup(html, "lxml")
    rating = soup.select_one("[itemprop='ratingValue']")
    reviews = soup.select_one("[itemprop='reviewCount']")
    sku_count = len(soup.select("[itemprop='sku'], .product-variant"))
    shipping = soup.find(string=lambda s: s and "shipping" in s.lower())
    return {
        "rating": float(rating.get("content")) if rating and rating.get("content") else None,
        "reviews": int(reviews.get("content")) if reviews and reviews.get("content") else None,
        "sku_count": sku_count,
        "shipping_hint": (shipping or "").strip() if shipping else None,
    }

4) Normalize, enrich, and score
Transform messy strings into typed fields. Compute ranks so your report overtly points to action.

Normalize price_text to price_min/price_max with regexes.
Bucket review_count into quartiles and compute review_growth if you have historical snapshots.
Compute a differentiation score (0–100) combining:
- Price positioning (mid-market vs premium)
- SKU breadth
- Review quartile
- Presence of bundles or subscriptions
Tag sub-categories to segment the market (e.g., “Mobility,” “Calming,” “Nutrition” for pet supplements).

5) Validate quality quickly

Dedup by canonical domain; if “www.” vs root appears, canonicalize.
Sanity checks: rating in [1, 5], reviews >= 0; reject rows missing both name and URL.
Manual sample: Open 30 random entries, verify 3 core fields; fix parsing rules if >10% error rate.

6) Generate the report and assets

A single CSV with 150–500 curated rows for the customer to slice.
A PDF or HTML summary with:
- Executive summary (5–7 bullets).
- Market map (clusters by price or sub-category).
- Top 25 opportunities with rationale.
- Benchmarks (medians/quartiles for pricing, reviews, SKU count).

You can render a clean HTML from a Jinja2 template and print to PDF. Keep it simple: clean table + a chart or two.

7) Orchestrate everything in Manus
You’ll use Manus to run this weekly without babysitting it.

Create your workspace using this invitation link: https://manus.im/invitation/OJF7FVBNBC14H?utm_source=invitation&utm_medium=social&utm_campaign=copy_link.
Add a “Schedule” trigger: every Monday at 02:00 UTC.
Step 1: “Run Script” or “HTTP Request” to start your scraper (containerized or webhook-triggered).
Step 2: “Transform” to run your enrichment script (price normalization, ranking).
Step 3: “Storage” to upload CSV to your cloud bucket and to your docs workspace.
Step 4: “Email/Slack” to notify you with accuracy metrics and a preview link.
Step 5: “Publish” to update a landing page snippet with “This week’s top 25” and a buy link.

This Manus-first approach means you can spin up new niches by cloning the workflow and swapping source URLs and parsing rules. The consistent schedule builds trust with buyers who want monthly refreshes.

Selling the $150 report

Your buyer list

Pull 300–800 prospects per niche from public websites and directories that expose business contact emails or contact forms. Do not scrape personal emails or gated data.
Enrich each with 1–2 custom notes from your dataset (“Noticed you rank in the 75th percentile for price with low review velocity; likely conversion lift from bundle pricing.”).
Segment into A/B messaging angles (growth mindset vs. risk mitigation).

Cold email math

Warm domain + proper setup: 40–60% open.
Relevant, concise pitch: 5–12% reply.
Short demo link (1–2 page sample): 3–8% click-to-view.
Conversion: 2–5% of total emailed buy the $150 report.

If you email 300 operators/month:

45 replies (15%)
12–18 serious conversations
6–12 purchases → $900–$1,800

Email template you can deploy
Subject: 25 low-competition plays in [Niche] from this week’s data

Body:
Hey {FirstName} — I analyzed {Niche} across {N} public listings this week and found 25 opportunities where price positioning and review velocity suggest easy wins.

Example insight: Brands priced in the 2nd quartile with fewer than 3 SKUs are seeing 18–24% higher review growth after introducing bundles.

I packaged everything into a focused report (10 pages + CSV of {M} curated entries). It’s $150. If it’s not useful, I refund in full.

Want the sample summary (2 pages) to see if it’s relevant?

Best,
{You}
{one-sentence credibility}

Attach or link to a 2-page teaser and a blurred screenshot of your top 25 list. When they say “yes,” send checkout link, automate delivery.

Pricing, margins, and costs (illustrative)

Proxy API: $29–$99/month depending on volume.
VPS: $6–$12/month.
Automation/orchestration: start with free/entry plan; scale as you grow.
Docs workspace: $8–$12/month per seat.

Assume $90/month in tooling and 10 hours of work after the initial build. Four sales ($600) yields $510 gross margin, $51/hour. Ten sales ($1,500) yields $1,410 margin.

Compliance and risk management

Respect robots.txt and site terms. Avoid scraping areas that are disallowed or behind logins.
Collect only public business data; avoid personal data (emails tied to individuals, phone numbers of private persons).
Rate-limited, polite crawling. Cache responses to reduce load.
Prefer official APIs and public sitemaps when possible; avoid circumventing technical barriers.
Maintain takedown and redaction process: if an owner requests removal, comply promptly.

Detailed build: first niche in 48 hours

Day 1 (6–8 hours)

Pick niche and define 12–16 fields.
Identify 3–5 sources and sample 200 URLs.
Implement fetch + basic parse for listing pages; save 500–1,000 candidate rows.
Add a single detail-page parser for your most consistent source.
Normalize 6–8 fields and compute 2 simple metrics (price quartile, review quartile).
Manual QA on 30 rows; refine.

Day 2 (6–8 hours)

Implement 2 more detail parsers; dedup and canonicalize domains.
Compute final metrics and scores; export CSV.
Build a 10-page summary (HTML->PDF). Add top 25 picks with 1–2 lines each.
Set up Manus:
- Create schedule, link your runner, wire up transform + publish steps.
- Upload sample output to a public shareable link.
Draft cold outreach assets: teaser PDF, email templates, subject lines, credibility snippet.

By the end, you’re ready to email 100–200 prospects with a real sample.

Example niche walk-through: “Top 300 DTC pet-supplement stores”
Objective: Identify 300 brands with above-median review growth and SKU gaps suggesting bundle potential.

Signals you’ll extract

Price_min, price_max, presence_of_bundle, sku_count.
rating, review_count, 90d_review_change (computed from snapshots over time).
shipping_speed_hint (fast/standard), discount_frequency_hint.
platform (Shopify/Woo/etc.), analytics pixel presence.
contact_email_in_footer_or_form, partnerships_page_url.

Process

Source A: Category listing pages across public directories (structured HTML).
Source B: Product pages for 1–3 top SKUs to compute price ranges and bundle presence.
Source C: Press/updates pages for “new product” recency signals.

Data hygiene

Deduplicate by eTLD+1 domain; if multiple products map to one brand, aggregate.
Drop entries missing both rating and price.
Winsorize outliers on review_count to reduce skew.

Metrics and output

Median price: $28.50; 75th percentile: $39.00.
Median review_count: 184; top decile: 1,200+.
SKU_count median: 6; 40% of brands have 3 or fewer SKUs.
Bundle presence: only 23% offer true bundles.
Opportunity thesis: Sub-$30 brands without bundles, but with 100–300 reviews, could lift AOV by 12–18% via 2-pack and subscription nudges.

Deliverables

CSV with 300 rows.
Summary PDF with market map and top 25 opportunities, each tagged with rationale.
“What to do Monday” section: 5 items any founder can do in 2 hours.

Simulated results (first 30 days)

1,200 prospects compiled; 300 emailed per week; 1,200 total.
51 replies (4.25%); 18 serious; 8 purchases → $1,200.
Tooling spend: $92; net: $1,108; time: 26 hours total → $42/hour baseline, improving with reuse.

Scaling beyond $600/month

Launch new verticals by cloning your Manus workflow and swapping parsers.
Offer subscriptions: $69/month for monthly refreshes; 10 subscribers = $690 MRR plus one-offs.
Bundle upsells: “Top 25 this month” mini-reports at $39; converts well as a lead magnet to the $150 report.
White-label to agencies: Let them resell; you produce weekly refreshes.

Operational guardrails so you don’t stall

Cap your initial field count at 16; depth beats breadth.
Automate enrichment only after core parsing is rock-solid; premature bells and whistles cause breakage.
Track three KPIs weekly: rows scraped, acceptance rate on manual QA, sales emails sent. If any drop to zero, fix the pipeline that week.

Manus workflow details (practical)

Trigger: Cron every Monday 02:00 UTC.
Task 1: Fetcher container (Docker) with environment variables for API keys; returns JSONL.
Task 2: Transform script:
- Normalize fields.
- Compute quartiles using precomputed histograms to avoid loading entire datasets in RAM.
- Select top 25.
Task 3: Publish:
- Upload CSV to your storage.
- Render HTML report using a Jinja2 template and headless print to PDF.
- Post the HTML and PDF to your docs workspace via its API.
Task 4: Notify:
- Send yourself a link to the new report, plus QA stats (random sample URLs with diffs week-over-week).
Task 5: Marketing:
- Update a public “This week’s insights” snippet on your landing page and queue a batch of 50 outreach emails.

Set this up once in Manus using the invitation link: https://manus.im/invitation/OJF7FVBNBC14H?utm_source=invitation&utm_medium=social&utm_campaign=copy_link, then clone for each niche.

How to keep your scrapers alive

Rotate IPs and user-agents; keep request rates conservative; add jitter to sleeps.
Cache aggressively and use conditional GETs where supported.
Split traffic by domain; each domain gets its own rate limiter.
Build selector tests: unit tests that load saved HTML fixtures and assert the parser still finds the fields.
Log structured events: request timings, status, selector failures, dedupe counts.

Packaging and delivery that converts

A single landing page describing one problem, your data-backed solution, and a clean CTA to buy.
Teaser sample: 2 pages of real (but partially blurred) content; don’t publish the full list publicly.
Delivery stack:
- Checkout → automated email with download link.
- Customer portal with prior months’ reports if they subscribe.
- Update cadence communicated clearly (e.g., “new edition every first Monday”).
Support policy: 7-day refund if they feel it’s not useful.

FAQs your buyers will ask (prepare answers)

“Where did the data come from?” Publicly available web pages and directories; we respect robots.txt and site terms; we do not include personal data.
“How fresh is it?” Snapshot updated weekly/monthly; each report states the crawl dates.
“Can you customize for my sub-niche?” Yes; custom cuts start at $300 with 5-day turnaround.

Common pitfalls and fixes

Pitfall: Over-scraping a single complex domain that blocks you. Fix: Diversify sources; slow down; switch to a managed crawler for that subset.
Pitfall: Trying to sell to huge enterprises out of the gate. Fix: Sell to operators who decide quickly (founders, agency owners).
Pitfall: Bloated reports. Fix: Keep 10–20 pages; lead with an executive summary and top 25 table; relegate raw data to a CSV.

Your first 72-hour plan

Hour 0–2: Pick niche; set 12–16 fields; outline thesis.
Hour 2–6: Implement listing parsers; fetch 1,000 candidates.
Hour 6–10: Implement 2 detail parsers; normalize; compute basic metrics.
Hour 10–14: Manual QA; generate v1 CSV; build 2-page teaser.
Hour 14–18: Set up Manus schedule and publish flow.
Hour 18–24: Build landing page and checkout; queue 100 emails.
Hour 24–48: Iterate based on replies; polish full report; deliver first sales.
Hour 48–72: Clone the niche to an adjacent vertical; reuse 80% of the code.

Tool-specific notes (choose one per category)

Rotating proxy: Configure your proxy endpoint and monitor 429 rates. Start with a small plan; scale when success is proven. If you need a larger IP pool, consider the enterprise-grade option listed above.
Managed crawler: If one source heavily relies on client-side rendering or anti-bot measures, offload it to the managed crawler mentioned earlier and merge its JSON output with your own.
VPS: A small VM can run all of this comfortably; keep CPU-bound parsing efficient and write artifacts to a mounted volume with lifecycle policies.
Docs workspace: Publish your deliverables with permalinks. Keep a “Changelog” section so returning customers see monthly progress.

Quality bar for a $150 report

95%+ field completeness on core fields.
<2% duplicate rows.
30+ minutes saved for the buyer’s research.
3–5 specific plays the buyer can execute next week.

Closing the loop: manual improvements that matter

Add 2–3 “insight macros” to your transform step. Examples:
- “Bundle potential” score: sku_count <= 3 and price_min in Q2 → high.
- “Underpriced premium” flag: rating >= 4.6 and price_max in Q2/Q3 → review growth tailwind.
- “Low-hanging SEO” tag: no schema markup + thin category text → likely quick win.
Track which insights buyers mention in replies; amplify those in the executive summary.

Final reminder
This model works because you provide repeatable, truth-based answers faster than your buyers can. Make the data clean, the insights blunt, and the delivery automatic. Clone niches. Keep sending emails. Iterate.

Call to action box
Your automated research business is one Manus workflow away. Create your Manus account with this invitation link: https://manus.im/invitation/OJF7FVBNBC14H?utm_source=invitation&utm_medium=social&utm_campaign=copy_link. Set a weekly schedule, plug in your Python scraper, and start shipping $150 reports on autopilot. Join today and build your first revenue-producing niche this week.

DEV Community

The Python Scraper Strategy: How to Automate Market Research and Earn $600+/Month

Top comments (0)