DEV Community

Vhub Systems
Vhub Systems

Posted on

I Built 28 Web Scrapers on Apify — Here's What I Learned

Six months ago, I had a problem: I needed to extract business contact data from Google Maps for a client project. I could have used an existing scraper, but I decided to build my own. That single actor turned into 28 production-ready web scrapers, thousands of lines of code, and a deep understanding of what it takes to build reliable data extraction tools.

This is what I learned along the way.

Why I Built 28 Scrapers

It started with necessity. I was working on a B2B lead generation tool and needed structured data from multiple sources: Google Maps for local businesses, Amazon for product research, real estate sites for property data, and social media for contact information.

I quickly realized that each data source required a different approach. Google Maps is a JavaScript-heavy SPA that fights automation. Amazon rotates their HTML structure and aggressively blocks bots. Instagram requires careful session management and rate limiting. Each platform became a learning opportunity.

The more I built, the more patterns emerged. I started extracting reusable components, building libraries of stealth techniques, and developing a mental framework for approaching new scraping challenges. By actor number 15, I could spin up a new scraper in a few hours instead of days.

The Stack: Crawlee, Playwright, and Apify SDK

Every one of my actors is built on the same foundation:

Crawlee 3.x — Apify's open-source crawling framework. It handles request queuing, retries, session management, and proxy rotation out of the box. The API is clean and the abstractions are at the right level. You focus on data extraction logic, not infrastructure.

Playwright — For anything JavaScript-heavy (which is most modern websites), Playwright is non-negotiable. It gives you a real browser with full rendering, JavaScript execution, and network interception. Yes, it's slower and more expensive than Cheerio, but it actually works.

Apify SDK — The glue that ties everything together. Dataset storage, key-value store for state, actor input schemas with validation, and built-in proxy management. The platform handles scaling, monitoring, and scheduling so I can focus on the scraping logic.

Node.js 20 — Fast, stable, and with native TypeScript support via tsx. All my actors run on the Node 20 runtime.

Here's what a typical Crawlee setup looks like:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    launchContext: {
        launcher: playwright.chromium,
        launchOptions: {
            headless: true,
            args: [
                '--disable-blink-features=AutomationControlled',
                '--disable-web-security',
            ],
        },
    },
    preNavigationHooks: [async ({ page }) => {
        // Remove webdriver flag
        await page.evaluateOnNewDocument(() => {
            Object.defineProperty(navigator, 'webdriver', {
                get: () => false,
            });
        });
    }],
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);

        // Your extraction logic here
        const data = await page.evaluate(() => {
            return {
                title: document.querySelector('h1')?.textContent,
                // ... more fields
            };
        });

        await Actor.pushData(data);
    },
});
Enter fullscreen mode Exit fullscreen mode

That preNavigationHooks pattern is critical. Modern anti-bot systems check for the navigator.webdriver flag. Remove it, and you're already ahead of 80% of basic scrapers.

Top 5 Actors (From 28 Built)

Let me walk you through five of the most interesting scrapers I've built, what they do, and what made them challenging.

1. Google Maps Lead Scraper

This was the first and still the most complex. It extracts business listings from Google Local Search: names, addresses, phone numbers, emails, websites, ratings, review counts, and business hours.

The challenge: Google Maps is not actually at maps.google.com. That's a React SPA that's nearly impossible to scrape reliably. Instead, I use Google Local Search (google.com/search?tbm=lcl) which returns structured HTML with business cards. Still JavaScript-heavy, but predictable.

The scraper handles pagination, deduplication, and falls back gracefully when emails aren't publicly listed. It uses residential proxies to avoid Google's rate limits and rotates user agents on every request.

2. Contact Info Scraper

Give it any URL, and it extracts all contact information: emails (even obfuscated ones), phone numbers (with international format detection), social media links, and contact form URLs.

This one taught me about regex hell. Email obfuscation comes in dozens of formats: info [at] example [dot] com, info@example•com, even base64-encoded mailto links. I ended up with a 200-line email extraction function that handles 15+ obfuscation patterns.

Phone number extraction is even worse. International formats, extensions, vanity numbers ("1-800-FLOWERS"), and false positives (product codes, dates) required building a validation layer on top of pattern matching.

3. Instagram Email Scraper

Extracts profile data, follower/following counts, engagement rates, post counts, and business contact information from Instagram profiles.

Instagram is interesting because they actively fingerprint bots through timing patterns. If you scroll too fast, click too precisely, or never move your mouse, you get shadowbanned. The solution: randomized delays, mouse movement simulation, and mimicking human scroll behavior.

The scraper also handles both public and private profiles gracefully, extracting what's available without failing on permission errors.

4. Amazon Product Scraper

Scrapes product data across 9 Amazon marketplaces: US, UK, Germany, France, Italy, Spain, Japan, Canada, and Australia. Extracts prices, ratings, review counts, Best Seller Rank, product features, images, and variant data.

Amazon is a masterclass in anti-scraping. They rotate HTML selectors, serve different markup to different user agents, and have sophisticated bot detection. This actor uses:

  • Residential proxies (datacenter IPs are blocked instantly)
  • Playwright stealth plugin
  • Request throttling (max 2 requests/second per session)
  • Fallback selectors (3-4 CSS paths per data point)

The BSR (Best Seller Rank) extraction alone took two days to get right. Amazon hides it in inconsistent places depending on category, marketplace, and whether the product is in stock.

5. Real Estate Scraper

Scrapes property listings from Zillow, Realtor.com, and Redfin. Extracts prices, square footage, bed/bath counts, property type, listing status, images, and agent contact info.

Real estate sites are tricky because they're location-dependent. Search results change based on your IP's geolocation. I use sticky sessions with residential proxies in the target region to get consistent results.

The scraper also handles "off-market" listings, foreclosures, and "coming soon" properties differently, with separate extraction logic for each status type.

Five Hard-Earned Lessons

1. Playwright > Cheerio for SPAs, But Costs More Compute

I started with Cheerio (jQuery-like HTML parsing) because it's fast and cheap. For static HTML sites, it's perfect. But modern websites are React/Vue/Angular SPAs that render content client-side.

Cheerio sees empty div tags. Playwright sees the fully-rendered page.

The tradeoff: Playwright uses 3-5x more memory and CPU. On Apify, that means higher compute costs. But a scraper that actually works is worth more than a cheap scraper that returns empty datasets.

My rule now: if the site needs JavaScript to display content, use Playwright. Otherwise, Cheerio is fine.

2. Stealth Matters More Than You Think

Basic scrapers get blocked immediately. Sites check:

  • navigator.webdriver flag (dead giveaway)
  • Missing browser plugins (real browsers have PDF viewer, etc.)
  • Canvas/WebGL fingerprinting (GPU rendering patterns)
  • Mouse movement patterns (bots move in straight lines)
  • Request timing (bots are too consistent)

I use playwright-extra with the stealth plugin, which patches most of these. For high-security targets (Google, Amazon, Instagram), I also:

  • Rotate user agents on every session
  • Randomize viewport sizes
  • Add random delays between actions (500-2000ms)
  • Simulate mouse movement before clicks
  • Vary scroll speed and distance

Here's a snippet showing stealth setup:

import { chromium } from 'playwright-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

chromium.use(StealthPlugin());

const browser = await chromium.launch({
    headless: true,
    args: [
        '--disable-blink-features=AutomationControlled',
        '--disable-features=IsolateOrigins,site-per-process',
    ],
});

const context = await browser.newContext({
    viewport: {
        width: 1920 + Math.floor(Math.random() * 100),
        height: 1080 + Math.floor(Math.random() * 100)
    },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
});
Enter fullscreen mode Exit fullscreen mode

The randomized viewport is subtle but effective. Real users don't all have exactly 1920x1080 screens.

3. Residential Proxies Are Essential for Big Sites

I learned this the hard way. Google blocked my datacenter proxies within 10 requests. Amazon blocked them within 5. Instagram didn't even load the page.

Residential proxies (real IPs from real ISPs) are expensive but necessary. On Apify, I use their built-in proxy groups:

proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
}),
Enter fullscreen mode Exit fullscreen mode

The cost is roughly $1 per 1000 requests (vs pennies for datacenter proxies), but the success rate goes from 20% to 95%.

For less aggressive sites, datacenter proxies work fine. I use a tiered approach: try datacenter first, fall back to residential on 403/429 errors.

4. Error Handling Is 50% of the Code

A naive scraper is 100 lines. A production scraper is 500 lines, and 250 of them are error handling.

What breaks:

  • Network timeouts (flaky connections)
  • Rate limits (429 errors)
  • Changed HTML structure (selectors stop working)
  • Missing data (optional fields that aren't always present)
  • Bot detection (403/CAPTCHA)
  • Memory leaks (browser contexts not closed)

My error handling strategy:

  • Retries: 3 attempts with exponential backoff
  • Fallback selectors: 2-3 CSS paths for critical data
  • Graceful degradation: partial data is better than no data
  • Logging: detailed errors with context (URL, selector, attempt number)
  • Circuit breakers: stop crawling if error rate > 50%

Example pattern for extracting with fallbacks:

async function extractWithFallback(page, selectors) {
    for (const selector of selectors) {
        try {
            const element = await page.$(selector);
            if (element) {
                return await element.textContent();
            }
        } catch (error) {
            console.log(`Selector ${selector} failed, trying next...`);
        }
    }
    return null; // All fallbacks failed
}

const title = await extractWithFallback(page, [
    'h1.product-title',
    '[data-testid="product-name"]',
    '.title',
]);
Enter fullscreen mode Exit fullscreen mode

5. Your README Is Your Landing Page

I didn't realize this until actor number 10. The README is what users see first in the Apify Store. It's your sales page, documentation, and SEO all in one.

Good READMEs have:

  • Clear title and one-sentence description (for SEO)
  • Feature list with bullets (scannable)
  • Input parameter documentation (with examples)
  • Output format examples (JSON schema or sample data)
  • Use cases (who needs this and why)
  • Screenshots or GIFs (showing input/output)
  • Pricing transparency (cost per 1000 results)

I use a template now with structured sections. It takes an extra hour to write, but the conversion rate is 3x higher than my early "here's the code, figure it out" READMEs.

SEO matters too. Keywords in the title, description, and headings help users discover your actors organically.

What's Next

I'm not done. Here's what I'm working on:

Pricing optimization: Most of my actors are free right now, but I'm testing paid tiers for higher limits and premium features (faster execution, priority support, advanced filters).

More actors: I have 12 more in development — LinkedIn scrapers, job board aggregators, review scrapers for TrustPilot and G2. The goal is 50 by end of year.

API integrations: Some users want scraped data pushed directly to their CRM or database. I'm building webhook support and Zapier/Make integrations.

Maintenance automation: Sites change their HTML structure weekly. I'm building a monitoring system that tests each actor daily and alerts me when selectors break.

The long-term vision: a complete data extraction toolkit where you can get structured data from any major website without writing code.

Final Thoughts

Building 28 web scrapers taught me that scraping is 20% extraction logic and 80% fighting anti-bot systems, handling errors, and maintaining code as sites change.

It's frustrating, rewarding, and endlessly educational. Every new site is a puzzle. Every blocked request is a learning opportunity. Every successful extraction feels like a small victory against the machines trying to stop you.

If you're building scrapers, my advice:

  1. Start with Crawlee and Playwright (don't reinvent the wheel)
  2. Invest in residential proxies early (save yourself the pain)
  3. Build error handling from day one (not after it breaks in production)
  4. Document everything (your future self will thank you)
  5. Ship early and iterate (perfect is the enemy of done)

Check out all 28 actors on my Apify Store profile: https://apify.com/lanky_quantifier

Happy scraping.


Vladimir Sysenko
Developer, data engineer, and builder of things that extract other things from the internet.

Top comments (0)