Donny Nguyen

Posted on Mar 10 • Edited on Mar 31

Shopee Product Scraper API — Complete Guide

#webdev #api #python #javascript

Last month, I launched my first commercial API on RapidAPI after months of wrestling with Shopee's ever-changing anti-bot measures. What started as a simple side project to track product prices for my dropshipping experiment turned into a production-grade API that now serves thousands of requests daily. Here's the brutally honest story of how I built it.

The problem with DIY scrapers

If you've ever tried scraping Shopee, you know the pain. I started with a basic Python script using requests and BeautifulSoup:

# My naive first attempt - DON'T DO THIS
import requests
from bs4 import BeautifulSoup

def scrape_shopee_product(product_url):
    response = requests.get(product_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # This worked for exactly 3 requests before getting blocked
    return soup.find('.price').text

This lasted about 15 minutes before Shopee's bot detection kicked in. CAPTCHAs, IP bans, browser fingerprinting – they had it all. I tried rotating user agents, adding delays, even purchased residential proxies. Nothing consistent.

The breaking point came when I realized Shopee loads most product data via JavaScript after the initial page render. My beautiful soup was just... soup. Empty soup.

Architecture: Express → Railway → RapidAPI

After weeks of frustration, I rebuilt the entire system from scratch. Here's the architecture that actually works:

Frontend: Express.js API server
Browser Automation: Playwright with stealth plugins
Hosting: Railway (incredible developer experience)
Distribution: RapidAPI Marketplace
Monitoring: Custom logging + Railway metrics

The key insight was treating this like a real browser session, not just HTTP requests. Playwright lets me:

Handle JavaScript rendering
Rotate browser profiles
Bypass basic bot detection
Take screenshots for debugging

Here's my current deployment setup on Railway:

// Dockerfile snippet
FROM mcr.microsoft.com/playwright:v1.40.0-focal

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
RUN npx playwright install chromium

COPY . .
EXPOSE $PORT
CMD ["npm", "start"]

One real endpoint walkthrough with code

Let me show you the /product-details endpoint that gets product info by URL. This is the most popular endpoint, accounting for 60% of API calls:

app.get('/api/v1/product-details', async (req, res) => {
  const { url, include_reviews = false } = req.query;

  if (!url || !url.includes('shopee.')) {
    return res.status(400).json({ 
      error: 'Valid Shopee product URL required' 
    });
  }

  let browser, context, page;

  try {
    // Launch browser with stealth mode
    browser = await playwright.chromium.launch({
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    });

    context = await browser.newContext({
      userAgent: getRandomUserAgent(),
      viewport: { width: 1920, height: 1080 },
      locale: 'en-US'
    });

    page = await context.newPage();

    // Navigate and wait for product data to load
    await page.goto(url, { waitUntil: 'networkidle' });
    await page.waitForSelector('[data-testid="pdp-product-title"]', { 
      timeout: 10000 
    });

    // Extract product data
    const productData = await page.evaluate((includeReviews) => {
      const title = document.querySelector('[data-testid="pdp-product-title"]')?.textContent;
      const price = document.querySelector('.notranslate')?.textContent;
      const rating = document.querySelector('.shopee-product-rating__label')?.textContent;
      const sold = document.querySelector('.aca9mm')?.textContent;

      const result = {
        title: title?.trim(),
        price: price?.trim(),
        rating: rating ? parseFloat(rating) : null,
        sold: sold?.trim(),
        images: Array.from(document.querySelectorAll('.shopee-image-wrapper img'))
          .map(img => img.src).slice(0, 5),
        timestamp: new Date().toISOString()
      };

      if (includeReviews === 'true') {
        result.recent_reviews = Array.from(document.querySelectorAll('.shopee-product-comment-list .item'))
          .slice(0, 3)
          .map(review => ({
            text: review.querySelector('.comment')?.textContent?.trim(),
            rating: review.querySelectorAll('.shopee-rating-stars__lit').length
          }));
      }

      return result;
    }, include_reviews);

    res.json({
      success: true,
      data: productData
    });

  } catch (error) {
    console.error('Scraping error:', error);
    res.status(500).json({
      success: false,
      error: 'Failed to fetch product data',
      message: process.env.NODE_ENV === 'development' ? error.message : undefined
    });
  } finally {
    if (browser) await browser.close();
  }
});

The magic happens in that page.evaluate() call – it runs inside the browser context where all the JavaScript has already executed.

Pricing strategy for API monetization

Pricing an API is harder than building it. I studied competitors and landed on this freemium model:

Free tier: 100 requests/month
Basic: $9.99 for 5,000 requests
Pro: $29.99 for 25,000 requests
Enterprise: Custom pricing for 100k+ requests

The free tier hooks developers, while the Basic plan covers my hosting costs (Railway runs about $20/month for my usage). Pro tier is pure profit.

RapidAPI handles all the billing, authentication, and rate limiting. I just focus on keeping the API fast and reliable.

Lessons learned

1. Don't fight the platform, embrace it: Playwright > raw HTTP requests every time for modern web scraping.

2. Error handling is everything: My first version crashed on any unusual page structure. Now I gracefully handle missing elements and return partial data.

3. Rate limiting saves money: I learned this the hard way when someone hammered my API and my Railway bill jumped 400%.

4. Documentation sells: I spent 3 days writing clear docs with code examples. It's the difference between 10 users and 1,000.

5. Monitor everything: Set up alerts for error rates, response times, and usage spikes before you need them.

The API now handles edge cases I never imagined – flash sales, out-of-stock products, regional restrictions. Each failure taught me something new about Shopee's frontend architecture.

Ready to scrape smarter, not harder?

Building reliable scrapers is time-consuming and expensive. If you need Shopee product data for your project, check out the API I built: Shopee Product Scraper on RapidAPI.

It handles all the complexity I've described here – browser automation, anti-bot measures, data parsing, and error handling – so you can focus on building your actual product.

Plus, the free tier gives you 100 requests to test it out. No credit card required.

What's your experience with web scraping? Any war stories or clever solutions? Drop them in the comments below!

Want to go deeper? Check out my guides and templates:

DEV Community