DEV Community

wfgsss
wfgsss

Posted on

How to Scrape Pinduoduo (拼多多) for Product Data: A Complete Guide

Why Scrape Pinduoduo?

Pinduoduo (拼多多) is China's second-largest e-commerce platform with over 900 million active buyers. Unlike Alibaba or JD.com, Pinduoduo focuses on ultra-low-price group buying — making it a goldmine for:

  • Dropshippers looking for the cheapest source prices
  • Market researchers tracking consumer trends in China
  • Wholesale buyers comparing prices across platforms
  • Data analysts studying pricing strategies and sales patterns

But scraping Pinduoduo is significantly harder than scraping most e-commerce sites. This guide covers everything you need to know.

Platform Architecture

Pinduoduo operates across multiple surfaces:

Surface Domain Data Access
PC Website pinduoduo.com Corporate pages only, no product data
Mobile H5 mobile.yangkeduo.com Requires login
Mini Program WeChat embedded Not scrapable
Native App iOS/Android Encrypted traffic
Temu (overseas) temu.com More accessible

The mobile H5 site (mobile.yangkeduo.com) is the primary target for web scraping, but it comes with serious challenges.

Key Technical Challenges

1. Mandatory Login Wall

Unlike most e-commerce platforms, Pinduoduo requires phone number + SMS verification for every page:

Search page → Redirects to login
Product detail → Redirects to login
Category page → Redirects to login
Enter fullscreen mode Exit fullscreen mode

There's no guest browsing mode, no email registration, and no third-party OAuth. You need a Chinese phone number.

2. Aggressive Anti-Bot System

Pinduoduo employs multiple layers of protection:

  • API signature encryption: All API calls require a sign parameter generated by obfuscated JavaScript
  • Browser fingerprinting: Canvas fingerprint, WebGL, and navigator property checks
  • Native bridge detection: Checks if running inside the Pinduoduo app via pinbridge
  • Rate limiting: axios-risk-interceptors monitor request patterns
  • Cookie rotation: Short-lived session cookies that expire frequently

3. Strict robots.txt

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /
Allow: /poros/h5
Enter fullscreen mode Exit fullscreen mode

Pinduoduo blocks all crawlers, including Googlebot (except one specific path). This is the strictest robots.txt among major Chinese e-commerce platforms.

Available Data Fields

When you do get access, here's what you can extract:

{
  "goods_id": 123456789,
  "title": "iPhone 15 手机壳 透明防摔",
  "price": 3.99,
  "original_price": 15.99,
  "sales_count": "10万+",
  "images": [
    "https://img.pddpic.com/xxx.jpeg"
  ],
  "shop_name": "数码配件旗舰店",
  "shop_rating": 4.8,
  "category": "手机配件",
  "reviews_count": 52000,
  "group_price": 2.99,
  "min_order": 1
}
Enter fullscreen mode Exit fullscreen mode

The group_price field is unique to Pinduoduo — it's the discounted price when buying as part of a group.

Approach 1: Playwright Browser Automation (Recommended)

The most reliable method uses a real browser to bypass JavaScript challenges:

import asyncio
from playwright.async_api import async_playwright

async def scrape_pinduoduo(keyword: str, max_items: int = 20):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False  # Headed mode recommended
        )

        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                "Version/17.0 Mobile/15E148 Safari/604.1"
            ),
            viewport={"width": 390, "height": 844},
            device_scale_factor=3,
            is_mobile=True,
        )

        page = await context.new_page()

        # Block unnecessary resources for speed
        await page.route(
            "**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
            lambda route: route.abort()
        )

        # Navigate to search
        url = f"https://mobile.yangkeduo.com/search_result.html?search_key={keyword}"
        await page.goto(url, wait_until="networkidle")

        # Check if redirected to login
        if "login" in page.url:
            print("Login required - need authenticated session")
            return []

        # Extract product cards
        products = await page.evaluate("""
            () => {
                const cards = document.querySelectorAll('[data-goods-id]');
                return Array.from(cards).map(card => ({
                    goods_id: card.dataset.goodsId,
                    title: card.querySelector('.title')?.textContent?.trim(),
                    price: card.querySelector('.price')?.textContent?.trim(),
                    sales: card.querySelector('.sales')?.textContent?.trim(),
                }));
            }
        """)

        await browser.close()
        return products[:max_items]

# Run
results = asyncio.run(scrape_pinduoduo("蓝牙耳机"))
for item in results:
    print(f"{item['title']} - {item['price']} ({item['sales']})")
Enter fullscreen mode Exit fullscreen mode

Important: This will hit the login wall. You need an authenticated session (see the Authentication section below).

Approach 2: API Interception

A more advanced technique intercepts the API calls the browser makes:

async def intercept_api(keyword: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_0) Safari/604.1",
            is_mobile=True,
        )
        page = await context.new_page()

        captured_data = []

        # Intercept search API responses
        async def handle_response(response):
            if "/proxy/api/search/goods" in response.url:
                try:
                    data = await response.json()
                    if "goods_list" in data:
                        captured_data.extend(data["goods_list"])
                except:
                    pass

        page.on("response", handle_response)

        url = f"https://mobile.yangkeduo.com/search_result.html?search_key={keyword}"
        await page.goto(url, wait_until="networkidle")

        # Scroll to trigger more API calls
        for _ in range(5):
    await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(2000)

        await browser.close()
        return captured_data
Enter fullscreen mode Exit fullscreen mode

This captures the raw JSON from Pinduoduo's internal API, which contains richer data than what's visible on the page.

Handling Authentication

Since Pinduoduo requires login, here's how to manage sessions:

import json
from pathlib import Path

COOKIE_FILE = "pdd_cookies.json"

async def save_session(context):
    """Save cookies after manual login"""
   s = await context.cookies()  h(COOKIE_FILE).write_text(json.dumps(cookies))
    print(f"Saved {len(cookies)} cookies")

async def load_session(context):
    """Restore saved session"""
    if Path(COOKIE_FILE).exists():
        cookies = json.loads(Path(COOKIE_FILE).read_text())
        await context.add_cookies(cookies)
        return True
    return False
Enter fullscreen mode Exit fullscreen mode

Workflow:

  1. Run the browser in headed mode
  2. Manually log in with your phone number
  3. Save the session cookies
  4. Reuse cookies for subsequent scraping runs
  5. Re-authenticate when cookies expire

Approach 3: Try Temu Instead

If Pinduo wall is a dealbreaker, consider Temu — Pinduoduo's international version:

  • No mandatory login for browsing
  • English interface
  • Similar product catalog (sourced from same suppliers)
  • Standard e-commerce page structure
  • Prices in USD (not direct factory prices)
  • Different product selection than domestic Pinduoduo

Comparison: Pinduoduo vs Other Chinese Platforms

Feature Pinduoduo Yiwugo 1688 DHgate
Login required Always No Some pages No
Anti-bot level Extreme Low Medium Medium
robots.txt Block all Permissive Partial block Permissive
API encryption Sign + obfuscation None Token-based None
Scraping difficulty 5/5 2/5 3/5 2/5
Data richness 5/5 3/5 4/5 3/5

If you're looking for an easier starting point, Yiwugo.com offers rich wholesale data with minimal anti-bot protection. There's a ready-to-use Yiwugo Scraper on Apify Store that handles everything out of the box.

For DHgate data, check out the [DHgate Scr//apify.com/jungle_intertwining/dhgate-scraper) — another tool we built for wholesale product extraction.

Best Practices

  1. Use residential proxies — Datacenter IPs get blocked instantly
  2. Respect rate limits — 2-5 second delays between requests minimum
  3. Rotate user agents — Mix iPhone and Android mobile UAs
  4. Monitor for CAPTCHAs — Implement detection and graceful retry
  5. Cache aggressively — Don't re-scrape data you already have
  6. Handle cookie expiry — Build automatic re-authentication flows

Legal Considerations

Pinduoduo's robots.txt explicitly disallows all crawling. Before scraping:

  • Use data only for personal research or internal analysis
  • Never scrape personal user data (reviews with usernames, order info)
  • Implement reasonable rate limiting to avoid server impact
  • Check local regulations regarding web scraping

What's Next

We're actively developing a Pinduoduo scraper tool for Apify Store. The main challenge is the mandatory login — we're exploring cookie-based session management and Temu as an alternative data source.

In the meantime, if you need Chinese wholesale product data


Have experience scraping Pinduoduo or similar platforms? Drop a comment — I'd love to hear what approaches worked for you.


📚 Related:

Top comments (0)