wfgsss

Posted on Feb 14

How to Scrape Pinduoduo (拼多多) for Product Data: A Complete Guide

#webscraping #ecommerce #python #beginners

Why Scrape Pinduoduo?

Pinduoduo (拼多多) is China's second-largest e-commerce platform with over 900 million active buyers. Unlike Alibaba or JD.com, Pinduoduo focuses on ultra-low-price group buying — making it a goldmine for:

Dropshippers looking for the cheapest source prices
Market researchers tracking consumer trends in China
Wholesale buyers comparing prices across platforms
Data analysts studying pricing strategies and sales patterns

But scraping Pinduoduo is significantly harder than scraping most e-commerce sites. This guide covers everything you need to know.

Platform Architecture

Pinduoduo operates across multiple surfaces:

Surface	Domain	Data Access
PC Website	pinduoduo.com	Corporate pages only, no product data
Mobile H5	mobile.yangkeduo.com	Requires login
Mini Program	WeChat embedded	Not scrapable
Native App	iOS/Android	Encrypted traffic
Temu (overseas)	temu.com	More accessible

The mobile H5 site (mobile.yangkeduo.com) is the primary target for web scraping, but it comes with serious challenges.

Key Technical Challenges

1. Mandatory Login Wall

Unlike most e-commerce platforms, Pinduoduo requires phone number + SMS verification for every page:

Search page → Redirects to login
Product detail → Redirects to login
Category page → Redirects to login

There's no guest browsing mode, no email registration, and no third-party OAuth. You need a Chinese phone number.

2. Aggressive Anti-Bot System

Pinduoduo employs multiple layers of protection:

API signature encryption: All API calls require a sign parameter generated by obfuscated JavaScript
Browser fingerprinting: Canvas fingerprint, WebGL, and navigator property checks
Native bridge detection: Checks if running inside the Pinduoduo app via pinbridge
Rate limiting: axios-risk-interceptors monitor request patterns
Cookie rotation: Short-lived session cookies that expire frequently

3. Strict robots.txt

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /
Allow: /poros/h5

Pinduoduo blocks all crawlers, including Googlebot (except one specific path). This is the strictest robots.txt among major Chinese e-commerce platforms.

Available Data Fields

When you do get access, here's what you can extract:

{
  "goods_id": 123456789,
  "title": "iPhone 15 手机壳 透明防摔",
  "price": 3.99,
  "original_price": 15.99,
  "sales_count": "10万+",
  "images": [
    "https://img.pddpic.com/xxx.jpeg"
  ],
  "shop_name": "数码配件旗舰店",
  "shop_rating": 4.8,
  "category": "手机配件",
  "reviews_count": 52000,
  "group_price": 2.99,
  "min_order": 1
}

The group_price field is unique to Pinduoduo — it's the discounted price when buying as part of a group.

Approach 1: Playwright Browser Automation (Recommended)

The most reliable method uses a real browser to bypass JavaScript challenges:

import asyncio
from playwright.async_api import async_playwright

async def scrape_pinduoduo(keyword: str, max_items: int = 20):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False  # Headed mode recommended
        )

        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                "Version/17.0 Mobile/15E148 Safari/604.1"
            ),
            viewport={"width": 390, "height": 844},
            device_scale_factor=3,
            is_mobile=True,
        )

        page = await context.new_page()

        # Block unnecessary resources for speed
        await page.route(
            "**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
            lambda route: route.abort()
        )

        # Navigate to search
        url = f"https://mobile.yangkeduo.com/search_result.html?search_key={keyword}"
        await page.goto(url, wait_until="networkidle")

        # Check if redirected to login
        if "login" in page.url:
            print("Login required - need authenticated session")
            return []

        # Extract product cards
        products = await page.evaluate("""
            () => {
                const cards = document.querySelectorAll('[data-goods-id]');
                return Array.from(cards).map(card => ({
                    goods_id: card.dataset.goodsId,
                    title: card.querySelector('.title')?.textContent?.trim(),
                    price: card.querySelector('.price')?.textContent?.trim(),
                    sales: card.querySelector('.sales')?.textContent?.trim(),
                }));
            }
        """)

        await browser.close()
        return products[:max_items]

# Run
results = asyncio.run(scrape_pinduoduo("蓝牙耳机"))
for item in results:
    print(f"{item['title']} - {item['price']} ({item['sales']})")

Important: This will hit the login wall. You need an authenticated session (see the Authentication section below).

Approach 2: API Interception

A more advanced technique intercepts the API calls the browser makes:

async def intercept_api(keyword: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_0) Safari/604.1",
            is_mobile=True,
        )
        page = await context.new_page()

        captured_data = []

        # Intercept search API responses
        async def handle_response(response):
            if "/proxy/api/search/goods" in response.url:
                try:
                    data = await response.json()
                    if "goods_list" in data:
                        captured_data.extend(data["goods_list"])
                except:
                    pass

        page.on("response", handle_response)

        url = f"https://mobile.yangkeduo.com/search_result.html?search_key={keyword}"
        await page.goto(url, wait_until="networkidle")

        # Scroll to trigger more API calls
        for _ in range(5):
    await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(2000)

        await browser.close()
        return captured_data

This captures the raw JSON from Pinduoduo's internal API, which contains richer data than what's visible on the page.

Handling Authentication

Since Pinduoduo requires login, here's how to manage sessions:

import json
from pathlib import Path

COOKIE_FILE = "pdd_cookies.json"

async def save_session(context):
    """Save cookies after manual login"""
   s = await context.cookies()  h(COOKIE_FILE).write_text(json.dumps(cookies))
    print(f"Saved {len(cookies)} cookies")

async def load_session(context):
    """Restore saved session"""
    if Path(COOKIE_FILE).exists():
        cookies = json.loads(Path(COOKIE_FILE).read_text())
        await context.add_cookies(cookies)
        return True
    return False

Workflow:

Run the browser in headed mode
Manually log in with your phone number
Save the session cookies
Reuse cookies for subsequent scraping runs
Re-authenticate when cookies expire

Approach 3: Try Temu Instead

If Pinduo wall is a dealbreaker, consider Temu — Pinduoduo's international version:

No mandatory login for browsing
English interface
Similar product catalog (sourced from same suppliers)
Standard e-commerce page structure
Prices in USD (not direct factory prices)
Different product selection than domestic Pinduoduo

Comparison: Pinduoduo vs Other Chinese Platforms

Feature	Pinduoduo	Yiwugo	1688	DHgate
Login required	Always	No	Some pages	No
Anti-bot level	Extreme	Low	Medium	Medium
robots.txt	Block all	Permissive	Partial block	Permissive
API encryption	Sign + obfuscation	None	Token-based	None
Scraping difficulty	5/5	2/5	3/5	2/5
Data richness	5/5	3/5	4/5	3/5

If you're looking for an easier starting point, Yiwugo.com offers rich wholesale data with minimal anti-bot protection. There's a ready-to-use Yiwugo Scraper on Apify Store that handles everything out of the box.

For DHgate data, check out the [DHgate Scr//apify.com/jungle_intertwining/dhgate-scraper) — another tool we built for wholesale product extraction.

Made-in-China Scraper — Extract B2B product data, supplier info, and MOQ from Made-in-China.com

Best Practices

Use residential proxies — Datacenter IPs get blocked instantly
Respect rate limits — 2-5 second delays between requests minimum
Rotate user agents — Mix iPhone and Android mobile UAs
Monitor for CAPTCHAs — Implement detection and graceful retry
Cache aggressively — Don't re-scrape data you already have
Handle cookie expiry — Build automatic re-authentication flows

Legal Considerations

Pinduoduo's robots.txt explicitly disallows all crawling. Before scraping:

Use data only for personal research or internal analysis
Never scrape personal user data (reviews with usernames, order info)
Implement reasonable rate limiting to avoid server impact
Check local regulations regarding web scraping

What's Next

We're actively developing a Pinduoduo scraper tool for Apify Store. The main challenge is the mandatory login — we're exploring cookie-based session management and Temu as an alternative data source.

In the meantime, if you need Chinese wholesale product data

Yiwugo Scraper — Yiwu wholesale market data, no login needed
DHgate Scraper — Cross-border wholesale data
Made-in-China Scraper — Extract B2B product data, supplier info, and MOQ from Made-in-China.com
GitHub Examples — Sample code and documentation

Have experience scraping Pinduoduo or similar platforms? Drop a comment — I'd love to hear what approaches worked for you.

📚 Related: