Why Scrape Pinduoduo?
Pinduoduo (拼多多) is China's second-largest e-commerce platform with over 900 million active buyers. Unlike Alibaba or JD.com, Pinduoduo focuses on ultra-low-price group buying — making it a goldmine for:
- Dropshippers looking for the cheapest source prices
- Market researchers tracking consumer trends in China
- Wholesale buyers comparing prices across platforms
- Data analysts studying pricing strategies and sales patterns
But scraping Pinduoduo is significantly harder than scraping most e-commerce sites. This guide covers everything you need to know.
Platform Architecture
Pinduoduo operates across multiple surfaces:
| Surface | Domain | Data Access |
|---|---|---|
| PC Website | pinduoduo.com | Corporate pages only, no product data |
| Mobile H5 | mobile.yangkeduo.com | Requires login |
| Mini Program | WeChat embedded | Not scrapable |
| Native App | iOS/Android | Encrypted traffic |
| Temu (overseas) | temu.com | More accessible |
The mobile H5 site (mobile.yangkeduo.com) is the primary target for web scraping, but it comes with serious challenges.
Key Technical Challenges
1. Mandatory Login Wall
Unlike most e-commerce platforms, Pinduoduo requires phone number + SMS verification for every page:
Search page → Redirects to login
Product detail → Redirects to login
Category page → Redirects to login
There's no guest browsing mode, no email registration, and no third-party OAuth. You need a Chinese phone number.
2. Aggressive Anti-Bot System
Pinduoduo employs multiple layers of protection:
-
API signature encryption: All API calls require a
signparameter generated by obfuscated JavaScript - Browser fingerprinting: Canvas fingerprint, WebGL, and navigator property checks
-
Native bridge detection: Checks if running inside the Pinduoduo app via
pinbridge -
Rate limiting:
axios-risk-interceptorsmonitor request patterns - Cookie rotation: Short-lived session cookies that expire frequently
3. Strict robots.txt
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /
Allow: /poros/h5
Pinduoduo blocks all crawlers, including Googlebot (except one specific path). This is the strictest robots.txt among major Chinese e-commerce platforms.
Available Data Fields
When you do get access, here's what you can extract:
{
"goods_id": 123456789,
"title": "iPhone 15 手机壳 透明防摔",
"price": 3.99,
"original_price": 15.99,
"sales_count": "10万+",
"images": [
"https://img.pddpic.com/xxx.jpeg"
],
"shop_name": "数码配件旗舰店",
"shop_rating": 4.8,
"category": "手机配件",
"reviews_count": 52000,
"group_price": 2.99,
"min_order": 1
}
The group_price field is unique to Pinduoduo — it's the discounted price when buying as part of a group.
Approach 1: Playwright Browser Automation (Recommended)
The most reliable method uses a real browser to bypass JavaScript challenges:
import asyncio
from playwright.async_api import async_playwright
async def scrape_pinduoduo(keyword: str, max_items: int = 20):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False # Headed mode recommended
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.0 Mobile/15E148 Safari/604.1"
),
viewport={"width": 390, "height": 844},
device_scale_factor=3,
is_mobile=True,
)
page = await context.new_page()
# Block unnecessary resources for speed
await page.route(
"**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
lambda route: route.abort()
)
# Navigate to search
url = f"https://mobile.yangkeduo.com/search_result.html?search_key={keyword}"
await page.goto(url, wait_until="networkidle")
# Check if redirected to login
if "login" in page.url:
print("Login required - need authenticated session")
return []
# Extract product cards
products = await page.evaluate("""
() => {
const cards = document.querySelectorAll('[data-goods-id]');
return Array.from(cards).map(card => ({
goods_id: card.dataset.goodsId,
title: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
sales: card.querySelector('.sales')?.textContent?.trim(),
}));
}
""")
await browser.close()
return products[:max_items]
# Run
results = asyncio.run(scrape_pinduoduo("蓝牙耳机"))
for item in results:
print(f"{item['title']} - {item['price']} ({item['sales']})")
Important: This will hit the login wall. You need an authenticated session (see the Authentication section below).
Approach 2: API Interception
A more advanced technique intercepts the API calls the browser makes:
async def intercept_api(keyword: str):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context(
user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_0) Safari/604.1",
is_mobile=True,
)
page = await context.new_page()
captured_data = []
# Intercept search API responses
async def handle_response(response):
if "/proxy/api/search/goods" in response.url:
try:
data = await response.json()
if "goods_list" in data:
captured_data.extend(data["goods_list"])
except:
pass
page.on("response", handle_response)
url = f"https://mobile.yangkeduo.com/search_result.html?search_key={keyword}"
await page.goto(url, wait_until="networkidle")
# Scroll to trigger more API calls
for _ in range(5):
await page.evaluate("window.scrollBy(0, 1000)")
await page.wait_for_timeout(2000)
await browser.close()
return captured_data
This captures the raw JSON from Pinduoduo's internal API, which contains richer data than what's visible on the page.
Handling Authentication
Since Pinduoduo requires login, here's how to manage sessions:
import json
from pathlib import Path
COOKIE_FILE = "pdd_cookies.json"
async def save_session(context):
"""Save cookies after manual login"""
s = await context.cookies() h(COOKIE_FILE).write_text(json.dumps(cookies))
print(f"Saved {len(cookies)} cookies")
async def load_session(context):
"""Restore saved session"""
if Path(COOKIE_FILE).exists():
cookies = json.loads(Path(COOKIE_FILE).read_text())
await context.add_cookies(cookies)
return True
return False
Workflow:
- Run the browser in headed mode
- Manually log in with your phone number
- Save the session cookies
- Reuse cookies for subsequent scraping runs
- Re-authenticate when cookies expire
Approach 3: Try Temu Instead
If Pinduo wall is a dealbreaker, consider Temu — Pinduoduo's international version:
- No mandatory login for browsing
- English interface
- Similar product catalog (sourced from same suppliers)
- Standard e-commerce page structure
- Prices in USD (not direct factory prices)
- Different product selection than domestic Pinduoduo
Comparison: Pinduoduo vs Other Chinese Platforms
| Feature | Pinduoduo | Yiwugo | 1688 | DHgate |
|---|---|---|---|---|
| Login required | Always | No | Some pages | No |
| Anti-bot level | Extreme | Low | Medium | Medium |
| robots.txt | Block all | Permissive | Partial block | Permissive |
| API encryption | Sign + obfuscation | None | Token-based | None |
| Scraping difficulty | 5/5 | 2/5 | 3/5 | 2/5 |
| Data richness | 5/5 | 3/5 | 4/5 | 3/5 |
If you're looking for an easier starting point, Yiwugo.com offers rich wholesale data with minimal anti-bot protection. There's a ready-to-use Yiwugo Scraper on Apify Store that handles everything out of the box.
For DHgate data, check out the [DHgate Scr//apify.com/jungle_intertwining/dhgate-scraper) — another tool we built for wholesale product extraction.
- Made-in-China Scraper — Extract B2B product data, supplier info, and MOQ from Made-in-China.com
Best Practices
- Use residential proxies — Datacenter IPs get blocked instantly
- Respect rate limits — 2-5 second delays between requests minimum
- Rotate user agents — Mix iPhone and Android mobile UAs
- Monitor for CAPTCHAs — Implement detection and graceful retry
- Cache aggressively — Don't re-scrape data you already have
- Handle cookie expiry — Build automatic re-authentication flows
Legal Considerations
Pinduoduo's robots.txt explicitly disallows all crawling. Before scraping:
- Use data only for personal research or internal analysis
- Never scrape personal user data (reviews with usernames, order info)
- Implement reasonable rate limiting to avoid server impact
- Check local regulations regarding web scraping
What's Next
We're actively developing a Pinduoduo scraper tool for Apify Store. The main challenge is the mandatory login — we're exploring cookie-based session management and Temu as an alternative data source.
In the meantime, if you need Chinese wholesale product data
- Yiwugo Scraper — Yiwu wholesale market data, no login needed
- DHgate Scraper — Cross-border wholesale data
- Made-in-China Scraper — Extract B2B product data, supplier info, and MOQ from Made-in-China.com
- GitHub Examples — Sample code and documentation
Have experience scraping Pinduoduo or similar platforms? Drop a comment — I'd love to hear what approaches worked for you.
📚 Related:
Top comments (0)