If you're training Chinese-language models — or multilingual models that need real Chinese coverage, not just translated English — the data problem is the bottleneck. Common Crawl gives you the open web. HuggingFace gives you the curated stuff. But the linguistic patterns that matter most for cultural alignment — slang, memes, code-mixed English-Chinese, regional variations, real-time discourse — those live in places Common Crawl barely touches.
Three platforms that matter most for Chinese training corpora in 2026:
- Weibo (微博) — 580M+ MAU, microblogging, real-time discourse, similar role to X/Twitter
- Bilibili (哔哩哔哩) — 300M+ MAU, video platform, comments + danmaku give you code-mixed natural language at volume
- Xiaohongshu / RedNote (小红书) — 300M+ MAU, lifestyle posts with longer-form content, female-skewed register
This post walks through how to build a multi-source pipeline that pulls clean structured data from all three, normalize across platforms, and ship it into your training datasets. With code, schema, and economics.
A note on legal posture: this entire pipeline accesses only publicly visible data — no auth bypass, no captcha solving, no scraping behind login. That matches the standard most AI training teams operate under in 2026, post-NYT-vs-OpenAI. Always consult your legal team for your specific use case and jurisdiction.
Why these three (and not, say, Douyin or Zhihu)
Each platform contributes a different linguistic register:
Weibo posts are short, high-frequency, conversational. Best for:
- Everyday Mandarin patterns
- Trending slang and memes (热搜 reflects what's actually viral right now)
- Public sentiment on news and policy
- Brand-mention contexts
Bilibili comments and danmaku are unique:
- Heavy code-mixing English ↔ Chinese (gaming, tech, anime communities)
- Real-time chat-style language
- Subculture vocabulary (gaming, fandom, two-dimensional culture / 二次元)
- Longer thread discussions on long-form videos
RedNote posts lean longer and more curated:
- Beauty / lifestyle / travel / food vocabulary
- Product-attribute language (skincare ingredients, fashion descriptors)
- Female-skewed register and topics
- Aspirational / descriptive framing
Douyin (Chinese TikTok) and Kuaishou are dominantly video — text data is sparse. Zhihu (Q&A) is great for long-form but dominated by single-author voice. The triad above gives you the best balance of volume, diversity, and accessibility.
Pipeline architecture
The cleanest architecture for an AI training data pipeline:
[Weibo Scraper] →
[Bilibili Scraper] → [Normalize] → [Dedup + Filter] → [JSONL]
[RedNote Scraper] →
Each scraper outputs platform-native JSON. A normalization layer flattens to a common schema. Deduplication on text hash + filtering by min-length / language detection ships clean data into your training format.
Below: I use Apify-hosted scrapers for the extraction layer (they handle anti-bot, rate limiting, and schema stability so you don't have to). The normalization + dedup is your code — straight Python.
Step 1 — Pulling from Weibo
For training data, the high-value combination is:
- Hot search topics (real-time trending — what people are talking about right now)
- Posts under those topics (organic conversation about real issues)
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
def collect_weibo_corpus(target_topics: int = 50, posts_per_topic: int = 100):
# 1a. Pull current trending topics
topics_run = client.actor("zhorex/weibo-scraper").call(run_input={
"mode": "hot_search",
"maxResults": target_topics,
})
topics = list(client.dataset(topics_run["defaultDatasetId"]).iterate_items())
# 1b. For each topic, pull underlying posts
corpus = []
for topic in topics:
posts_run = client.actor("zhorex/weibo-scraper").call(run_input={
"mode": "search",
"searchQuery": topic["title"],
"maxResults": posts_per_topic,
})
for post in client.dataset(posts_run["defaultDatasetId"]).iterate_items():
corpus.append({
"platform": "weibo",
"topic": topic["title"],
"category": topic.get("category"),
"text": post.get("text", ""),
"author": post.get("authorName"),
"engagement": (post.get("attitudesCount", 0) +
post.get("commentsCount", 0) +
post.get("repostsCount", 0)),
"post_url": post.get("postUrl"),
"scraped_at": post.get("scrapedAt"),
})
return corpus
Volume math: 50 topics × 100 posts = 5,000 items per snapshot. At $0.005/item that's $25 per pull. Run daily for a year ≈ $9,125.
Step 2 — Pulling from Bilibili
Bilibili gives you something the others don't: comments on long-form videos. That's where heavy code-mixing happens (tech tutorials, gaming streams, study-with-me content, drama analysis). For training data, comments are higher-value than video metadata.
def collect_bilibili_comments(category: str = "knowledge",
videos: int = 50,
comments_per: int = 100):
# Get popular videos in the category
popular_run = client.actor("zhorex/bilibili-scraper").call(run_input={
"mode": "popular",
"category": category,
"maxResults": videos,
})
items = list(client.dataset(popular_run["defaultDatasetId"]).iterate_items())
bvids = [v["bvid"] for v in items if v.get("bvid")]
# Pull comments on each
corpus = []
for bvid in bvids:
comments_run = client.actor("zhorex/bilibili-scraper").call(run_input={
"mode": "video_comments",
"videoUrls": [f"https://www.bilibili.com/video/{bvid}"],
"maxComments": comments_per,
"sortComments": "hot",
})
for c in client.dataset(comments_run["defaultDatasetId"]).iterate_items():
if c.get("type") != "comment":
continue
corpus.append({
"platform": "bilibili",
"category": category,
"text": c.get("text", ""),
"author": c.get("authorName"),
"engagement": c.get("likeCount", 0),
"video_bvid": bvid,
"scraped_at": c.get("scrapedAt"),
})
return corpus
Note: Bilibili throttles comment depth on cloud IPs — top ~3 per video without residential proxies. For training-data scale you don't need every comment, just enough diversity, so the top-N approach is fine and cheaper.
Categories worth pulling for diverse coverage: knowledge, tech, game, life, food, fashion, cars, entertainment.
Step 3 — Pulling from RedNote
RedNote gives you longer, more curated content — good for training models on aspirational and descriptive Chinese. The seed-query approach lets you control topical distribution, important for avoiding bias toward whatever's trending the day you scrape.
def collect_rednote_corpus(seed_queries: list, posts_per_query: int = 50):
corpus = []
for query in seed_queries:
run = client.actor("zhorex/rednote-xiaohongshu-scraper").call(run_input={
"mode": "search",
"searchQuery": query,
"maxResults": posts_per_query,
})
for post in client.dataset(run["defaultDatasetId"]).iterate_items():
corpus.append({
"platform": "rednote",
"topic": query,
"text": post.get("title", ""),
"author": (post.get("author") or {}).get("nickname"),
"engagement": post.get("likes", 0),
"post_url": post.get("postUrl"),
"scraped_at": post.get("scrapedAt"),
})
return corpus
# Diverse seed queries spread coverage across topics
seeds = [
"护肤心得", # skincare experience
"穿搭", # outfits
"美食推荐", # food recommendations
"旅行攻略", # travel guides
"健身打卡", # fitness check-in
"读书笔记", # reading notes
"育儿日记", # parenting diary
"职场感悟", # work reflections
]
rednote_data = collect_rednote_corpus(seeds, posts_per_query=100)
For richer body content per post (beyond title), pivot to mode: post_details with the post URLs you want to deep-dive on.
Step 4 — Normalization and dedup
All three scrapers produce platform-specific schemas; the per-step code above already brings them to a common shape:
{
"platform": "weibo" | "bilibili" | "rednote",
"topic": str,
"text": str,
"author": str,
"engagement": int,
"scraped_at": ISO8601,
}
Enough to ship into a JSONL training format. For higher quality, layer in filtering:
import hashlib
def filter_corpus(corpus, min_chars: int = 10, max_chars: int = 5000):
seen = set()
out = []
for item in corpus:
text = (item.get("text") or "").strip()
if not (min_chars <= len(text) <= max_chars):
continue
h = hashlib.md5(text.encode("utf-8")).hexdigest()
if h in seen:
continue
seen.add(h)
out.append(item)
return out
For pretraining-grade quality, also add fastText / langdetect to filter non-Chinese content, and a profanity / PII pass appropriate to your training context.
Economics at training-corpus scale
A reasonable Chinese-language pretraining contribution might be 10M items across platforms:
| Platform | Items | Cost @ $0.005 |
|---|---|---|
| 5M | $25,000 | |
| Bilibili | 3M | $15,000 |
| RedNote | 2M | $10,000 |
| Total | 10M items | $50,000 |
Apify free tier ($5/month credit) covers ~1,000 items per actor for prototyping.
For comparison, hiring 2 senior engineers to build and maintain DIY Chinese-platform extraction for 6 months: $150K-300K — and you don't even get the data, just the tooling.
For 100M+ items (real pretraining scale), volume pricing or a custom enterprise contract makes sense. See enterprise section below.
When to build vs buy
Build it yourself if:
- You're scraping 100M+ items per month and have a dedicated team
- You need real-time streaming below 1-second latency (this pipeline is batch)
- Your legal team requires you to own the entire data path
Use the hosted scrapers if:
- You're under 50M items per month per platform
- You want time-to-data measured in hours, not months
- You don't want to maintain three platform-specific scrapers as APIs evolve
The actors
-
Weibo Scraper —
hot_search,search,post_comments,user_posts -
Bilibili Scraper —
search,popular,video_detail,video_comments,user_videos - RedNote (Xiaohongshu) Scraper — six modes covering posts, profiles, comments, video
All three at $0.005/result. Pure HTTP — no browser, no proxy required for moderate volumes.
Enterprise / training-scale
If you're building actual training corpora (not prototyping), DM me on any actor page or open an Issue with subject "Training data inquiry":
- Custom output schemas matched to your training pipeline (Parquet / Arrow / your dialect of JSONL)
- Volume pricing above 1M items/month per platform
- Dedicated proxy infrastructure for sustained throughput
- Schema stability SLA so your training runs don't break mid-epoch
Issues typically get a response within 48 hours.
FAQ
Is this legal? Each Actor accesses only publicly visible data — no auth, no captcha bypass, no login walls. The same data any anonymous browser user can see. Standard ToS-compliant scraping posture as of 2026. Consult your legal team for jurisdiction-specific guidance.
What about rate limits? The hosted Actors handle rate-limit responses with exponential backoff. For 1M+ items/day per platform, talk to me about dedicated infrastructure.
Can I get historical data? The Actors return what's currently public. For longitudinal datasets, schedule them via Apify Schedules at the cadence you need (hourly / daily / weekly) and version-control your dataset snapshots.
Do you offer streaming / real-time? Not currently. The Actors are pull-based. If you need streaming, that's a custom integration.
Other platforms? I also maintain a RedNote Shop Scraper for Xiaohongshu e-commerce listings — useful if your model needs to reason about products, pricing, or commerce vocabulary.
Other relevant work
If you're building Chinese intelligence at scale, the full suite:
- RedNote Scraper — lifestyle social
- RedNote Shop Scraper — Xiaohongshu e-commerce (product metadata, pricing, vendor info)
- Weibo Scraper — microblogging, hot search, sentiment
- Bilibili Scraper — video creator analytics
If this saved you a quarter of dev time, a 30-second review on any of the Actor pages helps a lot. ⭐
Found a bug or have a feature request? Open an Issue — I usually ship fixes within 48 hours.
Top comments (0)