Weibo is China's Twitter — the platform where Chinese public opinion forms, brand crises break first, and government statements land. 580M+ monthly active users, mostly mainstream demographics. If you're doing China market intelligence, brand monitoring, or PR analytics, Weibo is one of the platforms you can't skip.
The challenge: Weibo's developer API requires a Chinese business license, has severe rate limits, and exposes very limited data. For Western teams, web scraping is the practical option. The interesting twist is Weibo's Sina Visitor System — an auth flow that makes anonymous access possible for some endpoints but not others. Understanding which is which matters for what you can actually scrape.
This article covers the technical landscape (with real Python code) and points to a hosted scraper if you'd rather skip the maintenance.
What Weibo serves
A Weibo post is structured similarly to a tweet but with longer character limits and more structured engagement signals:
- Post text (140 to 2,000 characters depending on user level)
- Repost chain — Weibo's quote-tweet equivalent, central to virality tracking
-
Engagement metrics —
attitudes_count(likes),comments_count,reposts_count - Hashtags and mentions
- Geolocation if disclosed by user
- Author profile — follower count, verification status, verified reason (e.g., "新浪科技 official Weibo")
- Media — images, videos
A Weibo user profile gives you:
- User ID (numeric)
- Screen name (display name)
- Description / bio
- Followers / friends counts
- Statuses count (total posts)
- Verification status with reason text — this is gold for identifying official accounts vs personal vs corporate
For monitoring use cases, the metric that matters most depends on your goal:
-
Crisis monitoring: track
comments_countand repost velocity. A spike in either signals viral attention. - Brand presence: track post frequency from verified accounts in your category.
-
KOL identification: filter by
verified=true+ follower count above a threshold.
The Sina Visitor System
This is the key technical concept for scraping Weibo without a Chinese business license.
When you visit Weibo without logging in, Sina automatically issues you a "visitor cookie" via what they call the Sina Visitor System (SVS). This cookie lets you access limited public data — specifically:
- Hot search / trending topics: full access
- Post comments: full access for any public post
- Post viewing: limited
For these endpoints, scraping is straightforward — get a visitor cookie, hit the AJAX endpoint, parse JSON.
What the visitor cookie does NOT give you:
- Search by keyword (returns hot timeline as a fallback instead of true search results)
- User posts beyond profile basics (you get the profile, not the user's post history)
For those, you need a real logged-in cookie — specifically the SUB cookie value from a logged-in browser session. We'll get to that.
Approach 1: Build it yourself
The Sina Visitor System flow looks roughly like this:
import httpx
# Step 1: Hit the visitor system to get a tid (temporary ID)
visitor_url = "https://passport.weibo.com/visitor/genvisitor"
resp = httpx.post(visitor_url, data={
"cb": "gen_callback",
"fp": '{"os":"1","browser":"Chrome","fonts":"undefined","screenInfo":"1920*1080*24","plugins":""}',
})
# The response is a JSONP-wrapped JSON. Strip the wrapper, parse, extract tid.
# Step 2: Use tid to get the SUB visitor cookie
incarnate_url = "https://passport.weibo.com/visitor/visitor"
resp = httpx.get(incarnate_url, params={
"a": "incarnate",
"t": tid,
"w": "2",
"c": "100",
})
# Response sets cookies. Extract SUB and SUBP from response.cookies.
# Step 3: Use those cookies to call AJAX endpoints
hot_search_url = "https://weibo.com/ajax/side/hotSearch"
resp = httpx.get(hot_search_url, cookies={"SUB": sub, "SUBP": subp})
data = resp.json()
# data["data"]["realtime"] is the hot search list
That's the rough shape. In practice you'll handle:
- Rate limit responses (HTTP 418, 429) with exponential backoff
- Cookie expiration (visitor cookies last hours, not days)
- AJAX endpoint changes (Weibo periodically reshuffles paths)
- Anti-scraping fingerprint checks (less aggressive than RedNote, but still present)
For the keyword-search and user-posts endpoints, you'll need a real SUB cookie from a logged-in account:
# Get SUB from your browser DevTools → Application → Cookies → weibo.com
# Look for the cookie named "SUB"
sub_cookie = "SUB=_2A25Fxxxxxx..."
resp = httpx.get(
"https://weibo.com/ajax/side/searchAll",
params={"q": "人工智能"},
cookies={"SUB": sub_cookie.split("=", 1)[1]},
)
Cookies typically last several days before expiring, depending on Weibo's session policies.
DIY cost breakdown
| Cost | Estimate |
|---|---|
| Initial setup (visitor system, hot search, comments) | 4-8 hours |
| User session cookie management | 1-2 hours/week |
| Maintenance when Weibo changes endpoints | 2-4 hours, every 2-3 months |
| No proxy needed for most endpoints | $0 |
Weibo is genuinely the easiest of the major Chinese platforms to scrape if you stay within visitor-system endpoints. RedNote and Bilibili both have more complex auth.
Approach 2: Use a hosted scraper
If you don't want to maintain visitor-system handling and cookie management, the zhorex/weibo-scraper Apify actor handles it.
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_API_TOKEN")
# Hot search (no cookie needed)
run = client.actor("zhorex/weibo-scraper").call(run_input={
"mode": "hot_search",
"maxResults": 50,
})
for topic in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"#{topic['rank']}: {topic['title']} (heat: {topic['hotValue']:,})")
Output format:
{
"rank": 1,
"title": "人工智能最新突破",
"category": "科技",
"hotValue": 2847562,
"labelName": "热",
"isHot": true,
"url": "https://s.weibo.com/weibo?q=...",
"scrapedAt": "2026-04-25T12:00:00Z"
}
For brand monitoring, search mode is what you want — though note the search-vs-cookie tradeoff:
# Without cookie: returns hot timeline as fallback
run = client.actor("zhorex/weibo-scraper").call(run_input={
"mode": "search",
"searchQuery": "CeraVe",
"maxResults": 50,
})
# With cookie: returns true keyword-matched results
run = client.actor("zhorex/weibo-scraper").call(run_input={
"mode": "search",
"searchQuery": "CeraVe",
"maxResults": 50,
"cookieString": "SUB=your_logged_in_cookie",
})
The hosted actor handles the visitor system, exponential backoff, and rate limit recovery internally. Pricing: $5 per 1,000 results.
Honest stats on the actor right now: 4 paying users, 11 free-tier users, 92.5% success rate, 3,768 result extractions to date. Average issue response time when something breaks: under a few hours.
When DIY vs hosted
DIY makes sense when:
- You're processing > 1M posts/month (per-result cost adds up)
- You have ops capacity to refresh
SUBcookies regularly - You need to scrape behind login at scale
- You have specific endpoints not covered by hosted scrapers
Hosted makes sense when:
- You don't have a dedicated scraper engineer
- Volume is moderate (< 500k posts/month)
- You want the visitor-system handling to be someone else's problem
- You're prototyping and want to validate the use case before committing
What you do with the data downstream
Sentiment analysis on Chinese text is the obvious next layer. Off-the-shelf Chinese BERT models work reasonably for Weibo's discourse style — Weibo posts tend to be more formal than RedNote slang, so general Chinese sentiment models accuracy is higher (typical 75-85% on neutral/positive/negative classification).
For brand crisis detection, the signal you usually want is *velocity
Top comments (0)