How to scrape Weibo (微博) data with Python in 2026 — the Sina Visitor System and how to handle it

#python #webscraping #china #datascience

Weibo is China's Twitter — the platform where Chinese public opinion forms, brand crises break first, and government statements land. 580M+ monthly active users, mostly mainstream demographics. If you're doing China market intelligence, brand monitoring, or PR analytics, Weibo is one of the platforms you can't skip.

The challenge: Weibo's developer API requires a Chinese business license, has severe rate limits, and exposes very limited data. For Western teams, web scraping is the practical option. The interesting twist is Weibo's Sina Visitor System — an auth flow that makes anonymous access possible for some endpoints but not others. Understanding which is which matters for what you can actually scrape.

This article covers the technical landscape (with real Python code) and points to a hosted scraper if you'd rather skip the maintenance.

What Weibo serves

A Weibo post is structured similarly to a tweet but with longer character limits and more structured engagement signals:

Post text (140 to 2,000 characters depending on user level)
Repost chain — Weibo's quote-tweet equivalent, central to virality tracking
Engagement metrics — attitudes_count (likes), comments_count, reposts_count
Hashtags and mentions
Geolocation if disclosed by user
Author profile — follower count, verification status, verified reason (e.g., "新浪科技 official Weibo")
Media — images, videos

A Weibo user profile gives you:

User ID (numeric)
Screen name (display name)
Description / bio
Followers / friends counts
Statuses count (total posts)
Verification status with reason text — this is gold for identifying official accounts vs personal vs corporate

For monitoring use cases, the metric that matters most depends on your goal:

Crisis monitoring: track comments_count and repost velocity. A spike in either signals viral attention.
Brand presence: track post frequency from verified accounts in your category.
KOL identification: filter by verified=true + follower count above a threshold.

The Sina Visitor System

This is the key technical concept for scraping Weibo without a Chinese business license.

When you visit Weibo without logging in, Sina automatically issues you a "visitor cookie" via what they call the Sina Visitor System (SVS). This cookie lets you access limited public data — specifically:

Hot search / trending topics: full access
Post comments: full access for any public post
Post viewing: limited

For these endpoints, scraping is straightforward — get a visitor cookie, hit the AJAX endpoint, parse JSON.

What the visitor cookie does NOT give you:

Search by keyword (returns hot timeline as a fallback instead of true search results)
User posts beyond profile basics (you get the profile, not the user's post history)

For those, you need a real logged-in cookie — specifically the SUB cookie value from a logged-in browser session. We'll get to that.

Approach 1: Build it yourself

The Sina Visitor System flow looks roughly like this:

import httpx

# Step 1: Hit the visitor system to get a tid (temporary ID)
visitor_url = "https://passport.weibo.com/visitor/genvisitor"
resp = httpx.post(visitor_url, data={
    "cb": "gen_callback",
    "fp": '{"os":"1","browser":"Chrome","fonts":"undefined","screenInfo":"1920*1080*24","plugins":""}',
})
# The response is a JSONP-wrapped JSON. Strip the wrapper, parse, extract tid.

# Step 2: Use tid to get the SUB visitor cookie
incarnate_url = "https://passport.weibo.com/visitor/visitor"
resp = httpx.get(incarnate_url, params={
    "a": "incarnate",
    "t": tid,
    "w": "2",
    "c": "100",
})
# Response sets cookies. Extract SUB and SUBP from response.cookies.

# Step 3: Use those cookies to call AJAX endpoints
hot_search_url = "https://weibo.com/ajax/side/hotSearch"
resp = httpx.get(hot_search_url, cookies={"SUB": sub, "SUBP": subp})
data = resp.json()
# data["data"]["realtime"] is the hot search list

That's the rough shape. In practice you'll handle:

Rate limit responses (HTTP 418, 429) with exponential backoff
Cookie expiration (visitor cookies last hours, not days)
AJAX endpoint changes (Weibo periodically reshuffles paths)
Anti-scraping fingerprint checks (less aggressive than RedNote, but still present)

For the keyword-search and user-posts endpoints, you'll need a real SUB cookie from a logged-in account:

# Get SUB from your browser DevTools → Application → Cookies → weibo.com
# Look for the cookie named "SUB"
sub_cookie = "SUB=_2A25Fxxxxxx..."

resp = httpx.get(
    "https://weibo.com/ajax/side/searchAll",
    params={"q": "人工智能"},
    cookies={"SUB": sub_cookie.split("=", 1)[1]},
)

Cookies typically last several days before expiring, depending on Weibo's session policies.

DIY cost breakdown

Cost	Estimate
Initial setup (visitor system, hot search, comments)	4-8 hours
User session cookie management	1-2 hours/week
Maintenance when Weibo changes endpoints	2-4 hours, every 2-3 months
No proxy needed for most endpoints	$0

Weibo is genuinely the easiest of the major Chinese platforms to scrape if you stay within visitor-system endpoints. RedNote and Bilibili both have more complex auth.

Approach 2: Use a hosted scraper

If you don't want to maintain visitor-system handling and cookie management, the zhorex/weibo-scraper Apify actor handles it.

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

# Hot search (no cookie needed)
run = client.actor("zhorex/weibo-scraper").call(run_input={
    "mode": "hot_search",
    "maxResults": 50,
})

for topic in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"#{topic['rank']}: {topic['title']} (heat: {topic['hotValue']:,})")

Output format:

{
  "rank": 1,
  "title": "人工智能最新突破",
  "category": "科技",
  "hotValue": 2847562,
  "labelName": "热",
  "isHot": true,
  "url": "https://s.weibo.com/weibo?q=...",
  "scrapedAt": "2026-04-25T12:00:00Z"
}

For brand monitoring, search mode is what you want — though note the search-vs-cookie tradeoff:

# Without cookie: returns hot timeline as fallback
run = client.actor("zhorex/weibo-scraper").call(run_input={
    "mode": "search",
    "searchQuery": "CeraVe",
    "maxResults": 50,
})

# With cookie: returns true keyword-matched results
run = client.actor("zhorex/weibo-scraper").call(run_input={
    "mode": "search",
    "searchQuery": "CeraVe",
    "maxResults": 50,
    "cookieString": "SUB=your_logged_in_cookie",
})

The hosted actor handles the visitor system, exponential backoff, and rate limit recovery internally. Pricing: $5 per 1,000 results.

Honest stats on the actor right now: 4 paying users, 11 free-tier users, 92.5% success rate, 3,768 result extractions to date. Average issue response time when something breaks: under a few hours.

When DIY vs hosted

DIY makes sense when:

You're processing > 1M posts/month (per-result cost adds up)
You have ops capacity to refresh SUB cookies regularly
You need to scrape behind login at scale
You have specific endpoints not covered by hosted scrapers

Hosted makes sense when:

You don't have a dedicated scraper engineer
Volume is moderate (< 500k posts/month)
You want the visitor-system handling to be someone else's problem
You're prototyping and want to validate the use case before committing

What you do with the data downstream

Sentiment analysis on Chinese text is the obvious next layer. Off-the-shelf Chinese BERT models work reasonably for Weibo's discourse style — Weibo posts tend to be more formal than RedNote slang, so general Chinese sentiment models accuracy is higher (typical 75-85% on neutral/positive/negative classification).

For brand crisis detection, the signal you usually want is *velocity