<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sami</title>
    <description>The latest articles on DEV Community by Sami (@sami_8858131362756585e4f4).</description>
    <link>https://dev.to/sami_8858131362756585e4f4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3877584%2F63d2c24c-ec4e-457f-8a71-2b79bb969554.png</url>
      <title>DEV Community: Sami</title>
      <link>https://dev.to/sami_8858131362756585e4f4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sami_8858131362756585e4f4"/>
    <language>en</language>
    <item>
      <title>Your Chinese training data has a provenance problem — and August 2026 makes it urgent</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Thu, 11 Jun 2026 12:41:16 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/your-chinese-training-data-has-a-provenance-problem-and-august-2026-makes-it-urgent-l95</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/your-chinese-training-data-has-a-provenance-problem-and-august-2026-makes-it-urgent-l95</guid>
      <description>&lt;p&gt;If you train or fine-tune models on Chinese-language web text, there's a date you should have circled: &lt;strong&gt;August 2, 2026&lt;/strong&gt;. That's when the EU AI Act's obligations for general-purpose AI (GPAI) models start applying in earnest — including the requirement to publish a &lt;strong&gt;sufficiently detailed summary of training data&lt;/strong&gt; and to put in place a policy to &lt;strong&gt;respect TDM (text-and-data-mining) opt-outs&lt;/strong&gt; under the EU Copyright Directive.&lt;/p&gt;

&lt;p&gt;In practice, that means someone on your team will eventually be asked: &lt;em&gt;"For this corpus — where did each document come from, when was it retrieved, and did the source signal an opt-out at the time?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If your Chinese corpus is a pile of JSONL files scraped (or bought) at some point in 2023–2025 with no per-document metadata, the honest answer is: &lt;em&gt;we don't know&lt;/em&gt;. And "we don't know" is becoming an expensive answer — for EU-facing labs directly, and for everyone else indirectly, because data vendors, enterprise customers, and academic review boards are all starting to ask the same questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Chinese-language corpora are the hardest case
&lt;/h2&gt;

&lt;p&gt;Every web corpus has documentation gaps. Chinese-language corpora have them worse, for structural reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Scarcity drives sloppiness.&lt;/strong&gt; High-quality open Chinese text is scarce relative to English. Common Crawl's Chinese share is small and skews toward SEO spam and mirror farms. Because supply is tight, teams hoard whatever they can get — old dumps, resold datasets, "a folder someone left behind" — and documentation is the first thing sacrificed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Quality variance is extreme.&lt;/strong&gt; The interesting Chinese text lives on platforms — social discussion, video comments, finance commentary, long-form reviews, lifestyle posts. Mixed in with it: boilerplate, ads, bot chatter, template spam. Without per-document quality scoring you either keep the noise or hand-filter at a cost that kills the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Near-duplicates are endemic.&lt;/strong&gt; Chinese platforms are reposting cultures. The same viral post appears dozens of times with minor edits — added emoji, swapped hashtags, platform watermarks. Exact-hash dedup misses almost all of it. Train on it anyway and you get memorization hot spots and inflated dataset counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. PII density is high.&lt;/strong&gt; User-generated Chinese text is full of phone numbers, national ID numbers, WeChat/QQ handles, addresses, and real names — often embedded mid-sentence. GDPR doesn't care that the data subject is in Shanghai if you're processing in the EU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Source documentation is usually zero.&lt;/strong&gt; Most Chinese web datasets in circulation — including academic ones — ship as bare text. No URLs, no timestamps, no record of what the source's robots/opt-out posture was at retrieval time. You cannot retrofit provenance. If it wasn't captured at collection time, it's gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What per-document provenance actually requires
&lt;/h2&gt;

&lt;p&gt;"Provenance" gets used loosely. For training-data documentation purposes, here's the concrete per-document record you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source URL&lt;/strong&gt; — the canonical URL of the original document, not just "Weibo".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval timestamp&lt;/strong&gt; — when this exact text was collected. Opt-out states change; the timestamp anchors your good-faith record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Robots / opt-out state at retrieval&lt;/strong&gt; — what the source's machine-readable signals said &lt;em&gt;at the moment of collection&lt;/em&gt;. This is the field everyone is missing and the one TDM-policy questions hinge on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License hint&lt;/strong&gt; — the best available signal about the source's terms (platform ToS class, page-level license markers). A hint, not a clearance — but a documented hint beats silence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content hash&lt;/strong&gt; — a stable hash of the normalized text, so you can prove what was in the corpus, detect drift between corpus versions, and answer takedown requests precisely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline version&lt;/strong&gt; — which version of the collection/cleaning pipeline produced the record, so your documentation is reproducible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A practical checklist (works with any tooling)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stop ingesting undocumented data now.&lt;/strong&gt; Every undocumented document you add today is a liability you can't repair later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture provenance at collection time&lt;/strong&gt; — URL, timestamp, robots/opt-out state, license signal, hash, pipeline version, on every document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedup with MinHash/SimHash, not exact hashes.&lt;/strong&gt; Near-duplicate detection is the only thing that works on repost-heavy Chinese platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score quality per document&lt;/strong&gt; and record the score and threshold, so your filtering is defensible, not vibes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scrub PII before storage&lt;/strong&gt;, and log that scrubbing happened (pipeline version again).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a manifest per corpus version&lt;/strong&gt; — document counts, source distribution, date ranges — so the "sufficiently detailed summary" is a query, not an archaeology project.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-check opt-out signals on refresh.&lt;/strong&gt; Provenance is a snapshot; periodic refresh keeps your record current.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can build all this in-house. Budget a few engineer-months for the collection layer, the dedup index, the PII pass, and the metadata plumbing — per platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Or use the turnkey version
&lt;/h2&gt;

&lt;p&gt;I built the &lt;strong&gt;&lt;a href="https://apify.com/zhorex/chinese-corpus-engine" rel="noopener noreferrer"&gt;Chinese AI Training Corpus Engine&lt;/a&gt;&lt;/strong&gt; to do exactly the pipeline above, as a self-serve Apify actor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Five platforms&lt;/strong&gt;: Weibo, Bilibili, Xueqiu, Douban, RedNote (Xiaohongshu) — social, video, finance, reviews, lifestyle, so your corpus has register diversity, not just one genre.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MinHash near-duplicate detection&lt;/strong&gt; across the batch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-document quality scoring&lt;/strong&gt; with configurable thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII scrubbing&lt;/strong&gt; (phones, national IDs, emails) before output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full provenance on every document&lt;/strong&gt;: source URL, retrieval timestamp, robots state at collection, license hint, content hash, pipeline version — the exact fields your EU AI Act documentation needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pricing is per &lt;em&gt;validated&lt;/em&gt; document: &lt;strong&gt;$0.025/doc&lt;/strong&gt; (HTTP tier) or &lt;strong&gt;$0.055/doc&lt;/strong&gt; (browser tier). Documents that fail quality checks or turn out to be duplicates are &lt;strong&gt;never charged&lt;/strong&gt; — you pay for corpus, not for noise. A 10,000-document pilot is $250; you'll know within an hour whether the output fits your pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;documentation tooling, not legal clearance&lt;/strong&gt;. The engine records what sources signaled at collection time and structures it so you can document your corpus — it does not grant training rights, license the underlying content, or determine that any given use is lawful in your jurisdiction. License hints are hints. Whether and how you may train on any document remains your decision, ideally made with counsel who knows the EU AI Act, the Copyright Directive, and your specific exposure. What the tool guarantees is that when counsel asks "what do we know about this corpus?", you have a real answer per document instead of a shrug.&lt;/p&gt;

&lt;p&gt;August 2026 is closer than it looks. The teams that win the documentation question will be the ones who captured the metadata while collecting — not the ones reconstructing it afterward.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Questions or pilot requests: &lt;a href="mailto:samimassis2002@gmail.com"&gt;samimassis2002@gmail.com&lt;/a&gt; — or just run the &lt;a href="https://apify.com/zhorex/chinese-corpus-engine" rel="noopener noreferrer"&gt;actor&lt;/a&gt; directly.&lt;/em&gt;&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>data</category>
    </item>
    <item>
      <title>How to track Weibo hot-search velocity with Python in 2026 — the trending-delta problem and how to handle it</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Tue, 09 Jun 2026 15:52:00 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/how-to-track-weibo-hot-search-velocity-with-python-in-2026-the-trending-delta-problem-and-how-to-5g5f</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/how-to-track-weibo-hot-search-velocity-with-python-in-2026-the-trending-delta-problem-and-how-to-5g5f</guid>
      <description>&lt;p&gt;If you scrape Weibo's hot-search board you get a snapshot: ~50 trending topics, ranked, right now. That's table stakes — and on its own it's almost useless as a signal. The value isn't &lt;em&gt;what&lt;/em&gt; is trending; it's &lt;em&gt;what's moving&lt;/em&gt;: which topic just jumped 30 places in 20 minutes, which is decaying, which is brand-new this hour. That's &lt;strong&gt;velocity&lt;/strong&gt;, and velocity is where the signal lives — for brand-crisis teams, consumer-trend desks, and anyone modelling attention in China.&lt;/p&gt;

&lt;p&gt;The catch: a single scrape can't tell you velocity. You have to diff the board against its own past, reliably, run after run. That's a &lt;em&gt;stateful&lt;/em&gt; pipeline, and it has a few non-obvious gotchas. Here's the shape of the problem and how to handle it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a snapshot isn't enough
&lt;/h2&gt;

&lt;p&gt;Rank-right-now tells you nothing about trajectory. "#7" could be a topic on its way to #1 or one fading out of the top 50 — same row, opposite meaning. To act on a trend you need the &lt;em&gt;derivative&lt;/em&gt;: direction, speed, and how long it's been climbing. None of that is in a single pull.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trending-delta problem
&lt;/h2&gt;

&lt;p&gt;Three things make "just diff the board" harder than it looks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Key by identity, not position.&lt;/strong&gt; You can't track a topic by its rank — rank is the thing that changes. Key by the topic itself (its text/keyword) or your deltas are nonsense.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State has to survive between runs.&lt;/strong&gt; A scheduled scrape is stateless by default — each run starts cold. To compute "this rose 12 places since 30 minutes ago," you must persist the previous board and reload it next run, keyed so independent schedules don't overwrite each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The board churns.&lt;/strong&gt; Topics appear, peak, and fall off. You want each tagged &lt;code&gt;new&lt;/code&gt; / &lt;code&gt;rising&lt;/code&gt; / &lt;code&gt;falling&lt;/code&gt; / &lt;code&gt;steady&lt;/code&gt; / &lt;code&gt;dropped&lt;/code&gt;, plus how long it's been on the board and its running peak — none of which exist in the raw snapshot.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to handle it (the pattern)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;current&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pull_board&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                  &lt;span class="c1"&gt;# [{topic, rank, heat}, ...]
&lt;/span&gt;&lt;span class="n"&gt;previous&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# durable store that persists across runs
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# match on identity, not rank
&lt;/span&gt;    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank_delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heat_delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heat&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# new / rising / falling / steady
&lt;/span&gt;    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_seen&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_rank&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;

&lt;span class="nf"&gt;emit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;dropped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# include topics that fell off
&lt;/span&gt;&lt;span class="nf"&gt;save_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                      &lt;span class="c1"&gt;# for next run
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schedule that (hourly or daily) and every run becomes a &lt;strong&gt;velocity reading&lt;/strong&gt; instead of a snapshot. The hard parts in practice are the durable, per-stream state and stable identity matching — get those wrong and the deltas lie.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this turns into money
&lt;/h2&gt;

&lt;p&gt;Velocity is a &lt;em&gt;leading&lt;/em&gt; indicator, and leading indicators are what people pay for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Brand-crisis alerting&lt;/strong&gt; — catch a topic about your brand spiking &lt;em&gt;before&lt;/em&gt; it peaks: hours of lead time vs. a once-a-day report. That lead time is the product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer-trend alt-data&lt;/strong&gt; — rising-topic velocity is an early read on attention and demand shifts. Trend desks and funds buy exactly this kind of signal; a clean, timestamped delta feed is a sellable input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marketing / launch timing&lt;/strong&gt; — ride a topic while it's ascending, not after it's saturated and CPMs have spiked.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building a product on top, this delta stream &lt;em&gt;is&lt;/em&gt; your signal layer — everything downstream (alerts, scoring, dashboards) hangs off it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical path (skip the plumbing)
&lt;/h2&gt;

&lt;p&gt;You can build the stateful diff and session handling yourself, or point a maintained extractor at it. I maintain a &lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper&lt;/a&gt; on Apify with a &lt;code&gt;hot_search_delta&lt;/code&gt; mode that does exactly this — pulls the board, persists state across scheduled runs, and returns the &lt;code&gt;new&lt;/code&gt; / &lt;code&gt;rising&lt;/code&gt; / &lt;code&gt;falling&lt;/code&gt; / &lt;code&gt;dropped&lt;/code&gt; deltas with rank velocity, time-on-board, and peaks. Pay-per-result, runs on a schedule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hot_search_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deltaStateKey&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hourly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# name independent streams (hourly / daily / ...)
&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rising&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rankDelta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  +&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rankDelta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ranks  (heat &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hotValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wire it to an Apify &lt;strong&gt;Schedule&lt;/strong&gt; and you have a rolling Weibo trend-velocity feed without owning the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is — and isn't
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is:&lt;/strong&gt; a stateful, scheduled velocity feed over China's largest real-time attention signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isn't:&lt;/strong&gt; a one-off snapshot (that's the standard hot-search mode) — or a sentiment model. You get structured movement; the modelling on top is yours.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Need a field that isn't there yet, or a different cadence? Open an issue on the &lt;a href="https://apify.com/zhorex/weibo-scraper/issues" rel="noopener noreferrer"&gt;Actor page&lt;/a&gt; — I usually ship small additions within a couple of days. For high-volume or managed feeds, the README has the enterprise contact.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>datascience</category>
      <category>china</category>
    </item>
    <item>
      <title>Sourcing clean, multi-platform Chinese-language training data at scale in 2026 — a legal + practical guide for AI teams</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Wed, 03 Jun 2026 00:11:34 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/sourcing-clean-multi-platform-chinese-language-training-data-at-scale-in-2026-a-legal--35na</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/sourcing-clean-multi-platform-chinese-language-training-data-at-scale-in-2026-a-legal--35na</guid>
      <description>&lt;p&gt;If you're training or fine-tuning a model that needs to understand modern Chinese — consumer slang, product opinions, finance chatter, Gen-Z internet register — you've probably hit the same wall: &lt;strong&gt;the open Chinese corpora are stale, web-heavy, and thin on authentic first-person signal.&lt;/strong&gt; Common Crawl's Chinese slice is noisy and dated; the polished open datasets skew formal/encyclopedic. The &lt;em&gt;living&lt;/em&gt; Chinese-language signal — how real people actually write in 2026 — sits on a handful of social platforms, and getting it cleanly, at scale, and on solid legal footing is its own project.&lt;/p&gt;

&lt;p&gt;This is a practical guide to doing that without standing up (and babysitting) a five-platform scraping operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Chinese is the hardest major language to source well
&lt;/h2&gt;

&lt;p&gt;Three things make it uniquely painful:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The register you want is platform-locked.&lt;/strong&gt; Formal Chinese is everywhere; &lt;em&gt;colloquial, current, opinion-rich&lt;/em&gt; Chinese lives inside Weibo, RedNote, Bilibili, Douban and Xueqiu — and each gates and structures its public data differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's fragmented.&lt;/strong&gt; A model that only sees microblog text misses lifestyle reviews, video-comment register, long-form opinion, and finance vernacular. You need &lt;em&gt;several&lt;/em&gt; platforms to cover the distribution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It moves.&lt;/strong&gt; Last year's dump is already drifting from how people write today. Good Chinese data is a &lt;em&gt;rolling&lt;/em&gt; requirement, not a one-time pull.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What a good Chinese-language corpus actually needs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt; — hundreds of thousands to millions of records, not a sample.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recency&lt;/strong&gt; — a scheduled, rolling pull, not a one-off snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register diversity&lt;/strong&gt; — microblog (Weibo), lifestyle/product reviews (RedNote), video comments + danmaku (Bilibili), long-form reviews/discussion (Douban), retail-finance vernacular (Xueqiu).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean structure&lt;/strong&gt; — normalized fields, consistent encoding, deduplicated across platforms (the same KOL post reposted three places should collapse to one record, or you bias the model).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance you can defend&lt;/strong&gt; — public surface, no authentication, clear about what it is.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The build-it-yourself trap
&lt;/h2&gt;

&lt;p&gt;You &lt;em&gt;can&lt;/em&gt; wire up five scrapers. The honest cost is what comes after:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Five different access surfaces that &lt;strong&gt;change on their own schedule&lt;/strong&gt;, each breaking independently — that's five maintenance burdens, not one.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;normalization + cross-platform dedup&lt;/strong&gt; layer you now own forever.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;legal/compliance&lt;/strong&gt; posture you have to reason about per platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the time it's robust, you've built a data-engineering team's worth of plumbing before training a single epoch. For most AI teams, that's not the project they want to be in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The legal layer (high-level — not legal advice)
&lt;/h2&gt;

&lt;p&gt;This is the part people skip and regret. The landscape in 2026, briefly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public, logged-off data sits on firmer ground.&lt;/strong&gt; In &lt;em&gt;Meta v. Bright Data&lt;/em&gt; (N.D. Cal., Jan 2024) a US court held that scraping &lt;strong&gt;publicly available, logged-off&lt;/strong&gt; data — and selling it — did not breach Meta's terms. It's narrow to that case's facts, but the direction is clear: &lt;em&gt;authenticated&lt;/em&gt; scraping is the risky lane; &lt;strong&gt;public, no-login collection is the defensible one.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal data has cross-border obligations.&lt;/strong&gt; If your corpus carries personal information, China's cross-border data-transfer rules (tightened for 2026) attach compliance steps above volume thresholds. The pragmatic read: &lt;strong&gt;favor public-post text and aggregate/derived signal over bulk personal profiles.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marketplaces increasingly demand clean provenance.&lt;/strong&gt; AI-data marketplaces now ask for "legally sourced, non-scraped" guarantees — which is exactly why sourcing &lt;em&gt;your own&lt;/em&gt; public-surface corpus (where you control and document the use) is often cleaner than buying a mystery dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;(None of this is legal advice — run your specific use case past counsel. The point is simply: stay on the public, logged-off, non-PII-heavy lane and document it.)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical path: maintained public-surface extractors
&lt;/h2&gt;

&lt;p&gt;Instead of owning the five-platform treadmill, you point a maintained, &lt;strong&gt;public-surface, no-login&lt;/strong&gt; extractor at each platform and get back clean, structured records — on a schedule, at scale, pay-per-result. I maintain exactly this set on Apify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Weibo Scraper&lt;/strong&gt;&lt;/a&gt; — microblog posts, hot search, comments (broad public-opinion register)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;RedNote / Xiaohongshu Scraper&lt;/strong&gt;&lt;/a&gt; — first-person product reviews + lifestyle text&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Bilibili Scraper&lt;/strong&gt;&lt;/a&gt; — video metadata, comments, danmaku (Gen-Z register)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Xueqiu Scraper&lt;/strong&gt;&lt;/a&gt; — retail-investor / cashtag finance vernacular&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/douban-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Douban Scraper&lt;/strong&gt;&lt;/a&gt; — long-form reviews and discussion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each returns clean JSON you can stream straight into your pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;新能源车&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# "new energy vehicles"
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# → straight into your tokenizer / dedup / corpus store
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want &lt;strong&gt;all five platforms normalized into one schema and deduplicated across platforms&lt;/strong&gt; (so cross-posts don't inflate your corpus), the &lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;&lt;strong&gt;Chinese Brand Monitor&lt;/strong&gt;&lt;/a&gt; aggregator does that merge in a single call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost at scale
&lt;/h2&gt;

&lt;p&gt;Pay-per-result, cents per record — so a corpus pull is a line item, not a procurement cycle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pull&lt;/th&gt;
&lt;th&gt;Order of magnitude&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50K Weibo posts, one-off&lt;/td&gt;
&lt;td&gt;small fine-tune slice&lt;/td&gt;
&lt;td&gt;~$250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500K records across 3 platforms&lt;/td&gt;
&lt;td&gt;a real corpus&lt;/td&gt;
&lt;td&gt;low four figures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduled monthly refresh&lt;/td&gt;
&lt;td&gt;rolling recency&lt;/td&gt;
&lt;td&gt;repeats at the same per-record rate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare that to an engineer-month building and maintaining five pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is — and isn't
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is:&lt;/strong&gt; public-surface text, structured, scheduled, at scale — you run it, you own how you use the output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isn't:&lt;/strong&gt; authenticated/private content, or a "mystery" dataset of unknown provenance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isn't:&lt;/strong&gt; a labeling service — you get raw, structured text + metadata; the curation/filtering is yours.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting a bulk corpus
&lt;/h2&gt;

&lt;p&gt;For a one-off corpus or a rolling scheduled feed, the actors above run self-serve on Apify's free tier so you can eyeball the output shape before committing. &lt;strong&gt;For high-volume / enterprise&lt;/strong&gt; — millions of records, a custom schema matched to your warehouse, or a managed recurring feed — open an issue titled &lt;strong&gt;"Enterprise inquiry"&lt;/strong&gt; on any actor, or email &lt;strong&gt;&lt;a href="mailto:samimassis2002@gmail.com"&gt;samimassis2002@gmail.com&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If a platform or field you need for your corpus isn't covered yet, say so — I usually turn additions around in a couple of days.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>datascience</category>
    </item>
    <item>
      <title>How to monitor a brand across 5 Chinese social platforms with Python in 2026 — the cross-platform dedup problem and how to handle it</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Sun, 31 May 2026 16:31:02 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/how-to-monitor-a-brand-across-5-chinese-social-platforms-with-python-in-2026-the-cross-platform-3lif</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/how-to-monitor-a-brand-across-5-chinese-social-platforms-with-python-in-2026-the-cross-platform-3lif</guid>
      <description>&lt;p&gt;You want to know how a brand is being talked about in China. The catch: the conversation isn't on one platform. It's split across &lt;strong&gt;Weibo&lt;/strong&gt; (microblog), &lt;strong&gt;RedNote / Xiaohongshu&lt;/strong&gt; (product &amp;amp; lifestyle), &lt;strong&gt;Bilibili&lt;/strong&gt; (video), &lt;strong&gt;Douban&lt;/strong&gt; (long-form reviews) and &lt;strong&gt;Xueqiu&lt;/strong&gt; (retail-investor chatter). So you wire up five scrapers — and &lt;em&gt;that's&lt;/em&gt; where the real work starts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part nobody warns you about
&lt;/h2&gt;

&lt;p&gt;Pulling each platform is the easy 20%. The other 80% is turning five raw feeds into one trustworthy dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Five completely different shapes.&lt;/strong&gt; A "post" on Weibo, a "note" on RedNote, a "video" on Bilibili, a "review" on Douban, a "cashtag comment" on Xueqiu — different fields, different engagement metrics, different date formats. Normalizing them into one table is a chore you redo every time a platform tweaks its response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicates everywhere.&lt;/strong&gt; A KOL announces a collab and it's reposted across three platforms; creators cross-post the same clip. Count naively and your "mention volume" is inflated 2–3×, which quietly ruins every trend line and alert you build on top of it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five moving targets.&lt;/strong&gt; Each platform changes how it serves public data on its own schedule. Keeping five pipelines alive is five maintenance burdens, not one — and they break on &lt;em&gt;their&lt;/em&gt; calendar, not yours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform consistency.&lt;/strong&gt; Sentiment and author-reach have to mean the same thing on every platform, or your dashboard lies to you.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the time you've built normalization + cross-platform dedup + sentiment + reach scoring for five platforms — and signed up to maintain it forever — you've built a data-engineering project before you've answered a single business question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shortcut: one call that returns the merged feed
&lt;/h2&gt;

&lt;p&gt;I maintain &lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;&lt;strong&gt;Chinese Brand Monitor&lt;/strong&gt;&lt;/a&gt; on Apify. You give it a brand keyword; it returns brand mentions across all five platforms &lt;strong&gt;already normalized into one schema, deduplicated to one canonical record per real mention, sentiment-tagged, and reach-scored&lt;/strong&gt; — so the messy 80% is just… done. Pay-as-you-go at &lt;strong&gt;$0.045 per canonical mention&lt;/strong&gt;: no subscription, no seat fee, no annual contract.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;完美日记&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Chinese or English
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;douban&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentimentAnalysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deduplication&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contentSnippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean rows — platform, author, follower count, engagement, sentiment, URL — straight into pandas / BigQuery / Snowflake / whatever you already run. No five-pipeline zoo to babysit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can &lt;em&gt;build&lt;/em&gt; on top of it (i.e. how this makes you money)
&lt;/h2&gt;

&lt;p&gt;This is the point. Cheap, clean, cross-platform China data is a raw material — and there's real margin in turning it into a product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run a China social-listening service.&lt;/strong&gt; Agencies bill brands &lt;strong&gt;monthly&lt;/strong&gt; for "monitor my brand + 3 competitors in China." Your data cost is cents per mention; you sell the insight, the dashboard, and the recurring retainer. The data layer that used to require a &lt;strong&gt;$36K–$50K+/yr&lt;/strong&gt; enterprise tool (Synthesio, Brandwatch, Meltwater) is now a line item — the spread is yours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sell an alt-data sentiment feed.&lt;/strong&gt; Funds pay for consumer/retail sentiment on Chinese names &lt;em&gt;ahead of the tape&lt;/em&gt;. Pull daily across a basket, build a &lt;strong&gt;7-day sentiment + mention-volume delta&lt;/strong&gt; per brand/ticker, and sell the series. Costs cents per name per day; replaces a five-figure alt-data subscription.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Productize competitor sweeps.&lt;/strong&gt; One-off "how is brand X perceived vs Y in China, across 5 platforms" reports are high-margin consulting deliverables built on a few dollars of data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply AI / LLM teams&lt;/strong&gt; labeled, multi-platform, sentiment-tagged Chinese-language text for training corpora and current-events grounding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every one of these, the data is the cheap input and the &lt;em&gt;insight&lt;/em&gt; is what you charge for — gross margin on the data side sits near the 96% the Actor itself runs at.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest comparison (where the big tools still win)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Enterprise (Synthesio / Brandwatch / Meltwater)&lt;/th&gt;
&lt;th&gt;Chinese Brand Monitor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Managed dashboard + alerting&lt;/td&gt;
&lt;td&gt;✅ Built in&lt;/td&gt;
&lt;td&gt;❌ You bring your own BI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global TV / podcast / news&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ Chinese social only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Account manager / SLA&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ Self-serve (issues answered, no SLA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;$36K–$50K+/yr, annual contract&lt;/td&gt;
&lt;td&gt;$0.045/mention, pay-as-you-go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw data ownership&lt;/td&gt;
&lt;td&gt;Walled-garden export&lt;/td&gt;
&lt;td&gt;✅ Your dataset, full export&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;China platform depth&lt;/td&gt;
&lt;td&gt;Often shallow / add-on&lt;/td&gt;
&lt;td&gt;✅ Five platforms, native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to first data&lt;/td&gt;
&lt;td&gt;Sales cycle + onboarding&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want a turnkey managed platform with global coverage and a team behind it, buy the enterprise tool. If you want the Chinese social data — cheaply, in your own pipeline, with no contract — this is the layer to build on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Realistic cost
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workflow&lt;/th&gt;
&lt;th&gt;Volume&lt;/th&gt;
&lt;th&gt;Monthly cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One brand, daily, 7-day lookback&lt;/td&gt;
&lt;td&gt;~3K mentions&lt;/td&gt;
&lt;td&gt;~$135&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-brand agency, daily, sentiment + dedup&lt;/td&gt;
&lt;td&gt;~15K mentions&lt;/td&gt;
&lt;td&gt;~$675&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-ticker fund, daily (Xueqiu + Weibo + RedNote)&lt;/td&gt;
&lt;td&gt;~22K mentions&lt;/td&gt;
&lt;td&gt;~$990&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-off competitor sweep&lt;/td&gt;
&lt;td&gt;2,500 mentions&lt;/td&gt;
&lt;td&gt;~$112&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each is a fraction of a single enterprise seat — and against what you can bill clients on top, the data cost rounds to noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it's NOT
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not a managed dashboard.&lt;/strong&gt; It's the data layer; you bring the visualization (that's also where your margin is).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not global coverage.&lt;/strong&gt; Chinese social platforms only — by design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not real-time streaming.&lt;/strong&gt; Cron-based polling; great for daily/hourly monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not authenticated/private content.&lt;/strong&gt; Public surface only.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  If you only need one platform
&lt;/h2&gt;

&lt;p&gt;The aggregator is for cross-platform monitoring. If you only need depth on a single platform, the standalone Actors go deeper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper&lt;/a&gt; — microblog, hot search, KOL posts&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;RedNote / Xiaohongshu Scraper&lt;/a&gt; — lifestyle / product sentiment&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;Bilibili Scraper&lt;/a&gt; — video + creator analytics&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;Xueqiu Scraper&lt;/a&gt; — retail-investor / cashtag sentiment&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/douban-scraper" rel="noopener noreferrer"&gt;Douban Scraper&lt;/a&gt; — long-form reviews&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Apify's free tier covers a first run, so you can see the output shape before committing a cent. Start here: &lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;&lt;strong&gt;zhorex/chinese-brand-monitor&lt;/strong&gt;&lt;/a&gt;. If a field or platform you need isn't there, open an issue on the Actor page — I usually turn fixes around in a couple of days.&lt;/p&gt;

</description>
      <category>china</category>
      <category>webscraping</category>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>How China-focused funds turn Weibo into alt-data (Python, 2026)</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Fri, 29 May 2026 22:17:26 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/how-china-focused-funds-turn-weibo-into-alt-data-python-2026-194o</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/how-china-focused-funds-turn-weibo-into-alt-data-python-2026-194o</guid>
      <description>&lt;p&gt;If you run a China book — equities, FX, commodities, or just a macro tilt — you already know the problem: the official numbers are slow and the English-language coverage is downstream of what already moved on Chinese social platforms. By the time a theme reaches Bloomberg, retail Weibo has been talking about it for days.&lt;/p&gt;

&lt;p&gt;Weibo (微博) is where Chinese consumer and retail-investor sentiment shows up first. 580M+ monthly actives, a public hot-search board that turns over hourly, and cashtag-style chatter on every listed name. The catch: there's no official API for international developers, and the data is in Chinese.&lt;/p&gt;

&lt;p&gt;This post walks through how to pull Weibo into a usable alt-data feed with a few lines of Python — hot-search trend tracking, keyword/cashtag sentiment, and KOL post monitoring — using an Apify Actor I maintain, so you don't have to babysit visitor cookies or rate limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three signals worth pulling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Hot search board (the leading indicator).&lt;/strong&gt; Weibo's trending board is the single fastest read on what 1.4B people are paying attention to. A brand, a policy rumor, a product recall, a CEO quote — it surfaces here first. For a fund, the delta matters more than the snapshot: what entered the board in the last hour, and how fast it's climbing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Keyword / cashtag sentiment.&lt;/strong&gt; Search a ticker's Chinese name, a brand, or a product line and you get the raw retail read — positive, negative, the volume of chatter, and which posts have reach. This is the consumer-demand nowcast that quarterly filings give you 90 days late.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. KOL post monitoring.&lt;/strong&gt; A single finance or consumer KOL with 5M followers moves retail flows in hours. Tracking specific accounts' posts (and their engagement velocity) is a cleaner signal than aggregate noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pull the hot-search board
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hot_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;heat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this on a cron every 30-60 minutes and diff consecutive snapshots. A topic that jumps 40 ranks in one hour is the alpha — not its absolute position.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keyword sentiment as a consumer nowcast
&lt;/h2&gt;

&lt;p&gt;Say you're long a Chinese EV name and want the retail read before the delivery numbers print:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;比亚迪&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# BYD in Chinese — Chinese keywords yield far better recall
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Reach-weight the chatter: a 2M-follower account counts more than a burner.
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reach&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repostsCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;commentsCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likesCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reach&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reach&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;createdAt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipe the &lt;code&gt;text&lt;/code&gt; field through whatever Chinese sentiment model you already run (or a multilingual LLM) and you have a daily polarity series per name. Track the 7-day delta in mention volume + polarity and you've built a sentiment-velocity factor for the cost of a few cents per run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a daily China alt-data job
&lt;/h2&gt;

&lt;p&gt;The two actors that matter together: Weibo for broad consumer + retail sentiment, and the &lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;Xueqiu Scraper&lt;/a&gt; for finance-specific cashtag chatter (Xueqiu is China's retail-investor forum — closer to a StockTwits read). Run both on the same cron, join on ticker, and you get consumer sentiment and investor sentiment side by side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tickers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BYD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;比亚迪&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pop Mart&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;泡泡玛特&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luckin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;瑞幸咖啡&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zh&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tickers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;zh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mentions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mentions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Diff today's mention counts against a trailing 7-day mean and you have a chatter-velocity screen across your whole China book.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;The Weibo Scraper is pay-per-event — you pay per item returned, no subscription, no seat fee. A 300-post sentiment pull is a few cents. A daily 20-ticker monitoring job across the month lands in the low tens of dollars. Compare that to a Bloomberg China module or a packaged alt-data feed and the math is not close.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;th&gt;Volume&lt;/th&gt;
&lt;th&gt;Rough monthly cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hourly hot-search tracker&lt;/td&gt;
&lt;td&gt;~70K topics/mo&lt;/td&gt;
&lt;td&gt;low tens of $&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-ticker daily sentiment&lt;/td&gt;
&lt;td&gt;~120K posts/mo&lt;/td&gt;
&lt;td&gt;tens of $&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-off theme research&lt;/td&gt;
&lt;td&gt;a few K posts&lt;/td&gt;
&lt;td&gt;a few $&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;See the Actor's Pricing tab for the exact per-result rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;Honest scoping, because sophisticated buyers care more about this than the pitch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not real-time tick data.&lt;/strong&gt; Cron-based polling; 30-60 min cadence is realistic and plenty for sentiment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a sentiment model.&lt;/strong&gt; It returns the raw posts + engagement + metadata. You bring (or plug in) the NLP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not authenticated content.&lt;/strong&gt; Public surface only — hot search, public search results, public profiles. Some modes (user timelines) work better with your own session cookie, which is optional.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not financial advice or a signal in a box.&lt;/strong&gt; It's a data feed. The factor construction is yours.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The broader China stack
&lt;/h2&gt;

&lt;p&gt;If Weibo is the consumer + retail-sentiment layer, the rest of the stack fills in the gaps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;Xueqiu Scraper&lt;/a&gt; — retail-investor forum, cashtag-tagged, the finance-specific sentiment read&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;RedNote / Xiaohongshu Scraper&lt;/a&gt; — consumer-brand and product sentiment, the highest-trust purchase-decision channel in China&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;Bilibili Scraper&lt;/a&gt; — Gen-Z video sentiment and creator analytics&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;Chinese Brand Monitor&lt;/a&gt; — if you'd rather not wire up four scrapers, this aggregates Weibo + RedNote + Bilibili + Douban + Xueqiu into one normalized, deduplicated, sentiment-tagged feed at a per-mention price&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;A small Weibo pull costs cents, and Apify's free tier covers a first run. Start here: &lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;zhorex/weibo-scraper&lt;/a&gt;. If you build a China sentiment factor on top of it, I'd genuinely like to hear how — drop a comment or open an issue on the Actor page.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>china</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Synthesio charges $36K+/year for Chinese platform coverage. I built one for $0.045/mention.</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Wed, 20 May 2026 01:46:55 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/synthesio-charges-36kyear-for-chinese-platform-coverage-i-built-one-for-0045mention-4d1l</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/synthesio-charges-36kyear-for-chinese-platform-coverage-i-built-one-for-0045mention-4d1l</guid>
      <description>&lt;p&gt;Synthesio sells Chinese platform coverage for $36K+/year. Brandwatch and Meltwater sit in roughly the same $24K-80K/year band. I built an Apify Actor that does the equivalent core job — Weibo, RedNote, Bilibili, Douban, Xueqiu — for $0.045 per deduplicated mention, billed pay-as-you-go.&lt;/p&gt;

&lt;p&gt;If you've ever tried to DIY this, you know the math. Five Chinese platforms means five different parsers, five different rate-limit dances, five different schema-drift surprises every couple of weeks, and zero deduplication when a KOL reposts the same content across all of them. By the time you've normalized author identity, follower counts, and timestamps into a usable cross-platform record, you've built a small distributed system that breaks every other Tuesday.&lt;/p&gt;

&lt;p&gt;The pitch for &lt;code&gt;zhorex/chinese-brand-monitor&lt;/code&gt; is simple: one API call, one normalized schema, one PPE event per canonical mention. You pass a brand keyword (Chinese or English), get back deduplicated records with sentiment scores and reach signals across all five platforms. You don't write per-platform code. You don't run five cron jobs. You don't pay an enterprise floor.&lt;/p&gt;

&lt;p&gt;This post walks through six concrete workflows with runnable Python — brand health, crisis monitoring, KOL discovery, hedge fund alt-data, AI training corpora, and a cross-tool finance signal — so you can decide if this fits your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;The Actor takes a single brand keyword (or a list of keywords) and returns deduplicated, sentiment-scored mentions from five Chinese platforms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Weibo&lt;/strong&gt; — China's largest microblog; broad consumer chatter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RedNote / Xiaohongshu (小红书)&lt;/strong&gt; — lifestyle and product discovery; heavy DTC signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bilibili&lt;/strong&gt; — long-form video community; strong Gen-Z signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Douban&lt;/strong&gt; — long-form reviews, especially media and lifestyle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Xueqiu (雪球)&lt;/strong&gt; — retail investor chatter, cashtag-tracked stock sentiment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Actor handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single keyword input — Chinese &lt;code&gt;护肤&lt;/code&gt; or English &lt;code&gt;Estée Lauder&lt;/code&gt; both work&lt;/li&gt;
&lt;li&gt;Normalized cross-platform schema — same fields on every record, no per-platform parsing in your downstream code&lt;/li&gt;
&lt;li&gt;Lexicon-based Chinese sentiment scoring per mention (polarity + score)&lt;/li&gt;
&lt;li&gt;Cross-platform deduplication — when the same KOL reposts identical content on Weibo and RedNote, you get one canonical record with &lt;code&gt;crossPlatformReposts&lt;/code&gt; listing the other appearances&lt;/li&gt;
&lt;li&gt;Author identity normalization with follower count for reach-weighted analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Engineering choices worth knowing: a browser-grade HTTP client, polite rate limiting, session warming, and a public-data scope that respects each platform's accessible surface. The point is that you don't have to think about any of that — you call the Actor, you get records.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six concrete workflows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  a) Brand health dashboard (~$135/mo)
&lt;/h3&gt;

&lt;p&gt;Daily 8am cron, single brand, 7-day rolling lookback. Push to Looker, Metabase, or a Notion database. Compare this to a $4K/mo Synthesio seat for the same functional coverage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estée Lauder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;douban&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxMentionsPerPlatform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentimentAnalysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deduplication&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mentions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mentionId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="n"&gt;reach&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorFollowerCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The grouped DataFrame is what you push to your BI tool. ~3,000 deduplicated mentions/month at this cadence lands around $135 in PPE charges.&lt;/p&gt;

&lt;h3&gt;
  
  
  b) Crisis monitoring (~$270/mo)
&lt;/h3&gt;

&lt;p&gt;Hourly cron, 1-day lookback, filter for negative polarity from accounts above 10K followers. Slack webhook fires on match. This is the workflow that justifies the spend during a product recall, a CEO quote going viral, or a competitor smear campaign.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SLACK_WEBHOOK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://hooks.slack.com/services/XXX/YYY/ZZZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estée Lauder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;douban&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxMentionsPerPlatform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentimentAnalysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;negative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorFollowerCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SLACK_WEBHOOK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authorName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authorFollowerCount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; followers) — &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;contentSnippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hourly × 24 × 30 ≈ ~6,000 deduplicated mentions/month if the brand has steady chatter — roughly $270/mo. Cheap insurance for a comms team.&lt;/p&gt;

&lt;h3&gt;
  
  
  c) KOL identification (~$90/mo)
&lt;/h3&gt;

&lt;p&gt;Weekly category-keyword run. Skincare = &lt;code&gt;护肤&lt;/code&gt;, sneakers = &lt;code&gt;球鞋&lt;/code&gt;, supplements = &lt;code&gt;保健品&lt;/code&gt;. Filter verified authors above 50K followers, sort by engagement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;护肤&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;douban&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxMentionsPerPlatform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagementMetrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shares&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorVerified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorFollowerCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorFollowerCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kol_candidates.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weekly cadence on 1-2 category keywords ≈ ~2,000 mentions/month — roughly $90/mo. The output is a ranked candidate list your social team can outreach directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  d) Hedge fund alt-data (~$990/mo)
&lt;/h3&gt;

&lt;p&gt;Daily run across 20 portfolio tickers on Xueqiu + Weibo + RedNote. Build a sentiment-velocity feature: 7-day mention-count delta paired with polarity shift. Join two consecutive runs to compute the velocity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tickers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BABA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PDD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BIDU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NIO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;XPEV&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEITUAN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TENCENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BYD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LKNCY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BILI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VIPS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YMM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIDI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ZH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NTES&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FUTU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lookback_days&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tickers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lookback_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentimentAnalysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;today&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;week&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mentionId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;avg_polarity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;today_agg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;week_agg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;week&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;today_agg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;week_agg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lsuffix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_1d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rsuffix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_7d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;velocity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count_1d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count_7d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity_shift&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_polarity_1d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_polarity_7d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;velocity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;20 tickers × daily × 3 platforms ≈ ~22K mentions/month — roughly $990/mo. Compare to a single Bloomberg terminal at ~$28K/year for one analyst.&lt;/p&gt;

&lt;h3&gt;
  
  
  e) AI training corpus (~$2,250 one-shot)
&lt;/h3&gt;

&lt;p&gt;50 brand keywords × 1,000 mentions each = 50K Chinese-language labeled records for SFT or RLHF corpora. Every record has an explicit sentiment polarity, author follower bracket, and platform. Compare to $15-50K academic licensing fees for comparable annotated Chinese sentiment corpora.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;brands&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;华为&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;小米&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;比亚迪&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;蔚来&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;理想汽车&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;拼多多&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;美团&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;完美日记&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;花西子&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;钟薛高&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;元气森林&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;瑞幸咖啡&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;海底捞&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ... 50 total
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;china_sft_corpus.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;brand&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;brands&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;brand&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;douban&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxMentionsPerPlatform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentimentAnalysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;50K records × $0.045 = $2,250. One-shot. No annotator contracts, no FTE-month spent labeling.&lt;/p&gt;

&lt;h3&gt;
  
  
  f) Cross-tool finance signal: Xueqiu sentiment × TradingView price
&lt;/h3&gt;

&lt;p&gt;Pair the Chinese Brand Monitor with &lt;a href="https://apify.com/zhorex/tradingview-scraper" rel="noopener noreferrer"&gt;the TradingView Scraper&lt;/a&gt; for a sentiment-vs-price divergence signal. When Xueqiu retail sentiment turns sharply positive while the price stays flat or drifts down, you have a setup worth a closer look.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sent_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/chinese-brand-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brandKeyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BABA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platforms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookbackDays&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentimentAnalysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;sent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="n"&gt;sent_score_7d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;price_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/tradingview-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;technical_analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symbols&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NYSE:BABA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includeIndicators&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="n"&gt;perf_week_pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;perfWeek&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="c1"&gt;# Positive Xueqiu sentiment minus weekly price return: large positive = retail
# is loud-bullish but the tape hasn't caught up yet.
&lt;/span&gt;&lt;span class="n"&gt;divergence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sent_score_7d&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;perf_week_pct&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BABA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xueqiu_sentiment_7d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent_score_7d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tradingview_perfWeek_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;perfWeek&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;divergence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;divergence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A positive divergence row is "sentiment positive, price not yet moved." That's the setup quants pay alt-data brokers tens of thousands a year to surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Normalized output schema
&lt;/h2&gt;

&lt;p&gt;Every record across every platform has this shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mentionId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rednote_8b3c2f91a4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rednote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"brandKeyword"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Estée Lauder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"brandMatchType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"雅诗兰黛小棕瓶用了三个月，肌肤紧致很多..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contentSnippet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"雅诗兰黛小棕瓶用了三个月..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zh-CN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authorId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rednote_user_4429871"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authorName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"小琳护肤日记"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authorFollowerCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;184230&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authorVerified"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"publishedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-18T14:23:11Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"engagementMetrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2104&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"comments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;187&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"shares"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;56&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"views"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18430&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.xiaohongshu.com/explore/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mediaUrls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://sns-img-...jpg"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"polarity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"positive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lexicon"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"crossPlatformReposts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weibo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://weibo.com/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"publishedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-18T15:02:00Z"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scrapedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-20T08:00:01Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your downstream code stays platform-agnostic. Pandas, BigQuery, Snowflake, ClickHouse — pick your warehouse and the records load directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;$0.045 per canonical mention, billed only after deduplication. If a KOL reposts the same content across Weibo + RedNote + Bilibili, that's one billable mention with the reposts attached, not three.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Volume&lt;/th&gt;
&lt;th&gt;Monthly cost&lt;/th&gt;
&lt;th&gt;Enterprise alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single brand, daily, 7-day lookback&lt;/td&gt;
&lt;td&gt;~3K/mo&lt;/td&gt;
&lt;td&gt;~$135&lt;/td&gt;
&lt;td&gt;$4K/mo Synthesio seat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-brand agency, daily, sentiment + dedup&lt;/td&gt;
&lt;td&gt;~15K/mo&lt;/td&gt;
&lt;td&gt;~$675&lt;/td&gt;
&lt;td&gt;$24K-80K/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20-ticker hedge fund&lt;/td&gt;
&lt;td&gt;~22K/mo&lt;/td&gt;
&lt;td&gt;~$990&lt;/td&gt;
&lt;td&gt;$28K/year Bloomberg seat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI training corpus one-shot&lt;/td&gt;
&lt;td&gt;50K&lt;/td&gt;
&lt;td&gt;~$2,250&lt;/td&gt;
&lt;td&gt;$15K-50K academic license&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What this Actor does NOT do
&lt;/h2&gt;

&lt;p&gt;Honest scoping matters more than pitch volume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not real-time push streaming.&lt;/strong&gt; Cron-based polling, 5-minute minimum interval. If you need sub-second push, this isn't it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a historical archive.&lt;/strong&gt; Maximum 30-day lookback. For multi-year backfill, you need a different tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not authentication-walled content.&lt;/strong&gt; No Zhihu authenticated answers, no private WeChat groups, no closed Weibo Super Topic posts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a CRM or BI tool.&lt;/strong&gt; This is the data layer. You bring the dashboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If those constraints are dealbreakers for your use case, save the credit and don't run it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The broader China stack
&lt;/h2&gt;

&lt;p&gt;The main Actor here is &lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;zhorex/chinese-brand-monitor&lt;/a&gt;, but the rest of the stack exists for cases when you need single-platform depth or a different angle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For deeper single-platform RedNote dives — full creator profiles, comment threads, hashtag networks — reach for &lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;the standalone RedNote/Xiaohongshu Scraper&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For Weibo-only bulk pulls — historical hashtag sweeps, single-account timelines, Super Topic posts — &lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;the Weibo Scraper&lt;/a&gt; is the dedicated tool.&lt;/li&gt;
&lt;li&gt;For Bilibili-only deep pulls — video metadata, danmaku, UP主 channel coverage — use &lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;the Bilibili Scraper&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For finance-only sentiment with cashtag granularity and reply trees, &lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;the Xueqiu Scraper&lt;/a&gt; goes deeper than the brand-monitor surface.&lt;/li&gt;
&lt;li&gt;For long-form review extraction, especially books, films, and lifestyle, &lt;a href="https://apify.com/zhorex/douban-scraper" rel="noopener noreferrer"&gt;the Douban Scraper&lt;/a&gt; handles the review-thread structure.&lt;/li&gt;
&lt;li&gt;For the cross-tool finance workflow above, &lt;a href="https://apify.com/zhorex/tradingview-scraper" rel="noopener noreferrer"&gt;the TradingView Scraper&lt;/a&gt; provides the price half of the sentiment-vs-price divergence signal.&lt;/li&gt;
&lt;li&gt;If you're tracking brand mentions, you usually also want competitor pricing — &lt;a href="https://apify.com/zhorex/jd-scraper" rel="noopener noreferrer"&gt;the JD Scraper&lt;/a&gt; covers the e-commerce price side of the China stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;$5 of Apify free credits cover roughly 100 mentions — enough to run a single brand for a week and see whether the output shape fits your downstream code. Start here: &lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;zhorex/chinese-brand-monitor&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you build something on top of this — a Looker dashboard, a Slack bot, a Streamlit explorer, a sentiment ETF screen — drop a comment, or open an Issue on the Actor page. Schema customization, missing platforms, follower-bracket additions, new sentiment lexicons — those are the kinds of changes that get prioritized when users ask for them.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>china</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Pinnacle odds for $0.01 a snapshot: the OddsJam / Odds API replacement sharp bettors are using in 2026</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Tue, 19 May 2026 13:54:32 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/pinnacle-odds-for-001-a-snapshot-the-oddsjam-odds-api-replacement-sharp-bettors-are-using-in-3kgl</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/pinnacle-odds-for-001-a-snapshot-the-oddsjam-odds-api-replacement-sharp-bettors-are-using-in-3kgl</guid>
      <description>&lt;p&gt;If you bet sharp lines, the only book that genuinely matters for fair-value is Pinnacle. Every EV model, every CLV report, every "did I beat the close?" check eventually compresses down to one question: what was Pinnacle showing on this market at T-1?&lt;/p&gt;

&lt;p&gt;For years the standard way to get that feed was The Odds API ($249/mo for 15M credits) or OddsJam Gold ($249/mo, $499+ for Pro). For a tipster shop polling 100 fixtures a day that math is tolerable. For a solo bettor running CLV on 20 fixtures it's overspend. For a specials trader it's worse — OddsJam gates futures and yes/no markets behind their highest tier and The Odds API doesn't surface most of them at all.&lt;/p&gt;

&lt;p&gt;There's now an Apify Actor that does the same job pay-per-snapshot:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;zhorex/sports-odds-aggregator&lt;/strong&gt; — Pinnacle h2h + spreads + totals + 5,000+ specials per sport, from $0.01 a snapshot. Datacenter-proxy friendly. No login, no monthly minimum.&lt;/p&gt;

&lt;p&gt;This post is the playbook: four recipes that show exactly how to run it, what each costs, and where the savings show up vs. the SaaS-incumbent pricing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Quick note on naming: the Actor's title still references Bet365 because Bet365 is the second-book slot, but Bet365's public mobile-web path is under repair as of May 2026. Pinnacle is shipping today, and the moment Bet365 returns the cross-book best-price flag (&lt;code&gt;isBestPriceAcrossBooks&lt;/code&gt;) and fuzzy event-matching activate automatically — no input change on your side.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Pricing in plain English
&lt;/h2&gt;

&lt;p&gt;Four event types, billed pay-per-event (PPE):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;When it fires&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;odds-snapshot-pre-match&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.01 / snapshot&lt;/td&gt;
&lt;td&gt;One market-outcome from a scheduled (not in-play) fixture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;odds-snapshot-live&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.02 / snapshot&lt;/td&gt;
&lt;td&gt;One market-outcome from a live (in-play) match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;odds-snapshot-player-prop&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.04 / snapshot&lt;/td&gt;
&lt;td&gt;One special / future / yes-no / team prop / exact-totals row&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scheduled-run&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.05 / run&lt;/td&gt;
&lt;td&gt;Once per cron tick — often fully offset by the dedup window&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A typical pre-match fixture with &lt;code&gt;["h2h", "spreads", "totals"]&lt;/code&gt; emits ~7 snapshots (3 h2h outcomes + 2 spreads + 2 totals). Add &lt;code&gt;"specials"&lt;/code&gt; and you get an extra 30–80 rows per fixture — yes/no markets, exact totals, first-team-to-score, winning margin per scoreline, team props.&lt;/p&gt;

&lt;p&gt;The bit that turns this from "interesting" to "actually cheap": the &lt;code&gt;deduplicationWindowSeconds&lt;/code&gt; setting suppresses snapshots when the line hasn't moved. On stable mid-week Premier League pre-match polls you typically charge for 5–15% of "naïve" volume. A 60-second cron on a stable line is essentially free.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recipe 1 — Pinnacle closing-line value (CLV) tracker
&lt;/h2&gt;

&lt;p&gt;The recipe that pays for the Actor in its first weekend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mode: "pre_match_only"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Trigger at T-30 minutes and T-1 minute per fixture&lt;/li&gt;
&lt;li&gt;Bet your soft book at T-30, log Pinnacle's T-1 close, compute CLV per ticket&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pinnacle's closing line is the canonical sharp benchmark. If you're consistently beating Pinnacle's close, your edge is real. If you aren't, you can stop pretending — CLV is the ground truth of whether you're a winning bettor or a noise-trader.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost for 200 fixtures/week&lt;/strong&gt; (h2h+spreads+totals, ~7 snapshots × 2 polls each): &lt;strong&gt;~$65/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The Odds API doesn't expose "Pinnacle close at T-1" as a first-class field, so you're paying $249/mo for the feed and still rolling your own snapshot scheduler. Here the snapshot scheduler (Apify cron) and the snapshot itself together come in at ~25% of the price.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recipe 2 — EV-model live edge harvester
&lt;/h2&gt;

&lt;p&gt;The model-on-top use case. If you have a fair-value model and you harvest the moments where &lt;code&gt;book_price × your_fair_value &amp;gt; 1.03&lt;/code&gt;, you want a polling firehose during in-play, not an hourly dump.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"books"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"pinnacle"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sports"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"basketball"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tennis"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"soccer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"marketTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"h2h"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"spreads"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"totals"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"live_only"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"deduplicationWindowSeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Schedule:&lt;/strong&gt; 60-second cron during target match windows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume:&lt;/strong&gt; ~50K live snapshots/month × $0.02 + orchestration ≈ &lt;strong&gt;~$1,080 / month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That looks pricey until you put it next to OddsJam Pro at $499+/mo for a SaaS API you don't control and that throttles by tier. The trade is: you pay more per request, but you pay only for what you actually consume, you set the cadence, and a stable line costs you nothing.&lt;/p&gt;

&lt;p&gt;The other thing the SaaS won't sell you: every snapshot includes &lt;code&gt;isLive&lt;/code&gt;, &lt;code&gt;matchClock&lt;/code&gt;, and &lt;code&gt;matchScore&lt;/code&gt;. Your model doesn't have to join against a separate scoreboard feed during a live NBA fourth quarter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recipe 3 — Specials sniper (the OddsJam gating trick)
&lt;/h2&gt;

&lt;p&gt;For value bettors and exact-totals modellers. This is the recipe where the pricing gap gets embarrassing.&lt;/p&gt;

&lt;p&gt;Pinnacle's &lt;code&gt;withSpecials=true&lt;/code&gt; matchups call returns &lt;strong&gt;~5,000 markets per major sport&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First Team To Score (3-way)&lt;/li&gt;
&lt;li&gt;Win to Nil 1st Half (yes/no)&lt;/li&gt;
&lt;li&gt;Exact Total Goals 1st Half (multi-way)&lt;/li&gt;
&lt;li&gt;Winning Margin per scoreline&lt;/li&gt;
&lt;li&gt;A long tail of team props and player-related markets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the markets soft books are slowest to sharpen up on — which is where the actual edge lives. OddsJam gates futures and props behind their highest tier. The Odds API doesn't surface most of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"books"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"pinnacle"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sports"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"soccer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"marketTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"specials"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pre_match_only"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"deduplicationWindowSeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Schedule:&lt;/strong&gt; 4-hour cron during the season.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume:&lt;/strong&gt; ~6K specials snapshots + 180 runs ≈ &lt;strong&gt;~$250 / month&lt;/strong&gt; for the segment that powers the largest EV pockets in retail sports betting.&lt;/p&gt;

&lt;p&gt;A pattern that works: filter the dataset to &lt;code&gt;marketType == "specials" &amp;amp;&amp;amp; impliedProbability &amp;lt; 0.10&lt;/code&gt;. Pinnacle longshots above 10× implied with sharp money backing are where the soft-book mispricings concentrate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recipe 4 — Tipster Discord auto-poster
&lt;/h2&gt;

&lt;p&gt;The cheapest one and the easiest to sell to a small operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sports: ["soccer"]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;leagueFilter: ["UEFA", "EPL", "La Liga"]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mode: "pre_match_only"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Every 6 hours, webhook → Discord&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; ~$30/month for a daily top-10 spreads + totals digest piped straight into the channel. If you currently screenshot OddsJam into Discord by hand, this is the upgrade.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pay-per-event math in one table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workflow&lt;/th&gt;
&lt;th&gt;Volume&lt;/th&gt;
&lt;th&gt;Monthly cost&lt;/th&gt;
&lt;th&gt;Replaces&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Casual bettor — daily 9am pre-match dump, 30 fixtures&lt;/td&gt;
&lt;td&gt;~900 snapshots&lt;/td&gt;
&lt;td&gt;~$11&lt;/td&gt;
&lt;td&gt;$59/mo Odds API tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLV tracker — T-30 + T-1, 80 fixtures/wk&lt;/td&gt;
&lt;td&gt;~3.2K snapshots&lt;/td&gt;
&lt;td&gt;~$65&lt;/td&gt;
&lt;td&gt;$249/mo OddsJam Gold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tipster shop — 100 fixtures × 7 outcomes, hourly&lt;/td&gt;
&lt;td&gt;~21K snapshots&lt;/td&gt;
&lt;td&gt;~$245&lt;/td&gt;
&lt;td&gt;$249/mo OddsJam Gold (parity)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Specials trader — daily soccer sweep&lt;/td&gt;
&lt;td&gt;~6K snapshots&lt;/td&gt;
&lt;td&gt;~$250&lt;/td&gt;
&lt;td&gt;Highest-tier gate (not available below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EV model — live NBA + tennis + soccer, 90s cron&lt;/td&gt;
&lt;td&gt;~50K live snapshots&lt;/td&gt;
&lt;td&gt;~$1,200&lt;/td&gt;
&lt;td&gt;OddsJam Pro $499+ + you control cadence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Flat-rate SaaS wins only once you cross ~150K snapshots/month of stable workload. Below that — which is most solo sharps, most tipster operations, and every specials trader — PPE is just cheaper, and the cost curve is linear in actual usage rather than tier-jumpy.&lt;/p&gt;

&lt;p&gt;The other PPE advantage that quietly compounds: there's no annual contract. Off-season for a sport? Cron stops, billing stops. You don't pay for unused capacity in August when soccer is dead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three steps to a running cron
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Pick your sports and markets.&lt;/strong&gt;&lt;br&gt;
Defaults are &lt;code&gt;["soccer", "tennis"]&lt;/code&gt; — the two highest-liquidity sharp markets year-round. For CLV add &lt;code&gt;"spreads", "totals"&lt;/code&gt;. For specials sniping add &lt;code&gt;"specials"&lt;/code&gt;. The full sport list is 11 deep (soccer, tennis, basketball, MMA, baseball, hockey, esports, AFL, NFL/college, golf, rugby).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Run once with default input&lt;/strong&gt; and verify Pinnacle returns data for your sport+league pick. Output lands in your Apify dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Save as Task → Schedules → New Schedule&lt;/strong&gt; with the cron string you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;*/&lt;span class="m"&gt;5&lt;/span&gt; * * * *    &lt;span class="n"&gt;pre&lt;/span&gt;-&lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="n"&gt;every&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt; &lt;span class="n"&gt;minutes&lt;/span&gt;
* * * * *      &lt;span class="n"&gt;live&lt;/span&gt; &lt;span class="n"&gt;every&lt;/span&gt; &lt;span class="n"&gt;minute&lt;/span&gt; &lt;span class="n"&gt;during&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="n"&gt;windows&lt;/span&gt;
&lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="m"&gt;9&lt;/span&gt; * * *      &lt;span class="n"&gt;daily&lt;/span&gt; &lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="n"&gt;am&lt;/span&gt; &lt;span class="n"&gt;pre&lt;/span&gt;-&lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="n"&gt;dump&lt;/span&gt;
&lt;span class="m"&gt;0&lt;/span&gt; */&lt;span class="m"&gt;6&lt;/span&gt; * * *    &lt;span class="n"&gt;every&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt; &lt;span class="n"&gt;hours&lt;/span&gt;
&lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt; * * &lt;span class="m"&gt;6&lt;/span&gt;     &lt;span class="n"&gt;Saturday&lt;/span&gt; &lt;span class="n"&gt;morning&lt;/span&gt; &lt;span class="n"&gt;weekly&lt;/span&gt; &lt;span class="n"&gt;audit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Attach a webhook to the schedule and ship the dataset into your EV pipeline, Discord/Slack bot, Sheets workbook, or wherever your model lives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Python in 12 lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/sports-odds-aggregator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pinnacle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sports&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soccer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tennis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;marketTypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spreads&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;totals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pre_match_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxEventsPerSport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deduplicationWindowSeconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;marketType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impliedProbability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;evaluate_for_bet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole integration. Every snapshot arrives in a flat per-market-outcome shape with &lt;code&gt;priceAmerican&lt;/code&gt;, &lt;code&gt;priceFractional&lt;/code&gt;, &lt;code&gt;price&lt;/code&gt; (decimal), &lt;code&gt;impliedProbability&lt;/code&gt;, and &lt;code&gt;isBestPriceAcrossBooks&lt;/code&gt; on every row — your model doesn't have to do format gymnastics or join against a separate American-odds conversion table.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a snapshot looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"snapshotId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a1b2c3d4e5f6789012345678"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"book"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pinnacle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sport"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"soccer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"league"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Premier League"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"homeTeam"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Manchester City"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"awayTeam"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Liverpool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"commenceTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-22T19:00:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"isLive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"marketType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"h2h"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"outcomeKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"home"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"outcomeLabel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Home"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"priceAmerican"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-110&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"priceFractional"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10/11"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"impliedProbability"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.52356&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"isBestPriceAcrossBooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scrapedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-18T14:32:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;snapshotId&lt;/code&gt; is a stable sha1 derived from book+event+market+outcome+timestamp, so it makes a clean primary key if you're persisting into Postgres / DuckDB.&lt;/p&gt;




&lt;h2&gt;
  
  
  For high-volume operations
&lt;/h2&gt;

&lt;p&gt;If your monthly burn is past 50K snapshots and you need a dedicated polling cadence, custom market types (Asian handicap quarter-lines, derivative props, fancy bets), or a schema SLA for a downstream production pipeline, the Actor page has an "Enterprise inquiry" pointer. Webhook integrations, dedicated proxy pools, and custom dataset views ship in roughly a week. Sustained seven-figure-action operations can talk dedicated-instance posture.&lt;/p&gt;

&lt;p&gt;For everyone else the default Apify Proxy works on Pinnacle's guest API — Pinnacle's public surface tolerates datacenter IPs by design (which is why it's on the supported-books list to begin with). If your plan includes datacenter, override &lt;code&gt;apifyProxyGroups: ["DATACENTER"]&lt;/code&gt; and your proxy cost drops to roughly 5% of a residential-default scraper.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things worth knowing before you run it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not US bookmakers.&lt;/strong&gt; DraftKings / FanDuel / BetMGM / Caesars / ESPN BET are geo-gated behind Akamai and need US residential proxy, which kills the per-snapshot economics. Other Apify Actors target those — this one stays out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal-analysis use only.&lt;/strong&gt; Pinnacle's TOS forbids commercial redistribution of raw odds. The architecture is per-buyer-execution — you run it in your own Apify account against your own polling cadence. Don't resell the feed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a streaming WebSocket feed.&lt;/strong&gt; Poll-based, fastest meaningful cadence ~60s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bet365 returns when Bet365 returns.&lt;/strong&gt; Cross-book best-price flag and fuzzy event-matching are already in the codebase; the day a second book ships, arb infra activates without an input change.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you currently pay The Odds API or OddsJam Gold $249/mo for the Pinnacle column, the cheapest experiment is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spin up the Actor with the default input.&lt;/li&gt;
&lt;li&gt;Run it on five of your usual fixtures.&lt;/li&gt;
&lt;li&gt;Compare the snapshots against whatever your incumbent feed gave you for the same fixtures.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The break-even comes faster than you'd expect — most workflows under 150K snapshots/month earn back the SaaS subscription inside the first month, and the dedup window keeps marginal cost near zero on stable lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actor link:&lt;/strong&gt; &lt;a href="https://apify.com/zhorex/sports-odds-aggregator" rel="noopener noreferrer"&gt;apify.com/zhorex/sports-odds-aggregator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the Actor saves you a month of OddsJam Gold, the single highest-leverage thing you can do back is a 30-second review on the Actor page — it directly funds the next defensive patch when the books shift their schemas.&lt;/p&gt;

&lt;p&gt;Roadmap is public: Smarkets adapter (v0.4) reactivates cross-book arb infra, Pinnacle alternate-lines / period markets (v0.5) opens half/quarter handicap decomposition, Betfair Exchange BYO-credentials (v0.6), WebSocket mode (v0.7), automatic arb finder (v0.8).&lt;/p&gt;

</description>
      <category>apify</category>
      <category>pinnacle</category>
      <category>sportsbetting</category>
      <category>scraping</category>
    </item>
    <item>
      <title>$0.005 per Weibo post — the Chinese social data layer Western teams keep skipping</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Sun, 17 May 2026 16:54:46 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/0005-per-weibo-post-the-chinese-social-data-layer-western-teams-keep-skipping-699</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/0005-per-weibo-post-the-chinese-social-data-layer-western-teams-keep-skipping-699</guid>
      <description>&lt;p&gt;I shipped a Weibo scraper on Apify eight months ago. Fifteen customers pay me on it now, another thirty-four use the free tier, and in the last sixteen days they pulled 136,400 posts through it. I built it because every Western social-listening tool I evaluated — Synthesio, Brandwatch, Meltwater — quoted four to five figures a year for China coverage that was thinner than what you get from one tuned Apify run.&lt;/p&gt;

&lt;p&gt;The whole pitch is one number: &lt;strong&gt;$0.005 per post.&lt;/strong&gt; Pay only for items you actually take. The Apify free plan covers your first ~1,000 mentions before you spend a cent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper on Apify Store →&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's actually in the box
&lt;/h2&gt;

&lt;p&gt;Four modes. All return normalized JSON. No Weibo login. No API key from Weibo. No VPN.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;hot_search&lt;/code&gt; — the live hot-topics list, i.e. what 580M+ monthly active users are looking at right now. The single most-watched signal in Chinese social.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;search&lt;/code&gt; — keyword search across public posts. Brand names, ticker symbols, product launches, Chinese or English.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;post_comments&lt;/code&gt; — every public comment on a given post. Sentiment grenades and viral crises live here.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user_posts&lt;/code&gt; — full posting history of any public account. KOL vetting, executive watch, competitor monitoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output is flat JSON — post text, author handle, timestamp, repost / comment / like counts, media URLs. Push it straight into a warehouse, a Pandas DataFrame, or a Slack alert with a 30-line script.&lt;/p&gt;

&lt;h2&gt;
  
  
  What people actually pay for
&lt;/h2&gt;

&lt;p&gt;I see what runs every day on this actor. The patterns paying customers settle into:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Equity / sentiment signal on China-listed names — ~$25/day, ~$750/month
&lt;/h3&gt;

&lt;p&gt;A small fund or research desk covering BABA, NIO, PDD, BILI, JD, BEKE, LI, XPEV, KWEB constituents, or any China-exposed Western name. Scheduled &lt;code&gt;search&lt;/code&gt; over 30-50 tickers and brand names, ~5,000 posts a day, fed into a sentiment model. Sentiment shifts on Weibo lead the Hong Kong open by hours. Dedicated enterprise social-listening contracts that even attempt China coverage start near $30K/year, and most don't index Weibo deeply.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Brand monitoring for Western brands in China — ~$15/day, ~$450/month
&lt;/h3&gt;

&lt;p&gt;A consumer brand with China exposure — Apple, Tesla, Nike, Starbucks, LVMH, Lululemon, any DTC brand on Tmall — needs ~3,000 mentions/day on brand and product-line keywords. Comments mode catches crisis posts before they trend. Synthesio / Brandwatch / Talkwalker contracts that include China typically run $30K-$100K/year. The same daily mention stream costs you less than a streaming subscription.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. KOL / influencer due diligence — ~$1 per KOL
&lt;/h3&gt;

&lt;p&gt;Before you wire 50,000-200,000 RMB to a Weibo influencer for a sponsorship, run &lt;code&gt;user_posts&lt;/code&gt; against the handle. Look at posting cadence, real engagement (not vanity follower counts), brand affinity history, controversy flags. One avoided bad deal pays for years of usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. AI / LLM training data — ~1M posts = $5,000
&lt;/h3&gt;

&lt;p&gt;Real-world, conversational, dialect-rich Mandarin from public posts. Filtered Weibo subsets sell on data marketplaces for $20K-$50K and ship stale by months. Pull fresh data on the topics and time windows you care about, own the pipeline, and the per-post cost is a small fraction of either marketplace data or annotator-collected datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. M&amp;amp;A and pre-deal diligence on Chinese targets — $200-$500 one-off
&lt;/h3&gt;

&lt;p&gt;A pre-LOI sentiment pull on a Chinese target — employee chatter, customer complaints, founder reputation, glass-door-equivalent venting. Boutique diligence firms bill $25K-$75K for the equivalent exercise. As a banker or consultant, even a "$500 in cost, $30K invoice" framing is a 60x markup the client is happy to pay for.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Crisis monitoring / hourly brand watch — ~$50/month
&lt;/h3&gt;

&lt;p&gt;Schedule a six-times-a-day run on brand keywords. &lt;code&gt;hot_search&lt;/code&gt; catches a viral crisis the moment it crosses into the public consciousness — typically a 4-12 hour head start on Western media coverage. For a brand worth eight figures, that gap is the difference between "managed" and "case study."&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Macro / consumer-trend reading from the hot list — ~$5/day
&lt;/h3&gt;

&lt;p&gt;The hot search list is the cheapest macro signal in Chinese markets. Tariff reactions, regulatory rumblings, viral consumer products, celebrity scandals that wreck brand deals — all surface here first. Hedge fund quants, geopolitical analysts, and morning-brief writers all bake this in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that matters: $0.005 per item
&lt;/h2&gt;

&lt;p&gt;You pay per item returned. No subscription, no surprise overage.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 mentions: &lt;strong&gt;$5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;10,000 mentions: &lt;strong&gt;$50&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;100,000 mentions: &lt;strong&gt;$500&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Apify free plan gives you ~$5/month in platform credit, which covers your first ~1,000 mentions on this actor. You validate the data fits your use case before you spend a cent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Start on the free plan →&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;YOUR_APIFY_TOKEN&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pull 1,000 posts mentioning Tesla. $5 flat.
&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;特斯拉&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Stream the results
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;createdAt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repostsCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same shape works for &lt;code&gt;hot_search&lt;/code&gt;, &lt;code&gt;post_comments&lt;/code&gt;, &lt;code&gt;user_posts&lt;/code&gt;. Swap the &lt;code&gt;mode&lt;/code&gt; and the input keys to whatever the run takes. The exact input schema lives on the actor page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the real money lives: recurring runs
&lt;/h2&gt;

&lt;p&gt;One-shot pulls are fine for a diligence assignment. The customers who actually extract serious value from this are the ones running it on a schedule. Apify Schedules takes a cron expression and a saved input — the actor runs forever, the dataset accumulates, and you download it as JSON, CSV, or Excel.&lt;/p&gt;

&lt;p&gt;The math gets compelling fast. Below is what my heaviest recurring customers actually run:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Cron expression&lt;/th&gt;
&lt;th&gt;Approx. monthly cost&lt;/th&gt;
&lt;th&gt;What it replaces&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Morning hot-search dump for the daily brief&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;0 9 * * 1-5&lt;/code&gt; (Asia/Shanghai)&lt;/td&gt;
&lt;td&gt;~$15&lt;/td&gt;
&lt;td&gt;A junior analyst's 30-min daily task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brand mentions, every two hours&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0 */2 * * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~$450&lt;/td&gt;
&lt;td&gt;$30K/yr Brandwatch contract for China only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Equity tickers, hourly (the highest-ROI cron on this list)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0 * * * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$750&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A $120K/yr China sentiment analyst, half-replicated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crisis watch, every 30 minutes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;*/30 * * * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~$1,500&lt;/td&gt;
&lt;td&gt;A 24/7 PR monitoring agency contract&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overnight KOL sweep on 200 handles&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0 2 * * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~$60&lt;/td&gt;
&lt;td&gt;$5K/mo influencer-vetting subscription&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Set the cron once, walk away, get paid in compounded insight.&lt;/strong&gt; The customers running hourly equity cron jobs have been doing it for months without touching the config — the actor runs, the data lands, the alpha shows up in their dashboards. That's the only mode of use that actually justifies the time you invested learning the schema.&lt;/p&gt;

&lt;p&gt;If you take one thing from this post: &lt;strong&gt;don't run it manually twice — wire the second run into a cron.&lt;/strong&gt; The actor was built for that, the pricing was designed for that, and that's where every customer who renewed went.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reliability and what happens when things break
&lt;/h2&gt;

&lt;p&gt;You pay per item. If the actor returns nothing on a run, you pay nothing. If it returns 327 items, you pay for 327. That alignment is the whole reason I picked per-event pricing instead of a monthly subscription — my incentive to keep the thing working is exactly your incentive that it works.&lt;/p&gt;

&lt;p&gt;I monitor the actor daily. When something upstream changes, I ship a fix within hours, not weeks. The Apify Store rating and issue history on the actor page are public.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I priced it like this
&lt;/h2&gt;

&lt;p&gt;I shipped this eight months ago. By month two it was profitable. The most recent sixteen-day window: fifteen paying customers, 136,400 items returned, $697 revenue, $675 profit, 96.79% margin. The margin isn't there because the work is trivial — it's there because per-event pricing means I only earn when the data is actually delivered.&lt;/p&gt;

&lt;p&gt;If you're evaluating Chinese social tools and the lowest quote you can get is $20K+, run a 1,000-mention probe through this actor first. You'll know inside ten minutes whether the data covers your use case. Worst case, you spend $5. Then wire it into a cron and forget about it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Try it on the free plan →&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Chinese-platform actors I run
&lt;/h2&gt;

&lt;p&gt;Weibo is the macro signal layer for China. These cover the rest of the surface area:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;Xiaohongshu / RED scraper&lt;/a&gt; — lifestyle, beauty, female-skewing audience. The #1 platform for DTC brand launches in China.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;Bilibili scraper&lt;/a&gt; — long-form video, Gen Z, gaming / anime / tech vertical signal.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/douban-scraper" rel="noopener noreferrer"&gt;Douban scraper&lt;/a&gt; — books, films, music, niche communities. The most "honest" review platform in China.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;Xueqiu scraper&lt;/a&gt; — retail-trader-heavy financial discussion. Equity-desk supplement to Weibo.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;Chinese Brand Monitor&lt;/a&gt; — composite brand signal across the platforms above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All same pricing model. Pay per item. Schedule freely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compliance posture
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Only &lt;strong&gt;public&lt;/strong&gt; Weibo posts. No private accounts, no DMs, no content behind a login wall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No login bypass.&lt;/strong&gt; The actor does not log into Weibo on your behalf, and does not need an account to function.&lt;/li&gt;
&lt;li&gt;Optional cookies are &lt;strong&gt;user-supplied&lt;/strong&gt; and only raise your personal rate limit. They are never required for the actor to work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your use case requires private data, this actor isn't it — and frankly nothing on the Apify Store will be.&lt;/p&gt;

&lt;p&gt;If you actually run something interesting with it, leave a comment or open an issue on the actor page — I read all of them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;apify.com/zhorex/weibo-scraper&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>apify</category>
      <category>python</category>
      <category>china</category>
    </item>
    <item>
      <title>Track brand mentions across China's top 5 social platforms in one API call — $0.045 per mention</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Sat, 16 May 2026 10:36:57 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/i-built-5-single-platform-scrapers-the-one-that-sells-fastest-is-the-aggregator-that-wraps-them-2pli</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/i-built-5-single-platform-scrapers-the-one-that-sells-fastest-is-the-aggregator-that-wraps-them-2pli</guid>
      <description>&lt;p&gt;If your brand competes for Chinese consumers and you're not actively monitoring conversations on Weibo, RedNote, Bilibili, Douban, and Xueqiu, you're flying blind in the world's second-largest consumer market.&lt;/p&gt;

&lt;p&gt;The problem is that the "enterprise" way to do this — Synthesio, Brandwatch, Talkwalker — starts at &lt;strong&gt;$50,000 per year&lt;/strong&gt; for Chinese platform coverage, with annual contracts, locked-in seats, and a sales cycle measured in weeks. So most mid-market teams just… don't. They monitor English-language Twitter for their global brand, see a sentiment dip in APAC revenue a quarter later, and have no leading signal explaining why.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;&lt;strong&gt;Chinese Brand Monitor&lt;/strong&gt;&lt;/a&gt; launches today on the &lt;a href="https://apify.com" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt; to fix that. One API call, five platforms, normalized output, &lt;strong&gt;$0.045 per mention. No subscription. No annual contract. No minimum spend. Run it once, run it daily, run it hourly — you only pay for the mentions you actually pull.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 platforms in one call
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;What it captures&lt;/th&gt;
&lt;th&gt;Why it matters for brand monitoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weibo (微博)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Public microblog mentions, KOL posts, hot search trending&lt;/td&gt;
&lt;td&gt;China's Twitter. 580M+ users. Where consumer crises break first and where KOL endorsements reach hundreds of millions in hours.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RedNote / Xiaohongshu (小红书)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lifestyle and consumer brand notes, first-person product reviews&lt;/td&gt;
&lt;td&gt;300M+ users. The single highest-trust channel for Chinese consumer purchase decisions in beauty, skincare, fashion, food, travel.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bilibili (B站)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Video titles, descriptions, and creator mentions&lt;/td&gt;
&lt;td&gt;China's YouTube. 300M+ users. Where Gen Z consumer brand affinity is built and where unboxing / review culture lives.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Douban (豆瓣)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Movie / book / music subject mentions, brand tie-ins, soundtracks, branded titles&lt;/td&gt;
&lt;td&gt;200M+ users. Long-form opinion-rich content — the densest source of detailed consumer attitude data outside Zhihu.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Xueqiu (雪球)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stock cashtag and corporate mentions for listed brands&lt;/td&gt;
&lt;td&gt;20M+ users, financial-grade signal. Critical if your brand is publicly listed (NYSE:BABA, NASDAQ:JD, HK:00700, A-share tickers) — finance KOLs move retail sentiment fast.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These five platforms cover the full spectrum: broad public opinion (Weibo), high-trust consumer reviews (RedNote), Gen-Z video sentiment (Bilibili), long-form opinion (Douban), and investor sentiment (Xueqiu). For most consumer brands, monitoring any 3 of these in real time is a leading indicator that beats your CRM dashboards by 2-6 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why each mention is worth $0.045
&lt;/h2&gt;

&lt;p&gt;It's tempting to look at $0.045 and compare it to "free" Twitter mentions. That's the wrong comparison. The right comparison is: &lt;strong&gt;what would it cost you to get one Chinese consumer mention, normalized, sentiment-tagged, and cross-platform-deduplicated, any other way?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Synthesio / Brandwatch / Talkwalker enterprise seat&lt;/strong&gt;: $50K+/year minimum. Cost per mention at typical 100K mentions/year volume: $0.50. &lt;strong&gt;11× the cost of this Actor&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hire a Chinese-speaking VA to manually check 5 platforms&lt;/strong&gt;: ~$15/hour, ~30 mentions/hour effective. Cost per mention: $0.50. Same 11× cost, plus 24-48 hour latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build it yourself&lt;/strong&gt;: 5 separate scrapers, 5 different output schemas to parse, dedup logic, sentiment classifier, ongoing maintenance every time a platform changes their frontend. Conservatively 60-100 engineering hours upfront and 10-20 hours/month ongoing. At $150/hr loaded engineer cost, that's $9K-$15K to build + $1.5K-$3K/month to maintain. &lt;strong&gt;Break-even vs this Actor: never, unless you're pulling &amp;gt;100K mentions/month.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every mention you pull at $0.045 buys you: the raw text, the author identity (handle, follower count, verified flag), the engagement metrics (likes, comments, shares), the timestamp, the URL, media URLs, language detection, lexicon-based sentiment scoring, and — if dedup is enabled — a &lt;code&gt;crossPlatformReposts&lt;/code&gt; array showing exactly which other platforms amplified the same content. That's a record your competitive intelligence analyst would gladly take and immediately put into a deck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The more you run it, the more leverage you get from each run.&lt;/strong&gt; A single weekly snapshot tells you nothing about velocity. A daily run shows you trends. An hourly run during a crisis or product launch shows you the inflection point in real time — which is when one mention is worth $4.50 to your PR team, not $0.045.&lt;/p&gt;

&lt;h2&gt;
  
  
  8 concrete use cases (run it like this)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Daily brand health dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: Cron daily at 8am local time, single brand keyword, 7-day lookback, sentiment + dedup enabled. Push the canonical mentions to your BI tool (Looker / Metabase / Hex / Sigma) for a stacked-by-platform sentiment chart, follower-weighted reach total, and top-10 highest-engagement mention list. Run for 30 days, you have a baseline. Run for 90 days, you have a leading-indicator dashboard your CMO will check daily.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 100 mentions/day × 30 days = 3,000/mo. &lt;strong&gt;Cost: ~$135/mo per brand.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Compare to&lt;/strong&gt;: $4,000/mo for a Synthesio seat covering the same platforms.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Crisis monitoring (hourly polling with Slack alerts)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: Cron every hour, 1-day lookback, sentiment-enabled, filter for &lt;code&gt;sentiment.polarity == "negative"&lt;/code&gt; AND &lt;code&gt;authorFollowerCount &amp;gt; 10000&lt;/code&gt;. Pipe matching records to a Slack webhook that pings #pr-alerts. The moment a verified KOL posts a negative mention, your PR team knows within 60 minutes — versus 24-72 hours via Google Alerts or "someone forwarded it to me."&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: Most hours return 0-5 mentions. ~200 mentions/day amortized = 6,000/mo. &lt;strong&gt;Cost: ~$270/mo per brand.&lt;/strong&gt; Single prevented PR crisis pays for the entire year.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Pre-launch competitor intelligence (one-off pull)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: Before launching a new SKU in China, pull 30 days of mentions on each competitor brand keyword across all 5 platforms. Look at: which platforms each competitor over-indexes on, which KOLs are talking about them, what sentiment dominates, what product attributes get the most positive vs negative mentions. Run this once a quarter on 5 competitors and you have the best competitive intel deck in the room.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 5 competitors × 500 mentions each = 2,500 mentions one-time. &lt;strong&gt;Cost: ~$112 one-time per quarter.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  4. KOL identification and vetting
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: Weekly run on your category keyword (e.g. "护肤" for skincare, "球鞋" for sneakers). Pull 500 mentions/week. Filter the output for &lt;code&gt;authorVerified == true&lt;/code&gt; AND &lt;code&gt;authorFollowerCount &amp;gt; 50000&lt;/code&gt;. Sort by &lt;code&gt;engagementMetrics.likes&lt;/code&gt; descending. Top 20 results = your candidate KOL list for the week, scored by actual cultural reach not by paid impressions. Compare against your influencer agency's recommendations.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 500 mentions/week × 4 weeks = 2,000 mentions/mo. &lt;strong&gt;Cost: ~$90/mo per category.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  5. China-watcher hedge fund alt-data signal
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: Daily run on each portfolio ticker that has Chinese consumer exposure (BABA, JD, PDD, BIDU, NIO, BYD, ANTA, Yum China, POP MART, etc.). Pull mentions from Xueqiu (financial sentiment) + Weibo (consumer sentiment) + RedNote (brand affinity for consumer brands). Build a sentiment-velocity feature: 7-day mention count delta + sentiment polarity shift. Backtest against earnings surprises and brand event days — Chinese consumer sentiment leads Western analyst consensus by 2-6 weeks for most consumer-facing names.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 20 tickers × 50 mentions/day = 1,000/day × 22 trading days = 22,000/mo. &lt;strong&gt;Cost: ~$990/mo.&lt;/strong&gt; Compare to: a single Bloomberg China consumer alt-data feed subscription, $80K-$200K/year minimum.&lt;/p&gt;
&lt;h3&gt;
  
  
  6. AI / LLM training data corpus
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: One-time bulk pull on a diverse set of brand keywords across consumer categories. 50 brands × 1,000 mentions each = 50K labeled Chinese-language consumer text records with explicit sentiment labels. Drop into your SFT or RLHF pipeline for Chinese-language consumer-domain fine-tuning. This is the densest source of brand-grounded labeled Chinese text outside of paid academic corpora.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 50,000 mentions one-time. &lt;strong&gt;Cost: ~$2,250 one-time.&lt;/strong&gt; Compare to: licensing a comparable academic corpus from Trinity College Dublin or Tsinghua, $15K-$50K per corpus, single-use license, 6-month delivery.&lt;/p&gt;
&lt;h3&gt;
  
  
  7. Cross-platform virality discovery (run it weekly, look at the dedup array)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: Weekly run, dedup ENABLED, scan the output for canonical records where &lt;code&gt;crossPlatformReposts.length &amp;gt;= 2&lt;/code&gt;. Those are mentions that spread across multiple platforms within 24 hours — the closest thing to a "viral" signal you can extract from raw mention data. Use it to identify breakout moments before they hit mainstream Chinese media.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 500 mentions/week × 4 weeks = 2,000/mo. &lt;strong&gt;Cost: ~$90/mo per brand.&lt;/strong&gt; Most viral moments cost $50K-$200K in PR services to capitalize on; this is how you find them 48-72 hours earlier than the agency.&lt;/p&gt;
&lt;h3&gt;
  
  
  8. Multi-brand portfolio monitoring (agency workflow)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;: One Actor run per client brand, scheduled daily via Apify Schedules. 10 clients × 500 mentions/brand/day = 5,000 mentions/day. Each client gets their own dataset and dashboard. The agency bills $2K-$5K/client/month for "China monitoring," delivers a custom dashboard, and the underlying data cost is ~$675/client/month — leaving healthy 70%+ gross margin per client.&lt;br&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 5,000 mentions/day × 30 days = 150,000/mo. &lt;strong&gt;Cost: ~$6,750/mo for 10 brands.&lt;/strong&gt; Revenue at $3K/client × 10 = $30K/mo. &lt;strong&gt;Gross margin: 77.5%.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What the output looks like
&lt;/h2&gt;

&lt;p&gt;Every mention is normalized to the same schema regardless of platform. Here's a real Weibo mention from a test run on the Chinese sportswear brand 李宁 (Li-Ning):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mentionId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weibo_4923475823745"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weibo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"brandKeyword"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"李宁"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"brandMatchType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"李宁的新款跑鞋质量真不错，比之前的耐克舒服多了！"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zh-CN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"authorName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"运动达人"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"authorFollowerCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"authorVerified"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"publishedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-15T14:32:11+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"engagementMetrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"comments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"shares"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://weibo.com/1234567890/4923475823745"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"polarity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"positive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.72&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"crossPlatformReposts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rednote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.xiaohongshu.com/explore/abc123"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three fields buyers tell me they care about most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sentiment&lt;/code&gt;&lt;/strong&gt; — lexicon-based Chinese sentiment scoring on every mention. Polarity (positive / neutral / negative) plus a numeric score. Disable it if you have your own pipeline; enabled by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;crossPlatformReposts&lt;/code&gt;&lt;/strong&gt; — the same viral post often appears across Weibo and RedNote within hours. The aggregator detects this with SimHash similarity and merges duplicates into the canonical record, with the repost paths preserved. &lt;strong&gt;You don't pay twice for the same mention&lt;/strong&gt;, and you get a free virality signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;authorFollowerCount&lt;/code&gt; / &lt;code&gt;authorVerified&lt;/code&gt;&lt;/strong&gt; — the difference between a 200-follower throwaway account and a 1.2M-follower verified KOL is the difference between "ignore this" and "alert the C-suite." Follower-weighting your dashboard is the first thing every serious buyer does with the data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;

&lt;p&gt;The input is brutally simple. This is a complete config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"brandKeyword"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"李宁"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"platforms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"weibo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bilibili"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rednote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"douban"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"xueqiu"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"maxMentionsPerPlatform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lookbackDays"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sentimentAnalysis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deduplication"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One brand keyword, one call, get back a normalized stream of mentions across all five platforms. The &lt;code&gt;lookbackDays&lt;/code&gt; filter applies per platform so you only get fresh content; &lt;code&gt;deduplication&lt;/code&gt; collapses cross-platform reposts; &lt;code&gt;sentimentAnalysis&lt;/code&gt; tags every record.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To run it on a schedule&lt;/strong&gt; (which is where the real value compounds): use &lt;a href="https://docs.apify.com/platform/schedules" rel="noopener noreferrer"&gt;Apify Schedules&lt;/a&gt; and set a cron expression. &lt;code&gt;0 8 * * *&lt;/code&gt; for daily 8am runs, &lt;code&gt;0 * * * *&lt;/code&gt; for hourly, &lt;code&gt;*/15 * * * *&lt;/code&gt; for every 15 minutes during a launch or crisis. Each run hits the same Actor with your saved input config, pushes to the same dataset, and bills only on the new canonical mentions. &lt;strong&gt;The Actor is built to be run thousands of times — that's how you go from "snapshot" to "monitoring system."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you have cookies for any of the platforms (logged-in browser session) you can pass them under &lt;code&gt;cookieStrings&lt;/code&gt; to unlock higher recall and rate limits. Cookies are optional — the actor degrades gracefully without them.&lt;/p&gt;

&lt;p&gt;For deeper single-platform scraping (full comment trees, infinite scroll, profile enrichment), use the dedicated single-platform actors directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper&lt;/a&gt; — posts, hot search, comments, profiles&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;RedNote (Xiaohongshu) Scraper&lt;/a&gt; — notes, comments, profiles, video&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/douban-scraper" rel="noopener noreferrer"&gt;Douban Scraper&lt;/a&gt; — long-form reviews and group discussions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;Xueqiu Scraper&lt;/a&gt; — ticker-tagged posts, KOL tracking&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/jd-scraper" rel="noopener noreferrer"&gt;JD.com Scraper&lt;/a&gt; — product detail extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The aggregator is for &lt;strong&gt;recurring&lt;/strong&gt; cross-platform brand monitoring with normalized output. The single-platform scrapers are for &lt;strong&gt;one-off&lt;/strong&gt; deep extraction inside one platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;Being honest about scope is more useful than vague promises:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not real-time streaming.&lt;/strong&gt; Poll-based — 5-15 minute effective refresh is realistic. If you need millisecond latency, this isn't it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not WeChat coverage.&lt;/strong&gt; WeChat has no public scraping interface; trying is a fast way to get accounts banned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not Douyin (TikTok China).&lt;/strong&gt; Out of scope for v0.1 — under evaluation for the roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a Synthesio replacement at the largest enterprise scale.&lt;/strong&gt; Synthesio also covers TV, podcasts, news, and provides a managed-service layer. This Actor is the data layer; bring your own BI / dashboard / alerting stack. Most teams who pick this over Synthesio are choosing it because they already own their BI stack and just need the raw normalized feed at a sane price.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Compliance posture
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Only public mentions — no private accounts, no DMs, no follower lists.&lt;/li&gt;
&lt;li&gt;No login bypass; cookies are user-supplied for higher rate limits only, and they're stored as a secret in the Apify input schema (encrypted at rest).&lt;/li&gt;
&lt;li&gt;Reviewer / commenter nicknames are partially redacted by the source platforms; this Actor passes through what the platforms display. No additional PII enrichment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it free, then scale it
&lt;/h2&gt;

&lt;p&gt;The Apify free plan includes monthly platform credit that covers a meaningful first batch of mentions — enough to validate the data quality on your own brand keyword before any commitment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://apify.com/zhorex/chinese-brand-monitor" rel="noopener noreferrer"&gt;Try Chinese Brand Monitor on Apify →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The flow most teams follow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1&lt;/strong&gt;: Run it once manually on your main brand keyword. Verify the output quality on a brand you know well — every mention should be one you'd recognize. (~$2-5 spend.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2&lt;/strong&gt;: Wire it into a daily Apify Schedule. Stream the dataset to your BI tool. Watch one week of trend data. (~$25-50 spend.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3&lt;/strong&gt;: Add a second brand keyword (competitor, partner, or category term). Add a Slack webhook for negative-sentiment alerts above a follower threshold. (~$50-100 spend.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Month 2 onward&lt;/strong&gt;: Production. Daily monitoring on your core brand portfolio, hourly during launches and crises, monthly competitive intel pulls. (Typical mid-market team: $200-1,500/mo.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pricing is transparent at every step: &lt;strong&gt;$0.045 per canonical mention&lt;/strong&gt;, billed only on the deduplicated records. No subscription, no minimum spend, no annual contract. Pause it for a month, scale it up 10x next week, switch brand keywords mid-run — it all just works.&lt;/p&gt;

&lt;p&gt;The teams getting the most out of this are running it on a schedule, daily or hourly, across multiple brand keywords, piping the normalized output into their existing BI / Slack / dashboard stack. Each run pays for itself in the first time it surfaces a mention you would have missed.&lt;/p&gt;

&lt;p&gt;Open an issue on the &lt;a href="https://apify.com/zhorex/chinese-brand-monitor/issues" rel="noopener noreferrer"&gt;Actor page&lt;/a&gt; if you hit any edge case. Typical turnaround on fixes is 48 hours.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>apify</category>
      <category>china</category>
      <category>marketing</category>
    </item>
    <item>
      <title>JD.com's isJdSelfRun Flag Is the Best Gray-Market Detection Signal in Chinese E-Commerce (Python Scraper Inside)</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Fri, 15 May 2026 22:08:41 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/jdcoms-isjdselfrun-flag-is-the-best-gray-market-detection-signal-in-chinese-e-commerce-python-3ib3</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/jdcoms-isjdselfrun-flag-is-the-best-gray-market-detection-signal-in-chinese-e-commerce-python-3ib3</guid>
      <description>&lt;p&gt;If your brand sells on JD.com (China's #2 e-commerce platform, ~600M annual active users) — or competes against one that does — there's a gray-market problem you can't see without one specific field in JD's data.&lt;/p&gt;

&lt;p&gt;That field is &lt;code&gt;isJdSelfRun&lt;/code&gt;. It tells you whether a given product listing is fulfilled by JD itself (their warehouses, their warranty, their return logistics) or by a third-party merchant on JD's marketplace. Combined with the seller's &lt;code&gt;sellerType&lt;/code&gt; (flagship / franchise / specialty / self-run), it's the single cleanest signal for detecting unauthorized resellers on Chinese e-commerce — and almost no generic scraper surfaces it.&lt;/p&gt;

&lt;p&gt;This post walks through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why JD's hybrid retail model creates the gray-market detection opportunity&lt;/li&gt;
&lt;li&gt;The exact field signatures (&lt;code&gt;isJdSelfRun&lt;/code&gt;, &lt;code&gt;sellerType&lt;/code&gt;) and what they mean&lt;/li&gt;
&lt;li&gt;Three concrete workflows: brand authorization audit, competitive pricing, gray-market detection&lt;/li&gt;
&lt;li&gt;A 50-line Python integration with the Apify Actor I built around this&lt;/li&gt;
&lt;li&gt;Honest cost math at indie scale and at hedge-fund scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don't want to read the whole thing: the Actor is at &lt;strong&gt;&lt;a href="https://apify.com/zhorex/jd-scraper" rel="noopener noreferrer"&gt;zhorex/jd-scraper&lt;/a&gt;&lt;/strong&gt;, and pricing is &lt;strong&gt;$0.008 per product detail + $0.02 per seller store record&lt;/strong&gt; (pay-per-event, no subscription).&lt;/p&gt;

&lt;h2&gt;
  
  
  The hybrid retail model that creates the signal
&lt;/h2&gt;

&lt;p&gt;JD.com is structurally different from Tmall and Pinduoduo. Tmall is a marketplace — every SKU is sold by a third-party merchant; Alibaba just runs the platform. JD operates a hybrid: a meaningful chunk of its catalog is sold and shipped by JD itself (JD Logistics, JD Plus warranty, JD's own returns), with the rest fulfilled by marketplace merchants.&lt;/p&gt;

&lt;p&gt;That hybrid creates an information asymmetry buyers can exploit. When a consumer searches a brand's SKU on JD, they see all listings — but the &lt;strong&gt;trust signal&lt;/strong&gt; comes from whether it's JD-self-run or a third-party. For a brand monitoring team, the question becomes: of the third-party listings of &lt;em&gt;my&lt;/em&gt; SKU, which are authorized resellers and which are gray-market?&lt;/p&gt;

&lt;p&gt;The data answers it in two fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"productId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100009082476"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sellerName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Apple产品京东自营旗舰店"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"isJdSelfRun"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sellerId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1000003566"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;isJdSelfRun: true&lt;/code&gt; means JD is the seller. The other listings — those with &lt;code&gt;isJdSelfRun: false&lt;/code&gt; — are where the gray-market questions live, and where you need the seller's type to decide.&lt;/p&gt;

&lt;h2&gt;
  
  
  The seller type enum
&lt;/h2&gt;

&lt;p&gt;A separate scrape against the seller store endpoint resolves to one of four values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sellerId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1000003566"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sellerType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"flagship_store"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"serviceScore"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"logisticsScore"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"descriptionAccuracyScore"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;flagship_store&lt;/code&gt;&lt;/strong&gt; (官方旗舰店) — the brand's own JD store. There should be exactly one per brand. If you see multiple, you have a counterfeit-or-impersonator problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;franchise_store&lt;/code&gt;&lt;/strong&gt; (品牌专营店) — authorized franchise of the brand. Brands typically maintain a list of these.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;specialty_store&lt;/code&gt;&lt;/strong&gt; (专卖店) — third-party that specializes in selling the brand. Often authorized via distribution agreement; sometimes gray-market.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;jd_self_run&lt;/code&gt;&lt;/strong&gt; (京东自营) — JD's direct retail. Always legitimate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The canonical gray-market signature: a &lt;code&gt;flagship_store&lt;/code&gt; listing alongside three &lt;code&gt;specialty_store&lt;/code&gt; listings priced 20-40% lower on the same SKU. Those specialty stores are usually moving inventory acquired outside the authorized channel (parallel imports, diverted product, refurbished-as-new). They're flagged the moment your monitoring sees the price gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three workflows the data unlocks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Workflow 1 — Brand authorization audit
&lt;/h3&gt;

&lt;p&gt;Submit your SKU IDs. Get back a record per listing with &lt;code&gt;sellerType&lt;/code&gt; resolved. Filter to entries where &lt;code&gt;isJdSelfRun: false&lt;/code&gt; AND &lt;code&gt;sellerType&lt;/code&gt; is not in your authorized list. That's your unauthorized reseller list, refreshed on whatever cadence you want.&lt;/p&gt;

&lt;p&gt;A small brand watching 50 SKUs across 200 listings (4 average sellers per SKU) costs about $2 per refresh: 200 seller records × $0.02 + 50 product details × $0.008 = $4.40 ($2.20 if you skip product detail and only check sellers).&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow 2 — Competitive pricing intelligence
&lt;/h3&gt;

&lt;p&gt;The product detail mode returns a &lt;code&gt;realtimePrice&lt;/code&gt; field that is fetched fresh at scrape time, not parsed from cached HTML. JD runs flash discounts that move prices within hours; cached scrapers miss them entirely.&lt;/p&gt;

&lt;p&gt;Tracking 200 competitor SKUs hourly = 200 × 24 × 30 = 144,000 detail records per month, $1,152 in raw event cost. At hedge-fund-grade refresh rates this is real money, but it's the right order of magnitude for the buyer cohort that already pays $3K-15K/month for alt-data feeds.&lt;/p&gt;

&lt;p&gt;Tracking 200 SKUs &lt;em&gt;daily&lt;/em&gt; (more realistic for a brand team) = 6,000 records × $0.008 = $48/month. Cheap enough to run as a cron.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow 3 — Gray-market detection at scale
&lt;/h3&gt;

&lt;p&gt;The canonical pattern in code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;listings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allListings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isJdSelfRun&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="n"&gt;cheap_specialty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;listings&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sellerType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specialty_store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;msrp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.80&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cheap_specialty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sellerType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flagship_store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;listings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cheap_specialty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the brand-monitoring signal: a real flagship store coexisting with three or more sub-MSRP specialty stores on the same SKU. Brand teams pay agencies five-figure annual contracts to surface exactly this kind of alert; running it yourself on this data feed costs cents per check.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 50-line Python integration
&lt;/h2&gt;

&lt;p&gt;Here's the working integration end-to-end. Replace &lt;code&gt;YOUR_TOKEN&lt;/code&gt; with your Apify API token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Scrape product details for your SKU list
&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/jd-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;productUrls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://item.jd.com/100009082476.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://item.jd.com/100012345678.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;unauthorized_sellers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isJdSelfRun&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;unauthorized_sellers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sellerId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Non-self-run listing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;productTitle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Seller: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sellerName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (id &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sellerId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Price: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;realtimePrice&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Now drill into those sellers to classify them
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;unauthorized_sellers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;seller_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/jd-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seller_store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sellerUrls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://mall.jd.com/index-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unauthorized_sellers&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;seller&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seller_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️ AUDIT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;seller&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sellerType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specialty_store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;seller&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sellerName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;seller&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sellerType&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (service: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;seller&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;serviceScore&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole audit. Two API calls, classified output, ready to feed into Slack alerts / spreadsheet exports / BI dashboards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest pricing — what does this cost in production?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workflow&lt;/th&gt;
&lt;th&gt;Volume / month&lt;/th&gt;
&lt;th&gt;Cost / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Brand watchlist — 50 SKUs daily&lt;/td&gt;
&lt;td&gt;1,500 product details&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$12&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brand authorization audit&lt;/td&gt;
&lt;td&gt;500 sellers, monthly&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Competitive pricing — 200 SKUs daily&lt;/td&gt;
&lt;td&gt;6,000 product details&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$48&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Competitive pricing — 200 SKUs hourly&lt;/td&gt;
&lt;td&gt;144,000 product details&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$1,152&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gray-market sweep — 200 SKUs + 50 sellers&lt;/td&gt;
&lt;td&gt;200 details + 50 sellers&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2.60&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Indie brand teams typically run the daily/monthly workflows ($10-60/month). Hedge-fund alt-data and agency-scale customers run hourly or 15-minute refreshes (low four figures monthly). Both work on the same Actor with the same event-priced billing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this Actor doesn't do
&lt;/h2&gt;

&lt;p&gt;Two honesty disclosures:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No search discovery.&lt;/strong&gt; You bring the SKU list. Discovery requires a different scraping pattern that doesn't survive shared residential proxy pools the way product detail and seller store do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No review scraping.&lt;/strong&gt; Same reason — JD's WAF gates the review API at the IP-reputation level on shared pools. If you need review sentiment, the Apify Store has other scrapers, or contact me for a premium-proxy integration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The README on the Actor page documents this in a "Known limitations" section. If your workflow needs either, this Actor isn't the right tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I built this
&lt;/h2&gt;

&lt;p&gt;I run a portfolio of six Chinese-platform scrapers on Apify Store (&lt;a href="https://apify.com/zhorex" rel="noopener noreferrer"&gt;zhorex&lt;/a&gt;). Five of them cover sentiment and content: Weibo for trending, RedNote (Xiaohongshu) for lifestyle, Bilibili for video, Douban for long-form reviews, Xueqiu for stock-cashtag discussion. The JD scraper extends the suite into commerce — the missing layer for buyers who already use the social ones for brand monitoring.&lt;/p&gt;

&lt;p&gt;The six together are a stack. A consumer-electronics brand can track sentiment on Weibo, video reviews on Bilibili, lifestyle unboxings on RedNote, &lt;em&gt;and&lt;/em&gt; gray-market resellers on JD — all on the same vendor, same billing, same API surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is live: &lt;strong&gt;&lt;a href="https://apify.com/zhorex/jd-scraper" rel="noopener noreferrer"&gt;zhorex/jd-scraper&lt;/a&gt;&lt;/strong&gt;. Pay-per-event billing — no subscription, no setup fee. Run a small evaluation batch (the Apify Free plan includes monthly platform credit you can apply to the run) to confirm output quality on your SKU list before scaling up.&lt;/p&gt;

&lt;p&gt;The rest of the Chinese Digital Intelligence Suite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper&lt;/a&gt;&lt;/strong&gt; — pair with JD to catch when a SKU trends socially before stock-outs hit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://apify.com/zhorex/rednote-scraper" rel="noopener noreferrer"&gt;RedNote Scraper&lt;/a&gt;&lt;/strong&gt; — Chinese lifestyle unboxings; useful for fashion, beauty, baby, home brands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;Bilibili Scraper&lt;/a&gt;&lt;/strong&gt; — video reviews; especially valuable for tech and consumer electronics SKUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://apify.com/zhorex/xueqiu-scraper" rel="noopener noreferrer"&gt;Xueqiu Scraper&lt;/a&gt;&lt;/strong&gt; — Chinese retail-investor sentiment; pair if you trade JD stock (NASDAQ:JD) alongside operational metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://apify.com/zhorex/douban-scraper" rel="noopener noreferrer"&gt;Douban Scraper&lt;/a&gt;&lt;/strong&gt; — long-form film / book / music reviews; less relevant for commerce but useful for IP / entertainment teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you ship a brand-monitoring workflow on top of any of these, drop a comment with what you're tracking. If this saved you the time of building an integration from scratch, a heart on the post or a follow keeps these writeups coming.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>china</category>
      <category>ecommerce</category>
    </item>
    <item>
      <title>Scraping Chinese Social Platforms for LLM Training Data: A Practical Multi-Source Pipeline (Python, 2026)</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Tue, 12 May 2026 20:02:05 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/scraping-chinese-social-platforms-for-llm-training-data-a-practical-multi-source-pipeline-python-584</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/scraping-chinese-social-platforms-for-llm-training-data-a-practical-multi-source-pipeline-python-584</guid>
      <description>&lt;p&gt;If you're training Chinese-language models — or multilingual models that need real Chinese coverage, not just translated English — the data problem is the bottleneck. Common Crawl gives you the open web. HuggingFace gives you the curated stuff. But the linguistic patterns that matter most for cultural alignment — slang, memes, code-mixed English-Chinese, regional variations, real-time discourse — those live in places Common Crawl barely touches.&lt;/p&gt;

&lt;p&gt;Three platforms that matter most for Chinese training corpora in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Weibo&lt;/strong&gt; (微博) — 580M+ MAU, microblogging, real-time discourse, similar role to X/Twitter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bilibili&lt;/strong&gt; (哔哩哔哩) — 300M+ MAU, video platform, comments + danmaku give you code-mixed natural language at volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Xiaohongshu / RedNote&lt;/strong&gt; (小红书) — 300M+ MAU, lifestyle posts with longer-form content, female-skewed register&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post walks through how to build a multi-source pipeline that pulls clean structured data from all three, normalize across platforms, and ship it into your training datasets. With code, schema, and economics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on legal posture&lt;/strong&gt;: this entire pipeline accesses only &lt;strong&gt;publicly visible data&lt;/strong&gt; — no auth bypass, no captcha solving, no scraping behind login. That matches the standard most AI training teams operate under in 2026, post-NYT-vs-OpenAI. Always consult your legal team for your specific use case and jurisdiction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why these three (and not, say, Douyin or Zhihu)
&lt;/h2&gt;

&lt;p&gt;Each platform contributes a different linguistic register:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weibo posts&lt;/strong&gt; are short, high-frequency, conversational. Best for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Everyday Mandarin patterns&lt;/li&gt;
&lt;li&gt;Trending slang and memes (热搜 reflects what's actually viral &lt;em&gt;right now&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Public sentiment on news and policy&lt;/li&gt;
&lt;li&gt;Brand-mention contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bilibili comments and danmaku&lt;/strong&gt; are unique:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Heavy code-mixing English ↔ Chinese (gaming, tech, anime communities)&lt;/li&gt;
&lt;li&gt;Real-time chat-style language&lt;/li&gt;
&lt;li&gt;Subculture vocabulary (gaming, fandom, two-dimensional culture / 二次元)&lt;/li&gt;
&lt;li&gt;Longer thread discussions on long-form videos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RedNote posts&lt;/strong&gt; lean longer and more curated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Beauty / lifestyle / travel / food vocabulary&lt;/li&gt;
&lt;li&gt;Product-attribute language (skincare ingredients, fashion descriptors)&lt;/li&gt;
&lt;li&gt;Female-skewed register and topics&lt;/li&gt;
&lt;li&gt;Aspirational / descriptive framing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Douyin (Chinese TikTok) and Kuaishou are dominantly video — text data is sparse. Zhihu (Q&amp;amp;A) is great for long-form but dominated by single-author voice. The triad above gives you the best balance of volume, diversity, and accessibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline architecture
&lt;/h2&gt;

&lt;p&gt;The cleanest architecture for an AI training data pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Weibo Scraper]    →
[Bilibili Scraper] →  [Normalize]  →  [Dedup + Filter]  →  [JSONL]
[RedNote Scraper]  →
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each scraper outputs platform-native JSON. A normalization layer flattens to a common schema. Deduplication on text hash + filtering by min-length / language detection ships clean data into your training format.&lt;/p&gt;

&lt;p&gt;Below: I use Apify-hosted scrapers for the extraction layer (they handle anti-bot, rate limiting, and schema stability so you don't have to). The normalization + dedup is your code — straight Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Pulling from Weibo
&lt;/h2&gt;

&lt;p&gt;For training data, the high-value combination is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot search topics&lt;/strong&gt; (real-time trending — what people are talking about right now)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Posts under those topics&lt;/strong&gt; (organic conversation about real issues)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_weibo_corpus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_topics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;posts_per_topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1a. Pull current trending topics
&lt;/span&gt;    &lt;span class="n"&gt;topics_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hot_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;target_topics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topics_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# 1b. For each topic, pull underlying posts
&lt;/span&gt;    &lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;posts_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/weibo-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;posts_per_topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;posts_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attitudesCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                               &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;commentsCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
                               &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repostsCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postUrl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapedAt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Volume math&lt;/strong&gt;: 50 topics × 100 posts = 5,000 items per snapshot. At $0.005/item that's $25 per pull. Run daily for a year ≈ $9,125.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Pulling from Bilibili
&lt;/h2&gt;

&lt;p&gt;Bilibili gives you something the others don't: &lt;strong&gt;comments on long-form videos&lt;/strong&gt;. That's where heavy code-mixing happens (tech tutorials, gaming streams, study-with-me content, drama analysis). For training data, comments are higher-value than video metadata.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_bilibili_comments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;videos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;comments_per&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Get popular videos in the category
&lt;/span&gt;    &lt;span class="n"&gt;popular_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/bilibili-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;popular&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;videos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;popular_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;bvids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bvid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bvid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="c1"&gt;# Pull comments on each
&lt;/span&gt;    &lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;bvid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bvids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;comments_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/bilibili-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;videoUrls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.bilibili.com/video/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bvid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxComments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;comments_per&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sortComments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comments_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likeCount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_bvid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bvid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapedAt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: Bilibili throttles comment depth on cloud IPs — top ~3 per video without residential proxies. For training-data scale you don't need every comment, just enough diversity, so the top-N approach is fine and cheaper.&lt;/p&gt;

&lt;p&gt;Categories worth pulling for diverse coverage: &lt;code&gt;knowledge&lt;/code&gt;, &lt;code&gt;tech&lt;/code&gt;, &lt;code&gt;game&lt;/code&gt;, &lt;code&gt;life&lt;/code&gt;, &lt;code&gt;food&lt;/code&gt;, &lt;code&gt;fashion&lt;/code&gt;, &lt;code&gt;cars&lt;/code&gt;, &lt;code&gt;entertainment&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Pulling from RedNote
&lt;/h2&gt;

&lt;p&gt;RedNote gives you longer, more curated content — good for training models on aspirational and descriptive Chinese. The seed-query approach lets you control topical distribution, important for avoiding bias toward whatever's trending the day you scrape.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_rednote_corpus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed_queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;posts_per_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seed_queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/rednote-xiaohongshu-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;posts_per_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nickname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postUrl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapedAt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;

&lt;span class="c1"&gt;# Diverse seed queries spread coverage across topics
&lt;/span&gt;&lt;span class="n"&gt;seeds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;护肤心得&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# skincare experience
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;穿搭&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# outfits
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;美食推荐&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# food recommendations
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;旅行攻略&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# travel guides
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;健身打卡&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# fitness check-in
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;读书笔记&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# reading notes
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;育儿日记&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# parenting diary
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;职场感悟&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# work reflections
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;rednote_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect_rednote_corpus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seeds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;posts_per_query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For richer body content per post (beyond title), pivot to &lt;code&gt;mode: post_details&lt;/code&gt; with the post URLs you want to deep-dive on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Normalization and dedup
&lt;/h2&gt;

&lt;p&gt;All three scrapers produce platform-specific schemas; the per-step code above already brings them to a common shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weibo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bilibili&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rednote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ISO8601&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enough to ship into a JSONL training format. For higher quality, layer in filtering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter_corpus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_chars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_chars&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For pretraining-grade quality, also add fastText / &lt;code&gt;langdetect&lt;/code&gt; to filter non-Chinese content, and a profanity / PII pass appropriate to your training context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Economics at training-corpus scale
&lt;/h2&gt;

&lt;p&gt;A reasonable Chinese-language pretraining contribution might be 10M items across platforms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Items&lt;/th&gt;
&lt;th&gt;Cost @ $0.005&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Weibo&lt;/td&gt;
&lt;td&gt;5M&lt;/td&gt;
&lt;td&gt;$25,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bilibili&lt;/td&gt;
&lt;td&gt;3M&lt;/td&gt;
&lt;td&gt;$15,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RedNote&lt;/td&gt;
&lt;td&gt;2M&lt;/td&gt;
&lt;td&gt;$10,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10M items&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$50,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify free tier ($5/month credit) covers ~1,000 items per actor for prototyping.&lt;/p&gt;

&lt;p&gt;For comparison, hiring 2 senior engineers to build and maintain DIY Chinese-platform extraction for 6 months: $150K-300K — and you don't even get the data, just the tooling.&lt;/p&gt;

&lt;p&gt;For 100M+ items (real pretraining scale), volume pricing or a custom enterprise contract makes sense. See enterprise section below.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to build vs buy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build it yourself if&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're scraping 100M+ items per month and have a dedicated team&lt;/li&gt;
&lt;li&gt;You need real-time streaming below 1-second latency (this pipeline is batch)&lt;/li&gt;
&lt;li&gt;Your legal team requires you to own the entire data path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use the hosted scrapers if&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're under 50M items per month per platform&lt;/li&gt;
&lt;li&gt;You want time-to-data measured in hours, not months&lt;/li&gt;
&lt;li&gt;You don't want to maintain three platform-specific scrapers as APIs evolve&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The actors
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Weibo Scraper&lt;/strong&gt;&lt;/a&gt; — &lt;code&gt;hot_search&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;post_comments&lt;/code&gt;, &lt;code&gt;user_posts&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Bilibili Scraper&lt;/strong&gt;&lt;/a&gt; — &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;popular&lt;/code&gt;, &lt;code&gt;video_detail&lt;/code&gt;, &lt;code&gt;video_comments&lt;/code&gt;, &lt;code&gt;user_videos&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;RedNote (Xiaohongshu) Scraper&lt;/strong&gt;&lt;/a&gt; — six modes covering posts, profiles, comments, video&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three at $0.005/result. Pure HTTP — no browser, no proxy required for moderate volumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enterprise / training-scale
&lt;/h2&gt;

&lt;p&gt;If you're building actual training corpora (not prototyping), DM me on any actor page or open an Issue with subject &lt;strong&gt;"Training data inquiry"&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom output schemas matched to your training pipeline (Parquet / Arrow / your dialect of JSONL)&lt;/li&gt;
&lt;li&gt;Volume pricing above 1M items/month per platform&lt;/li&gt;
&lt;li&gt;Dedicated proxy infrastructure for sustained throughput&lt;/li&gt;
&lt;li&gt;Schema stability SLA so your training runs don't break mid-epoch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Issues typically get a response within 48 hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is this legal?&lt;/strong&gt; Each Actor accesses only publicly visible data — no auth, no captcha bypass, no login walls. The same data any anonymous browser user can see. Standard ToS-compliant scraping posture as of 2026. Consult your legal team for jurisdiction-specific guidance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What about rate limits?&lt;/strong&gt; The hosted Actors handle rate-limit responses with exponential backoff. For 1M+ items/day per platform, talk to me about dedicated infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I get historical data?&lt;/strong&gt; The Actors return what's currently public. For longitudinal datasets, schedule them via Apify Schedules at the cadence you need (hourly / daily / weekly) and version-control your dataset snapshots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you offer streaming / real-time?&lt;/strong&gt; Not currently. The Actors are pull-based. If you need streaming, that's a custom integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Other platforms?&lt;/strong&gt; I also maintain a &lt;a href="https://apify.com/zhorex/rednote-shop-scraper" rel="noopener noreferrer"&gt;RedNote Shop Scraper&lt;/a&gt; for Xiaohongshu e-commerce listings — useful if your model needs to reason about products, pricing, or commerce vocabulary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Other relevant work
&lt;/h2&gt;

&lt;p&gt;If you're building Chinese intelligence at scale, the full suite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;RedNote Scraper&lt;/a&gt; — lifestyle social&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-shop-scraper" rel="noopener noreferrer"&gt;RedNote Shop Scraper&lt;/a&gt; — Xiaohongshu e-commerce (product metadata, pricing, vendor info)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper&lt;/a&gt; — microblogging, hot search, sentiment&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;Bilibili Scraper&lt;/a&gt; — video creator analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this saved you a quarter of dev time, a 30-second review on any of the Actor pages helps a lot. ⭐&lt;/p&gt;

&lt;p&gt;Found a bug or have a feature request? Open an Issue — I usually ship fixes within 48 hours.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>china</category>
      <category>ai</category>
    </item>
    <item>
      <title>Influencer Vetting at Scale on Xiaohongshu (RedNote): A Practical Python Guide for Brand Teams 2026</title>
      <dc:creator>Sami</dc:creator>
      <pubDate>Mon, 11 May 2026 16:14:10 +0000</pubDate>
      <link>https://dev.to/sami_8858131362756585e4f4/influencer-vetting-at-scale-on-xiaohongshu-rednote-a-practical-python-guide-for-brand-teams-2026-1946</link>
      <guid>https://dev.to/sami_8858131362756585e4f4/influencer-vetting-at-scale-on-xiaohongshu-rednote-a-practical-python-guide-for-brand-teams-2026-1946</guid>
      <description>&lt;p&gt;RedNote — known internationally as Xiaohongshu (小红书) or Little Red Book — has become the single most consequential platform for influencer marketing in China. With 300M+ monthly active users skewed female and Gen Z, it's where beauty, fashion, lifestyle, and travel brands first place a campaign before going wider. After the TikTok-uncertainty migrations of 2024–2025, RedNote also became the de facto Western fallback for many creators.&lt;/p&gt;

&lt;p&gt;If you're a brand team, agency, or media buyer working in or with China, you need a way to vet RedNote influencers at scale. Manual scrolling doesn't cut it past five creators. Here's the structured approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "vetting at scale" actually means
&lt;/h2&gt;

&lt;p&gt;For a single influencer partnership, you typically want answers to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reach&lt;/strong&gt;: how many followers, but more importantly how many people actually see their content (median impressions / followers)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engagement quality&lt;/strong&gt;: average likes/comments/saves per post — and the &lt;em&gt;distribution&lt;/em&gt;. A creator with 100K followers and 50 posts averaging 500 likes is very different from one with 100K and a few viral 10K posts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Niche fit&lt;/strong&gt;: do their tags and topics align with your category?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audience hints&lt;/strong&gt;: from bio, location, profile signals — who follows them?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authenticity&lt;/strong&gt;: posting cadence, sponsored-content ratio, content reuse from other platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The data lives in two places on RedNote: &lt;strong&gt;profile metadata&lt;/strong&gt; (followers, bio, location, verified status, total likes received) and &lt;strong&gt;recent posts&lt;/strong&gt; (titles, like counts, content type, publish dates).&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can extract per creator
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"userId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5d7439b40000000001009f54"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nickname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BeautyBlogger123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"avatar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://sns-avatar-qc.xhscdn.com/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"redId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100123456"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Skincare reviews, K-beauty translations. Seoul-based."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"followers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;184500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"following"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;320&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"totalLikes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1240000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gender"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Seoul, South Korea"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"isVerified"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Beauty Blogger"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"K-Beauty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Skincare"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per recent post:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"postId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"64be395b0000000010030b56"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"video"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Morning skincare routine for dry skin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scrapedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-06T12:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined, you can compute:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Median likes / post&lt;/strong&gt; (use median, not mean — viral outliers skew means)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engagement rate&lt;/strong&gt; = median likes / followers (RedNote benchmark for healthy: 2–5%, viral creators 8%+)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post frequency&lt;/strong&gt; (posts per week — burnout warning if dropping)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content-type ratio&lt;/strong&gt; (video vs image — videos get higher reach in 2026)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Niche concentration&lt;/strong&gt; (% of recent posts matching your category keywords)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Practical Python: building a vetting batch
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;vet_creator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Profile data
&lt;/span&gt;    &lt;span class="n"&gt;profile_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/rednote-xiaohongshu-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;userUrl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;profile_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Recent posts
&lt;/span&gt;    &lt;span class="n"&gt;posts_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zhorex/rednote-xiaohongshu-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_posts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;userUrl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;profile_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;posts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;posts_run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;likes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;posts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;median_likes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;likes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;likes&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;er&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;median_likes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;followers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;followers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nickname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nickname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;followers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;followers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isVerified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_count_recent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;posts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;median_likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;median_likes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement_rate_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;er&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;posts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;posts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Vet a list
&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.xiaohongshu.com/user/profile/USER_ID_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.xiaohongshu.com/user/profile/USER_ID_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;vet_creator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement_rate_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;nickname&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  followers=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;followers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  ER=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engagement_rate_pct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives you a sortable spreadsheet of candidates ranked by genuine engagement, not vanity follower counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Red flags to look for
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engagement rate &amp;lt; 0.5%&lt;/strong&gt; with &amp;gt; 100K followers → likely bought or stale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only image posts when category is video-heavy&lt;/strong&gt; → low reach in 2026's algorithm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Posts in the last 30 days &amp;lt; 4&lt;/strong&gt; → low retainer reliability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No tags or generic tags&lt;/strong&gt; → low discoverability inside RedNote search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Median likes within 10x of mean&lt;/strong&gt; → relatively consistent (good); 50x+ means single-viral-driven (risky)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to skip the DIY approach
&lt;/h2&gt;

&lt;p&gt;RedNote's public surface requires sustained anti-bot engineering to extract reliably at scale. The schema also evolves regularly, so a scraper that worked last month can quietly start returning empty arrays this month without raising any error you'd notice in your pipeline.&lt;/p&gt;

&lt;p&gt;That's why I maintain the &lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;RedNote Scraper on Apify&lt;/strong&gt;&lt;/a&gt; — six modes (search, user posts, profiles, post details, comments, video) with consistent output schemas across them. The infrastructure work (session handling, rate limiting, schema stability) is already done.&lt;/p&gt;

&lt;p&gt;Pricing is pay-per-event: &lt;strong&gt;$0.005 per result&lt;/strong&gt;. A typical influencer batch (50 candidates, profile + 50 posts each) costs about &lt;strong&gt;$13&lt;/strong&gt;. The Apify free tier ($5 monthly) covers ~1,000 items.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is the data from open profiles only?&lt;/strong&gt; Yes — public-facing profile and post data. Private/locked accounts are not accessible. Same content any anonymous browser user can see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does it work on &lt;code&gt;xhslink.com&lt;/code&gt; short links?&lt;/strong&gt; Yes, those resolve automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I get post-level comments?&lt;/strong&gt; Use &lt;code&gt;mode: comments&lt;/code&gt; on individual post URLs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What about RedNote vs Xiaohongshu vs Little Red Book?&lt;/strong&gt; All the same platform — the Actor handles all three name conventions and URL formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is scraping Xiaohongshu legal?&lt;/strong&gt; This Actor accesses publicly visible content only. No authentication is bypassed. Always consult your local laws.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Chinese intelligence stack
&lt;/h2&gt;

&lt;p&gt;Brand teams running campaigns across Chinese platforms typically pair this with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-xiaohongshu-scraper" rel="noopener noreferrer"&gt;RedNote Scraper&lt;/a&gt; — &lt;em&gt;(this one, social side)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/rednote-shop-scraper" rel="noopener noreferrer"&gt;RedNote Shop Scraper&lt;/a&gt; — Xiaohongshu e-commerce (products, vendors, prices)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/weibo-scraper" rel="noopener noreferrer"&gt;Weibo Scraper&lt;/a&gt; — microblogging, brand mentions, hot search&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/zhorex/bilibili-scraper" rel="noopener noreferrer"&gt;Bilibili Scraper&lt;/a&gt; — video creator analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vetting more than 100 creators per month?&lt;/strong&gt; I offer custom output schemas, dedicated proxy pools, SLA support, and volume discounts. DM me on Apify or open an Issue titled "Enterprise inquiry".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug?&lt;/strong&gt; Issues are typically fixed within 48 hours.&lt;/p&gt;

&lt;p&gt;If this saved you time, a 30-second review on the Apify Store helps a lot. ⭐&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>china</category>
      <category>marketing</category>
    </item>
  </channel>
</rss>
