スシロー

Posted on Jun 6

I Built a Python Probe That Scrapes Zenn's Hidden Stats API and Commits PV Numbers to SQLite Every Hour (and Crashes on Purpose)

#python #githubactions #sqlite #zenn

If you ship technical posts on Zenn, you already know the pain: the dashboard shows pageviews, but there is no public stats endpoint and no export button. After three weeks of running this probe, I have an hourly PV time-series for every article in a 40 KB SQLite file, a Discord ping when a post breaks 500 views, and zero manual checking.

By the end of this article you will be able to: (1) call the same private JSON endpoint Zenn's own dashboard uses, (2) write a probe that raises and dies on the first sign of trouble instead of logging a warning and continuing with garbage, and (3) run it from GitHub Actions so a crash auto-restarts on the next cron tick. Everything below is copy-paste runnable.

Why the `/api/me/articles` endpoint beats scraping HTML

My first version parsed the public article page with BeautifulSoup and read the "❤️ N" badge. It survived for nine days, then Zenn shipped a CSS change, the selector returned None, and my KPI table filled with 0s for two days before I noticed. The numbers looked plausible, which is the worst failure mode.

The fix was to stop scraping HTML entirely. When you open your Zenn dashboard with DevTools → Network open, the page fetches https://zenn.dev/api/me/articles?page=1. That JSON includes page_view_count, liked_count, comments_count, and published_at per article — the exact fields the dashboard renders. It needs your session cookie (connect.sid), which you grab once from DevTools → Application → Cookies and store as a secret.

The key design decision: this probe never swallows an error. A missing field, an empty list, an HTTP 401, or a PV count that went down since last run all raise an exception. The probe is meant to be killed and restarted, not to limp along producing fake KPI data. The recovery layer is the scheduler, not a try/except: pass.

The probe: `zenn_probe.py` that commits to SQLite or dies

Here is the full collector. It fetches every page, validates the shape of the response, and writes a row per article per run. If anything is off, it throws — and because the GitHub Actions job is marked failed, the next hourly run is a clean retry.

# zenn_probe.py
import os
import sqlite3
import sys
from datetime import datetime, timezone

import requests

ZENN_API = "https://zenn.dev/api/me/articles"
DB_PATH = os.environ.get("KPI_DB", "kpi.sqlite3")
SESSION_COOKIE = os.environ["ZENN_SESSION"]  # KeyError on purpose if unset


class ProbeError(RuntimeError):
    """Raised when the response is structurally untrustworthy."""


def fetch_all_articles() -> list[dict]:
    headers = {
        "Cookie": f"connect.sid={SESSION_COOKIE}",
        "User-Agent": "kpi-probe/1.0 (+personal-analytics)",
    }
    articles, page = [], 1
    while True:
        resp = requests.get(f"{ZENN_API}?page={page}", headers=headers, timeout=10)
        if resp.status_code == 401:
            raise ProbeError("401: ZENN_SESSION expired — refresh connect.sid")
        resp.raise_for_status()
        payload = resp.json()
        batch = payload.get("articles")
        if batch is None:
            raise ProbeError(f"no 'articles' key on page {page}: {list(payload)[:5]}")
        if not batch:
            break
        articles.extend(batch)
        if not payload.get("next_page"):
            break
        page += 1
        if page > 50:  # guard against an infinite next_page loop
            raise ProbeError("pagination exceeded 50 pages — aborting")
    if not articles:
        raise ProbeError("zero articles returned — auth or API shape changed")
    return articles


def validate(article: dict) -> dict:
    required = ("id", "slug", "title", "page_view_count", "liked_count")
    missing = [k for k in required if k not in article]
    if missing:
        raise ProbeError(f"missing fields {missing} on {article.get('slug', '?')}")
    pv = article["page_view_count"]
    if not isinstance(pv, int) or pv < 0:
        raise ProbeError(f"bogus page_view_count={pv!r} on {article['slug']}")
    return article


def init_db(conn: sqlite3.Connection) -> None:
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS pv_history (
            slug        TEXT NOT NULL,
            title       TEXT NOT NULL,
            page_views  INTEGER NOT NULL,
            likes       INTEGER NOT NULL,
            captured_at TEXT NOT NULL,
            PRIMARY KEY (slug, captured_at)
        )
        """
    )


def last_pv(conn: sqlite3.Connection, slug: str) -> int | None:
    row = conn.execute(
        "SELECT page_views FROM pv_history WHERE slug=? "
        "ORDER BY captured_at DESC LIMIT 1",
        (slug,),
    ).fetchone()
    return row[0] if row else None


def commit_run() -> None:
    now = datetime.now(timezone.utc).isoformat(timespec="seconds")
    articles = [validate(a) for a in fetch_all_articles()]
    conn = sqlite3.connect(DB_PATH)
    try:
        init_db(conn)
        for a in articles:
            prev = last_pv(conn, a["slug"])
            pv = a["page_view_count"]
            # Zenn PV is monotonic. A drop means cache/garbage — fail loud.
            if prev is not None and pv < prev:
                raise ProbeError(f"{a['slug']} PV went {prev}->{pv}, refusing commit")
            conn.execute(
                "INSERT OR IGNORE INTO pv_history VALUES (?,?,?,?,?)",
                (a["slug"], a["title"], pv, a["liked_count"], now),
            )
        conn.commit()
    finally:
        conn.close()
    print(f"committed {len(articles)} articles at {now}")


if __name__ == "__main__":
    try:
        commit_run()
    except ProbeError as exc:
        print(f"PROBE FAILED: {exc}", file=sys.stderr)
        sys.exit(1)  # non-zero -> GitHub Actions marks the run failed

The three validations that have actually saved me, ranked by how often they fired in three weeks:

The monotonic PV check (pv < prev) fired 4 times. Zenn occasionally serves a stale CDN response where a 1,200-view article reports 1,180. Without this guard those dips would have poisoned my day-over-day diff math. Fail loud, skip the run, the next tick is clean.
The 401 check fired twice — my connect.sid expires roughly every 12 days. Because it raises with an actionable message instead of returning [], I knew exactly what to rotate.
The missing fields check has fired zero times so far, which is exactly why it's there: it's the tripwire for the next Zenn API change, the same way the HTML scraper silently broke.

Notice there is no retry loop, no exponential backoff, no logging.warning. The whole recovery strategy is: crash, exit 1, let cron run it again in an hour. For an hourly KPI probe, a missed sample is invisible in the time series; a wrong sample is a landmine.

Computing the hourly diff and pinging Discord at 500 PV

The raw pv_history table is append-only, so "how many views did this post gain since yesterday" is a window query, not application logic. This second script runs right after the probe and posts a summary to a Discord webhook only when something crossed a threshold — I do not want an hourly message saying "nothing changed."

# zenn_report.py
import os
import sqlite3

import requests

DB_PATH = os.environ.get("KPI_DB", "kpi.sqlite3")
WEBHOOK = os.environ["DISCORD_WEBHOOK"]
ALERT_AT = 500  # ping when an article first crosses this PV count

DIFF_SQL = """
WITH ranked AS (
    SELECT slug, title, page_views, captured_at,
           ROW_NUMBER() OVER (PARTITION BY slug ORDER BY captured_at DESC) AS rn
    FROM pv_history
)
SELECT cur.slug, cur.title, cur.page_views,
       cur.page_views - COALESCE(prev.page_views, 0) AS delta
FROM ranked cur
LEFT JOIN ranked prev ON prev.slug = cur.slug AND prev.rn = 2
WHERE cur.rn = 1
ORDER BY delta DESC
"""


def build_report() -> str | None:
    conn = sqlite3.connect(DB_PATH)
    try:
        rows = conn.execute(DIFF_SQL).fetchall()
    finally:
        conn.close()

    movers = [r for r in rows if r[3] > 0]
    crossed = [r for r in rows if r[2] >= ALERT_AT > r[2] - r[3]]
    if not movers and not crossed:
        return None  # stay silent, no Discord spam

    lines = ["**Zenn PV — last hour**"]
    for slug, title, pv, delta in movers[:5]:
        lines.append(f"• +{delta} → {pv:,} | {title[:40]}")
    for slug, title, pv, _ in crossed:
        lines.append(f"🎉 **{title[:40]}** crossed {ALERT_AT} PV ({pv:,})")
    return "\n".join(lines)


def main() -> None:
    report = build_report()
    if report is None:
        print("no movement — skipping Discord")
        return
    resp = requests.post(WEBHOOK, json={"content": report}, timeout=10)
    resp.raise_for_status()
    print("posted report to Discord")


if __name__ == "__main__":
    main()

The ROW_NUMBER() OVER (PARTITION BY slug ...) trick is the part people miss. Because the table stores a full snapshot every hour, rn = 1 is the newest sample and rn = 2 is the previous one, so the self-join gives you the exact hourly delta without storing any derived state. When I first wrote this with a GROUP BY slug, MAX(captured_at) it silently joined the wrong rows whenever two snapshots landed in the same second — the window function fixed it in one line.

Wiring it to GitHub Actions so a crash self-heals

The last piece is the scheduler that turns sys.exit(1) into free retries. This workflow runs hourly, commits the updated SQLite file back to the repo, and — critically — the probe failing does not corrupt the DB because the failed run never reaches conn.commit().

# .github/workflows/zenn-probe.yml
name: zenn-pv-probe
on:
  schedule:
    - cron: "7 * * * *"   # :07 every hour, off the busy :00 spike
  workflow_dispatch: {}

jobs:
  probe:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install requests
      - name: collect PV
        env:
          ZENN_SESSION: ${{ secrets.ZENN_SESSION }}
          KPI_DB: kpi.sqlite3
        run: python zenn_probe.py
      - name: report to discord
        if: success()
        env:
          DISCORD_WEBHOOK: ${{ secrets.DISCORD_WEBHOOK }}
        run: python zenn_report.py
      - name: persist db
        if: success()
        run: |
          git config user.name kpi-bot
          git config user.email kpi-bot@users.noreply.github.com
          git add kpi.sqlite3
          git diff --cached --quiet || git commit -m "pv snapshot"
          git push

One sharp edge that cost me an afternoon: I originally scheduled the cron at 0 * * * *. GitHub's shared scheduler is heavily oversubscribed on the top of the hour, and my runs drifted 5–15 minutes late or got dropped entirely. Moving to 7 * * * * made the firing time reliable to within a minute. If you run anything on GitHub cron, never use :00.

The second edge: committing a binary SQLite file to git grows history forever. After three weeks my .git was 6 MB for a 40 KB DB. If you keep this running for months, push the DB to a release asset or an S3 bucket instead of committing it — or squash the history periodically.

What three weeks of data actually told me

The payoff was not the automation — it was the shape of the curve. My top post gained 80% of its lifetime PV in the first 36 hours, then flatlined. Before I had hourly data I assumed Zenn traffic accumulated steadily; it does not. That single insight changed when I cross-post to Dev.to (now: within 12 hours, while the Zenn curve is still climbing, not three days later when it's already dead).

The probe is ~120 lines, costs nothing on GitHub's free tier (24 runs/day is well under the 2,000-minute limit), and the fail-loud design means I have never once had to ask "is this number real?" If you write on Zenn and care about what's working, an hour of setup buys you a KPI table you can actually trust — and EXPLAIN-able SQL beats squinting at a dashboard every time.

If you want to go deeper on the SQLite window-function patterns and CI observability behind this, Designing Data-Intensive Applications is the one book I keep reopening for the time-series modeling chapters.

DEV Community

I Built a Python Probe That Scrapes Zenn's Hidden Stats API and Commits PV Numbers to SQLite Every Hour (and Crashes on Purpose)

Why the `/api/me/articles` endpoint beats scraping HTML

The probe: `zenn_probe.py` that commits to SQLite or dies

Computing the hourly diff and pinging Discord at 500 PV

Wiring it to GitHub Actions so a crash self-heals

What three weeks of data actually told me

Top comments (0)

Why the /api/me/articles endpoint beats scraping HTML

The probe: zenn_probe.py that commits to SQLite or dies

Computing the hourly diff and pinging Discord at 500 PV

Wiring it to GitHub Actions so a crash self-heals

What three weeks of data actually told me

Why the `/api/me/articles` endpoint beats scraping HTML

The probe: `zenn_probe.py` that commits to SQLite or dies