DEV Community: MORINAGA

Five changes I made after exhausting GitHub Actions free minutes twice

MORINAGA — Tue, 21 Jul 2026 08:05:05 +0000

The billing block hit in May. I cleared it, changed nothing structural, and it hit again in late June. The same five workflows exhausting the same 2,000-minute free tier, but faster as I'd added more daily jobs.

I could have upgraded to a paid plan. Instead I spent one session auditing where the minutes were actually going, and found the answer wasn't "the pipeline does too much" — it was "the pipeline reinstalls dependencies from scratch on every run, and fires on commits it doesn't need to care about." That's fixable without touching output frequency.

This post is the diff: what I changed across five workflow files, which changes should move the needle most, and one concurrency decision that looked simple but had a hidden trap. The savings figures are estimates from observed job durations and run frequency, not a controlled before-and-after billing measurement.

First: where the minutes were actually going

Before optimizing, I needed to understand the composition. The pipeline runs:

yt-publish: daily Short, ~6-8 minutes (ffmpeg + edge-tts + Playwright for OG)
yt-publish-longform: weekly, ~15-20 minutes (same stack plus mermaid-cli for diagram slides)
publish-articles: daily article drain, ~5-7 minutes (Playwright for OG images)
yt-analytics: daily, ~3-4 minutes
bluesky-queue: daily, ~2-3 minutes
ci: 4-parallel build on every push, ~8-10 minutes per job

The CI job was the quiet killer. Every content commit — articles, yt-queue updates, Bluesky queue refills, trends snapshots — was triggering a full 4-parallel build. Content commits happen 3-5 times per day. At 40 minutes per fire, that's 120-200 minutes daily from CI alone, on commits that touched no code.

The single CI pipeline design made sense for a project this size; the trigger scope did not.

Pip requirements files — the largest estimated saving

Every Python workflow was installing dependencies inline:

- run: pip install --quiet edge-tts Pillow

No lockfile. No explicit pip download cache. On every run, pip may resolve and download packages again. setup-python caches pip's download cache, not an installed virtual environment, so installation still runs; the expected saving is reduced network and resolution work rather than a zero-cost restore.

The fix is two parts. First, extract each workflow's dependencies into a requirements file:

# .github/requirements-yt-publish.txt
edge-tts
Pillow

# .github/requirements-publish-articles.txt
playwright==1.50.0
pyyaml==6.0.1

Second, tell setup-python to cache based on that file:

- uses: actions/setup-python@v5
  with:
    python-version: "3.12"
    cache: "pip"
    cache-dependency-path: ".github/requirements-yt-publish.txt"

- name: Install Python deps
  run: pip install --quiet -r .github/requirements-yt-publish.txt

The cache-dependency-path ties cache invalidation to the dependency file. With pinned requirements as key input, a dependency change produces a new cache key. It still does not guarantee a hit or restore installed packages.

Playwright's chromium browser is a special case — pip cache doesn't cover the browser binary download. I added a separate actions/cache@v4 step with a static key:

- name: Cache Playwright chromium
  uses: actions/cache@v4
  with:
    path: ~/.cache/ms-playwright
    key: playwright-chromium-${{ runner.os }}-1.50.0

- name: Install Playwright browser (chromium only)
  run: python -m playwright install --with-deps chromium

The static key means the cache is valid until the Playwright version changes. Chromium downloads ~120MB; caching it saves 3-4 minutes per run on the workflows that use it. The OG image generation and article publishing pipeline both use Playwright — those two workflows' Playwright installs were adding up.

The npm global cache for mermaid-cli

The longform video pipeline uses mermaid-cli to render .mmd diagram files into PNG slides, as I described in the mermaid and matplotlib slide pipeline. Installing it is slow:

npm install -g @mermaid-js/mermaid-cli

@mermaid-js/mermaid-cli pulls puppeteer which downloads a bundled Chromium. On a cold runner: 8-12 minutes. Running once per week, that's 35-50 minutes monthly from one install command.

The fix needs two pieces. First, a package.json for the cache key:

// .github/mermaid-cli-package.json
{
  "dependencies": {
    "@mermaid-js/mermaid-cli": "latest"
  }
}

Then the cache step and guarded install:

- uses: actions/setup-node@v4
  with:
    cache: "npm"
    cache-dependency-path: ".github/mermaid-cli-package.json"

- name: Cache global mermaid-cli
  uses: actions/cache@v4
  with:
    path: ~/.npm-global
    key: mermaid-cli-${{ runner.os }}-${{ hashFiles('.github/mermaid-cli-package.json') }}

- name: Install mermaid-cli (diagram slides)
  run: |
    npm config set prefix ~/.npm-global
    export PATH="$HOME/.npm-global/bin:$PATH"
    echo "$HOME/.npm-global/bin" >> "$GITHUB_PATH"
    if ! command -v mmdc >/dev/null 2>&1; then
      npm install -g @mermaid-js/mermaid-cli
    fi

The ~/.npm-global path isn't in actions/cache's default coverage — you have to specify it explicitly. The if ! command -v mmdc guard means a cache hit skips the install entirely. Cold: 10+ minutes. Warm: ~30 seconds.

Trigger pruning — the conceptual change

The CI workflow had a push trigger with no path restrictions:

on:
  push:
    branches: [main]

Every commit triggered a 4-parallel build. That includes every content-generation bot commit. Adding paths-ignore was one line of YAML:

on:
  push:
    branches: [main]
    paths-ignore:
      - "content/**"
      - "docs/**"

Content and docs changes don't affect build behavior. The four GitHub Actions patterns I've built all avoid mixing content commits with code CI for exactly this reason — they have different failure modes and different stakes.

The second trigger pruning was on publish-articles. That workflow had both a push trigger (fires when a new .md file is committed) and a daily cron:

on:
  push:
    branches: [main]
    paths:
      - "content/articles/**/*.md"
  schedule:
    - cron: "0 6 * * *"

The cron already handles the cadence correctly. It's idempotent — articles already published to Dev.to or Hashnode are skipped on re-runs. The push trigger was adding ~17 redundant runs per week (roughly matching the article generation cadence), each installing Playwright and pnpm from scratch. I dropped the push trigger and kept only the cron.

The cron scheduling patterns post covers the case for cron-as-cadence-engine in more depth. The short version: push triggers are for reacting to code changes; content pipelines shouldn't depend on them.

Concurrency discipline — and the Bluesky exception

For three regenerable-data workflows — yt-analytics, refresh-content, and trends-fetch — I flipped cancel-in-progress from false to true:

concurrency:
  group: yt-analytics
  cancel-in-progress: true

These jobs fetch data and write it to the repo. If a new run starts while the old one is still running, the old run's output will be overwritten anyway. Canceling it early is just faster.

The Bluesky queue workflow stayed at cancel-in-progress: false. This is the trap I mentioned earlier.

In June I hit a duplicate-post incident caused by a different cancellation pathway: a queue runner posted a Bluesky entry but was interrupted before committing the queue update that marks it as sent. The next run re-read the same entry as unposted and posted it again.

cancel-in-progress: true creates that exact race structurally. Post happens → cancel fires before queue-file commit → next run reposts. For jobs where the side effect is idempotent or reversible, cancel-in-progress is safe. For Bluesky posts, the side effect is a public post that can't be unsent by the pipeline. The 2-3 minutes of overlap cost is less than the cost of a duplicate post.

The cron timing bugs post has a section on exactly this pattern — treating cancel-in-progress as a safe default rather than thinking about what happens if a cancel fires at each step.

What worked, what didn't, what I'd do differently

Worked well: The pip requirements files and the mermaid-cli cache together. These are pure wins with no tradeoffs — same output, less time, no new failure modes. The paths-ignore on CI was similarly clean.

Worked with caveats: The static Playwright cache key. It works until Playwright releases a version bump. When it does, the cached chromium won't match. I need to manually update the key in the YAML — which I will forget at some point. A version-aware key would be safer: playwright-chromium-${{ runner.os }}-${{ hashFiles('.github/requirements-publish-articles.txt') }}.

Didn't do well: I didn't instrument before I cut. I estimated minute consumption from first principles rather than pulling GitHub's usage reports. The estimates were directionally correct but I don't know how accurate. GitHub has per-workflow usage reporting in the billing settings — I should check it monthly, not wait until the block hits.

Would do differently from day one: Write requirements files at project start. Writing pip install edge-tts Pillow inline in YAML is fast at the time and creates months of cache miss debt. requirements.txt is three lines of work that pays back immediately.

FAQ

How many free minutes does GitHub Actions give?
2,000 Linux runner-minutes per month, per account (both free-plan personal accounts and free-plan organizations). Windows is 1x but counts double; macOS counts 10x. Linux is almost always the right default for CI. See the GitHub Actions billing docs for current limits.

Does pip caching actually restore correctly?
Yes, if you use cache-dependency-path pointing to a pinned requirements file. The cache key hashes the file; a requirements change busts the cache and re-installs. Without the path hint, the cache can match incorrectly across different dependency sets.

Which workflows should keep cancel-in-progress: false?
Any workflow that writes non-idempotent side effects: posting to external APIs, sending notifications, updating queue state that drives future runs. Map the side effects first. If any step writes something that can't be safely re-done or reverted, keep cancel-in-progress off.

Can I avoid actions/cache for Playwright?
You can use setup-python cache: pip alone, but pip doesn't cache the browser binaries that playwright install downloads. You need the separate actions/cache@v4 step pointed at ~/.cache/ms-playwright to capture the chromium download.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Notable this week: Inkling open weights, GitHub Models sunset, Supabase Multigres

MORINAGA — Tue, 21 Jul 2026 08:04:59 +0000

Sunday curated reading. I maintain three AI-curated directory sites and track open-weight model releases and tooling changes for practical reasons: new models land in my Top AI Tools ETL pipeline, and anything touching the GitHub or Supabase ecosystem affects my OSS alternatives directory directly. Five things from this week worth annotating.

1. Inkling — first open-weight model from Thinking Machines Lab

Announced July 15 under Apache 2.0. Thinking Machines Lab is Mira Murati's company, founded more than a year after she left OpenAI — this is their first public model. Inkling is a 975B-parameter MoE where roughly 41B parameters are active per forward pass. It was trained on 45 trillion tokens across text, image, audio, and video, and reasons natively across all four modalities. Weights are on HuggingFace with a 1M context window; the Tinker API drops that to 256K but provides hosted inference.

The Apache 2.0 license is clean — no "non-commercial only" carve-outs, no "you can't compete with us" provisions I can find. What I don't know yet is whether native audio input actually changes anything useful for content generation workflows like mine, or whether "trained on audio" means something different from "usable for audio-adjacent tasks." I'll add Inkling to my AI tools directory listing queue this week and benchmark it against the ETL prompts I currently run on Haiku before drawing any conclusions about model swap feasibility.

2. GitHub Models retires July 30 — ten days left

GitHub announced in June it was closing the free model playground, with the hard shutdown confirmed for July 30. That covers the playground, model catalog, inference API, and bring-your-own-key (BYOK). Brownouts ran July 16; next one is July 23.

I stopped using GitHub Models months ago when I standardized on Claude Haiku via the Anthropic API. But a lot of solo developers I've seen in HN comments relied on it for low-volume prototyping before committing to a billing relationship with any specific provider. The official migration path is Azure AI Foundry. Because GitHub Models spoke the OpenAI-compatible format, the mechanical part of migration is usually just a base URL and key swap — the heavier cost is that Azure's pricing model doesn't have a free tier in the same sense. The product always had a designed off-ramp built in: experiment here, graduate to Azure when you're ready to scale. That arc completed on schedule.

3. Supabase Multigres is now open source

From the Supabase July 2026 developer update: the Multigres Kubernetes operator is now fully open source. It covers direct pod management, zero-downtime rolling upgrades, pgBackRest PITR backups, and OpenTelemetry tracing. The same update includes a TanStack DB alpha integration that syncs collections with Supabase tables over PostgREST and Realtime, and Wrappers v0.6.2 adding MongoDB join support from Postgres.

I already had a Supabase yt-longform video spec queued up in my pipeline before this roundup — the git commit is timestamped before I drafted this — so that connection isn't retrofitted. The Multigres open-source release matters to the OSS alternatives directory because Supabase is listed as an alternative to several managed-Postgres products, and a production-grade Kubernetes operator being freely available changes the self-hosting calculus. I'll watch whether actual self-hosted Supabase deployments increase in the GitHub ETL signal over the next 30 days.

4. CodeQL now detects system prompt injection in JavaScript and TypeScript

From the GitHub Changelog this week: CodeQL added a JS/TS query targeting untrusted values flowing into AI model system prompts without sanitization. The rule catches the pattern where user-controlled or externally-sourced data reaches a system: parameter in LLM API calls directly.

This is the class of vulnerability I've been informally aware of and formally sloppy about. My ETL passes game descriptions, model metadata, and README snippets from Steam, GitHub, and HuggingFace into Claude Haiku prompts. None of that goes through a sanitization pass — my threat model is that the external APIs don't supply adversarial content, which is true until it isn't. The CodeQL query gives me a concrete way to surface which call sites are the highest-risk. I'll run it on the packages that do prompt construction and treat the output as a prioritized list, not necessarily a mandate to fix everything it flags.

5. GitHub Copilot agentic browser tools are generally available

Now GA by default in VS Code, no flag needed. Copilot agents can navigate web pages, inspect DOM content, capture screenshots, and validate web app behavior from within the editor. Parallel agent sessions and visible per-chat cost are also included.

I don't use Copilot — my agentic workflow runs on Claude Code — so this doesn't change my daily setup. What I'm watching is whether browser-navigating agents in VS Code become standard CI tooling or stay in the "advanced feature" tier. For my kind of deployment (Cloudflare Pages, Vercel, static sites with OG images and JSON-LD I want to verify post-deploy), the ability to check a live page from the same session that deployed it is genuinely useful. If Claude Code ships comparable browser tooling, I'd use it. If Copilot's GA adoption makes this the assumed baseline, that changes what readers expect from articles about verification pipelines.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

One env flag that strips affiliate CTAs for AdSense review — without touching code

MORINAGA — Mon, 20 Jul 2026 08:34:57 +0000

After four AdSense rejections, I did a more careful read of what the ossfind.com pages looked like to a human reviewer. The verdict: affiliate CTAs, an Amazon product widget, and "sister site" cross-links to aiappdex and findindiegame — together, they made the site read as a revenue-motivated content network rather than a standalone editorial resource. That's a rejection signal I had underestimated.

I decided to try for approval on ossfind specifically. But stripping those revenue channels to pass review and then re-adding them after would normally mean multiple code deploys. I didn't want that risk — one botched re-activation and the affiliate links silently don't appear.

The solution was a single environment variable: PUBLIC_REVIEW_MODE=1.

What it hides

The getMonetization() function in packages/shared/src/monetization/index.ts returns an enabled object that each Astro component checks before rendering:

const reviewMode = process.env.PUBLIC_REVIEW_MODE === "1";
return {
  // ...
  enabled: {
    ads: mode === "adsense" && !!adsenseClient,
    amazon: !!amazonTag && !reviewMode,
    affiliate: !reviewMode,
    newsletter: !!newsletterAction,
    ga4: !!ga4Id,
  },
};

When PUBLIC_REVIEW_MODE=1 is set in Cloudflare Pages:

enabled.amazon → false: AmazonRecommend.astro renders nothing. No product widget, no affiliate tag in any URL.
enabled.affiliate → false: All hosting referral CTAs (DigitalOcean, Hetzner, Vultr, RunPod, Vast.ai) render nothing. The GPU affiliate links I wrote about wiring into the AI tools directory are hidden.
Sister-site cross-links: these are controlled separately in templates by checking !reviewMode directly. In review mode the nav and footer drop the "Also see: aiappdex.com / findindiegame.com" links.

Crucially, enabled.ads is not gated by reviewMode. The AdSense script tag needs to be present for account domain verification and for the review itself. Hiding it would defeat the purpose.

Why build it at the environment level, not as a code branch

I could have added a REVIEW_MODE boolean to the TypeScript config object and deployed. But the key property I wanted was zero code change on both entry and exit.

Setting a Cloudflare Pages environment variable triggers an automatic redeploy. Removing it triggers another. No PR, no review, no merge. If I later decide mid-review that I want to re-enable Amazon links to test something, I can do it without touching source. The audit trail lives in Cloudflare's env change log, not in git commits.

The PUBLIC_ prefix matters for Astro: environment variables without it are server-only. PUBLIC_REVIEW_MODE is read at build time by getMonetization() inside packages/shared, which runs during Astro's SSG build. The value bakes into the static HTML — there's no runtime re-evaluation. That's correct behaviour for a static site: the entire site builds clean with all affiliate links removed, and any cached CDN responses reflect that clean state.

Cloudflare Pages environment variable updates trigger an automatic redeploy of the project, which is the mechanism that makes this pattern work without manual CI intervention.

Restoring after approval

One Cloudflare Pages action: remove PUBLIC_REVIEW_MODE from the project's environment variables. On the next redeploy (automatic, triggered by the env change), all affiliate and Amazon paths re-enable wherever the existing code checks enabled.affiliate and enabled.amazon.

Nothing in the TypeScript needs to change. The pattern is closer to a feature flag than a code branch — which is why I used the env layer instead of git checkout -b review-clean.

I don't know yet how long AdSense review takes, or whether they re-crawl after an initial pass. Affiliate monetization is the primary strategy now for two of the three sites regardless, so this review mode only applies to ossfind. I'll publish numbers when there's something real to report.

What I still haven't figured out

Whether AdSense reviews manually or with a bot. The rejection emails have all been form letters with category codes, not specific page URLs. So I don't know whether a reviewer is clicking through a live site or scraping a cached snapshot. If it's the latter, the redeploy timing matters in a way I can't control.

I also don't know if the newsletter field should be gated by reviewMode. Newsletter forms are probably neutral to AdSense, but I'm not certain. Right now they stay visible during review — if there's another rejection I'll check that as well.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

How I built Bluesky summary cards in Python — YAML frontmatter to 1080 1350 PNG

MORINAGA — Mon, 20 Jul 2026 08:34:52 +0000

The standard article OG image I generate for aiappdex.com is 1200×630 — the landscape aspect ratio that works well on Twitter, Discord, and most link-preview boxes. That system uses Playwright and zero image API calls, which I described earlier.

Bluesky is different. When a post links to an article that has a summary_image URL in its <meta> tags, Bluesky renders that image in portrait format — and the native crop target is 1080×1350, which is the ratio of a phone screen. A landscape image rendered at that scale ends up letterboxed and small. I wanted something that looked intentional at 1080×1350, so I built a second image pipeline.

The result: generate-summary.py, a Python script that reads a summary_data block from each article's frontmatter, renders one of three visual layouts via an inline HTML template, and screenshots it with Playwright Chromium at the exact pixel dimensions. No external API, no third-party image service.

Why a separate YAML schema instead of using the title alone

My first instinct was to generate the summary image from the article title: large text, dark background, brand mark, done. I already do something like that for cover images, so the code exists.

The problem is that article titles are often too long for a 1080×1350 canvas at a readable font size. "How I built a shared Claude Haiku client with system-prompt caching for batch ETL" is 84 characters. At 76px bold, that overflows. And even when it fits, a wall of title text doesn't communicate structure — it's just a slightly bigger version of the link card's plain text.

So I designed a summary_data YAML block that each article can opt into. The schema has five keys:

summary_data:
  title_html: "YAML → <accent>Bluesky card</accent>\n1080×1350 PNG, zero API cost"
  cards:
    - { icon: "🤖", domain: "aiappdex.com", desc: "AI model directory", stat: "LIVE" }
    - { icon: "🎮", domain: "findindiegame.com", desc: "Indie game search", stat: "LIVE" }
    - { icon: "🔓", domain: "ossfind.com", desc: "OSS alternatives", stat: "LIVE" }
  pipeline:
    - "ETL fetch"
    - "Claude Haiku gen"
    - "Cache to Turso"
    - "Build & deploy"
    - "IndexNow ping"
  stats:
    - { num: "$25", label: "monthly cost" }
    - { num: "3",   label: "sites" }
    - { num: "880", label: "pages daily" }
    - { num: "6mo", label: "horizon" }
  tagline: "An honest 6-month indie experiment"

title_html is the headline. It supports \n for line breaks and <accent>...</accent> tags to highlight specific words in a contrasting colour. All other keys are for the visual zone below the title: cards for a three-column comparison grid, pipeline for a sequential step flow, and stats for a four-number data row. Only one of those three should appear in any given article — the renderer handles whichever key is present and renders nothing for the others.

tagline is the footer line, always shown if present. It falls back to "An honest indie experiment" if absent.

The accent tag and why XSS-paranoia matters even in local tools

The title_html field is rendered server-side into an HTML string, which then runs in a headless browser. In theory, this is a local tool with no user input, so SQL injection isn't the threat model. But I've built enough internal tools that "local only" stops being true quickly — and the real reason to do this right is code that future-me can audit easily.

The approach: everything in title_html gets html.escape() by default. The only exception is <accent>...</accent> tags, which the renderer finds with a narrow regex, escapes their inner text separately, and re-injects as <span class="accent">.

def render_title_html(raw: str) -> str:
    parts: list[str] = []
    cursor = 0
    for m in re.finditer(r"<accent>(.+?)</accent>", raw, re.DOTALL):
        parts.append(escape(raw[cursor : m.start()]).replace("\n", "<br/>"))
        parts.append(
            f'<span class="accent">{escape(m.group(1)).replace(chr(10), "<br/>")}</span>'
        )
        cursor = m.end()
    parts.append(escape(raw[cursor:]).replace("\n", "<br/>"))
    return "".join(parts)

The \n-to-<br/> replacement happens after escaping, not before, so there's no way to inject markup through newlines. The outer text never runs through the browser as raw HTML. If someone puts <script> in their title_html, it renders as literal <script> on screen.

The three visual modes

The visual zone below the title picks its layout from whichever key appears in summary_data:

Cards — a three-column grid for "three things compared" framing. Each card has an emoji icon, a short domain or label, a 50-80 character description, and an optional UPPERCASE stat tag. I use this for articles that describe all three directory sites, or for comparison articles (tool A vs B vs C).

Pipeline — a five-step horizontal flow with numbered circles and arrows between them. Good for articles that describe sequential processes. The step labels support \n for two-line labels when the text is long.

Stats — a four-number row for articles with memorable metrics. The number gets rendered large in #FCD34D (amber) and the label in smaller uppercase grey. I use this for articles with cost, duration, or count data that's worth highlighting.

If none of these three keys appear, the script generates nothing for that article and the article's Bluesky post falls back to the standard landscape cover_image. The pre-post QC gate doesn't care which image field is used — it only cares that some image is present before the post goes out.

Playwright rendering and the frontmatter backfill

The script reuses a single Playwright browser context across all articles in the batch:

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context(viewport={"width": 1080, "height": 1350})
    page = context.new_page()
    for path in targets:
        meta, _ = parse_frontmatter(path.read_text())
        summary = meta.get("summary_data")
        if not isinstance(summary, dict):
            continue
        html = build_html(summary)
        page.set_content(html, wait_until="networkidle")
        page.screenshot(
            path=str(out_path),
            clip={"x": 0, "y": 0, "width": 1080, "height": 1350},
        )

wait_until="networkidle" matters here. The HTML template loads Inter from Google Fonts. If the screenshot fires before the CSS is applied, the fallback system font renders instead and the result looks wrong. networkidle blocks until the font CDN request completes.

The clip parameter is important too. The context viewport is set to 1080×1350, but page.screenshot() without a clip argument would capture whatever the page actually renders to — which can be shorter if the content doesn't fill the full height. The explicit clip forces exactly the declared dimensions regardless of content height.

After generating the PNG, the script calls update_summary_image_field(), which re-reads the article file and adds a summary_image: line to its frontmatter if it isn't already there:

summary_url = f"{HOST}/og/summary/{slug_base}.png"
new_content = re.sub(
    r"^(---\n)([\s\S]*?)(\n---\n)",
    lambda m: m.group(1) + m.group(2) + f"\nsummary_image: {summary_url}" + m.group(3),
    content,
    count=1,
)

This means I never have to manually add the summary_image: URL to an article I wrote. I add summary_data, run the script, and the field appears. The downstream Bluesky queue refill reads that field when it constructs a post for the article.

The output path is apps/ai-tools/public/og/summary/<slug>.png, which Astro copies to /og/summary/<slug>.png during the static build. The same PNG is then served from aiappdex.com — the same domain that hosts all three sites' OG images, rather than spreading them across three different domains.

The CI pipeline that runs generate-og.py runs generate-summary.py in the same step, immediately after. Both scripts reuse the same Playwright installation, so there's no double browser-download cost in CI.

Choosing the visual mode per article

My decision tree for which summary_data to write:

Pipeline: the article describes a sequential process with clear stages. Most technical "how I built X" articles fall here.
Cards: the article compares or describes exactly three things. The three-site overview article, model comparison articles, tool roundup articles.
Stats: the article has four numbers worth remembering. Cost, page count, API call count, duration. If fewer than four numbers, I either pad with something defensible or skip stats in favour of pipeline.
Omit: the article is a lightweight, recap, or curated list. The frontmatter doesn't have a clean structured story to tell. The Bluesky URL card uses cover_image instead.

The content quality gate doesn't validate summary_data — that's outside its scope. But the generator script itself validates silently: a malformed YAML block causes meta.get("summary_data") to return a non-dict, and the article is skipped without error. I've been thinking about adding a dry-run validation step there.

What I'd do differently

Embed Inter as Base64 instead of fetching from Google Fonts. The wait_until="networkidle" approach works, but it depends on outbound internet access from the CI runner. In a cold environment or a GitHub Actions runner with network restrictions, the font fetch can fail silently — Playwright renders with the fallback font but doesn't raise an exception. I should detect this, either by checking that the font loaded successfully or by embedding the subset as Base64 in the HTML. The YouTube slide renderer uses Pillow with a local .ttf font for exactly this reason.

A dead reference at the top. The script defines TEMPLATE_PATH = ROOT / "scripts/summary_template.html" but never uses it — the actual template is HTML_TEMPLATE, an inline constant in the same file. I initially intended to read from the file, then inlined it for portability, and forgot to remove the reference. It's harmless but I notice it every time I open the file.

Async Playwright for batch performance. The current script uses sync_playwright and processes articles sequentially. On a full regeneration pass over 70+ articles, that's noticeable — maybe 3-4 minutes in CI. The Playwright Python library has an async version; batching 10 concurrent page.set_content calls would cut that significantly. I haven't bothered because the script only runs in the og-image CI step, not on the critical path of any content publish.

FAQ

Why not just use an image generation API?
For 70 articles, an API at even $0.02/image would cost $1.40 per full regeneration pass. In CI that runs daily, that's $42/month. Playwright Chromium is already installed because I use it for OG images and for Bluesky image upload race detection. Zero marginal cost.

What if I want to update the card design across all articles?
Change the HTML_TEMPLATE constant and rerun the script against all articles. The whole batch regenerates in a few minutes. Since the slug-based filenames are stable, the CDN-cached URLs stay the same.

Can I use more than one visual mode per article?
The renderer will display all three that are present in summary_data. But the canvas is 1350px tall and three modes together push the tagline below the fold. In practice I pick one. The template was designed for exactly that: cards fills the middle section, pipeline fills it, stats fills it. Two modes at once makes both feel cramped.

What I'm watching this week: GPT-5.6 Sol, Steam Machine, Krea 2, and two more

MORINAGA — Sun, 19 Jul 2026 07:56:08 +0000

Five things from this week's HN and release feeds that I actually read and thought about. I run three AI-curated directories — Top AI Tools, Find Games Like, Open Alternative To — so "worth watching" means something affected my sourcing, ETL planning, or how I'd categorize things.

GPT-5.6 Sol — government vetting before access

OpenAI announced GPT-5.6 Sol on June 26 (HN: 713 points). The follow-up story got more attention in the comments: the US government will vet users before granting access to Sol, making it the first major frontier model gated by something other than a credit card and API key.

I'm not building with GPT-5.6 Sol. My pipeline runs Claude Haiku 4.5 for ETL because latency and cost matter more than ceiling performance at the volume I operate. But the vetting arrangement creates a categorization problem for aiappdex.com: a model that exists but isn't freely accessible doesn't fit cleanly into the same listing format as one you can call tomorrow with an API key. I'm going to add a "restricted access" flag to the directory schema and see how it affects search behavior. The gap between "exists" and "available" is going to widen.

Steam Machine relaunches

Valve's new Steam Machine hardware line was the highest-scored HN story this week at 1,025 points on June 22. The first run (2015–2018) ended quietly; the new version arrives after the Steam Deck proved Linux-native gaming hardware has real demand. These are SteamOS-native PCs targeting the living room.

For findgameslike.com, Steam-based game discovery is 80% of what I surface. If Steam Machine gains meaningful market share, the couch-gaming segment grows and my existing Steam-first recommendations stay well-aligned. I'm not changing anything yet — a launch announcement isn't a sales number — but it's going in my "things to revisit in Q3" list.

Krea 2 — 12B open-weights image generation

Krea released Krea 2, a 12-billion-parameter image generation model under open weights, distributed via HuggingFace. The benchmark numbers put it near the top of the open-weights image gen category right now, ahead of several larger models.

I don't use ML inference for image generation today. My YouTube slide renderer assembles frames with Pillow and pre-rendered templates, which is faster and cheaper for my CI budget. But I track image gen models because the architectural pattern keeps compressing — 12B this year, probably 4B-distilled next year, probably usable in constrained CI the year after. Krea 2 is the kind of release I add to the AI tools directory and watch for community adoption velocity.

OpenAI's first custom chip, designed with Broadcom

OpenAI announced its first proprietary inference chip on June 24 (HN: 417 points), built with Broadcom on TSMC 3nm. The announcement was thin on specs — inference-optimized, no published performance numbers, no timeline for external access.

The question for builders isn't whether we'll run this chip — we won't. It's what custom silicon means for pricing strategy. Labs that control inference costs can discount selectively: lower prices on models they want to commoditize, keep proprietary tiers expensive. I wrote about why I'm betting on cross-channel distribution over months 1–6; one of the background assumptions in that bet is that frontier API prices stay roughly flat. Custom silicon moves that variable.

OpenKnowledge — open-source alternative to Obsidian and Notion

inkeep/open-knowledge launched as a Show HN on June 25 and reached 151 points. It's a local-first, AI-first personal knowledge base that targets both Obsidian and Notion as alternatives. The differentiator is LLM-assisted retrieval integrated from day one, rather than bolted onto an existing feature set.

This goes directly into ossfind.com. The interesting decision is how to categorize it: as a Notion alternative (collaboration/docs), an Obsidian alternative (personal knowledge), or a new category (AI-native knowledge). I'm currently using a four-signal scoring system for OSS directory decisions; OpenKnowledge clears three of the four — GitHub activity, license clarity, and differentiated framing. The fourth signal (adoption trajectory) I'll check again in 30 days.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Notable this week: Open Source AI state report, Comic Chat, LM Studio Bionic, SQLite

MORINAGA — Sun, 19 Jul 2026 07:56:04 +0000

Sunday reading round-up. I run three AI-curated directory sites and check HN a few times a week for things that either affect my stack directly or give me something honest to write about on the content side. This week had five things worth annotating.

1. stateofopensource.ai — a reference document I'll keep returning to

stateofopensource.ai hit 340 HN points July 17. I clicked through expecting marketing copy and got something closer to an actual dataset: license comparisons across the major open-weight releases, training compute estimates, and capability benchmarks organized by release date.

What's useful for me: I'm constantly adding new models to my AI tools directory, and I need a consistent framework for what "open" actually means in each case. This report is specific. It separates "open weights" (download the checkpoint), "open training data" (reproducible pretraining), and "open source" (full stack, permissive license) as distinct categories, and classifies models accordingly. "Open weights" and "open source AI" get conflated constantly — in HN discussions, in model announcements, in my own past writing. Having a reference that draws the line cleanly is something I'll cite when writing model descriptions and when I'm deciding whether a release qualifies for the OSS alternatives directory.

2. Microsoft Comic Chat goes open source

Announced July 16 with 442 HN points. Comic Chat was a 1996 Microsoft IRC client that rendered your chat conversation as a comic strip, with customizable cartoon avatars speaking in panels. The full source is now on GitHub.

The software itself isn't useful in 2026. What I'm watching is the pattern: Microsoft has now open-sourced several legacy internal products in quick succession. Each release follows the same arc — GitHub trending for 48 hours, nostalgia discussion, occasional "what if someone built X on top of this" thread. The comic-strip chat metaphor has genuine modern appeal for specific communities: accessible communication tools, visual-first interfaces for non-English speakers, creative chat formats for games. Whether anyone picks up the rendering code and builds something new is the question I'll watch over the next few months. I won't list Comic Chat itself in the OSS alternatives directory (there's no modern equivalent context to surface it in), but the release announcement is worth adding to the content queue for historical context posts.

3. LM Studio Bionic brings agent mode to local models

lmstudio.ai/blog/introducing-lm-studio-bionic, 79 HN points July 16. LM Studio is the desktop app for running local LLMs with a GUI. Bionic adds an agent layer on top — tool use, multi-step task execution, the same primitives that make Claude Code and similar tools useful.

My ETL pipelines currently run on Claude Haiku 4.5 via API because local models haven't cleared the latency bar for anything I run in CI. But there's a category of tasks — draft-and-review loops, non-time-critical enrichment jobs, one-off data transformations where a slow answer is fine — where "free and local" could beat "$0.0002 per call." If LM Studio Bionic makes it easy to point agent workflows at Qwen or Mistral running locally, that's the first realistic path I've seen toward migrating some workloads off the API. The 79 points tells me the dev community is intrigued but not yet convinced. I'm in the same position: watching, not moving anything yet.

4. Julia Evans on SQLite — and Lobste.rs switching to it in production

Two separate SQLite pieces landed on HN within 24 hours of each other.

Julia Evans' Learning a few things about running SQLite (110 points) covers journal modes, WAL behavior under concurrent writes, and practical limits she hit in real usage — the kind of post I read twice because it maps directly to tradeoffs I've made without fully understanding. Lobste.rs announced they moved to SQLite from Postgres (87 points), a production social site with real write concurrency.

I use Turso (libSQL, a SQLite fork with replication) for all three of my directories. My ETL jobs run concurrent upserts on the games and model tables, which is exactly the workload Julia's post covers. I've been on WAL mode since launch and haven't measured whether it's the correct choice or just the default I haven't questioned. The Lobste.rs migration is the more surprising signal — a site with sustained write load choosing SQLite over Postgres in 2026, not as a cost-cutting move but as an engineering preference. That's worth understanding in detail. I'll write a proper note on this after I pull the actual upsert timings from my own ETL jobs rather than guessing.

5. MoonBASIC — a modern BASIC for 2D and 3D games

github.com/CharmingBlaze/moonbasic, 28 HN points July 17. Not a big launch. 28 points is "interesting to a niche audience," which is exactly the kind of release I track for the indie game discovery site.

BASIC-inspired game development languages have a persistent niche: Pico-8, TIC-80, and GBStudio all prove there's a community for constrained creative tools with learnable syntax. MoonBASIC is targeting both 2D and 3D with what looks like clean syntax inspired by the classic BASIC era. The repo is very early stage — I won't add it to the directory yet since there are no games to catalog. But if it develops a community, my GitHub ETL will pick it up via star growth. The signal worth noting is that this niche keeps attracting new entrants; the demand for "programming language you can learn in a weekend to make a game" hasn't peaked.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Five things I noticed this week: Chinese AI surge, GitHub Agentic Workflows, Copilot CLI

MORINAGA — Sat, 18 Jul 2026 07:25:16 +0000

Another week of more happening than I can reasonably track. I run three AI-curated directory sites — Top AI Tools, Find Games Like, and Open Alternative To — and I keep a loose eye on the surrounding ecosystem because what ships this week tends to land in my ETL pipelines next month. Here are five things that caught my attention.

1. DeepSeek V4.1 Flash hit top trending on HuggingFace within a week of release

DeepSeek V4.1 Flash climbed to the top trending slot on HuggingFace faster than anything I've seen this quarter. The number that sticks with me: Chinese open-weight models now hold five of the top ten slots — the highest concentration on record according to the HuggingFace trending data I've been watching. I don't have a clean read yet on whether V4.1 Flash is meaningfully better than V3.2 for my specific use case (structured JSON generation from unclean source data), but the download velocity alone is a signal worth noting.

For my AI tools directory, this means another batch of model cards I need to ingest. I'm not adding models on hype alone — my ETL has a signal-gate that requires GitHub stars + HuggingFace likes above a threshold before a model gets a detail page. V4.1 Flash cleared it.

2. Six competitive Chinese frontier models shipped within two weeks

Qwen 3.7, DeepSeek V4.1, Hunyuan Large 3, ERNIE 5.1, Doubao Pro, and GLM-6 all arrived inside roughly a two-week window. I started calling this the "Chinese frontier convergence" this week because the cadence stopped looking like independent releases and started looking like a coordinated response to Claude Fable 5 and whatever OpenAI shipped at Build.

The practical effect on my OSS alternatives directory is that the "best open-weight alternatives to [closed model]" comparison pages are going stale faster than my weekly ETL refresh can keep up. I'm going to need to tighten the refresh cadence on those specific pages or accept that they'll be wrong for a few days each cycle.

3. Transformers 5.12.0 added MiniMax-M3-VL and Parakeet-RNNT

Hugging Face shipped Transformers v5.12.0 on June 12 with first-class support for MiniMax-M3-VL, PP-OCRv6, and Parakeet-RNNT. The one I'm watching most is Parakeet-RNNT — a streaming ASR model from NVIDIA. I've been looking for a free, local-runnable alternative to Whisper for my video pipeline, and RNNT architecture is meaningfully lower latency for short-form audio.

I haven't benchmarked it yet. My current Whisper setup runs fine, and I'm not going to swap a working component for an untested one on a hunch. But I've got a test branch open and I'll run a side-by-side on my standard 90-second video script sample next week.

4. GitHub Agentic Workflows entered public preview

GitHub's Agentic Workflows feature entered public preview this week. The pitch is reasoning-based automation inside GitHub Actions — issue triage, documentation updates, and similar tasks that previously required a human to interpret context before acting. I watched the announcement and immediately thought about my content QC pipeline.

Right now I run a Claude Haiku step in CI that flags potential frontmatter issues and broken internal links before an article gets committed. That's not "agentic" in the GitHub sense — it's a dumb script that calls an API. Agentic Workflows would let me set up something that reads the issue, reads related open PRs, and decides whether a flagged article is actually broken or just pattern-matched incorrectly. I'm interested but not in a hurry. Public preview means the API surface will change.

5. GitHub Copilot CLI redesign went GA, and Claude Fable 5 is inside it

GitHub's redesigned Copilot CLI terminal interface — previewed at Microsoft Build 2026 — went generally available this week. The tabbed UX for issues, pull requests, and gists looks useful, though I'm skeptical I'll use it heavily given how much of my workflow is already automated.

The more interesting detail: Claude Fable 5 from Anthropic is now one of the available models in GitHub Copilot for Pro+ and above. That puts Fable 5 inside VS Code, JetBrains, and Xcode without requiring a separate Anthropic API key. I'm still on Claude Sonnet 4.6 for most pipeline tasks — it's cheaper for high-volume structured generation — but having Fable 5 accessible in the editor changes the calculus for the reasoning-heavy planning steps I currently do manually.

That's five observations. Three of them are going to affect my ETL pipelines before the end of the month. The GitHub Agentic Workflows one I'm treating as interesting-to-watch rather than immediately actionable. I'll come back to it when the API stabilizes.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Five things I noticed this week: Kimi K3, Bonsai 27B on-device, and the Gemini rebrand

MORINAGA — Sat, 18 Jul 2026 07:25:12 +0000

Another week where more shipped than I could realistically process. I run three AI-curated directory sites — Top AI Tools, Find Games Like, and Open Alternative To — so I keep a loose eye on open-weight releases and tooling shifts because what appears on HN this week tends to land in my ETL pipelines next month. Here are five things that caught my attention between July 14 and 17.

1. Kimi K3 hit 945 HN points as another open frontier model

Moonshot AI posted Kimi K3 on July 16 under the headline "Open Frontier Intelligence," and the HN thread immediately climbed past 900 points with 571 comments. The framing is deliberately aggressive — they're not positioning it as a cheap alternative but as a direct peer to frontier closed models.

What I noticed operationally: my HuggingFace ETL sorts by total likes and filters by pipeline tag, but there's no freshness weight. A model sitting at 800 likes from three weeks ago will outscore something that dropped this morning. Kimi K3 is the fourth model in three months where I noticed this stale-ranking problem. I need to add a recency decay factor — probably a half-life of around 14 days applied to the raw like count before sorting. I'll build that into the next ETL update; right now my AI tools directory could be showing K2.7 when K3 is the current version.

2. Bonsai 27B claims phone-level inference

The July 14 HN thread on Bonsai 27B — a 27-billion-parameter model designed to run on consumer hardware including phones — got 315 points with skeptical but engaged comments. The claims involve aggressive quantization targeting Apple Silicon and Android Snapdragon chips.

The longer-term implication I keep thinking about: if 27B-class models actually achieve useful inference speeds on-device, the category "cloud API vs self-hosted" starts to fracture into three tiers: cloud, server-hosted, and on-device. My directory currently only captures the first two. I'm not adding an on-device filter yet — I don't have user queries confirming people search for it — but I noted the gap. I'll see whether the Bonsai 27B download numbers on HuggingFace actually back up the phone-inference claims in the next few weeks.

3. NotebookLM is now Gemini Notebook

Google rebranded NotebookLM to Gemini Notebook on July 16. The HN thread (195 points, 109 comments) was split between people asking whether anything functional changed and people lamenting the loss of a distinctive name under another Gemini umbrella.

For my AI tools directory this is a concrete data problem. I have "NotebookLM" as a canonical tool entry. The product's name is now different, the landing URL structure changed, and if I don't reconcile that I have a stale entry with a broken or redirected URL. I checked and I hit this exact issue with two other tools this week that quietly rebranded. I'm going to add an alias_names field to my tool schema so I can track these transitions without creating duplicate entries. Right now I'd just update the name manually, but I've done that three times this month.

4. LM Studio added an agent layer called Bionic

LM Studio shipped "Bionic" on July 16 — their framing is "the AI agent for open models." The HN post got 79 points, mostly discussion comparing it to Jan and Ollama for running agentic workflows locally.

What caught my attention: six months ago, LM Studio was a model manager. It's now positioning as an agent platform. The tool category changed even though the name didn't. My OSS alternatives directory has LM Studio categorized under "model management," and that category label is now wrong. I've been hitting this drift problem more frequently — tools that start in one category and migrate. I don't have a good automated signal for category drift yet; I catch it by manually reading changelogs, which doesn't scale.

5. My auto-tuner caught something I need to manually verify

From my own pipeline this week: the YouTube analytics auto-tuner flagged that "product framing" videos are outperforming "build-in-public" videos at roughly 2:1 in first-15-second retention. I adjusted this week's scripting directive. Then I went back and looked at what the classifier put in each bucket.

Three of the "product framing" videos were miscategorized when I initially seeded the training set. So I don't actually know whether the retention signal is real or whether I accidentally built a classifier that's good at predicting its own mislabeled inputs. I've paused the directive update and scheduled a manual audit of the classification labels. I wrote more about the auto-tuner setup in Three archetype signals the YouTube analytics auto-tuner surfaced after two weeks — this week's catch is why that article ends with a note about needing human-labeled ground truth before trusting the output.

Five things, three of which are going to change something concrete in my ETL or directory schema before the end of the month. The auto-tuner one is the most uncertain — I'll have a cleaner picture once I've done the manual label audit.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Three archetype signals the YouTube analytics auto-tuner surfaced after two weeks

MORINAGA — Fri, 17 Jul 2026 07:50:48 +0000

The auto-tuner runs daily: scripts/yt-analytics/run.py reads the last 30 uploads from the YouTube Data API v3 videos.list endpoint, groups them by archetype, computes age-normalized views/day (excluding videos under 24 hours old), and rewrites docs/yt-today-directive.md. The generation routine must read that file first—the directive design is covered elsewhere; this is about what the live data actually said after two weeks.

Signal 1: the gap between archetypes is larger than I expected

Current median views/day by archetype:

product_findindiegame (game-vs-game comparisons): 11 views/day, n=19
product_ossfind (OSS alternative comparisons): 2 views/day, n=1 — too small to trust
build_in_public: 1 views/day, n=2

The 11 views/day median for product_findindiegame comes from a specific pattern in the underlying data, not from the archetype label alone. Named-vs-named game comparisons—where both the indie and the AAA are well-known proper nouns—drove individual Shorts to 162-373 views. Stripping the specific names from the same template (keeping the numeric hook, dropping the game names) collapsed a comparable Short to 77 views. A Short with no recognizable game names reached 5 views.

The original analytics classifier tracked archetype vs non-archetype performance. What the two-week data added is specificity about what inside the archetype drives performance. The auto-tuner now enforces a hard gate on the spec: the title must name two real, recognizable proper nouns before the spec passes the YouTube audit.

Signal 2: build_in_public didn't just underperform—it regressed

When I started the channel, posting what I shipped each week seemed natural. One early video hit 34 views, which looked like signal at the time.

The age-normalized median for that archetype is now 1 view/day. The 34-view video was a month-one outlier; the last two build_in_public Shorts averaged 8 views each with no growth tail in the first two weeks.

The directive hard-bans it. The specific concern isn't just that it underperforms—it's that when the 3-in-a-row guard fires on the winning archetype (see below), the generation routine needs a fallback. Without an explicit ban, it might fall back to build_in_public as a low-resistance option. To prevent that, the Python script maintains DEAD_ARCHETYPES = frozenset({"build_in_public", "meta", "curated", "technical"}) and excludes those explicitly from any fallback path in the directive logic. A fallback into a dead archetype would produce a spec that the analytics engine then counts against the winner's share of the queue—distorting next week's view distribution.

Signal 3: the 3-in-a-row guard matters even for the winning archetype

Today's directive switched the target to product_ossfind because product_findindiegame appeared in the last two uploads. The 3-in-a-row guard is a rule I hardcoded when I noticed two things colliding.

First, the Jaccard duplicate detection in the spec audit. When the same archetype runs three days in a row, the opening-line similarity score between consecutive specs rises above the 0.82 threshold and the audit starts blocking specs before they reach the TTS step. The guard fires at two-in-a-row specifically to force a rotation before the audit has to catch a near-duplicate. The audit is the last-resort gate; the guard prevents the situation from reaching it.

Second, the practical consequence of the guard: today's video will likely underperform a product_findindiegame spec on the same day. I'm choosing one potentially weaker video to avoid two worse outcomes—a near-duplicate upload that erodes the audience's sense of variety, or a pipeline stall from a failed audit.

The tradeoff is real and I haven't resolved it cleanly. The right answer might be to build a rotation schedule that forces archetype diversity over a 5-day window rather than triggering off consecutive identical archetypes. That would allow product_findindiegame to appear on Monday and Thursday without triggering the guard, rather than being blocked after any two consecutive days regardless of the gap between them.

What I don't know yet

product_ossfind has one video. The 2 views/day figure is not actionable data—it's a prior that could easily flip with the second video. The auto-tuner requires n≥3 before an archetype enters the ranked comparison; currently product_ossfind is in an "early data" category and only gets the target slot today because the guard pushed off the clear winner.

The view counts are also raw, not CTR-weighted or subscriber-normalized. A Short that gets 200 views from algorithm distribution with 3% CTR is a different signal than 200 views from a well-trafficked playlist with 0.1% CTR—but I don't have CTR data accessible from the YouTube Data API at the current tier without going through YouTube Studio. Views/day is the proxy, and it's a noisy one for videos under a week old.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

What I learned adding Jaccard duplicate detection to a YouTube Shorts spec audit

MORINAGA — Fri, 17 Jul 2026 07:50:43 +0000

Before the spec audit, my CI pipeline could upload a perfectly formatted video—correct word count, all JSON fields present, proper hook variants—that was structurally identical to something uploaded three days ago. After adding Jaccard similarity checks against the last 30 uploaded specs, those near-duplicates fail in under 50ms before any TTS synthesis or ffmpeg step runs. The threshold calibration matters more than the algorithm: 0.76 for titles and 0.82 for opening 28-word windows, derived from manual inspection of uploaded specs rather than theory.

The CI pipeline that generates YouTube Shorts ran for several weeks before I noticed a pattern in the upload history: the titles varied, the game matchups differed, but the opening lines were converging. The AI generating scripts had found a template that worked—open with a hard number, frame an underdog—and was reusing the same sentence skeleton almost verbatim. Not the same video, but close enough that a viewer who'd watched the last five would notice.

Format validation doesn't catch this. You can verify that a spec has a title, a script between 55 and 200 words, at least three hook_variants, and data_panels with HTTPS source URLs—and still upload something that feels like a rerun. The content quality gate I wrote for articles has the same limitation: it catches broken frontmatter and missing required fields, but it can't tell whether you've written essentially the same piece twice with different nouns.

Jaccard similarity is the simplest thing that addresses this. It's used in plagiarism detection and recommendation systems; it's three lines of code with no dependencies; and it's directly interpretable when it fires. The formal definition is intersection over union of two sets. Here's how I applied it to video spec validation.

What Jaccard measures, and what it won't catch

Jaccard similarity between two texts: tokens in both / tokens in either. Tokens are lowercased words longer than two characters, punctuation stripped. The algorithm is order-blind—it treats "Risk of Rain 2 has more reviews than Hollow Knight" and "Hollow Knight has fewer reviews than Risk of Rain 2" as identical. That's a weakness if you're trying to catch intentional paraphrase. It's an acceptable tradeoff here because I'm not trying to catch paraphrase. I'm trying to catch the case where the generation routine reused the same sentence structure and vocabulary because a particular template happened to score well in the analytics classifier.

function tokenSet(text) {
  return new Set(
    String(text).toLowerCase()
      .replace(/[^a-z0-9 ]+/g, " ")
      .split(/\s+/)
      .filter((t) => t.length > 2)
  );
}

function jaccard(a, b) {
  const aa = tokenSet(a), bb = tokenSet(b);
  const overlap = [...aa].filter((token) => bb.has(token)).length;
  return overlap / Math.max(1, new Set([...aa, ...bb]).size);
}

I compute this twice per new spec—once on the full title, once on the first 28 words of the script—and compare against every spec in the last 30 uploads. Title check fires at 0.76; opening check fires at 0.82.

Why separate thresholds for titles vs openings

The right threshold depends on how much vocabulary overlap is natural given the content. Titles in this pipeline almost always name two specific games or software products. Two titles covering different matchups share almost no vocabulary. A threshold of 0.76 on the title means "three-quarters of the title tokens appear in both texts"—that only happens when two specs are targeting the same product comparison with nearly identical phrasing.

Opening lines are harder. The pipeline's best-performing hook pattern leads with a hard number in the first three words: "Risk of Rain 2 has 693,000 Steam reviews while Hollow Knight sits at 400,000." A different spec might open with "Balatro has 380,000 Steam reviews against Final Fantasy XVI's 12,000." These share nearly nothing despite following the same template, because the proper nouns dominate the vocabulary.

The problem arises when the generation routine gets into a rut after several consecutive product_findindiegame specs. I've seen it output "X has N reviews, making it the clear winner" across three consecutive specs with only the game names swapped. The Jaccard score for that pattern reached 0.84 on the opening window—above the 0.82 threshold. At 0.70, genuine numeric hooks on different matchups would have occasionally collided. At 0.90, rut cases would slip through.

Choosing a 28-word window for the opening comparison came from looking at where structurally similar openings diverged from each other. At 20 words, the window was too short to catch real convergence; two specs that opened with different games but the same template still scored below 0.70. At 40 words, unrelated specs that both transitioned into a "here's why this matters" framing started scoring too high due to shared transition vocabulary. Twenty-eight words—roughly the first two sentences of a 55-word Short script—was the inflection point for this corpus.

The claims provenance map: tying every assertion to a source

Jaccard catches structural similarity. The claims[] field catches something different: assertions that are factually grounded but not traceable from the spec itself.

Every item in claims must link a specific factual assertion to the HTTPS URL where I confirmed it, and that URL must also appear in data_panels. The cross-reference constraint prevents a pattern I saw in early generated specs: a claims[] entry citing a Steam page, but data_panels containing a different URL that was the actual source the generation routine used. If those diverge, either the claim was generated from a source not recorded in the panels, or the panels list sources that don't back the specific claims.

"claims": [
  {
    "claim": "Risk of Rain 2 has 693,000 Steam reviews as of 2026-07-10",
    "source": "https://store.steampowered.com/app/632360/Risk_of_Rain_2/"
  }
],
"data_panels": [
  {
    "label": "Steam review counts",
    "url": "https://store.steampowered.com/app/632360/Risk_of_Rain_2/"
  }
]

The audit enforces this: every claim.source must appear in the flat list of URLs extracted from data_panels. A spec that cites source A in claims but lists source B in panels fails the audit regardless of whether either URL is actually correct.

Combined with verified_at (a YYYY-MM-DD date field required on every spec), this creates a paper trail: each claim has a source, each source has a date when I confirmed it, and the audit verifies the linkage.

Quarantine instead of halting the queue

The pipeline health monitor watches the queue at the workflow level—it fires if nothing has shipped in 36 hours. But a spec that fails quality validation is a narrower problem: one bad spec shouldn't stall the day's queue.

The audit runs as the first step in the CI job, before any compute-intensive work. If it fails, the spec moves to content/yt-queue/rejected/ with the error list appended to its filename, the CI job exits non-zero, and the next spec in the queue is unaffected. The quarantine directory also serves as a record: I've retrieved and manually fixed rejected specs twice, and I've occasionally discovered that a threshold miscalibration was producing false positives before any specs were lost.

This matters more for the two-host longform pipeline than for Shorts. A longform spec takes 8-15 minutes to synthesize and render. A bad spec that passes format validation but fails midway—at the thumbnail generation step, for example—wastes the entire CI run. The three-tier thumbnail fallback pipeline was partly built in response to a spec that failed at thumbnail generation after TTS had already completed; the quarantine model would have caught it before any of that compute ran.

With the audit as a fast first gate, failure is cheap: exit non-zero in under 100ms, commit the rejected spec with its error list, continue with the next item. The health monitor's 36-hour threshold is large enough to absorb several consecutive audit failures without alerting.

How the archetype directive and the spec audit interact

The analytics-driven archetype directive is the upstream control. If the directive correctly identifies which archetype to produce and the generation routine follows it with sufficient variation in game matchups, the Jaccard audit should rarely fire. Different matchups produce different vocabulary; Jaccard scores for unrelated specs typically stay below 0.40.

The audit becomes necessary when the directive guidance is followed but the generation routine still converges. That can happen when the same archetype runs for several days and the language model finds a phrase pattern that consistently passes the hook quality checks—not because it's copying verbatim, but because it's exploiting the same structural template.

The thumbnail rules derived from 19 videos of analytics have an analogous property: they enforce visual differentiation at the image layer while the spec audit enforces differentiation at the text layer. Neither catches the other's failure mode—a visually distinct thumbnail can accompany a near-duplicate script, and vice versa.

The quality gates article covers why I think fail-closed gates matter at every layer; this post is about one specific gate and how it's calibrated.

Limits I haven't solved

Semantic drift. If the generation routine rotates vocabulary deliberately—"critically acclaimed" instead of "award-winning", "dominating" instead of "winning"—Jaccard scores will be low even if the underlying framing is identical. I don't have a good solution here at the scale of 30 videos. TF-IDF cosine similarity would be more sensitive to rare vocabulary, but the corpus is too small for meaningful IDF weights.

The lookback window ages out. If I publish 30 videos on the same archetype category—roughly seven weeks at current cadence—the oldest specs age out of the window, and a rephrase of the earliest ones could theoretically slip through. I'd need either a longer window or a topic-slug–based deduplication layer that persists beyond the rolling 30.

Threshold drift. Both thresholds are calibrated to the current script style. If I significantly change the hook pattern or switch to a different generation model, the natural base similarity of both titles and openings will shift and the thresholds will need recalibration. I haven't built any automated way to flag threshold drift; it requires noticing the audit's false positive rate climbing.

FAQ

Why Jaccard and not cosine similarity or BLEU?
Jaccard is three lines of code with zero dependencies. Cosine similarity over TF-IDF vectors would be more sensitive to vocabulary frequency, but TF-IDF requires a meaningful corpus for IDF weighting—30 specs is too small. BLEU requires per-test reference translations, which don't apply here. For this scale, Jaccard is accurate enough and trivially debuggable: when the audit fires, I can compute the overlap manually in my head.

Does the lookback include longform specs?
Yes, but the similarity checks only compare a new spec against previous specs of the same type. A Short spec is checked against the 30 most recent Short specs; a longform is checked against longform. The isLongform detection is based on whether the spec has a segments array rather than a single script field.

What's actually in the lookback directory?
content/yt-queue/uploaded/ contains one JSON file per uploaded spec, named by upload date. The audit reads all .json files in that directory, sorts by filename, and takes the last 30. It does not read from the YouTube API—just from the committed spec files.

How long does the audit take?
On the GitHub Actions runner, about 40ms for a queue with 19 uploaded specs. The hot path is the nested loop over uploaded specs (O(n) Jaccard computations per new spec), and both n and the token set sizes are small enough that this doesn't register in the workflow's timing.

What happens to a rejected spec?
It moves to content/yt-queue/rejected/ with the full error list written to a companion .txt file. The rejected directory is committed and pushed. The generation routine doesn't automatically re-queue rejected specs; I handle those manually or let the next daily run produce a fresh spec for the same topic.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Four YouTube Shorts thumbnail rules I hardcoded after 19 videos of analytics

MORINAGA — Thu, 16 Jul 2026 07:52:56 +0000

When I changed my YouTube Shorts titles to name both games in the comparison, the view counts moved. Named-vs-named drove 162–373 views per video; unnamed variants died at 5. That data was enough to start making corresponding changes to the thumbnail design — the visual and the title need to work together or neither works well.

Here are the four rules I've now hardcoded in the thumbnail generation pipeline after running 19 game-comparison Shorts.

Rule 1: Lead with the big number, not the game title

The original thumbnail layout put the game names at the top — icons, titles, branding — and the statistical result at the bottom: "Has 5× more Steam reviews." This mimics review-site hierarchy: identify the subject first, then reveal the verdict.

For a 3-second impression on a Shorts feed, this ordering is backwards. The viewer doesn't know to care about the game until they see the result. They need a reason to stop scrolling before they'll register who the comparison involves.

PR #37 reversed the layout: the comparison number or result goes at the top in large type. Game names and icons move to the lower third. The implicit reading order is now "RESULT → explanation" rather than "subject → result."

I don't have A/B split data on this specific change yet — it shipped this week. But the title-naming data consistently showed specificity is the click trigger. The number is the most specific element on the thumbnail. It should be the first thing the viewer reads.

Rule 2: Wire the cover_image explicitly or the render breaks

The thumbnail generator takes a cover_image field from the YouTube script JSON. When I added the Hades II Short to the queue, I initially omitted the field assuming the generator would fall back to a sensible default.

It did not. It defaulted to a black frame. The ffmpeg overlay then rendered the text and game icons correctly onto a black background. The thumbnail was technically valid and looked completely wrong — indistinguishable from a placeholder.

PR #36 added two fixes: an SOP for how the background image is selected and sourced (platform-appropriate art, no UI chrome, no text overlap zones), and explicit wiring of cover_image in the Hades queue entry. The SOP was drafted in a code-review session and committed directly to the video generation runbook.

The rule: cover_image is not optional. If you're generating thumbnails programmatically, put a gate on the field before the thumbnail step runs. A missing image will silently produce a broken output, not an error.

Rule 3: Export at 16:9, not 9:16, for the preview card

YouTube Shorts play vertically at 9:16, so the obvious choice is to generate vertical thumbnails. The problem is where discovery actually happens.

Thumbnail preview cards in YouTube search results, the main feed, and embed contexts display horizontally. A 9:16 thumbnail that YouTube crops to a 16:9 preview card loses the top and bottom — which is exactly where I'd placed the game title and the result number. The center crop shows only the background art.

My current export is 1280×720 (16:9). YouTube scales it for the Shorts player by adding sidebars. The data in the preview card — where the viewer decides whether to click — is preserved in full.

This rule depends on where your Shorts discovery is actually happening. If most of your traffic comes from within the Shorts vertical feed (the swipe-up interface), 9:16 makes sense and my rule doesn't apply. At my follower count, feed cards are the dominant discovery surface, so the horizontal export wins.

Rule 4: Hardcode the layout constants, don't prompt for them

The pre-PR thumbnail generator had a prompt that described the desired layout: "put the number at the top in large type, game names below." The results were roughly correct but inconsistent across runs — font sizes drifted, number positions shifted, text occasionally clipped by an icon.

Prompting treats the layout as a creative decision the model makes fresh each time. For automation, that's the wrong model. I want identical layout across every thumbnail in a series, with only the game names and numbers changing.

After PR #37, the constraints are constants in the Python generation script, not prose in a prompt:

NUMBER_FONT_SIZE = 96
NUMBER_POSITION = (CENTER_X, TOP_MARGIN)
GAME_NAME_FONT_SIZE = 36
GAME_NAME_POSITION = (CENTER_X, LOWER_THIRD)

These aren't configurable parameters. They're invariants. The model doesn't pick them. The only inputs the generator accepts are game names, the comparison number, and the cover_image path.

The same logic applied earlier on the directive side: once a decision is settled, move it from a prose instruction into code. A prose instruction can be skimmed past or interpreted loosely. A constant cannot.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Why I'm betting quality gates protect my AdSense approval better than content volume

MORINAGA — Thu, 16 Jul 2026 07:52:51 +0000

Four AdSense rejections across three sites will change your intuition about content strategy.

The standard advice for programmatic sites is volume: more pages means more potential entry points, more crawl events, faster indexing feedback. I followed that advice. The rejections tracked the page count up. The more content I had, the more there was for a reviewer to find fault with — and they found fault every time.

This article is my argument for the opposite bet: that fail-closed quality gates protect AdSense approval better than content velocity does. It's falsifiable, it has a deadline, and I'll name the conditions that would change my mind.

What "fail-closed" actually means in this codebase

A quality gate that produces a warning but still publishes is not a quality gate. It's a suggestion I'm free to ignore when I'm running a batch job at 7am.

Every article in this project goes through scripts/audit-articles.mjs before it reaches Dev.to, Hashnode, or Bluesky. The script checks:

Required frontmatter keys (title, description, tags, publish_to, and the quality_contract v2 keys)
Tags against an approved pool — anything outside it fails; "seo" fails by name
Cliché phrases (13 patterns: common AI-writing tells and marketing-speak that no editor would let through)
Body word count minimum
Fabricated metric patterns (regexes that catch suspiciously round numbers attached to words like "visits" or "users")

If any check fails, the script exits 1. The publish step in the workflow runs only if the audit exits 0. The file stays in the repo as a staged change but does not distribute.

The directory pages have a parallel gate. Pages that don't pass the three-tier quality ladder stay noindexed rather than removed — they exist but aren't surfaced to search engines until the content meets the minimum threshold.

The gate isn't blocking the majority of articles. In 107 published articles so far, every one has cleared. The gate isn't a band-aid for bad drafting — it's insurance against the tail case where an automated routine produces something that violates a rule I've set. One slipped cliché or one accidentally round visit-count won't make it through.

The volume counterargument, stated honestly

The case for volume-first is real. A site with 1,000 indexed pages has more surface area than a site with 100. Crawl frequency correlates with update frequency. Freshness signals exist.

There's also a subtler version: sentence uniqueness matters more than page count for AdSense review, and volume correlates with uniqueness if you're generating varied content. The argument isn't "more pages = approval" — it's "more diverse content = more signal that this isn't a thin-content farm."

I accepted this counterargument enough to continue generating content. I have 107 articles and three directory sites with multi-thousand-page indexed content. I haven't stopped publishing. The question is what I prioritize at the margin: does the next hour go into another article or into tightening the gate?

The empirical problem with volume-first is that it didn't work for me. Three separate applications, each submitted when the sites had more content than the previous one. The rejection reasons didn't cite "not enough pages." They cited low-value content signals: thin individual pages, template-similar entries, missing editorial context.

That's a quality problem, not a quantity problem.

What the rejection history actually told me

The four AdSense rejections clustered around three categories:

Category 1: Template-similar pages at scale. When the AI generates a hundred "Game X is similar to Game Y" pages using the same sentence structure, they're technically unique by character count but visually and semantically uniform. AdSense's automated review reads this as scaled content abuse — a bucket Google explicitly penalizes under its site quality guidelines and that has grown stricter since 2024.

Category 2: Missing editorial voice. The early directory pages had data (game price, Steam review count, similar-game links) but no reason why someone would prefer one option over another. They were information delivery without judgment. A page that only restates facts from a source doesn't add value in AdSense's model.

Category 3: No visible author or purpose. The EEAT transparency pages I eventually built — /about, /methodology, authorship declaration on articles — were absent in the first application cycles. The site looked like a content farm because there was no evidence of a person running it.

None of these failures would have been fixed by adding more pages. Adding more template-similar pages would have made category 1 worse. The fix for each was to improve the quality of what existed, not add to it.

Current gate coverage and what it doesn't catch

For articles, the audit script enforces six constraint classes. For directory pages, I have four separate QC scripts that run before pages enter the live index. For Bluesky posts, the four-gate QC filter rejects posts before they queue.

The pattern that emerged across all three publishing channels is the same: an automated pipeline without a gate will eventually produce something that violates a constraint I care about. A warning that doesn't block is not enforcement.

Here's the honest comparison between the two strategies at the page level:

Dimension	Volume-first	Gates-first
Indexed page count growth	Fast	Slow
Per-page quality floor	Variable	Enforced minimum
AdSense reviewer signal	More entries to find problems in	Fewer entries, each harder to reject
Freshness / crawl signal	Stronger	Weaker
Operator time cost	ETL pipeline time	Gate logic maintenance
AdSense application risk	Higher variance per page	Lower variance per page

I'm betting the right column wins at AdSense review time. The counter to this table is that "lower variance" doesn't mean the quality floor is high enough — it means it's consistently mediocre instead of inconsistently bad. This is the failure mode the gate doesn't catch, and it's the strongest version of the argument against this bet.

The gate tells me nothing is actively bad. It doesn't tell me anything is good. "No prohibited tags and no clichés" is a necessary condition for AdSense approval, not a sufficient one. If the floor is still below the bar, the gate has been clearing content that's uniformly mediocre.

Timeline and what would change my mind

The bet resolves at month 9 of this experiment, which is January 2027. By then, I expect at least one of the three sites to have cleared AdSense review. If none have, one of these conditions is probably responsible:

Condition A: The gate is solving the wrong problem. If rejections continue citing category 1 (template-similar pages) even after I've improved the article gate, the issue is at the directory page level, not the article level. The article gate is working; the page gate needs to be stricter.

Condition B: Volume is the actual constraint. If the rejection cites "insufficient content" for the first time, I've been wrong about what the reviewer is evaluating. I'll need to run the volume experiment I've been avoiding: generate aggressively for 60 days and reapply.

Condition C: EEAT is the constraint, not content quality. Author trust signals — inbound links, author profiles across the web, editorial recognition — might matter more than content quality beyond a floor. If this is the failure mode, the gate is necessary but not sufficient.

I'll write a month-9 update when I have a result. I did the same with the AI overviews vs directories bet from May — that one resolves at the same time horizon.

The affiliate revenue path is already running as a hedge. Affiliate beats AdSense for new directories at low traffic anyway — the gate bet doesn't hinge on AdSense approval for immediate monetization.

FAQ

Do the gates apply to every article, or only new ones?
Only new ones. The audit script runs as a pre-publish check in the CI workflow. Existing published articles aren't retroactively re-checked on each run. I ran a bulk audit once to baseline the existing set, but ongoing enforcement is only at the publish step.

What's the false positive rate on the fabricated metric regex?
Not zero. The pattern catches any round large number followed by words like "visitors" or "subscribers" — so a sentence citing a verified subscriber count would also trip it. In practice, I avoid citing any specific metric I haven't directly measured, so the gate's strictness is aligned with the content policy rather than creating real friction.

If all three sites get rejected at month 9, what's the fallback?
The current plan is affiliate monetization indefinitely, with a sale at month 12 based on traffic and content asset value rather than AdSense approval. The sale doesn't depend on AdSense being live.

Can you fail the gate and still publish manually?
Yes, by running the publish script directly outside CI. The gate is a CI guardrail, not a cryptographic lock. But the automated routine doesn't have a bypass — any failure stops the routine and requires a manual decision. For an automated publishing system, "requires a manual decision" is effectively a block.

What would prove the bet was right, not just lucky?
If the first site to clear AdSense review is the one with the strictest gate (no clichés, no fabricated metrics, no prohibited tags across all articles), rather than the one with the most pages, that's evidence for gates over volume. If the highest-page-count site clears first, I was wrong.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

DEV Community: MORINAGA

Five changes I made after exhausting GitHub Actions free minutes twice

First: where the minutes were actually going

Pip requirements files — the largest estimated saving

The npm global cache for mermaid-cli

Trigger pruning — the conceptual change

Concurrency discipline — and the Bluesky exception

What worked, what didn't, what I'd do differently

FAQ

Notable this week: Inkling open weights, GitHub Models sunset, Supabase Multigres

1. Inkling — first open-weight model from Thinking Machines Lab

2. GitHub Models retires July 30 — ten days left

3. Supabase Multigres is now open source

4. CodeQL now detects system prompt injection in JavaScript and TypeScript

5. GitHub Copilot agentic browser tools are generally available

One env flag that strips affiliate CTAs for AdSense review — without touching code

What it hides

Why build it at the environment level, not as a code branch

Restoring after approval

What I still haven't figured out

How I built Bluesky summary cards in Python — YAML frontmatter to 1080 1350 PNG

Why a separate YAML schema instead of using the title alone

The accent tag and why XSS-paranoia matters even in local tools

The three visual modes

Playwright rendering and the frontmatter backfill

Choosing the visual mode per article

What I'd do differently

FAQ

Related reading

What I'm watching this week: GPT-5.6 Sol, Steam Machine, Krea 2, and two more

GPT-5.6 Sol — government vetting before access

Steam Machine relaunches

Krea 2 — 12B open-weights image generation

OpenAI's first custom chip, designed with Broadcom

OpenKnowledge — open-source alternative to Obsidian and Notion

Notable this week: Open Source AI state report, Comic Chat, LM Studio Bionic, SQLite

1. stateofopensource.ai — a reference document I'll keep returning to

2. Microsoft Comic Chat goes open source

3. LM Studio Bionic brings agent mode to local models

4. Julia Evans on SQLite — and Lobste.rs switching to it in production

5. MoonBASIC — a modern BASIC for 2D and 3D games

Five things I noticed this week: Chinese AI surge, GitHub Agentic Workflows, Copilot CLI

1. DeepSeek V4.1 Flash hit top trending on HuggingFace within a week of release

2. Six competitive Chinese frontier models shipped within two weeks

3. Transformers 5.12.0 added MiniMax-M3-VL and Parakeet-RNNT

4. GitHub Agentic Workflows entered public preview

5. GitHub Copilot CLI redesign went GA, and Claude Fable 5 is inside it

Five things I noticed this week: Kimi K3, Bonsai 27B on-device, and the Gemini rebrand

1. Kimi K3 hit 945 HN points as another open frontier model

2. Bonsai 27B claims phone-level inference

3. NotebookLM is now Gemini Notebook

4. LM Studio added an agent layer called Bionic

5. My auto-tuner caught something I need to manually verify

Three archetype signals the YouTube analytics auto-tuner surfaced after two weeks

Signal 1: the gap between archetypes is larger than I expected

Signal 2: build_in_public didn't just underperform—it regressed

Signal 3: the 3-in-a-row guard matters even for the winning archetype

What I don't know yet

What I learned adding Jaccard duplicate detection to a YouTube Shorts spec audit

What Jaccard measures, and what it won't catch

Why separate thresholds for titles vs openings

The claims provenance map: tying every assertion to a source

Quarantine instead of halting the queue

How the archetype directive and the spec audit interact

Limits I haven't solved

FAQ

Four YouTube Shorts thumbnail rules I hardcoded after 19 videos of analytics

Rule 1: Lead with the big number, not the game title

Rule 2: Wire the cover_image explicitly or the render breaks

Rule 3: Export at 16:9, not 9:16, for the preview card

Rule 4: Hardcode the layout constants, don't prompt for them

Why I'm betting quality gates protect my AdSense approval better than content volume

What "fail-closed" actually means in this codebase

The volume counterargument, stated honestly

What the rejection history actually told me

Current gate coverage and what it doesn't catch

Timeline and what would change my mind

FAQ