Openverse CC0 images in a monetized YouTube pipeline: four things to check first

#showdev #webdev #programming #opensource

I added an image slide type to the YouTube slide renderer for cases where a real photo contextualizes a slide better than a Mermaid diagram. The source is Openverse, the CC-licensed media search maintained by WordPress.org. Before it worked reliably in CI, I hit four issues that are not obvious from reading the Openverse API docs.

1. CC0 and PDM only — not CC-BY

Openverse returns results across multiple Creative Commons license types. The default search includes CC-BY, CC-BY-SA, CC-BY-NC, and others alongside CC0 and PDM. For a monetized YouTube channel, the practical split is:

CC0 and PDM: no attribution required. You can use these in commercial content without a credit line.
CC-BY and variants: attribution required. On a slide or thumbnail, that means visible on-screen credit — "Photo by X via Y" somewhere on the frame, readable on screen.

For slides in a fast-paced video, on-screen attribution text has to be legible but unobtrusive. CC-BY attribution on a dark-background slide is doable but adds layout complexity — the credit needs to clear the host overlay, stay below the heading, and still be readable at 1080p. More importantly, if you miss a single CC-BY image in a batch-generated video, you are distributing unlicensed content.

The filter I use is license=cc0,pdm — nothing else. Even though CC0 technically does not require attribution, I still add a courtesy credit caption on every image slide:

cap = f"{custom} · {attribution}" if custom else attribution  # always keep credit

The attribution string is formatted like "Title" · Creator · CC0 · Openverse. It appears at the bottom of the slide in muted text. Not legally required for CC0; still correct practice, especially for Openverse's long-term funding (the catalog depends on content creators seeing attribution from reuse).

2. The 200px minimum filter matters

The Openverse API can return results that include small thumbnails, low-resolution copies, and occasionally images that fail to load entirely. Without filtering, you get unusable frames in your videos.

im = Image.open(io.BytesIO(raw)).convert("RGB")
if min(im.size) < 200:   # skip tiny/thumbnail junk
    continue

The check is against the smaller dimension. An image that is 1800x180 pixels would fail this check — it's too narrow to fill a 1920x1080 slide usefully. 200px is a loose floor; in practice most results that pass it are at least 400x400. If the query term is obscure (narrow technical topics often return few results), you may exhaust all 12 results in a page and get a RuntimeError. The fallback in _visual_slide catches this and renders a heading-only card — the build continues.

page_size=12 is set explicitly because the Openverse default varies and results 0-11 give you enough candidates to find at least one usable image for common topics.

3. SSL certificate errors in CI

On Ubuntu GitHub Actions runners, plain ssl.create_default_context() sometimes fails HTTPS requests to Openverse with certificate verification errors. The reliable fix is certifi:

try:
    import certifi
    ctx = ssl.create_default_context(cafile=certifi.where())
except Exception:
    ctx = ssl.create_default_context()

The except branch keeps the code working locally where certifi may not be installed. In CI, certifi is pinned in the workflow's pip install step (pip install certifi Pillow), so the primary branch runs.

The same pattern applies to the image download request — both the API call and the image fetch use the same ctx. Without this, roughly one in ten CI runs fails on SSL at the image download step even if the API call succeeds, because the two requests can hit different CDN nodes.

4. Retry with backoff is necessary

The Openverse API has rate limits and occasional transient 5xx responses. A single request without retry logic fails silently in CI in a way that is hard to debug after the fact:

for attempt in range(3):
    try:
        with urllib.request.urlopen(req, timeout=timeout, context=ctx) as r:
            data = json.load(r)
        break
    except Exception:
        if attempt == 2:
            raise
        time.sleep(1.5 * (attempt + 1))

Three attempts with 1.5s and 3.0s waits covers most transient failures. The timeout on urlopen is 15 seconds — enough for a slow CDN response, not so long that it blocks the entire video build if Openverse is down.

The image download loop (iterating through results) has no retry per-image, because if one image URL fails you just move to the next candidate. The retry logic sits at the API query level, where the cost of a transient failure is losing all candidates at once.

What the complete integration looks like

In the slide spec JSON, an image slide looks like:

{
  "kind": "image",
  "heading": "Self-hosted observability stack",
  "query": "server rack data center monitoring",
  "caption": "Grafana + Prometheus in production"
}

The query drives the Openverse search. The caption prefixes the attribution string. When the fetch succeeds, the slide shows the heading, the photo centered in the content area, and the attribution credit at the bottom. When it fails — query too specific, Openverse down, all results too small — the slide shows the heading alone. Either way the video builds.

The design principle is the same one behind the three-tier content quality ladder for the directory ETL: the build never blocks on an enrichment step. Degraded output is better than a failed pipeline.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.