I Generated 1,200 Civitai Images in 9 Days to Farm Buzz — Here's the Python + CLIP Pipeline That Picked the 40 Worth Posting

#python #comfyui #clip #stablediffusion

⚠️ この記事はアフィリエイト広告（プロモーション）を含みます。リンク先で発生した収益の一部が運営者に支払われますが、読者の購入価格には一切影響ありません。

By the end of this article you'll have a working Python pipeline that (1) batch-renders a LoRA across a prompt matrix via the local ComfyUI API, (2) scores every output with OpenAI CLIP for aesthetic + prompt-fidelity, and (3) auto-rejects the 96% that would have earned you zero Buzz. I ran this for 9 days, generated ~1,200 images, posted 47, and went from ~3 Buzz/image to ~120 Buzz/image. I'll show the code, the actual reject rates, and the two mistakes that cost me three days.

Why posting all 1,200 images tanked my Buzz-per-image on Civitai

First lesson, learned the expensive way: Civitai Buzz is not linear in volume. When I dumped my first 180 raw outputs onto the platform, the median image got 2–4 Buzz (reactions + tips converted), and my generator showcase looked like spam. Engagement per follower dropped, because the feed algorithm clearly weights recent reaction-rate. Posting more bad images actively suppressed the good ones.

The number that matters is Buzz per posted image, not per generated image. Generation is basically free on a local 5090 (about 2.1 s per 1024×1024 SDXL image at 25 steps in my setup). The scarce resource is feed slots and viewer attention. So the whole game becomes: generate cheaply and wastefully, then ruthlessly select. That inverted my intuition — I'd been treating the GPU as the bottleneck when the real bottleneck was my own taste applied at scale.

Here's the funnel I converged on after 9 days:

Stage	Count	Survival
Generated (prompt matrix)	1,214	100%
Passed CLIP aesthetic ≥ 5.6	312	25.7%
Passed prompt-fidelity ≥ 0.27	121	10.0%
Passed dedup (CLIP cosine < 0.92)	58	4.8%
Manually posted	47	3.9%

The pipeline does the first four rows. I only ever look at ~58 images by hand, not 1,214.

Step 1: Driving ComfyUI headless from Python with a 6-axis prompt matrix

I don't use the ComfyUI web UI for production. I export the workflow as API JSON (the "Save (API Format)" button after enabling dev mode) and POST it. The trick that 4x'd my hit rate was building a prompt matrix instead of hand-writing prompts: I take a small set of subjects, styles, lighting, and camera axes and take the Cartesian product, injecting each combo into the workflow's positive-prompt node.

import json, itertools, time, urllib.request, uuid

COMFY = "http://127.0.0.1:8188"

# Loaded once from "Save (API Format)". Node "6" is the positive CLIPTextEncode,
# node "3" is the KSampler in the default SDXL graph.
with open("workflow_api.json", "r", encoding="utf-8") as f:
    BASE_GRAPH = json.load(f)

AXES = {
    "subject": ["a lone samurai", "a neon street vendor", "a forest spirit"],
    "style":   ["cinematic", "ukiyo-e ink", "90s anime cel"],
    "light":   ["golden hour rim light", "moody blue night", "harsh noon"],
    "camera":  ["35mm portrait", "wide establishing shot"],
}

def build_prompts(axes):
    keys = list(axes)
    for combo in itertools.product(*axes.values()):
        d = dict(zip(keys, combo))
        yield d, f"{d['subject']}, {d['style']}, {d['light']}, {d['camera']}, " \
                 f"masterpiece, highly detailed, <lora:my_style_v3:0.8>"

def queue(prompt_text, seed):
    graph = json.loads(json.dumps(BASE_GRAPH))  # deep copy
    graph["6"]["inputs"]["text"] = prompt_text
    graph["3"]["inputs"]["seed"] = seed
    payload = json.dumps({"prompt": graph, "client_id": str(uuid.uuid4())}).encode()
    req = urllib.request.Request(f"{COMFY}/prompt", data=payload)
    return json.loads(urllib.request.urlopen(req).read())["prompt_id"]

if __name__ == "__main__":
    # 3*3*3*2 = 54 combos x 3 seeds = 162 images per run
    for meta, prompt in build_prompts(AXES):
        for seed in (1, 2, 3):
            pid = queue(prompt, seed)
            print("queued", pid, meta["subject"][:18], "seed", seed)
            time.sleep(0.05)  # don't hammer the /prompt endpoint

The matrix is the whole point. Hand-prompting, I'd unconsciously stay in my comfort zone (always golden hour, always portraits). The Cartesian product forced "harsh noon + ukiyo-e + wide shot" combos I'd never have typed — and three of my top-five Buzz images came from exactly those weird corners I would have skipped.

Failure #1 that cost me a day: I initially reused seed=1 for every prompt to "keep things comparable." SDXL with a fixed seed and a fixed LoRA collapses toward near-identical compositions — my dedup stage later killed 70% of a whole run. Three seeds per combo fixed it. Vary the seed, not just the words.

Step 2: Scoring 1,200 images with CLIP for aesthetic + prompt-fidelity

This is the brain of the pipeline. Two signals, both from CLIP:

Prompt-fidelity — cosine similarity between the image embedding and the prompt text embedding. Catches the LoRA wandering off-prompt (you asked for a samurai, you got a blurry blob).
Aesthetic proxy — cosine similarity against a fixed bank of "good/bad" text prompts. A real LAION aesthetic predictor is better, but this 12-line version correlated 0.71 with my own 1–10 ratings on a 60-image holdout, which was enough to do the heavy filtering.

import torch, glob, os, csv
from PIL import Image
import open_clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-L-14", pretrained="openai", device=device)
tokenizer = open_clip.get_tokenizer("ViT-L-14")

GOOD = ["a stunning, professional, highly detailed artwork",
        "award winning photograph, sharp focus, beautiful lighting"]
BAD  = ["a blurry, ugly, low quality amateur image",
        "distorted anatomy, extra limbs, malformed, jpeg artifacts"]

@torch.no_grad()
def text_feats(prompts):
    t = tokenizer(prompts).to(device)
    f = model.encode_text(t)
    return torch.nn.functional.normalize(f, dim=-1)

good_f, bad_f = text_feats(GOOD).mean(0, keepdim=True), text_feats(BAD).mean(0, keepdim=True)

@torch.no_grad()
def score(path, prompt):
    img = preprocess(Image.open(path).convert("RGB")).unsqueeze(0).to(device)
    img_f = torch.nn.functional.normalize(model.encode_image(img), dim=-1)
    aesthetic = (img_f @ good_f.T - img_f @ bad_f.T).item() * 100  # ~ -2 .. +8
    fidelity  = (img_f @ text_feats([prompt]).T).item()            # ~ 0.18 .. 0.34
    return round(aesthetic, 3), round(fidelity, 3)

if __name__ == "__main__":
    rows = []
    for path in glob.glob("output/*.png"):
        # prompt stored in a sidecar .txt next to each image at generation time
        prompt = open(path.replace(".png", ".txt"), encoding="utf-8").read()
        a, fi = score(path, prompt)
        keep = a >= 5.6 and fi >= 0.27
        rows.append((os.path.basename(path), a, fi, keep))
    with open("scores.csv", "w", newline="", encoding="utf-8") as f:
        csv.writer(f).writerows([("file", "aesthetic", "fidelity", "keep"), *rows])
    print("kept", sum(r[3] for r in rows), "of", len(rows))

The two thresholds (5.6 aesthetic, 0.27 fidelity) aren't universal — I tuned them by hand-labeling 60 images, plotting score vs. my rating, and picking the cut that kept ~90% of my 7+ images while dropping ~75% of the rest. Re-tune these per LoRA. A photoreal LoRA and an anime LoRA sit on completely different fidelity scales; reusing the photoreal threshold on the anime model rejected everything for half a day until I re-calibrated (that's failure #2).

On the 5090, ViT-L/14 scores ~38 images/sec, so all 1,214 images get ranked in well under a minute. CLIP scoring is ~50× cheaper than generation, which is why this funnel is worth building rather than eyeballing.

Step 3: Killing near-duplicates so the showcase doesn't look like spam

CLIP also gives you cheap dedup for free — reuse the image embeddings. Civitai showcases reward variety; ten variations of the same pose read as low-effort and depress reactions. I greedily keep an image only if its cosine similarity to every already-kept image is below 0.92:

import torch

def dedup(embeddings, files, thresh=0.92):
    kept_idx, kept_vecs = [], []
    for i, v in enumerate(embeddings):
        if all((v @ k).item() < thresh for k in kept_vecs):
            kept_idx.append(i); kept_vecs.append(v)
    return [files[i] for i in kept_idx]

That 0.92 cutoff took 312 aesthetically-passing images down to 58 visually-distinct ones. I A/B'd it: a 12-image showcase built from deduped picks averaged 2.3× the reactions of one built from the raw top-12-by-aesthetic (which were near-clones of the same hero composition).

Step 4: A nightly GitHub Actions run that posts the CSV to Discord for my approval

I deliberately do not auto-post to Civitai. The platform's terms and the ban risk on automated uploads aren't worth ¥6,000/month. Instead, a self-hosted GitHub Actions runner (it needs my GPU) runs the generate→score→dedup chain at 03:00, then drops the 58 survivors and their scores into a Discord channel. I approve ~47 with a thumbs-up over coffee and upload those manually. Human stays in the loop on the one irreversible step; the machine does the 1,200-image grind.

# .github/workflows/lora-farm.yml  (self-hosted runner with the GPU)
name: lora-buzz-pipeline
on:
  schedule: [{ cron: "0 18 * * *" }]   # 03:00 JST
  workflow_dispatch:
jobs:
  run:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - run: python generate.py && python score.py && python select.py
      - name: Notify Discord
        env:
          HOOK: ${{ secrets.DISCORD_WEBHOOK }}
        run: |
          python - <<'PY'
          import os, json, urllib.request, csv
          kept = [r for r in csv.DictReader(open("scores.csv", encoding="utf-8")) if r["keep"]=="True"]
          top = sorted(kept, key=lambda r: float(r["aesthetic"]), reverse=True)[:10]
          msg = "Top picks tonight:\n" + "\n".join(f"{r['file']}  aes={r['aesthetic']} fid={r['fidelity']}" for r in top)
          body = json.dumps({"content": msg}).encode()
          urllib.request.urlopen(urllib.request.Request(os.environ["HOOK"], data=body,
              headers={"Content-Type": "application/json"}))
          PY

What actually moved Buzz: the numbers after 9 days

Three findings I didn't expect, all measured against my own posts:

Fidelity mattered more than aesthetic for Buzz. Images in the top fidelity quartile (0.30+) earned 1.8× the Buzz of bottom-quartile ones at the same aesthetic score. Civitai viewers reward images that clearly nail a recognizable concept over generically pretty mush.
The dedup step was the single biggest lever — 2.3× reactions on the showcase, as above.
CLIP can't judge hands or faces. ~15% of my high-scoring picks had mangled hands CLIP happily rated 6.5+. The 47-image manual gate exists almost entirely to catch those. If I'd auto-posted, my best-scoring image of the run would have shipped with three thumbs.

Where this nets out: median Buzz/image went from ~3 (raw dump) to ~120 (filtered), on roughly the same generation budget. The pipeline didn't make my LoRA better — it made my selection 25× more efficient, and selection was the real bottleneck the whole time.

If you build one piece, build Step 2. The prompt matrix is nice-to-have; CLIP-based ranking is what turns "I generated a thousand images and now I'm paralyzed" into "here are the 58 worth your eyeballs."

If you found this useful: I packaged 50 copy-paste AI debugging prompts + drop-in Claude Code config templates (CLAUDE.md, settings.json, MCP) into a small kit.
Launch deal: code START50 = 50% off → 50 AI Debugging Prompts + Claude Code Config Pack (about $6, 50% off applied)
New: my 10-chapter ebook Practical Claude Code — automation & unattended operation (about $9, 50% off applied)