⚠️ この記事はアフィリエイト広告(プロモーション)を含みます。リンク先で発生した収益の一部が運営者に支払われますが、読者の購入価格には一切影響ありません。
By the end of this article you'll have a working Python pipeline that (1) batch-renders a LoRA across a prompt matrix via the local ComfyUI API, (2) scores every output with OpenAI CLIP for aesthetic + prompt-fidelity, and (3) auto-rejects the 96% that would have earned you zero Buzz. I ran this for 9 days, generated ~1,200 images, posted 47, and went from ~3 Buzz/image to ~120 Buzz/image. I'll show the code, the actual reject rates, and the two mistakes that cost me three days.
Why posting all 1,200 images tanked my Buzz-per-image on Civitai
First lesson, learned the expensive way: Civitai Buzz is not linear in volume. When I dumped my first 180 raw outputs onto the platform, the median image got 2–4 Buzz (reactions + tips converted), and my generator showcase looked like spam. Engagement per follower dropped, because the feed algorithm clearly weights recent reaction-rate. Posting more bad images actively suppressed the good ones.
The number that matters is Buzz per posted image, not per generated image. Generation is basically free on a local 5090 (about 2.1 s per 1024×1024 SDXL image at 25 steps in my setup). The scarce resource is feed slots and viewer attention. So the whole game becomes: generate cheaply and wastefully, then ruthlessly select. That inverted my intuition — I'd been treating the GPU as the bottleneck when the real bottleneck was my own taste applied at scale.
Here's the funnel I converged on after 9 days:
| Stage | Count | Survival |
|---|---|---|
| Generated (prompt matrix) | 1,214 | 100% |
| Passed CLIP aesthetic ≥ 5.6 | 312 | 25.7% |
| Passed prompt-fidelity ≥ 0.27 | 121 | 10.0% |
| Passed dedup (CLIP cosine < 0.92) | 58 | 4.8% |
| Manually posted | 47 | 3.9% |
The pipeline does the first four rows. I only ever look at ~58 images by hand, not 1,214.
Step 1: Driving ComfyUI headless from Python with a 6-axis prompt matrix
I don't use the ComfyUI web UI for production. I export the workflow as API JSON (the "Save (API Format)" button after enabling dev mode) and POST it. The trick that 4x'd my hit rate was building a prompt matrix instead of hand-writing prompts: I take a small set of subjects, styles, lighting, and camera axes and take the Cartesian product, injecting each combo into the workflow's positive-prompt node.
import json, itertools, time, urllib.request, uuid
COMFY = "http://127.0.0.1:8188"
# Loaded once from "Save (API Format)". Node "6" is the positive CLIPTextEncode,
# node "3" is the KSampler in the default SDXL graph.
with open("workflow_api.json", "r", encoding="utf-8") as f:
BASE_GRAPH = json.load(f)
AXES = {
"subject": ["a lone samurai", "a neon street vendor", "a forest spirit"],
"style": ["cinematic", "ukiyo-e ink", "90s anime cel"],
"light": ["golden hour rim light", "moody blue night", "harsh noon"],
"camera": ["35mm portrait", "wide establishing shot"],
}
def build_prompts(axes):
keys = list(axes)
for combo in itertools.product(*axes.values()):
d = dict(zip(keys, combo))
yield d, f"{d['subject']}, {d['style']}, {d['light']}, {d['camera']}, " \
f"masterpiece, highly detailed, <lora:my_style_v3:0.8>"
def queue(prompt_text, seed):
graph = json.loads(json.dumps(BASE_GRAPH)) # deep copy
graph["6"]["inputs"]["text"] = prompt_text
graph["3"]["inputs"]["seed"] = seed
payload = json.dumps({"prompt": graph, "client_id": str(uuid.uuid4())}).encode()
req = urllib.request.Request(f"{COMFY}/prompt", data=payload)
return json.loads(urllib.request.urlopen(req).read())["prompt_id"]
if __name__ == "__main__":
# 3*3*3*2 = 54 combos x 3 seeds = 162 images per run
for meta, prompt in build_prompts(AXES):
for seed in (1, 2, 3):
pid = queue(prompt, seed)
print("queued", pid, meta["subject"][:18], "seed", seed)
time.sleep(0.05) # don't hammer the /prompt endpoint
The matrix is the whole point. Hand-prompting, I'd unconsciously stay in my comfort zone (always golden hour, always portraits). The Cartesian product forced "harsh noon + ukiyo-e + wide shot" combos I'd never have typed — and three of my top-five Buzz images came from exactly those weird corners I would have skipped.
Failure #1 that cost me a day: I initially reused seed=1 for every prompt to "keep things comparable." SDXL with a fixed seed and a fixed LoRA collapses toward near-identical compositions — my dedup stage later killed 70% of a whole run. Three seeds per combo fixed it. Vary the seed, not just the words.
Step 2: Scoring 1,200 images with CLIP for aesthetic + prompt-fidelity
This is the brain of the pipeline. Two signals, both from CLIP:
- Prompt-fidelity — cosine similarity between the image embedding and the prompt text embedding. Catches the LoRA wandering off-prompt (you asked for a samurai, you got a blurry blob).
- Aesthetic proxy — cosine similarity against a fixed bank of "good/bad" text prompts. A real LAION aesthetic predictor is better, but this 12-line version correlated 0.71 with my own 1–10 ratings on a 60-image holdout, which was enough to do the heavy filtering.
import torch, glob, os, csv
from PIL import Image
import open_clip
device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-L-14", pretrained="openai", device=device)
tokenizer = open_clip.get_tokenizer("ViT-L-14")
GOOD = ["a stunning, professional, highly detailed artwork",
"award winning photograph, sharp focus, beautiful lighting"]
BAD = ["a blurry, ugly, low quality amateur image",
"distorted anatomy, extra limbs, malformed, jpeg artifacts"]
@torch.no_grad()
def text_feats(prompts):
t = tokenizer(prompts).to(device)
f = model.encode_text(t)
return torch.nn.functional.normalize(f, dim=-1)
good_f, bad_f = text_feats(GOOD).mean(0, keepdim=True), text_feats(BAD).mean(0, keepdim=True)
@torch.no_grad()
def score(path, prompt):
img = preprocess(Image.open(path).convert("RGB")).unsqueeze(0).to(device)
img_f = torch.nn.functional.normalize(model.encode_image(img), dim=-1)
aesthetic = (img_f @ good_f.T - img_f @ bad_f.T).item() * 100 # ~ -2 .. +8
fidelity = (img_f @ text_feats([prompt]).T).item() # ~ 0.18 .. 0.34
return round(aesthetic, 3), round(fidelity, 3)
if __name__ == "__main__":
rows = []
for path in glob.glob("output/*.png"):
# prompt stored in a sidecar .txt next to each image at generation time
prompt = open(path.replace(".png", ".txt"), encoding="utf-8").read()
a, fi = score(path, prompt)
keep = a >= 5.6 and fi >= 0.27
rows.append((os.path.basename(path), a, fi, keep))
with open("scores.csv", "w", newline="", encoding="utf-8") as f:
csv.writer(f).writerows([("file", "aesthetic", "fidelity", "keep"), *rows])
print("kept", sum(r[3] for r in rows), "of", len(rows))
The two thresholds (5.6 aesthetic, 0.27 fidelity) aren't universal — I tuned them by hand-labeling 60 images, plotting score vs. my rating, and picking the cut that kept ~90% of my 7+ images while dropping ~75% of the rest. Re-tune these per LoRA. A photoreal LoRA and an anime LoRA sit on completely different fidelity scales; reusing the photoreal threshold on the anime model rejected everything for half a day until I re-calibrated (that's failure #2).
On the 5090, ViT-L/14 scores ~38 images/sec, so all 1,214 images get ranked in well under a minute. CLIP scoring is ~50× cheaper than generation, which is why this funnel is worth building rather than eyeballing.
Step 3: Killing near-duplicates so the showcase doesn't look like spam
CLIP also gives you cheap dedup for free — reuse the image embeddings. Civitai showcases reward variety; ten variations of the same pose read as low-effort and depress reactions. I greedily keep an image only if its cosine similarity to every already-kept image is below 0.92:
import torch
def dedup(embeddings, files, thresh=0.92):
kept_idx, kept_vecs = [], []
for i, v in enumerate(embeddings):
if all((v @ k).item() < thresh for k in kept_vecs):
kept_idx.append(i); kept_vecs.append(v)
return [files[i] for i in kept_idx]
That 0.92 cutoff took 312 aesthetically-passing images down to 58 visually-distinct ones. I A/B'd it: a 12-image showcase built from deduped picks averaged 2.3× the reactions of one built from the raw top-12-by-aesthetic (which were near-clones of the same hero composition).
Step 4: A nightly GitHub Actions run that posts the CSV to Discord for my approval
I deliberately do not auto-post to Civitai. The platform's terms and the ban risk on automated uploads aren't worth ¥6,000/month. Instead, a self-hosted GitHub Actions runner (it needs my GPU) runs the generate→score→dedup chain at 03:00, then drops the 58 survivors and their scores into a Discord channel. I approve ~47 with a thumbs-up over coffee and upload those manually. Human stays in the loop on the one irreversible step; the machine does the 1,200-image grind.
# .github/workflows/lora-farm.yml (self-hosted runner with the GPU)
name: lora-buzz-pipeline
on:
schedule: [{ cron: "0 18 * * *" }] # 03:00 JST
workflow_dispatch:
jobs:
run:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- run: python generate.py && python score.py && python select.py
- name: Notify Discord
env:
HOOK: ${{ secrets.DISCORD_WEBHOOK }}
run: |
python - <<'PY'
import os, json, urllib.request, csv
kept = [r for r in csv.DictReader(open("scores.csv", encoding="utf-8")) if r["keep"]=="True"]
top = sorted(kept, key=lambda r: float(r["aesthetic"]), reverse=True)[:10]
msg = "Top picks tonight:\n" + "\n".join(f"{r['file']} aes={r['aesthetic']} fid={r['fidelity']}" for r in top)
body = json.dumps({"content": msg}).encode()
urllib.request.urlopen(urllib.request.Request(os.environ["HOOK"], data=body,
headers={"Content-Type": "application/json"}))
PY
What actually moved Buzz: the numbers after 9 days
Three findings I didn't expect, all measured against my own posts:
- Fidelity mattered more than aesthetic for Buzz. Images in the top fidelity quartile (0.30+) earned 1.8× the Buzz of bottom-quartile ones at the same aesthetic score. Civitai viewers reward images that clearly nail a recognizable concept over generically pretty mush.
- The dedup step was the single biggest lever — 2.3× reactions on the showcase, as above.
- CLIP can't judge hands or faces. ~15% of my high-scoring picks had mangled hands CLIP happily rated 6.5+. The 47-image manual gate exists almost entirely to catch those. If I'd auto-posted, my best-scoring image of the run would have shipped with three thumbs.
Where this nets out: median Buzz/image went from ~3 (raw dump) to ~120 (filtered), on roughly the same generation budget. The pipeline didn't make my LoRA better — it made my selection 25× more efficient, and selection was the real bottleneck the whole time.
If you build one piece, build Step 2. The prompt matrix is nice-to-have; CLIP-based ranking is what turns "I generated a thousand images and now I'm paralyzed" into "here are the 58 worth your eyeballs."
Top comments (0)