Every AI image benchmark I've ever read measures the wrong thing.
FID tells you how close your distribution is to ImageNet. CLIP score tells you how well your image matches its caption according to a model that was itself trained on captions. Inception Score tells you the model is confident. Human Preference Score tells you what a crowd of MTurk workers clicked on in 3 seconds.
None of them tell you whether the image made anybody feel anything.
That's the only number that matters to me, so today I'm announcing the Beauty Index — the first open AI image benchmark that measures feelings, not pixels. Live now at zsky.ai/beauty-index.html. Results land April 30, 2026. Methodology is CC-BY 4.0 and I want other labs to run it against their own tools.
This post is the why, the how, and the "please steal this and use it" invitation.
Why current benchmarks are a hall of mirrors
Let me be precise about the problem, because I've heard "benchmarks are bad" a thousand times and it usually collapses into vibes. Here's the actual failure mode.
FID (Fréchet Inception Distance) compares the statistical distribution of generated images to the statistical distribution of a reference dataset (usually a slice of ImageNet or COCO). Lower is "better." But "better" here means "closer to the training distribution." Which means a model that perfectly reproduces boring, average, forgettable images scores higher than a model that makes one surprising, beautiful, strange image. FID rewards regression to the mean. The most beautiful output is, by definition, statistically distant from the mean.
CLIP score measures how well an image matches its text prompt, according to CLIP. Which is a model that was itself trained on scraped alt-text from the internet. So you're asking one neural net to grade another neural net's homework, using a rubric written by the internet's worst captioners. Predictably, CLIP thinks "a dog" and "a really good photo of a dog that makes you miss your childhood" score the same. Because it can only see the dog.
Inception Score (IS) measures how confident a classifier is that the image belongs to a category. So the top-scoring output is the one that looks maximally like the prototypical version of a thing. Again: regression to the mean, with extra steps.
Human Preference Score (HPS) is closer — it at least involves humans. But the humans are paid crowd workers doing forced A/B choices in ~3 seconds each, on prompts they didn't write, with no emotional stake in the outcome. It measures "which image would you click on in a split-second" which is closer to "which thumbnail would you doomscroll past" than to "which image changed how you saw the world."
Put them together and you get the current state of AI image evaluation: a hall of mirrors where models are graded on how well they imitate other models' ideas of what other models think is average.
Meanwhile, real humans are out there crying at sunsets.
What the Beauty Index measures instead
The Beauty Index asks one question, and only one question, about every image:
"Did this make you feel something?"
Yes / No. That's the whole rubric.
No rating scales. No "aesthetics from 1 to 10." No forced choice between images. No demographic segmenting. No "how well does this match the prompt." Just — did it land for you, the human looking at it, in the moment.
I know this sounds almost embarrassingly simple. That's the point. Every additional axis of "rigor" we've added to image benchmarks has moved us further from the thing we're actually trying to measure.
The methodology
Here's the exact protocol we're running for the first iteration. All of this is open and you can read the full spec at zsky.ai/beauty-index.html.
Judges: 20 humans, not crowd workers. I'm not using MTurk. The panel is hand-picked for one reason: they have a relationship with images.
- 6 working photographers (portrait, landscape, fashion, photojournalism)
- 4 art therapists (people whose literal job is watching humans have feelings about images)
- 4 people with aphantasia (no mind's-eye imagery — so every image is landing fresh, without a reference memory to compare against)
- 3 TBI survivors (images hit the brain differently after trauma; I know this personally)
- 3 visual artists who don't use AI (a skeptic panel, intentionally)
Panel bias disclosure: this is not a statistically representative sample of humanity. It's an expert panel of people who pay attention to images. If you want a general-population study, that's a different benchmark and I encourage somebody to run it. This one is for answering "what do the people who notice notice?"
Prompts: 10 of them. Diverse on purpose:
- "A kitchen on a Sunday morning"
- "Grief, but in color"
- "The moment before something changes"
- "Your mother when she was twenty-four"
- "A place that doesn't exist but you've been there"
- "The last minute of summer"
- "A stranger who feels like home"
- "Joy, unposed"
- "A small miracle"
- "What it felt like when you finally understood"
Notice none of them are "a red car on a beach." Technical prompts test models on their failure modes. Emotional prompts test them on what they're for.
Tools: 10 of them. The biggest names in AI image generation, plus two smaller indie tools I like, plus ZSky. I'm not naming the others in this post because I don't want this to read as a competitive hit piece before results are in — all tool identities will be in the final report, including any that embarrass ZSky. I'm not running a benchmark to win a benchmark. I'm running it to see what's actually true.
Blind: judges never see which tool made which image. Images are presented one at a time. Tool order is randomized per judge. There's no side-by-side comparison. Each image is just an image, evaluated on its own merit, the way a real human encounters an image in the wild.
Single question, logged with a timestamp. Did this make you feel something? Y/N. Plus an optional 10-second voice memo if the judge wants to say what the feeling was. Voice memos are not used for scoring — they're qualitative color for the final report.
Score: Percentage of judges who said "yes" per (tool, prompt) pair. Aggregated to a single Beauty Index per tool. That's it.
What I expect to find (and why I might be wrong)
I have hypotheses. I also know hypotheses are how benchmarks get biased, so I'm writing them down in public before running the study, so you can hold me to them.
H1: Tools that score best on FID will not score best on the Beauty Index. I'd bet ZSky's rack of RTX 5090s on this.
H2: Aphantasia judges will have a higher "yes" rate across the board than sighted-imagination judges. (No internal reference image to disappoint.)
H3: The variance between tools will be smaller than the variance between prompts. Some prompts will hit for everybody, some prompts will flop for everybody. The "which tool is best" question will turn out to be less interesting than "which prompts actually work."
H4: At least one tool that's famously praised by AI Twitter for its "aesthetics" will underperform on the emotional prompts. Aesthetics and feelings are different axes, and I think the field has been conflating them.
H5: ZSky will not win on every prompt. I'm running my own tool in my own benchmark — I know how that looks. Commitment: if ZSky places 5th or worse, I publish that on the homepage. No cherry-picking. The only way to be trustworthy is to report the ugly result when it's ugly.
Why open methodology, why CC-BY
Everything about this benchmark is open under CC-BY 4.0:
- The prompts
- The judge selection criteria (not the judges' identities, for privacy)
- The scoring protocol
- The raw per-image per-judge data (anonymized)
- The final report
You can run it yourself. You can run it on tools I didn't include. You can run it with your own panel. You can fork the methodology, change one variable, and see what happens. The only thing I ask is that you publish your results under the same license, so the field accumulates real evidence instead of marketing claims.
I'm doing this because closed benchmarks are how the AI image field ended up grading itself on imitation. If everybody's using the same FID/CLIP harness from 2021, everybody trains toward it, and the whole field drifts toward "images that look like what the harness expects." Open, forkable, feelings-first benchmarks are how we drift back toward making things people actually love.
The founder note
Quick honest context: I have aphantasia. I can't visualize my own memories. I also had a bad TBI a few years back that took away a lot of what I thought I was. Photography was the first thing that gave me back a relationship with beauty, because the camera did the "seeing" for me.
That's why ZSky exists — a free AI image and video generator that gives everybody a camera for their imagination, especially the people whose imagination is broken. 26,000 creators have used it in 4 months. 3,000+ join every day. No credit card, no rationing, no watermark on the free tier.
When I look at existing benchmarks, I don't see my users in them. I don't see the art therapist whose client generated a picture of the father she never met and cried for an hour. I don't see the teenager with depression whose first generation was just "a place that feels safe." Those moments are what ZSky is for, and they're completely invisible to FID.
The Beauty Index is my attempt to build a measuring stick that's pointed at the thing that actually matters. I might be wrong about the methodology. I'm definitely wrong about some of it — first attempts always are. But I'd rather be wrong in public, in the open, with a license that lets the next person fix my mistakes, than keep pretending the old benchmarks mean anything.
How to participate
- Read the full methodology: zsky.ai/beauty-index.html
- Results land April 30, 2026. I'll publish a post here the same day with links to the full report.
- Want to run it on your own tool? Grab the CC-BY methodology and run it. Email me the results and I'll link them from the main report. Honest results only — I will link to results that beat ZSky.
- Want to be a judge on the next iteration? I'm building a pool of 100+ judges for the next round, especially from underrepresented neurotypes. Sign up at zsky.ai.
The AI image field is four years old. We still haven't agreed on what "good" means. Maybe that's because we've been measuring the wrong thing from the start. Let's find out together.
Cemhan Biricik, Founder, ZSky AI. Building a free AI image and video generator for people whose imagination needs help. Aphantasiac. TBI survivor. Owner of more GPUs than is strictly reasonable. zsky.ai
Top comments (0)