Why GPT's image generator keeps giving you the same picture

#ai #diffusion #llm #explained

Image generators do not invent imagery. They sample a learned distribution, and the distribution has gravity wells. What reads on r/ChatGPT this week as "GPT keeps producing the same kind of image regardless of prompt" is the model behaving exactly as trained, not malfunctioning. It gets pulled toward the centroid of whatever the training corpus over-represented, and most prompts are not strong enough to escape that pull.

The thread that prompted the question — "I have no idea where gpt gets this imagery from" — sits at 3,276 upvotes and 1,294 comments on r/ChatGPT as of this writing. The poster ran a handful of unrelated prompts through GPT's image tool and got back a sequence of pictures that read as variations of the same image: same warm cast, same soft-focus depth, same painterly haze across portrait, landscape, and abstract requests. The comment thread treats this as a bug, or as evidence the model is "lazy," or as proof that it secretly hates the user. None of those is the explanation.

GPT's image generator, like every CLIP-guided diffusion descendant, is a sampler over a learned image distribution. The distribution is not uniform. It carries the statistical fingerprint of the corpus it was trained on — which is, by volume, mid-2010s to mid-2020s stock photography, Adobe-pipeline photography, Instagram and Pinterest exports, and the visual outputs of earlier generative models that were themselves trained on the same family. That mix has a center of mass. When a prompt is neutral or under-specified, the sampler does what samplers do: it returns a draw biased toward where the density is highest. The "weird AI aesthetic" everyone keeps complaining about is not a style choice the model is making — it is the visual median of the open web after a decade of algorithmic curation.

The "median aesthetic" attractor

There are two forces pulling the output toward the centroid. The first is the corpus itself. A library that holds ten thousand identical postcards and one rare manuscript will, on average, hand you a postcard. The second is the guidance system. CLIP-guided diffusion uses a separate model — typically a CLIP encoder — to score how well a partial image matches the text prompt at each denoising step. CLIP's notion of "matches the prompt" is itself learned from the same web-scale corpus, which means the guidance vector points toward the same regions of image space that the corpus over-represented. The architecture stacks two attractor effects on top of each other.

The published literature on this is unambiguous. Studies on mode coverage in CLIP-guided diffusion — the work by Sehwag and collaborators on diffusion mode collapse, and the family of "rare concept" papers that followed — find that for short, neutral prompts a large fraction of outputs cluster in a small number of visual regions. The exact share varies by model and prompt set, but the qualitative result is consistent: a meaningful share of the output mass lands in roughly half a dozen attractor regions, and the rest of image space is reached only when the prompt is long, specific, or uses negative conditioning. The Reddit user's experience is what that statistic looks like from the user side.

The "creativity" knob exposed in some interfaces — temperature, guidance scale, sampler type — is mostly a way of trading off between two failure modes. Low guidance scale: the output drifts away from the prompt and looks generic. High guidance scale: the output sticks tightly to the prompt and over-saturates toward the corpus median. The interesting region is in between, and the interesting region is narrow.

What the convergence tells you about the training corpus

Read the other way around, the convergence is a diagnostic. Every image generator leaks its training distribution through its defaults, and the defaults are legible if you know to look for them. A model that hands back a warm-toned soft-focus picture for almost any prompt was trained on a corpus where warm-toned soft-focus pictures were over-represented. A model that defaults to a clean centered subject on a gradient background was trained on stock photography. A model that produces text that looks like garbled Latin characters was trained on a corpus where Latin-alphabet text dominated; the same model on a non-Latin prompt produces gibberish that looks like the same kind of garble. The output style is corpus archaeology.

This matters for two practical reasons. The first is evaluation. A team shipping a product on top of an image model should treat the centroid as a property of the model, the same way a team shipping on top of an LLM treats the system-prompt default tone as a property of the model. The second is procurement. If two image models produce visually similar outputs on neutral prompts, the most likely explanation is that they were trained on overlapping web crawls, not that they share an architecture. Visual similarity at the output is a proxy for corpus overlap, and corpus overlap is a proxy for IP and licensing exposure.

Two tests are worth running on any image model under evaluation. The first is the neutral-prompt sweep: send the model thirty unrelated, deliberately under-specified prompts ("a photo of a person," "a landscape," "a still life") and measure how visually similar the outputs are. Use a perceptual similarity score like LPIPS or, more simply, compute CLIP-image embeddings for the outputs and look at the pairwise distance distribution. A tight distribution means a strong attractor; a wide one means the model has been deliberately trained or fine-tuned to spread the prior. The second is the rare-concept stress test: send prompts for things that are likely absent from common web crawls (a specific historical chemistry diagram, a piece of regional folk art, a non-English script). The shape of the failure tells you where the corpus ends.

The Reddit thread will reach the front page and the conversation will move on. The mechanism it was bumping up against — that an image generator is a sampler over a non-uniform learned distribution, and that the distribution has a center of mass that pulls every neutral prompt toward it — will keep being the mechanism. Most of what gets discussed online as a quirk of GPT's image tool is, on inspection, a quirk of the open web's visual median in 2026. The model is the messenger.

DEV Community

Why GPT's image generator keeps giving you the same picture

The "median aesthetic" attractor

What the convergence tells you about the training corpus

Top comments (0)