DEV Community

Sattyam Jain
Sattyam Jain

Posted on

Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator

Unveiling the Inner Workings of Stable Diffusion's Image Generator

AI models that generate images from text inputs have fascinated the world with their creative potential. However, many of these models remain shrouded in mystery when it comes to their training data sources. Fortunately, the team behind Stable Diffusion has taken a refreshingly transparent approach, sharing insights into their model's training data. In this post, Simon Willison and Andy embark on an exhilarating journey to explore over 12 million images used to train Stable Diffusion's image generator. With the help of Simon's remarkable Datasette project, we've created a data browser that allows you to dive into the depths of this vast dataset yourself.

Unraveling the Enigma: The Data Source

Stable Diffusion's training datasets were collected by LAION, a nonprofit organization that owes its compute time largely to Stability AI, the owner of Stable Diffusion. LAION leveraged the vast resources of Common Crawl, a nonprofit web scraping initiative, to gather billions of webpages and curate image-text pairs. By classifying these pairs based on language, resolution, watermark likelihood, and an "aesthetic" score (representing subjective visual quality), LAION created several specialized datasets for training Stable Diffusion.

Stable Diffusion's training journey commenced with low-resolution 256x256 images from LAION-2B-EN, a dataset comprising 2.3 billion English-captioned images. Additionally, it incorporated high-resolution images from LAION-High-Resolution, a subset of LAION-5B boasting 170 million images with resolutions greater than 1024x1024 (downsampled to 512x512). The model's latest checkpoints were built upon LAION-Aesthetics v2 5+, a dataset featuring 600 million images with a predicted aesthetic score of 5 or higher, where low-resolution and likely watermarked images were carefully filtered out.

To facilitate exploration, we've provided a window into the LAION-Aesthetics v2 6+ dataset, which contains 12 million image-text pairs with a predicted aesthetic score of 6 or higher. While this represents only a fraction of the complete training data, it offers an insightful glimpse into the aesthetically appealing images that influenced Stable Diffusion's recent checkpoints.

Unveiling the Origins: Source Domains

Analyzing the 12 million images, they unveiled their origins by cataloging their source domains. Astonishingly, nearly half of the images (about 47%) originated from a mere 100 domains, with Pinterest taking the lead by contributing over a million images (8.5% of the dataset) from its pinimg.com CDN. User-generated content platforms, such as WordPress, Blogspot, and DeviantArt, were also significant contributors. Furthermore, stock image sites, like 123RF, Adobe Stock, and Shutterstock, played a substantial role in enriching the training data.

It's important to note that domain counts alone might not precisely reflect the actual sources of these images. Some images hosted on platforms like Pinterest might have originated from other websites.

Illuminating the Artists: Creative Minds in the Dataset

Exploring the dataset's artistic landscape, they sought to uncover the representation of various artists. they utilized a list of over 1,800 artists to gauge the number of images associated with each artist's name. Surprisingly, only three of the top 25 artists in the dataset are still living: Phil Koch, Erin Hanson, and Steve Henderson. Remarkably, the most frequently referenced artist was none other than the celebrated Thomas Kinkade, known as The Painter of Light™, with an astounding 9,268 images linked to his name.

Additionally, they discovered that some popular artists frequently used in AI image prompting, such as Greg Rutkowski and James Gurney, were not as prominent in the dataset as anticipated. However, it's important to keep in mind that these images represent only a fraction of the extensive training data.

Celebrities in the Spotlight: The Faces they Know

Unlike some AI models, Stable Diffusion does not impose limitations on generating images of well-known individuals mentioned in the dataset. To assess the representation of celebrities and famous personalities, they compiled a list of nearly 2,000 names and conducted a search within the image dataset. Surprisingly, Donald Trump emerged as one of the most cited names, with nearly 11,000 images referencing him. Charlize Theron closely followed with 9,576 images.

A cursory glance at the dataset suggests a notable gender breakdown, with many popular names belonging to women. However, they observed the intriguing absence of certain internet personalities, like David Dobrik, Addison Rae, Charli D’Amelio, Dixie D’Amelio, and MrBeast, from the dataset. The reasons behind this peculiar observation remain a puzzle.

Fictional Worlds Brought to Life: Iconic Characters in the Dataset

Fictional characters have captivated users of Stable Diffusion and Craiyon alike, presenting an exciting challenge for other AI models like DALL-E 2. Delving into the representation of fictional characters in the dataset, they employed a list of 600 characters from pop culture for their exploration. Characters from the Marvel Cinematic Universe (MCU) took center stage, with Captain Marvel, Black Panther, and Captain America ranking among the most prevalent characters.

Unveiling the Sensitivity: NSFW Content

Stable Diffusion distinguishes itself by its ability to handle adult content. The team designed a predictor to assess the probability of an image containing NSFW material. Their analysis revealed surprisingly limited explicit content, with only 222 images (0.002% of the dataset) receiving a "1" unsafe probability score, indicating 100% confidence in their unsafe content. Most images with high punsafe scores did not contain nudity.

Please exercise caution when sorting by the "punsafe" field in the images table, as it may display potentially NSFW images.

Conclusion: Peering into the Creative Abyss

Their exploration of a subset of Stable Diffusion's vast training data provided captivating insights into the workings of this extraordinary image generator. Transparently sharing such datasets fosters trust and understanding, enabling users to appreciate the capabilities and limitations of AI models. As the world of AI continues to evolve, embracing openness and transparency will undoubtedly pave the way for even more remarkable advancements.

Reference: https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/

Top comments (0)