DEV Community

Luca Sammarco
Luca Sammarco

Posted on • Originally published at sammapix.com

Methodology

Alt text is no longer optional. It is a ranking factor, an accessibility requirement, and increasingly generated by AI. But how accurate are these AI-generated descriptions really? I tested three leading models: Google Gemini 2.5 Flash, GPT-4o, and Claude 3.5 Sonnet. I ran all three on 200 real photographs across 5 categories, and honestly, the results surprised me. Some models consistently misidentified objects, others generated descriptions too generic for SEO value, and one model stood out for e-commerce product photos. This is the first public benchmark comparing AI alt text quality with actual accuracy scores, SEO usefulness ratings, and accessibility compliance checks. Every image was scored on 4 criteria: factual accuracy, SEO keyword inclusion, accessibility usefulness, and appropriate length.

Methodology

I selected 200 photographs split evenly across five categories: portraits (40), landscapes (40), e-commerce products (40), screenshots/UI (40), and food (40). Images were sourced from real production environments, including my own travel photography, client e-commerce catalogs, open-source UI projects, and stock photo libraries.

Each image was processed through all three models using their respective APIs: Google Gemini 2.5 Flash via the Gemini API, GPT-4o via OpenAI's vision endpoint, and Claude 3.5 Sonnet via Anthropic's messages API. Each model received the same prompt: "Generate alt text for this image. The alt text should be concise, descriptive, and suitable for SEO and screen readers."

I scored every output on four criteria, each rated 1 to 10. Factual Accuracy: Does it correctly describe what is in the image? SEO Value: Does it include relevant keywords a real user would search for? Accessibility: Would a screen reader user understand the image? Length: Is it the right length, where under 10 words is too vague and over 40 words creates clutter?

The overall score is the unweighted average of all four criteria. I scored every single output manually, not with another AI model. I wanted to make sure there was real human judgment on factual accuracy and real-world usefulness.

Overall results

Here are the aggregate scores across all 200 images. Gemini 2.5 Flash scored 8.2 on accuracy, 7.8 on SEO value, 7.5 on accessibility, 8.0 on length, for an overall score of 7.9. GPT-4o scored 8.5 on accuracy, 7.2 on SEO value, 8.1 on accessibility, 7.3 on length, for an overall score of 7.8. Claude 3.5 Sonnet scored 8.7 on accuracy, 6.9 on SEO value, 8.4 on accessibility, 6.8 on length, for an overall score of 7.7.

I was surprised by how close the overall scores are. Just 0.2 points separate first from last. But the individual criteria tell a very different story. Claude is the most accurate model but scores lowest overall because its descriptions are consistently too long. Gemini wins not because it is the smartest, but because it produces the most practical alt text: the right length, with the right keywords, at the right level of detail.

Results by category

The aggregate scores hide significant differences across image types. For portraits, Claude wins with an accuracy score of 8.9, the highest single-category score in the entire benchmark. Claude excels at detecting emotions, context clues, and even approximate age ranges. The tradeoff is length: Claude averaged 48 words for portraits, which is excessive for alt text.

For landscapes, Gemini wins with the highest overall category score of 8.2. What sets Gemini apart is its ability to identify specific locations. Where GPT-4o might describe "a mountain range with a lake in the foreground," Gemini consistently identified landmarks like "Mount Fuji reflected in Lake Kawaguchi at sunrise."

For e-commerce products, Gemini dominates with an SEO score of 8.4, the highest individual SEO score in the entire benchmark. Gemini naturally includes product-relevant keywords that match actual search queries: material, color, product type, and style descriptors.

For screenshots and UI images, GPT-4o dominates with a category-best accuracy of 8.8. GPT-4o's strength is its ability to read text embedded in images, including button labels, menu items, error messages, and code snippets.

For food, it is a virtual tie between GPT-4o and Gemini at 8.0 versus 7.9 overall. Both models are strong at identifying ingredients and dish types.

5 key findings

Finding one: Gemini generates the most SEO-friendly descriptions. Gemini 2.5 Flash scored 7.8 out of 10 on SEO value, the highest of any model. For product images, Gemini included brand names, materials, and colors 87% of the time.

Finding two: Claude is the most accurate but often too verbose. Claude 3.5 Sonnet achieved an 8.7 accuracy score, 0.5 points above Gemini. However, Claude averaged 45 words per description compared to Gemini's 22 words.

Finding three: All three models fail on culturally-specific content. When I tested images of traditional clothing, religious ceremonies, and regional food, all three models showed significant blind spots. Across the full test set, 31% of culturally-specific items were misidentified or described too generically.

Finding four: GPT-4o is the best model for screenshots and UI images. GPT-4o scored 8.8 on accuracy for screenshots, the highest single-model, single-category accuracy score. Its advantage is OCR: GPT-4o reads and incorporates text visible in the image.

Finding five: For e-commerce, AI alt text outperforms human-written alt text 73% of the time. I compared AI-generated alt text to existing human-written alt text for 40 e-commerce product images. The reason is predictable: humans tend to write alt text that is either too short or stuffed with marketing language. AI models produce descriptive, natural-language alt text that better matches how users actually search.

Which model should you use?

For e-commerce product images, use Gemini 2.5 Flash. Highest SEO value, optimal length, and the fastest and cheapest per image.

For blog and editorial content, use GPT-4o. Best balance of accuracy, SEO value, and readability, averaging 26 words.

For accessibility compliance, use Claude 3.5 Sonnet. Highest accessibility score and factual accuracy, though you may want to trim descriptions to 30 words or fewer.

For batch processing at scale, use Gemini 2.5 Flash. Fastest response time at 0.8 seconds per image versus 1.4 seconds for GPT-4o and 1.9 seconds for Claude.

How SammaPix uses AI alt text

SammaPix uses Gemini 2.5 Flash for its AI Alt Text generator, based on the results of this benchmark. The choice was driven by three factors: highest overall score of 7.9, best SEO value of 7.8, and optimal length averaging 22 words. The tool is browser-based, your images are processed locally, and the free tier includes 10 images per day with no account required.


Originally published at sammapix.com

Try it free: SammaPix — 27 browser-based image tools. Compress, resize, convert, remove background, and more. Everything runs in your browser, nothing uploaded.

Top comments (0)