DeepSeek Vision Goes Wide: Multimodal AI at 10x Lower Cost Than Claude

#ai #technology #deepseek #opensource

DeepSeek is rolling out native vision capabilities to users worldwide, marking the Chinese AI lab's long-awaited entry into multimodal AI. The company's new image recognition mode lets users upload photos, screenshots, documents, and charts directly into the chat interface — and it costs roughly one-tenth what Claude charges for the same task.

The vision mode first appeared in limited beta on April 29, available to select users on DeepSeek's web and mobile apps. Now it's expanding to a broader audience, appearing alongside the existing Flash and Expert modes on the chat interface.

What DeepSeek Vision Actually Does

Unlike simple OCR tools, DeepSeek's vision mode understands images holistically. Users can upload an invoice and ask for totals, show a screenshot and request specific data extraction, or present a chart and get trend analysis. The model processes images directly within the same architecture used for text.

The 10x Efficiency Advantage

DeepSeek V4 uses roughly 90 KV cache entries per image, compared to around 870 for Claude 3.5 Sonnet — a nearly 10x compression advantage. Combined with lower per-token pricing, total vision costs land 10–120x cheaper than competitors.

Built on DeepSeek V4

The vision capabilities ride on top of DeepSeek V4, released in April, with two variants: the V4 Pro (1.6T parameters, 49B active) and V4 Flash (284B total, 13B active). Both support 1M context windows.

Full article: DeepSeek Vision Goes Wide on TekMag