Originally published on rohitraj.tech
DeepSeek turned on vision for V4 this week — image understanding inside chat.deepseek.com and the API, hitting the Hacker News front page on June 18, 2026. The hook for builders: it encodes an ~800×800 image into roughly 90 KV-cache entries versus ~870 for Claude and ~1,100 for Gemini, which is where the "10x cheaper multimodal" headline comes from. This is the builder read — what actually shipped, the OpenAI-SDK call you paste today, where DeepSeek vision wins (OCR, documents, charts, UI screenshots), where it still loses to GPT and Gemini, an honest cost-and-capability comparison table, and how I would wire it in production with a fallback so a single cheap model never becomes a single point of failure.
Read the full version with code samples, diagrams, and architecture details: DeepSeek V4 Vision: The Cheapest Multimodal API to Ship in Production (2026)
More engineering notes: rohitraj.tech/en/notes
Top comments (0)