DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Multimodal AI: One Model That Sees, Reads, and Hears

The models you use now don't just read — they see, hear, and read images too. GPT-4o, Gemini, Claude with vision: all multimodal. The trick that makes it work is the same embeddings idea, stretched across senses. Here's how, visualized.

👁️‍🗨️ Watch modalities meet in one space: https://dev48v.infy.uk/ai/days/day16-multimodal.html

One shared space for everything

Each input type gets its own encoder — a vision encoder for images, an audio encoder for sound, the tokenizer for text — and they all project into ONE shared embedding space. Trained on paired data (image + caption), the model learns to put a dog photo near the word "dog." That alignment is what lets it reason across senses (the CLIP idea, scaled up).

What it unlocks

  • Captioning / visual Q&A — "what's in this image?", "is this chart going up?"
  • OCR — read text inside a photo.
  • Audio understanding — transcribe and answer about speech.
  • Any-to-any — increasingly, generate across modalities too.

How the pieces fit

Modality → its encoder → shared representation → one transformer reasons over the mix → output. Text was just the first modality; the architecture didn't have to change much to add the rest.

Caveats

Still hallucinates, costs more (images = many tokens), and modalities can be unevenly strong. But "talk to it with a picture" is now table stakes.

🔨 Build it (call a multimodal model with image + text → caption / answer; transcribe audio) on the page: https://dev48v.infy.uk/ai/days/day16-multimodal.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk

Top comments (0)