The models you use now don't just read — they see, hear, and read images too. GPT-4o, Gemini, Claude with vision: all multimodal. The trick that makes it work is the same embeddings idea, stretched across senses. Here's how, visualized.
👁️🗨️ Watch modalities meet in one space: https://dev48v.infy.uk/ai/days/day16-multimodal.html
One shared space for everything
Each input type gets its own encoder — a vision encoder for images, an audio encoder for sound, the tokenizer for text — and they all project into ONE shared embedding space. Trained on paired data (image + caption), the model learns to put a dog photo near the word "dog." That alignment is what lets it reason across senses (the CLIP idea, scaled up).
What it unlocks
- Captioning / visual Q&A — "what's in this image?", "is this chart going up?"
- OCR — read text inside a photo.
- Audio understanding — transcribe and answer about speech.
- Any-to-any — increasingly, generate across modalities too.
How the pieces fit
Modality → its encoder → shared representation → one transformer reasons over the mix → output. Text was just the first modality; the architecture didn't have to change much to add the rest.
Caveats
Still hallucinates, costs more (images = many tokens), and modalities can be unevenly strong. But "talk to it with a picture" is now table stakes.
🔨 Build it (call a multimodal model with image + text → caption / answer; transcribe audio) on the page: https://dev48v.infy.uk/ai/days/day16-multimodal.html
Part of AIFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)