Jimin Lee

Posted on Nov 23 • Originally published at Medium

Giving AI Eyes - A Technical Deep Dive into Multi-Modal LLMs

#llm #deeplearning #nlp

Remember late 2022 when ChatGPT first dropped? It feels like ancient history in tech years, but it’s barely been three years. A lot has happened since then.

Back then, ChatGPT shocked the world with its eloquence and vast knowledge base. But it had one glaring limitation: it was text-only.

Fast forward to today, and the landscape has shifted. With models like Gemini and GPT, it’s no longer a novelty for an AI to understand images, audio, and even video. We have officially entered the era of the Multi-Modal LLM.

But how do these models actually "see" and "hear"? Let's break it down.

What is "Multi-Modal" Anyway?

Think of a "Modality" as a channel of communication. Text is one modality, images are another, and audio is yet another.

"Multi-Modal" simply means the ability to process multiple modalities simultaneously. Humans are inherently multi-modal—we speak, look, and listen simultaneously. An AI that only deals with text can never fully capture the depth of human interaction. The Multi-Modal LLM was born to solve this limitation.

So, how does it work under the hood?

United Under the Transformer

Research into multi-modal AI isn't new, but we’re going to focus on the architecture that’s eating the world right now: The Transformer.

Usually, when we talk about Transformers, we talk about LLMs (Large Language Models). But here’s the secret: the Transformer architecture isn't exclusive to language.

Let’s look at the high-level workflow of a standard text LLM:

Tokenization: Break long input into small chunks (Tokens).
Embedding: Convert those chunks into vectors (numbers the model understands).
Attention (Self-Attention): Figure out relationships between vectors (e.g., understanding that "bank" relates to "money" in a specific sentence).
Transformation: Squash that relationship data into a context-aware vector.
Prediction (Decoding): Predict the next token based on that context.

Applying the Logic to Images and Audio

While the output remains text (for our purposes), we can apply the same input philosophy to other modalities.

For Images:

Tokenization: Chop the image into small square patches.
Embedding: Convert those patches into vectors.
Attention: Figure out how patch A (a dog's ear) relates to patch B (a dog's tail).
Transformation: Create a vector that understands the whole image.

For Audio:

Tokenization: Slice the audio into short time segments.
Embedding: Convert segments into vectors.
Attention: Analyze the relationship between sound waves over time.
Transformation: Create a vector representing the full audio clip.

The Key Insight: "It's All Just Vectors"

Whether it's a word token, an image patch, or an audio segment, the framework remains the same. Once you get past step 2 (Embedding), the model doesn't really care if the input was originally a pixel or a vowel. To the model, it's just a pile of vectors to process.

This insight inspired domain-specific variants like Vision Transformers (ViT) for images and Conformers for audio, which combine convolution and attention to model temporal signals.

And that brings us to the core idea of Multi-Modal LLMs:

"If it's all vectors, why treat them separately? Let's throw text tokens, image pixels, and audio signals into one pot and let the model figure out the relationships between them all."

This allows the AI to understand mathematically that the pixel vector for "dog" and the text vector for "cute" are deeply related.

2. Two Ways to Build a Multi-Modal LLM

Theory is great, but how do we actually build one? There are two main schools of thought.

Approach A: Natively Multi-Modal

This is the path taken by Google’s Gemini or OpenAI’s GPT. You design the model from day one to be trained on text, images, and audio simultaneously.

Everything is treated as a token from the start. Text is tokenized, images are tokenized, audio is tokenized. They are all mapped to the same embedding space. This is powerful because the model can understand nuance—like realizing a speaker is angry not just by what they said, but by how they sounded.

The Downside:

It’s incredibly expensive and difficult.

You need to answer a hard question: How do you prove to a model that the word "Soccer" and a photo of an NBA player are related? You need massive datasets that explicitly link these different modalities. Collecting that data and training a massive model on it is a luxury reserved for tech giants with bottomless budgets.

Approach B: The Component-Based Assembly

Researchers with tighter budgets asked, "Can't we just assemble existing parts?"

Instead of building a monster from scratch, they combine pre-trained experts:

The Eyes (Vision Encoder): A model great at seeing (e.g., CLIP, SigLIP).
The Brain (LLM): A model great at talking (e.g., Llama, Vicuna).
The Glue (Adapter): A module to connect the two.

You take a frozen Vision Encoder and a frozen LLM. Since they speak different "vector languages," you build a lightweight Adapter to translate the visual signals into something the LLM understands.

This method is efficient, cheaper, and surprisingly effective. It has become the standard for open-source multi-modal models. Let's look at the poster child for this approach: LLaVA.

3. LLaVA: The Open Source Textbook

LLaVA (Large Language and Vision Assistant) made waves by proving you don't need a complex architecture or proprietary data to build a top-tier Multi-Modal LLM.

A Simple Architecture

Before LLaVA, researchers used complex connectors (like Q-Formers). LLaVA did something bold:

"Let's just connect them with a single Linear Layer."

And it worked.

Vision Encoder (CLIP ViT-L/14): Turns images into vectors.
Projection Layer (Linear): A simple translation layer that turns image vectors into word embedding vectors.
LLM (Vicuna): Takes the translated signals and generates a response.

Data Alchemy

The architecture is simple, but where do you get the training data?

LLaVA’s team used a clever trick. There are plenty of image-caption pairs and bounding box coordinates. Text is the common denominator here.

They fed those captions and coordinates into a text-only GPT-4 and asked it to generate conversations. For example,

Prompt: "I have an image where a person is at [coordinate X] and a car is at [coordinate Y]. The caption is 'A man walking by a car.' Generate a conversation a user might have about this image."

Here is an image from the original paper:

The example above comes directly from the LLaVA paper. The approach is to feed Text-only GPT-4 the two context items shown at the top (Context type 1 & 2) and ask it to generate the three data points listed below (Response type 1, 2, & 3).

GPT-4 hallucinated a conversation based on the text description, effectively creating labeled multi-modal training data without ever seeing the image itself.

By pairing GPT-4’s generated text responses with existing captioned image datasets, the LLaVA team compiled roughly 158K (Image, Question, Answer) examples.

4. The 2-Stage Training Pipeline

So how do you train it? LLaVA uses a 2-Stage Curriculum Learning process.

Stage 1: Feature Alignment

The goal of this stage is to establish a primary connection between visual information and linguistic information. It is a process of teaching the LLM that "this specific chunk of image vectors" is highly correlated with the word "soccer."

During this phase, the Vision Encoder and the LLM are Frozen, and only the Projection (Adapter) is trained. Since the objective is solely to learn the relationship between visual and linguistic data, training just the Projection layer is sufficient.

We use (Image, Caption) pairs for this stage. This allows the Projection layer to learn the relationship between photos composed of pixels and the language that describes them.

Stage 2: Visual Instruction Tuning

After passing Stage 1, the model becomes capable of converting image inputs into a format the LLM can understand. However, it is not yet capable of providing intelligent answers about the image. We need to push the LLM beyond simple understanding so that it can generate intelligent responses.

In this stage, the Vision Encoder remains Frozen. This is because our goal is to train the LLM (the speaker), not the Vision Encoder (the "eye" that understands images).

The LLM becomes a training target because it must learn how to respond based on this new type of input (images). The Projection layer is also trained. Although it learned the basics in Stage 1, we need to fine-tune the "Image Info → Language Info" conversion as the LLM learns to generate actual responses.

For training data, we use the dataset created earlier using the clever GPT-4 method. Since this data consists of (Image, Question, Answer) triples, the LLM finally learns how to provide answers that align with both the image and the specific question.

5. Evolution: LLaVA 1.5 and LLaVA-NeXT

The research didn't stop there.

LLaVA 1.5 swapped the single Linear Layer for an MLP (Multi-Layer Perceptron) and upgraded the Vision Encoder. These small tweaks resulted in massive performance gains.

LLaVA-NeXT tackled the resolution problem. Standard models squash every image into a square (e.g., 336x336). This ruins tall smartphone screenshots or wide panoramas—text gets blurry and unrecognizable.

LLaVA-NeXT introduced AnyRes, a strategy that slices the original image into smaller tiles (e.g., slicing a tall image into three vertical squares) and feeds them along with a global view. This allows the model to see high-resolution details without distortion.

6. The Hurdles That Remain

We’ve come a long way, but it’s not perfect.

Object Hallucination

Hallucinations—where text models invent plausible lies—manifest in even stranger ways in Multi-Modal models. For example, you show it a photo of an empty desk, and it insists, "There’s a cup on the desk." Or it looks at a picture of a single cat and claims, "Several cats are playing together."

Here are a few known causes:

LLMs are trained on massive amounts of text, learning that the word "cup" follows the word "desk" with a very high probability. Even if there is no cup in the actual image, the LLM tends to follow these pre-learned linguistic statistical probabilities.

Sometimes, the Attention mechanism simply misfires. If the model focuses on the wrong part of the image while generating an answer, hallucinations can occur.

Techniques similar to RLHF and preference-based fine-tuning are being explored to reduce visual hallucination—teaching models to prioritize visual evidence over text priors.

The Resolution Dilemma

We take photos in 4K, but the AI model sees them squashed down to the size of a postage stamp. Text, in particular, gets distorted and becomes illegible. It’s hard enough to read the fine print on insurance policies as it is, let alone when it's crushed by the model. Distant road signs face the same issue.

The difficulty of using large images is similar to the challenge of increasing the Context Window in LLMs. Due to the Self-Attention structure, if the input increases by N, the computation scales by N^2. Since images are 2D, the problem is even more severe.

For example, if you double the image width from 200 to 400, the actual pixel count increases 4x. Consequently, the Self-Attention computation jumps by the square of that—16x. The computational load explodes.

To solve this, strategies like AnyRes (used in LLaVA-NeXT), which slice the original image into multiple smaller patches for processing, are proving helpful.

The Modality Gap

Whether it's Natively Multi-Modal or a component-based Adapter approach, the ultimate goal of a Multi-Modal LLM is to represent various modalities in a single unified way. However, this process isn't perfect. There is an inevitable, uncrossable river between analog signals (images, audio) and symbols (text).

For instance, if I show a Multi-Modal LLM a photo of myself looking up at the sky with a gloomy expression on a rainy day and ask, "Describe the emotions you feel in this picture," it will likely just respond:

"A man is looking at the sky on a rainy day."

It captures the facts, but misses the feeling.

7. Conclusion

Multi-Modal LLMs are no longer optional; they are the standard.

While we currently use them to chat about photos, these models are the brains of the future. They will power robots that navigate messy living rooms and autonomous cars that interpret complex traffic signals.

The fusion won't stop at images. We are moving toward models that integrate sound, video, and eventually, sensory data from the physical world.

Try it out tonight. Open your fridge, snap a photo, and ask an LLM, "What should I cook for dinner?" You’ll get a taste of the future, right in your kitchen.

DEV Community