Bongho Tae

Posted on Apr 30

The Machine That Reads, Watches, Listens — All at Once

#computervision #python #machinelearning #ai

Imagine you're trying to help a friend over the phone book a flight. They send you a screenshot of the airline's website, then a voice memo describing what they want, then a short video they recorded of the confusing pop-up that keeps blocking the booking button. You glance at the picture, listen to the memo, watch the clip, and tell them: Click the small "X" in the upper right of the gray box, then scroll down past the seat selector.

You did something extraordinary in that moment, and you didn't notice. You combined four different streams of information — pixels, sound waves, motion over time, and the memory of the conversation up to that point — and you produced one coherent answer. You did it in maybe ten seconds.

For most of the past decade, getting a computer to do this has been the kind of problem that quietly drives researchers to drink. Not because any single piece is impossible — there are systems that read text well, systems that recognize images, systems that transcribe audio. The hard part is the all at once. Each of those streams arrives in a completely different format. Pixels are arranged in grids. Sound is a wiggling line over time. Text is a sequence of symbols. Video is pixels arranged in a grid that also changes over time. Forcing one system to understand all of them, and to reason across them coherently, has felt a bit like asking a panel of monolingual translators to collaboratively write a poem, when none of them speak the same language.

NVIDIA's new paper, Nemotron 3 Nano Omni, is an attempt at that poem. It is one model — one set of digital "brain weights" — that natively accepts text, images, video, and now audio, and tries to reason across them as a single unified mind. The paper is, in a sense, a progress report from the front lines of a research direction that has slowly been changing what computers can do for ordinary people. It's worth understanding, because the limitation it solves is one we've all bumped into.

The Old Way: Four Specialists in Four Rooms

Before models like this one, the standard approach to a multimodal problem was assembly-line specialization. You'd have a model that "saw" the image and described it in words. Another model that transcribed the audio into text. Another that summarized the video. And then you'd hand all those translated descriptions to a fourth model — a language model — and ask it to make sense of the situation.

The analogy is a courtroom where the judge is blindfolded, and four witnesses each describe what they saw, heard, or read. The judge can only work with their secondhand reports. If a witness misses something, or paraphrases away an important nuance, the judge has no way to go back to the original. "He pointed angrily" doesn't capture whether the gesture was at someone in particular or just generally toward the door. The judge is stuck with the summary.

What researchers wanted instead was a judge who could see, hear, and read the evidence directly — the same mind processing all of it.

That's what omni-modal models are. The "omni" part is the ambition: one mind, every channel.

A Crowd of Specialists, but Only the Right One Speaks

The first big architectural choice in Nemotron 3 Nano Omni is something called a Mixture-of-Experts, or MoE, backbone. The label sounds technical, but the idea is intuitive once you put it in the right setting.

Picture a small-town doctor's office where the family physician has to handle everything — broken bones, allergic reactions, mental health, pediatric concerns, geriatric medicine. They do their best, but they're stretched thin. Now picture a hospital with twenty specialists. When you walk in with a sore ankle, the cardiologist, the dermatologist, and the psychiatrist don't bother coming to your appointment. The receptionist routes you to the orthopedist, who handles the case efficiently because that's all they do all day. The hospital, as a whole, has more accumulated expertise than the family doctor — but you only consume the time of the relevant specialist.

Mixture-of-Experts works exactly this way inside the model. The "30B-A3B" name in the paper is a coded way of saying: there are roughly 30 billion total parameters of expertise stored in the model, but for any given chunk of work, only about 3 billion get activated. There's a built-in router — the receptionist — that looks at each piece of incoming data and sends it to the experts most equipped to handle it. The result is a model that has the knowledge of a much larger system but the speed of a smaller one.

This matters enormously for things like video, where the model is processing thousands of small pieces of information per second. If every piece woke up every neuron, the system would crawl. With MoE, only the relevant neurons fire.

How the Model Sees: Through Two Translators

Before a model can reason about a photograph, the photograph has to be turned into something the model's mathematical machinery can chew on. The Nemotron paper uses two specialized "translators" sitting at the gates: a vision encoder called C-RADIOv4-H and an audio encoder called Parakeet-TDT.

Think of these as people who have spent their entire careers learning how to describe one specific kind of thing. The vision encoder is an art critic who has stared at hundreds of millions of photographs over their training and can convert any new image into a kind of compact poetic shorthand — not English words, exactly, but a vector of numbers that captures the essence of what's in the picture. The audio encoder is the same idea but for sound: a music historian who has listened to millions of recordings and can compress any clip into a similar compact summary.

The cleverness is that both translators write their summaries in the same code. So when the main reasoning model receives the encoded image and the encoded audio, it doesn't see them as foreign objects. They've been pre-translated into the model's native tongue — the language of high-dimensional vectors. The reasoning brain in the middle can then think about the sound of someone saying "look at this" and the image they're pointing at as one continuous thought, instead of two separate problems.

Don't Cut the Painting Into Puzzle Pieces

One of the more elegant changes in this version of the model is in how it handles images of varying sizes — particularly the extremely tall, narrow, or wide images we encounter constantly: a screenshot of a long article, a panoramic photo, a scan of a legal document.

The previous standard approach was called tiling. Imagine you have a massive painting and your eyes can only focus on a small square at a time. Tiling means cutting the painting into many small squares and inspecting each one separately. The problem is obvious: you lose the composition. The relationship between the figure in the lower left and the doorway in the upper right is gone, because you never saw them together. For a document, this might mean the model loses track of how a chart at the top of page three relates to the paragraph that introduces it.

Nemotron 3 replaces tiling with dynamic resolution, which is closer to how a human actually looks at things. Instead of slicing the image into uniform pieces, the model adapts its viewing strategy to the image's actual shape — looking at the whole thing while preserving the relationships between regions. It's the difference between studying a painting through a keyhole, square by square, and stepping back to take it in.

The result, the paper claims, is significant gains on real-world document understanding — exactly the situations where awkward aspect ratios matter most.

Time, Compressed

Video is uniquely difficult because it adds a fourth dimension: time. A one-minute video at thirty frames per second is 1,800 separate images. Naively, the model would have to process all of them, which is wasteful — most adjacent frames are nearly identical. The wasted effort is enormous.

Nemotron 3 Nano Omni introduces a technique using something called Conv3D-based temporal compression. The technical name doesn't help; the analogy does. Think of a film editor who has been told to summarize a long take. They watch the eight seconds of an actor walking down a hallway and they don't store all 240 frames — they store the gist. The walk. The pace. The direction. They compress eight seconds of nearly redundant images into a single conceptual summary. A 3D convolution does the mathematical version of this: it slides over short windows of consecutive frames and merges them into a single representation that captures both the visual content and the motion.

The model also uses something called Efficient Video Sampling, which is the more aggressive cousin of the same idea: it identifies sequences of frames where almost nothing is changing — a static shot of a speaker's face, say — and quietly skips over the redundancy. The paper reports that this halves the number of "tokens" (the model's smallest units of attention) it needs to process a video, which translates directly into faster, cheaper responses.

The everyday analogy is reading a long email where someone has copy-pasted the same disclaimer three times. A practiced reader skips the repeats. The model is learning to do the same.

A Working Memory You Can Actually Use

Earlier language models had a working memory of around 4,000 to 8,000 tokens — enough for, say, a short essay. The previous version of this model could hold 128,000 tokens, roughly equivalent to a short novel. Nemotron 3 Nano Omni now holds 256,000 tokens — closer to a long novel, or a feature-length film with all of its dialogue, or hours of conversation, or a full software codebase.

The reason this matters is subtle. Once a model can hold an entire long document in mind at once, the nature of what you can ask it changes. You're no longer asking it to summarize chunks; you're asking it to reason across the whole thing. Where in this 200-page contract does the indemnification clause contradict the warranty section? That question requires holding both sections simultaneously, and remembering everything between them. With a small working memory, the model has to be fed pieces sequentially and forgets the start by the time it reaches the end.

A useful comparison: trying to follow the plot of The Wire if every five minutes someone wiped your memory of the previous five minutes. You could describe individual scenes beautifully. You could not, in any meaningful sense, talk about the show.

The Curriculum: Teach Slowly, Don't Erase

Training a single model to handle text, images, video, and audio without it getting confused — and without it suddenly becoming worse at what it used to do well — is a quietly enormous problem. There's a phenomenon researchers call catastrophic forgetting, where teaching a neural network something new causes it to overwrite what it knew before. Imagine a polyglot who learns Italian and then realizes their Spanish has mysteriously evaporated. The brain has limited storage and the new lessons crowded out the old ones.

The Nemotron team's answer is a multi-stage training recipe. They don't dump every modality on the model at once. They start with a strong text-only foundation — language, reasoning, math — and then add modalities in stages, like a curriculum. First text. Then images. Then video. Then audio. Each new stage is introduced gently, with careful balancing, so the older skills aren't trampled.

The closest human analogy is the way medical schools teach. You don't begin a first-year medical student in surgery. They learn anatomy, then physiology, then pathology, then clinical examination, and only after a long ramp do they pick up a scalpel. By the time they're operating, the foundational knowledge is consolidated and won't get knocked loose. The Nemotron training schedule is doing the same thing — protecting older capabilities while layering on new ones.

Quantization, or: How to Shrink the Model Without Lobotomizing It

The paper also releases versions of the model in three numerical formats: BF16, FP8, and FP4. The numbers refer to how many bits are used to store each parameter — basically, how precisely the model's "knowledge" is recorded.

The analogy here is image compression. A high-resolution photograph is sharp and detailed but enormous. Compress it to a smaller file and you keep most of the image, but very fine details get rounded off. Push compression too far and the image becomes blocky and unrecognizable.

Quantization is the same trade-off applied to a model's brain. BF16 is the high-resolution version: precise, but heavy. FP4 is the aggressively compressed version: it fits onto cheaper hardware and runs faster, with some careful loss of nuance. The paper's contribution is showing that, with the right techniques, the FP4 version retains nearly all the capability of the larger version. That matters because it means people without supercomputers can run the model on more modest hardware.

What Becomes Possible

The paper's headline claim — three times higher single-stream throughput than a competitor model, nine times higher per-GPU throughput at a fixed responsiveness target — is the kind of number that doesn't quite land emotionally. So let me translate.

Picture an accessibility tool for blind users: a phone app that processes the live camera feed, recognizes text on signs, identifies the voice of a friend approaching, and quietly narrates the scene in real time. The bottleneck right now isn't whether such a system can be built. It's whether it can run smoothly on a phone, in real time, without melting the battery. The kind of efficiency gains in this paper are what move that experience from a research demo to a product.

Or picture a customer-support agent that watches you struggle with a screenshare, listens to your frustrated voice, reads the error log you've pasted, and walks you through a fix — all without handing you off to a human. The "agentic GUI use" benchmark in the paper, which includes navigating real software environments, is precisely about that capability.

Or picture a model that watches an entire two-hour surgery video, listens to the surgeon's narration, and produces a teaching annotation: here, at minute 47, the tissue is unusual; the surgeon makes an adjustment that would not be in the textbook. Long audio-video comprehension, which the paper highlights, is what makes this realistic rather than fantasy.

The Honest Limits

A few things should temper the excitement.

The benchmarks the paper reports are, by design, snapshots. A model can score brilliantly on document-understanding benchmarks and still trip over a real document with a coffee stain or an unusual layout. Real-world performance always lags benchmark performance, sometimes by a lot.

The paper claims leadership on certain comparisons but is also competing in a fast-moving field. By the time you read this, other research groups will have shipped their own models with their own clever tricks. The architectural choices here — Mixture-of-Experts, dynamic resolution, temporal compression — are widely seen as the right direction, but no one has converged yet on a single winning recipe.

And while the paper's release of model weights and training data is generous, "open" in this corner of AI is always a partial gesture. The compute required to train models at this scale remains the province of a small number of labs. Researchers can fine-tune and study what NVIDIA has released; very few can replicate it from scratch.

Still, the trajectory is clear. We are slowly building machines that can perceive the world the way we do — through every channel at once — and reason about what they perceive without first translating everything into one impoverished medium. The poem is, finally, starting to take shape.

📄 https://arxiv.org/abs/2604.24954

tags: ai, multimodal, nvidia, nemotron