Bongho Tae

Posted on Apr 26

When the AI Learns to See and Think at the Same Time

#artificialintelligen

The Problem with Doing Everything in a Line

Picture the last time you organized something genuinely complicated — a move across the country, a wedding, a conference. At some point, you probably realized that doing every task in sequence was killing you. You couldn't wait to finish booking the caterer before calling the venue, and you couldn't wait to confirm the venue before sending invitations. The entire operation required you to hold many threads simultaneously, farming out tasks to different people while you kept track of the whole picture.

Now imagine that the person coordinating all of this could only use a telephone, and could only make one call at a time. That is, roughly, the state of most AI systems today when they face complex, real-world problems. They think in a line. They act in a line. And as tasks grow more intricate — research a topic, then design something, then write code, then verify the result — that single-file approach becomes not just slow but fundamentally inadequate.

A new model from the Chinese AI lab Moonshot AI, called Kimi K2.5, takes direct aim at this constraint. It does so in two ways that, taken together, represent a meaningful shift in how AI systems are designed: it trains the model to genuinely understand both language and images as a single unified skill, rather than grafting vision onto a text-first brain as an afterthought. And it introduces something the researchers call Agent Swarm — a way of multiplying the AI into a small army of specialized workers that tackle sub-problems in parallel, then report back to a coordinating intelligence.

Both ideas sound intuitive. But making them work in practice, and making them work together, turned out to be genuinely hard.

Why Seeing and Reading Have Always Fought Each Other

Most powerful AI models today are, at their core, language machines. They were trained on enormous quantities of text — books, articles, code, conversations — and they learned the deep structure of human reasoning through words. Vision was added later, like fitting a seeing-eye dog with a translation earpiece: technically functional, but not the same as being born with both senses integrated.

The problem with this approach is that the two skills pull against each other during training. Imagine trying to learn French and violin simultaneously, but on a rigid schedule: two hours of French, then two hours of violin, with no mixing allowed. You might get decent at both. But you'd never develop the fluid cross-modal thinking of a musician who hums a tune while writing its lyrics, each skill feeding the other in real time.

The researchers behind K2.5 found something similar. When vision is added to a language model late in training — or when the two modalities are trained in separate phases — the model develops a kind of internal friction. Improving vision sometimes hurts language; improving language sometimes hurts vision. They conflict because they were never taught to speak to each other from the beginning.

K2.5's answer was to insist on early integration. From the very first stages of pre-training — the massive, expensive phase where the model ingests hundreds of billions of words and images — text and vision tokens were mixed together in a constant ratio throughout. Think of it less like learning French and violin on a schedule, and more like growing up bilingual: the two languages don't just coexist in your brain, they shape each other's grammar, expand each other's vocabulary, and ultimately create a richer understanding of both than either would produce alone.

The Surprising Power of Doing Almost Nothing

Here is one of the counterintuitive findings buried in this paper, and it deserves a moment's attention.

The conventional wisdom in AI training is that if you want a model to do something specific — say, interpret a chart, or follow a visual instruction, or use a tool when prompted by an image — you collect examples of those exact behaviors and train the model on them. You show it thousands of human-designed demonstrations. The model watches, imitates, and learns.

The K2.5 team tried this. And it made things worse.

They call what they actually found "zero-vision SFT," which sounds technical but encodes a beautifully strange insight. SFT stands for supervised fine-tuning — the phase of training where a model is shaped to follow instructions and behave helpfully, using human-labeled examples. "Zero-vision" means: during that phase, show the model no visual examples at all. Just text.

The result was that the model's visual reasoning capabilities activated anyway — and generalized better than when human demonstrations were provided.

Why? The researchers' explanation is elegant. The pre-training phase had already established such deep connections between language and vision that the model had, in effect, already learned to think visually. Human-designed demonstrations of visual reasoning, it turns out, are a kind of straitjacket: they constrain the model to imitate specific patterns rather than applying its own already-rich visual understanding. By withholding those demonstrations, the team let the model draw on what it had already taught itself.

The analogy that comes to mind is a writer who has read thousands of novels and deeply internalized the rhythms of prose. If you then give them a rigid template — "write your opening sentence this way, structure your paragraphs like this" — you may actually produce worse writing than if you'd simply told them the subject and let them work. The template interrupts a fluency they already possess.

Figure 1: Kimi K2.5 main results, comparing performance across benchmark categories against leading proprietary and open-source models.

Figure 2: Vision RL training curves on vision benchmarks starting from minimal zero-vision SFT. By scaling vision RL FLOPs, the performance continues to improve, demonstrating that zero-vision activation generalizes effectively.

The curves in the figure above tell the story numerically: as the model was given more and more practice through reinforcement learning — a technique more like game-playing than imitation, where the model tries things and receives feedback on whether they worked — its visual understanding kept climbing. The message is that practice, not prescription, built the skill.

When Training One Sense Sharpens the Other

There is something even stranger in the results, and it directly contradicts an assumption that has quietly shaped AI development for years.

When the team applied reinforcement learning to visual tasks — having the model practice interpreting images and graphs and receive feedback on whether it got things right — they found that the model's language performance improved too. Not despite the visual training. Because of it.

This is not obvious. It would be perfectly reasonable to assume that training on images uses up some finite capacity that was previously devoted to language, producing a tradeoff: more vision skill, less text skill. That is, roughly, what people assumed. The K2.5 results suggest the opposite: that genuine cross-modal integration creates a kind of cognitive leverage. Learning to reason carefully about what a chart is actually showing you makes you better at reasoning carefully about what a sentence is actually claiming.

The analogy is cross-training in athletics. A marathon runner who adds strength training doesn't become a worse runner because the weights are "using up" running capacity. Done right, the strength work changes how the body moves, how forces transfer, how fatigue accumulates — and the runner comes back faster. The skills compound rather than compete.

The Orchestra Problem

With the model's visual and linguistic reasoning unified, the team turned to a different and arguably more fundamental problem: the architecture of how an AI tackles a hard task.

Current AI systems, even sophisticated ones, operate sequentially. The model thinks step one, acts on step one, observes the result, thinks step two, acts on step two, and so on. This works. But it scales badly. If a genuinely complex task requires hundreds of steps — researching a topic across dozens of sources, then synthesizing the findings, then designing something based on those findings, then verifying the design — the time required grows linearly with the number of steps. You are waiting, always, for the model to finish its last thought before it can begin its next one.

This is the telephone-one-call-at-a-time problem from the opening. And Agent Swarm is the solution.

Think of how a large architectural firm tackles the design of a complex building. There is a lead architect who holds the overall vision and makes the decisions that require that vision. But there are also structural engineers, interior designers, environmental consultants, and cost estimators — each working on their own domain, in parallel, reporting back when their piece is complete. The lead architect doesn't wait for the structural drawings before commissioning the interior design study. The pieces develop concurrently and are integrated at the end.

Agent Swarm works on the same principle. A coordinating AI — the orchestrator — receives a complex task and immediately analyzes it for parallelizability: which parts depend on other parts, and which parts can proceed simultaneously without waiting for anything else? It then spins up specialized sub-agents — an AI researcher, a fact-checker, a coder, a visual analyst — and dispatches them to work on their pieces at the same time. The sub-agents are not general intelligences; they are locked-down specialists, given specific tools and specific goals. The orchestrator alone is trained to adapt and coordinate.

Figure 3: An agent swarm has a trainable orchestrator that dynamically creates specialized frozen subagents and decomposes complex tasks into parallelizable subtasks for efficient distributed execution.

The result, according to the paper's measurements, is a reduction in task completion time of up to 4.5 times compared to doing the same work sequentially. On complex search and research tasks, Agent Swarm doesn't just speed things up — it also gets better answers, because the parallel workers cover more ground before the orchestrator synthesizes them.

Figure 4: In our parallel-agent reinforcement learning environment, training accuracy increases smoothly as training progresses. At the same time, the level of parallelism during training also gradually increases.

What is particularly interesting about Figure 4 is that the model learned when to multiply itself. As training proceeded and the model became better at solving hard problems, it spontaneously used more parallel agents. The more capable it became, the more it chose to delegate. A naïve reading might see this as the model becoming lazier; a more accurate reading is that it learned what experienced managers know — that the hardest problems are the ones most worth distributing.

What the Numbers Actually Show

The benchmark results are numerous and the comparisons carefully hedged, as they always are in papers that announce impressive performance. Kimi K2.5 is being compared against GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — the frontier models from OpenAI, Anthropic, and Google respectively — and the picture is genuinely mixed, which is worth saying plainly.

On agentic tasks — the tasks that require planning, using tools, browsing the web, and synthesizing information — K2.5 does well, particularly when Agent Swarm is engaged. On pure mathematical reasoning benchmarks like AIME and HMMT, it trails GPT-5.2 and Gemini 3 Pro somewhat. On knowledge recall tasks like SimpleQA, it trails Gemini significantly. It leads on several coding and web-browsing tasks, and performs strongly on visual understanding tests.
The honest reading of these numbers is that K2.5 is a genuinely capable model with meaningful innovations, particularly in how it handles vision and how it organizes multi-step work. It is not uniformly ahead of the competition. What it offers that the others do not, as an open-source release, is the ability for researchers and developers to examine and build on its architecture — the Agent Swarm mechanism especially — without waiting for a proprietary API to expose those features.

What Becomes Different

Step back from the benchmarks for a moment and think about what these capabilities, combined, actually change.

Consider a person trying to understand a dense medical report after a diagnosis. Currently, they might copy out the relevant sections and paste them into an AI chat window, painstakingly describing what the charts show. A system that genuinely integrates vision can look at the actual document — the actual graph of their bloodwork over time — and reason about it directly, not through a verbal description.

Or consider a journalist trying to verify a complex claim that involves cross-referencing dozens of documents, each containing a mix of text, images, and data tables. A sequential AI, however smart, takes a long time because it must examine each source one by one. A parallel agent swarm can disperse across those sources simultaneously, fact-checking different claims in different documents at once, then bring the findings back to a central synthesizer.

Or consider a small software team using an AI assistant to debug a complex system. The AI currently reasons through possibilities one at a time. A parallel architecture lets it pursue multiple diagnostic hypotheses simultaneously — testing one while continuing to reason about another — potentially compressing hours of investigation into minutes.

These are not wild speculations. They are the natural extensions of what this paper demonstrates working in controlled conditions.

What Remains Uncertain

There is a limit to how much one research paper can establish, and it is worth naming what this one does not answer.

The Agent Swarm results are measured on benchmarks — structured tests with defined right answers. Real-world tasks are messier. They have ambiguous success criteria, contradictory sources, and edge cases that no benchmark designer anticipated. Whether parallel agent orchestration degrades gracefully when the sub-agents encounter genuinely unexpected situations — rather than simply being slower in the controlled case — is not yet clear.

The "zero-vision SFT" finding is striking, but it is also a finding about a specific model at a specific scale with a specific pre-training recipe. Whether it generalizes — whether other labs could replicate the same counterintuitive benefit by withholding visual demonstrations — is an open question that requires independent verification.

And the cross-modal enhancement claim — that training on vision improves language, and vice versa — is compelling in the aggregate benchmark numbers but harder to scrutinize mechanically. The paper shows that the numbers go up together; it does not fully show why, in a way that would let someone predict when this benefit will appear and when it won't.

None of this diminishes what the paper contributes. It presents a coherent, testable set of ideas about how to build AI systems that handle the full complexity of the world — text and images, sequential reasoning and parallel action — and it releases the trained model for others to examine and extend. In a field where many of the most significant advances stay locked inside proprietary systems, that openness is itself a contribution.

The single-file telephone call, it turns out, was always an artificial constraint. What the architects of K2.5 have shown is that AI, given the right training, can learn to run a switchboard.

📄 https://arxiv.org/abs/2602.02276

tags: artificialintelligence multimodal agenticsystems machinelearning