How Thinking Machines built interactivity into the model

#ai #news #machinelearning #llm

A new release from Thinking Machines, dated May 11, 2026, lands at 0.40 seconds end-to-end on the FD-bench V1 turn-taking benchmark — about three times faster than GPT-realtime-2.0 (xhigh) and roughly half the latency of Gemini-3.1-flash-live (high). The latency number is the surface story. The architectural story is what makes it possible: the model is processing audio, video, and text in 200-millisecond ticks, with no separate turn-detection component sitting between the user and the weights.

The post that landed at thinkingmachines.ai is a research-preview announcement of a class of models the team is calling interaction models. The framing question worth taking seriously is this: what changes when interactivity is part of the model itself instead of a harness around it? The three sections below walk through the answer.

The 200ms tick

A turn-based model receives one complete input, generates one complete output, and waits. An interaction model receives 200ms of input and produces 200ms of output, then 200ms more, then more — input and output streams running concurrently. The model does not see "the user's turn finished, now respond." It sees a continuous interleaved sequence: input chunk, output chunk, input chunk, output chunk, with no artificial turn boundaries to honor.

What disappears in this design is the voice-activity-detection harness that lives between the user and the model in most real-time speech systems today. Turn-based models cannot tell on their own whether the user is thinking, yielding the floor, or being briefly silent — a separate, smaller component makes that call and passes the model a "go" signal. Thinking Machines argues, citing The Bitter Lesson, that any harness less intelligent than the model itself will eventually be outpaced by the model. So they remove the harness, and the things that harnesses could not express — speaking while listening, reacting to a visual cue without an audio prompt — become things the model can do directly.

The audio and video paths are deliberately lightweight. Audio comes in as dMel features through a thin embedding layer, not a Whisper-style encoder. Images are split into 40×40 patches encoded by an hMLP. The audio decoder is a flow head. All four components — embedding, image patcher, flow head, and the main transformer — are co-trained from scratch together. The phrase the team uses is encoder-free early fusion, and the practical effect is that there is no separate pre-processing model whose limits cap what the interaction model can do.

Two models, one continuous thread

A 200ms tick is fast enough for conversational presence, but it is not enough time for sustained reasoning, tool use, or longer-horizon work. The system splits those responsibilities across two models. The interaction model — TML-Interaction-Small, a 276-billion-parameter mixture-of-experts with 12B active — holds the live thread, listens, speaks, watches. When the user asks for something that needs deeper work, the interaction model delegates to a background model that runs asynchronously.

The split matters because the interaction model does not freeze while the background model thinks. It keeps the conversation going — answering follow-ups, taking new input, holding the thread — and weaves background results back in when they arrive, at a moment that fits what the user is currently doing rather than as an abrupt context switch. Both models share context, so the background model is not starting cold from a stripped query; it inherits the full conversation.

The net effect for the user: planning, tool use, and agentic workflows at the response latency of a non-thinking model. The interaction model on its own is also competitive on intelligence benchmarks — 89.7 on text IFEval, 82.1 on voice IFEval — so it is not a thin front-end that punts everything to the background.

Where the gap shows up

The standard interactivity benchmarks (FD-bench, Audio MultiChallenge) put TML-Interaction-Small ahead of every other non-thinking model on the Pareto frontier of intelligence versus latency. That is a real result. But the more telling numbers are on benchmarks the team built specifically to test what an interaction model can do that a harness-wrapped turn-based model cannot.

On TimeSpeak — which asks the model to initiate speech at user-specified times with the correct content — TML-Interaction-Small scores 64.7 versus 4.3 for GPT-realtime-2.0 (minimal). On CueSpeak, which tests speaking at the appropriate moment in response to a verbal cue, 81.7 versus 2.9. On Charades, a temporal-action-localization task adapted to require the model to say "start" and "stop" at the right moments of a video, the temporal IoU is 32.4 versus 0. On ProactiveVideoQA, where the no-response baseline scores 25.0, TML-Interaction-Small scores 33.5 — a small absolute lift, but a meaningful one, since the baseline is essentially "say nothing and lose no points."

Scores near zero usually mean the benchmark is testing a capability the architecture cannot express. The point is not that GPT-realtime-2.0 is poor at speech — it is that turn-based plus harness has no representation for "speak while listening" or "react to a visual cue without an audio prompt." Time-aligned micro-turns do, and the benchmark gap follows.

What's still open

The post is honest about what is not solved. Very long sessions still need careful context management — continuous audio and video accumulate context quickly. Streaming at low latency is sensitive to network reliability, and the experience degrades hard over a flaky connection. The current TML-Interaction-Small is the small one — larger pretrained models exist but are too slow to serve in this regime today, and the team plans to release them later this year. The research preview will open in the coming months with a wider release after.

Source: Interaction Models: A Scalable Approach to Human-AI Collaboration, Thinking Machines Lab, May 11, 2026.