DEV Community

Witold Wozniak
Witold Wozniak

Posted on

Can Copilot be your AI pair programmer?

My dear teammate decided to ruin my evening yesterday by asking an intriguing question: could you set up Copilot as a real pair programmer? Not the chat box you take turns with, but something that watches your code as you write it and reacts in the moment, the way a person at the next desk would. The short answer is no, not with anything you can install today.

What pairing means

GitHub launched Copilot in 2021 as "your AI pair programmer," and the same announcement described it adapting to you as you type. The thing was named for pairing and built for autocomplete, so it is fair to ask whether it can do the thing on the label. None of what follows is news, but I am spelling it out because the rest of the piece holds the product to these exact terms.

Pair programming is a core practice of Extreme Programming, with a definition that predates Copilot by two decades. The usual articulation gives it two roles. The driver types; the navigator reviews each line as it appears, watches for the mistake before it lands, and thinks a step ahead about design. The navigator works while the driver is mid-line, not after a turn is handed over. The two swap roles every few minutes, and both stay engaged the whole time; a navigator who checks out is not pairing.

The navigator produces without touching the keyboard. The reviewing, the watching, the thinking ahead: that is the work. The canon is explicit about it — the oldest objection to pairing, that it doubles cost by paying two people for one person's typing, is a named misconception, precisely because programming is not typing. So the navigator, who writes almost nothing, is the harder half to automate: the job is continuous real-time oversight, catching the wrong thing the instant it appears and saying so before it costs anything.

Hold the product against that definition. A tool that completes your line as you type is doing the driver's job, faster — filling the keyboard, the part the canon just told us was never the point. It is not in the loop with you. It never reacts to its own last suggestion, it never waits for yours, and it never takes the navigator's seat: it cannot watch you work and speak up, only wait to be asked.

Why not

From the outside, the model is a function. You hand it the whole context, it returns a complete answer. The only stream is the output, and it is read-only: you watch tokens arrive, but there is no second pipe to push new tokens into the call while it runs. To react to something new, you stop and make another call with the new context folded in. Every reaction is a fresh, full-context request. There is no "while you were typing" — there is only the next turn. (Stateful APIs that "remember" the conversation do not change this. They store the context so you do not have to resend it; the model still consumes it whole, one turn at a time.)

Voice is the proof of how solid that wall is. OpenAI's Realtime API and Google's Gemini Live feel like the real thing: they run over a live two-way connection, they talk while listening, they stop the instant you cut in. But underneath, they are still turn-based. Interruption is not the model hearing you mid-sentence and adjusting: it is the server detecting your speech, cancelling the response in flight, discarding the audio you had not heard yet, appending what you said, and starting a fresh turn. Often it happens fast enough that you do feel heard. [1] The industry spent real money building the full-duplex pipe for voice, and what it bought was an excellent imitation of mid-stream reaction, not the thing itself. A kind of "optimistic UI AI", if you will.

The architecture that genuinely does the thing exists. It is called full duplex, and the clearest example is a research speech model called Moshi, which runs your audio and its own as two parallel streams and conditions on your input every fraction of a second, with no stop and no restart. [2] And because a full-duplex model never stops generating, getting it to shut up when you interrupt is apparently harder, not easier, which is probably why Moshi has not made it into a code editor since its heroic escape from a French AI laboratory.

It seems that nobody has built full duplex for text because text does not need it. Speech is a continuous signal with no natural end, so the system has to guess when you are done. Text comes with a turn boundary built in. You press Enter, and the turn is over. No guessing which moment counts as done, because you explicitly say "I am done, your turn." This is the same reason I have to say "over" after asking my mom on my walkie-talkie to bring me another bag of Cheetos — an open channel gives you no way to hear that someone has finished, so the boundary has to be a word. The entire problem full duplex solves does not exist when the input already arrives pre-segmented, which is why nowadays I tend to text my mom on Signal.

How close you can get

Autocomplete is not pairing. It reacts as you type, but you cannot steer it, and it does not answer to your intent. It completes whatever you are writing regardless of what you meant, which makes it a fast stream of guesses with no interaction and no loop.

Agent mode is the real near-miss. It reasons about your work and responds to it, which is the engagement autocomplete lacks. But you have to hand it the turn: you write a prompt, send it, and wait for the burst of work to come back. The trigger does not have to be a prompt you type, though. OpenCode lets you wire it to an external event instead, like a file save or a failing test, through a plugin that injects a turn on its own, so the agent reacts to your activity without being explicitly asked.

Whether a harness can do that comes down to mechanism, not brand. Pretty much every decent harness has hooks, but hooks fire only on the session's own internal lifecycle. Reaching outside the session needs a plugin or extension layer on top, one that can inject a turn. OpenCode has it through plugins, and Copilot CLI through a separate extension system. Claude Code is the exception. It has hooks, but its plugin layer can't inject a turn, so the only external trigger is the SDK, which starts a fresh session instead of continuing the current one.

Either way it is still bursts, not full-auto. This can look like the AI taking the driver's seat while you navigate, but handing off a task and reviewing what comes back is a swap across a turn boundary, not the fluid trading of seats a shared loop needs.

Is it worth it

Suppose you hacked the agent loop tighter and fired it on every pause instead of every save. You would pay full context on every trigger, eat the latency of a fresh call each time, and drown in suggestions you never asked for. You would approximate continuity badly and noisily, and it would still be bursts. Firing more often does not turn a sequence of turns into a continuous stream. The gap between the near-miss and real pairing is not the cadence of the trigger, so no setting closes it.

When real pairing does reach a code editor, it will not be a Copilot setting. It will be a different kind of model. Until then, the honest answer is that he is stuck with me as his pairing partner.


Notes

[1] That interruption is cancel-and-restart, rather than the model editing a generation already in flight, is an inference from how transformer inference works, not a confirmed internal detail. OpenAI and Google document the protocol — voice activity detection, response cancellation, transcript truncation — but not the decode loop itself, so "abort and re-run" is a reasoned conclusion, not a stated one.

[2] Moshi. Défossez et al., Kyutai, 2024. The model processes the user stream and its own stream jointly as two autoregressive token streams at a 12.5 Hz frame rate (one 80 ms frame per step), discarding its prediction of the user stream and substituting the real incoming audio, which carries into context for the next step. That is genuine mid-decode conditioning. Measured latency around 200 ms, which is inside the range of a normal human conversational gap. The HuggingFace reference implementation is simplified and not real-time; the live full-duplex behavior is in Kyutai's own implementation. It is an open research model.

Sources

  • Pair programming, definition and the driver/navigator roles: the Agile Alliance glossary and Williams and Kessler, "Pair Programming Illuminated" (2002). The "doubles cost" objection is recorded there as a misconception that equates programming with typing; the driver/navigator framing postdates XP's original practice and is not universally accepted.
  • OpenAI, Realtime API documentation (voice activity detection, response cancellation, conversation.item.truncate, WebSocket/WebRTC/SIP transports).
  • Google, Gemini Live API documentation, ai.google.dev/api/live (BidiGenerateContent, interruption and turn handling).
  • OpenAI, streaming responses documentation (single HTTP request, server-sent events as one-directional output).
  • Défossez et al., "Moshi: a speech-text foundation model for real-time dialogue," 2024, arXiv:2410.00037.
  • GitHub Copilot CLI hooks reference and extensions documentation, docs.github.com (lifecycle hook events; the separate extension system over JSON-RPC).
  • Anthropic, Claude Code hooks documentation (internal lifecycle events) and Agent SDK.

Further reading

  • "From turn-taking to synchronous dialogue: a survey of full-duplex spoken language models," arXiv:2509.14515 — where the line between pseudo-full-duplex and genuine full-duplex systems is drawn.
  • Latent Space, "OpenAI Realtime API: the missing manual" — a practitioner walk-through of the event protocol and input-token caching.

Top comments (0)