Ve Sharma

Posted on Dec 4

So… what is GitHub Copilot’s "Goldeneye" model and why should devs care?

#github #githubcopilot #ai #vscode

You might have heard whispers about "Raptor Mini" appearing in VS Code recently. But while the public is testing that, there is something bigger lurking in the internal builds at Microsoft & Github.

I’m a Solution Engineer at Microsoft focusing on Dev Tools, which means my GitHub account is part of the MicrosoftCopilot organization. Because of this, my VS Code instance inherits feature flags that aren't public yet.

Yesterday, I opened my model picker and saw a new name: "Goldeneye (Preview)."

It’s not in the public docs. It’s not in the changelog. But after digging into the debug logs and putting it through its paces, I can tell you this: It is a monster.

Here is the breakdown of what Goldeneye is, the massive hardware running it, and why this internal experiment proves that the future of Copilot is about to get a whole lot bigger.

TL;DR for the busy dev

Goldeneye = An internal-only model currently being dogfooded by Microsoft/GitHub employees.
The Specs are wild: It features a massive 400k context window and a staggering 128k output limit.
The Architecture: It appears to be an "OSWE" (OpenAI Software Engineering) agentic model running on NVIDIA A100s.
Why you care: You can't use it yet, but this is the prototype for the next generation of Copilot Agents. It solves the "context rot" problem and proves GitHub is preparing to ship significantly expanded token limits to the public.

If you want the receipts, keep reading.

1. The Discovery (The Insider Part)

Unlike the Raptor rollout, which went to Copilot Pro users, Goldeneye is locked behind the Microsoft Employee curtain.

When I selected it in the chat interface, the latency was different. The quality was different. It didn't feel like a standard GPT-4 variant; it felt... deeper.

So, I did what any engineer would do: I popped open the VS Code Developer Tools Copilot Chat window (View and More Actions -> Show Chat Debug View) and watched the logs while I chatted with it.

2. The Specs (The "Holy Cow" Part)

Here is the raw configuration object that came back from the backend. Look closely at the limits:

{
    "billing": {
      "is_premium": true,
      "multiplier": 1
    },
    "capabilities": {
      "family": "oswe-vscode-large",
      "limits": {
        "max_context_window_tokens": 400000,
        "max_output_tokens": 128000,
        "max_prompt_tokens": 272000,
        "vision": {
          "max_prompt_image_size": 3145728,
          "max_prompt_images": 1,
          "supported_media_types": [
            "image/jpeg",
            "image/png",
            "image/webp",
            "image/gif"
          ]
        }
      },
      "object": "model_capabilities",
      "supports": {
        "parallel_tool_calls": true,
        "streaming": true,
        "structured_outputs": true,
        "tool_calls": true,
        "vision": true
      },
      "tokenizer": "o200k_base",
      "type": "chat"
    },
    "id": "oswe-agent-b",
    "is_chat_default": false,
    "is_chat_fallback": false,
    "model_picker_category": "powerful",
    "model_picker_enabled": true,
    "name": "Goldeneye (Preview)",
    "object": "model",
    "preview": true,
    "supported_endpoints": [
      "/responses"
    ],
    "vendor": "Azure OpenAI",
    "version": "goldeneye"
}

The big takeaways:

400k Context Window: This is the largest context window we've seen in a Copilot chat model to date. For reference, standard GPT-4o often caps lower in practice. Goldeneye can likely hold your entire src folder in memory.
128k Output: This is the real game changer. Most models give up after generating a few hundred lines of code. 128k output means this model can architect entire modules or rewrite massive files in a single pass without cutting off.
Family: oswe-vscode-large: This confirms it is a bespoke model family, likely "OpenAI Software Engineering," specifically tuned for the IDE.

3. Decoding the "Goldeneye" String

Let's dive into a model response:

requestType      : ChatResponses
model            : oswe-agent-b
maxPromptTokens  : 271997
maxResponseTokens: undefined
location         : 7
otherOptions     : {"stream":true,"store":false}
reasoning        : {"summary":"detailed"}
intent           : undefined
startTime        : 2025-12-03T23:04:22.839Z
endTime          : 2025-12-03T23:04:33.218Z
duration         : 10379ms
response rate    : 22.35 tokens/s
ourRequestId     : 92becdc5-214b-4b38-a3d8-725c744575b5
requestId        : 92becdc5-214b-4b38-a3d8-725c744575b5
serverRequestId  : 92becdc5-214b-4b38-a3d8-725c744575b5
timeToFirstToken : 6846ms
resolved model   : capi-noe-ptuc-a100-oswe-vscode-large-prime
usage            : {"prompt_tokens":6274,"completion_tokens":232,"total_tokens":6506,"prompt_tokens_details":{"cached_tokens":1792},"completion_tokens_details":{"reasoning_tokens":192,"accepted_prediction_tokens":0,"rejected_prediction_tokens":0}}

Deep in the logs, the model resolved to a specific identifier string that tells us exactly what hardware Microsoft is throwing at this problem:

capi-noe-ptuc-a100-oswe-vscode-large-prime

I broke this down to understand the infrastructure behind the model:

capi: Likely "Copilot API" or "Chat API."
noe: "Northern Europe" Azure region.
ptuc: Internal cluster/deployment identifier.
a100: NVIDIA A100 GPU. This confirms the model runs on top-tier, high-performance compute. This isn't a quantized "mini" model; this is heavy iron.
oswe: OpenAI Software Engineering. This is the strongest evidence that OpenAI and GitHub are building models specifically for coding logic, not just general chat.
vscode-large: Optimized for the editor, and "large" implies parameter count.
prime: The flagship version of this model tier.

4. Why should devs care? (The Future)

Since you can't toggle this on yet, why does it matter?

Because Goldeneye is a crystal ball. It shows us exactly where GitHub Copilot is heading in the next 2 to 4 months.

1. Reducing "I forgot what file you're talking about"

With a 400k context window and 272k prompt limit, Microsoft & Github is testing a world where context rot doesn't exist. You won't have to cherry-pick which files to add to context - the models are getting considerably better at the state of your repo. Especially as repo AI rulesets get larger, larger memory is needed.

2. Agents are getting "Brain Transplants"

The ID oswe-agent-b is a smoking gun. This isn't just a chatbot - it's a backend for Agents.
Current public agents can sometimes struggle with complex, multi-step tasks because they run out of memory or output capability. Goldeneye’s specs suggest that the next wave of GitHub Copilot Agents will be able to handle complex, multi-file refactors without needing a human to hold their hand.

3. Speed and Scale

The fact that this is running on A100s and is being dogfooded internally means Microsoft & Github is willing to spend serious compute to get code generation right. They aren't trying to make it cheaper - they are trying to make it smarter while moving quickly internally.

5. The Verdict

Goldeneye represents a massive leap in "memory" and "stamina" for AI coding assistants in the Github copilot ecosystem.

It is currently the guinea pig for improving the underlying models of the Copilot platform. By testing this internally, GitHub is validating that bigger is better—better memory, better output limits, and better agentic behavior.

It’s exciting to see Microsoft and GitHub moving this fast. If this is what the employees are using today, get your repos ready—because we’re all going to have this power soon.

I’m Ve Sharma, a Solution Engineer at Microsoft focusing on Cloud & AI working on Github Copilot. I help developers become AI-native developers and optimize the SDLC for teams. I also make great memes. Find me on LinkedIn or GitHub.

Top comments (3)

Peter Vivo • Dec 4

Interesting today with a gemini cli I did a lot of job but few hour later stop around 5M token. I also try use a coopilot but that is give up fare more earlier.

Ve Sharma • Dec 4

oh interesting, typically I find what moves the needle the most on improving output is starting with Plan mode, then when moving into agent and chat mode, using the appropriate well defined prompts for the task, and this can make a significant improvement on output. Also, always starting with 1 context window for 1 task, and refreshing context ensures it doesn't sputter nearly as much!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.