Gabriel Anhaia

Posted on Apr 5

GitHub Starts Training AI on Your Private Code April 24 — Here's How to Stop It

#github #ai #security #discuss

My project: Hermes IDE | GitHub
Me: gabrielanhaia

Nineteen Days

On March 25, 2026, GitHub published a blog post. Most developers didn't see it. The ones who did wish they hadn't needed to.

Buried under several paragraphs of cheerful corporate language about "improving the Copilot experience," one detail stood out: starting April 24, 2026, GitHub will use Copilot interaction data from Free, Pro, and Pro+ users to train its AI models. Not "might use." Not "is exploring." Will use. Default: on.

The Hacker News thread went exactly how anyone would expect. The Register picked it up. And now there are roughly 19 days left before this kicks in. Every affected user who doesn't explicitly opt out gets enrolled automatically.

That's bad enough. The specifics are worse.

What "Interaction Data" Actually Means

GitHub chose the term "interaction data" carefully. It sounds harmless. Benign, even. Like telemetry about button clicks. It isn't.

Here's what falls under that label:

Prompts and instructions. Every question typed into Copilot Chat. Every inline comment written to trigger a suggestion. A developer typing "refactor this auth handler to support OAuth2 PKCE flow" — that's collected.

Code snippets sent as context. This is the big one. When Copilot generates suggestions, it reads surrounding code for context. That surrounding code gets sent to GitHub's API as part of the request. Code from private repositories. Collected.

Generated outputs. Every suggestion Copilot produces, accepted or rejected.

Behavioral signals. Which suggestions were accepted, which were dismissed, how developers modified the output after accepting it. A detailed log of how humans interact with AI-generated code.

Here's what a single interaction looks like from a data collection perspective:

{
  "prompt": "Add JWT validation with RS256 to this endpoint",
  "context_snippets": [
    "class AuthService:\n    def __init__(self, secret_key, issuer)...",
    "from app.models import User, Session, TokenBlacklist...",
    "ALLOWED_ORIGINS = ['https://internal-dashboard.company.com']..."
  ],
  "suggestion": "def validate_token(self, token: str) -> dict:\n    try:\n        payload = jwt.decode(token, self.public_key, algorithms=['RS256'])...",
  "accepted": true,
  "modified_after_acceptance": true
}

Notice what's in those context snippets. Internal class names. Import paths that reveal project structure. An internal URL. None of this was pushed to a public repo. It got hoovered up because a developer used the autocomplete in their editor.

One interaction might expose a handful of lines. A full workday of Copilot usage? That's a different story.

40 Copilot interactions per day
× ~200-500 lines of surrounding context each
= 8,000-20,000 lines of private code per day
× 20 working days per month
= 160,000-400,000 lines per month

That math isn't theoretical. It's what happens when someone actually uses Copilot the way GitHub wants them to.

What GitHub Says It Won't Touch

Credit where it's due: GitHub has been specific about the boundaries. The distinction is real, even if it's thinner than the marketing suggests.

Private repository content "at rest" is not used for training. GitHub does not crawl private repos and feed them into models.

A developer with a 50,000-line private codebase who only used Copilot on 200 lines means only those 200 lines' worth of context entered the pipeline. The other 49,800 lines stay put.

GitHub also confirmed that Business and Enterprise plan users are completely excluded. Their data policies haven't changed. This only hits individual-tier accounts: Free, Pro, and Pro+.

Which creates an absurd situation. A solo developer building a SaaS product on a personal Pro account gets less data protection than a junior dev at a Fortune 500 company using the same tool on the same kind of code. The difference? Corporate lawyers negotiated those Enterprise terms. Individual developers got a blog post and a toggle.

The Full Opt-Out Walkthrough

This takes about 90 seconds. There's no reason to wait.

Step 1: Open the Copilot settings page.

Direct URL:

https://github.com/settings/copilot

Manual path: GitHub profile icon (top right) → Settings → Copilot (left sidebar).

Step 2: Find the training data toggle.

Look for a setting labeled:

"Allow GitHub to use my data for product improvements"

It's on the main Copilot settings page under the data usage section. Simple toggle switch.

Step 3: Turn it off.

Flip the toggle to disabled. The page should auto-save. If there's a save button, click it. Verify it shows as disabled before navigating away.

Step 4: Verify via API.

UI toggles are great. API confirmation is better. Run this:

curl -s \
  -H "Authorization: Bearer YOUR_GITHUB_TOKEN" \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  https://api.github.com/user/copilot \
  | jq '{ ide_chat: .copilot_ide_chat, model_training: .copilot_model_training_opt_in }'

Replace YOUR_GITHUB_TOKEN with a personal access token that has the read:user scope. The copilot_model_training_opt_in field should return false. If it returns true, the opt-out didn't stick — go back and try again.

Don't have jq installed? The raw JSON output works fine too:

curl -s \
  -H "Authorization: Bearer YOUR_GITHUB_TOKEN" \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  https://api.github.com/user/copilot

Search the output for copilot_model_training_opt_in and confirm it's false.

Step 5: Check organization settings.

Anyone who administers a GitHub organization on a Free or Pro plan should also check org-level settings:

https://github.com/organizations/YOUR-ORG/settings/copilot

Organization admins can enforce the opt-out for all members. Worth doing, especially for open source projects where contributors on personal accounts might not update their own settings.

Who's Affected and Who Isn't

Plan	Affected?	Action Needed
GitHub Free (with Copilot Free)	Yes	Opt out manually
GitHub Pro	Yes	Opt out manually
GitHub Pro+	Yes	Opt out manually
GitHub Team	Possibly	Admin should verify org settings
Copilot Business	No	Existing data policies unchanged
Copilot Enterprise	No	Existing data policies unchanged

The tiering tells on itself. Enterprise customers have legal teams that would shred the contract over a default opt-in to training. Individual developers don't have legal teams. They have Reddit threads and strongly worded blog posts.

The "We're Not Reading Your Repos" Problem

GitHub's position boils down to: "We're only using interaction data, not your repository content." Technically accurate. Practically misleading.

The interaction data contains code from those repositories. Every snippet sent as context originated from a repo the developer intended to keep private. Every prompt that references internal architecture. Every generated output that builds on proprietary logic. It all came from somewhere, and that somewhere was a private repo.

The distinction between "we didn't read your private repo" and "we read the parts of your private repo that flowed through our tool" is real but razor-thin. Especially at scale.

Think about what a typical Copilot power user generates over a month. Tens of thousands of lines of private code fragments sitting in GitHub's "interaction data" corpus. Nobody pushed those lines to a public repo. Nobody checked a box saying "use this for training." The consent happened retroactively, via a default-on toggle that most users will never see.

The Supply Chain Angle Nobody's Talking About

Here's where things get genuinely messy.

A freelancer uses their personal GitHub Pro account. They work on three client projects, all in private repos. They use Copilot constantly because it makes them faster and clients don't care how the code gets written, just that it works.

Starting April 24, fragments of every client's codebase enter the training pipeline. The freelancer might not even know. The clients definitely don't. No NDA was violated intentionally. The developer just used the same IDE plugin they always use.

Now scale that to regulated industries. A contractor doing healthcare work uses their personal Copilot account. Internal API naming conventions that hint at data models, database schemas with field names that reference PHI categories, authentication patterns specific to HIPAA-compliant systems — all of that can end up in interaction data. The contractor didn't do anything wrong. They just didn't opt out of a setting they didn't know existed.

The liability questions here aren't hypothetical anymore. They're calendar events.

What the Developer Community Got Right (and Wrong)

The backlash split into predictable camps.

Some developers called it a bait-and-switch. They adopted Copilot under one set of data policies, integrated it into daily workflows, and now those policies are changing underneath them. This is a fair reading. The terms changed. The habits built on the old terms didn't.

Others argued that interaction data is fundamentally different from repository access and that AI training on usage patterns makes the tool better for everyone. Also fair, but it assumes anonymization is bulletproof and that code fragments aren't reconstructible. That's an assumption doing a lot of heavy lifting.

The most interesting critique came from developers asking why code is treated differently from other creative work. When Spotify uses listening data to improve recommendations, the algorithm learning someone's taste in music doesn't threaten their career. When GitHub uses coding data to improve Copilot, the improved model generates code that competes with the developer who produced the training data. Users improve the product, and the improved product reduces demand for those same users.

Nobody's solved that incentive problem. Most companies aren't even acknowledging it exists.

Alternatives Worth Knowing About

For developers who want AI coding assistance without the training data question hanging overhead, options exist. None are perfect.

Self-hosted models. Running a local code model through Ollama, LM Studio, or llama.cpp means code never leaves the machine. Quality is improving fast but still falls short on large, context-heavy codebases.

Copilot Business or Enterprise. The "pay more, keep your data" option. GitHub's upsell pitch, working exactly as designed. Cynical but effective.

Competitors with no-training policies. Some AI coding tools have taken a hard stance against training on user data. Read the actual terms of service, not the landing page. Policies change. This exact GitHub situation proves that point.

Local inference in the editor. Some IDEs are building on-device AI features that never phone home. Still early. The direction matters more than the current quality.

What about open source contributors?

For developers working exclusively on public, open-source code, the training data question is less urgent — the code is already public. But interaction data still includes prompts, behavioral signals, and context from private development branches that might not be public yet. Even open source developers should consider opting out if they work on any private branches or repos alongside their public work.

The Actual Checklist

For anyone who scrolled past everything else:

[ ] Go to github.com/settings/copilot
[ ] Disable "Allow GitHub to use my data for product improvements"
[ ] Run the API curl command to verify copilot_model_training_opt_in is false
[ ] If administering an org, check org-level Copilot settings
[ ] Tell other developers on the team — especially contractors and freelancers on personal accounts
[ ] Set a calendar reminder after April 24 to verify the setting didn't reset
[ ] Review which Copilot features are still enabled and whether the trade-off makes sense

The Default Is the Policy

Here's the uncomfortable truth that makes this more than a settings toggle.

GitHub reported over 150 million developers on the platform. Copilot has millions of active users. Even a generous estimate — say 10% of affected users see the announcement and opt out — leaves millions of developers unknowingly feeding interaction data into training runs. That's not a bug. That's the business model.

Default settings are policy. Blog posts are plausible deniability. The 30-day window between announcement and enforcement is just long enough to say "we gave people time" and just short enough to ensure most people miss it.

GitHub built a tool developers genuinely love. Copilot makes people faster. That's not in dispute. What's in dispute is whether a good product earns the right to change data policies retroactively with a default opt-in that most users won't see.

Nineteen days. Go flip the toggle. Tell someone else to flip theirs.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.