Your Production Code Is Training AI Models Right Now (And How to Audit Your Stack)

Every AI coding tool you use needs access to your code to function. Copilot reads your files for completions. Cursor indexes your project for context. LangChain traces log your prompts and outputs for observability.

The problem is not that these tools access your code. The problem is that most engineers never ask what happens to that code after the tool processes it. Where does the telemetry go? Who trains on it? Is your proprietary logic ending up in a foundation model's training set?

This week, GitHub's decision to opt all users into AI model training by default made this question impossible to ignore. But GitHub is not the only platform doing this. It is the default pattern across the entire AI tooling stack.

The Default Is Always "Opt In"

Here is how it works at almost every AI tool company: ship the feature, opt everyone in, bury the toggle three levels deep in settings, and wait for someone to notice.

GitHub opted users into training data collection. The setting is under Settings, Privacy, and you have to manually disable it. Cursor uploads your project files for cloud-based indexing to power its AI features. LangSmith, the observability layer for LangChain, logs your prompts, model outputs, and even API keys that appear in traces by default.

None of this is hidden exactly. It is documented if you know where to look. But documentation is not consent. And the default matters more than the documentation, because most engineers never change the defaults.

The real issue is compounding exposure. Each tool on its own seems manageable. But when you stack Copilot, Cursor, LangSmith, and your CI/CD telemetry together, your entire codebase is being transmitted to four different cloud providers simultaneously. None of them coordinate on data handling. Each has its own retention policy, its own training pipeline, its own definition of "anonymous".

Why This Matters for Production Systems

If you are building AI systems in production, your codebase contains things that should never leave your organization: proprietary algorithms, customer data handling logic, API keys in commit history, infrastructure patterns that reveal your architecture.

When I was building Menthera, our voice AI system handled sensitive mental health conversations. The architecture included multi-LLM orchestration across Claude, GPT, and Gemini, persistent memory via Mem0, and real-time voice processing through WebRTC. If any of that codebase had ended up in a training set, it would have exposed not just our code but the design decisions that gave us our technical edge.

This is the reality for every team shipping AI features in production. Your code is not just code. It is your competitive advantage, your security surface, and your liability.

The 4-Point Audit Every Team Should Run This Week

Here is what I recommend for any team using AI coding tools in production:

1. Inventory every AI tool touching your codebase

List them all: IDE extensions, AI coding assistants, observability platforms, CI/CD integrations. If it processes your code, it goes on the list. Most teams are surprised to find they have 5 or more AI tools with code access.

2. Check telemetry and data sharing settings for each tool

Go into settings for every tool on your list. Look for "telemetry", "data sharing", "model training", and "usage analytics". Disable anything that sends code content upstream. This takes 20 minutes and could save you from a data leak you never knew was happening.

3. Scan your commit history for secrets

Run truffleHog or gitleaks against your repository. Secrets in commit history are the first thing that leaks when your code ends up in a training pipeline. Even if you rotated the key, the old one is still in git history. And git history is exactly the kind of data that gets bulk-ingested for training.

4. Add ignore files for sensitive paths

Create a .cursorignore file to prevent Cursor from indexing sensitive directories. Add a .github/copilot configuration to block Copilot from reading specific paths. These are simple text files that take 5 minutes to set up and permanently reduce your exposure surface.

The Bigger Picture

The model powering your AI feature is replaceable. You can swap Claude for GPT for Gemini and your system keeps working. But your proprietary code appearing in someone else's training set is permanent. There is no "undo" for training data.

The engineers who treat their code as a data liability, not just a product, will build more defensible systems in the long run.

Have you ever audited what data your AI coding tools send home? Most engineers I talk to have not. The tools are too useful to question and too convenient to distrust. But convenience is exactly how data leaks become invisible.

This week is a good time to start. Run the audit. Check the settings. Treat your code like the liability it is. The 20 minutes you spend now could prevent a data exposure you would never be able to reverse.