Why Open Source Doesn't Embrace AI

#ai #opensource #licensing #commons

A week ago I would have told you the open source community rejected AI for four obvious reasons. Then I tried defending each one properly. Most of them collapse under scrutiny — and what's left is a much more tractable conversation.

The conventional answer

Every maintainer I know has roughly the same three or four complaints about AI tooling.

First, nobody can tell where the training data came from, which sits badly with a culture obsessed with provenance. Second, maintainers are drowning in AI-generated PRs and CVE reports — Daniel Stenberg's posts on the curl side of this are now the canonical example. Third, AI shortcuts the old apprenticeship loop where contributors learned a codebase by earning review trust over months. And fourth, the power asymmetry: frontier models need hyperscaler capital, which sits uneasily inside a movement built on "you can read, modify, and redistribute the thing."

This is the standard pitch. Put it in front of any contributor at FOSDEM and you'd get nods all the way down. It's also the position I'd have given you a week ago, before I tried to defend it in detail and realised most of it doesn't actually hold up.

The win-win argument

The most common pushback I hear from developers is straightforward. If AI is trained on open source, and then used to write more open source, surely that's a win for everyone. The argument, basically, is that open source taught the model, and the model now helps people ship more open source.

The first-order answer is that the value doesn't circulate. Code goes in to the model for free, gets laundered through training, and comes back out behind a paywall rented from a private company. The contributor whose code shaped GPT-5 or Claude or Copilot doesn't get a cut, doesn't get attribution, and pays the same subscription fee as everyone else. Compare that to how open source normally compounds: I fix a bug in libcurl, the fix lives in libcurl, every downstream user benefits, the project is permanently better. With LLMs, the upstream project gets nothing.

That answer is satisfying. It's also where the argument starts to wobble.

The RHEL problem

The community's case against AI extraction has to reckon with the fact that it has cheerfully tolerated extraction at far larger scale for decades.

Red Hat ships kernel patches upstream, but Satellite, Insights, and Ansible Automation Platform are proprietary. They made billions off a kernel they didn't write, restricted source distribution to paying customers in 2023, and remain consistently in the top tier of corporate contributors to Linux. The community grumbled and kept using RHEL. Cloud provider extraction hit Elasticsearch and MongoDB hard enough that both re-licensed in self-defence — Elastic directly aimed at AWS, MongoDB preemptively against the whole cloud category — and we still all use AWS. Amazon Linux 2023 is a downstream of Fedora. They keep the orchestration plane closed. Nobody is staging a boycott.

If RHEL gets a pass for closed-source Satellite while making billions off a kernel they didn't write, the AI critique can't just be "they made money off the commons." That proves too much. It rules out the entire managed-services industry that open source has come to depend on.

What Valve actually shows

Valve is probably the strongest counter-example the anti-AI position has. The Steam Deck ships KDE Plasma in desktop mode, so Valve's Steam Deck work funds a consultancy — Techpaladin since 2025, Blue Systems before that — which in turn employs a substantial slice of the KDE Plasma development team, including Nate Graham and David Edmundson. Valve also funds Mesa, kernel graphics, Wine, gamescope, and PipeWire work through similar arrangements. KDE Plasma on Wayland in 2026 is dramatically better than it was in 2022, and a non-trivial chunk of that is Valve money.

But the reason it works matters, because it doesn't generalise the way the community sometimes wants it to.

Valve is redistributing GPL'd binaries on every Deck they ship. The licence compels everything that follows. They have to provide source, they have to comply with copyleft, and funding upstream is partly enlightened self-interest because the alternative is forking the world and maintaining it forever. That's not moral generosity, it's licence physics. The Linux kernel made them play by Linux's rules.

AI vendors don't redistribute the code. They train on it. Whether that triggers any licence at all is the entire legal question — and so far courts have leaned toward "no, it doesn't." Which means the Valve comparison isn't actually telling us what we want it to tell us. It's telling us that when distribution triggers copyleft, copyleft works. It says nothing about what happens when distribution doesn't enter the picture.

Inspiration is not copying

This is the part of the argument I've moved most on, and the part the strong anti-AI position has the most trouble with.

If copyright treated "learning from code" as infringement, the software industry would collapse under its own weight. Every senior engineer I know has spent years reading other people's code and unconsciously carrying patterns forward. I learned how to structure a GTK app by reading Builder's plugin code. Half of what I know about production C came from reading curl. None of that gave libcurl or Builder a claim on what I write next. The uncomfortable part is that LLMs industrialise that process at a scale humans never could — and a lot of the community's gut reaction is to that scale shift, not to the legal question underneath it.

The honest legal question isn't "is training infringement." US courts so far appear sympathetic to transformative-use arguments around training when the training data was lawfully acquired — Anthropic's $1.5B settlement in Bartz v. Anthropic last year over books obtained from piracy sites is the live counter-example, and it carves the question into two pieces. Training on lawfully-licensed text looks defensible; how the corpus was assembled is its own front. And the EU is heading somewhere different again. The remaining interesting question is "is the output infringement when it reproduces training data verbatim." That's at least a problem courts can reason about.

And once you separate "training happened" from "specific expression was reproduced," software starts looking awkwardly unlike most of the creative media copyright was designed around. There are only so many ways to write code.

Copyright law already has a concept for this — it's called merger. If there are only a handful of sensible ways to express something, copyright protection gets very thin. Nobody owns the canonical for loop. Nobody owns the standard shape of a quicksort or a binary search. There's a related doctrine called scènes à faire for stock elements that are unavoidable in a genre. Google v. Oracle (2021) touched adjacent territory — the Court treated Java's declaring code as highly functional and ultimately ruled Google's use fair, with functional constraint as one of the factors that pulled the analysis in Google's direction.

Apply this to the AI debate and a lot of the "look, the model emitted GPL code!" claims get weaker on inspection. Train on enough code and exact matches alone stop telling you very much — most non-trivial codebases share enough surface patterns that something is going to collide. The real legal test isn't "does the output match." It's "does the output match in ways that have no functional justification."

Strong evidence of copying: idiosyncratic variable names, distinctive comment style, unusual algorithmic choices, reproduced bugs, copyright headers emitted verbatim, an author's specific tics. The famous Copilot examples — the Quake III fast inverse square root with the // what the fuck? comment intact, full GPL licence headers — are real and damning. They have no merger defence.

Weak evidence of copying: a quicksort that looks like every other quicksort, a hash function that matches because there are six reasonable ways to write one, boilerplate that converges because boilerplate converges. A court applying merger doctrine would throw most of these out without breaking a sweat.

So the actual remaining infringement concern isn't "regurgitation happens." It's targeted regurgitation of idiosyncratic, non-functionally-constrained expression. Exact-match suppression is tractable, and some vendors are already doing it. Detecting transformed or partially memorised output is much harder in practice, and that's where the real engineering work sits. But it's still engineering work, not a philosophical impasse.

What actually remains

Once you get past the slogans, the objections narrow pretty quickly. None of what remains is wrong. None of it is what the discourse usually sounds like.

The first is auditability. Open source culture runs on receipts — git blame, mailing list archives, CVE databases, sign-off lines — the whole apparatus is built around being able to trace value to its source. RHEL's contribution to the commons is auditable down to the patch. Valve's is auditable down to the developer. AI vendors offer no equivalent. The training corpus is undisclosed, the weights are closed, and there's no way to compute what fraction of a model's capability came from your project versus someone else's. The objection isn't really "you took our code." It's "you took our code and we have no way to audit what happened next." That's a process complaint, but it's a legitimate one, and it could in principle be answered.

The second is targeted regurgitation — the case where models emit idiosyncratic chunks of training data verbatim, not statistical convergence on standard patterns but actual reproduction of distinctive expression. This is a real infringement concern, but it's an engineering one. Exact-match detection is solved. Semantically equivalent memorisation isn't. The work is real, but the problem is bounded.

The third is harder, because it doesn't have a copyright answer at all. A human inspired by reading Linux gets a modest productivity boost on their next project — they still have to write it. An LLM trained on Linux gives every developer using it a substantial productivity boost, sold by subscription, to millions of users. The legal status is the same as the inspired human. The economic transfer is wildly different. Copyright was never designed to police that gap, because until now the gap didn't exist — inspiration was bounded by human cognitive throughput. This is a real concern, but it's a policy question about whether we want a new tax or licence regime for industrial-scale inspiration. Conflating it with copyright, the way most of the community rhetoric does, is sloppy — copyright doesn't cover what we're actually upset about.

Three things, then. Publish training corpora. Suppress idiosyncratic regurgitation. Have a serious conversation about whether productivity transfers at this scale need a policy response. None of it requires rejecting AI.

The reframe

Which brings me to the position I've actually landed on, which is that the community has been judging AI vendors by the wrong standard.

In the open source communities I'm part of — GNOME, Rust crates ecosystem, the desktop Linux world — AI-assisted development is already flowing back into public codebases at substantial scale. Every Rust crate, every GTK widget, every kernel patch that gets shipped faster because someone had a competent pair programmer in their editor is contribution to the commons. It just doesn't show up on the AI vendor's ledger. If you measure contribution by what ends up in the commons rather than what the model vendor open-sources, the ledger might already be net positive. We're measuring the wrong column.

The community's current position is, I think, partly a category error. It's judging AI vendors by the standard we apply to forges — GitHub, GitLab, SourceHut — when the closer analogue is compilers and IDEs. Nobody demanded GCC's optimisation heuristics be reproducible from public data, or that JetBrains open-source IntelliJ before you were allowed to write Linux patches in it. The analogy isn't perfect — compilers don't ingest copyrighted corpora the way foundation models do — but culturally we treated those things as productivity tools rather than infrastructure providers. We judged them by what their users produced.

There's a coherent version of the pro-AI argument that says: AI vendors are toolmakers, not platform owners. Judge them by what their users produce. And the produce here is overwhelmingly more open source, faster.

Most contributors I know already use these tools. The real issue is what would let maintainers admit that publicly without feeling like they're conceding the entire argument — and the answer, when you work it out, turns out to be a fairly short list. Receipts, regurgitation suppression, and an honest policy conversation about productivity transfer. That's all of it.

Compared to where the discourse usually sits, that's a much shorter list than the rhetoric suggests.