Adam DuVander for Daily Context

Posted on Jun 29

Coding Agents Play Favorites With Your Dependencies

#agents #aie #tooling

AI Engineer World's Fair Coverage

Deep into a coding session, you realize you want beta testers to try some new functionality first. You ask your agent to add feature flagging to your app. It offers you LaunchDarkly’s experimentation solution and a reasonable-looking plan to implement it. With a skim of a Markdown document, you accept its recommendation, and it begins writing code.

This is a realistic scenario, because Claude, ChatGPT, and Gemini all recommend LaunchDarkly. But when you ask these questions of your agent, the response comes from a single model that was asked just once. It’s subject to the same training bias and nondeterminism as any prompt. In my research, the tool recommendations can vary considerably.

How Dependencies Are Chosen

Regardless of whether it’s the agreed-upon leader, the model’s favorite arrives with the same confidence, you give the plan the same light read, and you probably react with the same “looks reasonable.”

That’s a dependency decision. And it was mostly abdicated to your agent.

On one hand, this makes sense. You trust that agent to plan and write the code to build your app, so it’s reasonable to trust its other decisions too. You also can review the code it writes and suggest alternative approaches. More and more, engineers give the code a cursory scan similar to the implementation doc.

Indeed, there are multiple sessions at the AI Engineer World’s Fair that either pronounce code review dead or declare an intention to kill it. With human review as a bottleneck, engineers work toward automated review solutions we can trust. Most organizations will require a robust approach, perhaps at the level of a mathematical proof, as Erik Meijer may suggest in his keynote.

Many dependencies also have a life outside of your codebase, beyond the full visibility of any review process. Before this modern era, two-ish short years ago, dependencies weren’t adopted with academic rigor. But there were usually multiple sets of eyes, and a decision took significantly longer.

Pre-2024, looking for a new tool started with a web search or a chat message to collaborators. You would research alternatives, investigate maintenance issues, and scrutinize open source licenses. There might be an RFC, or at least a teammate’s gut check. There were conversations and data gathering that went into choosing a new tool for your project.

Even with AI agents writing most of the code, some engineers or teams may still use a less automated, well-researched decision-making process. But the modern tools don’t encourage it. The de facto approach is the feature flag story: Ask the agent, get the recommendation, and start the build.

How Top Models Rank Dev Tools

With more engineers turning to their agents for research, I’ve been tracking what gets recommended across common categories such as application databases, managed hosting, and, yes, experimentation platforms like LaunchDarkly. I run the same set of prompts multiple times on the top models and publish the results publicly at llmrank.fyi every month.

Each category ranking is an average of the results from the latest Claude, ChatGPT, and Gemini models. Though LaunchDarkly has remained atop the experimentation leaderboard for three months, No. 2 and No. 3 have been distinct each time. There’s even more change when you look at the differences between models. For example, the latest data saw both Gemini and Claude rank Split.io second, behind LaunchDarkly. ChatGPT did not list the product at all. If you’d asked your agent for multiple feature flag options, you’d get different results based on which model you’re using.

There are many of these disagreements across models. In one fun example, I asked for dev-friendly AWS competitors. ChatGPT returned Azure as one of its responses 100% of the time. Gemini did not include Azure in any of its answers. Conspiracy theories abound.

These disagreements between models are meaningful because it’s not just typical AI noise. The methodology I’ve used represents distributions of recommendations rather than one-offs that an engineer would get from their agent. Based on roughly 50,000 pairwise run comparisons, all three models shared a top-three grouping (regardless of order) in about 58% of cases. In other words, they agree only slightly more than at least one disagrees.

What the Disagreements Actually Mean

Engineers are used to nondeterministic outputs. Two code reviews might provide slightly different responses. That’s expected variation, even within the same model. Dependency recommendations are a little sneakier. They arrive with an authoritative plan and are remarkably consistent within a model.

Gemini provides the same top recommendation 97% of the time. ChatGPT and Claude are within a percentage point. Each agrees with itself, though it frequently disagrees with the others. At minimum, it’s worth a second model’s opinion before you commit.

Claude commands roughly 54% of the enterprise AI market share, according to a Menlo Ventures report from Q4 2025. It’s also the most frequent outlier. Claude is most likely to give you a different top three than the other two. The tool your agent just recommended with confidence may be one that a second opinion would contradict.

The cost of adopting the dependency your model recommends is often just a click of a button. Undoing a dependency that you later regret is significantly more work — even if you make the model apologize and help fix its mistake. There may already be other systems that now require what the tool delivers. You’ll have to update them as well. That’s 42% of cases where a second opinion would have pointed you somewhere else.

Many of the AI Engineer World’s Fair sessions are focused beyond the model. Harness engineering, context optimization, and software factories are about improving workflows and generating more predictable outcomes. Dependencies are one place where model disagreement shows up clearly, but probably not the only one. Find these gaps and fill them with these new engineering disciplines.

Not every engineering team will implement the multi-model approach. The default path is where most dependency decisions will continue to be made. And it's where developer tool companies either show up or don’t. If you work with a product or marketing team, this is the data gap they probably don’t know exists yet.

That feature flag decision, and thousands more like it, gets made whether you think about it or not. The 42% disagreement number will change as models converge or diverge and the market shifts with them. What stays the same is that the decision lands somewhere: Is it with your agent, a single model, a sub-agent, a model routing algorithm, or some human in the loop? All of these are being advocated at the AI Engineer World’s Fair. The best sign is that we’re talking about it.

Top comments (9)

Pon • Jun 30

What I keep rereading: the decision happened before the engineer realized there was a decision to make. There's a security edge to that you didn't quite step into. The same confident, skim-and-accept handoff is where a supply-chain attack lives. When the agent names a package with identical authority whether it's the market consensus or its own idiosyncratic pick, that authority is what disarms the one check that would catch a typosquatted or hallucinated lookalike -- looks reasonable, click, installed. That 97% within-model confidence is doing real work here: it's what stands between a plausible package name and your lockfile. And the second-model opinion you suggest helps with regret but not with this. Two models agreeing a package is good says nothing about whether the package is real or unhijacked, because consensus isn't provenance -- they can share the same training-era blind spot and both name something a squatter has since registered. Different gap, same root as yours: the recommendation arrives with a confidence the supply chain underneath it never earned.

Raju Dandigam • Jul 1

This is quietly important for engineering teams adopting coding agents. Dependency choice is an architecture decision, but agents often present it as an implementation detail with the same confidence as generated code. I like the point that the decision can happen before the engineer realizes there was a decision to make. For production teams, I’d want dependency recommendations to come with evidence: alternatives considered, maintenance signals, license/security checks, and why the selected package fits the system constraints.

Alex Shev • Jul 2

Dependency preference is a subtle form of lock-in. If the agent keeps choosing the library it has seen most often, teams need explicit constraints around stack fit, maintenance, and security posture.

Siyu • Jun 30

This is quietly important research. The 97% within-model consistency paired with only 58% cross-model agreement reveals something non-obvious: each model has a stable but different internal ranking of the tool landscape, and it presents its favorites with identical confidence regardless of whether they are the market consensus or its own idiosyncratic pick. The real risk is not that agents make bad recommendations. It is that the decision happened before the engineer even realized there was a decision to make.

Your findings made me think about a parallel problem in something I have been building. The Opportunity Skill includes agent-to-agent human discovery, where one person's agent composes semantic queries to find matching professionals or buyers based on embedding similarity. If dependency recommendations diverge 42% of the time across models, what happens when a Claude-powered agent searches for a professional whose impressions were generated by a Gemini-powered agent? Cross-model semantic drift in people discovery feels like a variant of the same gap your research exposes, and one worth measuring.

Kartik N V J K • Jun 30

The single-sample point is the real issue: a dependency pick is one draw from a biased distribution, yet it arrives with the same confidence whether nine of ten samples agreed or only two did. Sampling the same prompt a few times and surfacing the spread would turn that hidden bias into something a reviewer can actually see. Have you tested whether the favored library shifts with prompt framing, or stays stable across rewordings?

Mykola Kondratiuk • Jul 1

in practice this means you can't trust dependency picks to be unbiased. agents almost always go with the most-documented option. keeping a preferred-stack reference in context helps - works better than overriding it after the fact.

ANP2 Network • Jul 4

Agreement is the more dangerous case than disagreement here. The failure mode is a consensus blind spot: several models trained on overlapping corpora can converge on the same library while all of them miss an unstated constraint, like a license term that blocks redistribution or a data-residency rule tied to a compliance boundary. A second opinion catches divergence, but convergence is not correctness. The dependency decision is also mostly invisible to the diff. A reviewer sees the import and the install line, not why the rejected package lost or whether the chosen license and its transitive trust surface were ever checked against the project's constraints. I would want the agent bounded to emit a tiny decision record alongside the patch: chose X over Y, checked constraints C, recommendation came from model M. When the rip-out cost finally lands, the only artifact left should not be the line everyone skim-accepted.

Vinicius Pereira • Jun 30

The stat that should scare people isn't the 58% cross-model agreement, it's that paired w/ 97% within-model consistency. A single model hands you a confident, stable, repeatable answer that reads like researched consensus but is really one biased sample wearing a suit. The determinism is doing the lying, it feels like correctness because it never wavers, when all it actually means is the bias is consistent.

Which is why imo the multi-model move isn't about picking the majority winner. The useful signal is the disagreement itself. Three models converge, low risk, click the button. They split (your 42%), and that's not noise to average away, it's a flag that this specific decision still needs a human and some real research. Disagreement as a routing signal for where judgment is required, basically. And it's the cheapest check there is given the asymmetry you point out: asking three models costs seconds, ripping out the wrong dependency in six months costs a sprint. Worth doing at decision time even if you add nothing else fancy.

SharkyBee • Jul 3 • Edited

The scary part isn't that the models disagree. It's that they don't tell you they're disagreeing. You only find out if you go looking for another opinion. @pizza edition