Coding Agents Play Favorites With Your Dependencies

claudio ubeda — Tue, 30 Jun 2026 01:50:35 +0000

Deep into a coding session, you realize you want beta testers to try some new functionality first. You ask your agent to add feature flagging to your app. It offers you LaunchDarkly’s experimentation solution and a reasonable-looking plan to implement it. With a skim of a Markdown document, you accept its recommendation, and it begins writing code.

This is a realistic scenario, because Claude, ChatGPT, and Gemini all recommend LaunchDarkly. But when you ask these questions of your agent, the response comes from a single model that was asked just once. It’s subject to the same training bias and nondeterminism as any prompt. In my research, the tool recommendations can vary considerably.

How Dependencies Are Chosen
Regardless of whether it’s the agreed-upon leader, the model’s favorite arrives with the same confidence, you give the plan the same light read, and you probably react with the same “looks reasonable.”

That’s a dependency decision. And it was mostly abdicated to your agent.

On one hand, this makes sense. You trust that agent to plan and write the code to build your app, so it’s reasonable to trust its other decisions too. You also can review the code it writes and suggest alternative approaches. More and more, engineers give the code a cursory scan similar to the implementation doc.

Indeed, there are multiple sessions at the AI Engineer World’s Fair that either pronounce code review dead or declare an intention to kill it. With human review as a bottleneck, engineers work toward automated review solutions we can trust. Most organizations will require a robust approach, perhaps at the level of a mathematical proof, as Erik Meijer may suggest in his keynote.

Many dependencies also have a life outside of your codebase, beyond the full visibility of any review process. Before this modern era, two-ish short years ago, dependencies weren’t adopted with academic rigor. But there were usually multiple sets of eyes, and a decision took significantly longer.

Pre-2024, looking for a new tool started with a web search or a chat message to collaborators. You would research alternatives, investigate maintenance issues, and scrutinize open source licenses. There might be an RFC, or at least a teammate’s gut check. There were conversations and data gathering that went into choosing a new tool for your project.

Even with AI agents writing most of the code, some engineers or teams may still use a less automated, well-researched decision-making process. But the modern tools don’t encourage it. The de facto approach is the feature flag story: Ask the agent, get the recommendation, and start the build.

How Top Models Rank Dev Tools
With more engineers turning to their agents for research, I’ve been tracking what gets recommended across common categories such as application databases, managed hosting, and, yes, experimentation platforms like LaunchDarkly. I run the same set of prompts multiple times on the top models and publish the results publicly at llmrank.fyi every month.

Each category ranking is an average of the results from the latest Claude, ChatGPT, and Gemini models. Though LaunchDarkly has remained atop the experimentation leaderboard for three months, No. 2 and No. 3 have been distinct each time. There’s even more change when you look at the differences between models. For example, the latest data saw both Gemini and Claude rank Split.io second, behind LaunchDarkly. ChatGPT did not list the product at all. If you’d asked your agent for multiple feature flag options, you’d get different results based on which model you’re using.

There are many of these disagreements across models. In one fun example, I asked for dev-friendly AWS competitors. ChatGPT returned Azure as one of its responses 100% of the time. Gemini did not include Azure in any of its answers. Conspiracy theories abound.

These disagreements between models are meaningful because it’s not just typical AI noise. The methodology I’ve used represents distributions of recommendations rather than one-offs that an engineer would get from their agent. Based on roughly 50,000 pairwise run comparisons, all three models shared a top-three grouping (regardless of order) in about 58% of cases. In other words, they agree only slightly more than at least one disagrees.

DEV Community: claudio ubeda

Coding Agents Play Favorites With Your Dependencies