DEV Community: Bap

Evaluate agents skills, ship 3x better code

Bap — Thu, 26 Feb 2026 16:51:44 +0000

Today, I’m excited to announce that you can evaluate your skills and optimize them.

This means you can stop debugging agent output and start shipping quality code, faster: Product Hunt launch

Agent skills help agents use your products, build in your codebase and enforce your policies.

They're the new unit of software for devs - but most are still treated like simple Markdown files copied between repos with no versioning, no quality signal, no updates.

Without AI evaluations, you can’t tell if a skill helps, provides minimal uplift or even degrades functionality.

You spend your time course-correcting agents instead of shipping.

Tessl is a development platform and package manager for agent skills. With Tessl, we were able to evaluate and optimize ElevenLabs' skills, 2x'ing their agent success in using their APIs.

If you are building a personal project, maintaining an OSS library, or developing with AI at work, you can now evaluate your skill and optimize it to help agents use it properly.

We’ve launched on Product Hunt launch! If you find it useful, we’d appreciate an upvote - and even more, your feedback in the comments:

7 AI Devtools to Watch This December

Bap — Tue, 09 Dec 2025 19:01:31 +0000

If you’re building with AI, make it a habit to explore one new tool each week-you’ll level up your workflow far faster than you expect. And with more tools entering the landscape constantly, it’s worth keeping an eye out for what arrives next.

Looking at both traffic patterns and our own curation, 7 tools emerged as particularly interesting. Not because they’re the flashiest (or have the biggest marketing budgets), but because they represent distinct approaches to how AI can fit into the development workflow.

Before diving into specific tools, it’s worth framing these through the lens of: trust and change. Trust indicates how much you need the tool to get it right for it to be valuable. Change indicates how much you need to alter your existing workflow to use it.

The tools gaining traction typically operate in “high adoption” territory-they fit existing workflows and produce verifiable results. But the most interesting tools push boundaries, asking developers to work differently because the value proposition justifies the friction.

Conductor: orchestrating parallel agents locally

Conductor tackles a fundamental problem that emerges as agents become more capable: how do you manage multiple coding agents working simultaneously without losing your mind? The tool runs on your Mac and orchestrates parallel workspaces — each agent gets its own isolated git worktree to work in.

What makes this interesting is the orchestration layer. Rather than just spawning agents and hoping they don’t conflict, Conductor provides visibility into what each agent is working on and structured review mechanisms for merging their changes. It supports both Claude Code and Codex, working with however you’re already authenticated.

This sits firmly in attended workflows. You’re not trusting agents to autonomously merge changes-you’re reviewing their work. But the parallel execution model acknowledges a reality: as agents get better at focused tasks, developers will want to run multiple simultaneously. Conductor makes that tractable by handling the coordination overhead.

The trust requirement is moderate because you control the merge step. If you’re already using coding agents, the change requirement would be low -Conductor just makes it possible to use them at scale.

https://conductor.build/

Graphite: code review reimagined for AI velocity

Graphite is rethinking code review for teams shipping faster with AI. The core insight: traditional PR workflows weren’t designed for the velocity AI enables. When you can generate significant code quickly, the bottleneck shifts to review and merge coordination.

Graphite introduces stacked PRs — breaking larger changes into sequenced, smaller chunks that can be reviewed independently. This addresses a real pain point: with AI assistance, you can create large features quickly, but reviewing massive PRs is slow and error-prone. Stacking lets you ship incrementally without waiting for each review to complete.

The platform includes an AI agent that operates directly in your PR page. It can resolve CI failures, suggest fixes, and help you iterate without context switching. The merge queue is stack-aware, landing PRs in order while keeping branches green.

What’s clever here is recognizing that AI changes both sides of the equation. Developers can write code faster, but reviewers still need time and context. Graphite optimizes for the new bottleneck by making reviews faster and less blocking. The stack-aware merge queue ensures velocity doesn’t compromise stability.

In other words, with Graphie and its GitHub sync, it means the process fits existing workflows. The value would be immediate if your team already struggles with review velocity.

https://graphite.com/

Google Code Wiki: automated documentation that stays current

Google launched Code Wiki to tackle documentation’s oldest problem: it becomes outdated the moment you write it. Code Wiki generates comprehensive documentation for repositories and regenerates it automatically after every change.

The system scans the full codebase, maintains links to every symbol, and creates interactive documentation where you can navigate from high-level explanations to specific code locations. A Gemini-powered chat interface uses the always-current documentation as context for answering questions about the codebase.

What makes this approach interesting is treating documentation as a continuously regenerated artifact rather than something developers maintain manually. The public preview works for open-source projects. Google is developing a Gemini CLI extension for private repositories — particularly valuable where legacy code is poorly documented and institutional knowledge has eroded.

The challenge with any auto-documentation system is whether the generated content is accurate enough to trust for important technical decisions. Code Wiki includes the standard disclaimer that Gemini can make mistakes. But the linking to actual code makes verification straightforward.

This targets a genuine pain point. New contributor onboarding, understanding legacy decisions, and maintaining architectural knowledge are persistent challenges. If AI-generated documentation proves reliable enough, it removes significant friction.

The question is whether teams will trust it enough to act on it without verification-that determines whether it stays in attended workflows or moves toward autonomy.

https://codewiki.google/

Extra!

The question is whether teams will trust it enough to act on it without verification — that determines whether it stays in attended workflows or moves toward autonomy. The broader challenge here is giving agents the right documentation context at the right time. Tessl’s open source registry tackles a related problem: providing version-aware library documentation, coding styleguides, and reusable workflows that agents can reliably pull from — treating documentation as structured context for steering rather than just reference material for humans.

Kilo Code: open-weight agentic development

Kilo positions itself as the open alternative in the coding agent space. The pitch: switch between 500+ models, bring your own API keys, see exactly what models are being used, and inspect every prompt and tool call.

The platform includes orchestrator mode that breaks complex projects into subtasks and coordinates between different agent modes. It features Context7 integration-automatically looking up library documentation to reduce hallucinations. The debug mode systematically traces through your codebase to locate bugs.

What’s notable is the transparency emphasis. No silent context compression, no hidden model switching, no locked-in providers. This matters for teams that want control over costs and model choice. The open-source plugin under Apache 2.0 license means you can see and modify how Kilo works.

The parallel mode capability acknowledges that complex projects benefit from multiple agents working simultaneously-similar to Conductor’s insight but integrated at the platform level rather than as a separate orchestration layer.

This appeals to developers who’ve hit friction with closed platforms. The transparent pricing model (pay exact list price from providers, Kilo makes money on Teams/Enterprise) and bring-your-own-keys approach addresses concerns about vendor lock-in and hidden costs.

If you’re already using VS Code or JetBrains, Kilo integrates as a plugin. It fits into existing workflows while offering model flexibility that proprietary tools don’t provide.

From an adoption standpoint, this is incremental. The trust requirement is moderate because you’re reviewing agent output. This would fit into existing workflows while offering model flexibility that proprietary tools don’t provide.

https://kilo.ai/

Letta Context-Bench: measuring agentic context engineering

Letta released Context-Bench, a benchmark evaluating how well language models handle “agentic context engineering”: when agents themselves strategically decide what context to retrieve and load.

The benchmark measures agents’ ability to chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks. Questions require multiple tool calls and strategic information management-agents can’t answer correctly without navigating file relationships.

What makes this valuable is addressing a critical challenge: as agents tackle longer tasks, determining what information should be in the context window at any moment becomes crucial. Too much causes context rot; too little causes hallucinations.

The findings reveal interesting patterns. Claude Sonnet 4.5 leads at 74% accuracy, demonstrating that models explicitly trained for context engineering excel. But open-weight models are closing the gap-GLM-4.6 achieves 56.83%, Kimi K2 scores 55.13%. Even top models miss 25–30% of questions, indicating substantial room for improvement.

This benchmark matters because it measures a specific capability that’s critical for production agents but often overlooked: the meta-problem of managing what information you need to solve the actual problem. As tasks extend beyond native context windows, models that excel at context engineering will handle long-horizon work more reliably.

For teams building with agents, Context-Bench provides data for model selection when context management is critical. For model developers, it highlights a training dimension that differentiates performance on real-world agentic tasks.

While Context-Bench focuses on measuring raw context utilization, other approaches like specs eval framework examine how structured context translates into practical task completion — offering a different lens on the same fundamental question of context value.

https://www.letta.com/blog/context-bench

Google Antigravity: agent-first development platform

Google launched Antigravity as an “agent-first” development platform alongside Gemini 3. Rather than treating AI as a sidebar feature, Antigravity gives agents dedicated workspace with direct access to editor, terminal, and browser.

The platform splits into two modes: Editor View for hands-on coding with AI assistance, and Agent Manager for deploying agents that autonomously plan and execute complex tasks. Agents communicate work via Artifacts-screenshots, task lists, implementation plans-that you can review and comment on without stopping execution.

What’s architecturally interesting is the inverted paradigm. Instead of agents embedded within surfaces, surfaces are embedded into the agent. This reflects a bet that models like Gemini 3 are capable enough to operate across multiple environments simultaneously.

With Antigravity, you’re trusting agents to plan, execute, and verify work across your development environment. You’re adapting to a task-oriented interaction model rather than line-by-line coding. The Artifacts approach attempts to make agent work reviewable without overwhelming you with tool call details.

Antigravity represents Google’s vision for agent-era development. Available free in public preview, it’s an ambitious platform play-not just adding AI features to existing tools but reimagining the developer experience around autonomous agents. Time will tell whether devs embrace this paradigm shift or prefer incremental AI assistance in familiar environments.

https://antigravity.google/

LocalAI: agentic workflows with MCP

What makes LocalAI architecturally interesting is its positioning as a privacy-first, self-hosted alternative that mimics OpenAI’s API. You run LLMs, generate images, and use agents locally or on-prem with consumer-grade hardware. No GPU required for many use cases.

The 3.8.0 release makes agentic workflows practical for teams that need to keep data on-premises. LocalAI 3.8.0 significantly upgrades support for agentic workflows using the MCP. The release also adds live action streaming-watch agents “think” in real-time, seeing tool calls, reasoning steps, and intermediate actions as they happen rather than waiting for final output.

LocalAI supports multiple backend types (llama.cpp, diffusers, transformers, and more) and can run models from various sources including Hugging Face, Ollama, and standard OCI registries. The platform automatically detects GPU capabilities and downloads appropriate backends.

This addresses a real deployment constraint. Many organizations can’t send code or data to external APIs but still want to leverage AI capabilities. LocalAI provides the infrastructure to run agents locally while maintaining API compatibility with OpenAI-style integrations.

The trust requirement is moderate to high depending on your use case. The change requirement is low if you’re already running local inference-LocalAI just makes agentic patterns more practical. For organizations new to self-hosted AI, there’s infrastructure overhead, but the MCP support and live streaming features reduce the gap between local and cloud agent experiences.

https://localai.io/

What this selection reveals

Looking across these seven tools, patterns emerge around where current development is focused:

Orchestration and coordination (Conductor, Kilo, Antigravity): As agents become more capable, managing multiple agents and their interactions becomes the bottleneck. Tools are emerging to handle parallel execution, workspace isolation, and coordination complexity.
Code review velocity (Graphite): AI accelerates code generation, but review remains a human bottleneck. Tools that optimize review workflows and reduce blocking will be critical as teams ship faster.
Agent-first platforms (Antigravity): The boldest bets involve reimagining development around autonomous agents rather than incrementally adding AI features to existing tools.
Documentation as infrastructure (Code Wiki): Treating documentation as continuously regenerated rather than manually maintained could solve the staleness problem if AI-generated content proves reliable enough.
Context management (Letta Context-Bench): As tasks extend beyond native context windows, the meta-skill of managing what information to retrieve becomes differentiating for model performance.
Transparency and control (Kilo, LocalAI): Not all teams want black-box solutions. Demand exists for open-source alternatives, model flexibility, and visibility into costs and behavior.

We need both incremental tools that improve current workflows and speculative platforms exploring new paradigms. The ecosystem benefits from having options across the trust/change spectrum. What’s less visible but equally important: tools that haven’t gained traction yet but represent useful approaches.

The AI development landscape is still taking shape and evolving fast. Our weekly tool spotlight helps you stay ahead by discovering one new AI-native dev tool at a time. This isn’t about sprinkling AI onto existing workflows; it’s about redefining how software is built from the ground up.

Explore the AI Native Dev Landscape, track the shifts happening across categories, and stay plugged into the tools developers are truly adopting.

If you’re an AI native developer, make it a habit: explore one new tool each week and level up your workflow!

Originally published at https://ainativedev.io.

Does Developer Delight Matter in a CLI? The Case of Charm’s Crush

Bap — Fri, 12 Sep 2025 08:35:22 +0000

Released in July, and now amassing stars, Crush is a new open-source command-line AI coding assistant developed by CharmBracelet (the team behind tools like Glow).

It provides a terminal-based interface for devs to interact with a coding agent ( npm install -g @charmland/crush ). Crush works with a wide range of models (via OpenAI, Anthropic, and other APIs) and lets you switch models mid-session while preserving context.

You can maintain multiple sessions per project, meaning Crush remembers conversation history and file context across runs. This helps when working on larger tasks or hopping between different projects without losing context.

Crush ties into Language Server Protocol (LSP) servers to inject code-aware context into the AI’s prompts. If you’re unfamiliar with the term, LSP lets editors talk to language servers for code intelligence: a simple example is that when you type in Python, the LSP automatically suggests

This means Crush can understand type signatures, function dependencies, and project structure. It also supports MCPs for plugging in external tools and context sources.

True to Charm’s ethos of making the command line “glamorous”, with successful OS projects like bubbletea, gum, and lipgloss, Crush has a modern and playful text-based UI. It features a split-pane view (with things like a dedicated diff view for code changes) and intuitive keyboard navigation, aiming to feel friendly and futuristic.

If you're enjoying this content, you might enjoy this 1-pager newsletter I share with over 7,000+ AI native devs. Stay ahead, and get weekly insights here.

Community: Crush sparks joy, comparisons, and cost questions

The consensus so far: Crush offers a refreshing UX and solid foundation, but it’s one player in a bigger trend. As one observer quipped, the “terminal-based AI coding agents” trend is hot, and everyone is experimenting to see which tool will stick.

The word “playful” came up frequently in our research — CharmBracelet’s TUI framework pedigree (Bubble Tea, etc.) is well respected, so seeing those slick visuals applied to an AI assistant delighted people.

Shifting gear, a Hacker News user requested a detailed “comparison between all these new tools” — listing Crush, Claude Code, OpenCode, Aider, and Cortex — because “I just can’t get an easy overview of how each tool works and is different”.

This captures a common reaction: excitement about the tool, paired with the question “How does it stack up against X?”. This sentiment shows both the interest in these AI dev agents and the fragmentation of the ecosystem. It’s not often we see multiple similar tools gain popularity almost simultaneously, so community members are trying to map the space, often through first-hand trials and discussions.

Some devs also lamented that they feel “in golden handcuffs” with proprietary tools like Claude’s official app, because those offer unlimited usage for a flat rate, whereas using something like Crush with pay-as-you-go APIs could rack up costs. We might see future updates focus on easier integration with subscription-based models or better support for local LLMs to alleviate cost issues.

Will AI dev tools win on delight? Thoughts on DevX vs Model Capability

OpenAI’s CFO recently described a vision of an “agentic software engineer” — essentially an AI that could take a high-level project description and autonomously build and iteratively improve software.

AI is becoming a first-class citizen in dev workflows. Just as version control or stack overflow search became ingrained in a dev’s day-to-day, AI assistants (be it in the terminal, editor, or IDE) are heading in the same direction, helping with brainstorming, coding, debugging, and documentation. But, who will win devs’ hearts?

Many of these AI coding assistants rely on the same or similar underlying models. If every tool can hook up to GPT-5, Claude Opus, or the next open-source model, then what sets them apart? I believe there is merit in thinking that it would be how effectively they let devs harness those models and how much delight they bring to the experience.

I’ve come to a similar conclusion upon writing a comparison of Windsurf, Cursor, Copilot with GPT-5. Building with these tools brought me to the finish line in all cases. But how I got there, and the feelings I had as a dev varied. The real differences showed up in workflow ergonomics, UI polish, and how much hand-holding each agent needed.

History offers instructive analogies: Betamax vs. VHS is often cited — Betamax was arguably the superior video tape technology, but it lost the format war due to practical UX factors (shorter recording time, higher costs, less industry support) . Similarly, HD DVD vs. Blu-ray ended with Blu-ray victorious not purely for technical reasons but due to strategic partnerships and consumer perception.

Conversely, a pleasant, well-integrated tool can win even if under the hood it’s not radically different. Crush’s playful interface and thoughtful touches (like preserving scrollback, offering diff views, etc.) might seem cosmetic, but they significantly impact adoption.

As one HN user pointed out, even something as simple as syntax highlighting or colorful text can change how we feel about a tool — decades ago, some old school devs scoffed at such features as unnecessary, yet today we take them for granted as usability must-haves.

Looking ahead, AI dev tools will increasingly compete on the design choices that shape how enjoyable and intuitive they feel to use. Delight matters. Still, real hurdles remain: accuracy, reliability, and trust in code generation. Developers will need guardrails — tests, reviews, and structured practices- to confidently fold AI into their workflows.

At AI Native Dev, we believe one promising path is spec-driven development: anchoring AI contributions in clear, testable specifications that keep humans in the loop and guardrails agents. If you’re curious about how this concept, you can explore more about it here.

Originally published at https://ainativedev.io.

IDE Comparison with Cursor, Windsurf & Copilot on GPT-5

Bap — Wed, 03 Sep 2025 13:11:01 +0000

tl;dr

Over the weekend, I put Cursor, Windsurf, and Copilot (in VS Code) to the test using GPT-5.
Tested in both greenfield (starting a project, like our landscape, from scratch) and brownfield scenarios (working with the existing codebase).
We leaned into a spec-first development approach (more about this here).
All 3 IDEs got the job done. The real differences show up in workflow ergonomics, UI polish, and how much hand-holding each agent needs.
My recommendation is to play with these IDEs yourself, and pick your own winner based on your taste and team norms.
Think of this as a field journal from a dev, not a lab benchmark (test window: Aug 22–24, 2025).

Pricing at a glance

Greenfield exploration: Build a project from scratch with a spec

I’ve prompted each tool to leverage this greenfield-spec.md (describing architecture, basic MERN stack scaffold) and then asked it to implement the spec, iterate, and adjust tests. In my observations, all specs produced by all three were very similar and all delivered working code.

On the UI/UX experience, Cursor felt the most professional in flow/explanations, but once refused to auto-build the project from spec. When it did build, it created a clean structure + tests and adjusted correctly as I changed the spec.

I also appreciated how Cursor usefully shows % of context consumed (~400k token total context with 272k input + 128k output). As for inline speed: Cursor ≈ Copilot > Windsurf . Though fast, Copilot's in-line refactoring fell short to Windsurf and Cursor's output.

When running the spec, Windsurf went full send. It created the whole project structure automatically (folders/files) without extra prompting. Windsurf’s chat pane made it easy to distinguish model “thinking”, command runs, and issues.

Copilot (VS Code) mostly showed file contents in chat with path hints but didn’t immediately write an on-disk tree. I did appreciate Copilot’s in-IDE browser preview though, which was delightful for quick checks.

For testing, Windsurf and Cursor’s generated tests passed on first run in my sample. Copilot authored the strongest tests with granular mocking boundaries and edge case coverage. Though I had to wrestle a bit to get them passing, the self-contained modules made failures immediately traceable to specific component behaviors.

If you're enjoying this content, you might enjoy this 1-pager newsletter I share with over 7,000+ AI native devs. Stay ahead, and get weekly insights here.

Brownfield exploration: extend a legacy codebase

*Getting started * : Upon running locally, Copilot got the server up first, followed by Windsurf second and Cursor third — though notably, Cursor created a new .env.local instead of finding the right one, which slowed me down.

Codebase explanation : When it came to understanding the existing code, Windsurf explained the codebase exceptionally well, and the play on response format/highlights made it the best experience of the 3. That said, Cursor and Copilot were also able to read and explain legacy code effectively.

New feature (asked all to add a tool detail page + schema comparison across tools): All three successfully built from the brownfield-spec.md and produced sensible tests that ran well. This reinforced an important lesson: the clearer your spec, the more concrete the result.

(subtle PostHog lazy-init bug): Here, Windsurf and Cursor diagnosed it faster, while Copilot missed it in my run. It’s important to highlight that the bug comments were not added in the file (that would have been too easy 🙃).

( scripted request across multiple files): All 3 executed the steps and staged changes properly, though with different approaches. Copilot required more approvals (e.g., 7 terminal prompts vs. Windsurf’s 3 and Cursor’s 1), which is great for caution but slower for flow. Interestingly, Copilot would sometimes say it’s “been working on this problem for a while” and stall, whereas the others tended to keep iterating until done.

UI & DX details that matter

Overall, Windsurf > Cursor > Copilot for terminal/chat integration. Copilot runs commands outside the chat pane, which unfortunately breaks the narrative thread for me. Meanwhile, Cursor's default look and progress indicators feel the most cohesive, Windsurf communicates what's happening the best, and Copilot nails browser-in-IDE and markdown rendering.

On context and memory: Windsurf’s “just remembers” feel is particularly strong for memory and context retention. Cursor gets there with rules/notes but will lose thread in long sessions, while Copilot keeps things simpler but consequently more ephemeral.

Notable touches:

Windsurf can continue proposing changes in the background while you review staged diffs — a nice touch for maintaining flow.
Cursor’s multi-file edit flow is strong, though I’d love a nudge when I forget I’m in “Ask” vs. “Agent” mode.
Copilot’s “Ask” mode sometimes defaults to doing rather than explaining; additionally, starting a new chat ends the current edit session.
I hit one infinite loop with Copilot (malformed ); a restart fixed it → but it made me wonder: could this have been resolved without my intervention?”

Implementation Results, Key Takeaways.

So… who should use what?

All three experiences ride on VS Code — Windsurf and Cursor are forks; Copilot runs inside VS Code. That means capability deltas may be subtle: same keyboards, similar models, similar APIs. The differentiation shows up in micro-interactions, like how the agent plans, how much it explains, how the terminal/chat loop flows, and how well the UI helps you stay oriented. If you’re expecting a jaw-dropping, orders-of-magnitude difference, that wasn’t my experience. In my opinion, the small product details and moments of delight are what choose your daily driver.

That said, I found that each IDE may come in use depending on your use case. This is a lightly opinionated take, not gospel:

Cursor → You care about polish, tight multi-file edits, and a feeling that the agent understands when to execute vs. chat. Great for folks who want structure and speed in daily flow. This may work best for senior/full-stack IC at a startup or small team.
Windsurf → You value context retention and a workflow-first UI that narrates what’s happening. Strong pick for larger codebases where “not losing the thread” is key. I can see a staff, Principal engineer, or maintainer working in a large/long-lived repo leveraging Windsurf.
Copilot (VS Code) → You want trust and a more human-in-the-loop vibe. Company teams already in the Microsoft/GitHub ecosystem will feel right at home. I think this is a fit for team lead or IC in GitHub-standardized orgs, who prefer trusted default and governance.

It’s important to note that this exercise comes with several limitations, including (and not limited to) the complexity of the builds, the test coverage, the size of the codebase, etc. That’s why I recommend experimenting with these tools yourself: you’ll want to choose the one that best matches your (or your team’s) needs.

Finally, the current status quo is only ephemeral. Each IDE may develop defensibility in how quickly they ship their features. Cursor’s own in-line edits, as an example, have been developed in-house and were clearly felt throughout this experiment. It will be interesting to observe the pace of feature releases from these three tools. When I converse with engineers, the consensus leans toward Cursor demonstrating stronger execution capabilities, while Copilot appears to be slower in comparison. Meanwhile, given the recent tumultuous Windsurf acqui-hire and the accompanying leadership changes, we may see a shift in its release cadence.

Closing thoughts: A missing spec-driven feature

Despite their capabilities, none of these IDEs fully embrace spec-driven development as a first-class citizen (unlike the recent release of Kiro).

While they can generate and work from specs, the workflow remains bolted-on rather than native — you’re still manually adding your specs, losing track of which code implements which requirements, and lacking formal verification that implementations match specifications. Memory and context management, though improving, remains ad hoc across all three; they remember recent edits but struggle to maintain architectural decisions or cross-session constraints.

Test and mock generation quality varies wildly depending on how you prompt, with no systematic approach to coverage or edge cases. The gaps become even more glaring: no audit trails for AI-generated code changes, absent cost attribution per dev or project, and no integration with existing governance workflows.

These tools lack team-wide consistency enforcement — each developer’s AI assistant learns different patterns, creating style drift across the codebase. There’s no way to enforce company-specific architectural patterns, security requirements, or compliance rules at the AI level.

This points to non non-trivial opportunity for next-gen IDEs where specs aren’t just documentation but executable contracts, where AI agents understand and preserve organizational standards, and where generated code comes with provenance and accountability. If you are interested in learning more about the interesting projects built around spec-driven development, you can search find them on the Landscape, the guide to the AI development ecosystem.

Originally published at https://ainativedev.io.

Stack Overflow’s 2025 Report is Out: Trends on AI Native Development

Bap — Thu, 28 Aug 2025 00:15:14 +0000

What Happened — Devs appear to use AI more, and believe it less

Stack Overflow's 2025 Developer Survey, featuring over 49,000 developers from 177 countries, reveals an intriguing paradox in the AI revolution: while 84% of developers now use or plan to use AI tools (up from 76% in 2024), trust in these tools has significantly dropped.

Trust in AI accuracy has fallen sharply: only 33% of developers now trust AI-generated output, down from 43% in 2024.That said, there’s a quote worth remembering: “Today’s AI is the worst you are going to get.” So while trust may be declining, the pace of progress means those losing faith might want to revisit their assumptions sooner rather than later.

Perhaps most telling, 66% of developers report frustration with AI solutions that are "almost right but not quite," often spending more time debugging AI-generated code than writing it themselves. Stack Overflow CEO Prashanth Chandrasekar emphasized that the "growing lack of trust in AI tools stood out" this year, highlighting the need for a trusted "human intelligence layer" to counterbalance inaccuracies.

If you're enjoying this content, you might enjoy this 1-pager newsletter I share with over 7,000+ AI native devs. Stay ahead, and get weekly insights here.

Community Reactions — Survey Resonance, and Contextual AI Confidence

The developer community largely resonates with the survey results, noting that the findings align closely with their own experiences, with some highlighting that the data appears relatively consistent with trends observed in last year’s report. Across forums and social media, programmers also echoed that AI tools are useful — even indispensable — but it wouldn’t be wise to trust them blindly.

A more nuanced view is that trust in AI code gen is context-dependent. “Do I trust AI for a full agentic build? Absolutely not. Do I trust it to scaffold a frontend validation schema? Generally, yes — but I will verify.”

“LLMs get a bad rep because of individuals who completely outsource thinking.”

The community also acknowledges that although the survey suggests that 69% of AI agent users agree AI agents have increased productivity”, the rise of AI first development have at times slow down productivity. Critically, productivity concerns transcend experience: multiple studies, including recent METR research, suggest even experienced developers are 19% slower when relying on AI tools.

Ultimately, it it all depends on the tools being used, the coding languages and codebase complexity; which is not taken into account in this survey numbers. Together, these insights point to a core truth: the value of AI in development hinges not just on capability, but on how precisely and intentionally it’s integrated into the workflow.

Enjoy being up-to-date with AI Native Development space (AIND)? For the busy folks, we send out a crisp one-pager news every week, free and straight to your inbox.

PS: I’m helping build the AIND community, where we cover the most relevant dev-focused news.

The AIND Take: Emerging reality of AI Native Dev and spec-driven development

The issue isn’t that people don’t trust AI — they do. What they don’t trust are the current methods AI uses to generate code, particularly “vibe coding” approaches. As we attempt more complex development tasks, we’re consistently hitting the limits of these methods.

This creates a predictable cycle: AI improves, we find new limits. Tools evolve, we discover fresh constraints. The critical question isn’t whether AI can help with development — it’s whether we can push AI-assisted development into professional, production-ready environments.

The current trust setback should thus accelerate more sophisticated integration patterns, where AI code generation’s strengths and limitations are explicitly recognized as core design considerations rather than implementation afterthoughts. For instance, instead of letting AI generate an entire backend service ad hoc, teams might begin designing modular interfaces where AI is only responsible for generating testable utility functions within a predefined contract.

A robust future for AI native development likely includes clear guardrails and a test-driven approach, as embraced by our community advocating for a spec-driven development.

Alignment between intent and AI output is emerging as the key challenge, surpassing simple code generation. Specs act as the anchor, offering structured clarity for both AI reasoning and human validation. They provide a shared, testable source of truth crucial in managing AI-generated content’s inherent unpredictability. In a world where AI confidently delivers almost right solutions, specs help safeguard against subtle yet significant errors.

While AI Native development is advancing rapidly, its ultimate form will likely be nuanced. This year’s survey offers devs a practical reality check, underscoring the need for thoughtful integration, realistic expectations, and human oversight as AI becomes more central to development practices.

Originally published at https://ainativedev.io.

Is GitHub Copilot Changing the Game?

Bap — Thu, 05 Jun 2025 14:53:35 +0000

What Happened: GitHub’s Cloud-Based Agent Drafts PRs Autonomously

GitHub announced a major new capability for Copilot:

GitHub has unveiled a significant upgrade to Copilot during Microsoft Build 2025: a cloud-based coding agent capable of drafting and iterating on pull requests directly within GitHub. While previous Copilot experiences focused on in-editor assistance via VSCode, this new agent runs asynchronously in the cloud, leveraging an infrastructure similar to GitHub Actions.

Developers can now assign tasks directly via GitHub or VSCode:

@github Open a pull request to refactor this query generator into its own class
The Copilot agent handles tasks autonomously: creating branches, iterating on PRs based on code review comments, and updating commits until the work is accepted - all without touching protected branches. Crucially, existing CI pipelines, branch protections, and review workflows remain intact, ensuring “trust by design.”

With the addition of MCP, developers can even grant the agent access to external tools and data by configuring servers directly in the repository settings. This abstracts us from the actual implementation and allows to focus on describing the tasks. This aligns with Codex Agents and others providing compute for the agents to run and do the coding. It also resonated with the broader movement toward headless agents - those that run independently, complete tasks, and report back when work is ready for review.

Community Reactions: Awe, Caution, Fears, and Questions.

Personally, using GitHub Copilot Agent was an enjoyable experience when asking to perform low level tasks. I had a ‘wow’ moment when I performed typical GitHub actions directly inside the PR (or via the VSCode’s chat bar). It was impressively fast, and the developer experience felt spot on.

When digging into the community’s reaction, we found Copilot’s coding agent showing a mix of excitement, curiosity, and healthy skepticism. “Time-saving” was the common praise. Users liked that the agent’s draft PR and log let them see exactly what was happening, and that they retained the choice of when/if to merge. Similar to my experience, users shared they found Copilot to perform well with low-level tasks.

To which GitHub’s Product Lead for Copilot coding agent replied:

A portion of the community also shared a fear that by offloading coding to AI, human work might devolve into janitorial oversight: filing detailed JIRA tickets all day and rubber-stamping AI-generated code. Also, users expressed concern for what this means for junior devs. GitHub CEO’s take on this is:

In an age where some claim that more AI means fewer opportunities for entry level devs, I believe the opposite is true. There has never been a more exciting time to join our industry.

In an age where some claim that more AI means fewer opportunities for entry level devs, I believe the opposite is true. There has never been a more exciting time to join our industry.

Building on that concern, security and policy questions were also raised. For example, could the agent inadvertently expose sensitive information, such as including a secret key in a PR? Adding to the complexity, some developers pointed out that the benefits of using AI agents don’t come for free. You only truly reap the rewards if you invest effort upfront; e.g writing a good issue description with clear acceptance criteria.

Enjoy being up-to-date with AI Native Development space (AIND)? For the busy folks, we send out a crisp one-pager read by 7,000+ AI tinkerers every week - free and straight to your inbox!
PS: I’m helping build the AIND community, where we cover the most relevant dev-focused news.

The AIND Take: Abstracting Grunt Work, and Developing New Habits

GitHub is evolving into an AI-enhanced development platform. Much like the historic growth of Actions Marketplace, we can expect a spike in tools within Copilot Extensions, making Copilot more powerful and covering more aspects of the development workflow.

Drawing an analogy from history, this feels akin to the DevOps revolution or the shift to cloud/serverless computing: mundane infrastructure work is abstracted away, enabling developers to focus on higher-level logic. Similarly, the Copilot agent can abstract away a chunk of grunt work.

It’s also a new skill: prompting and supervising AI in coding. Just as CI/CD and cloud led to new roles, AI agents could lead to new roles like “AI orchestration engineer”. I suspect we’ll see AI-native startups experimenting more and more with having multiple agents collaborate.

One use case we believe in is the use of task dependent agents: one agent could generate code while another reviews it, or one agent specialized in front-end and another in back-end, coordinating through issues and PRs. GitHub’s infrastructure could support this kind of multi-agent workflow, especially as MCP allows chaining different AI services.

From a developer’s perspective, to prepare for this AI-centric future, there are a few concrete steps we’d recommend when playing with Copilot’s Agent. First, writing clear, detailed task/issue/PR descriptions with acceptance criteria will become a valuable skill. It’s a good practice even for human collaborators, but for AI it’s very relevant. This helps build the muscle of communicating with machines about code intent.

Second, invest in tests. The prospect of an AI agent that relies on tests to know it didn’t break things is a strong incentive. High test coverage can turn the agent into a reliable contributor, while low coverage turns it into a serious risk. Strengthening your automated tests and CI pipelines can help position your projects to benefit more fully from AI involvement.

The end-game is to become comfortable letting the AI draft something, and you guiding/refining it. In this new era, the developers who thrive will be those who embrace specifications-centric development; guiding AI with precision, shaping code, and validating it with tests.

GitHub's MCP Server: You Can Now Talk to Your Repos

Bap — Thu, 22 May 2025 14:10:55 +0000

What Happened - GitHub Released its Model Context Protocol Server

GitHub has released a new open-source Model Context Protocol (MCP) server as part of its latest GitHub Copilot update . Announced in April 2025, the release marks GitHub’s first implementation of the MCP standard developed by Anthropic. The new server is a complete rewrite in Go, preserving “100% of the old server’s functionality” while adding improvements like customizable tool descriptions, integrated code scanning, and a new get_me function for natural language queries e.g. “show me my private repos”.

By releasing its own MCP server, GitHub provides an official gateway for agents to interact with GitHub features (repos, PRs, issues, etc.). Developers can thus automate GitHub workflows and processes, extract and analyze data from repositories, and build AI powered tools and apps that interact with GH’s ecosystem. Notably, Visual Studio Code now has native support for MCP in GitHub Copilot.

Community’s Reactions - Let’s Build!

With close to 14,000 GitHub stars, and over 150 PRs, the new Copilot agent + MCP combo is described as a kind of “awesome sauce” for agentic workflow. Copilot is no longer limited to suggesting code; it can actually take actions and fetch information on a developer’s behalf.

In day-to-day workflows, developers can now ask Copilot to perform tasks that span beyond the editor. For instance, one could prompt a simple task like: “Find any markdown files in my project missing an author footer, and create a GitHub issue to track adding those”. This evolution has developers excited, not just because Copilot feels more capable, but because they can run more productive workflows and delegate real tasks unlike ever before.

You can find more from this type of news in the AI Native Dev community site here:

Do you enjoy being informed about the AI Native Development space (AIND)? I’m helping build the AIND community, where we cover the most relevant dev-focused news. For the busy folks, we send out a crisp one-pager every week, free and straight to your inbox.

Our AIND Take - GitHub’s MCP: Context and Workflows Reach New Heights

Large language models only perform well when they have access to correct context, both of which need to be timely and accurate. GitHub’s MCP server addresses this core limitation by providing a structured way to integrate directly with public and private repos. This enables AI tools to operate with real-time development context, accessing the latest repository data, issue updates, and pull request activity as it evolves.

The GitHub MCP server also introduces a workflow that makes it easier for developers to move from identifying issues to generating PRs and conducting code reviews. This is the central value piece in this release: keep chatting and the MCP takes care of both code-gen and workflow actions.

This end-to-end flow brings a higher level of automation and structure to what are traditionally manual steps in the development lifecycle. A promising trend is thus the emergence of complementary MCP servers that extend support across more parts of the development workflow. For example, GitHub MCP can handle repository tasks while Context7 serves as a dependency management layer, working together to execute broader SDLC actions. When used together, these servers enable development tools to become "workflow-aware" and collaborative in nature. This will also raises a key questions on accuracy and reliability: how do we ensure the MCP server’s actions, like creating issues or modifying repos, are accurate and aligned with developer intent?

In an increasingly crowded landscape of MCP server implementations, we believe GitHub’s official MCP server will stand out as one of the most known and practically useful. It is also a sign that incumbents are taking MCP seriously. With both OpenAI and Anthropic now able to interact with GitHub, developers are gaining the ability to bring their repository context with them across different models (and platforms). Finally, bundling this with Microsoft now open-sourcing Copilot for Visual Studio Code, developers gain further control and customization in their agentic workflows.

What’s exciting is that AI is no longer limited to code generation via prompts - it’s now extending across the entire software development lifecycle, including GitHub repository management. With standardized protocols like MCP playing a part in the journey where builders’ intent is translated into workflow, GitHub’s release marks an important milestone.

IDE Free Tier War: Windsurf’s Push to Win Over Developers ⚔️

Bap — Fri, 09 May 2025 09:30:42 +0000

What's happened: Windsurf's Offering Revamp

In April 2025, Windsurf (formerly Codeium) rolled out significant pricing updates that made waves in the developer community. First, the company provided OpenAI's o4-mini and GPT-4.1 (equipped with 1 million token context window) freely available to all users for a 2-week time period. Immediately after this free trial period, on April 29, Windsurf followed up with a major revamp of their pricing and offering.

Free plan users now receive 25 prompt credits per month (up from 5). That is equivalent to 100 prompts with GPT-4.1, o4-mini, and other premium models. Devs will be charged once per prompt, regardless of the number of actions Cascade performs in response. This simplification will make it easier for users to anticipate costs and manage their credit usage effectively.

Added to that package are unlimited Cascade, Fast Tab Completions, and Command. This is a significant upgrade designed to facilitate a more agentic experience. The free plan now also includes unlimited interactive app Preview and 1 App deploy per day (see Windsurf's Netlify integration ). When assessing Copilot and Cursor's free model, Windsurf's proposal looks rather appealing.

Who stands to benefit?

The expanded free credits greatly encourages tinkering and exploration. With 100 GPT-4.1 prompts per month, devs have a decent starting point to build and test small scale projects. Simply put, developers have a cost-free sandbox enabling more experimentation and knowledge-sharing.

Price-sensitive developers, notably university students and junior devs, will also stand to benefit from this change. It's worth noting that though many professional devs can get IDEs like Cursor reimbursed by their companies, many established traditional firms are slower with adopting these new tools. This is an opportunity for professional devs (either at work or in their own time) to dabble with it. Following this, we expect to see a wave of companies switch to the IDE and become part of Windsurf's enterprise customer base.

Side note, I'm working on The AI Native DevCon happening on May 13.

Join this one-day virtual event and take home:

Proven agentic workflows from leaders at Netlify, JetBrains, and Qodo
Live walkthroughs of code‑generation and data‑pipeline tools
Templates that slash iteration time and ship multimodal models faster

Don't wait - AI Native DevCon is just around the corner on May 13th, secure your spot today!

What's the Community saying?

The reaction across developer communities has mainly been energetic. Devs see it as a "breath of fresh air" in a market full of subscription-based tools. The inclusion of unlimited Cascade Base usage and Tab Completion s also means the free version no longer feels "crippled" compared to Pro.

Students voiced excitement that they could use GPT-4.1 in a coding IDE; most enjoyed the pace and quality of the outputs. Many users who had not tried AI coding tools indicated they were now downloading Windsurf to give it a shot, since the usual cost barrier was gone. Developers also appreciated the pricing simplification and expressed relief that the confusing Flow Actions system was addressed.

People no longer feel they're "burning credits" unknowingly in the middle of a coding session. By charging only for prompts (and not for each action made by the agent), Windsurf earned goodwill. On the whole, Windsurf's strategy succeeded in getting the community's attention, and if you haven't yet tried Windsurf, there is merit in trying it out.

That said, watching how devs behave in the coming months (do they stick with Windsurf? Convert to paid? Revert to Copilot?) will be the true test to the viability of this pricing structure.

Future Outlook: Raising the Bar for "Free" in AI IDEs

With the current subscription fatigue, Windsurf's move reflects the entry towards a new free tier model, and could prompt competitors to adjust. We had already seen signs of this shift in the industry. For instance, GitHub Copilot, notorious for not providing a free tier for individuals, has introduced a limited free tier ( 2,000 completions/month, 50 chat requests, and more).

As a developer, you might wonder whether prompt-based metering is the right approach at all. It can indeed be a little stressful. Many feel it should be value-driven, and that a fixed-price model would be the game changer (also who is paying for this? The foundation model folks? Split?). It's worth noting this may only be a temporary reshuffling, as others are likely to follow suit.

This also raises the bigger question: what would ideal pricing even look like? Free tiers aren't sustainable forever. Will we see the shift toward premium features such as plugins, agents, team plans, or corporate offerings? For developers, this competition is fantastic news! It means more opportunities to try AI tools without committing upfront.

Perhaps the biggest wildcard in Windsurf's future is the possibility of being acquired by a larger player. Notably, OpenAI is reportedly in talks to acquire Windsurf for ~$3 billion. We could see OpenAI investing in Windsurf's development, providing it with the latest models (at a discounted price?) while keeping it as a product for developers. This would have similar flavours to Microsoft's acquisition of GitHub, and its decision to let it continue to operate the platform.

One sound piece of advice is that developers shouldn't commit just yet. We are still in the beginning phases of the AI Native IDE war, and there is no clear winner. Windsurf's free tier can easily be part of your toolbox, given its generous free offering. However, there's always more to selecting an IDE product than pricing! Over time, and as the space evolves, you might consolidate if one tool leaps ahead, but for now, a hybrid approach hedges against any single platform's cost and capability limits.

One thing's for sure: the IDE is evolving and becoming increasingly AI Native. As developers, this is our time to have fun exploring these tools and raise the bar in our workflows.

🤨 4 Frustrations in AI Native Development [GPT 4.1]

Bap — Wed, 23 Apr 2025 14:52:54 +0000

tl;dr

Developers often struggle with the applicability of benchmarks, understanding semantic versioning, and identifying a common standard to effectively leverage the right models.
These challenges often surface directly in developer workflows—whether it’s wiring up a model inside an IDE or deploying a feature that depends on consistent outputs across model versions.
Devs also face frustration when trying to improve results—particularly since prompting techniques and optimisation mechanisms are still not fully understood.
Beyond cost-to-performance ratio as the primary lens for model selection, the future of AI native developer workflow lie in a call for abstraction layers that simplify multi-model integration.

How does GPT-4.1 measure up?

GPT-4.1 was recently released, and OpenAI disclosed its 54.6% score on SWE-bench Verified, improving by 21.4% abs over GPT‑4o and 26.6% abs over GPT‑4.5—making it a leading model for coding within OpenAI:

Are these results significant? Those figures are slightly under the scores reported by Google and Anthropic for Gemini 2.5 Pro (63.8%) and Claude 3.7 Sonnet (62.3%), on the same benchmark. More importantly, the release of this new model is raising some frustrations into how developers are thinking about the state of the ecosystem.

Wider ecosystem frustrations

Frustration 1: Alignment on Benchmarks

“The benchmarks are so whack, how does [GPT-4.1] score so high on SWE bench (20% higher almost) and fall behind deepseek on [Aider Benchmark]? - Reddit user.

Benchmark results (like SWE-bench, Aider Polyglot, etc.) have their own sets of requirements, making it challenging to understand and trust the true high performers within a domain. For example, when switching to a model with superior benchmark scores in your dev tool, it may fail to execute your requirements as effectively as the previous model—despite the improved metrics. Developers are thus struggling to agree on model performance as the results are so specific to the task, codebase and prompts.

Internal benchmarking is emerging as a critical practice, with teams developing custom evaluation frameworks suited to their specific needs. For instance, in the aim to build a secure AI native development platform, Tessl is leveraging known benchmarks as well as their own to assess the most fitting model for their use cases. While there are valid reasons for enterprises and developers to do this, many devs still just want to get work done—not manage model quirks.

It’s worth noting, as a side point, that Michelle Pokrass @OpenAI mentioned that “[GPT‑4.1] has a focus on real work usage and utility, and less on benchmarks,” further highlighting the idea that benchmarks shouldn’t be considered the ultimate source of truth.

Frustration 2: Lack of Standardisation

“4o with scheduled tasks”—why on earth is that a model and not a tool that other models can use?! — HackerNews user

There’s a growing outcry from developers over the lack of standardisation across models—in features, and interfaces. Feature fragmentation has forced manual model selection for each task. If you’re building a dev assistant that uses image generation, CoT reasoning, and code completion, this often means duct-taping together multiple APIs and writing fallback logic. This is reminiscent of the browser wars before the 2000s—different versions of websites for Internet Explorer, Netscape, and later Firefox, because there were no standard implementations of HTML, CSS, or JavaScript.

Standardisation often comes not from vendors agreeing, but from users banding together out of shared pain—think of the lock-in and interoperability pain prior to SQL. The good news is that we are seeing this already in a couple of places, think OpenAI’s API as the defacto standard, as well as Anthropic’s MCP. Prompt routers will be also be significant part of the solution. Analogous to HTTP router + HTML standard, a combination of prompt routing and standardisation may be the path to absolving these frustrations.

Frustration 3: Semantic Drift in Model Versions

GPT-4.1’s release highlights confusion. There was a lot of hype around GPT-5 coming out, and after the disappointing results from GPT-4.5, OpenAI released 4.1. Why GPT-4.1? The move back toward 4.1 feels like a semantic step backward (GPT-4.1 vs. GPT-4o vs. GPT-4 Turbo—what’s going on here?).

While the claims and early results show improvement, and the focus between 4.5 and 4.1 differ, it still leaves room for confusion. If you’ve been building repeatable prompting flows, you will at the very least feel mildly overwhelmed by the “model maze”——is higher better? faster? newer?

That said, this ambiguity hasn’t gone unnoticed. OpenAI has acknowledged these naming confusions, and we should expect future iterations to bring more clarity to the semantics and categorisation framework.

Frustration 4: Prompting and Hand-Waving

OpenAI publishing recommended workflow is a sign that prompting is becoming systematic, not just vibes. Yet, there’s frustration among the community where these practices still feel more like guesswork than true, quantifiable methods.

Developers report that small prompt changes lead to big performance differences, revealing the lack of standard prompting guidelines. Also, developers need to prompt different model in different ways—which is tricky to manage. Prompting today thus resembles alchemy before the scientific method—part intuition, part ritual, often surprisingly effective, but lacking universal principles.

That said, tools like DSPy already offer a path toward a framework turning prompt design into a repeatable, testable, and improvable process. If you are a developer reading this, one last piece of advice would be to adopt the mindset of an “Explorer”. We’re in the prototyping phase of a new engineering discipline, and we need to shift our mental models accordingly.

Future Implications

The optimal model choice often depends more on practical constraints (budget, speed, use case) than raw coding performance. Cost-to-performance ratio will become the primary lens for model selection—replacing the “latest and greatest” mindset. Indeed, LLM ecosystem is entering the “infrastructure layer wars,” where optimisation and integration trump novelty.

For instance, Google’s TPUs with Gemini 2.5 is not only leading on Aider’s coding benchmarks, but it is cheaper and faster to serve (scored 73% on Aider Polyglot for ~$6, vs. GPT-4.1’s ~$10). Even if GPT swing back with GPT-5, the hardware war with latency and costs, may ultimately decide which vendor developers and businesses favour. Read between the lines, and this is great news for developers and organisations. We’re benefiting from a race to the bottom on cost and a rise in serving efficiency!

At the same time, developers are voicing a roaming trend: devs don’t enjoy hand-picking models—they want abstraction layers to handle this automatically. This shift prioritises developer experience, where developers expect multi-model environments to take on this work—think of a Cursor who automatically hand picks the best fitting model for the task you require.

This evolution isn’t just about benchmark performance—it’s about empowering developers to fine-tune their infra: using leaner, cheaper models for routine jobs and automatically switching to more powerful ones when the stakes are higher. The future of AI native development doesn’t lie in manually picking models—it lies in building abstraction layers that make those choices invisible, so developers can focus on building, not juggling APIs.

All in all, it’s a shift from toolsmith to architect, one that perfectly mirrors the implementation to intent pattern from Patrick’s 4 Patterns of AI Native Development. If you haven’t explored it yet, it’s a must-read—it offers a compelling blueprint for where we’re headed.

If you enjoyed this piece, make sure to subscribe to our newsletter to keep up to date with the latest insights in the AI native development space.

🚀 AI Native Dev Landscape Just Dropped (and it's open source)

Bap — Wed, 26 Mar 2025 15:58:40 +0000

Time lost searching the internet for new tools …. E-TOO-Much.

Chances are that you are noticing a lot of new AI tools pop up in your social feeds.

You are really happy when you discover a great new one, but the discovery process feels off and artisanal.

It encourages people to keep scouting the internet and potentially sharing their own lists among friends and colleagues.

🎧 Listen to the full conversation with Patrick Debois on the AI Native Dev podcast here.

We're excited to announce our brand new AI Native Developer Tool Landscape that keeps track of what is new in the intersection of AI and the whole Software Delivery Lifecycle.

Ranging from tools that help you with product prototyping, development, quality assurance, DevOps, and security, we've got you covered.

This is, of course, only the beginning. We’ll be enhancing the landscape with reviews (and hot takes) and allow it to become the single stop to find the tool that you need to get your job done.

Whether you are a developer, a researcher, or even a VC, we got you covered!

A few words on lessons learned building out the landscape categories:

Naming is hard, and it turns out naming categories are too.
We decided to stick to the current terminology of the SDLC domain and put tools first in a general domain.

Now, some tools span multiple categories and if that happens, we can put tools in a few different sections, but it should be the exception for now.

We also looked at grouping tools by features but we quickly learned that features can be implemented at different levels of ease and maturity, making it difficult to make the distinction between them.

Some categories will be “overcrowded” as they are in full expansion, and those categories will be very big to show. We will sort tools based on a mixture of popularity and newness and cap the number of tools shown by default. You can expand the category to explore all of them.

We are big believers in learning in the open. By making the list public and allowing others to contribute to it, we can share the burden of keeping it up to date. We have the vendors participating to keep their information up to date. But we also count on your help to expand the list and keep the list current. You can send a pull request to update our repo.

And now with the catalog, to keep up to date, you can simply visit our website if you want to discover new tools.

Happy tool sharing!

Plugins and Platforms: v0’s Marketplace Integrations in AI Native Development

Bap — Tue, 25 Mar 2025 17:42:41 +0000

v0 Marketplace, what’s New?

Vercel’s v0 is an AI-powered development assistant that helps developers design, iterate, and deploy full-stack applications using natural language prompts. With the recent Marketplace integration, v0 can now directly incorporate third-party services and infrastructure into generated projects.

Previously, integrating external backends or APIs required manual setup outside the tool. Now, v0 can provision and configure services automatically, making these services instantly accessible to both the Vercel deployment and the v0 AI during generation. For instance, users can enable a serverless Postgres database (Neon or Supabase) or a Redis-compatible cache (Upstash), allowing v0 to integrate these services into an application without requiring users to leave the browser.

A new step towards AI-Native Development

For developers, these integrations eliminate a major friction point. Instead of switching between interfaces and managing API credentials, everything is progressively injected and unified within the UI. This results in faster prototyping and iteration, enabling teams to generate a fully functional application—frontend, backend, and external services—in one go.

These changes reflect an industry shift, much like Bolt.new integrating Netlify (as opposed to v0 integrating Vercel for deployment), which enhances developer experience by leveraging direct integrations. A hallmark of Bolt is its tight integration with Supabase for backend needs. In fact, Bolt essentially requires a Supabase connection to generate a functional app (see below for comparison with v0).

This is a notable improvement in the current AI Native development ecosystem. With the refinement of its two-step workflow, developers can prototype on v0, Bolt, or Lovable and then export to Cursor, Cline, or Copilot for further iteration. The rise of customisable integration is part of a broader movement, marking a significant step forward in creating a more productive developer experience.

It is worth noting that within AI Native development coding platforms, Supabase is a strong contender. Lovable.dev is another platform that benefits from its real-time Postgres capabilities and simple REST/GraphQL APIs. This also brings an exciting race for competitors like Neon and Upstash to introduce one-click integrations, aiming to capture a share of the AI-driven development market.

From Implementation to Intent

As noted, AI Native development will dramatically speed up the pace of manual coding—with the very real future of removing a non-trivial amount of it. So, while manual coding is gradually being outsourced, code generation is on the rise, and AI-assisted code writing is growing significantly.

We’re in a world where you can build a minimum viable product in days, not weeks. This compresses innovation cycles – teams can prototype, get feedback, and refine in a continuous loop with much lower overhead. This evolution very much mirrors the shift from assembly languages to high-level programming—where abstraction unlocked acceleration. Patrick Debois frames this paradigm shift as a move from how to developers defining the what.

This is already illustrated by Bolt.new removing the need for users to configure Supabase (the how) and directly build the database (the what) when developers ask for it. As Vercel’s CTO highlighted during the AI Native DevCon session (you secure your spot at AI Native DevCon 2025 here!), AI will naturally gravitate towards the most seamless tool, removing the need for manual selection. This doesn’t eliminate the need for engineers, but it amplifies individual impact. This also means that product experiments which might have been too costly in developer hours can now be tried easily, shrinking the gap between idea and execution.

Deeper Integrations Ahead

The current integrations (databases, auth, third-party APIs) are just the start. Domain-specific integrations could emerge – like AI systems specialised in ecommerce that automatically integrate with Shopify or Stripe, or AI for data science that spins up Jupyter notebooks and connects to Snowflake with one prompt.

The Marketplace concept has already expanded to include AI model integrations — many platforms already lets you swap between models. A company has a model(s) that knows their internal architecture and coding conventions, resulting in more tailored code generation. This aligns with what Vercel’s team discussed on our podcast: the concept of ‘eval-driven development’ — testing model outputs with benchmarks or tests to validate generated code. You might choose a 4o-mini based model for one task, then a faster, cheaper model for another, all orchestrated by the platform (this resonates with our discussions in o3-mini vs. GPT-4.5).

This will enable highly customisable, non-trivial projects with multiple steps, going beyond simple CRUD apps. Referring back to the above, one tangent hypothesis is that in the medium term, the line between “intent” and “implementation” will blur. Perhaps one day you might say, “Build me a clone of X app with feature tweaks Y and Z,” and the AI will negotiate with developers through all required integrations and code to deliver it. We’re not there yet, but the trajectory is set.

The Building of Monopolies?

Building on the above, we need to watch out for concentration of power in AI Native development platforms. Right now, we have several players. If one platform achieves a significantly better AI ecosystem lock-in, it could dominate the market, much like a dominant operating system. This will open the door to potential biases – for example, one could imagine a cloud provider’s AI tool subtly favoring its own database service over a third-party. We’ve already seen this in practice where we noticed that the replit agent performs significantly better with its built-in database compared to using a third-party datastore like Supabase.

There’s also the data angle: the more users a platform has, the more data it gathers to further improve its AI—creating a network effect. This would lead to monopoly-like dynamics where one or two services become the de facto choice in AI-generated projects. For developers’ sake, it will be important to have interoperability – for example, ensuring you can export your AI-generated code to other platforms, or that an AI suggestion can be adjusted to use a different library if needed.

As AI-native development becomes common, maintaining openness and transparency in how integration choices are made will be important to avoid locking developers into suboptimal or one-vendor solutions. Otherwise, we risk trading the old cloud lock-in for a new kind of AI Native development lock-in.

We’ll continue to monitor how platforms like v0, Bolt.new, and others evolve, and how they impact real development workflows. In fact, we are currently reviewing developer’s general sentiment within of AI Native development’s tools, and would love to get your input! We are examining how this landscape is affecting engineering teams’ trust and productivity, and are excited about reviewing and sharing the results with you.

In the meantime, we’d love to hear your experiences: Have you tried building an app with AI tools like v0 or Bolt.new? How has AI changed the way you develop? Let us know! Your insights could be part of the conversation as we all navigate this new era of software development.

o3 vs GPT-4.5: Observations in AI-Native Development

Bap — Tue, 18 Mar 2025 13:31:03 +0000

Hey peeps,

I sat down with Amy Heineike, AI engineer at Tessl.io, to discuss her work in evaluating which models excel at AI Native development.

tl;dr “Why is this important for developers?”

Building complex AI-generated modules and packages remains a non-trivial challenge, but we are getting closer than ever to true AI Native development.
o3-mini is a hidden gem that hasn’t gained much traction in developer circles yet. If you’re building AI-powered development tools, this model is a contender worth serious consideration.
Tessl’s AI Engineering team’s initial findings highlight o3-mini’s superiority in building complex, multi-layered systems.
We don’t see a single model serving all use cases. Instead, the key lies in identifying and leveraging the strengths of different models at each step of the AI Native development workflow.

The challenge of AI Native development

Building AI Native development is a challenge that demands precision. Success hinges on integrating code understanding, specification-to-code translation, intelligent code generation, and automated testing across multiple modules. Beyond OpenAI’s insights on o3-mini’s coding performance (see below), there’s a shared sentiment in the developer community about its effectiveness. For example, Simon Willison noted his surprise at o3-mini’s ability to generate programs that solve novel tasks, as well as its strength in producing documentation.

However, within AI Native development, each layer introduces new complexities, requiring models to not only generate functional code but also understand dependencies, adapt to evolving specifications, and self-correct through testing. Traditional approaches struggle with this level of integration, making it clear that AI Native development demands a fundamentally different paradigm.

Advancements in reasoning models, from the early “chain-of-thought” approach to “hybrid reasoning” in Claude 3.7, make this an exciting time for tackling these complex problems. Tessl’s AI engineering team built an evaluation framework that enables continuous testing of new models as they are released. When GPT-4.5 launched with their “last non-chain-of-thought model”, the team assessed which models best suited their use case.

Comparing o3-mini vs GPT 4.5—Tessl’s approach

Tessl, focused on AI Native development, initially had its generation process use GPT-4o for most tasks, but transitioned to o3-mini after it demonstrated stronger performance. With the release of GPT-4.5, and its claims of producing more accurate responses and less hallucination, the team conducted a comparative analysis to assess its performance against o3-mini.

The evaluation process involved testing the model’s ability to generate complete, working, multi-module packages. Each package represented a distinct coding challenge, such as implementing a calculator, performing color transformations, or creating a spreadsheet engine.

The task focused on key capabilities they were testing within AI Native Development:

Code understanding
Code generation
Translating specifications into code
Debugging and error resolution
Test case generation

The results provided meaningful insights.

Initially, the team left GPT-4.5 and o3-mini to generate their own test cases, and o3-mini demonstrated a significantly higher pass rate. However, to ensure a fair comparison, the team standardised the evaluation by using test specifications and cases generated by o3-mini for both models. With this apples-to-apples comparison, o3-mini still proved to be significantly stronger in their internal pass rate benchmarks.

Our findings align with OpenAI’s statement that GPT-4.5 is “showing strong capabilities in […] multi-step coding workflows”. However, in this context, o3-mini ultimately proved to be a better fit for Tessl’s AI Native development use case. These findings resonate fairly well with SWE-bench (a known benchmark for evaluating models on issues collected from GitHub—see below).

Worth noting: Within Tessl’s benchmark case, the team didn’t see enough evidence to suggest that GPT-4.5. outperformed GPT-4o—an interesting signal given the much higher cost of GPT-4.5.

Ultimately, the most interesting insights lie between the lines.

OpenAI has found that the model performed well on some coding benchmarks, suggesting that it might do better on certain subtasks—perhaps those more aligned with the emotional intelligence they tout. Perhaps the fact that GPT-4.5 organically generated more tests hints at this. This raises compelling questions:

Were GPT-4.5 generated tests superior?
Could GPT-4.5 be better suited for test generation rather than code creation?
Would it be more effective to leverage GPT-4.5 for specific aspects of AI Native development rather than applying it universally?

Implication for the Future of AI Native Development

Model advancements are reshaping development workflows, making AI-driven coding a more practical reality. These early results could push more AI-powered dev tools to integrate models like o3-mini as its model improvements are dramatically changing development workflows.

“o3-mini has really been a step change for us. It avoids many small compounding errors that have tripped up other models. It is making well-considered code that suddenly makes it feel like we’re a lot closer to the AI Native future. It’s especially exciting that this jump has come from post-training because there’s so much opportunity for further gains here.” Amy Heineike, AI Engineer at Tessl

That said, should we explore further model pairing experiments—where one model (potentially o3-mini at this stage) manages the overarching system architecture while another refines the finer details? We believe the future of AI Native development lies in leveraging multiple models, stacked on top of one another, each optimised for a specific stage of the development workflow. Just as a hammer and a screwdriver can both put a screw in place—but with different levels of effort and precision—different models excel in different roles within the development process.

For instance, GPT-4.5 is known for its human-like writing, while o3-mini excels in coding output. Could o3-mini generate the code while GPT-4.5 refines and explains it in a more natural way? What role does each model play in this complex puzzle? And which pairings create the most effective AI Native development stack?

We’re still in the early days of AI Native development, and the possibilities ahead are exciting. Let’s explore, build, and learn together. What models are you using? What evals are you running? What insights have you unearthed? We will be looking into this in more detail at the 2025 AI Native DevCon event. If you’re interested in AI Native development, emerging trends, and how to stay ahead in this fast-moving space, join us!

Plus, be part of the conversation—join our AI Native Development community on Discord.