DEV Community: Patrick Chan

Why Traditional Regression Testing Doesn’t Work for MCP Tools

Patrick Chan — Fri, 04 Jul 2025 01:24:19 +0000

Why MCP Changes the Game for Testing

Modern AI systems increasingly rely on the Model Context Protocol (MCP), the standardized interface that enables large language models (LLMs) to communicate with structured tools like APIs, databases, and enterprise systems. These MCP Tools allow agents to reason over context and dynamically choose which tool to invoke, with what parameters, and when, all in response to natural language prompts.

This flexibility is powerful, but it introduces a fundamental problem: you can’t use traditional regression testing methods to validate these systems.

In this blog, we’ll explore what traditional regression testing assumes, how MCP systems violate those assumptions, and what it takes to test MCP workflows reliably.

What Is Traditional Regression Testing?

In software development, regression testing ensures that functionality that worked before still works after a code change. This is most often implemented as unit tests, which are short, isolated tests that run automatically and compare outputs to expected values. Here’s a basic example:

import unittest

def divide(a, b):
    return a / b

class TestMath(unittest.TestCase):
    def test_divide(self):
        self.assertEqual(divide(10, 2), 5)

This approach assumes:

Determinism: Same input always produces the same output.
Clear expectations: You know the exact expected result.
Test independence: One test doesn’t affect others.
Low, fixed cost: Tests are fast and cheap to run frequently.

MCP systems break all of these assumptions.

How MCP Breaks Traditional Testing Models

MCP-based systems shift testing from fixed-function code to probabilistic, context-sensitive reasoning. At a high level, this leads to the following issues:

Behavior is non-deterministic
Tools are interdependent, not isolated
Tool vs. toolbox testing must be separated
It behaves more like integration testing
Testing introduces non-trivial, recurring costs

Let’s explore each of these challenges in more depth.

1. Testing the Toolbox Is Non-Deterministic

MCP systems rely on LLMs making probabilistic decisions. Given the same prompt and toolbox, the model might:

Pick different tools on different runs
Format parameters differently
Succeed or fail inconsistently

Traditional systems assume consistent behavior. But in MCP, a prompt that “passes” once may still be unreliable if it only works 70% of the time. And while retries can sometimes succeed, the real question is how often does it fail?

What matters is the success rate, not whether it passed on a single attempt. If success drops below an acceptable threshold, that’s a signal the prompt, tool, or system needs adjustment.

Traditional regression tools aren’t built to track success distributions—they treat tests as pass/fail. But with MCP, every test should be measured statistically over time.

2. Everything Is Interdependent (aka “You Can’t Test in Isolation”)

In MCP systems, tools and prompts are contextually bound. If you rename a tool, edit its description, or modify example prompts, you may affect how the model behaves across many other prompts or tools. That breaks the assumption of test independence. A fix in one place can lead to unexpected regressions elsewhere.

This leads to the classic “whack-a-mole problem,” when you fix one thing only to break another. Testing MCP systems requires reasoning about the system as a whole, not individual unit behaviors.

3. Tool Testing ≠ Toolbox Testing

A critical distinction:
Tool testing: You directly invoke the tool with parameters and verify the result. This bypasses the model entirely and resembles traditional integration testing.
Toolbox testing: You provide a prompt, and the LLM must:

Understand the user’s intent
Choose the correct tool
Fill in the right parameters

Toolbox testing is about model reasoning, not function correctness. It’s harder, more variable, and must be evaluated semantically.
Also, because MCP tool outputs are often unstructured, you may still need an LLM to verify whether the tool did the right thing, even in tool testing.

4. MCP Testing Is Closer to Integration Testing

MCP Tools usually connect to live systems, be it messaging platforms, ticketing systems, etc. So MCP test cases mutate real data, depend on specific system state, and may span multiple steps or APIs

This isn’t simple unit testing… it’s full-on integration testing. Traditionally, integration testing has been handled by human testers, often using scripts and manual verification. But with modern LLMs, we now have the ability to automate much of this reasoning.

An LLM can:

Set up a valid test state
Interpret messy outputs
Judge whether the action succeeded

This enables automated integration testing that was previously impractical.

5. MCP Testing Adds Real Cost

In traditional testing, you pay for the compute (a fixed server or CI pipeline), but you can run your test suite as often as needed. Once it’s set up, additional test runs are nearly free.

MCP testing is different:

Every prompt costs tokens
Every tool call might invoke real APIs, which may be metered or rate-limited
Tests that mutate real systems require cleanup

These costs aren’t theoretical and they add up quickly. If a test run costs $0.10 and you have 5,000 test cases, that’s $500 per full regression cycle.

Because of this, MCP testing needs to be cost-aware. Strategies include:

Prioritizing high-risk or frequently failing tests
Sampling subsets instead of running everything every time
Tracking cost per test and optimizing the suite accordingly

You can’t treat MCP testing as fire-and-forget. You need to manage it like a resource.

Rethinking Regression Testing in the Era of MCP

Traditional regression testing was built for a different world, one where logic was static, outputs were deterministic, and tools were isolated. All of that went out the door when MCP came in.

MCP systems mark a shift from code execution to reasoning-driven action. They are powerful, flexible, and deeply contextual, but they don’t fit into the testing frameworks we’ve used for decades.

To test MCP systems effectively, you must:

Measure success rates, not just single outcomes
Treat tool and toolbox testing differently
Think holistically to avoid interdependent regressions
Use LLMs for semantic judgment and integration evaluation
Monitor and reduce the cost of test execution

At Gentoro, we’re working across the entire MCP Tools lifecycle to help developers build robust, production-grade MCP tooling with built-in observability, success tracking, and cost-aware testing workflows. Give the Playground a try and let us know what you think!

How MCP Leverages OAuth 2.1 and RFC 9728 for Authorization

Patrick Chan — Sat, 03 May 2025 00:53:41 +0000

Authorization isn’t just a checkbox in agentic systems. It’s a make-or-break layer for security, scalability, and trust. Without a clear way to authenticate dynamically, AI agents either hit brittle, hardcoded APIs or introduce serious security gaps. That’s why authorization has been one of the biggest missing pieces in the Model Context Protocol (MCP).

When MCP was first introduced, it focused on helping AI agents interact with APIs. But it left out that one critical capability: authorization. For the past six months, the MCP team has been working to change that, building a secure, dynamic authorization layer that can keep up with the unpredictable nature of agentic systems.

The new MCP authorization model is based on OAuth 2.1, as expected. But it also adds an important enhancement: support for RFC 9728 (OAuth 2.0 Protected Resource Metadata), a long-overlooked spec that turns out to be essential for the way agentic AI systems actually behave.

Agentic Tools Are Not Like Regular Function Calls

In traditional applications, developers directly program which APIs to call. This is known as direct invocation. The software knows exactly which API endpoint it needs ahead of time, and the location of the authorization server can be manually configured.

Agentic AI tools work differently. Instead of following fixed paths, agents are dynamically assigned MCP tools at runtime, and these tools interact with various APIs. At runtime, through inferred invocation, agents determine which tools to use based on goals, user prompts, or evolving context, without knowing beforehand which APIs will be accessed.

Because of this dynamic behavior, agents cannot predict in advance which APIs they will call or which authorization servers they will need. Hardcoding this information is impractical; agents must be able to discover authorization details automatically as they operate.

The Need for Dynamic Authorization

Most early MCP use cases assumed preconfigured trust between clients and tools, an unrealistic assumption in enterprise environments. As agentic architectures became more dynamic and open-ended, the need for runtime authorization discovery became clear.

OAuth 2.1 provided the base, but it wasn’t enough on its own. Without a way for APIs to advertise how they expect to be accessed, agents still couldn’t operate autonomously. RFC 9728 addresses this need. Over eight years in the making, RFC 9728 was developed long before the rise of large language models and agentic AI. Initially seen as a useful enhancement for traditional apps, it has now become essential for enabling dynamic, AI-driven integrations.

The Importance of RFC 9728

RFC 9728 allows APIs to publish information about how they expect to be accessed securely. When an agent tries to call an API without a valid access token, the API sends back a response that includes a link to this information.

From there, the agent can learn which authorization servers it should contact to get permission, what types of permissions it needs to ask for, and how to make sure the access token it gets will be accepted by the API. This dynamic process allows agents to connect to APIs safely and automatically, even when everything is happening on the fly.

MCP’s New Authorization Model

The updated MCP specification builds on OAuth 2.1, strengthened by RFC 9728. This ensures that AI agents can discover trusted authorization servers, obtain access tokens securely, and interact with APIs they encounter through dynamically assigned tools.

Without RFC 9728, MCP agents would require hardcoded server details, undermining the goal of dynamic, autonomous operation. With RFC 9728, MCP provides a modern, secure framework for agents to adapt and thrive.

A New Era for Authorization

The finalization of RFC 9728 comes at a perfect moment. As agentic AI systems grow more complex, static assumptions about authorization no longer hold. Agents need the ability to discover how APIs expect to be accessed, and respond in real time.

By adopting OAuth 2.1 alongside RFC 9728, MCP now provides a secure framework for this kind of dynamic behavior. In this new, interconnected landscape, AI agents can request tokens, discover auth servers, and interact with tools on the fly.

Why This Matters for Developers

For developers, the integration challenge is real, especially when working with AI agents that need to access multiple APIs securely and autonomously. Every new API could mean a new token setup or a different auth config.

With MCP’s updated authorization model, that friction disappears. Agents can now use standardized metadata to securely authenticate at runtime, without advance configuration. That means you can generate, deploy, and call tools without worrying about how your agents will handle auth across different systems.

What’s Next

The new MCP authorization model is a major milestone, but it’s just the beginning. The next challenge is making this seamless for developers: turning RFC 9728 from spec into practice, and abstracting away the complexity so agents can authenticate tools without added friction.

At Gentoro, we’re building toward that future. Our OpenAPI tool generation feature lets you turn any OpenAPI spec into a set of MCP-compliant agentic tools. If you’d like to try check it in our playground, sign up for the beta or request a full demo. We’d love to hear from you!

TL;DR

Until now, authorization has been a missing piece for MCP
OAuth 2.1 and RFC 9728 give MCP the dynamic authorization model agents need
RFC 9728 lets APIs expose how they expect to be accessed, so agents can request tokens in context
Hardcoding auth flows is no longer necessary
RFC 9728 is key to enabling scalable, secure AI integrations and OpenAPI-based agentic tools
Gentoro uses this model to support dynamic, composable tool workflows without vendor lock-in

Turn Your OpenAPI Specs Into Agentic Tools—Instantly

Patrick Chan — Fri, 25 Apr 2025 16:26:42 +0000

What if your internal APIs could talk to AI agents without manual integration, schema rewrites, or hallucination-prone trial and error?

Now they can.

We’re excited to introduce a powerful new feature in Gentoro that makes it easier than ever to connect your APIs to the world of intelligent agents and agentic workflows—without writing a single line of integration code.

Widely used public service? Yes. Private internal API? Absolutely. With this release, Gentoro can now take any OpenAPI specification and automatically generate a complete set of Model Context Protocol (MCP)-compliant tools based on that spec. The tools are optimized for LLMs and immediately usable by any agent or framework that supports MCP, including Claude Desktop, Cursor.ai, Goose, LangChain, the OpenAI SDK, and more.

From OpenAPI Spec to Agentic Tools, in Minutes

Most APIs weren’t designed with LLMs or AI agents in mind. APIs often involve deeply nested structures, implicit assumptions, and machine-oriented schemas that language models struggle to interpret.

Gentoro reshapes these APIs into forms that better align with how agents interpret and use tools. More specifically, Gentoro:

Ingests, parses, and enhances your API endpoints
Abstracts complex parameters and structures
Auto-generates user-aligned agentic tools that match how agents think

These agentic tools aren’t just raw endpoint wrappers. They’re ready-to-use, scalable, user-aligned connectors that allow AI agents to safely call your API.

Fewer Hallucinations, More Precision

LLMs are sensitive to nuance. A poorly named parameter or an ambiguous description can lead to hallucinations, misuse, errors, or endless retries.

That’s why Gentoro goes beyond basic OpenAPI fields, drawing on metadata, naming patterns, and natural language signals to produce clear, meaningful descriptions that help agents understand what each tool does and how to use it. Gentoro’s summarization pipeline generates:

Rich, plain-language descriptions
Parameter clarifications
Tool usage guidance based on context, not just syntax

Structure and description quality reduces risk and directly improves agent performance. In a recent benchmark, Gentoro’s user-aligned agent tools executed 100% of prompts successfully, compared to 74% for optimized, API-aligned functions and just 33% with raw API mappings.

These performance evaluation results, documented in detail here, highlight the real-world value of aligning tools to how LLMs reason—not just how APIs are structured.

First-Class Support for Internal, Proprietary APIs

Unlike many agent tool platforms, Gentoro isn’t a cloud SaaS product. It runs inside your firewall. That means Gentoro lives where your systems already run, allowing you to generate tools from internal services, custom business logic, and proprietary APIs.

This is huge. It means your tools stay private. Your data stays local. And your agents can finally integrate with the systems that actually matter to your business.

The result? Deeper, more accurate integrations tailored to your real-world environment—not a watered-down simulation. No public proxy. No complex workarounds. With Gentoro, your agents can now work directly with the systems that matter most to your business.

Out-of-the-box MCP Compatibility

All tools generated by Gentoro are instantly compatible with the Model Context Protocol (MCP), the standard for connecting generative AI agents to external tools and systems.

MCP standardizes how AI models like Claude or GPT interact with tools, making it easier for agents to discover, understand, and use capabilities like “create ticket,” “fetch customer data,” or “post to Slack” without complex custom wiring.

For example, an MCP-compatible agent can query Gentoro’s tool server, interpret available functionality, and immediately begin executing tasks using those tools—no manual configuration required.

That means Gentoro-generated tools can be used immediately by:

Claude Desktop – for local AI task automation
Cursor.ai – for in-IDE coding and navigation
Goose – for embedding autonomous workflows
LangChain – for structured multi-step agent chains
OpenAI SDK – for function-calling and tool use
Custom-built MCP agents – in enterprise or OSS environments

By supporting MCP, Gentoro gives you a clean separation between agent logic and tool logic. You’re not locked into a single platform or vendor—and your tools remain interoperable across a growing ecosystem of LLM runtimes.

How It Works: Gentoro in Action

To begin, you create a new Toolbox in Gentoro and select the AI model you want to use (OpenAI, Claude, or any other supported provider). Then, you choose the OpenAPI connector and provide your API specification. This can be done by pasting a YAML or JSON file directly, or by linking to a hosted spec via URL.

If your API requires authentication, Gentoro supports common methods like OAuth, JWT, and API Keys, allowing you to securely define access.

Once submitted, Gentoro parses the API definition, enhances each endpoint with clear descriptions, and abstracts complex schemas into LLM-friendly formats. It then automatically generates a set of agentic tools, each mapped to a specific API function.

The tools are neatly organized by functional category, so it’s easy to find what you need. Each tool includes enhanced summaries, clarified parameters, and simplified input/output schemas to help LLMs reason more accurately.

You simply select the tools your agent will use and click Import. Gentoro then generates the underlying implementation code—secure, scalable, and ready to plug into your AI workflows. In just a few clicks, your internal API is transformed into an AI-accessible Toolbox, fully integrated with your agent ecosystem.

Figure 1: How Gentoro generates agentic tools from any OpenAPI spec (created by me with AI)

Coming Soon: Tool Builder for Customization (No Code Required)

Gentoro’s auto-generated tools are powerful by default, but we know one size doesn’t fit all. That’s why we’re launching Tool Builder, a natural-language interface for customizing, composing, and reshaping tools to fit your workflows.

With Tool Builder, you’ll be able to:

Customize any generated tool (or create one from scratch) to match your unique workflows and domain needs
Break tools down into modular components
Combine multiple endpoints into a single composite tool
Import and further refine third-party MCP tools
Use simple, natural language to adjust parameters or reshape logic

Whether you’re streamlining repetitive tasks or orchestrating complex workflows, Tool Builder puts intuitive, flexible control in your hands. No code. No manual scripting. Just powerful tooling, designed by you and shaped by your goals.

Gentoro Turns APIs Into Intelligent Interfaces

Connecting LLMs to real systems isn’t just a technical challenge. In many cases, it’s the primary bottleneck to making AI actually useful in enterprise settings.

You can’t automate a workflow if your agents can’t talk to your tools. You can’t reduce manual overhead if your APIs are too complex for an LLM to reason about. And you can’t scale AI adoption if every integration requires dev resources and a custom wrapper.

With Gentoro, your internal APIs become intelligent agent interfaces in just a few clicks. No boilerplate, no guesswork… just Gentoro.

TLDR:

Gentoro generates agent-ready tools from any OpenAPI-compliant spec
Zero-code setup! Just drop in a spec and go
Tools are user-aligned and optimized for LLM agent reasoning
Enhanced descriptions reduce hallucinations and improve performance
Gentoro runs inside your firewall for secure, private API access
Supports proprietary, internal systems, not just public APIs
Tools are MCP-compatible, instantly usable with Claude, Cursor, LangChain, OpenAI, Goose, and more
Ideal for prototyping and scaling
Tool Builder (coming soon) enables natural language customization, composition, and third-party tool integration—no code required

Next Steps

Navigating the Expanding Landscape of AI Applications

Patrick Chan — Wed, 16 Apr 2025 15:50:02 +0000

From LLM Apps to Agents, Tools, and Interoperability

The past two years in AI have felt like a sprint. We’ve seen powerful new capabilities come online—LLMs that can reason, tools that can be called programmatically, agents that can plan and act autonomously, and emerging protocols that make these pieces work together. But here’s the important part: This is not a linear evolution.

LLM-based applications are not “turning into” agents. Tools aren’t replacing apps. Agents aren’t the inevitable next step for everyone. Instead, we’re witnessing the expansion of the AI architecture toolkit.

And developers now have more choices than ever.

Some applications will stay simple—just an LLM plugged into a chat interface. Others will become highly autonomous systems coordinating across dozens of tools. The question isn’t what’s next, it’s what fits.

At Gentoro, we’re building infrastructure for this growing complexity. Our focus is on integration—connecting the reasoning capabilities of agents to the systems, APIs, and tools they need to get things done. We do this by standardizing what we call the agentic tool—a durable, reusable unit that plays well in both traditional LLM apps and next-gen agent ecosystems.

This post explores the current AI application stack and the architectural choices developers have today. Let’s walk through how protocols like MCP and A2A (and platforms like Gentoro) are helping make sense of it all.

What are LLM-based applications?

LLM-based applications are still the most common—and in many cases, the most appropriate—way to use AI.

These are conventional applications that include a large language model as a component. Think chatbots, Q&A systems, form summarizers, email generators, or retrieval-augmented search interfaces. The model provides language understanding and generation, but the flow of logic is fixed—decided by developers.

Frameworks like LangChain have become the go-to choice for building these kinds of systems. LangChain abstracts away prompt management, memory, retrieval, and tool use, letting developers quickly build smart, linear flows using reusable primitives.

Today, these are often called agentic workflows—even if they don’t involve a fully autonomous “agent.” And that’s okay. Not every AI system needs to plan, reason, or dynamically route actions. Sometimes, simpler is better.

What is Function Calling in AI?

For systems that do require more flexibility, function calling introduced the next major capability. First introduced by OpenAI in mid-2023 and rapidly adopted across the ecosystem, **function calling **allowed models to pick a tool based on the user’s intent.

Rather than hardcoding every possible user path, developers could register a set of functions or APIs. The model would determine when a function was needed, fill in the parameters, and pass the result to the application for execution.

This pattern enabled more dynamic behaviors. Applications could now delegate decisions to the model: when to fetch weather data, when to generate a calendar invite, when to submit a form.

It laid the foundation for agentic systems, where the LLM isn’t just responding—it’s acting.

MCP: Standardizing Tool Calling (not Function Calling)

Function calling solved the problem of choosing what to do. But it didn’t solve how to actually execute tools in a standardized way—especially across teams, applications, and enterprises.

Every developer was forced to implement their own tool-calling layer: HTTP clients, error handling, authentication, retries, output formats. Worse, there was no way to share tools between systems, even if they solved the same problem. The model knew what to do—but developers still had to reinvent how to do it, every time.

That’s where MCP (Model Communication Protocol) comes in.

Introduced by Anthropic in 2024, MCP doesn’t unify function calling formats (those remain vendor-specific, and frameworks like LangChain already abstract them). Instead, MCP standardizes how tools are described and called—defining a shared protocol for execution, independent of model or vendor.

This unlocked something powerful: the ability to separate agent builders from tool builders.

Agent builders now focus on reasoning, planning, and choosing actions. Tool builders focus on secure integrations, compliance, and system execution. And tools—once hard to share—can now be packaged, reused, and invoked reliably across agents, apps, and workflows.

At Gentoro, this is core to our work. We help organizations define and manage agentic tools—tools that comply with MCP, can be discovered and composed, and can serve both traditional applications and autonomous agents.

What are Agentic Tools? Gentoro’s unit of integration

At Gentoro, we believe the agentic tool is the stable unit of integration in this new world.

It’s reusable. It’s composable. It can be invoked by a simple LLM-based app—or by a chain of agents executing long-running workflows.
This flexibility matters. The AI stack is still evolving, but tools are not going away. In fact, they’re becoming more important. That’s why Gentoro focuses on tools as first-class objects—not just one-off API calls.

And unlike most platforms, Gentoro supports tool composability. Sometimes, you don’t want to rebuild a tool—even if you could. Maybe it’s owned by a third party. Maybe it’s tested and trusted. Or maybe you just don’t have the time.

With Gentoro, developers can compose tools into higher-level capabilities, chain tools across teams, and orchestrate behavior without tightly coupling their systems. This is crucial for scaling responsibly and maintaining flexibility over time.

Understanding A2A: enabling Agent-to-Agent interoperability

As systems grow more complex, we’re starting to see the rise of multi-agent systems—where different agents specialize in different roles and collaborate to complete workflows. This could mean sourcing job candidates, scheduling interviews, processing approvals, or managing inventory—each handled by a separate, specialized agent.

To make this work, agents need to talk to each other.

This is where the A2A protocol (Agent-to-Agent) comes in. Developed in partnership with Google Cloud and other industry leaders, A2A provides a standard for how agents discover each other, share tasks, and manage collaboration across frameworks and vendors.

It supports long-running workflows, capability negotiation, and even UI modality adaptation (like embedding videos or forms). It’s a huge step forward in agentic system design.

But here’s something interesting: sometimes, an agent is a tool.

Gentoro recognizes this hybrid reality. Our platform is being designed to let agents call not just tools, but other agents, treating them as composable, callable units. This will be critical as systems get more modular, and as agents specialize further.

How Gentoro fits into the expanding AI stack

At Gentoro, we’re not just building for one approach to AI. We’re building for all the approaches that coexist in today’s hybrid stack.

Some teams will continue to build LLM-based applications with fixed flows. Others will explore fully autonomous agents. Many will land somewhere in between—leveraging tools, workflows, and planning in varying degrees.

Across all of these patterns, we believe one thing stays constant: the value of the agentic tool.

Gentoro is built to help developers define, manage, and integrate these tools—securely, flexibly, and at scale. Whether you’re working with LangChain, building agents with open protocols like MCP and A2A, or running a mix of all of the above, Gentoro ensures your systems can connect, collaborate, and grow.

We’re not trying to define what the future must look like. We’re helping build the infrastructure for whatever it becomes. If you’d like to learn more, get in touch with us today!

What Are Agentic AI Tools?

Patrick Chan — Wed, 26 Feb 2025 16:21:00 +0000

A new term has been gaining traction in AI discussions: agentic tools. These tools allow AI agents — such as those powered by large language models (LLMs) — to interact with the real world. While LLMs are remarkably skilled at generating text and responding to complex prompts, they operate within the confines of their training data. They don’t have built-in access to live information or the ability to take real-world actions. They can’t check the weather, send a message, or update a database.

This is where Model Context Protocol (MCP) has started drawing attention. MCP aims to provide a structured way for AI systems to connect with external tools, bridging the gap between language models and real-world functionality. Its role will be discussed later, but first, it is important to understand what an agentic tool actually is.

Why Do AI Agents Need Tools?

At first glance, LLMs seem impressively capable. They can draft emails, summarize reports, and even engage in deep philosophical discussions. But outside of pure text generation, they quickly hit a wall. They don’t know what time it is. They can’t send an alert. They can’t update a to-do list or place an order.

Some might wonder — why not just train the model with all the information it needs? The problem is that knowledge embedded in training is static. Even if a model were trained to always know the current time, it would be outdated a zeptosecond later. The same applies to any dynamic information — prices change, appointments get rescheduled, messages need to be sent. No amount of training can make an AI model inherently aware of, or capable of acting on, real-time events.

The ability to retrieve live data, such as checking the latest stock prices, and take action, such as sending a notification, happens outside the training process. Agentic tools provide this missing link, allowing AI agents to interface with the world in real-time rather than just reasoning about it in isolation.

For an AI agent to perform a certain task, the necessary agentic tools must exist first. If an AI assistant needs to book a meeting, there must be a tool that can interact with a calendar system. If it needs to process a payment, there must be a tool capable of securely handling transactions. Before an agent can act, the right tools must be developed and made available to it.

With this foundation in place, the next step is understanding how an agent determines when and how to use these tools.

Explicit vs. Inferred Invocation

The key to understanding agentic tools is recognizing how they are invoked. Traditional software systems rely on explicit invocation, while AI-powered agents rely on inferred invocation.

Explicit Invocation (Direct and Controlled)

In most software, a developer explicitly calls a procedure with a specific name and precise parameters. The procedure always does the same job in a predictable way. The computer doesn’t decide whether or not to use it; the developer makes that decision ahead of time.

Inferred Invocation (AI Decides When to Act)

An AI agent, on the other hand, infers when it should use a tool based on the user’s request. It reads your natural-language input (“What’s Bob’s phone number?”), recognizes that retrieving contact info is required, and infers that the “find contact info” tool is the right one to use. No human developer has hand-coded that choice at the moment of your request — the AI makes the decision on its own.

Inferred invocation is what makes agentic tools powerful, but it also introduces new challenges that don’t exist in traditional software.

Challenges of Inferred Invocation

Inferred invocation allows AI agents to decide autonomously when and how to use tools, making them flexible and adaptive. However, this flexibility introduces significant challenges that traditional, explicitly invoked software does not face. The most fundamental issue is that existing procedures cannot simply be turned into tools. Procedures are designed for structured code execution, while tools must be designed for AI agents that interpret natural language, infer intent, and make autonomous decisions. This difference means that an additional layer must be built — one that reshapes traditional functionality into something AI can effectively understand and invoke. Without this, the entire process of inferred invocation would break down.

These challenges primarily involve tool design, language interpretation, and scalability, all of which must be carefully managed to ensure AI agents make the right decisions.

Tools vs. Traditional Procedures: Why Documentation Matters More‍

In traditional software, procedures are designed for code, not humans. A procedure can have dozens of parameters with complex structures, and as long as developers call it correctly, the system remains stable. Even if the procedure has no documentation, the software will still execute it as expected.

Moreover, while traditional procedures can have many parameters without issue, agentic tools must be designed for simplicity. If a tool has too many parameters or overly complex inputs, the AI may struggle to infer the correct way to use it. Unlike explicit procedure calls, where parameters are manually defined in code, AI agents rely on reasoning to supply the right inputs. The more parameters a tool has, the higher the chance of confusion.

Bridging the Naming Gap‍

Another major challenge is the disconnect between how tools are named and how users describe their needs. Traditional software procedures have precise, predefined names, and developers explicitly call them when needed. AI agents, however, must interpret human language and match it to the appropriate tool.

For example, a backend system might have a procedure called getUpcomingEvents(), but users will phrase their requests differently — “What’s on my schedule?”, “Do I have any meetings today?”, or “Show my next appointment.” Unlike conventional software, where developers map specific inputs to predefined procedures, AI agents must infer the connection between human intent and system logic.

This mismatch extends beyond simple phrasing differences. Backend systems often use technical or generic names that don’t align with user expectations. A hotel booking system might refer to “reserving a room” as “resource allocation,” or a payment system might process a refund under “reverse transaction” rather than simply “issue refund.” If an agentic tool is not named intuitively, AI agents may struggle to use it correctly.

To bridge this gap, tool design must prioritize clear, user-friendly naming that aligns with how people naturally phrase requests. Otherwise, AI agents may misinterpret the available tools, leading to errors or inefficient responses.

Scalability: Why More Tools Can Lead to Worse Performance‍

In traditional software, scalability is straightforward — new procedures can be added indefinitely without disrupting existing functionality. Since every procedure is explicitly called when needed, adding more does not create ambiguity.

With agentic tools, however, scalability is much more delicate. AI agents must infer which tool to use based on user input, meaning the more tools they have access to, the harder it becomes to make the correct choice. If multiple tools have overlapping capabilities, the AI may pick the wrong one or hesitate between options.

For example, if an AI agent has both a “find contact info” tool and a “retrieve customer details” tool, it might struggle to determine which one to use when asked, “What’s Bob’s phone number?” Unlike traditional systems, where each procedure is explicitly invoked, inferred invocation introduces decision complexity as the toolset grows.

To prevent this, agentic toolsets must be carefully curated and structured. Rather than adding tools indiscriminately, developers must prioritize clarity, minimize redundancy, and consolidate similar functions. For example, instead of separate tools for “add item to cart,” “remove item from cart,” and “check cart total,” a single “manage shopping cart” tool could handle all three actions dynamically.

The NxM Problem (and How MCP Solves It)

As AI becomes more widespread, we’re seeing an explosion of different agents (chatbots, virtual assistants, recommendation engines) and a whole variety of tools (calendars, databases, analytics apps). Without a standard, each agent would need a custom way to talk to each tool — leading to a tangled web of integrations. If you have N agents and M tools, you end up with N x M unique connections to build and maintain.

MCP provides a common language and structure so that any agent can use any tool following the same rules. This drastically reduces that messy NxM problem into something far simpler. Instead of building a separate connector from every agent to every tool, you build just one connector to MCP on either side. Agents know how to call any MCP-compliant tool, and tools know how to respond to any MCP-compliant agent.

Pulling It All Together

Agentic tools are a game-changer because they let AI systems act on the world around them, rather than being locked up in their own training data. The AI relies on inferred invocation — figuring out which tool to use based on what the user wants — while traditional software is built on explicit invocation, with a developer deciding every step in advance.

By structuring these tools well, AI can interact with real-world applications smoothly. While inferred invocation introduces complexity, it also unlocks AI that can act rather than just respond — creating new opportunities for businesses and users alike.

LangChain: From Chains to Threads

Patrick Chan — Tue, 25 Feb 2025 16:21:00 +0000

LangChain is one of the most exciting frameworks to emerge in AI development. It took what was once a scattered, low-level process — managing LLM calls, chaining outputs, handling retrieval — and wrapped it in a structured, developer-friendly abstraction. It made things easier. Before LangChain, you had to stitch everything together manually, from memory management to prompt templating. Now, with a few lines of code, you could assemble an LLM-powered pipeline in a fraction of the time.

But as AI applications are evolving beyond simple workflows into fully interactive, real-time applications, cracks are starting to show. LangChain works great when you’re treating an LLM like a function in a structured data flow, but things start to break down when you try to use it for complex, stateful, interactive agents. Developers are running into issues with debugging, performance, and flexibility. Some feel it abstracts too much, taking away control, while others find that it doesn’t abstract enough, requiring manual glue code for critical features like prompt engineering and tool integration.

Most of these pain points — performance optimizations, lacking documentation, debugging improvements, API flexibility — will be addressed over time. But there’s a deeper issue, one that may be harder to fix.

LangChain Was Forged in the Data Pipeline World

LangChain wasn’t designed in isolation — it was built in the data pipeline world, where every data engineer’s tool of choice was Jupyter Notebooks. Jupyter was an innovative tool, making pipeline programming easy to experiment with, iterate on, and debug. It was a perfect fit for machine learning workflows, where you preprocess data, train models, analyze outputs, and fine-tune parameters — all in a structured, step-by-step fashion.

When LLMs arrived, they were slotted into this existing pipeline mindset. At first, they were treated just like any other step in an ML workflow. You pass some text into an LLM, get an output, maybe filter or analyze it, and move on to the next step. The integrations were built fast. Now, with LangChain, you could call an LLM inside your Jupyter notebook and easily chain together multiple LLM-powered steps. And at first, it was amazing.

But LLMs aren’t just another step in a pipeline. They aren’t image classifiers or search indexes. They are interactive reasoning engines, capable of engaging in open-ended conversations, dynamically adapting to inputs, and maintaining context over long interactions. And that’s where the chain model starts to feel limiting.

Applications Are Not Built as Chains

Real-world applications don’t follow a simple chain of execution. They aren’t even structured as directed acyclic graphs. Instead, they are event-driven systems, where multiple subsystems interact concurrently and respond dynamically to user input.

Modern applications — whether it’s an AI-powered chatbot, a search assistant, or a customer service agent — aren’t built like an ML pipeline. Instead, they’re structured around threads, stateful interactions, and concurrent processes. A request comes in, background tasks execute, a database is queried, and new information is surfaced dynamically. These are complex, multi-layered architectures where LLMs are just one component of a larger system.

LangChain’s chain-based abstraction works well for structured LLM workflows, like document processing or batch summarization. But as soon as an agent needs to track context, handle multiple concurrent tasks, or adapt dynamically to changing inputs, the model becomes a constraint rather than an enabler.

This is why we’re already seeing developers look for alternatives. Goose, a new agent framework, wasn’t built in LangChain — it was written in Rust, optimized for speed and scalability, with a focus on real-time AI applications. The fact that developers are already reaching for lower-level, more application-friendly architectures suggests that the limitations of LangChain’s chain model are becoming more apparent.

What Would an AI Application Framework Look Like?

I’m not sure. AI development is moving so fast that anything we define today might be obsolete in a few months. But if we step back and think about the challenges developers face when building AI-powered applications, a few key ideas stand out. These aren’t definitive answers, but they point toward the kinds of abstractions that might be needed to move beyond the limitations of chain-based frameworks.

LLM-Managed Prompt Optimization

Right now, most AI applications rely on handwritten, static prompt templates, but this approach is fragile. LLMs are absurdly sensitive to minor variations in phrasing, word order, and formatting, often producing wildly different outputs for nearly identical instructions. Developers spend an unreasonable amount of time tweaking prompts manually, even though techniques like DSPy already exist to allow models to iteratively optimize their own instructions through self-play and reinforcement learning. A modern AI framework should treat prompts as dynamic artifacts, allowing LLMs to refine their own instructions over time instead of requiring manual intervention.

Seamless LLM-Agnostic Integration

A proper AI framework should be model-agnostic, seamlessly supporting OpenAI, Anthropic, Mistral, and fine-tuned proprietary models without major architectural changes. The AI ecosystem is evolving too fast for developers to lock themselves into a single provider, and switching between models should require minimal code changes. APIs should be abstracted in a way that makes model selection flexible, allowing applications to test multiple providers and dynamically switch based on cost, latency, or accuracy.

Built-In Auditing and Observability

AI applications are inherently unpredictable, and developers need deep visibility into how models behave in production. LangSmith already does a great job of providing structured logging, tracing, and analytics for LLM interactions, enabling developers to inspect failure patterns and debug issues effectively. Any AI framework should offer built-in observability tools, capturing input-output mappings, latency metrics, and contextual reasoning logs to ensure model performance can be systematically improved.

Robust Tool and API Integration

LLMs are not powerful in isolation — their real value comes from invoking external tools, searching databases, calling APIs, and interfacing with structured knowledge sources. A robust AI framework should provide first-class support for tool integrations, making it easy to define, test, and validate API interactions. Implicit function bindings should be introspectable and debuggable, ensuring that AI agents reliably interact with external systems without requiring excessive boilerplate.

Streaming Output as a First-Class Feature

Traditional application frameworks assume a request-response cycle where a function is called, a JSON response is returned, and processing continues. AI applications don’t always work this way — LLMs frequently stream responses token by token, requiring a framework to handle incremental processing rather than blocking execution until the full response is available. Streaming is particularly important for chat interfaces, multi-agent interactions, and real-time summarization tasks, where downstream processes need access to partial results before the full response has completed. Any AI framework should make streaming a core abstraction, ensuring that applications can handle real-time token-by-token processing naturally.

Intelligent Retry Mechanisms

Unlike traditional API calls, where retry logic is simple (e.g., if the request fails, retry with the same input), LLM calls are far more complex to retry. Failures in AI applications often result in partially incorrect or low-confidence responses, meaning a simple retry isn’t enough. Instead, applications may need to evaluate the output before deciding to retry, sometimes requiring another LLM to assess whether the result meets certain quality thresholds. A robust AI framework should provide built-in support for failure detection, adaptive retry logic, and self-correcting mechanisms, allowing models to recognize and recover from mistakes dynamically.

The Real Question: Who Evolves First?

LangChain is at a crossroads. Either it evolves into a true AI application framework, or existing application frameworks will integrate LLM abstractions and render it unnecessary. The next phase of AI development isn’t about chains — it’s about event-driven, stateful, and streaming-capable architectures.

And the real question is: Who will get there first?

Vibe Coding: The New Way We Create and Interact with Technology

Patrick Chan — Wed, 19 Feb 2025 20:48:56 +0000

Software development has always been about one thing: enabling humans to communicate their intent to machines. Over time, we’ve seen major shifts in how this is done, each one making it easier and more efficient. From the early imperative approach, where programmers had to spell out every step, to the declarative approach, which allowed them to specify what they wanted without detailing how to get there — each evolution has been about reducing complexity and increasing efficiency.

Now, a new paradigm is emerging — one that takes this abstraction to its logical extreme. Powered by generative AI, vibe coding allows users to express intent in natural language. Instead of worrying about syntax, implementation details, or system constraints, users simply describe what they want, and the system determines how to make it happen. Companies like Cursor.ai and Windsurfer are already pushing the boundaries of this approach, integrating AI-driven directive capabilities into coding environments. These tools enable developers to issue natural language instructions that generate, modify, and refine code, making the process more intuitive and efficient. But the real impact of the vibe coding paradigm extends far beyond programming — it is poised to revolutionize how people interact with all software.

Imperative and Declarative Approaches

The imperative approach is the foundation of traditional programming. It requires the programmer to define every step necessary to perform a task, with the system executing operations in a specified sequence. This was the basis of early computing — starting with machine code, then assembly, and later evolving into higher-level languages like C and Python. Although these languages introduced abstractions to simplify programming, the fundamental principle remained the same: the programmer had to explicitly define each operation and control the execution flow.

Declarative programming takes a different approach. Instead of focusing on the step-by-step execution of a task, it allows users to describe the desired outcome, and the system determines how to achieve it. For example, in SQL, a user can write:

SELECT category, COUNT(*) AS total_products
FROM products GROUP BY category;

The database engine then determines the optimal way to execute this query, abstracting away the underlying logic of how the data is retrieved. Declarative programming is widely used in domains where complex logic can be encapsulated by high-level abstractions — such as SQL for databases, HTML for structuring web pages, and functional programming for data transformations. It simplifies code, making it easier to maintain and reason about, but it still requires a solid understanding of the abstractions it provides.

Vibe Coding

Vibe coding represents the next step in abstraction. Instead of requiring users to think in structured syntax or system constraints, it allows them to communicate in natural language, with the system handling the implementation details. Users describe what they want, rather than specifying structured commands, and the system interprets intent, acting as an expert that understands the domain. Instead of writing code, structuring queries, or navigating menus, users issue commands in everyday language, focusing entirely on outcomes.

What makes this so powerful is the shift in who does the translation between human intent and system execution. Previously, programmers were responsible for taking a high-level goal and breaking it down into code, ensuring every piece of logic and structure was explicitly defined. With vibe coding, that responsibility shifts to the AI. Users guide the AI toward the desired outcome through iterative refinement — adjusting their requests as needed while the AI dynamically responds, much like a human collaborator would.

This transition drastically improves efficiency. A developer no longer needs to spend hours writing boilerplate code, structuring data pipelines, or debugging syntax errors. Instead, they focus on defining intent and refining the result. The AI handles the low-level implementation details, freeing developers to concentrate on high-level problem-solving and design.

You might think that this would allow non-developers to write software, but that’s not the case. While vibe coding removes the need to manually structure logic, the user must still understand programming concepts, abstractions, and terminology. If someone doesn’t know what an abstract interface is or what refactoring means, they will still struggle to express their intent effectively. Vibe coding does not replace developer knowledge — it simply provides a more powerful, efficient way to execute that knowledge. The learning curve to become a developer remains, but the time and effort required to translate ideas into working software is dramatically reduced.

This paradigm shift is already transforming coding, but its most profound impact may not be on development at all. The same vibe coding approach that simplifies programming can also revolutionize how people use software applications.

Vibe Interaction for Applications: An Entirely New Way to Use Software

One of the key aspects of vibing is that it operates on abstractions. You can’t effectively vibe if you don’t understand the abstraction you’re working with. For example, if you’re using vibe coding to build software, you still need to grasp programming fundamentals or, at the very least, the abstractions provided by a higher-level framework — such as buttons, canvases, or API endpoints. Vibing doesn’t replace knowledge of these abstractions; rather, it provides a more intuitive and efficient way to work with them.

The fascinating part is that the software being built through vibing is itself another abstraction. Consider a developer using classes and methods to construct a database. While they are leveraging one set of abstractions, the database itself introduces another — tables, triggers, views, queries, and so on. Since vibing inherently works on abstractions, it naturally extends to these newly created layers as well. This means that the benefits of vibe coding are not limited to software creation but also extend to interacting with the software once it’s built.

As Andrej Karpathy has highlighted, the way we build software today is fundamentally different from traditional methods. But the true power of this shift is that the same transformative efficiencies that make development faster and more intuitive also apply to using the software itself. Instead of navigating complex UIs, memorizing commands, or manually interacting with systems, users will be able to express intent in natural language and have the system respond accordingly.

This means that this convergence of creation and interaction doesn’t just impact coders — it transforms the experience for every user of any software, anywhere. From professionals using enterprise applications to casual users interacting with everyday tools, technology is no longer a rigid system requiring mastery — it becomes an adaptive, intuitive experience, making software more accessible and powerful than ever before.

Here’s an improved version of the two sections with greater clarity, precision, and a stronger alignment with the original points. The second section retains more details from the original while making the ideas flow more smoothly.

Enabling Vibing in Applications

For vibing to work effectively, the AI must both understand an application’s abstractions and be able to manipulate them. The first requirement — understanding abstractions — can largely be addressed with high-quality user documentation. If the AI struggles with certain concepts, it often signals gaps in the documentation. Improving these materials not only benefits human users but also enhances the AI’s ability to assist.

The second requirement — manipulating abstractions — depends on the application’s architecture. If the system has a frontend that communicates with a backend via an API, enabling AI-driven interaction is much easier. By providing the AI with access to API documentation, it can learn to work with the same entities the UI does. However, in many cases, the way data is structured in the API doesn’t perfectly match how users interact with it in the UI. To bridge this gap, additional explanations may be needed to map user-level concepts to their API counterparts. One way to facilitate this mapping is to allow the AI to analyze the frontend source code, ensuring it understands how the application presents data and functions to users.

However, the big LLM vendors are now creating agents that can operate a browser — OpenAI operator and Anthropic’s Computer Use. With these agents, an application may not have to do anything, The user may be able to vibe through the browser agent.

Transitioning to Vibing

Shifting to vibing is an evolution rather than a disruptive overhaul. Unlike past technological shifts — such as the transition to microservices, serverless computing, or event-driven architectures, which required fundamental changes to software design — vibing seamlessly integrates with existing abstractions. Instead of replacing traditional software structures, it enhances them, allowing users to engage with technology at a higher level without altering the underlying foundations.

The key advantage of vibing is its ability to remove friction from software interactions. Today, using an application often involves navigating complex UIs, memorizing commands, and working within rigid workflows. With vibing, users simply express their intent in natural language, and the system determines how to execute it. For example, instead of manually creating a budget spreadsheet, selecting a template, and configuring formulas, a user can simply say:

Create a budget spreadsheet with categories for income, expenses, and savings. Highlight any category that exceeds its budget.

The system interprets the request, builds the appropriate structure, and applies the necessary logic — eliminating the need for tedious manual setup.

This shift follows a familiar adoption curve seen with past advances in software interaction. When early IDEs introduced features like autocomplete, syntax highlighting, and intelligent refactoring, they were initially seen as optional enhancements. Over time, however, they became essential tools that dramatically improved productivity. Similarly, vibing will begin as a convenience but will ultimately reshape how people interact with technology, reducing complexity and making software more intuitive, adaptable, and aligned with human intent.

Conclusion
Vibe coding is already reshaping software development, making interaction with technology more intuitive and efficient. But as groundbreaking as it is for programming, its true significance lies beyond development. The vibe paradigm represents a fundamental shift in human-computer interaction — one where users no longer need to learn rigid interfaces, navigate complex workflows, or write detailed instructions. Instead, they simply express intent, and the system determines how to execute it.

Ironically, the coding practices that made AI possible are now being transformed by AI itself. Coding, once the essential bridge between human intent and machine execution, is no longer the primary bottleneck. The real challenge now lies in designing high-quality abstractions — the frameworks that define how AI-driven systems operate. As AI continues to advance, this paradigm isn’t just transforming how we build software — it’s redefining how we interact with, control, and leverage technology itself.

LLM Function-Calling Performance: API- vs User-Aligned

Patrick Chan — Mon, 17 Feb 2025 16:26:00 +0000

" width="640" height="213">Performance of API-Aligned vs User-Aligned LLM Functions

Large language models (LLMs) have evolved beyond basic text generation to become powerful tools capable of programmatically interacting with external systems through function-calling mechanisms. This capability allows LLMs to execute API calls using predefined functions, unlocking their potential for more advanced and practical applications.

For example, imagine a traveler wants to modify the return date of a flight. Through function-calling, an LLM could be given access to an airline’s booking API. When prompted with, “Change my return flight from November 25 to December 1,” the LLM could interpret the request, select the appropriate modifyBooking function, and provide parameters such as the booking ID and new date. It then makes the API call to adjust the itinerary, effectively completing the task without requiring further user intervention.

The success of such operations relies heavily on how functions are defined. Clear and well-structured functions enable the LLM to interpret prompts accurately and execute the right operations, directly affecting performance outcomes. This makes function design a critical factor in achieving high levels of accuracy and utility.

This article compares two distinct approaches to defining functions — API-aligned and user-aligned. Through benchmarking, it examines the performance differences between the two methods and explores how these differences arise.

API-Aligned Approach

In a GenAI application designed to interact with an enterprise system via an API, a natural strategy is to define LLM functions that directly mirror the required API operations. The LLM is then tasked with mapping the incoming prompts to these API-aligned functions.

This approach is demonstrated by Composio through a benchmark using a specific set of prompts and eight functions, each representing an endpoint of a project management service API (see https://composio.dev/blog/gpt-4-function-calling-example/). Initially, without any optimizations, Composio achieved an accuracy rate of 33%. By iteratively refining the function definitions — such as tweaking parameter structures or supplying clearer descriptions — they were able to improve accuracy to 74%.

The accuracy of this approach partly depends on the quality of the LLM itself. A more capable LLM, much like a more experienced developer, can often navigate complex scenarios better. For instance, LLMs perform more effectively when input parameters are flattened rather than deeply nested. While a more advanced model might handle nested parameters seamlessly, less capable models struggle with such complexity.

However, some challenges in API-aligned functions go beyond the raw ability of the LLM — even a “genius” developer-equivalent LLM may fail under certain conditions:

1. Implicit Knowledge Requirements: The correct usage of an endpoint may rely on context or domain knowledge that cannot be deduced from the function definition alone.

Example: A prompt asks the LLM to “archive a project,” but the API endpoint requires additional hidden parameters, such as whether to notify team members or preserve specific metadata.

2. Terminology Mismatch: The terms used in the prompt may not align with the terminology of the API.

Example: A user prompt refers to a “shopping cart,” while the API defines this concept as an ItemList. The LLM may fail to correctly map between these terms without additional guidance.

These challenges highlight inherent limitations in the API-aligned approach. While it can be effective with well-defined APIs and straightforward mappings, achieving consistent accuracy requires not only improving the LLM but also addressing ambiguities and gaps in the function definitions and user input alignment.

User-Aligned Approach

An alternative to the API-aligned method is to define functions based on the structure of input prompts rather than the API itself. This approach shifts the focus closer to how users naturally phrase their requests. The distinction between the two approaches can be summarized as follows:

• API-Aligned Functions: These are tightly coupled with the API’s structure, requiring the LLM to perform more work to map user prompts to the appropriate functions.

• User-Aligned Functions: These are designed to align more closely with user prompts, reducing the burden on the LLM for accurate mapping. However, this approach often requires additional implementation effort, as it may necessitate invoking multiple API endpoints or performing extra computations.

Two Approaches to Function Design

Generating Functions

API-aligned functions are typically generated directly from the API specification, making them straightforward to implement. User-aligned functions, on the other hand, must be derived from prompts, which requires those prompts to be entity-complete — meaning they must explicitly reference all entities and parameters necessary for defining the function.

For example, if an API endpoint requires a mode parameter, the prompt used to generate the user-aligned function must specify this parameter explicitly. If the prompt omits it, the resulting function definition will lack it as well, potentially arbitrarily picking a default value during implementation. Ensuring prompts are entity-complete is critical to generating effective user-aligned functions.

Challenges of User-Aligned Functions

While user-aligned functions simplify the LLM’s task of binding prompts to functions, they also present unique challenges:

1. Proliferation of Functions: A major drawback is the risk of creating too many narrowly tailored functions, each corresponding to a specific variation of input prompts. In the worst-case scenario, this could result in one function per prompt variation, leading to an overly complex and fragmented implementation.

2. Generalization Process: To mitigate this, it is essential to consolidate overfitted functions into a smaller set of generalized functions. This process must ensure that generalization does not compromise the LLM’s ability to bind accurately to the functions. A careful balance is required to maintain efficiency and reliability while preserving the alignment with user inputs.

Testing Both Approaches Using the Composio Function-Calling Benchmark

The Composio benchmark consists of 50 prompts, each with an expected result corresponding to the LLM binding the input to one of eight predefined functions. The benchmark is designed to measure the performance of API-aligned functions across a variety of optimization techniques. To evaluate the user-aligned approach, these same prompts were used to generate corresponding user-aligned functions, which were then executed and compared to the benchmark’s expected results.

A notable challenge in this comparison is that the user-aligned functions, by design, differ structurally from the eight predefined API-aligned functions. To address this, a semantic comparison tool was developed, as described below.

Evaluation Steps

Each prompt in the benchmark was evaluated using the following process:

1. Function Generation: A function definition was generated Gentoro’s function-generation capability based on the content and intent of the prompt.

2. Prompt Execution: The prompt was submitted to the LLM along with the generated function, and the resulting function call trace was recorded.

3. Comparison: The recorded function call trace was semantically compared to the benchmark’s expected output using the custom semantic comparer.

The results of this evaluation showed that, using the user-aligned approach, all 50 prompts were successfully executed.

Benchmark Execution Details

During the benchmark evaluation, discrepancies arose where the expected function calls included values not explicitly provided in the prompts. In the user-aligned approach, such parameters cannot be included in the generated function call trace because they are absent from the input prompts. To resolve this mismatch, the evaluator was configured to ignore parameters in the expected function calls that were not present in the original prompt. The results of the modified benchmark can be seen here: https://github.com/gentoro-gh/Composio-Function-Calling-Benchmark

The following system prompt was used to compare the actual and expected results:

System Message:
  Compare two function calls (expected and actual) to determine if they match. 
  Lean toward passing, failing only when explicitly critical mismatches cannot 
  be resolved through creative reasoning.

User Message:
  Steps for Matching

  1. Parse and Normalize:
      - Extract parameters as key-value pairs.
      - Flatten nested structures (e.g., a=(b=(enabled=true)) becomes a.b.enabled=true).

  2. Handle Aliases:
      - Use mappings for semantically equivalent parameters 
        (e.g., due_dates_enabled ↔ features.due_date.enabled).
      - Translate both lists into a unified format.

  3. Exclude Irrelevant or Unreferenced Parameters:
      - Ignore parameters not explicitly mentioned or required by the prompt.
      - Missing parameters in the actual call are ignored unless explicitly critical.

  4. Compare Parameters:
      - Match keys and values exactly, semantically, or creatively 
        (e.g., private=true ↔ privacy_setting=members-only).
      - Resolve ambiguity in favor of passing.

  5. Ignore Extra Parameters:
      - Extra parameters in the actual call do not affect the outcome.

  6. Report Findings:
      - Summarize matched, excluded, and ignored parameters.

  7. Determine Pass/Fail:
      - Pass: Only if all relevant parameters align or plausible matches exist.
      - Fail: If a critical, explicitly referenced parameter cannot align 
              in any way; in this case, do not try to find another reason to pass.

Experimental Configuration

The evaluation used the following parameters:

• Model: GPT-4o-2024–08–06

• Temperature: 0.8

• Top-p: 0.8

Implications, Challenges, and Conclusion

The evaluation demonstrates the potential of aligning function definitions with user-provided inputs to streamline the function-calling process for LLMs. By reducing reliance on rigid API schemas, the user-aligned approach simplifies function binding and minimizes the need for complex prompt adjustments. However, while this method shows promise, more extensive testing and validation are needed to fully assess its effectiveness across diverse scenarios and use cases.

A notable challenge in the evaluation was the sequential nature of testing, where each tool was created, tested, and discarded individually. Optimizing this workflow to handle multiple prompts in a single evaluation run could significantly enhance efficiency. Additionally, the current benchmarks focus on relatively straightforward scenarios, and expanding them to include more complex, multi-step tasks or higher computational demands would provide deeper insights into the scalability and robustness of the user-aligned approach.

Another challenge involves balancing specificity and generalization. Aligning functions closely with user prompts simplifies LLM function binding but risks generating an excessive number of narrowly tailored functions. Future work should explore techniques to consolidate overfitted functions into generalized ones without sacrificing the LLM’s ability to accurately map prompts to these broader definitions.

Hybrid strategies also represent a promising direction for future research. By dynamically adapting both function definitions and prompts, hybrid methods could combine the precision of API-aligned approaches with the flexibility of user-aligned ones, offering a balanced and scalable solution for real-world applications.

In conclusion, the user-aligned approach provides a compelling direction for improving LLM accuracy in function-calling tasks, but its practical scalability and reliability remain dependent on further refinement and testing. Addressing challenges such as testing efficiency, function generalization, and broader benchmarking will be crucial to unlocking the full potential of this method. By combining these efforts with hybrid strategies, the field can advance toward more versatile and robust LLM function-calling mechanisms.

Using MCP Server to Integrate LLMs into Your Systems

Patrick Chan — Thu, 13 Feb 2025 20:42:06 +0000

As the enterprise adoption of Large Language Models (LLMs) continues to accelerate, developers face increasingly complex challenges in building production-ready generative AI applications. At Gentoro, we've been tackling these challenges head-on as we build bridges between LLMs and enterprise systems, and we've discovered a game-changing solution in Anthropic's recently launched Model Context Protocol (MCP).

The fundamental challenge in enterprise LLM integration lies in establishing reliable communication between these powerful AI models and existing enterprise infrastructure - databases, applications, and various services. This isn't just about connecting different systems; it's about ensuring these connections are robust, maintainable, and perform consistently at scale. Developers working with LLMs know that the process involves painstaking effort in constructing multiple components, followed by countless hours of fine-tuning to achieve desired behaviors.

One of the most significant hurdles we encountered while developing our product was the proliferation of agent frameworks. The landscape of LLM tooling has exploded with solutions like LangChain, LlamaIndex, and CrewAI, each offering unique approaches to building LLM-powered applications. The traditional approach would have required building and maintaining custom connectors for each framework - a resource-intensive endeavor that would have diverted our focus from core product development.

Model Context Protocol: A Universal Solution

This is where the Model Context Protocol enters the picture. Recently introduced by Anthropic, MCP represents a paradigm shift in how we approach LLM integration. As an open standard, it establishes a universal protocol for connecting LLMs with external data sources, tools, and services. The implications of this standardization are profound - it eliminates the need for fragmented, framework-specific integrations and provides a scalable foundation for building enterprise-grade AI solutions.

For development teams working with LLMs, the benefits of MCP are immediate and substantial. By implementing server-side MCP support, we at Gentoro transformed what was once a major technical roadblock into a solved problem virtually overnight. This standardization allows developers to redirect their energy from building and maintaining multiple connectors to focusing on more critical challenges in the LLM space - such as optimizing tool interactions and reducing hallucinations, which remain persistent challenges in working with these powerful but sometimes unpredictable models.

The technical community's response to MCP has been overwhelmingly positive, with growing momentum toward adoption. This enthusiasm isn't just about the immediate benefits; it's about the future potential of a unified ecosystem. As more frameworks align with the MCP standard, we're witnessing the emergence of a more cohesive and efficient development environment for generative AI applications.

While it's important to note that many current MCP server implementations remain in the demonstration phase, the protocol itself represents a crucial step forward in enterprise AI development. It provides the architectural foundation necessary for building robust, scalable, and efficient generative AI solutions. This foundation is particularly valuable for enterprises looking to implement LLM-powered applications without getting bogged down in integration complexities.

Claude using Gentoro tools via MCP

The Broader Impact

Looking ahead, the impact of MCP extends far beyond its role as a technical standard. It's emerging as a transformative force in the generative AI landscape, fundamentally changing how developers approach LLM integration. The protocol's ability to simplify complex integrations while maintaining flexibility and scalability positions it as a key enabler for the next generation of enterprise AI applications.

For developers building with LLMs, MCP servers represent a significant opportunity to streamline development processes and focus on delivering value rather than wrestling with integration challenges. As the ecosystem matures and more production-ready implementations emerge, MCP's role in shaping the future of enterprise AI integration becomes increasingly clear.

The journey of implementing and working with MCP has reinforced our belief that standardization is crucial for the sustainable growth of enterprise AI applications. As we continue to explore and expand the capabilities of our MCP server implementation, we're excited to be part of this transformative movement in the generative AI space.

Getting started with Gentoro's MCP support

Learn more about the Gentoro MCP server here.
Visit this GitHub repository for the code for the Gentoro connector.
Follow this documentation guide to connect Claude Desktop with Gentoro.

Building Bridges: Connecting LLMs with Enterprise Systems

Patrick Chan — Thu, 13 Feb 2025 20:13:17 +0000

Enterprise users want to harness the power of large language models (LLMs) and Generative AI (GenAI) with enterprise resources, but developers tasked with making that happen will face some complex challenges.

Enterprise users will expect LLMs and GenAI to know and use private and proprietary data to inform and provide context for responses. Retrieval-augmented generation (RAG) processes can be implemented to extend the LLM beyond its initial training data, but users already expect the LLM to be able to get intelligence from complex information services and systems that might require query execution plans to retrieve specific records instead of non-performant database dumps and full table scans.

Your users will want the LLM/GenAI to take actions that reliably execute function calling capabilities. Complex requests will require complex actions that are difficult to statically code ahead of time and can’t be directly executed by an LLM either.

Several logical layers of code need to be implemented between prompt, execution, and response. Developers need to create a bridge between LLMs and function calling. External APIs must be defined and described as functions (tools in agentic systems) so that the LLM knows when to use them. When the LLM chooses to invoke the function, parameters are passed to your code for function calling and the API response is passed back to the LLM. It sounds easy, but there is a lot more to it. Here’s a list of eleven technical challenges requiring consideration.

1. Learn External API and Specification

- Requirement: Developers need to fully understand the external system’s API, schema, data models, and object structures. This is often a significant learning curve, especially for complex systems with many endpoints (e.g., Slack with 100+ endpoints).
- Challenge: This step can take days to weeks due to the depth of learning required and the number of external systems that require integration. APIs may be poorly documented, lack examples, or have complex object and data models that need to be understood before building any integration.

2. Design the Functions

- Requirement: Developers must design and define how function calling interacts with external APIs and the LLM. This step includes specifying what endpoints to call and how to call them based on different prompts.
- Challenge: The vast array of possible natural language prompts can lead to many unexpected use cases, each requiring a unique endpoint selection. When developers manually code agentic tools they must try to anticipate new LLM function calling logic from novel prompts to ensure prompt-to-tool mapping is robust and accurate. While frameworks like LangChain can help manage the registration of tools with the LLMs, they don't help at all if the LLM does not properly use the registered toolset.

3. Assess LLM Dependencies

- Requirement: Developers must understand the LLM's limitations and specific requirements for tool implementation, such as tool bindings, function calls, and how the LLM interprets external calls.
- Challenge: LLMs vary significantly in their function calling models, requiring customization for each system. Functions may be defined by JSON, model-specific API calls, or token-based function call systems. Function calling may be accomplished with strict or structured outputs, prompt engineering, and an unlimited variety of unstructured prompts. Misalignments in function calling models and LLM abstractions can lead to inaccuracies and hallucinations.

4. Implement Tool Code

- Requirement: Developers must write code that calls external APIs and implements the logic for interacting with these APIs via calling functions. They must manage authentication and any complexity introduced by differentiated access control and privacy mechanisms. They can write or use existing code libraries, sample code, and tools for testing, analytics, monitoring, and debugging.
- Challenge: Writing code for external APIs involves dealing with complex authentication flows, ensuring proper error handling, and managing edge cases. Developers must also ensure scalability and performance, particularly when handling high-throughput API calls. Debugging API interactions can be challenging due to limited visibility into external systems, requiring robust logging and diagnostic tools. Writing test cases to simulate API responses accurately is essential but time-intensive. Ensuring code is secure and complies with enterprise data privacy standards further adds to the complexity.

5. Register Functions with LLM API

- Requirement: Functions must be registered correctly with the LLM’s API, enabling the LLM to recognize and use them. This involves defining tool sets and their capabilities in a way the LLM understands.
- Challenge: Too many registered functions or poorly designed tool sets can slow down the LLM or lead to expensive operations if not optimized. Updates to an existing LLM or selection of a new LLM can completely change function registration requiring a rebuild of the function calling applications.

6. Pre-Production Testing and Validation

- Requirement: Before deployment, developers must test and refine function calls using sample prompts to verify correct behavior, ensure functionality, and validate performance. The system must also scale nicely to at least the expected peak demand and more if warranted.
- Challenge: Pre-production testing is critical and complex, as it must account for a wide variety of prompt types and edge cases. It’s necessary to write automated tests that validate each potential interaction and ensure accurate data flow. Even when you think you have solutions for all cases, the infinite number of possible prompts may have you testing and validating more.

7. Post-Production Support and Evaluation

- Requirement: Systems should be designed to self-monitor, detect faults, accept feedback, and revalidate or auto-correct with proper logging and notification for support throughout. Once deployed, developers or support engineers must be able to check performance and troubleshoot any hallucinations or other issues to ensure that the integration works as expected.
- Challenge: Post-production environments often reveal unforeseen issues related to performance, accuracy, or system load. Because prompt layer inputs can be anything, it’s essential to have a tool in place for ongoing evaluation, performance monitoring, and quick identification of bottlenecks and failures. A prompt layer LLM agent tool can be bought and implemented, or designed to log requests, responses, and associated metrics for assessment and evaluation.

8. Issue Resolution in Production

- Requirement: The system developed should be consistent, reliable, and without hallucinations. The quality of the LLM responses should be excellent and as expected. If issues occur, then the users should have proper notification, and support should have enough information logged to troubleshoot and fix without introducing regressions or downtime. Developers need to ensure that both the LLM and tools are context-aware and responding to end-user prompt variations correctly.
- Challenge: Handle hallucinations, bugs, or failures in a live environment without breaking other functionality. Anticipate types of failures and provide breaks and logging where appropriate to allow for diagnosis and repair. Avoid introducing regressions, when changes are made to core tool functions or LLM interactions.

9. Handling Latency and Performance Optimizations

- Requirement: Tools must be optimized for low latency and high performance when interacting with external systems. This is particularly important for real-time applications. The developer may need to impose limits on how the LLM can join tables of information from large data sources. Full table scans or operations that are known to be expensive or consumptive of computing resources may need to be blocked.
- Challenge: System resources are finite. External systems might not be fast enough or may have rate-limiting, which can increase response times for the LLM. Developers must balance caching, data retrieval strategies, and optimization techniques to keep the integration fast and efficient.

10. Define Security and Privacy Protocols

- Requirement: Systems need to protect both enterprise and user data. Logins and authentication tokens must allow access to secured resources. LLMs must be able to handle protected information in compliance with GDPR, HIPAA, and CCPA regulations.
- Challenge: The main obstacle is handling user input and API responses that may contain personally identifiable information (PII) or sensitive data without exposing vulnerabilities.
External services may have role-based access control that allows specific users full access, restricted access, or constrained access depending on access tokens, credentials, or permissions based on identity. Developers must implement tools that pass tokens or credentials based on the user’s identity and use returned data securely. Ensuring the API calls are efficient and compliant with privacy laws adds additional complexity.

11. Cost

- Requirement: Systems must be performant and cost-effective. Resource efficiency must allow for scalability.
- Challenge: Developers need to define and register the functions that most efficiently fulfill the SLA requirements. The number of registered functions can create an expensive tool if they are loaded for every user prompt. Logic must be written to handle when to submit a particular set of functions based on the request. Because the LLM is determining what to do with the prompt, some part of the system must manage function calling so that the total number of input tokens doesn’t create expensive overhead.

In conclusion, connecting enterprise systems with LLMs and GenAI presents a range of technical challenges that developers must navigate, from understanding complex external APIs to managing latency and ensuring robust security. The process is far from simple, requiring precise function design, careful tool implementation, and rigorous testing to ensure that the system operates reliably in real-world scenarios. As these systems evolve, developers must also stay agile, adapting to changes in APIs, LLM models, and external services while ensuring that performance, security, and cost-efficiency are maintained.

Rethinking LangChain in the Agentic AI World

Patrick Chan — Fri, 31 Jan 2025 20:50:36 +0000

LangChain is undeniably a powerful tool for building modern AI-powered applications, enabling developers to chain together models, APIs, and tools to create intelligent systems. It powers some of the most advanced agentic AI systems today. Ironically, however, LangChain itself doesn’t take full advantage of the generative AI power it enables. Instead, it relies on traditional programming paradigms, requiring developers to define workflows step by step, meticulously design prompts, and manage integrations manually.

What would LangChain development look like if it leveraged the same agentic AI power it provides to others?

Imperative Programming: The “How”

LangChain operates firmly within the imperative programming paradigm. In this approach, developers explicitly define each step the system must take to achieve a desired outcome. For example, when building a LangChain agent, developers must:

Write code to define workflows step by step.
Manually design and refine prompt templates.
Integrate APIs, tools, and data sources manually.

This method is effective but highly detailed and time-intensive, requiring programming expertise. Every crafted detail leaves room for human error and slows down the development process.

Declarative Programming: The “What”

Declarative programming shifts the focus from process to outcome. Instead of defining how something is achieved, developers specify what they want, and the system determines the steps. Examples include:

SQL: Define the data you need, and the database retrieves it.
Terraform: Specify the desired infrastructure, and the tool provisions it.

While declarative systems simplify some tasks, they still require significant domain-specific knowledge and expertise to achieve results.

Directive Programming: The “What” in Plain Language

Generative AI introduces a new paradigm: directive programming. Here, users describe goals in natural language, and the system determines how to achieve them. For instance:

‍“Show me a list of customers who made a purchase last month, sorted by spending.”

The system generates and executes the query, returning the results. If adjustments are needed, users provide conversational feedback:

‍“Add a chart to visualize the data and group customers by region.”

This paradigm eliminates the need for technical expertise, introduces an iterative feedback loop, and makes the process intuitive and adaptive.

Analyzing a LangChain Agent

To understand the challenges inherent in LangChain, consider a recent demo example: a structured report generator. This LangChain agent creates a comparative review report on a given topic that creates reports with this kind of structure:

# CPU and GPU Comparison Report

## CPU and GPU Comparison Overview

The computing landscape is evolving with advancements in CPU and GPU architectures. Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are crucial components in modern computing, each with unique strengths and applications. Understanding the context and differences between these two is essential for optimizing performance in various fields, from mobile computing to artificial intelligence and deep learning.

## Context for Comparison

The latest CPU models, such as the Arm Cortex-X925, boast enhanced decode and dispatch widths, improved instruction window sizes, and advanced branch prediction units, significantly boosting computing speed and efficiency. On the other hand, GPUs like the NVIDIA H200 Tensor Core GPU leverage massively parallel architectures to handle complex AI and HPC challenges, with features like enhanced AI inference, power efficiency, and multi-instance GPU technology. This comparison delves into the core features, architectures, and implementations of CPUs and GPUs, exploring their example use cases and the innovations driving their development.

## CPU Architecture and Implementation

The latest advancements in CPU architecture are revolutionizing computing speed and efficiency.

The Arm Cortex-X925, for example, features a 10-wide decode and dispatch width, significantly increasing the number of instructions processed per cycle. This enhancement allows the core to execute more instructions simultaneously, leading to better utilization of execution units and higher overall throughput.

- Key features of the Cortex-X925 include:
  - 10-wide decode and dispatch width.
  - Doubled instruction window size.
  - 2X increase in L1 instruction cache bandwidth.
  - Advanced branch prediction unit.

The Cortex-X925 is designed for high-performance mobile computing and highlights Arm’s focus on efficient PPA (Power, Performance, Area).

## Core Features and Architecture of GPU

The GPU’s massively parallel architecture is its core feature, enabling it to handle thousands of threads simultaneously, which is essential for AI and deep learning applications.

The NVIDIA H200 Tensor Core GPU is designed to tackle complex AI and HPC challenges, with enhancements in memory, processing power, and scalability. Its HBM3e memory offers 141GB at 4.8 TBps, improving performance for large language models.

- Key features include:
  - Enhanced AI inference, power efficiency, and multi-instance GPU technology.
  - Support for a range of precision formats, from FP8 and BF16 to FP32 and INT8.
  - Use in healthcare for analyzing medical images, enabling faster and more accurate diagnoses.

## Conclusion and Recommendations

The choice between CPU and GPU depends on specific use cases and requirements.

| Component | Strengths                                                                 | Weaknesses                              |
|-----------|---------------------------------------------------------------------------|-----------------------------------------|
| CPU       | High-performance mobile computing, efficient PPA, advanced branch prediction | Limited parallel processing capability  |
| GPU       | Massively parallel architecture, enhanced AI inference, power efficiency   | High power consumption for complex computations |

Based on the comparison, CPUs like the Arm Cortex-X925 are ideal for high-performance mobile computing, while GPUs like the NVIDIA H200 Tensor Core are better suited for AI, deep learning, and complex computations. Selecting the appropriate component depends on specific project requirements and considerations like power consumption and parallel processing needs.

## Sources

- Arm Unveils 2024 CPU Core Designs, Cortex X925, A725, and A520
- AMD vs Intel CPU Roadmap 2024-2025: Ryzen AI vs Core Ultra
- M4 vs M3 Chip: These Improvements and Features You May Wonder!
- NVIDIA H200 Tensor Core GPU: Architecture, Features, and Use Cases

The developer designed this agent to use a six-step process to generate reports:

Generate the report structure and outline based on the topic and user-defined parameters.
Create targeted search queries for each section using predefined templates.
Fetch relevant content via the Tavily API for each query.
Generate individual report sections using an LLM and the gathered research.
Organize the sections into a cohesive document.
Add introductory and concluding sections, compiling the final report.

This process is entirely imperative — every step is explicitly programmed, leaving little room for flexibility or adaptation. While this imperative approach is effective for building sophisticated workflows, it demands significant programming expertise and a lot of manual effort, making it both time- and resource-intensive. This raises the question: Is there a way to reduce complexity and make LangChain more accessible?

Is LangGraph the Answer?

LangGraph, an optional tool for LangChain, introduces a visual interface for designing workflows with drag-and-drop functionality. This approach makes LangChain more accessible to non-programmers and provides a clear visualization of workflows, helping users identify errors or inefficiencies more easily. For simple agents, LangGraph can streamline the process by removing the need for direct coding.

However, LangGraph is not a declarative system. While graphical systems can, in principle, be declarative — allowing users to visually specify the desired outcome without detailing the implementation — LangGraph is strictly imperative. In LangGraph, each node represents a specific step in the workflow, and the edges define the flow of execution from one step to the next. Developers are still responsible for defining the logic, configuring prompts, and troubleshooting at each node. As such, LangGraph does not eliminate complexity but rather shifts it into a visual representation. Its limitations underscore the need for tools that go beyond visualization to genuinely simplify workflow creation.

Reimagining LangChain with Directive Programming

Now, let’s consider how the same structured report generator agent could be built using directive programming. Instead of explicitly defining every step, the user would simply describe the desired outcome in natural language, and the system would handle the rest.

Step 1: The Directive

“Create a comparative report generator. The report should compare two technologies, highlighting their key differences and similarities. The report should look like this: . Use the Tavily search engine for research, and here is the API key.”

Directives can be as simple or as detailed as convenient. With this simple directive, the system would:

Identify and generate the LangChain workflow needed to create the report.
Generate and manage prompt templates tailored to each section.
Fetch relevant data, organize it, and structure the report.

This approach streamlines development, shifting the focus from manual configuration to outcome-oriented instruction. As a demonstration of this paradigm, OpenAI successfully generated the following report using the directive above.

# LangChain and LlamaIndex Comparative Report

## Overview

In the rapidly evolving field of Natural Language Processing (NLP), frameworks like LangChain and LlamaIndex have emerged as pivotal tools for developing applications powered by Large Language Models (LLMs). Both frameworks offer unique features tailored to specific use cases, and understanding their distinctions is crucial for developers aiming to optimize their NLP applications.

## Context for Comparison

LangChain and LlamaIndex are designed to facilitate the integration and utilization of LLMs in various applications. While they share the common goal of enhancing LLM-powered solutions, their approaches and core functionalities differ. This report delves into their architectures, key features, and ideal use cases to provide a comprehensive comparison.

## LangChain Architecture and Implementation

LangChain is a versatile framework that provides a modular and flexible set of tools for building a wide range of NLP applications. Its architecture is designed around several key components:

- **Prompts**: Standardized interfaces for creating and managing prompts, allowing customization and reuse across different models and applications.
- **Models**: Unified interfaces for interacting with various LLMs, simplifying integration with models from providers like OpenAI and Anthropic.
- **Memory**: Advanced memory management capabilities that retain information from previous interactions, enabling context-aware and coherent conversations.
- **Chains**: Sequences of operations where the output of one step serves as the input for the next, facilitating the creation of complex workflows.
- **Agents**: Components that determine and execute actions based on input, utilizing LLMs to decide sequences and leveraging various tools to accomplish tasks.

LangChain's design emphasizes flexibility and extensibility, making it suitable for applications such as chatbots, content generation tools, and complex workflow automation systems. ([datacamp.com](https://www.datacamp.com/blog/langchain-vs-llamaindex?utm_source=chatgpt.com))

## LlamaIndex Architecture and Implementation

LlamaIndex is a framework primarily designed for search and retrieval tasks, focusing on data indexing and querying. Its architecture includes:

- **Data Connectors**: Tools for ingesting data from various sources, transforming it into numerical embeddings that capture semantic meaning.
- **Indexing**: Efficient methods for structuring and storing data to facilitate quick retrieval.
- **Query Engines**: Interfaces that allow for the retrieval of relevant information based on semantic similarity to user queries.

LlamaIndex excels in handling large datasets, enabling quick and accurate information retrieval. It is particularly well-suited for applications that require advanced search capabilities and efficient data indexing. ([datacamp.com](https://www.datacamp.com/blog/langchain-vs-llamaindex?utm_source=chatgpt.com))

## Core Features and Comparison

| Feature             | LangChain                                                                                         | LlamaIndex                                                                                         |
|---------------------|---------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| **Primary Focus**   | Building a wide range of NLP applications with modular components.                                | Optimized for search and retrieval tasks with efficient data indexing.                             |
| **Customization**   | Offers extensive customization options, supporting complex workflows and integrations.            | Provides limited customization, focusing on efficient indexing and retrieval.                      |
| **Use Cases**       | Suitable for chatbots, content generation, and applications requiring advanced context retention. | Ideal for internal search, knowledge management, and enterprise solutions requiring quick retrieval. |
| **Performance**     | Efficient in handling complex data structures and workflows.                                      | Optimized for speed and accuracy in retrieval tasks.                                               |
| **Integration**     | Supports integration with various tools and LLMs, offering flexibility in application development.| Primarily focuses on data connectors and indexing, with less emphasis on broader integrations.     |

([datacamp.com](https://www.datacamp.com/blog/langchain-vs-llamaindex?utm_source=chatgpt.com))

## Conclusion and Recommendations

The choice between LangChain and LlamaIndex depends on the specific requirements of your project:

- **LangChain** is ideal for applications that demand flexibility, extensive customization, and complex workflows, such as chatbots and content generation tools.
- **LlamaIndex** is better suited for projects that require efficient data indexing and quick retrieval capabilities, such as internal search systems and knowledge management solutions.

In some cases, combining the strengths of both frameworks might provide the best results, leveraging LangChain's modularity and LlamaIndex's efficient retrieval mechanisms. ([datacamp.com](https://www.datacamp.com/blog/langchain-vs-llamaindex?utm_source=chatgpt.com))

## Sources

- [LangChain vs LlamaIndex: A Detailed Comparison](https://www.datacamp.com/blog/langchain-vs-llamaindex)
- [Differences between Langchain & LlamaIndex](https://stackoverflow.com/questions/76990736/differences-between-langchain-llamaindex)
- [LangChain vs. LlamaIndex: A Comprehensive Comparison for Retrieval-Augmented Generation (RAG)](https://medium.com/@tam.tamanna18/langchain-vs-llamaindex-a-comprehensive-comparison-for-retrieval-augmented-generation-rag-0adc119363fe)

Step 2: Iterative Feedback

In directive programming, the development process is inherently iterative, driven by feedback rather than manual coding or debugging. The user begins by reviewing the system’s current results, evaluating how well they align with the desired outcome, and providing feedback to refine the output.

For example, after the system generates an initial report based on a directive, the user might request changes such as:

“Add a summary section at the top”

“Focus more on performance metrics in the comparison”

“Make the report more concise.”

This feedback is then processed by the system, which dynamically updates its workflows, modifies prompt templates, and generates a refined result. The loop continues until the output converges with the user’s expectations.

Importantly, the system decides how to implement the updated logic based on the feedback. After several iterations, the workflow may evolve into something resembling the original six-step process — or it may become more streamlined or more complex. The decision about how many steps are necessary, and what they should entail, is entirely left to the generative AI system. This flexibility allows the system to optimize the workflow to suit the strengths and weaknesses of the tools or models it employs.

In the short term, users may still need to offer suggestions about implementation strategies, especially when the system struggles to achieve the desired results. For example, a user might say, “Why don’t you first determine the sections of the report and then work on each section individually?” This kind of guidance can help the system refine its approach to solving more complex tasks. However, as LLMs become smarter and more capable, they will increasingly be able to infer such strategies on their own, further reducing the need for user input on implementation details.

By delegating implementation choices to the system, users focus solely on defining high-level goals and identifying areas for improvement. The system adapts dynamically, ensuring an efficient and collaborative process while freeing users from the burden of manually reconfiguring workflows or logic. This iterative feedback mechanism ensures both flexibility and precision, with the system evolving intelligently to meet the user’s requirements.

A Glimpse of the Future

LangChain provides a powerful foundation for building AI systems, but its reliance on imperative programming — whether through code or visual workflows — limits its accessibility and usability. Directive programming offers a transformative vision: enabling users to focus on outcomes rather than implementation details.

In this new agentic world, where AI can handle complexity for us, it’s time to rethink tools like LangChain. Building complex AI systems could become as simple as having a conversation.

AI and the Monkey’s Paw: Why SaaS Will Endure

Patrick Chan — Tue, 14 Jan 2025 19:27:00 +0000

In W.W. Jacobs’ The Monkey’s Paw, a family’s wishes are granted in horrifyingly unintended ways. Each wish is fulfilled literally, but without understanding nuance or context, leading to outcomes that are technically correct but tragically wrong. The story underscores a timeless truth: poorly expressed or misinterpreted intent can lead to disastrous results.

This cautionary tale feels especially relevant to the future of AI and SaaS. Some predict SaaS platforms will become obsolete as AI evolves to interpret user intent directly, dynamically generating workflows, tools, or code. Why rely on predefined systems when AI can seemingly create everything on the fly?

It’s an exciting idea but overlooks a critical issue: human intent is rarely clear, and generative AI is inherently unpredictable. Without a structure to clarify and align intent, we risk the same problem as the Monkey’s Paw: systems fulfilling requests in ways that miss the mark — or worse, cause harm. This is why SaaS isn’t going anywhere. As automation takes center stage, SaaS provides cognitive frameworks — structured concepts that organize, validate, and align intent with action. These frameworks ensure technology delivers outcomes that match goals and values, even when intent is messy, ambiguous, or incomplete.

Why the “Direct Intent” Model Falls Apart

The idea of bypassing SaaS entirely sounds appealing: imagine asking AI to “find the most valuable customers” or “optimize system performance” and having it execute seamlessly. No SaaS tools, no predefined workflows — just AI accessing a database directly.

The problem? Human intent is rarely straightforward. A request like “find the most valuable customers” leaves much open to interpretation. Does “most valuable” mean those who spend the most, have the longest relationships, or offer the best profit margins? Without clarification, the system will choose one interpretation, which might not align with your goals.

This is the core flaw of a direct intent-to-database model: it assumes perfect clarity in communication, but humans rarely operate that way. As complexity increases, so does the risk of misaligned outcomes. Without structured systems to refine and guide intent, results can easily become misinterpreted, misaligned, or even harmful.

Why SaaS Matters: Cognitive Frameworks

SaaS platforms provide more than task automation or data management — they offer cognitive frameworks that shape and refine intent. Consider the spreadsheet. At first glance, it’s just a grid of rows and columns. But its real power lies in how it helps users structure and clarify goals like “track project budgets” or “analyze sales data” into actionable formats with labeled columns, rows of data, and formulas.

Spreadsheets don’t just store information — they shape how people think about data, forcing users to define what’s needed, how it relates, and what outcomes to measure. More complex SaaS tools extend this principle, acting as intermediaries between messy human intent and precise machine execution.

Cognitive frameworks refine intent by breaking abstract goals into manageable components, aligning them with best practices, and validating them against known rules. They foster shared understanding, ensuring teams and systems are aligned. Without these frameworks, workflows would devolve into chaos, and AI’s probabilistic outputs would be far less reliable.

Why SaaS Will Endure

SaaS platforms provide more than software — they offer scaffolding that bridges the gap between human intent and machine action. By refining, validating, and aligning goals, SaaS ensures systems deliver what users need, not just what they request. As AI becomes more integrated into workflows, SaaS will evolve to harness AI’s dynamic capabilities while retaining its core role as a reliable structure.

Some who predict the end of SaaS may specifically mean the disappearance of manual coding in SaaS. While AI might generate code dynamically, this assumption overlooks a fundamental challenge. There are compelling reasons why code — especially prewritten and tested code — will likely endure. AI-generated code, by its very probabilistic nature, will always carry a degree of unpredictability. No matter how advanced AI becomes, it cannot surpass the consistency, stability, and reliability of code that has been rigorously tested and refined over time. When predictability and certainty are non-negotiable, traditional prewritten code provides a level of assurance that probabilistic systems inherently cannot match.

The promise of AI is vast, but intent is often more ambiguous than we realize. Without frameworks to refine and structure it, we risk outcomes that technically fulfill our requests but deviate from our true goals or values — a modern-day Monkey’s Paw scenario. SaaS prevents this by offering clarity and structure, ensuring human intent is effectively translated into actionable and meaningful results. It’s not just software — it’s the foundation for aligning intent with desired outcomes.