DEV Community

Cover image for Is AI Agent Development Just About Calling APIs? Where's the Real Difficulty?
Yaohua Chen for ImagineX

Posted on

Is AI Agent Development Just About Calling APIs? Where's the Real Difficulty?

The Bottom Line First

Calling APIs is indeed the entirety of Agent development — just like cooking is indeed putting ingredients in a pot. Technically correct, but it perfectly explains why some people produce Michelin-star dishes while others produce culinary disasters.

Saying the conclusion without explanation is meaningless. Let's actually build an Agent and walk through it together. But before diving in, let's take 30 seconds to clarify what an Agent actually is.


What Is an Agent, Exactly?

The original interaction model with large language models (LLMs) was simple: you ask a question, it gives an answer. One question, one answer, done. If you wanted it to do something complex, you had to manually break tasks into small pieces and feed them one round at a time. You were the "orchestrator"; the LLM was just a passive tool that responded on demand.

What an Agent does is fundamentally one thing: it adds a loop to this question-and-answer model. The model no longer just answers you once. Instead, it judges "what else do I need to do," calls external tools to get results, feeds those results back to itself, thinks about what to do next, and repeats until the task is complete. This loop transforms a large model from a "responder" into an "executor."

Agent Execution Loop:

User Input → LLM Reasoning → Need to call a tool?
                                      │
                    ┌─── Yes ─────────┘─── No ───┐
                    ▼                             ▼
           Select Appropriate Tool         Task Complete?
                    │                             │
                    ▼                        Yes  ▼
           Call External Tool          Return Final Result
         ┌──────────────────┐
         │  Check Emails    │
         │  Check Calendar  │
         │  Create Meeting  │
         └──────────────────┘
                    │
                    ▼
         Get Tool Return Results
                    │
                    ▼
           Update Context ──────────────────────────────┐
                                                        │
                                              (loop continues)
Enter fullscreen mode Exit fullscreen mode

Conceptually, it's that simple. A while loop plus tool-calling capability — that's your Agent skeleton. So many people read this and think, "There's no real technical depth here?" True, the skeleton is simple. But making that loop run stably, reliably, and efficiently in the real world — that is the real engineering challenge.

Let's walk through it for real. Say you want to build an Agent that manages your schedule: read emails, check calendars, arrange meetings. Doesn't sound complicated, right? Let's look at what you encounter at each step.


Step 1: Call the API — Done in 10 Minutes

This step really is easy. Install an SDK, write a few lines of code, pass user input to the model, get back a result. If you've used the OpenAI or Claude API, you could write it blindfolded. You don't even need to write code yourself — open an AI coding tool like Claude Code or Cursor, describe your requirements in natural language, and they'll scaffold the project for you. Define a few tools — check calendar, read emails, create meeting — write the JSON schema, and the model can call them.

It runs. You ask it "what meetings do I have tomorrow?", it calls the calendar tool, gets the result, and reads it back in natural language. Perfect. You think: Agent development isn't that hard, maybe I can ship this in a week.

I've had this feeling before. 20 years ago when I first learned C# development, I dragged a few controls onto a Windows Form and had a running App — I thought Windows Form development was no big deal either.

In theory, those AI coding Agents could handle every step ahead for you too. But in practice, every problem you encounter from here on isn't about how to write the code — it's about what code should be written. To really understand where Agent development gets hard, let's keep walking.


Step 2: Connect to Real APIs — The Nightmare Begins

In the demo you used mock data. Now you need to connect to real email and calendar services. Each user might use something different: Outlook, Gmail, hotmail, etc. Let's simplify and just connect to Microsoft's Graph API — it's accessible domestically and Outlook is mainstream in enterprise.

The first problem arrives immediately: OAuth. Your users must authorize your application to access their Microsoft account. You need to register an app in Azure AD, handle OAuth redirects, securely store refresh tokens, and auto-refresh when tokens expire. None of this has anything to do with the LLM, but without it, your Agent can't take its first step. Microsoft's permissions model alone (delegated permissions vs. application permissions) can eat half a day of research.

Then come the API edge cases. Microsoft Graph returns email lists paginated — 10 items per page by default, up to 50. Your Agent gets the first page without knowing how many more pages exist, and it will give you a confident-sounding conclusion based on just those 10 emails. Ask "did anyone email me last week about Project A?" — the actual email is on page 3, but the Agent confidently tells you "no." You can add a tool to check the next page, but then you need to add a tool to check the next page, and so on.

Rate limiting is another problem. Microsoft Graph's throttling strategy is complex, with different thresholds per app, per user, and per resource type. If your Agent calls it a dozen times in a complex task, it will easily hit a 429 error. What happens then? The model doesn't know what "429 Too Many Requests" means — it just thinks the tool call failed and starts guessing reasons. And this is only for one provider. To build a real product, every provider (Gmail, hotmail, etc.) has its own authentication system and API design. The workload multiplies.

The Tool Design Problem: Connecting the API is only half of the tool-call equation. The other half is how to design the tool itself — and this is trickier than you'd expect.

What should your "search emails" tool look like? If it's too rigid — only supporting sender-based queries — a user saying "find last week's emails about Project A" will fail immediately. So you add keyword search, time range filtering, attachment filtering? The more parameters, the more complex the schema, and the more likely the model is to fill things in wrong or miss fields. Berkeley's Function-Calling benchmark found that the more tools and the more complex the parameters, the worse model accuracy becomes. Smaller models degrade dramatically as tool count grows — BFCL data shows that models like Llama 3.1 8B can handle a modest number of tools but start failing unpredictably once tool count exceeds their capacity threshold.

On the other end, if you design a generic "search" tool that covers everything, the model won't know what to put in it. It might pass calendar query parameters to the email search tool, or call "send email" when it should "create a meeting." There's no right answer for tool granularity — too fine and user needs aren't covered, too coarse and the model can't handle it. The only way is to iterate in your specific context.

Tool description text matters enormously. For the same functionality, a description written as "Search emails" vs. "Search the user's Outlook inbox by keyword, sender, date range, or attachment presence. Returns a list of matching emails sorted by date" produces dramatically different model accuracy. In short, you don't just need to write code to implement a tool — you need to learn to write a manual for the model, and whether that manual is good or bad, you can only verify through repeated testing.

A lot of research puts it clearly with data: in production-grade Agent systems, AI completes only 30% of the work, and the remaining 70% is tool engineering. What you think of as "calling an API" is mostly spent on the design and integration work surrounding that API.


Step 3: Multi-step Tasks — Errors Start Snowballing

Good — the API is connected and basically working. Now try a slightly more complex request: "Find a time slot next week when everyone is free, schedule a project review meeting, and then email all attendees."

This task requires: querying multiple people's calendar availability, finding the intersection, creating a meeting invite, drafting an email, and sending it. Five or six steps, each depending on the previous one's result.

Here's the problem. Berkeley's Function-Calling Leaderboard (BFCL) shows that even the best models struggle with tool-calling accuracy — top scores hover around 80% on overall benchmarks, and accuracy drops further as tool count and parameter complexity increase. That means roughly 1 in 5 calls has an error. The probability of a five-step task completing entirely correctly? About 0.8 to the fifth power — less than 33%. Your Agent has roughly a two-thirds chance of going wrong at some step.

Worse, Galileo's research found that early small errors amplify through later steps. Say the model misparses a date format in step one and reads Tuesday as Wednesday. Every subsequent step builds on that error. It creates a meeting at the wrong time, then sends everyone an email notification with the wrong time. One small hallucination triggers a cascade of wrong actions.

At this point you realize: you need to add validation logic between each step, rollback mechanisms, and confirmation loops. None of this is taught in any LLM's API documentation.

Step 4: Guardrails — The Invisible Security Risk

And there's a deeper problem lurking here that most people don't think about until it's too late: guardrails. Your scheduling Agent has permissions to send emails, create meetings, and modify calendars. What happens when it hallucinates a participant name and sends a meeting invite to the wrong person? Or confidently deletes a calendar block because it "optimized" your schedule?

OWASP classifies this as "Excessive Agency" (LLM06:2025) — one of the top security threats in LLM applications. It breaks down into three failure modes: excessive functionality (your Agent has access to 50 actions when it only needs 5), excessive permissions (your Agent can modify any calendar, not just the user's), and excessive autonomy (the Agent sends emails and creates meetings without any human confirmation gate).

In practice, you need to separate "read" tools from "write" tools and put explicit approval gates on write operations. High-stakes actions — sending external emails, deleting calendar entries, modifying shared resources — should run in a "dry run" mode where the Agent describes what it would do and waits for human confirmation before executing. You need to design for rapid rollback, because the question isn't if your Agent will take a wrong action — it's when. And you need to enforce the principle of least privilege: your Agent should request only the minimum API permissions it needs, not broad access "just in case."

None of this is glamorous engineering. But skip it, and one hallucinated email from your Agent can undo months of user trust.


Step 5: Open It to Real Users — The Bill Scares You Awake

You tested the first three steps in your development environment and things seemed fine. But once you open the Agent to real users, the nightmare comes from a direction you never anticipated: the bill.

You used Claude Sonnet or GPT-4o for development testing — great results, a few cents per complex task, no pain. But with real users, hundreds of requests per day, each averaging four or five tool call rounds, each carrying substantial context — you look at the monthly bill and see a small feature burning thousands of dollars a month. What if user volume grows ten times?

You think: a user saying "what meetings do I have tomorrow?" — does that really need the most powerful model? That's overkill.

So you start thinking about model routing: different tasks use different base models. Simple queries go to cheap small models (Haiku, GPT-4o mini, Gemini Flash); complex multi-step reasoning goes to large models (Claude Sonnet, GPT-4o, Gemini Pro). But who judges complexity?

  • Use a large model to judge? That costs money too.
  • Use a rule engine? Works for simple cases, but user inputs are endlessly variable and rules always have gaps.
  • Use a small model as a classifier? Now you've added another model component that needs tuning and maintenance.

And different models vary enormously in their tool-calling capabilities. A tool schema that works on Claude Sonnet may have parameters filled in wrong on Haiku. JSON that runs perfectly on GPT-4o may fail to parse on open-source models. Every time you swap a model, your carefully tuned prompts and tool descriptions may need to be re-adapted. This is why many teams eventually find that the token money saved doesn't cover the labor cost of multi-model adaptation.

To put concrete numbers on this: Claude Sonnet costs \$3/\$15 per million input/output tokens, while Claude Haiku costs \$0.25/\$1.25 — a 12x to 60x difference. GPT-4o vs. GPT-4o mini has a similar spread. Mid-sized Agent deployments easily burn \$1K–\$5K per month in token costs alone; complex Agents consuming 5–10 million tokens monthly aren't unusual. One underrated optimization: prompt caching. Anthropic's prefix caching can reduce costs by up to 90% and latency by 85% for repeated long prompts — a massive win for Agents that include the same system prompt and tool definitions in every call.

And cost isn't the only scaling problem — latency hits you just as hard. A multi-step scheduling task that checks four people's calendars, finds a common slot, creates a meeting, and sends emails can easily take 30–45 seconds end-to-end. Technically correct, but your users experience it as broken. The biggest UX win is streaming intermediate results: instead of a 45-second black box, show "Checking Alice's calendar... Found 3 available slots... Confirming with Bob..." — the total time is the same, but the perceived wait drops dramatically. Parallelizing independent tool calls (check all four calendars simultaneously instead of sequentially) helps with actual latency. But the hard tradeoff remains: smaller, faster models hallucinate more, so you can't just throw Haiku at everything to speed things up.

Cost optimization looks like an operations problem, but it's actually an architecture problem. You need to make the model-calling layer pluggable from the very beginning — something most people never think about when writing a demo.


Step 6: Context Management — Your Agent Starts "Forgetting"

After a while, you notice a strange problem: the Agent "drifts" during long tasks. You give it a complex task requiring seven or eight conversation rounds, and by rounds four or five, it starts forgetting the original requirements and constraints.

This is what the industry calls "Agentic Amnesia." Research data is clear: when tasks are split across multiple conversation rounds, model performance degrades significantly — and without memory management strategies, Agents lose track of constraints, requirements, and earlier results as context accumulates.

The reason is that LLM context windows are finite. Every tool call's input and output consumes context space. Query five people's calendars, each returning a large JSON payload, and the context window is mostly full. Spotify's engineering team hit the exact same pitfall building a code Agent: once the context window filled up, the Agent "lost its direction" and forgot the original task after a few rounds.

You need to start doing Context Engineering. Anthropic defines it as "curating exactly what content goes into a limited context window from an ever-changing universe of information." In plain terms, it's the LLM version of memory management: you dynamically decide what the model "sees" at each reasoning step and what it "forgets." Which historical information gets compressed into summaries? Which key constraints must always be preserved? Which tool return values can be discarded?

The Manus team rebuilt their entire framework four times to get this right. Four times. They called this process "stochastic gradient descent" — inelegant, but effective.

There's also a subtler trap: research shows context length and hallucination rate are positively correlated. The longer the input, the more likely the model is to hallucinate. For Agent tasks that require large contexts, this is nearly an unresolvable structural paradox.

One emerging solution to this problem is Agent Skills, a mechanism pioneered by Anthropic. Where Context Engineering is about managing what's in the context window, Skills are about not putting things there in the first place. A Skill is a modular package of instructions, workflows, and best practices (typically a SKILL.md file plus optional scripts) that an Agent loads on demand. Think of it as pluggable expertise — a "Tax Compliance Skill" or a "Cloud Migration Skill" that transforms a general-purpose Agent into a domain specialist, without bloating the context window for every other task.

The design uses progressive disclosure: an Agent can have dozens of Skills installed but only loads the 2–3 it needs for any given task. This directly mitigates the context window pressure that causes Agentic Amnesia. Skills also enable composability — combining a code-review Skill with a git-automation Skill produces an Agent that can review and commit code without anyone writing explicit coordination logic.

The impact on the ecosystem has been rapid. OpenAI adopted structurally identical Skills for ChatGPT and Codex CLI. Microsoft's Semantic Kernel implements an equivalent "Plugins" abstraction. Marketplaces like SkillsMP have emerged with hundreds of thousands of community-built Skills. Anthropic has positioned Agent Skills as an open standard — and the convergence across platforms suggests it's becoming the standard abstraction for packaging Agent capabilities, much like MCP became the standard for Agent-to-tool communication.


Step 7: Want to Test It? You Don't Even Know How

At this point, your Agent barely works. But how do you determine whether it's "actually good" vs. "just barely functional"?

Traditional software development has mature testing methodologies: unit tests, integration tests, end-to-end tests — inputs are deterministic, expected outputs are deterministic. But an Agent's input space is open-ended (users can say anything) and its output is non-deterministic (the model generates different text each time). LangChain's blog put it perfectly: "every input is an edge case" — a challenge traditional software has never faced.

You might think to use LLM-as-judge to evaluate LLM outputs. A Hacker News developer explained the problem clearly: using a judge with the same architecture as the system being tested maximizes the probability of systematic bias. The judge and the tested Agent share exactly the same blind spots.

Anthropic's January blog also acknowledged: Agent interactions involving tool calls, state modifications, and behavior adjustments based on intermediate results are precisely the capabilities that make Agents useful — and simultaneously make them almost impossible to evaluate systematically.

The data is stark. LangChain's State of AI Agents survey (1,300+ professionals, 2025) found only about half of organizations run offline evaluations, and fewer than a quarter combine both offline and online evaluations. A multi-dimensional analysis of major Agent benchmarks found a 37% performance gap between lab testing and production environments — with reliability dropping from 60% to 25% in real-world conditions. An Agent that tests great in your dev environment may behave completely differently in users' hands.

Anyone who's done client-side development will understand this pain: your Agent might handle a request perfectly today, and fail on the same request tomorrow. Users can accept missing features — they can't accept inconsistency.

And evaluation is only half the story — the other half is observability in production. Evaluation tests what you expect the Agent to do; observability shows what it actually does with real users. When a user reports "the Agent scheduled my meeting at the wrong time," you need to trace back through every tool call: what calendar data was retrieved, what the LLM reasoned, what meeting parameters were generated, and why the wrong time was selected. Without tool call tracing, latency monitoring, and cost/token budget tracking, you're debugging blind. That "37% performance gap" between lab and production? Observability is how you find it. Tools like LangSmith and Arize have emerged specifically for this, but many teams still discover production failures only when users complain.


Step 8: Add Multi-Agent Collaboration? Complexity Explodes

Your scheduling Agent is working well, and you start thinking: could you add more specialized Agents? One for email, one for calendar, one for meeting notes, one for scheduling coordination. Clear division of labor, each handling its domain — sounds reasonable, right?

Microsoft's Azure SRE team went down this path. They initially built a massive system with 100+ tools and 50+ sub-Agents, and hit a pile of unexpected problems: the orchestrator Agent couldn't find the right sub-Agent (the correct one was "buried three hops away"); a buggy sub-Agent didn't just crash itself — it dragged down the entire reasoning chain; Agents kicked responsibility back and forth in infinite loops. They eventually scaled down to 5 core tools and a few general-purpose Agents, and the system became more reliable.

Their core lesson: scaling from one Agent to five doesn't multiply complexity by four — it grows exponentially. UC Berkeley's MAST framework analyzed 1,600+ Agent traces and found that 41–86.7% of multi-Agent systems fail in production, and 79% of problems come from the orchestration and coordination layer, not the technical implementation. How to divide work and how to communicate between Agents is far harder than how to write the code.

There are established orchestration patterns — sequential chains, concurrent fan-out, hierarchical supervisor models — and each has tradeoffs. ICLR 2025 research found that hierarchical architectures (one coordinator delegating to specialists) show only a 5.5% performance drop when individual Agents malfunction, compared to 10.5–23.7% for flatter architectures. This explains why Microsoft eventually simplified to a supervisor model. The practical advice is almost counterintuitive: start with fewer, more capable Agents rather than many specialized ones, and only decompose when a single Agent demonstrably can't handle the workload. The allure of clean role separation is strong, but the coordination overhead will eat you alive.


Step 9: You Start Doubting — Where's the Bottleneck?

After months of work, your engineering gets more refined, but Agent performance always hits a ceiling you can't break through. You realize a harsh truth: all engineering optimization has one prerequisite — the underlying model needs to be capable enough.

An InfoQ interview with Alibaba Cloud's code platform lead captured it honestly: engineering challenges can be overcome, but model capability bottlenecks are far more daunting. An awkward industry reality: nearly every company building general-purpose Agent products uses Claude Sonnet as their first-choice model, because other models lag noticeably on instruction-following in complex tasks. The more instructions a model can follow, the more complex the problems it can handle. When a model can't even do basic instruction-following, no amount of engineering optimization above it helps.

You might think: what about using more powerful reasoning models — o3, o4-mini, DeepSeek R1, Claude Sonnet, Claude Opus? Research finds that reasoning models hallucinate more than base models. The data is striking: OpenAI's o3 has a 33% hallucination rate on person-specific factual questions — double the rate of its predecessor o1. The o4-mini reasoning model hits 48%. The root cause is that RL fine-tuning for chain-of-thought reasoning introduces high-variance gradients and entropy-induced randomness, making models more confident even when wrong. They answer rather than admit uncertainty.

The practical implication for Agents: reasoning models may handle complex task decomposition better, but they trade off reliability on factual tasks. One emerging pattern is to use reasoning models for planning (breaking down what needs to happen) and base models for execution and verification (actually doing it and checking the results). But this adds yet another layer of architectural complexity.

It's like finding your app is laggy, spending days optimizing code logic, and then discovering the bottleneck is hardware performance. Your engineering optimizations have limits, and beyond those limits lies the constraints of underlying capability.


Step 10: You Start Understanding the Framework Wars

At this point, you've definitely wrestled with whether to use LangChain, CrewAI, or similar frameworks. The Hacker News discussion has moved from debate to consensus: frameworks are useful for prototyping; in production they often become a burden.

A CTO shared on Hacker News that he built hundreds of Agents without any framework, using only chat completions plus structured output.

Anthropic's official guidelines also advise caution with frameworks, as they often make underlying prompts and responses opaque and harder to debug.

Here's the practical landscape: LangGraph (by LangChain) uses a graph-based architecture with nodes, edges, and conditional routing — it's powerful for complex multi-step reasoning and is used in production by 400+ companies. CrewAI takes a role-based approach where you define Agents by organizational roles — simpler to set up, adopted by 60% of the Fortune 500 for content generation and analysis workflows. AutoGen (Microsoft) was merged into the Microsoft Agent Framework in late 2025, reflecting a broader trend of frameworks consolidating. Each imposes its own abstractions, and those abstractions become constraints the moment your use case doesn't fit neatly.

There is one thing you genuinely need frameworks for: persistence and state management. Your Agent needs to pause while waiting for user confirmation, recover from checkpoints after errors, and resume long tasks mid-execution. Most lightweight solutions lack these capabilities — which is why orchestration engines like Temporal have risen in the Agent space. Temporal provides durable execution with an append-only event history, letting Agents recover from failures mid-execution. That's genuinely hard to build from scratch.

Perhaps more consequential than any framework is the emerging protocol and abstraction layer — three complementary standards that are reshaping how Agents are built and composed:

Model Context Protocol (MCP), created by Anthropic, standardizes how models interact with external tools and data sources. Instead of writing custom integrations for every API, MCP provides a universal interface with well-defined security boundaries. It's the "USB port" for Agent-to-tool connections.

Agent2Agent (A2A), backed by Google and Microsoft, tackles inter-Agent communication — enabling Agents from different providers and frameworks to discover each other and collaborate via standardized protocols. It's the "HTTP" for Agent-to-Agent interactions.

Agent Skills, pioneered by Anthropic (discussed in Step 6), solve a different problem entirely: domain knowledge and procedural expertise. MCP gives Agents access to tools; Skills give them the knowledge of how to use those tools effectively — modular, on-demand expertise that keeps context windows lean through progressive disclosure.

Together, these three layers — MCP (Agent-to-tool), Agent Skills (Agent knowledge), and A2A (Agent-to-Agent) — form a cohesive architecture. Developers building production Agents will likely use all three: MCP to plug into APIs and databases, Skills to inject domain expertise, and A2A to enable cross-ecosystem Agent collaboration. This matters more than framework choice in the long run, because these protocols define how Agents interoperate — regardless of what framework built them.

The truth is, framework choice isn't the core challenge of Agent development. The real challenges are the nine steps above. Frameworks are just tools. Choosing the wrong tool wastes time, but going in the wrong engineering direction wastes everything.


Conclusion

The ten steps above aren't something I made up sitting here. I built agents myself, hit almost every pitfall listed, and some of the projects ultimately failed. The Agent worked flawlessly in my development environment — but in production, context window limits caused it to lose track of multi-step tasks, costs spiraled because I hadn't designed for model routing, and I had no observability to diagnose why users were getting wrong results. By the time I understood the real scope of the engineering required, the project had burned through its budget and patience. Looking back, the mindset of "it's just calling an API, how hard can it be?" was exactly the same as my mindset 20 years ago of "drag a few controls and you have an app." What really taught me, in the end, was that failure.

Walk through these ten steps and you'll find that "calling APIs" accounts for roughly 5% of total Agent development effort. The other 95% is:

  • OAuth, rate limiting, and error handling in the tool layer (Step 2)
  • Getting tool design granularity and descriptions right (Step 2)
  • Validation and rollback for multi-step error cascades (Step 3)
  • Safety guardrails, least-privilege permissions, and human-in-the-loop gates (Step 4)
  • Cost control, prompt caching, model routing, and latency optimization (Step 5)
  • Context Engineering, memory management, and Agent Skills for progressive disclosure (Step 6)
  • Building evaluation and production observability from scratch (Step 7)
  • Complexity control for multi-Agent orchestration and coordination (Step 8)
  • Engineering around model capability ceilings and reasoning model tradeoffs (Step 9)
  • Navigating the framework/protocol landscape — MCP, A2A, and Agent Skills (Step 10)

LangChain calls this emerging discipline "Agent Engineering" — I think that's exactly right. Boston Consulting Group's research shows that only about a quarter of companies achieve significant ROI from their AI initiatives, and Agent projects are no exception. LangChain's survey found that 32% of companies cite "quality below standard" as the top barrier to shipping an Agent. These numbers say it all.

The enormous gap between Agent and Agent doesn't come from who's calling different APIs — it comes from the vastly different quality of the 95% of engineering that happens outside the API call. Calling an API is the entry threshold, something you can cross in a week. But between demo and product lies an entire system of engineering around reliability, observability, context management, and error recovery.

That's where Agent development is truly hard.

References

Top comments (0)