ObservabilityGuy

Posted on Nov 3

Development Trends and Architecture Evolution of AI Agents

#ai #architecture #agents #llm

The Evolution of Programming Paradigms
As technology develops, programming paradigms are constantly evolving. Andrej Karpathy, an ex co-founder of OpenAI and the director of AI and Autopilot Vision at Tesla, has proposed similar views. In the Software 1.0 era, we programmed computers with computer programming languages, such as Java and Python. In the 2.0 era, we program neural networks by adjusting the parameter weights of the neural networks.

After the advent of the large language model (LLM) era, our programming paradigm has undergone profound changes, heralding the arrival of the Software 3.0 era. This is reflected in the fact that our programming object has become the LLMs. LLMs run on GPUs, unlike traditional applications that run on CPUs. Our programming languages are no longer languages such as Java, Go, or Python. Instead, we use prompts. The written prompts run on the LLM. Therefore, we program the LLM by using prompts, and the resulting applications are what we call AI native applications. This shift has led to many changes in our mindset regarding our understanding of development paradigms and application development.

Core Concepts of AI Native Applications
As this is a new concept, many developers have a vague understanding of what AI native applications are and what their architecture entails. To address this question, Alibaba Cloud has defined a panoramic view of AI native application development to help everyone better understand, explore, and practice it. Next, we will explore these concepts in the following sections.

For an AI Agent to operate, it requires several core capabilities, including:

● Perception: The agent must perceive its internal and external environment to receive inputs and produce outputs.

● Brain: Using an LLM to help us make decisions.

● Tools: The agent invokes external tools, including MCP tools, to execute necessary actions.

● Memory: This includes long-term and short-term memory. The context maintained during the application’s execution is critical.

After understanding the basic concepts above, how should we start developing an AI Agent?

First, a suitable development framework is needed to build the core of the AI Agent. Mainstream development languages have many development frameworks to help simplify the development steps. At the same time, as AI Coding tools such as Tongyi Lingma, Cursor, and Claude Code continue to mature, low-code has also become a new method and possibility for generating Agents.

After an AI Agent is generated, it needs to rely on computing resources to execute tasks. Its runtime environment can be based on Kubernetes (K8s) or other computing paradigms, such as Function Compute. Specifically, the model inference and the operation of the MCP toolchain required for task execution both rely on the resource scheduling capabilities of the underlying runtime environment.

Once the runtime environment is established, the Agent’s underlying architecture relies on general-purpose middleware to support its core services. For example, you can use Nacos to implement unified management of prompts and dynamic registration and discovery of the MCP services. You can use an AI gateway to implement a centralized proxy for multiple models and the MCP. At the same time, Once the runtime environment is established, a message queue can be used to handle long-running, multi-stage tasks asynchronously.

When an AI Agent is built, its runtime observability is a key aspect to ensure system stability and optimization capabilities. Because the running logic of an Agent has a dynamic nature and uncertainty, such as multi-turn inference and event-driven behaviors, its internal state must be monitored in real time using data collection probes. For example, you can use the LoongSuite open source probe to collect data on token consumption, model inputs and outputs, and so on. With this information, we can analyze the performance, cost, and quality of the AI Agent.

The preceding content describes our panoramic view.

Key Issues in AI Agent Development
After explaining the basic concepts, let's discuss the key issues that need attention during the development process of an AI Agent.

Workflow Mode vs. Agentic Mode: When building an Agent, which mode should we use?

The Workflow mode is actually very simple. When we orchestrate business processes, we orchestrate some fixed flows through predefined steps by using low-code or high-code platforms. The advantage is its high degree of predictability. For traditional business processes that require high certainty without any error, we can use this mode. However, when faced with complex scenarios or tasks, the Workflow mode can be inadequate. For example, when an Agent must complete a highly uncertain task, the next step at any given stage may not be predetermined. In this scenario, you can use the Agentic mode. The LLM tells you how to execute the next step to complete the planning and execution. The advantage is that the flexibility is relatively high. For example, common applications such as Deep Research and Coding Agent use the Agentic mode.

In business practice, technology selection often requires a trade-off between accuracy and cost-effectiveness. When a business has strict requirements for result accuracy, such as key field extraction in Image Recognition or invoice information structuring, you must use a Workflow. A Workflow implements verifiable processing logic through a predefined rule chain, sacrificing flexibility for predictable accuracy. At the same time, when faced with tasks such as complex text information extraction, although LLMs have stronger semantic understanding capabilities, their computation costs are significantly higher than traditional solutions. Actual test data shows that the cost of processing such tasks with a GPU cluster can be more than 10 times that of a CPU-based solution. In this situation, we need to weigh whether to use the Workflow mode or the Agentic mode. However, a balance may ultimately be achieved through a hybrid architecture design.

Single Agent vs. Multiple Agents: Do we need a single agent or multiple agents?

The second topic is single agents versus multiple agents, including the circumstances under which you should use a single agent and when you should use multiple agents. In current practice, we recommend using a single agent for simple scenarios with clear targets. The advantage of a single agent is that its development and maintenance are relatively simple. However, it also has some limitations. For example, the model's context window is limited. As the agent becomes more complex and executes step by step, it carries more and more context with each step. When the context reaches a certain window size, the model may experience some hallucinations or even exhibit some uncertain behaviors. At this point, we must consider whether a single agent can complete the task. Therefore, we need to consider breaking down the task. The guiding principle is Occam’s razor: “entities should not be multiplied without necessity.” In general, you should use a single agent under normal circumstances as much as possible.

Of course, in scenarios where task execution is clearly found to be very complex and requires intricate collaboration, we recommend using multiple agents to complete the task. Moreover, multi-agent systems offer a distinct advantage. When completing the same coding task with the same model, the collaboration of multiple agents can significantly improve accuracy in complex scenarios, compared with the single-agent mode. This effect has been verified through experiments and various practices.

For example, the Deep Research sample is a typical multi-dimensional scenario. A Leader agent is responsible for breaking down the task, assigning specific research and design tasks to sub-agents, and then aggregating the results from the sub-agents and returning them to the user.

Prompt Engineering vs. Context Engineering: How to implement prompt engineering, or should we choose the popular context engineering?

The third topic is prompt engineering and context engineering. Prompt engineering was a previously popular concept. It mainly addresses how to interact with a model and ask the right questions so that the model can answer them accurately. The core focus is that the prompt must contain a clear context and examples, as well as some keywords and so on, which together constitute our prompt.

However, we have found that the concept of Context Engineering has become increasingly popular recently. The reason is that agents are becoming more complex, there are many uncertainties during an agent's execution process, and the model's context is limited. Therefore, we need to solve the problem of how to provide the most effective information to the model within a limited context window.

In complex scenarios, the model input needs to integrate multi-source information, including prompts, documents retrieved by retrieval-augmented generation (RAG), tool calling results, and the current context status. This process is called context engineering. This content requires precise filtering and assembly of relevant information to ensure that the model can execute tasks based on a complete and efficient context. Assembling these pieces of information to provide to the model has become a sophisticated art. At the same time, inference efficiency is closely related to the key-value (KV) cache. By prefixing fixed content (such as general-purpose templates and constant parameters) and postfixing dynamic data (such as real-time input), you can increase the cache hit ratio and reduce repeated computation overhead. This way, when a large portion of the prefix content is fixed, it increases the likelihood of a KV Cache hit during inference. This fine-grained management of information hierarchy and caching mechanisms has become a core direction for improving the performance of AI agents. Context engineering has also become a direction that currently requires significant attention.

AI Native Application Reference Architecture

After discussing the preceding three questions, this section introduces the reference architecture for AI native applications. This architecture is centered around an AI Agent, whose operation relies on the collaboration of multiple technical components. The Agent itself can be built by using different development frameworks and deployed in compute instances. It obtains external data to support decision-making by invoking databases or vector databases. User requests first access the system through an API Gateway and are then forwarded to the Agent module. This module interacts with models through a unified AI gateway. The AI gateway acts as a key proxy and provides general-purpose capabilities, such as protocol transformation for multi-model invocations and token rate limiting. It effectively coordinates the differences between various model APIs, especially in scenarios where multiple models coexist. During model interactions, the AI gateway plays an important role. It uses Nacos to implement unified registration for public and private services and dynamic prompt management, which ensures the flexibility and extensibility of model invocations. For asynchronous tasks that involve long-period processing, the system relies on a message event mechanism to perform state management. It decouples task execution from the response flow in an event-driven manner. All observability data generated by the components, such as performance metrics and invocation traces, is collected based on the standard OpenTelemetry protocol. The data is then uniformly aggregated to an observability platform by a LoongSuite probe. This data is used to implement system diagnosis, model performance evaluation, and runtime optimization.

The following sections introduce these key components.

Spring AI Alibaba

The first component is Spring AI Alibaba. It is based on the open source Spring AI component and provides encapsulation for more capabilities, such as support for workflow and Agent modes, and some abstract configurations for single-agent and multi-agent setups. This helps Java application developers better develop AI native applications. Based on this, we have built higher-level business scenarios. An example is the general-purpose Agent named JManus, which is the Java implementation of Manus. There are also some typical vertical Agent scenarios, such as Deep Research and Data Agent. For Java developers, Spring AI Alibaba is one of the most feature-complete frameworks for developing AI applications that they can start using immediately.

Nacos

In AI native application scenarios, the role of Nacos as a dynamic configuration management and service registry extends to the field of MCP service governance. When an Agent needs to access traditional microservices or third-party tools, it can convert the services into MCP APIs by using a locally started Local Server, or invoke the traditional services through a remote MCP Server. For MCP services that involve enterprise sensitive data or internal business logic, unified management must be implemented through a privately deployed MCP registry. This approach not only meets the AI Agent's need for flexible invocation of heterogeneous services but also ensures the security and controllability of enterprise-level service governance.

Higress

Higress plays the core role of the AI gateway. It provides core AI gateway proxy capabilities between our AI applications and models. It can implement core AI capabilities, such as LLM caching, vector retrieval, and token rate limiting. In terms of security, its capabilities include protocol adaptation for multiple OpenAI model protocols and unified API management. A major recent focus is on the MCP proxy capability. This involves uniformly exposing private or public MCP services to the Agent and implementing capabilities such as fine-grained authentication and dynamic discovery. In addition, this AI gateway and the MCP gateway can be used to transform traditional OpenAPI protocols into the standard MCP protocol.

Apache RocketMQ

In complex interaction scenarios involving AI Agents, Apache RocketMQ uses a message queue mechanism to solve the problems of state recovery and high retry costs in multi-round conversations. When an Agent interacts with a model in multiple stages, intermediate results, such as phased responses and streaming outputs, usually exist as temporary states. If a network interruption or service exception occurs, traditional architectures need to restart the retry flow from the beginning, which involves GPU computation. The cost of this process can be more than ten times that of a retry in a microservice scenario from the CPU era. RocketMQ innovatively maps a session in the AI framework to a topic in the message queue and writes all intermediate states to the queue storage in real time. For example, a gateway acts as a consumer that subscribes to this topic and gradually pushes the results to the client. If the current gateway node fails, the system can dynamically switch to a standby consumer node. The new node can obtain the stored intermediate data by subscribing to the same topic, which implements a breakpoint-based recovery capability. This design avoids the repeated consumption of GPU resources. At the same time, it ensures the reliability of long-period tasks through the persistence feature of the message queue.

Observability Solution

The following is an introduction to observability. During the development of AI applications, we have identified three major pain points related to observability. The first is implementation: how to make the application function correctly. The second is cost: how to use it economically. The third is effectiveness: how to ensure high-quality output.

The first problem is that when we build these applications and invoke the model, we find that the inference process is extremely slow and stutters. Errors may occur, but we do not know where the bottleneck is. This problem is about how to make the application work. The second problem is that after using the application for a period of time, the token consumption is very high, but we do not know where the tokens are consumed. This addresses the challenge of managing and reducing operational costs. The third problem is that we are not sure about the quality of the model's responses and need to evaluate it. This addresses the challenge of verifying and improving the quality of the application’s output. These are the three problems to be solved.

To solve these three problems, you must first collect observable data by using collection probes throughout the entire link where the Agent runs. What does this data include? It includes all our trace information. We hope to record what happens in each step, from the client side to the API Gateway, to the AI Agent, to the AI gateway, and then to the inside of our model. This includes the inputs and outputs of invocations, token consumption, the use of tools, and so on. The second step is to collect key metrics that can reflect the current running behavior. The third step is to perform quality analysis and evaluation of the Agent's behavior by using the data collected from the model.

Here, we use the OpenTelemetry open-source standard, which includes both an open-source SDK and probe solutions. This means that some probes are dynamically Instrumented into the AI application. For example, for AI applications built with Java and Python, probes can be Instrumented into the application to dynamically collect the observable data mentioned earlier. In addition, on the model side, we find that many models are run by using inference acceleration frameworks such as vLLM and SGLang. These frameworks are essentially Python applications, so we can mount probes into them to collect information about the inference flows and details inside the model. At the same time, at the GPU layer, you can also collect information such as GPU usage rates. With this data, we can perform the three types of processing mentioned earlier.

Here is a brief introduction to the key metrics that we should follow. First, in applications and Agents, during the microservice era, our golden three metrics might have been RED, which stands for Request, Error, and Duration. However, in AI applications, token consumption has become a more critical metric. The new golden three metrics are TED, where Token, Error, and Duration are the three most critical metrics.

When accelerating model inference, there are two very critical metrics to follow. One is time to first token (TTFT), and the other is time per output token (TPOT). What does TTFT depend on? It is the time from when the context input is provided to the model to when the model outputs the first token. This latency determines the perceived responsiveness of the model inference. TPOT is the metric obtained after the first packet is output. It is time it takes to output all subsequent packets, divided by the number of tokens. In other words, TPOT is the average transmission time for each subsequent packet after the first packet latency. This reflects the key performance of the model in the decode stage. Therefore, you must follow these two metrics. In some key stages of model inference, you also need to follow the KV Cache hit ratio, GPU utilization rates, and throughput capabilities. In evaluation scenarios, the main focus is on metrics such as accuracy, bias, and toxicity.

In addition to the metrics just mentioned, another important aspect is the Trace. It provides a clear view of the nodes traversed and the real-time operational status within a model inference call. How can you view this? For example, by using the standard OpenTelemetry Tracing protocol, you can collect data from each key step. The screenshot shows a workflow built with Dify that invokes a vLLM model. From this call chain, you can see its duration, total token consumption, and its input and output. You can also see the key node information for each workflow under Dify and their respective durations. Therefore, we can see that the step with a long duration is the LLM call stage, as well as the token consumption of this stage. Then, by going inside the model, the full-link tracing capability also reflects the internal invocation procedure of the model. This trace allows you to accurately see the details of each execution.

Finally, there is evaluation. Evaluation is a very important concept in the AI Agent scenario. It is equivalent to regression testing in traditional software development. This is an iterative, cyclical process, not a one-time action.

After we develop an Agent in the development stage, we use tracing to record the model's inputs and outputs and perform a preliminary evaluation on the Agent. This evaluation has two types. One is manual evaluation, which is more suitable for the early stages of AI application development. This requires manually checking whether the results of the AI application meet expectations. First, you can select some fixed cases where you know the expected return result of the model. Then, you can manually evaluate whether the run result meets expectations. After a certain stable state is reached, we can switch to LLM evaluation. This means using a third-party model to help us perform the evaluation, which can better improve extensibility and efficiency. In the flow from evaluation completion to online deployment, we continuously track the online data. This includes the metrics, tracing, and logs mentioned earlier, which are used to provide feedback for and optimize our Agent. Then, the Agent is continuously improved through iterations in a repeating loop.

What are the key areas to follow during evaluation? There are three stages. The first is Planning. The Planning stage involves evaluating how the Agent breaks down a task into sub-tasks (or nodes). We need to determine whether the nodes are split accurately, whether there are duplicate splits, whether the splits are precise enough, or whether the process takes a detour. These are some of the key factors that we need to focus on during evaluation.

Another area is tool invocation. In many cases, the model's input is unstable because there is a problem with tool invocation, such as not selecting the correct tools. Second, the parameters for the tools may be detected inaccurately, causing incorrect information to be passed. These are all critical considerations during the evaluation stage. In addition, in the retrieval process of the RAG stage, you need to pay attention to whether the retrieved corpus is relevant or contains duplicates.

After we have this data, we can send it to an observability platform. On this platform, we can continuously, automatically, and periodically extract some online operational data and then evaluate it. After we define these evaluation templates, they can run continuously and automatically online. Then, through evaluation, these results can be scored to help us with digital analysis.

Open Source Project Planning

Recently, we just published the open source project LoongSuite. “Loong” means “Chinese dragon,” and “Suite” refers to a collection of tools. Based on the open-source OpenTelemetry community, we aim to provide automated instrumentation for various frameworks used in AI Agent development. For example, there are probes for Java, Go, and Python. These probes can automatically capture data from Agents developed in different languages, as mentioned earlier. This data includes metrics, traces, logs, inputs, and outputs. The data can be sent to some open source storage solutions and supports any console that is compatible with the OpenTelemetry Protocol (OTLP), such as Jaeger or Elasticsearch. This data can also be reported to an Alibaba Cloud service. You can use the cloud platform to help you complete data storage, display, and hosting. Based on this, you can perform performance, cost, and quality analysis and evaluation.

Finally, I will briefly introduce the planning for several of the open source projects mentioned earlier.

● Spring AI Alibaba: In the future, we will add support for protocols such as A2A and create an evaluation console to improve the overall efficiency of developer testing and evaluation. https://github.com/alibaba/spring-ai-alibaba

● Higress: We will enhance some AI plugins and some RAG plugins. https://github.com/alibaba/higress

● Nacos: Version 3.x will provide support for dynamic prompts and the A2A protocol. https://github.com/alibaba/nacos

● Apache RocketMQ: Some capabilities will also be published to the open source community in the next one or two months. https://github.com/apache/rocketmq

● LoongSuite: We will provide comprehensive support for more mainstream open source frameworks, such as some Python frameworks. Support for Dify/Langchain/MCP have been published. We will also add support for the A2A protocol and end-to-end observability. https://github.com/alibaba/loongsuite-python-agent

DEV Community

Development Trends and Architecture Evolution of AI Agents

Top comments (0)