A Brief Introduction to LLM Application Observability
Thriving AI Application Ecosystem
AI technology is currently reshaping the industrial landscape at an unprecedented speed. Its application ecosystem has developed a multi-dimensional and collaborative evolutionary trend, characterized by breakthroughs in the following three dimensions:
First, foundation model technology has achieved leapfrog development.
Domestic large language models (LLMs), such as DeepSeek and Qwen, are continuously making breakthroughs in core metrics such as parameter size, inference capabilities, and multimodal processing. Through algorithm optimization and computing power upgrades, they are significantly narrowing the technology gap with leading international models from companies such as OpenAI and Anthropic (Claude). Notably, model research and development has shifted from simply pursuing performance metrics to specializing in vertical domains. This shift has created a dual-track pattern where general-purpose LLMs and industry-specific models evolve collaboratively.
Second, full-stack frameworks are building the technical foundation.
In terms of technical implementation, the Python language still dominates the AI developer ecosystem. High-code frameworks such as LangChain and LlamaIndex continue to improve their chained processing and knowledge base management capabilities. As the demand for technology popularization grows, cross-language developer systems are maturing rapidly. Frameworks in the Java ecosystem, such as Spring AI Alibaba, have achieved alignment with the core features of the Python ecosystem. At the same time, low-code development platforms such as Dify and Coze provide lightweight solutions for AI application development through visual orchestration and modular components. These platforms are particularly well-suited to the inherently lightweight architectural characteristics of AI applications. Combined with infrastructure support systems such as Machine Learning Operations (MLOps) toolchains and vector databases, they build a complete closed loop for development and operations.
Third, application scenarios are showing a diversified development trend.
The form of AI applications is undergoing a paradigm shift from single interaction to agent. Early customer service-oriented chatbots have evolved into intelligent assistant systems that include composite functions such as code assistance (such as GitHub Copilot) and decision support. Currently, general-purpose agents based on the agentic architecture are becoming a focus of innovation. Their capabilities, such as autonomous decision-making and continuous learning, have given rise to many innovative scenarios in realms such as finance, healthcare, and manufacturing. According to Gartner's latest forecast, by 2025, more than 70% of enterprises will deploy at least one agent. This marks that AI applications have moved from the proof-of-concept stage to a new era of large-scale implementation.
Typical Architecture and Observability Requirements of AI Native Applications
The architecture of a typical AI native application is shown in the preceding figure. It can be mainly divided into three major sections. Traffic flows in from the user business side. Users send requests from various endpoints, including browsers, mini programs, Android, and iOS. All traffic is usually proxied by a unified entry point, such as a Higress gateway. The gateway performs unified administration operations at the entry point, such as security protection and traffic shaping. After passing through the gateway, the traffic is sent to the model application layer. The model application layer contains applications that are written in various programming languages by using various programming frameworks, such as Dify, LangChain, and Spring AI Alibaba. These programs invoke different model services, such as hosted or self-deployed Qwen and DeepSeek. For model high availability and model performance comparison, we usually deploy multiple models and switch between them based on certain strategies. For example, load balancing is implemented based on the model invocation cost, importance, and traffic policies. Usually, users also implement this through a unified proxy, such as a Higress AI gateway.
As you can see, the execution trace of an entire AI native application is quite long. A problem at any node in the entire execution trace can cause the business to fail. Therefore, the first problem to solve is to determine which components a single invocation passes through and connect all these components through a call trace. To achieve end-to-end diagnosis, we must first be able to establish these traces. When a problem occurs with a request, we need to know which link is causing the problem. Is the problem in the AI application, or is it in the internal inference of the model? We must quickly pinpoint the source of the problem.
Second, in such a complex distributed system, observing from a single dimension will leave users in information silos. We need to build a full-stack observable data platform that can effectively associate all this data, which includes not only traces but also metrics, such as the internal GPU utilization of the model. Through association analysis, we can determine whether the problem is in the application layer or the underlying model layer.
Finally, we also need to use model logs to understand the inputs and outputs of each invocation. We can then use this data to perform evaluations and analyses to more accurately verify and evaluate the quality of the AI application.
Observability Is Crucial for LLM Applications
The complexity of LLM applications introduces many observability-related requirements. Therefore, observability is crucial for LLM applications. For example, when an Agent executes a task, it usually consumes a large number of tokens and a lot of time. Every step in the execution process needs to be recorded in detail. This requires complete observability capabilities to collect the execution status of each stage. There are very specific requirements for aspects such as calls to the model, tool usage, and token consumption. On the other hand, we support the emergence of MCP. When an Agent is executing, it often needs to have multi-turn interactions with the model. The token consumption of the final execution result may not seem large, but the actual consumption during the intermediate process is often surprisingly large. The Agent may even fall into an endless state, the so-called MCP token blackhole. In addition, every time an AI Agent is modified and published online, we need to evaluate the results of the Agent's execution. This is equivalent to performing regression testing that requires a large amount of observable data. Therefore, from the developer testing phase to the runtime and O&M phases, observability is a very important part.
The complexity of AI Agents, the uncertainty of dynamic data flows, and the need for real-time inference make it difficult for traditional monitoring methods to meet the requirements. The architecture of an AI Agent is significantly different from that of traditional microservices. A change in the Agent, model, or intermediary can affect the entire system. However, the idea of observability is the same. We need to implement comprehensive end-to-end observability, from the client side to the Agent and to the inside of the model. Through end-to-end full-link observability, we can achieve full coverage of the interaction process between the AI Agent and the model. As an open-source observability data collection suite, LoongSuite provides comprehensive coverage for the development, testing, and evaluation of Agents.
Introduction to LoongSuite
LoongSuite (/lʊŋ swiːt/) (phonetically Long-sweet) is the core carrier of the next-generation observability technology ecosystem. Its core data collection engine effectively combines host-level probes and process-level instrumentation. The process-level probes implement in-application, fine-grained observable data collection. The host probes implement efficient and flexible data processing and data reporting, as well as out-of-process data collection capabilities through technologies such as extended Berkeley Packet Filter (eBPF).
At the process-level data collection layer, LoongSuite builds enterprise-level observation capabilities for mainstream programming languages such as Java, Go, and Python. Through deep adaptation to language attributes, the collector can automatically capture function call links, parameter passing paths, and resource consumption. This allows for the precise collection of runtime status without modifying business code. This non-intrusive design is particularly suitable for technology environments with frequent dynamic updates because it both ensures the integrity of observable data and avoids interfering with core business logic. When faced with complex pipelines, the system can automatically associate distributed trace contexts to build a complete execution path topology. As the core data collection engine, LoongCollector implements unified processing of multi-dimensional observation data. The entire flow, from raw data collection to structured transformation and then to smart routing distribution, is flexibly orchestrated through a modular architecture. This architecture allows observable data to be connected to open-source analysis platforms for autonomous administration or to be seamlessly connected to managed services to build a cloud-native observation system. In terms of technology ecosystem construction, Alibaba Cloud is deeply involved in the formulation of international open standards, and its core components are compatible with mainstream standards such as OpenTelemetry.
How LoongSuite Non-intrusive Tracking Works
How LoongSuite Non-intrusive Tracking Works in Python Agent
The preceding code shows a simple hello method in a Python program. Python provides a mechanism called monkey patching, which allows our program to dynamically modify a function at runtime, including its attributes and methods. Therefore, we only need to redefine a wrapper method on its outer layer. This method can perform some actions before and after the original method is executed. This is somewhat similar to the decorator pattern seen in design patterns.
This hello is the original method definition. During the initialization phase of a running Python program, you can replace the reference-like object of the original function. In this way, the method that is actually executed is the replaced wrapper method. Inside the wrapper method, the original method will also be invoked. The final result, as seen in step 4, is that you can insert some desired logic before and after the original method. In this way, we can collect various data without modifying the user's code.
How LoongSuite Non-intrusive Tracking Works in Go Agent
The LoongSuite Go Agent was formerly the alibaba/opentelemetry-go-auto-instrumentation project. It provides non-intrusive observable data collection capabilities for Go language applications through non-intrusive compile-time instrumentation technology. The compilation flow for a Go application is as follows:
First, the compiler frontend performs lexical and syntactical analysis on the source code and parses it into an abstract syntax tree. Then, the compiler backend optimizes the code and generates machine code to compile an executable binary file.
Unlike Java, which can use bytecode enhancement technology to implement dynamic Agent mounting, the product of Go compilation is a directly executable binary file. To achieve non-intrusive monitoring capabilities, you can use the go toolexec capability provided by Go to easily insert an intermediate layer between the compiler frontend and backend. In this intermediate layer, you can make any modifications to the abstract syntax tree that is generated by the compiler frontend. This process allows you to complete the injection of monitoring tracking points at compile-time.
Through abstract syntax tree (AST) analysis, you can find the tracking points for monitoring. Based on predefined Tracking Rules, you can insert the required monitoring code before compilation. This code can handle tasks such as Span creation, metric statistics, and contextual pass-through. This solution does not require any code modifications and only requires you to modify the compilation command. In addition, because this solution goes through the complete Go compilation flow, it naturally supports all Go running scenarios and avoids some unpredictable errors.
LoongSuite provides complete support for microservice applications. This includes Tracking for common frameworks and support for multiple plugins, such as http, db, redis, and mq. On the other hand, for AI Agents written in various languages, support is also provided for common AI Agent developer frameworks, such as LangChain, MCP, and Dify. This allows you to quickly view the input and output, duration, and number of tokens consumed by LLM invocations. This helps users use this data to better optimize prompts, improve the access efficiency of LLMs, and reduce token consumption.
Compared with other open-source probes, LoongSuite has more extensive Tracking, more comprehensive feature support, and broader language support. It provides a solid data foundation for building an intelligent O&M system.
LoongSuite Dify Observability Practice
Introduction to Dify
Dify is an open-source LLM application development platform and is one of the most popular LLM application development platforms in China. It integrates the concepts of Backend as a Service and LLMOps, which allows developers to quickly build production-grade generative AI applications. Even if you are non-technical personnel, you can participate in the process of defining AI applications and managing data operations. Dify has a built-in key technology stack that is required to build LLM applications. This stack includes support for hundreds of models, an intuitive prompt orchestration interface, a high-quality RAG engine, a robust Agent framework, and flexible pipelines. Dify also provides a set of easy-to-use interfaces and APIs. This saves developers a lot of time from reinventing the wheel, allowing them to focus on innovation and business requirements.
Dify Architecture Overview
Users access Dify through a browser. The local Nginx service acts as a unified entry point and accepts all HTTP Requests. Nginx forwards traffic to different services based on the Request Path. Static pages and frontend APIs are provided by the WebUI frontend service, which is responsible for rendering the Management Console interface. Backend API Requests are handled by the Dify API service, which is a Flask application. This service processes all business APIs, such as those for application configuration, chat conversations, and Data Management. During its lifecycle, the Dify API service interacts with multiple internal components:
● PostgreSQL database: Stores persistent data, such as application configuration, conversation records, and Multi-tenant information. API accesses the database through SQLAlchemy.
● Redis cache/queue: Plays a dual role. On one hand, it caches temporary data and session state to improve read and write performance. On the other hand, it serves as the Message Queue for Celery and stores asynchronous tasks (such as embedding and observable data upload).
● Object Storage Service (OSS): Used to store content such as uploaded files, Knowledge Base Documents, and certificates required for storing Keys. It can be a local file system, Alibaba Cloud OSS, or others.
After Version 1.0, the most significant change is the introduction of the plugin architecture. The features of dify-plugin-daemon are as follows:
It is responsible for starting and managing the plugin lifecycle based on parameters. It is also responsible for invoking started plugins, which includes invoking LLMs by using the Server-Sent Events (SSE) protocol. It is a stateful service.
It is responsible for the input and output of all plugins.
You can dynamically load plugins and connect to plugins locally or remotely for testing.
You can make reverse calls to Dify's internal services.
Both Models and Tools are migrated to the plugin system, and new plugin types such as Agent Strategies, Extensions, and Bundles have been added. This architecture decouples models and tools from the core platform, which enables independent development and maintenance. In addition, Dify has launched the Dify Marketplace to make it easy for users to discover and manage various plugins.
Dify Observability Hands-on
Note: Some of the capabilities in the hands-on content of this section are not yet fully available in the open source version of LoongSuite. This demo mainly uses the commercial version of LoongSuite. The capabilities included in this demo will be gradually made open source. Please stay tuned.
One-click Launch of the Dify Suite
Step 1. You can follow the tutorials to deploy Dify on Alibaba Cloud Serverless App Engine (SAE) or Container Service for Kubernetes (ACK) with one click.
Non-intrusive Integration with dify-api
First, install the ack-onepilot component according to the documentation. Then, follow the instructions in the documentation to add the relevant toggles to the dify-api deployment to enable the observability capabilities of dify-api:
You only need to add a few extra labels to immediately integrate the dify-api component into the LoongSuite observability system.
Non-intrusive Integration with dify-daemon
dify-daemon is a component written in Golang. We need to enhance this component at compile-time. First, we need to clone the dify-daemon project locally.
After that, modify the Dockerfile of dify-daemon according to the documentation:
FROM golang:1.23-alpine AS builder
ARG VERSION=unknown
# copy project
COPY . /app
# set working directory
WORKDIR /app
# using goproxy if you have network issues
# ENV GOPROXY=https://goproxy.cn,direct
# build
# Add compile prefix to get non-intrusive observability
RUN chmod 777 ./aliyun-go-agent-linux-amd64 && ./aliyun-go-agent-linux-amd64 go build \
-ldflags "\
-X 'github.com/langgenius/dify-plugin-daemon/internal/manifest.VersionX=${VERSION}' \
-X 'github.com/langgenius/dify-plugin-daemon/internal/manifest.BuildTimeX=$(date -u +%Y-%m-%dT%H:%M:%S%z)'" \
-o /app/main cmd/server/main.go
# copy entrypoint.sh
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
FROM ubuntu:24.04
WORKDIR /app
# check build args
ARG PLATFORM=local
# Install python3.12 if PLATFORM is local
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y curl python3.12 python3.12-venv python3.12-dev python3-pip ffmpeg build-essential \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1;
# preload tiktoken
ENV TIKTOKEN_CACHE_DIR=/app/.tiktoken
# Install dify_plugin to speedup the environment setup, test uv and preload tiktoken
RUN mv /usr/lib/python3.12/EXTERNALLY-MANAGED /usr/lib/python3.12/EXTERNALLY-MANAGED.bk \
&& python3 -m pip install uv \
&& uv pip install --system dify_plugin \
&& python3 -c "from uv._find_uv import find_uv_bin;print(find_uv_bin());" \
&& python3 -c "import tiktoken; encodings = ['o200k_base', 'cl100k_base', 'p50k_base', 'r50k_base', 'p50k_edit', 'gpt2']; [tiktoken.get_encoding(encoding).special_tokens_set for encoding in encodings]"
ENV UV_PATH=/usr/local/bin/uv
ENV PLATFORM=$PLATFORM
ENV GIN_MODE=release
COPY --from=builder /app/main /app/entrypoint.sh /app/
# run the server, using sh as the entrypoint to avoid process being the root process
# and using bash to recycle resources
CMD ["/bin/bash", "-c", "/app/entrypoint.sh"]
After that, you can build the dify-daemon image in the root directory of the dify-daemon project by using this Dockerfile:
docker build -t ${actual_image_name} -f docker/local.dockerfile .
Then, replace the image of the dify-daemon component that you are running with the image that has been non-intrusively enhanced at compile-time by using LoongSuite.
End-to-end observability
Finally, you can configure a simple chat application on the Dify console and use this simple chat application to have a conversation with the backend LLM:
After that, you can view the monitoring data of both the dify-api application written in Python and the dify-plugin-daemon component written in Golang on the interface at the same time.
In addition, the call chain between the dify-api application and the dify-plugin-daemon component can be successfully linked. This indicates that LoongSuite can provide end-to-end observability for Dify.
Dify Observability Summary
In summary, compared to other Dify observability solutions, LoongSuite provides more comprehensive end-to-end journey tracing. It addresses the gap where Dify can natively only view siloed internal links and OpenTelemetry (OTel) can only view platform links. In addition, it has a lower integration cost. A single integration takes effect for all applications. It supports non-intrusive integration, so you do not need to separately modify each LLM application. It is also not limited by runtimes such as gevent, and the subsequent upgrade and maintenance costs are low. With these advantages, LoongSuite Agent can provide Dify with an experience of single integration and global observability. It also makes up for the shortcomings of the native solution in multi-dimensional observation, global analysis, and end-to-end journey tracing.
Future Outlook
In the future, the community will mainly continue to invest in developing full-stack observability capabilities related to LLMs. This includes the observability of model-side inference (such as for well-known inference acceleration engines like vLLM and sglang), profiling capabilities on GPUs, and support for more LLM-related plugins, including support for features such as Google adk and a2a.


















Top comments (0)