By Zhi Shao, From the Alibaba Cloud SLS Team
Introduction
AI and Simple Log Service (SLS) create a powerful synergy. We use AI to enhance SLS's capabilities through features like SQL Copilot, and we leverage SLS to address critical challenges in AI development. When Dify's observability gaps led to troubleshooting times exceeding 30 minutes, we turned to SLS to build a more robust data infrastructure.
As large language model (LLM) applications develop rapidly, we often focus on model fine-tuning and feature implementation. However, we tend to overlook a critical question:
How do you effectively monitor, diagnose, and optimize live LLM applications?
This article shares our engineering practices from building the SLS SQL Copilot. It shows how to build a complete data infrastructure for LLM applications using SLS.
- Background: Observability Challenges in LLM Application Development 1.1 The Rise and Limitations of the Dify Platform
Dify is a popular platform for developing LLM applications. Its visual workflow design and rich widget ecosystem make development much easier. Our team chose Dify to build our SQL Copilot application. The application provides intelligent SQL query generation and analysis services.
However, in our production environment, we found a critical problem: The Dify platform has poor observability. Our team provides observability solutions. We know good monitoring and diagnosis are vital for service quality. Ironically, as an observability team, we couldn't "eat our own dog food." This pushed us to build a better data infrastructure for LLM applications.
1.2 The Business Complexity of SQL Copilot
Our SQL Copilot application has the following features:
β Collaboration between multiple subsystems: Includes subsystems for requirement analysis, SQL generation, quality validation, and SQL diagnosis.
β Complex workflows: Involves multilayer, nested Dify workflows. A single user request can trigger multiple child flows.
β Dynamic content embedding: Uses extensive context embedding and retrieval-augmented generation (RAG). System prompts include large amounts of embedded dynamic context and content retrieved from the knowledge base.
β High concurrency: Must support real-time query generation requests from many concurrent users.
Given these complexities, Dify's built-in observability features were insufficient for our production environment.
- Pain Point Analysis: Three Major Observability Shortcomings of the Dify Platform 2.1 Limited Query Capabilities Description: The Dify platform only offers basic account queries. Users can only search conversation history by User ID or session ID.
Our requirements included:
β Keyword search: A user might report, "I asked about order analysis, but the result was wrong." We need to quickly find the relevant session using keywords such as "order."
β Multidimensional queries: Combine queries across multiple dimensions. These include time range, error type, User ID, session ID, and function module.
β Fuzzy matching: Support fuzzy searches for SQL statement fragments and error message snippets.
β Pain point: Troubleshooting production issues is inefficient. Locating the specific execution record for a reported problem often took several minutes.
2.2 Lack of Trace Analysis
Description: Dify displays execution logs as a flat list. It lacks hierarchical trace analysis.
This led to several pain points:
β Hard to trace nested workflows: When a main workflow contains a nested child workflow, it is difficult to drill down into the execution details of the child workflow.
β Difficult to trace upstream and downstream data: Viewing and comparing upstream and downstream data requires running separate queries. The entire tracing process is tedious and greatly reduces troubleshooting efficiency.
Example: When a user submits a request to generate SQL, the system executes several steps in order: session memory β smart routing β requirement analysis β SQL generation β syntax validation. If the final result is not as expected, we must check the details of each step to find the root cause. For complex problems, this required frequent context switching, extending the troubleshooting process to over 30 minutes.
2.3 Unformatted Content Display
Current situation: The Dify log display interface is hard to read when it shows large blocks of text. This is especially true for system prompts that contain many dynamic content embeddings.
Specific problems:
β Poor prompt readability: Prompts with context embeddings and retrieval-augmented generation (RAG) retrieval content can be thousands of characters long. In the Dify interface, they display as a large block of unformatted text.
β Poor data format parsing: Input and output data often use multiple formats, such as Markdown, JSON, SQL, and Text. Dify only displays the raw string in its interface. It does not support smart format parsing or reader-friendly displays.
Pain point: When troubleshooting SQL generation problems, it is hard to quickly read and understand the full context. It is also impossible to quickly jump to key sections. This reduces the efficiency of problem diagnosis.
- Deeper Thinking: Three Infrastructure Challenges behind Observability Pain Points Analyzing these observability issues revealed deeper architectural challenges.
3.1 The First Challenge: Architectural Scalability
The root cause of the query performance issues was Dify's limited architectural scalability. Dify uses PostgreSQL for its underlying data storage. While PostgreSQL excels in online transactional processing (OLTP) scenarios, it faces major scalability challenges. These challenges relate to storing large-scale log data and running complex query analysis. This impacts storage capacity, computing resources, and write and query performance.
Challenges of growing data volume: As the number of users and usage frequency grow, huge amounts of execution logs, interaction data, and model call records build up quickly. Even with cloud-based PostgreSQL, the system is constrained by instance types. This makes planning and scaling out difficult. This required a data infrastructure with true elasticity, on-demand scaling, and a pay-as-you-go model.
Poor resource elasticity for traffic bursts: During periods of high concurrency, PostgreSQL is limited by its connection count and processing power. Insufficient resources can cause service latency or even timeouts. The interactive nature of LLM applications means traffic has significant peaks and valleys. Continuously running high-specification instances leads to low resource utilization and can result in costs two to three times higher than necessary.
Complexity and risks of upgrades and scale-outs: Cloud-based PostgreSQL supports online scale-outs. However, these operations still require manual planning and evaluation. They also carry risks, such as service restarts or performance fluctuations. Our Copilot service has experienced several stressful online scale-outs.
3.2 The Second Challenge: Data Processing Capability
We found limitations when we tried to analyze and diagnose production issues. The underlying infrastructure has limited query capabilities for LLM application data. This is especially true for observability:
Multidimensional query performance bottlenecks: LLM applications generate large amounts of natural language text, such as user questions, system prompts, and model responses. While PostgreSQL supports full-text search, its performance degrades significantly when running keyword searches and fuzzy matching on massive volumes of long-text data. It also struggles to support requirements like the multidimensional queries mentioned before.
Poor real-time analysis capabilities: We often need to run ad hoc queries for multidimensional analysis. For example, we might ask, "How many users are affected by the same problem?" or "What is the distribution of different error types?" Dify does not provide these analysis capabilities. It cannot meet the demand for complex and varied ad hoc queries and analysis.
Complex and changing data structures: LLM application data contains complex and nested formats, such as JSON and Markdown. This includes prompts, parameter configurations, and generated results. These data structures change dynamically with each development iteration. Compared to a professional log analysis engine, PostgreSQL (PG) is far less flexible and efficient at querying, fetching, and analyzing complex log data.
3.3 The Third Challenge: Diverse Data Requirements
As we tried to optimize the problem diagnosis process and content rendering, we uncovered a deeper issue:
Increasingly diverse use cases: LLM applications are flourishing, creating diverse needs across different fields. The quality team needs quality assessment and regression testing. The product team needs user analysis and requirements mining. The algorithm team needs model training and performance evaluation. The Operations team (or SRE team) needs trace diagnostics and performance analysis. These needs extend far beyond the functional scope of the Dify platform.
Growing demand for data accessibility: As an LLM application development platform, Dify has access to complete, firsthand data. However, it struggles to meet the diverse needs of various industries and roles. Users want to access these data resources to perform in-depth analysis or custom development based on their business needs. This places higher demands on data accessibility.
Limited data infrastructure capabilities: PG has clear shortcomings for enterprise-level data accessibility. Its number of connections and concurrent processing capabilities are limited. It lacks flexible access control and security management mechanisms, such as data filtering and desensitization. Data consumption methods are not diverse, and it lacks versatile data processing and consumption capabilities. These features could be added, but at a significant development cost.
- Rebuilding Capabilities: Upgrading with an SLS-based Data Infrastructure We concluded that patching features would not address the root problem; a fundamental rebuild of the data infrastructure was necessary. We needed to completely rebuild the underlying data infrastructure to meet these multiple challenges. As a team that provides observability solutions, we understood the capabilities and potential of SLS. We were confident that SLS was the right tool for the job. SLS was the ideal solution, offering distinct advantages:
Architectural scalability: As a fully managed, cloud-native service, SLS offers seamless online scaling and a pay-as-you-go model. It can dynamically adjust resources without affecting business continuity. This eliminates the need for manual planning or concerns about scale-out risks. It fundamentally solves the scalability bottlenecks faced by PG.
Data processing capabilities: LLM application data is inherently log-like. SLS is designed specifically for log scenarios. It natively supports full-text search, multi-dimensional queries, and flexible aggregate analysis. Its data processing power scales linearly with data volume. It excels at handling the massive amounts of text and JSON data generated by LLM applications.
Meeting diverse needs: SLS provides rich methods for data query, processing, and consumption, such as queries, SQL, APIs, stream processing, and delivery. It supports fine-grained access control and data distribution management. You can use storeviews to control data visibility and security at the row or even field level. It also supports loose schemas and completely unstructured data. This flexibility allows it to meet the data needs of different teams.
4.1 Architectural Redesign: From Dual-write to Capability Rebuild
Based on these technical advantages, we chose SLS as our new data infrastructure. To ensure online service stability and manage risk, we adopted a gradual reconstruction strategy: keep the existing write path to PG and add a new, asynchronous write path to SLS.
This approach has several benefits:
Prioritizing business security: We kept PG as the primary data store for online services. This ensures that Dify's core functions are not affected. All existing queries, statistics, and business logic continue to run on PG.
Data plane decoupling: Writing to SLS asynchronously decouples the data layer. This separates observability, analysis, and monitoring tasks from the main business database. This approach avoids impacting the performance of online services while enhancing data query and processing capabilities.
Functional load separation: PG focuses on handling online execution loads, such as user management, session management, and real-time queries. SLS handles all offline analysis loads, such as log queries, statistical analysis, and alert monitoring.
Progressive architectural evolution: This approach leaves room for the underlying data infrastructure to evolve. As your business grows, you can gradually migrate more features to the new SLS-based data infrastructure.
4.2 Capability Upgrade: Showcasing Core SLS Features
Powered by the robust SLS data processing engine, the new data infrastructure greatly improved full-text search, multi-dimensional queries, and ad hoc analysis. Data extension capabilities for various scenarios are significantly enhanced. This system easily supports diverse needs, such as regression testing for quality teams, user analysis for product teams, model optimization for algorithm teams, and problem diagnosis for O&M teams.
Support for Diverse Data Formats
SLS is compatible with various data formats, including JSON, Markdown, Text, SQL, and numeric parameters. It also provides readable rendering for common formats such as Markdown and JSON. This feature helps you easily read and understand your data.
Original Log
Markdown formatted display
JSON formatted display
Fast Full-text Search
For example, when addressing the user feedback issue mentioned in Chapter 2, you can now quickly retrieve relevant requests by searching for the word 'order'.
Multi-dimensional Queries
You can also combine query conditions across multiple dimensions, such as UID, session ID, action, and error type. This lets you precisely retrieve requests that have specific features.
Real-time Insight and Analysis
You can take this a step further. Use SQL for flexible aggregation and analysis of data with specific features. For example, you can track the daily trend of user requests containing the word 'include'.
Using the SQL analysis and visualization features of SLS, you can easily build key metrics for your business. Customize operational dashboards to gain deep insights and analyze various data points. These include interaction, retention, and conversion rates, daily request trends, error distribution, SQL quality (execution and effective data rates), and user satisfaction.
The powerful and flexible query and analysis features of SLS let you perform deeper analysis of user behavior. You can gain insights for specific business scenarios, such as:
β How many users are affected by the same issue?
β What is the distribution of issue features among different user groups?
β Which SQL patterns are most prone to errors?
β What are the users' operational behavior patterns?
4.3 Scenario Implementation: Building a Lightweight Tracing and Problem Diagnosis System with SLS and OneDay
As mentioned in Chapter 2, Dify's built-in tracing and content display were inadequate. This made troubleshooting very difficult. With the rise of AICoding and the powerful data infrastructure of SLS, I decided to build a lightweight tracing and problem diagnosis system myself. This also gave me a chance to experience the new AI-driven development model.
I used OneDay, the Group's AICoding platform, to build the frontend. The backend was built directly on the powerful data storage, query, and processing capabilities of SLS.
This resulted in a lightweight yet fully functional diagnosis system:
It boasts a clean, professional user interface.
It features a clean and simple page layout.
It even integrates the company's user identity authentication.
With just the request ID of a user's query, it traces the entire link and topology.
You can view the execution details for any request, stage, or node.
This includes inputs, processing, and outputs
It intelligently detects multiple data types.
This includes Markdown, JSON, SQL, and plain text.
It provides smart highlighting and perfect formatting for all of them.
It also fully displays dynamic content, such as system prompts and user queries.
It supports prompt formatting, collapsible sections, content navigation, and text scaling.
This is very useful for diagnosing specific problems, especially for locating details.
(In Dify, a system prompt is essentially a content template. The system retrieves content segments from multiple sources and embeds them into the template. Each request has different prompt content. Reading the full prompt is critical for quickly locating the details of a problem.)
Incredibly, it took only one day to go from code to data to a fully functional service. The entire solution is fully managed. OneDay truly lives up to its name!
Behind this success, the powerful capabilities of the SLS-based data infrastructure are the cornerstone of this minimalist architecture:
β Data layer: SLS provides one-stop capabilities for data storage, queries, and analysis.
β Application layer: A lightweight proxy handles only permissions, security, and essential business logic.
β Presentation layer: OneDay quickly generates a visual interface from requirement descriptions.
It's worth noting: AICoding is truly reshaping the software industry.
Now, all you need is an idea and a reliable, flexible data infrastructure. Let AI handle the rest. It is that simple
OneDay and SLS are a perfect match. Together, they form a minimalist architecture. OneDay helps you create beautiful pages with just a few lines of code. The powerful data infrastructure of SLS lets you easily query, process, and analyze data. You do not need to configure bulky databases or develop lengthy and complex server-side logic. For security, you only need a proxy layer at most. You also do not need to worry about details like capacity, performance, or instance types.
Using the powerful data infrastructure of SLS, we quickly built a complete system for tracing analysis and issue diagnosis. This verified the value of SLS in real-world business scenarios.
- Production Practice: A Data-driven Loop for Quality Optimization 5.1 Issue Diagnosis and Quality Optimization Flow Ultimately, we must return to production practice to support the development of the SQL Copilot product. We established a complete quality feedback loop:
When building and iterating on AI applications, we must fully use all available resources and adopt an AI-first approach.
For example, we combined SLS with DingTalk AI Table. We use the SQL analysis capabilities of SLS to process massive amounts of request data. This helps organize errors and effectively identify error patterns. We detect features using regular expression matching, such as case when msg like '%cannot be resolved%' then 'Column Reference Error' .... Then, we download and export the result data. We import it into DingTalk AI Table, formerly DingTalk Base. This lets us use many features of AI Table, such as grouping and filtering. We can also use forms for manual review and annotation. When diagnosing a specific issue, we can directly link to the diagnosis system mentioned in the previous chapter. This connects requests, issues, and data.
Adopting an "AI-first" approach is not just a slogan. DingTalk is already integrated with many AI platforms. AI Table itself will also offer AI assistant capabilities in the future. (PS: We have applied for internal testing and look forward to its release. :)) Our standard for choosing tools is based on a key question. Can the data we continuously accumulate, including manual annotations, be used efficiently as training data and a knowledge base for AI? This is the logic behind our strategy.
5.2 Key Metrics and Evaluation System
Quality assessment for Copilot is a key tool. It measures and provides feedback on version iterations. It also guides our optimization efforts. After much trial and error and multiple iterations, we finalized the following
core quality metrics:
β SQL success rate: The percentage of generated SQL statements that have correct syntax and can be executed.
β SQL data return rate: The percentage of SQL executions that return meaningful data.
β Response time: The total time from a user's query to the returned result.
β User satisfaction: A composite score based on user feedback and a reverse evaluation of generation quality using an LLM.
This quality evaluation system is simple, clear, specific, and actionable. It will guide our future iterations and optimizations.
5.3 Actual Performance Improvements
Over the last three months, we gradually built and deployed this infrastructure. We have since continuously optimized Copilot's quality, achieving significant improvements.
Improved Issue Diagnosis Efficiency:
β Issue identification time: Reduced from an average of 30 minutes to under 5 minutes, an 83% improvement.
β Root cause analysis accuracy: Increased from 60% to 90%, a 50% improvement.
β Issue resolution cycle: Shortened from an average of 3 days to 1 day, a 67% improvement.
Improved Service Quality:
β SQL execution rate: Increased from 75% to 85% (a 10-percentage-point improvement).
β User Satisfaction: Increased from 3.2 to 4.3 on a 5-point scale, a 34% increase.
β Issue tracking coverage rate: Increased from 10% with manual checks to 100% with full monitoring.
- Key Takeaways and Future Outlook 6.1 New Trends in the AI Era Simplified and Lightweight Architecture: β Fully managed data infrastructure simplifies the system architecture. It lets your team focus on core business innovation.
β The "Idea + Data + AI" development model is simple and efficient. It unleashes unlimited creativity.
Toolchain Integration Trend:
β SLS, OneDay, and DingTalk AI Table show the huge potential of multi-toolchain collaboration in the AI era.
β The path from idea to implementation to process is greatly simplified with AI. Development efficiency improves significantly.
Data-driven Quality Closed Loop:
β The first step in developing LLM applications is to create clear and reliable evaluation metrics based on your business needs.
β A complete feedback loop ensures continuous improvement. This loop covers data collection, analysis, insights, and optimization.
6.2 Future outlook
Intelligent Upgrades:
β Use historical data to understand user semantics and preferences. This helps intelligently predict the user's true intent.
β Use the data infrastructure and LLM technology to automatically generate diagnostic reports and optimization suggestions.
β Build an intelligent Root Cause Analysis system. This reduces the need for manual analysis.
Ecosystem Partnership:
β We plan to contribute the Dify dual-write implementation, created by Yanhe, to the open-source community. Alibaba Cloud observability capabilities are also integrated into the official Dify platform.
β We look forward to partnering with more leading LLM application platforms beyond Dify. Our goal is to promote the development of cloud-native observability.
Conclusion
The rapid growth of LLM applications is changing how we work and live. But ensuring their stability and quality is a major challenge. Our work on the SQL Copilot project shows this. A complete and powerful data infrastructure with strong observability is essential for successful LLM applications.
We shared our experience in this post on building data infrastructure for LLM applications using SLS. We hope it inspires other developers and teams. AI technology is evolving rapidly. We must pursue innovative features. But we must also focus on building infrastructure. A solid foundation is the only way to ensure steady, long-term progress on the path toward intelligence.
A core principle of engineering is that good tools are essential for good work. Our hope is that every LLM application can be equipped with comprehensive observability. This will ensure its steady and long-term success in the age of intelligence.
This article is based on real-world experience from the Alibaba Cloud SLS SQL Copilot project.

























Top comments (0)