DEV Community

ObservabilityGuy
ObservabilityGuy

Posted on

Ensuring Reliable Voice Activation: How Gongniu Murora Rebuilt Its Observability System

This article explains how Gongniu Murora migrated from the open source SkyWalking monitoring solution to Alibaba Cloud Application Real-Time Monitoring Service (ARMS), building a comprehensive observability platform with metrics, tracing, log analysis, and intelligent alerting. Beyond the selection criteria, the article highlights the unique value ARMS in LLM and IoT integration scenarios. By identifying bottlenecks in speech recognition, optimizing LLM inference performance and ensuring high-quality speech synthesis, Gongniu successfully moved from reactive responses to proactive management.

The overall observation pipeline works as follows: user local gateway → user speech input → automatic speech recognition (ASR) → multi-agent → IoT command execution → response text generation → text-to-speech (TTS) → device response.

1. Company Background and Motivation for Architecture Upgrade
As a leading provider of electrical products and solutions in China, Gongniu Group is committed to delivering safe, intelligent, and reliable power solutions. With the company's ongoing digital transformation and business expansion, its application architecture has evolved from a traditional monolithic system to a microservices-based, cloud-native architecture. While this transformation improved the system flexibility and scalability, it also complicated the system topology, increased the frequency of service calls and generated massive amounts of runtime data, posing significant challenges for systematic observability.

Gongniu initially relied on an open source SkyWalking-based monitoring system, which met basic tracing analysis needs. However, As the number of microservices increased and call relationships became more complex, the system hit performance bottlenecks. The O&M team was often forced into reactive troubleshooting, struggling to quickly identify and analyze the root cause of issues which affected service stability and user experience. To address these challenges, Gongniu decided to rebuild its observability system that integrates metrics, tracing, log analytics, and intelligent alerting. This enabled a move from reactive to proactive operations.

2. Technology Selection Criteria and Comparison
SkyWalking, a robust open source application performance monitoring (APM) tool, was initially an ideal choice due to its lightweight deployment and broad support for mainstream microservices frameworks. However, as SkyWalking's limitations in enterprise-grade scenarios became apparent, Gongniu recognized that open source solutions alone could not support their goals of achieving a highly available, intelligent, and automated O&M system. This prompted an evaluation of next-generation enterprise-grade observability solutions.

Building an application monitoring system often requires careful consideration and trade-offs between commercial and open-source solutions across several key dimensions with integration complexity foremost. Commercial solutions offer standardized instrumentation, auto-discovery mechanisms, and graphical configuration, significantly lowering the barrier to entry, especially for teams with limited resources or those prioritizing rapid deployment. While open source solutions offer flexibility, they require teams to handle instrumentation integration, data format definition, and collection pipeline setup. In complex architectures with heterogeneous technology stacks, this demands substantial manual effort for adaptation and maintenance.

Furthermore, tracing query capabilities directly impact troubleshooting efficiency. Commercial solutions provide intuitive query languages, visual search criteria, and multi-dimensional filtering, enabling rapid identification of anomalous requests. Open source solutions often require custom development for correlated trace queries spanning multiple systems, with the query experience depending on how much customization is implemented. Drill-down analysis capabilities reflect the depth of insight into root causes: Commercial solutions integrate topology, dependency analysis, and anomaly correlation within a unified platform, enabling seamless drill-down from application performance fluctuations to specific instances, threads, and even slow database queries, creating a closed-loop analysis path. While the open source ecosystem can achieve similar functionality by combining different components, data silos between these components are common. Implementing seamless end-to-end drill-down often requires complex integration work and a unified data model.

The performance overhead of monitoring agents is another crucial factor. Commercial solutions, refined through years of production use, are typically highly optimized in resource consumption, sampling strategies, and data reporting mechanisms, maintaining high collection frequency with minimal overhead. While open source agents offer transparency, they can introduce issues such as high CPU utilization, memory usage, and log accumulation under high concurrency. Improper configuration can even negatively impact service stability.

Therefore, the final decision depends not only on technical capabilities but also on teams' O&M expertise, long-term investment considerations, and the business’s specific observability needs of the business. It requires finding a balance among flexibility, efficiency, and sustainability. Based on these considerations, Gongniu Group chose ARMS as its observability platform.

3. Gongniu Group's Journey to Observability: From Adoption to Mastery
3.1 Seamless Migration with Zero Business Disruption
In upgrading its observability system, Gongniu Group prioritized both enhanced functionality and a stable, compatible migration that ensured zero business disruption. The team successfully transitioned from SkyWalking to ARMS while maintaining core business continuity. ARMS enabled a smooth, efficient, and low-risk migration through several mechanisms:

• One-click integration: Users can enable tracing analysis in the cloud service console, immediately gaining access to trace data and significantly reducing instrumentation overhead.

• Automatic instrumentation: ARMS offers optimized agents for popular languages including Java, Go, and Python, improving instrumentation quality, performance, and stability without extensive code changes.

• OpenTelemetry support: Leveraging industry-standard protocols, ARMS provides trace mapping, trace topology visualization, and application dependency analysis for distributed applications.

3.2 The Path to Mastery: From Adoption to Optimization
After migrating to ARMS, Gongniu Group achieved end-to-end observability across hundreds of application nodes within its microservices architecture. ARMS's high-performance tracing analysis, real-time metric analysis, intelligent anomaly detection, and integrated alerting significantly deepened the system’s observability. The O&M team gained real-time visibility into service traces, identified performance bottlenecks, anticipated potential risks, and leveraged rich dashboards for multi-dimensional data analysis. As a result, mean time to recovery (MTTR) dropped by more than 60%, and high availability for critical services was ensured. Gongniu successfully transitioned from basic monitoring to proactive management, significantly improving O&M efficiency and system stability.

3.2.1 Proactive Inspection with Trace Views
ARMS’s tracing analysis acts as an intelligent navigator for the O&M team, enabling proactive inspections. Using scatter plots and aggregated trace views, the team can quantify the health of critical traces and quickly identify service nodes with abnormal fluctuations. Visual topology maps clearly depict service dependencies, enabling precise identification of single points of failure (SPOFs) and even revealing hidden circular dependencies, allowing for proactive architectural optimization. The system supports real-time aggregation and drill-down analysis of hundreds of millions of traces, helping the team quickly pinpoint the characteristics of anomalous calls within massive datasets.

3.2.2 Closed-loop Internal Stability Governance
ARMS further demonstrated its value in establishing a closed-loop stability governance process. For a core API endpoint handling an average of 80 million calls per day, the team used ARMS's deep tracing analysis to pinpoint read/write performance bottlenecks within a specific cluster. After code-level optimizations, the endpoint’s average response time was reduced by 45%. ARMS’s root cause analysis capabilities allowed the team to trace issues across the entire system. For example, during an IoT device control command latency event, ARMS identified a bottleneck in the LLM service’s response time, which was resolved through scaling and request queue optimization. Importantly, the team integrated key performance metrics from traces into their CI/CD pipeline, automating the detection of slow API response times and high error rates during pre-release testing, and shifting from manual inspection to automated identification.

3.3 Unexpected Benefits: Cross-domain Observability
Beyond upgrading their microservices monitoring, Gongniu discovered unexpected cross-domain benefits from ARMS's observability capabilities, particularly in LLM and IoT integration scenarios, where ARMS delivers powerful end-to-end analysis.

3.3.1 LLM and IoT Pipeline Optimization
ARMS's end-to-end tracing analysis provides comprehensive visibility across the entire pipeline, from LLM inference and command dispatch to IoT device responses, enabling precise analysis of latencies in the 100-200 ms range. Previously, troubleshooting required hours of manual log correlation across multiple systems. With ARMS, the team can now pinpoint anomalies within minutes by using unified trace mapping. Furthermore, ARMS also offers real-time monitoring of LLM API call quality, latency, and error rates, supplying critical data for model iteration and optimization.

3.3.2 LLM-powered Voice Pipeline Monitoring and Performance Optimization
ARMS provides end-to-end pipeline tracing, from user voice input to final device response, encompassing key stages such as ASR, LLM intent parsing, and TTS synthesis, with clear visibility into the duration and status of each stage. Building on its cross-domain observability capabilities, ARMS offers specialized deep analysis features for LLMs, providing crucial support for AI engineering and deployment. Beyond basic metrics such as latency, ARMS leverages contextual analysis to deliver deeper insights. For anomaly governance, ARMS's automated anomaly pattern learning significantly improves issue detection. By analyzing historical trace data, the system automatically identifies typical anomaly patterns such as "model stalls" and "truncated responses." ARMS helps teams balance cost and performance, optimizing resource allocation for maximum business impact. Multi-model version comparison capabilities further support data-driven decision-making for model iterations. These features are instrumental in scenarios such as identifying bottlenecks in speech recognition, optimizing LLM inference performance, and ensuring high-quality speech synthesis.

  1. Identify bottlenecks in speech recognition:
    When users reported slow voice command responses averaging 1.8 seconds, the O&M team used ARMS tracing analysis to quickly identify the root cause. ARMS revealed that the ASR service accounted for 65% of the total latency and suffered from numerous retry requests. The team adjusted ASR service parameters and allocated a dedicated resource pool for dialect recognition, reducing end-to-end latency to 600 ms. This targeted approach avoided the unnecessary scaling costs and demonstrated ARMS’s value in precisely pinpointing performance bottlenecks.

  2. Optimize LLM inference performance:
    During routine inspection, the O&M team observed significant fluctuations in LLM service response times, with P99 latency reaching 2.5 seconds. ARMS revealed a substantial drop in token generation speed during long-text generation scenarios. Based on this insight, the R&D team implemented dynamic batch processing and optimized caching for long text. These optimizations reduced long-text generation latency by 40% and stabilized P99 latency below 1.6 seconds. ARMS's scatter charts, visualizing the non-linear relationship between text length and latency, were crucial for identifying this key bottleneck within the vast dataset.

  3. Ensure high-quality speech synthesis:
    When users reported decreased TTS quality in specific dialect scenarios, the O&M team used ARMS to identify a rising failure rate in calling the corresponding dialect model as the primary cause. They set up a quality monitoring dashboard for dialect models with automated alerts that trigger when failure rates exceed predefined thresholds, prompting the R&D team to take corrective action. ARMS now enables Gongniu to monitor not only technical metrics but also the direct impact on user experience, shifting quality assurance from reactive response to proactive prevention.

3.3.3 Improved Development Efficiency
ARMS's observability capabilities have transformed not only O&M processes but also development workflows. The R&D team uses ARMS-driven tracing analysis for targeted code reviews, enabling precise code improvements. New team members can quickly grasp system interactions through intuitive call graphs without needing to navigate extensive documentation. For cross-team collaboration, ARMS breaks down data silos between departments, providing shared access to observability data. This data-driven collaboration has increased cross-departmental troubleshooting efficiency by more than 50%.

4. Gongniu's Thinking and Practices: Making Observability a Core Competency
During their adoption of ARMS, Gongniu actively explored best practices tailored to their specific business needs, focusing on areas such as sampling strategies, alert rule design, and custom metric planning.

4.1 Flexible Observability Configuration Based on Business Needs and Scenarios
4.1.1 Optimize Sampling Strategies to Balance Storage Costs and Observability
Building a highly available and observable application monitoring system requires well-defined configuration strategies. These are crucial not only for successful implementation but also for maintaining system stability and business continuity. Sampling strategies must balance data completeness with resource costs. Indiscriminate full sampling in large-scale microservices models leads to high storage and transmission overhead. Therefore, a differentiated sampling approach based on business criticality is essential: critical paths such as payment, order processing, and core APIs should employ 100% sampling to ensure full traceability of any anomalies. For lower-volume or non-critical paths, adaptive sampling mechanisms can dynamically adjust the sampling rate based on request volume, error rate, or latency, maintaining observability while effectively managing data volume.

4.1.2 Configure Dynamic Alert Thresholds Based on P99 Latency to Prevent Alert Fatigue
Alerting mechanisms should move beyond static thresholds to prevent "alert fatigue." Fixed thresholds can trigger spurious alerts due to normal fluctuations in traffic patterns or temporary latency increases during peak activity, leading O&M teams to ignore or dismiss even genuine alerts. Dynamic thresholds based on percentiles such as P99 or P95 latency, combined with comparative trend analysis, provide more accurate detection of performance degradations, improving alert sensitivity and accuracy.

4.1.3 Extend Custom Metrics Based on Business Needs
Observability should not be limited to system-level metrics or generic trace data. It should incorporate business-specific context. Custom metrics can bridge the gap between technical performance and business impact. For example, incorporating key business metrics such as IoT device response success rates into trace data provides insights not only into system health but also into business functionality and availability.

4.1.4 Model-specific Monitoring for AI Innovation
Given the widespread adoption of LLMs, monitoring strategies must adapt to their unique performance characteristics and user experience sensitivities. This requires specialized configurations: Dedicated LLM metrics: Create dedicated metric groups for LLM inference services, focusing on key performance metrics such as token generation speed, time to first token (TTFT), and context length processing efficiency to comprehensively characterize model behavior. Prompt length analysis: Configure dashboards correlating prompt length with response time to understand the impact of input length on inference performance, aiding in model optimization and resource planning. Real-time service level agreement (SLA) monitoring: For latency-sensitive scenarios such as speech recognition and synthesis, implement end-to-end SLA monitoring. This involves latency modeling across the entire request lifecycle, from user initiation to audio playback, segmenting key stages (e.g., ASR, NLU, TTS) with stage-specific alerting thresholds to ensure acceptable user experience. These configurations work in concert within a unified observability platform, creating a closed loop from data collection, storage, and analysis to alerting and visualization, enabling deep insights and proactive governance of complex distributed systems.

4.2 Build an Observability Culture
Deploying monitoring tools is just the first step. The true value of observability lies in its integration into a team's mindset and workflows. When observability shifts from an optional add-on to an ingrained practice, teams move beyond reactive alert handling. They can anticipate problems, analyze trends, and proactively optimize performance, building a modern R&D governance system that is data-driven, collaborative, and stability-focused. Gongniu Group has taken a multifaceted approach to cultivating this observability culture.

Developing tracing analysis skills requires more than ad-hoc knowledge transfer or reactive troubleshooting; it requires systematic, hands-on training. To this end, Gongniu regularly runs ARMS analysis workshops based on real production events. These workshops reconstruct end-to-end traces of typical faults, guiding R&D, O&M, and testing teams through the complete process of identifying and resolving issues, from initial symptom detection to root cause analysis. Participants gain practical experience in contextual analysis, anomaly pattern recognition, and cross-team collaboration, transforming trace data from mere visualized lines on the observability platform into a powerful problem-solving framework.

Furthermore, observability must transcend tooling and become ingrained in the team's culture. Integrating metrics such as trace health, key endpoint P99 latency, and error rates into each R&D team's objectives and key results (OKRs) fosters shared ownership of system stability. This encourages a shift in mindset from "delivering code" to "ensuring system availability." Coupled with a "resolve issues today" approach, teams are empowered to proactively identify and address potential problems within the same day, creating a positive feedback loop of continuous improvement.

This cultural shift influences development practices, encouraging engineers to prioritize observability from the design and coding phases, embracing an "observability-driven development" philosophy. This includes: Contextualized tracing: Enriching traces with business-relevant identifiers such as order IDs, user session tokens, or request context tags enables precise attribution across critical paths. Structured logging: Maintaining structured logs with contextual information linked to trace IDs allows for correlated analysis across logs, metrics, and traces. Observability by design: Defining key instrumentation points and dashboard prototypes during API design integrates observability as a core component of software delivery, rather than an afterthought.

4.3 Future Plans
As Gongniu's observability practices mature and its culture of observability takes root, the company has set higher expectations for both its observability system and for ARMS's capabilities. Its roadmap focuses on enhanced intelligence, cost control, cross-platform collaboration, and deeper business integration, with the goal of making observability a core pillar of business innovation and system stability. Their journey offers valuable insights for other enterprises.

4.3.1 AI-powered Root Cause Analysis
As system architectures and business complexity continue to grow, observability must evolve from simply "seeing" to "understanding, predicting, and optimizing." AI plays a key role in this evolution. By combining ARMS's end-to-end trace data with the wealth of unstructured information within Simple Log Service (SLS), Gongniu aims to build context-aware root cause analysis models. These models will automatically identify anomaly patterns, correlate performance degradation signals across services, and pinpoint the most probable root causes in complex, multi-faceted scenarios. This will dramatically reduce MTTR, shifting O&M response from manual troubleshooting to AI-powered recommendations and even automated remediation.

4.3.2 Cost Optimization
As data volumes grow exponentially, cost management becomes increasingly paramount. Simple storage expansion is not sustainable. Fine-grained resource governance is essential. Leveraging ARMS's intelligent sampling capabilities, Gongniu can dynamically adjust data collection density based on request importance, error status, or call path, minimizing redundant data while preserving complete visibility into critical paths. Combined with tiered storage strategies—keeping frequently accessed hot data in high-performance storage and migrating archived data to lower-cost options—Gongniu can maintain a balance between performance and cost, ensuring the economic sustainability of its observability system as it scales.

4.3.3 Unified Hybrid Cloud Observability
The rise of hybrid cloud infrastructure introduces new challenges in data siloing. Gongniu plans to deepen integration between ARMS and its public cloud (multi-account) and private cloud environments. By using standardized data ingestion protocols and a unified metadata model, they aim to connect traces, metrics, and event logs across their hybrid cloud deployments, building a unified observability platform with a global view. This will deliver a consistent observability experience and unified alerting regardless of where applications run.

4.3.4 Deepening LLM Observability
As LLM adoption accelerates, observability must evolve from external metrics to deeper internal insights. Future observability efforts will move beyond surface-level metrics such as token generation speed to the intricacies of model inference. This includes capturing granular runtime data such as GPU utilization fluctuations, key-value cache hit ratios, and batch processing efficiency. By combining this data with prompt semantic classification and response quality assessments, Gongniu aims to build correlation models between model performance and input characteristics. This will enable multi-dimensional performance comparisons across model versions, tenants, and business scenarios, providing more precise data for model optimization, resource allocation, and SLA compliance.

5. Summary
By migrating from SkyWalking to ARMS, Gongniu Group built a comprehensive, end-to-end observability system, achieving three key shifts: from reactive response to proactive prevention, from siloed monitoring to holistic insights, and from technical metrics to business value. ARMS not only addressed the shortcomings of their previous observability system but also became a crucial enabler of business innovation, particularly within their LLM and IoT integration scenarios.

For AI-powered voice services, ARMS provides full-pipeline tracing analysis, from voice input to device response, along with specialized metrics tailored for LLMs. This enabled Gongniu to pinpoint performance bottlenecks, optimize user experience, and control service costs. Compared to traditional monitoring tools, ARMS's LLM observability capabilities represent a significant leap, moving from simply identifying issues to understanding and even preventing them.

ARMS serves as the cornerstone of Gongniu's observability strategy, delivering substantial value. In today’s environment of deep AI–business integration, robust observability has become essential infrastructure for enterprises’ digital transformation.

Top comments (0)