DEV Community: Leo Dev Blog

Automated Monitoring and Message Notification System for Payment Channels

Leo Dev Blog — Mon, 25 Nov 2024 00:20:12 +0000

Building an Automated Monitoring System for Payment Channels

When third-party channels experience failures, our awareness is often delayed. Typically, we rely on extensive system alerts or feedback from users and business teams to detect anomalies. As the core system responsible for managing company-wide payment operations, it's insufficient to rely solely on manual maintenance. Thus, building a robust automated monitoring system for payment channels becomes crucial.

1. Background

To accommodate growing business demands, we have integrated numerous payment channels. However, the stability of third-party systems varies greatly, and channel failures occur frequently. When such anomalies happen, detection often lags, with alerts or user feedback as the primary indicators. For a core payment system aiming to provide stable services upstream, manual maintenance alone is inadequate. This necessitates the establishment of an automated monitoring and management system for payment channels.

2. Design Goals

Based on our business requirements, the automated payment channel management system should address the following key challenges:

Monitoring capabilities across multiple channels and entities.
Rapid fault detection and precise identification of root causes.
Minimized false positives and missed alerts.
Automatic failover in case of channel failures.

3. Technology Selection

Given the background, the following technology options were evaluated:

3.1 Circuit Breaker

Circuit breakers are commonly associated with fault isolation and fallback mechanisms. We explored mature solutions such as Hystrix, but identified several limitations for our use case:

Circuit breakers operate at the interface level, lacking granularity for channel- or merchant-level fault isolation.
During traffic recovery, residual issues may persist, and there's no ability to define targeted traffic for testing (e.g., specific users or services), increasing the risk of secondary incidents.

3.2 Time-Series Database

After ruling out circuit breakers, we turned to developing a custom monitoring system. Time-series databases are often used as the foundation for such systems. Below is an evaluation of popular options:

With the final contenders being Prometheus and a custom solution based on Redis.

Accuracy

Prometheus sacrifices some accuracy in favor of higher reliability, simplicity in architecture, and reduced operational overhead. While this tradeoff is acceptable for traditional monitoring systems, it does not suit high-sensitivity scenarios like automatic channel failover:

Missed Spikes: Prometheus might miss transient spikes occurring between two sampling intervals (e.g., 15 seconds).
Statistical Estimates: Metrics like QPS, RT, P95, and P99 are approximations and cannot achieve the precision of logs or database records.

Ease of Integration and Maintenance

Prometheus has a learning curve for business developers and poses challenges in long-term maintenance. Conversely, Redis is already familiar to Java backend developers, offering lower initial learning and ongoing maintenance costs.

Considering the above factors, we decided to build a custom "time-series database" based on Redis to meet our requirements.

4. Architecture Design

Workflow Design

Transaction Routing: For both receiving and making payments, requests are routed through the respective channel router to filter available payment channels.
Order Processing: After selecting the channel, the gateway processes the payment or disbursement request and sends it to the third-party provider.
Monitoring Data: The response from the third-party provider is reported to the monitoring system via a message queue (MQ).

Monitoring System Workflow

The monitoring system listens to MQ messages and stores monitoring data in Redis.
The data processing module fetches data from Redis, filters it, calculates failure rates for each channel, and triggers alerts based on configured rules.
Data in Redis is periodically backed up to MySQL for subsequent fault analysis.
Offline tasks regularly clean Redis data to avoid excessive storage.

Data Visualization

To observe changes in channel metrics:

Metrics are reported to Prometheus.
Grafana dashboards display the channel's health status.

Automated Channel Management

Initially, only manual channel management (online/offline) is enabled due to the sensitivity of the operation.
After collecting substantial samples and refining the algorithms, the system will gradually enable automated channel management based on monitoring results.

5. Implementation Details

5.1 Data Structure

The data is stored in Redis with a design inspired by time-series databases like InfluxDB:

InfluxDB	Redis
tags	set to record monitoring dimensions
time	zset to store timestamps (in seconds)
fields	hash to store specific values

Tags (Labels): Monitored dimensions are stored using Redis sets (SET), leveraging its deduplication feature.
Timestamps: Data points are stored using Redis sorted sets (ZSET) to allow time-based lookups and ordering. Each point represents one second.
Fields (Metrics): Specific monitoring data is stored in Redis hashes (HASH). Each key-value pair represents:
- Key: Result type (e.g., success or failure).
- Value: Count of occurrences within one second, including specific failure reasons.

Example Redis Data Structure:

SET: Tags -> Stores monitored dimensions.
ZSET: Timestamps -> Stores event times.
HASH: Metrics -> Stores success/failure counts and failure reasons.

### Redis Data Structure

1. **Set**
   - Stores the monitored dimensions, specific to the merchant ID.
   - **Key**: `routeAlarm:alarmitems`  
   - **Values**:  
     - `WeChat-Payment-100000111`  
     - `WeChat-Payment-100000112`  
     - `WeChat-Payment-100000113`  
     - ...

2. **ZSet**
   - Stores timestamps (in seconds) for requests from a specific merchant ID. Data for the same second will overwrite previous entries.
   - **Key**: `routeAlarm:alarmitem:timeStore:WeChat-Payment-100000111`  
   - **Scores and Values**:  
     - `score: 1657164225`, `value: 1657164225`  
     - `score: 1657164226`, `value: 1657164226`  
     - `score: 1657164227`, `value: 1657164227`  
     - ...

3. **Hash**
   - Stores the aggregated request results within 1 second for a specific merchant ID.
   - **Key**: `routeAlarm:alarmitem:fieldStore:WeChat-Payment-100000111:1657164225`  
   - **Fields and Values**:  
     - `key: success`, `value: 10` (count)  
     - `key: fail`, `value: 5`  
     - `key: balance_not_enough`, `value: 3`  
     - `key: third_error`, `value: 2`  
     - ...

5.2 Core Algorithm

To avoid missing short spikes between monitoring intervals and ensure accurate reporting, the algorithm combines local counting with a global sliding window:

Per-Second Tracking: Records the number of successes and failures for each second.
Sliding Window Calculation: Computes success and failure counts across the entire window duration, ultimately determining the failure rate for each channel.

Example:

Window Duration: 1 minute.
Monitoring Frequency: Every 10 seconds.

Key Factors Affecting Accuracy:

Monitoring Frequency:
- Low frequency results in insufficient samples, reducing accuracy.
- High frequency may miss short-term spikes, causing underreporting.
Window Size: Must balance sample size and real-time accuracy.

The frequency and window size are determined based on metrics like daily transaction volume, hourly order frequency, and submission rates for each channel.

5.3 Handling Low Traffic

Challenges with Low Traffic:

Channel Dimension: Handling channels with low daily transaction volumes.
Time Dimension: Managing off-peak periods with sparse transactions.

Solution:

For channels with low traffic or off-peak times:

If there is only one transaction in the monitoring window and it fails, the window size is incrementally expanded:
- Initial Window: 1 minute.
- Expanded Window: Doubles (e.g., 2 minutes, 4 minutes) up to 10x.
If the failure rate exceeds the threshold even after expansion, an alert is triggered, as such cases are treated as critical anomalies.

6. Outcomes

Ensured accuracy of monitoring and alerting, minimizing missed anomalies.

Merge duplicate alarm entries.

Channel Anomaly Recovery.

7. Future Plans

To further enhance the automated monitoring system:

Continuously optimize monitoring algorithms to achieve alert accuracy of 99% or higher.
Integrate with the monitoring system to enable automatic channel deactivation upon fault detection.
Implement automatic fault recovery detection and enable automated channel reactivation.

High Performance Notification System Practices

Leo Dev Blog — Thu, 21 Nov 2024 04:20:53 +0000

Building a High-Performance Notification System

1. Service Segmentation

2. System Design

2.1 Initial Message Sending

2.2 Retry Message Sending

3. Ensuring Stability

3.1 Handling Traffic Surges

3.2 Resource Isolation for Faulty Services

3.3 Protection of Third-Party Services

3.4 Middleware Fault Tolerance

3.5 Comprehensive Monitoring System

3.6 Active-Active Deployment and Elastic Scaling

4. Conclusion

4.1 Feedback on Results

4.2 Final Thoughts

In any company, a notification system is an indispensable component. Each team may develop its own notification modules, but as the company grows, problems like maintenance complexity, issue debugging, and high development costs begin to emerge. For example, in our enterprise WeChat notification system, due to variations in message templates, a single project may use three different components—not even counting other notification functionalities.

Given this context, there is an urgent need to develop a universal notification system. The key challenge lies in efficiently handling a large volume of message requests while ensuring system stability. This article explores how to build a high-performance notification system.

Architecture Overview

Configuration Layer: This layer consists of a backend management system for configuring sending options, including request methods, URLs, expected responses, channel binding and selection, retry policies, and result queries.
Interface Layer: Provides external services, supporting both RPC and MQ. Additional protocols like HTTP can be added later as needed.
Core Service Layer: The business logic layer manages initial and retry message sending, message channel routing, and service invocation encapsulation. This design isolates normal and abnormal service execution to prevent faulty services from affecting normal operations. For instance, if a particular message channel has a high latency, it could monopolize resources, impacting normal service requests. Executors are selected via routing strategies, including both configured routing policies and dynamic fault discovery.
Common Component Layer: Encapsulates reusable components for broader use.
Storage Layer: Includes a caching layer for storing sending strategies, retry policies, and other transient data, as well as a persistence layer (ES and MySQL). MySQL stores message records and configurations, while ES is used for storing message records for user queries.

2. System Design

2.1 Initial Message Sending

When handling message-sending requests, two common approaches are RPC service requests and MQ message consumption. Each has its pros and cons. RPC ensures no message loss, while MQ supports asynchronous decoupling and load balancing.

2.1.1 Idempotency Handling

To prevent processing duplicate message content, idempotency designs are implemented. Common approaches include locking followed by querying or using unique database keys. However, querying the database can become slow with high message volumes. Since duplicate messages usually occur within short intervals, Redis is a practical solution. By checking for duplicate Redis keys and verifying message content, idempotency can be achieved. Note: identical Redis keys with different message content may be allowed, depending on business needs.

2.1.2 Faulty Service Dynamic Detector

Routing strategies include both configured routes and dynamic service fault-detection routing. The latter relies on a dynamic service detector to identify faulty channels and reroute execution via a fault-notification executor.

This functionality uses Sentinel APIs within JVM nodes to track total and failed requests within a time window. If thresholds are exceeded, the service is flagged as faulty. Key methods include loadExecuteHandlerRules (setting flow control rules, dynamically adjustable via Apollo/Nacos) and judge (intercepting failed requests to mark services as faulty).

Faulty services are not permanently flagged. An automatic recovery mechanism includes:

Silent Period: Requests during this time are handled by the fault executor.
Half-Open Period: If sufficient successful requests occur, the service is restored to normal.

2.1.3 Sentinel Sliding Window Implementation (Circular Array)

Sliding windows are implemented using a circular array. The array size and indices are calculated based on the time window.

Example:

For a 1-second window with two sub-windows (500ms each):

Window IDs: 0, 1
Time ranges: 0–500ms (ID 0), 500–1000ms (ID 1)
At 700ms, window ID = (700 / 500) % 2 = 0 and windowStart = 700 - (700 % 500) = 200.
At 1200ms, window ID = (1200 / 500) % 2 = 0, requiring reset of ID 0 to reflect the new start time.

2.1.4 Dynamic Thread Pool Adjustment

After message processing, a thread pool is used for asynchronous sending. Separate pools exist for normal and faulty services, configured based on task type and CPU cores, with dynamic adjustments informed by performance testing.

A dynamically adjustable thread pool design leverages tools like Apollo or Nacos to modify parameters. Thread queue lengths remain fixed unless a custom queue is implemented. Instead, an unbounded pool is defined with matching core and max threads, using a discard policy. Overloading the pool triggers task persistence in MQ for retries, ensuring no memory overflow or message loss.

2.2 Retrying Message Sending

Messages failing due to bottlenecks or errors are retried via distributed task scheduling frameworks. Techniques like sharding and broadcasting optimize retry efficiency. Duplicate message control is achieved using locks.

Retry Mechanism:

Check if the handler’s resources are sufficient. If not, tasks wait in a queue.
Lock control prevents duplicate processing across nodes.
Task volume is based on handler settings.
Retrieved tasks are sent to MQ, then processed via thread pools.

2.2.1 ES and MySQL Data Synchronization

For large datasets, Elasticsearch (ES) is used for queries. Data consistency between ES and the database must be maintained.

Synchronization Flow:

Update ES first, then change the database state to "updated."
If synchronization isn't complete, reset the state to "init."
Synchronization includes the database update_time to ensure updates only occur for the latest data.

ES Index Management:

Monthly rolling indices are created.
New indices are tagged as "hot," storing new data on high-performance nodes.
A scheduled task marks previous indices as "cold," moving them to lower-performance nodes.

3. Stability Assurance

The designs outlined above focus on high performance, but stability must also be considered. Below are several aspects of stability assurance.

3.1 Sudden Traffic Spikes

A two-layer degradation approach is implemented to handle sudden traffic spikes:

Gradual Traffic Increase: When traffic grows steadily, and the thread pool becomes busy, MQ is used for traffic shaping. Data is asynchronously persisted, and subsequent tasks are scheduled with a 0s delay for processing.
Rapid Traffic Surge: In the case of abrupt spikes, Sentinel directly routes traffic to MQ for shaping and persistence without additional checks. Subsequent processing is delayed until resources become available.

3.2 Resource Isolation for Problematic Services

Why isolate problematic services? Without isolation, problematic services share thread pool resources with normal services. If problematic services experience long processing times, thread releases are delayed, preventing timely processing of normal service requests. Resource isolation creates a separation to ensure problematic services do not impact normal operations.

3.3 Protection of Third-Party Services

Third-party services often implement rate-limiting and degradation to prevent overload. For those that lack such mechanisms, the following should be considered:

Avoid overwhelming third-party services due to high request volume.
Ensure our services are resilient to third-party service failures by using circuit breakers and graceful degradation.

3.4 Middleware Fault Tolerance

Fault tolerance for middleware is essential. For example, during a scaling operation or upgrade, MQ might experience a few seconds of downtime. The system design must account for such transient failures to ensure service continuity.

3.5 Comprehensive Monitoring System

A robust monitoring system should be established to:

Detect and mitigate issues before they escalate.
Provide rapid incident resolution.
Offer actionable insights for post-incident analysis and optimization.

3.6 Active-Active Deployment and Elastic Scaling

Operationally, active-active deployment across multiple data centers ensures service availability. Elastic scaling, based on comprehensive service metrics, accommodates traffic variations while optimizing costs.

4. Conclusion

System design should address service architecture, functionality, and stability assurance comprehensively. Achieving scalability, fault tolerance, and adaptability to dynamic scenarios is an ongoing challenge. There is no universal "silver bullet"; technical designs must be tailored to specific business needs through thoughtful planning and iteration.