ObservabilityGuy

Posted on Apr 22

RUM Practice: Android Network Performance Optimization with Data

#android #rum

This article introduces the RUM Practice for Android, detailing how to optimize network performance through fine-grained metric analysis and connection pool tuning.

1.Overview
In the era of the mobile Internet, network request performance has become a key factor that affects user experience. Statistics show that the conversion rate drops significantly as the page load time increases, and the most common user feedback in mobile applications is related to network performance issues such as "slow load" and "stuttering". However, the complexity of the mobile network environment far exceeds that of the web client:

Diversified network environments
● Multiple network standards such as Wi-Fi, 4G, 5G, 3G, and 2G coexist.

● The signal strength varies, and network transitions are frequent.

● The network quality varies greatly across different regions and carriers.

Critical device fragmentation
● There are many Android device brands and models.

● The system version span from Android 5.0 to the latest version is large.

● The device performance is uneven, which affects the network processing capability.

Difficulty in troubleshooting
● Lack of visibility: Traditional monitoring can only see whether a request succeeded or failed and the total duration, but cannot understand which specific segment the time is spent on.

● Difficult to reproduce: The user feedback is "very slow", but it often cannot be reproduced in the development environment.

● Lack of quantization basis: Optimization is based on feeling, and the optimization effect cannot be evaluated.

● Lack of end-to-end tracking: Client logs are missing, and it is separated from the server-side monitoring, which cannot form a complete trace.

To solve the above pain points, we need to turn the "black box" of the network request into a "transparent box" to clearly see the duration of each segment. Real User Monitoring (RUM) of Cloud Monitor 2.0 for the Android SDK provides mobile network performance monitoring capabilities. Next, we will introduce the resource metric data model collected by the RUM SDK in detail to help you understand the meaning and compute method of each metric.

2.Description of Resource Metric Data
To make each phase of each network request clearly visible and quantifiable, you must first establish a standardized data model. Alibaba Cloud RUM uses resource events as the core data model for network request monitoring.

Resource events are a standardized event type specifically designed for network requests. It is formulated based on the Hypertext Transfer Protocol (HTTP) and the World Wide Web Consortium (W3C) Performance Timing API standard, which ensures the accuracy and comparability of data collection. Considering the implementation differences of the API in different environments (Web, iOS, Android, and HarmonyOS), RUM has corrected and snapped them. This allows developers to see consistent performance data on both the web client and mobile clients, facilitating cross-platform performance comparison and troubleshooting.

Next, we will introduce the property fields and metric fields included in resource events in detail.

2.1 Property Field Description
Resource events contain rich attribute fields that describe the context information of a request:

2.2 Metric Field Description
In addition to property fields, resource events also contain core performance metrics. This part of the data is the core data for us to troubleshoot slow network requests.

Metric Type Description

2.3 Request Duration Phase Description
A complete HTTPS request usually includes the following key phases:

2.4 Compute Method
After understanding the definition of the metric, we will deeply understand the specific compute implementation based on OkHttp3 on the Android client.

2.4.1 OkHttp3 compute method
The following table shows the compute method for the duration of each phase of the Android network resource request, and clearly defines the start and end time points and compute methods of each stage.

You can view the detailed time start points in the resource.timing_data field of the raw data.

Note: The TCP connection duration displayed in the console actually includes the SSL handshake time.

2.4.2 Connection reuse detection
Based on the metric data collected by the RUM SDK, we can detect whether the connection is reused. The judgment basis is as follows:

Judgment basis:

● connectionAcquiredTime > 0: The connection is obtained.

● dnsStartTime ≤ 0: No DNS resolution callback.

● tcpStartTime ≤ 0: No TCP connection callback.

Features when the connection is reused:

● resource.dns_duration = 0

● resource.connect_duration = 0

● resource.ssl_duration = 0

● There is a wait time from callStart to connectionAcquired (connection pool seek time).

This wait time is an important performance metric. If it is too long, it may indicate improper connection pool configuration.

2.4.3 Relationship between TCP and SSL connections
For HTTPS requests, connection establishment is divided into two phases:

connectStart (TCP starts)
    ↓
    [TCP three-way handshake]
    ↓
secureConnectStart (SSL handshake starts)
    ↓
    [SSL/TLS handshake]
    ↓
secureConnectEnd (SSL handshake ends)
    ↓
connectEnd (Connection established)

Time relationship:

Total connection time = connectEnd - connectStart
Pure TCP time = secureConnectStart - connectStart (approximate)
SSL time = secureConnectEnd - secureConnectStart

2.5 View Metrics in the Console
You can log on to the RUM console, select your application, click the API request module, and click specific details to view the duration and duration distribution of each phase of the request.

After understanding the data model and data compute methods, let's look at how to use these metric data to quickly locate performance issues through a real online user case.

3.User Case Analysis
3.1 Case Background
An app received online user complaints, with feedback such as "page load is particularly slow" and "spinning often exceeds 1 second." The developer team immediately troubleshot the backend service, but found a confusing phenomenon:

The client reported that the response time of a core API often exceeded 1 second (some users even reached 2-3 seconds). This problem existed regardless of whether the network environment was Wi-Fi or 4G, and it was random, making it difficult to stably reproduce in the development environment.

However, backend monitoring showed that the server-side processing time of the API was stable at about 400 ms, the database query performance was normal with no slow queries, and the server CPU and memory payload were also healthy. The data on both sides did not match. The client reported 1.2 seconds, while the server-side only took 400 ms. Where did the remaining 800 ms go? Without fine-grained monitoring, the team fell into a "blind men and an elephant" dilemma: the client and the server-side blamed each other, and the problem could not be resolved for a long time.

By integrating the Alibaba Cloud RUM Android SDK, we collected detailed duration data.

Let's see how the problem was precisely located.

3.2 Raw Timing Data
In the resource.timing_data field, we obtained the raw time points (in nanoseconds) of each phase of the request:

{
    "requestHeadersEnd": 1560814315115219,
    "responseBodyStart": 1560814719308917,
    "requestType": "OkHttp3",
    "connectionAcquired": 1560814312934751,
    "connectionReleased": 1560814721700948,
    "requestBodyEnd": 1560814315850323,
    "responseHeadersEnd": 1560814718722250,
    "requestHeadersStart": 1560814312975011,
    "responseBodyEnd": 1560814719441625,
    "requestBodyStart": 1560814315146573,
    "callEnd": 1560814721840948,
    "duration": 1232825780,
    "callStart": 1560813486615845,
    "responseHeadersStart": 1560814718314125
}

Key observations:
● No DNS, TCP, or SSL-related callback time points → This indicates that connection pool reuse is used.

● The interval from callStart to connectionAcquired is 826 ms → The connection pool wait time is abnormally long.

● Total duration = 1232.8 ms

There is already a clear clue here: The problem does not lie in DNS, TCP, or SSL handshake, but in the fact that the wait time for the connection pool to assign a connection is too long.

3.3 Detailed Phase Analysis
Based on the raw data and the data calculation methods in section 2.4, we calculate the duration phase by phase to precisely locate performance bottlenecks:

Phase 1: Wait for the connection pool to assign

callStart → connectionAcquired
Time consumed: (1560814312934751-1560813486615845)/1,000,000 = 826.32 ms⚠️

Note:

● The wait time to retrieve an active connection from the connection pool.

● No DNS/TCP callback = Reuse the existing connection.

● This is the biggest bottleneck. It accounts for 67% of the total duration.

Phase 2: Send request headers

requestHeadersStart → requestHeadersEnd
Time consumed: (1560814315115219-1560814312975011)/1,000,000 = 2.14 ms✅

Phase 3: Send the request body

requestBodyStart → requestBodyEnd
Time consumed: (1560814315850323-1560814315146573)/1,000,000 = 0.70 ms✅

Phase 4: Wait for the server response (TTFB)

requestBodyEnd → responseHeadersStart
Time consumed: (1560814718314125-1560814315850323)/1,000,000 = 402.46 ms
Note: The time the server takes to process the request is consistent with the backend log and is within the normal range.

Phase 5: Receive response headers

responseHeadersStart → responseHeadersEnd
Time consumed: (1560814718722250-1560814718314125)/1,000,000 = 0.41 ms✅

Phase 6: Receive the response body

responseBodyStart → responseBodyEnd
Time consumed: (1560814719441625-1560814719308917)/1,000,000 = 0.13 ms✅

Phase 7: Release the connection

responseBodyEnd → connectionReleased
Time consumed: (1560814721700948-1560814719441625)/1,000,000 = 2.26 ms✅

Through this analysis, we can clearly see that the connection pool wait time is a performance bottleneck.

3.4 Issue Diagnosis
Diagnosis of abnormal points
Core issue: The connection pool wait time is too long (826 ms).

Possible causes:
The connection pool is full - All connections are in use, and it is necessary to wait for other requests to release connections.
Serial request queuing - Too many requests are sent to the same host, which is limited by the maxRequestsPerHost configuration.
Connection leaks - Previous requests did not correctly release connections.
Improper connection pool configuration - The maxIdleConnections setting is too small.
Diagnosis steps
Step 1: Check the connection pool configuration

// View the connection pool configuration of the current OkHttpClient.
ConnectionPool connectionPool = okHttpClient.connectionPool();
// Default configurations: A maximum of five idle connections, and keep alive for 5 minutes.

After the check, it is found that the application uses the OkHttp default configurations, and there are only five idle connections.

Step 2: Monitor the concurrent request quantity
You can view the quantity of concurrent requests to the same host within this time segment via the RUM console.

Step 3: Check for connection leaks
You can view application logs to confirm that all requests have correctly closed the response body:

Response response = client.newCall(request).execute();
try {
    String body = response.body().string();
    // Process the response
} finally {
    response.close();  // Close it

Diagnostic conclusion:
The issue is caused by a connection pool configuration that is too small. A large number of requests are waiting for connection release, causing critical performance bottlenecks.

After the cause of the issue is identified, we will introduce troubleshooting methods and optimization ideas for common network performance issues.

4.Best Practices for Troubleshooting Common Issues
Through the above case, we have seen how to use RUM data to locate issues. This chapter will systematically introduce four categories of the most common network performance issues and their troubleshooting methods.

4.1 Long Connection Pool Wait Time
Symptom: An abnormal connection acquisition duration is observed in resource.timing_data.

callStart → connectionAcquired duration > 500 ms

Diagnosis steps:
Step 1: View the connection pool configuration

// Check the current configuration.
ConnectionPool pool=okHttpClient.connectionPool();
// Default: five idle connections

Step 2: View the number of concurrent requests
View the number of concurrent requests for the time period through the RUM console:

-- Execute the query in the RUM console
SELECT 
    COUNT(*) as concurrent_requests
FROM rum_resource
WHERE 
    timestamp BETWEEN start_time AND end_time
    AND resource.url LIKE 'https://api.example.com%'
GROUP BY timestamp
ORDER BY concurrent_requests DESC

Step 3: Check for connection leaks

// Add log monitoring status for connection pools
interceptor.addInterceptor(chain -> {
    ConnectionPool pool = chain.connection().connectionPool();
    Log.d("Pool", "Active: " + pool.connectionCount() + 
                   ", Idle: " + pool.idleConnectionCount());
    return chain.proceed(chain.request());
});

Optimization ideas:

// Solution 1: Increase the connection pool size
.connectionPool(new ConnectionPool(30, 5, TimeUnit.MINUTES))

// Solution 2: Increase the maximum number of concurrent requests per host
.dispatcher(new Dispatcher() {{
    setMaxRequestsPerHost(10);  // 默认5
    setMaxRequests(64);         // 默认64
}})

// Solution 3: Merge requests

4.2 Slow DNS Resolution
Symptom: It is observed in the console that the DNS duration remains high.

resource.dns_duration > 500ms

Diagnosis steps:
Step 1: Confirm that it is a DNS issue
You can check whether resource.dns_duration remains high. You can check the differences between different network environments (WiFi vs. 4G).
Step 2: Analyze a specific domain name

// Group by domain name in the RUM console
SELECT 
    resource.url_host,
    AVG(resource.dns_duration) as avg_dns_time,
    MAX(resource.dns_duration) as max_dns_time
FROM rum_resource
WHERE resource.dns_duration > 0
GROUP BY resource.url_host
ORDER BY avg_dns_time DESC

Solutions:

// Solution 1: Use a custom DNS  
.dns(new CustomDns())  

// Solution 2: Use HttpDNS  
.dns(new AliHttpDns())  

// Solution 3: DNS pre-parsing  
DnsPreloader.preload(client);</font>

4.3 High SSL Handshake Duration
Symptom: An abnormal SSL handshake duration is observed in the console.

resource.ssl_duration > 1000ms

Diagnosis steps:
Step 1: Confirm the SSL version

// Add an interceptor to View SSL information
interceptor.addInterceptor(chain -> {
    Connection connection = chain.connection();
    if (connection != null) {
        Handshake handshake = connection.handshake();
        if (handshake != null) {
            Log.d("SSL", "Protocol: " + handshake.tlsVersion());
            Log.d("SSL", "Cipher: " + handshake.cipherSuite());
        }
    }
    return chain.proceed(chain.request());
});

Step 2: Check the connection reuse rate

// Query in the RUM console  
SELECT   
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">COUNT(CASE WHEN resource.ssl_duration = 0 THEN 1 END) * 100.0 / COUNT(*) as reuse_rate  
FROM rum_resource  
WHERE resource.url LIKE 'https://%'

Optimization ideas:

// Solution 1: Enable SSL session reuse  
.sslSocketFactory(SslConfig.createSSLSocketFactory())  

// Solution 2: Increase the connection keep-alive time  
.connectionPool(new ConnectionPool(30, 10, TimeUnit.MINUTES))</font><font style="background-color:#d0cece;">  </font><font style="background-color:#d0cece;">// Extend to 10 minutes  

// Solution 3: Use certificate pinning  
.certificatePinner(certificatePinner)

4.4 Long TTFB
Symptom: The time from when a request is sent to when the first byte is received is excessively long. You can observe a long request response duration in the console.

resource.first_byte_duration > 2000ms

Diagnosis steps:
Step 1: Troubleshoot client issues
Make sure that the following metrics are normal:

● DNS resolution time < 300 ms

● Connection establishment time < 500 ms

● Request sending time < 100 ms

Step 2: Analyze the server response time
TTFB is mainly determined by the server processing time. If the client metrics are normal, you can:

1.Check the server load.  
2.Check the database query performance.  
3.Check the complexity of the interface business logic.  
4.Use an application performance management (APM) tool to track server performance.

Step 3: Network path analysis

// View the TTFB differences across different regions and carriers in the RUM console  
SELECT   
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">user.region,  
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">user.isp,  
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">AVG(resource.first_byte_duration) as avg_ttfb  
FROM rum_resource  
GROUP BY user.region, user.isp  
ORDER BY avg_ttfb DESC

Optimization ideas:

// Solution 1: Use CDN for acceleration  
// Deploy static resources and APIs to CDN points of presence  

// Solution 2: Enable server caches  
// Implement a reasonable cache policy on the server-side  

// Solution 3: Use data prefetching  
// Request data in advance before users might access it   
PreloadManager.preload("https://api.example.com/user/profile");  

// Solution 4: Manage request priorities  
.dispatcher(new Dispatcher() {{  
// Use a separate thread pool for high-priority requests.  
})

5.Case Summary
By using the troubleshooting methods for the preceding four categories of common issues, we have mastered a systematic diagnosis approach. Now, let's return to the real case in Chapter 3 that troubled the team for days: the performance bottlenecks of an 826 ms connection pool wait time. By precisely locating the issue using RUM data, we discovered that the root cause of the issue is that improper connection pool configurations cause requests to queue up and wait. The solution is actually very simple: Select appropriate connection pool configurations based on different application types.

Configuration suggestions:
For the maxIdleConnections parameter of OkHttpClient (the default value is 5), we recommend that you adjust it based on application characteristics. Based on experience, common configurations are as follows:

● Highly concurrent applications: maxIdleConnections = 30-50.
Such applications have high user popularity, frequent network requests, and a large amount of concurrency, and require sufficient connection pool support.

● General applications: maxIdleConnections = 10-20.
Moderate the request frequency and concurrency, and maintain a moderate connection pool size.

● Low-frequency applications: maxIdleConnections = 5-10. Fewer user requests. In this case, keep the default configuration or slightly increase it to meet the demand.

From post-event optimization to proactive monitoring:
However, this case also brings us deeper reflection. Performance optimization should not be an after-the-fact remedy. In addition to mastering post-troubleshooting and optimization methods, establishing a comprehensive performance monitoring system is more important. You can grasp the network performance metrics of the application in real time through the RUM console to shift from "passive firefighting" to "active observation." If necessary, you can also configure custom alert rules based on the RUM platform (such as triggering notifications when the connection pool wait time P95 > 500 ms) to further improve the problem response speed.

Suggestions for monitoring and alerting configuration
RUM data allows users to create custom alerts for real-time monitoring. Establishing a scientific monitoring and alerting system allows you to detect and handle problems in a timely manner before the problems impact users.

Reference for metric-based alerting thresholds
Based on industry practices such as the RAIL model and Google Web Vitals, common threshold references are as follows:

6.Summary
In mobile application development, network request performance directly impacts user experience. By integrating the Alibaba Cloud RUM Android SDK, developers can obtain the following core capabilities:

Accurately locate performance bottlenecks
● Fine-grained phase duration (such as DNS, TCP, SSL, and TTFB) helps quickly detect problems.

● From the vague description of "slow requests" to the accurate positioning of "connection pool waits for 826 ms"

Connection reuse analysis
● Automatically detect the use efficiency of the connection pool

● Detect hidden problems such as connection leaks and improper connection pool configurations

Real user experience monitoring
● Collect data based on the network environments of real users

● Analyze performance differences by dimensions such as region, carrier, and network type

Data-driven optimization
● The comparison before and after optimization is clearly visible

● Establish performance baselines and alerting mechanisms for continuous improvement

Alibaba Cloud RUM implements a non-intrusive monitoring and collection SDK for application performance, stability, and user behavior on the Android client. You can refer to the integration document to experience and use the SDK. In addition to Android, RUM also supports monitoring and analysis on multiple platforms such as web, mini program, iOS, and HarmonyOS. For related questions, you can join the RUM support group (DingTalk group number: 67370002064) for consultation.

DEV Community

RUM Practice: Android Network Performance Optimization with Data

Top comments (0)