DEV Community

ObservabilityGuy
ObservabilityGuy

Posted on

Android Crash Monitoring: A Complete Troubleshooting Flow for Production Environment Crashes

This article demonstrates a complete production crash troubleshooting flow using Alibaba Cloud RUM, from alerting and stack analysis to user behavior tracing and root cause identification.
I. Background: Why Is Crash Collection Necessary?
Series Review: In the previous article In-depth Analysis of the Principles of Android Crash Capture and a Closed-loop Framework from Crash to Root Cause Identification, we deeply analyzed the technical insider details of crash collectionβ€”from the UncaughtExceptionHandler mechanism in the Java layer to signal processing and Minidump technology in the Native layer, and then to the symbolization principle of the obfuscation stack. We believe that everyone has obtained a clear understanding of "how crashes are caught."

However, theory alone is not enough. This article will reproduce a production environment case to show how an Android developer, when encountering an online crash problem, can perform crash analysis and positioning through exception data and context collected by Real User Monitoring (RUM). It will take you through the complete flow of crash troubleshooting: from receiving alerts, viewing the console, analyzing the stack, and tracking user behavior, to locating the root cause.

1.1 Case background
An app V3.5.0 was published, which mainly optimized the loading performance of the product list. However, on the third day after the version was published, the team started to receive a large number of user complaints about unexpected app exits and crashes.

Severity:

● 10 + fold increase in crash rate

● App store ratings drop

● User uninstallation rate increased

Final solution: Alibaba Cloud RUM SDK was integrated to collect crash data and locate the problem within two hours.

II. Complete Troubleshooting Flow: From Alerting to Root Cause Positioning
2.1 πŸ”” Step 1: Receive crash alerts
After data integration, because alerting is configured, when the online crash rate rises significantly, the team developers will receive alerting notifications and follow the online problem immediately.


app.name: xxx and crash | SELECT diff[1] AS "current value", diff[2] AS "yesterday's value", round(diff[3], 4) AS "ratio" FROM ( SELECT compare(cnt, 86400) AS diff FROM ( SELECT COUNT(*) AS cnt FROM log)) ORDER BY "current value" DESC
Enter fullscreen mode Exit fullscreen mode

2.2 πŸ“Š Step 2: View crash overview - Lock the exception type
Operation path: Console homepage β†’ RUM β†’ Find the corresponding app β†’ Exception statistics

By analyzing the exception list displayed in the console, we found that IndexOutOfBoundsException accounted for the vast majority of crashes and was definitely the main problem, and began to appear in large quantities after V3.5.0 was published.

2.3 πŸ” Step 3: Analyze the crash stack - Preliminary positioning
Click to enter the IndexOutOfBoundsException details page for in-depth analysis. This verified our idea. Here, you can locate that the crash occurred to the newly published V3.5.0, and the page where it occurred is: ProductListActivity. The corresponding session ID is: 98e9ce65-c51a-40c4-9232-4b69849e5985-01. This information is used for our subsequent analysis of user behavior.

View the crash stack and analyze key information:

● The crash occurred on line 50 of the ProductListAdapter.onBindViewHolder() method.

● Fault reason: Attempted to access the 6th element (index 5) of the List, but the list has only 5 elements.

● This is a typical RecyclerView data inconsistency problem.

Preliminary assumptions:

● It may be that the data update timing is incorrect.

● It may be multi-threaded concurrent modification of data.

● It may be caused by rapid user operations.

However, the root cause cannot be determined solely by the stack. You need to view the specific operation path of the user.

2.4 🎯 Step 4: Track user behavior - Find the trigger path
Operation path: Crash details page β†’ Select the session ID corresponding to the crash β†’ View the session trace of the session ID

Click the session details to view the user behavior path, combined with the page where the crash occurred. We identified the following operation path.

Operation path:

● Go to the ProductListActivity page.

● Quickly click the refresh button three times consecutively, triggering an asynchronous update of the list (Note: A network request actually occurs here. Because we are reproducing it locally, an asynchronous update is used.)

● Online request timing issues:

The first asynchronous request returns several items, and the user scrolls to the 6th one.
Subsequent requests return only five items and update the list data.
● RecyclerView is still rendering the 6th position, but the data no longer exists.

● Root cause: Data race caused by multiple asynchronous requests.

2.5 🌐 Step 5: Multidimensional analysis- Validate assumptions
To further confirm the issue, you can perform multi-dimensional filtering and analysis on the crash data to analyze failure features and confirm the impact scope.

2.5.1 Crash data structure
The crash data collected by the SDK contains the following core fields:

{
  "session.id": "session_abc123", // The session ID, which is used to associate the user behavior path.
  "timestamp": 169988400000, // The time when the crash occurred, in milliseconds.
  "exception.type": "crash", // The type of the exception.
  "exception.subtype": "java", // The subtype of the exception.
  "exception.name": "java.lang.NullPointerException", // The type of the exception.
  "exception.message": "Attempt to invoke virtual method on a null object", // The error message.
  "exception.stack": "[{...}]", // The full stack (JSON array).
  "exception.thread_id": 1, // The ID of the crash thread.
  "view.id": "123-abc", // The ID of the page on which the crash occurred.
  "view.name": "NativeCrashActivity", // The name of the page on which the crash occurred.
  "user.tags:": "{\"vip\":\"true\"}", // User tags (custom).
  "properties": "{\"version\":\"2.1.0\"}", // Custom properties.
  "net.type": "WIFI", // The network type of the user.
  "net.ip": "192.168.1.100", // The IP address of the client.
  "device.id": "123-1234", // The ID of the user device.
  "os.version": 14, // The version number of the user's system.
  "os.type": "Android" // The system type of the user.
}
Enter fullscreen mode Exit fullscreen mode

2.5.2 Overview of the crash dashboard
Location: RUM > Experience dashboard > Exception analysis

On the exception analysis dashboard, you can view the overall breakdown results of the application, including the total number of exceptions, exception trend, device distribution, exception type, and network distribution.

2.5.3 Network type distribution
Because the actual list update operation is returned by a network request, we need to pay attention to the user's network type when a crash occurs in the online data. You can view the crash network distribution of V3.5.0 in the crash dashboard.

πŸ’‘Conclusion: 90% crashes occur in 3G/4G networks and the rate of crashes in WiFi networks is very low. This confirms that the network (asynchronous request) is the key factor.

2.5.4 Device brand distribution
View the distribution of device brands that crashed in V3.5.0 on the crash dashboard.

πŸ’‘Conclusion: All brands are affected. It is not a model-specific issue, but a code logic issue.

2.5.5 Version comparison
In addition to the crash dashboard, we can still use SQL custom analysis on the Log Explorer tab page.

app.name: xxx and crash | select "app.version", count(*) from log group by "app.version"
Enter fullscreen mode Exit fullscreen mode

Operation: Compare the crash rates of V3.4.0 and V3.5.0.

2.6 πŸ’» Step 6: Locate code issues
View the problematic code
Open ProductListActivity.java and find the refresh logic:

private void loadProducts() {
    // ❌ Changes in v3.5.0: Optimize performance with asynchronous loading.
    new Thread(() -> {
        try {
            // Simulate a network request.
            List<Product> newProducts = ApiClient.getProducts(currentCategory);
            // ❌ Problem 1: The previous request was not canceled.
            // ❌ Question 2: Directly clear and update data without considering that RecyclerView is rendering.
            runOnUiThread(() -> {
                productList.clear(); //πŸ’₯Dangerous operation!
                productList.addAll(newProducts); //πŸ’₯Data update.
                adapter.notifyDataSetChanged(); //πŸ’₯Notification refresh.
            });
        } catch (Exception e) {
            e.printStackTrace();
        }
    }).start();
}
Enter fullscreen mode Exit fullscreen mode
@Override
public void onBindViewHolder(@NonNull ProductViewHolder holder, int position) {
    //πŸ’₯Crash point: The position may exceed the range of products.
    Product product = products.get(position); //IndexOutOfBoundsException!
    holder.bind(product);
}
Enter fullscreen mode Exit fullscreen mode

Find the root cause of the problem
Purpose of V3.5.0 changes: Optimize performance and move network requests to subthreads.

Introduced issues:

The previous request is not canceled: When the user quickly clicks the refresh button, multiple requests are performed at the same time.
Data race: When the next request is returned, data is cleared and updated.
Inconsistent UI status: The RecyclerView is rendering a location, but the data has been reduced.
III. Symbolication configuration: Make the stack "speak human language"
Through the previous troubleshooting process, we successfully located the root cause of the crash: The ProductListAdapter.onBindViewHolder() method has an index out-of-bounds problem when dealing with data updates. But you may have a question: How do we get from the obfuscated stack, exactly to ProductListAdapter.java:50 this line of code?

In a real production environment, to protect code and optimize package size, release versions published to the app store are obfuscated by ProGuard or R8. This means the crash stack initially seen on the console is as follows:

java.lang.IndexOutOfBoundsException: Index: 5, Size: 5
    at java.util.ArrayList.get(ArrayList.java:437)
    at com.shop.a.b.c.d.a(Proguard:58)
Enter fullscreen mode Exit fullscreen mode

This is the reason why we need symbolication. Next, let's see how to configure symbolication in the RUM console.

3.1 Symbolize Java/Kotlin obfuscation
Step 1: Preserve the mapping.txt file
After the release version is built, the mapping.txt file is located at:

app/build/outputs/mapping/release/mapping.txt
Enter fullscreen mode Exit fullscreen mode

Sample file content:

com.example.ui.MainActivity -> a.b.c.MainActivity:
    void updateUserProfile(com.example.model.User) -> a
    void onClick(android.view.View) -> b

com.example.model.User -> a.b.d.User:
    java.lang.String userName -> a
    void setUserName(java.lang.String) -> a
Enter fullscreen mode Exit fullscreen mode

Step 2: Upload the mapping file to the console
Log on to the Cloud Monitor 2.0 console.
Go to RUM > Go to the application you are connected to > Application Settings > File Management
Click the symbol table file > Upload the file
Upload the mapping.txt file

3.2 Symbolize the native.so
After the build is complete, the .so file in the folder is located at:

app/build/intermediates/cxx/release/xxx/obj/
β”œβ”€β”€ arm64-v8a/
β”‚ β”” ── xxx-native.so ← contains debug symbols
β”œβ”€β”€ armeabi-v7a/
β”‚ β”” ── xxx-native.so
β”” ── x86_64/
β”” ── xxx-native.so
Enter fullscreen mode Exit fullscreen mode

Step 3: Upload to the console
Similar to the Java mapping file, upload the .so file of the corresponding architecture in the console.

3.3 Verify symbolication
Use the symbol table file for parsing: Open crash details > Exception details > Parse the stack > Select the corresponding symbol table file (Use the .so file for the native stack, and .txt file for the java stack)

Click OK to display the parsed stack.

Symbolization success:

● Display full class name and method name.

● Show source file path and line number.

● C++ function name restored (non-mangled state).

IV. πŸ“ Case Summary: Key Value of RUM
What key help does RUM provide in troubleshooting this crash?

1.Complete stack information + symbolization
● Without RUM: Online applications can only see the obfuscated stack, and do not know where the crash occurred.

● With RUM: After uploading the mapping file, you can accurately pinpoint ProductListAdapter.java:50.

2.User behavior path tracing
● Without RUM: We only know that "the user opens the list and it crashes", but cannot reproduce the crash.

● With RUM: You can view the complete operation timeline and discover that the issue is triggered by "rapidly clicking refresh multiple times."

3.Multi-dimensional data analysis
● Without RUM: You do not know which users are affected or in what environment the crash occurred.

● With RUM:

You discover that 90% of crashes occur on 3G or 4G networks (network latency is the key).
All device models are affected (hardware issues are excluded).
The issue only started to appear in V3.5.0 (version changes are identified).

4.Real-time alerting + quantified impact
● Without RUM: You rely on user complaints, and discovery is lagged.

● With RUM: You receive alerts immediately and start troubleshooting immediately.

Application stability is the cornerstone of user experience. Through systematic crash collection and analysis, developer teams can transform from "passive response" to "proactive prevention," continuously improving application quality and winning user trust. Alibaba Cloud RUM implements a non-intrusive collection SDK for application performance, stability, and user behavior for Android. You can refer to the integration document to experience it. In addition to Android, RUM also supports monitoring analysis for various platforms such as Web, miniapp, iOS, and HarmonyOS. For related questions, you can join the RUM support group (DingTalk group ID: 67370002064) for consultation.

Top comments (0)