System stability is the foundation of user experience and the lifeline of brand reputation.
Introduction: Why is System Stability So Important?
Remember the news about a well-known phone brand being collectively complained about by users due to frequent system crashes? A system crash may only take a few seconds, but the damage to brand reputation takes years to repair.
As an Android Framework engineer, you may often encounter scenarios like:
- A tester rushes over angrily: "The phone crashed again!"
- User feedback: "The app keeps crashing, it's terrible!"
- The boss @s everyone in the group: "Massive ANR incidents online, urgent fix needed!"
Behind these problems lies the same theme—system stability.
System stability is like the foundation of a house. If the foundation is unstable, no matter how beautiful the decoration, it's still a castle in the air. The Android system runs dozens of system processes and hundreds of system services. They are like a precise symphony orchestra—if any member has a problem, the entire performance is affected.
This article is the opening piece of the "Android System Stability and Performance Optimization" series. Starting from the Android system architecture, we will help you establish a fundamental cognitive framework for system stability analysis. After reading this article, you will be able to:
- Understand the relationship between Android's layered architecture and stability
- Master the core mechanisms and key components of system stability
- Understand the classification and manifestations of stability issues
- Establish a fundamental analytical framework for system stability analysis
Ready? Let's begin this journey of exploration!
I. Android System Architecture Review
To understand system stability, we must first have a clear understanding of Android's system architecture. If we compare the Android system to a building, it is constructed in layers like this:
1.1 Layered Architecture Overview
This layered design is like a pyramid—the lower the layer, the more stable, but the greater the impact when problems occur.
To give an analogy: if your light bulb breaks (Application layer), just replace it; if the circuit board burns out (Framework layer), the entire room may lose power; if the transformer explodes (Kernel layer), the entire building suffers!
1.2 Impact of Each Layer on Stability
Let's analyze layer by layer to see what stability issues may occur at each level:
Application Layer (Applications)
This is the layer users directly interact with. Stability issues mainly manifest as:
- Java Crash: Uncaught exceptions causing app crashes
- ANR: Main thread blocking causing app unresponsiveness
- Memory Leak: Continuous memory consumption eventually leading to OOM
Impact Scope: Single application
Severity: Moderate (users can reopen the app)
Framework Layer (Application Framework)
This is Android's "brain", providing all services required for app operation:
- System Server Crash: Core system service crashes, causing system restart
- Service Deadlock: Multiple services waiting for each other, system freezes
- Binder Resource Exhaustion: Inter-process communication blocked
Impact Scope: Entire system
Severity: Critical (may cause system restart)
Native Layer (Native Libraries & ART)
This layer is written in C/C++, with high execution efficiency but also more prone to issues:
- Native Crash: Signals like SIGSEGV, SIGABRT causing process termination
- Memory Access Violation: Wild pointers, array out of bounds
- Resource Leak: File descriptor leaks, memory leaks
Impact Scope: Single process to entire system
Severity: Important to Critical
HAL Layer (Hardware Abstraction Layer)
The hardware abstraction layer connects software and hardware:
- HAL Service Crash: Hardware functionality unavailable
- Hardware Timeout: Communication timeout with hardware
Impact Scope: Specific hardware functionality
Severity: Important
Kernel Layer (Linux Kernel)
This is the foundation of the system:
- Kernel Panic: Kernel crash, system directly restarts
- Driver Crash: Specific hardware functionality abnormal
- Deadlock: System completely frozen
Impact Scope: Entire device
Severity: Fatal
1.3 Key Processes and Services
In the Android system, there are several processes that are "vital"—if they crash, the entire system must restart:
Zygote Process
Zygote is Android's "incubator". All application processes are forked from it. Think of it as an "application factory":
Zygote
├── fork() → App1 process
├── fork() → App2 process
├── fork() → App3 process
└── fork() → System Server process
If Zygote crashes, it's like the factory shutting down—new application processes cannot be created, and the system can only restart.
System Server Process
System Server is Android's "butler", running almost all core system services:
| Service Name | Responsibility | What happens if it crashes |
|---|---|---|
| ActivityManagerService | Manages lifecycle of four major components | Cannot start/switch apps |
| WindowManagerService | Manages window display | Screen goes blank |
| PackageManagerService | Manages app installation/queries | Cannot install/find apps |
| PowerManagerService | Manages power state | Cannot sleep/wake |
| InputManagerService | Manages input events | Touch/key input fails |
Once System Server crashes, the entire system restarts (you'll see the boot animation play again).
SurfaceFlinger Process
SurfaceFlinger is Android's "painter", responsible for compositing all app interfaces and displaying them on the screen:
App1 Surface ─┐
App2 Surface ──┼──→ SurfaceFlinger ──→ Display
StatusBar ─┘
If SurfaceFlinger crashes, the screen either goes black or freezes.
II. Core Stability Mechanisms
The Android system designs multiple mechanisms to ensure stability, like the human immune system. Let's understand these "moats" one by one.
2.1 Process Management Mechanism
Android is a multitasking operating system, but phone memory is limited. How to ensure smooth system operation with limited resources? The answer is process priority management and Low Memory Killer (LMK).
Process Priority
Android divides processes into 5 levels based on their importance:
| Priority | Name | oom_adj | Description | Example |
|---|---|---|---|---|
| Highest | Foreground | 0 | Process user is interacting with | Foreground app |
| High | Visible | 100 | Visible but not in foreground | App with floating window in background |
| Medium | Service | 200 | Process running background service | Music player |
| Low | Cached | 900 | Cached background process | Recently used app |
| Lowest | Empty | 999 | Empty process, can be killed anytime | Exited app |
You can check process priority with the following code:
// Check process priority
ActivityManager am = getSystemService(ActivityManager.class);
List<ActivityManager.RunningAppProcessInfo> processes = am.getRunningAppProcesses();
for (ActivityManager.RunningAppProcessInfo info : processes) {
Log.d(TAG, "Process: " + info.processName +
", Importance: " + info.importance +
", ImportanceReasonCode: " + info.importanceReasonCode);
}
Low Memory Killer (LMK)
When system memory is insufficient, LMK steps in to "kill processes". Its working principle is simple: prioritize killing low-priority processes, releasing memory for high-priority processes.
When memory is sufficient: [Foreground App] [Visible App] [Service] [Cache1] [Cache2] [Empty]
↓ Memory pressure
When memory is tight: [Foreground App] [Visible App] [Service] [Cache1] ← Empty and Cache2 killed
↓ Severe memory shortage
Extreme case: [Foreground App] [Visible App] ← Even service processes killed
LMK vs OOM Killer:
| Feature | LMK | OOM Killer |
|---|---|---|
| Layer | Android userspace | Linux kernel |
| Trigger Timing | When memory below threshold (proactive) | When memory exhausted (reactive) |
| Process Killing Strategy | Based on oom_adj priority | Based on oom_score |
| Purpose | Preventive memory release | Emergency memory release |
LMK is like a "prepared" butler, starting cleanup before memory gets tight; OOM Killer is the "last resort" emergency measure.
2.2 Exception Monitoring Mechanism
The Android system has a comprehensive "health check" mechanism that can detect and handle various exceptions in time.
ANR Monitoring
ANR (Application Not Responding) is one of the most common stability issues users encounter. When an app doesn't respond within the specified time, the system pops up an "Application Not Responding" dialog.
ANR trigger conditions:
| Type | Timeout | Trigger Scenario |
|---|---|---|
| Input event | 5 seconds | Touch/key event not processed |
| Broadcast | 10s foreground/60s background | Broadcast receiver not completed |
| Service | 20s foreground/200s background | Service start not completed |
| ContentProvider | 10 seconds | Provider publish timeout |
Core ANR monitoring logic:
// Simplified ANR detection logic (InputDispatcher.cpp)
void InputDispatcher::handleAnr() {
// 1. Send input event to app
dispatchEventLocked(event);
// 2. Set timeout timer (5 seconds)
mAnrTimer.set(5000ms);
// 3. Wait for app response
if (!receivedResponse && timeout) {
// 4. Trigger ANR
notifyAnr(connection, reason);
}
}
Crash Monitoring
App crashes are divided into Java Crash and Native Crash:
Java Crash Handling Flow:
// Set global exception handler
Thread.setDefaultUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
@Override
public void uncaughtException(Thread t, Throwable e) {
// 1. Collect crash information
String stackTrace = Log.getStackTraceString(e);
// 2. Write to log
Log.e(TAG, "Uncaught exception in thread " + t.getName(), e);
// 3. Report to server (optional)
CrashReporter.report(e);
// 4. Terminate process
Process.killProcess(Process.myPid());
}
});
Native Crash Handling Flow:
Native layer crashes are handled through Linux signal mechanism:
Program triggers SIGSEGV (segmentation fault)
↓
Kernel sends signal to process
↓
debuggerd catches signal
↓
Collect crash information (registers, stack, maps)
↓
Generate tombstone file
↓
Notify AMS process termination
Watchdog Mechanism
Watchdog is System Server's "watchdog", specifically monitoring whether core threads are running normally.
// Core threads monitored by Watchdog
mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread");
mMonitorChecker.addMonitor(mAms); // ActivityManagerService
mMonitorChecker.addMonitor(mWms); // WindowManagerService
mMonitorChecker.addMonitor(mPms); // PowerManagerService
// ...
Watchdog working principle:
- Every 30 seconds, send a message to the monitored thread's Handler
- If no response within 60 seconds, judge as semi-deadlock state
- If still no response after 60 seconds (default), trigger Watchdog restart
- Generate traces.txt file, recording all thread stacks
Normal case:
Watchdog ──send message──→ Core thread ──responds within 30s──→ Watchdog: "All normal"
Abnormal case:
Watchdog ──send message──→ Core thread (stuck) ──no response for 60s──→ Watchdog: "Restart System Server!"
2.3 Resource Management Mechanism
Resource leaks are common root causes of stability issues. Android has specialized management mechanisms for several types of critical resources.
Memory Management
Android memory is divided into several regions:
| Memory Type | Description | Typical Issues |
|---|---|---|
| Java Heap | Heap memory managed by Dalvik/ART | Memory leaks causing OOM |
| Native Heap | Memory allocated by C/C++ | Forgetting to free causing leaks |
| Graphics | Graphics memory used by GPU | Large images causing OOM |
| Code | Memory occupied by loaded dex/so | Loading too many libraries |
File Descriptor (FD) Management
The number of file descriptors each process can open is limited (usually 1024). FD leaks can cause:
- Cannot open new files
- Cannot create new sockets
- Cannot establish Binder connections
# Check process FD usage
adb shell ls -la /proc/<pid>/fd | wc -l
# Check FD limits
adb shell cat /proc/<pid>/limits | grep "open files"
Binder Resource Management
Binder is Android's most important IPC mechanism, but its resources are also limited:
- Binder Thread Pool: Maximum 16 threads by default
- Binder Buffer: Maximum 1MB per process (regular apps)
- Binder Proxy Objects: Limited quantity
When Binder resources are exhausted, inter-process communication fails, manifesting as ANR or functional abnormalities.
III. Stability Issue Classification
After understanding stability mechanisms, let's systematically classify various stability issues.
3.1 Classification by Severity
Severity Pyramid:
/\
/Fatal\ Kernel Panic, System Server Crash
/Level\ → System restart, user data may be lost
/──────\
/ Critical \ Native Crash, Watchdog restart
/──────────\ → Function unavailable, need to restart app/system
/ Important \ ANR, System service unresponsive
/──────────────\ → Severe user experience degradation
/ Moderate \ App Crash, performance lag
/────────────────\ → Affects single app, user can recover
3.2 Classification by Manifestation
Unresponsive Type
| Issue | Manifestation | Common Causes |
|---|---|---|
| ANR | "Application Not Responding" dialog pops up | Main thread blocking, deadlock |
| System Unresponsive | System freezes, keys don't respond | System Server deadlock |
| Black Screen | Screen shows nothing | SurfaceFlinger issue |
Crash Type
| Issue | Manifestation | Common Causes |
|---|---|---|
| Java Crash | App crashes | Null pointer, array out of bounds |
| Native Crash | App crashes, has tombstone | Memory access violation |
| Kernel Crash | System directly restarts | Driver bug, hardware issue |
Abnormal Restart Type
| Issue | Manifestation | Common Causes |
|---|---|---|
| Watchdog Restart | Boot animation appears | System service deadlock |
| Kernel Panic | Sudden restart | Kernel exception |
| Hardware Watchdog | Sudden restart | System completely frozen |
3.3 Common Root Cause Analysis
After analyzing numerous issues, I've summarized common root causes of stability problems:
-
Deadlock Issues (~25%)
- Multiple threads competing for same resource
- Binder calls forming loops
- Main thread waiting for child thread
-
Resource Exhaustion (~20%)
- Memory leaks causing OOM
- FD leaks
- Binder resource exhaustion
-
Memory Access Violations (~15%)
- Null pointer references
- Wild pointers
- Array out of bounds
-
Concurrency Race Conditions (~15%)
- Multi-threaded read/write conflicts
- Timing issues
-
Third-party SDK Issues (~10%)
- SDK compatibility issues
- SDK internal bugs
-
Others (~15%)
- Configuration errors
- Hardware compatibility
- System bugs
IV. Stability Analysis Framework
When encountering stability issues, how to efficiently locate root causes? Here's a practical analytical framework.
4.1 Analysis Workflow
4.2 5W1H Analysis Method
When facing stability issues, I'm used to using the 5W1H method to clarify thinking:
| Question | Description | How to Obtain |
|---|---|---|
| What | What happened? | Phenomenon description, error type |
| When | When did it happen? | Timestamp, frequency |
| Where | Where did it happen? | Module, file, code line |
| Who | Who had the problem? | Process, thread, component |
| Why | Why did it happen? | Root cause analysis |
| How | How to reproduce? How to solve? | Reproduction steps, fix solution |
4.3 Toolbox Overview
To do a good job, one must first sharpen one's tools. Here are common tools for analyzing stability issues:
Logging Tools
| Tool | Purpose | Command Example |
|---|---|---|
| logcat | View real-time logs | adb logcat -v threadtime |
| DropBox | View system-saved exceptions | adb shell dumpsys dropbox |
| bugreport | Export complete system logs | adb bugreport |
Analysis Tools
| Tool | Purpose | Applicable Scenarios |
|---|---|---|
| Systrace | System-level performance analysis | ANR, lag issues |
| Perfetto | Next-generation trace tool | Complex performance issues |
| addr2line | Native symbolization | Native Crash |
| MAT | Memory analysis | Memory leaks |
Example: Analyzing an ANR
# 1. Get ANR traces
adb pull /data/anr/traces.txt
# 2. Check ANR reason
adb shell dumpsys activity anr
# 3. Analyze stack traces in traces.txt
# Focus on main thread state:
# - If BLOCKED, find what lock it's waiting for
# - If WAITING, find what signal it's waiting for
# - If executing a method, analyze why that method is slow
# 4. Use Systrace for further analysis
python systrace.py -o trace.html sched freq idle am wm gfx view
V. Case Study: A Real ANR Analysis
Let's demonstrate how to apply the above knowledge through a real case.
Problem Phenomenon
User feedback: When switching WiFi in Settings, "System UI Not Responding" dialog frequently appears.
Analysis Process
Step 1: Collect Logs
adb bugreport
adb pull /data/anr/traces.txt
Step 2: Determine Problem Type
Found ANR record in DropBox:
Subject: Broadcast of Intent { act=android.net.wifi.WIFI_STATE_CHANGED }
ANR in com.android.systemui
PID: 1234
Reason: Broadcast of Intent { act=android.net.wifi.WIFI_STATE_CHANGED }
waited 10003ms for android.intent.action.BATTERY_CHANGED
This is a Broadcast Timeout ANR.
Step 3: Analyze Stack
Found SystemUI's main thread in traces.txt:
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x74e04dd8 self=0x7a4e014c00
| sysTid=1234 nice=-2 cgrp=default sched=0/0 handle=0x7b5a9f49a8
| state=S schedstat=( 123456789 987654321 12345 ) utm=100 stm=50 core=2 HZ=100
at com.android.systemui.BatteryController.update(BatteryController.java:150)
- waiting to lock <0x0a1b2c3d> (a java.lang.Object) held by thread 15
at com.android.systemui.BatteryController.onReceive(BatteryController.java:100)
...
The main thread is waiting for a lock held by thread 15.
Step 4: Find Lock-Holding Thread
"BatteryStats" prio=5 tid=15 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13579bdf self=0x7a4e028800
at android.os.BinderProxy.transactNative(Native Method)
at android.os.BinderProxy.transact(Binder.java:1129)
at com.android.internal.os.IBatteryStats$Stub$Proxy.noteWifiOn(IBatteryStats.java:2046)
...
Thread 15 is making a Binder call, waiting for BatteryStatsService response.
Step 5: Root Cause Analysis
After further analysis, BatteryStatsService was performing a time-consuming statistical operation, causing Binder call timeout. This formed the following wait chain:
Main thread waiting for lock → Thread 15 holds lock but waiting for Binder → BatteryStatsService busy
Step 6: Solution
- Move BatteryController's update operation to a child thread
- Optimize BatteryStatsService's statistical logic
VI. Preview of Upcoming Articles
As the opening article of this series, this article establishes a fundamental cognitive framework for system stability. In subsequent articles, we will dive deep into each topic:
| Article | Topic | What You'll Learn |
|---|---|---|
| Article 2 | Deep Analysis of ANR Mechanism | Complete flow of ANR trigger, detection, and reporting |
| Article 3 | ANR Troubleshooting Practice | traces.txt analysis techniques, common ANR cases |
| Article 4 | Native Crash Analysis | Tombstone interpretation, addr2line usage |
| Article 5 | Watchdog Mechanism | How system watchdog works |
Stay tuned!
Summary
Starting from Android system architecture, this article introduced core concepts of system stability:
- Layered Architecture: Understanding Android's five-layer architecture and stability impact of each layer
- Key Processes: Zygote, System Server, SurfaceFlinger are the system's "lifelines"
- Core Mechanisms: Process priority, LMK, ANR detection, Watchdog form the stability guarantee system
- Issue Classification: Classifying stability issues by severity and manifestation
- Analysis Framework: 5W1H analysis method and toolbox help efficiently locate problems
System stability cannot be mastered overnight—it requires continuous accumulation in practice. I hope this article helps you establish the correct cognitive framework, so you won't be lost when facing stability issues.
References
- Android Official Documentation - System Architecture
- AOSP Source Code - ActivityManagerService
- Deep Understanding of Android Kernel Design Philosophy
- Android System Source Code Scenario Analysis
Author Bio: Years of Android system development experience, specializing in system stability and performance optimization. Welcome to follow this series and explore the wonderful world of Android systems together!




Top comments (0)