WonderLab

Posted on Mar 10

Android Stability Basics: System Architecture and Key Mechanisms

#android #architecture #anr #crash

System stability is the foundation of user experience and the lifeline of brand reputation.

Introduction: Why is System Stability So Important?

Remember the news about a well-known phone brand being collectively complained about by users due to frequent system crashes? A system crash may only take a few seconds, but the damage to brand reputation takes years to repair.

As an Android Framework engineer, you may often encounter scenarios like:

A tester rushes over angrily: "The phone crashed again!"
User feedback: "The app keeps crashing, it's terrible!"
The boss @s everyone in the group: "Massive ANR incidents online, urgent fix needed!"

Behind these problems lies the same theme—system stability.

System stability is like the foundation of a house. If the foundation is unstable, no matter how beautiful the decoration, it's still a castle in the air. The Android system runs dozens of system processes and hundreds of system services. They are like a precise symphony orchestra—if any member has a problem, the entire performance is affected.

This article is the opening piece of the "Android System Stability and Performance Optimization" series. Starting from the Android system architecture, we will help you establish a fundamental cognitive framework for system stability analysis. After reading this article, you will be able to:

Understand the relationship between Android's layered architecture and stability
Master the core mechanisms and key components of system stability
Understand the classification and manifestations of stability issues
Establish a fundamental analytical framework for system stability analysis

Ready? Let's begin this journey of exploration!

I. Android System Architecture Review

To understand system stability, we must first have a clear understanding of Android's system architecture. If we compare the Android system to a building, it is constructed in layers like this:

1.1 Layered Architecture Overview

This layered design is like a pyramid—the lower the layer, the more stable, but the greater the impact when problems occur.

To give an analogy: if your light bulb breaks (Application layer), just replace it; if the circuit board burns out (Framework layer), the entire room may lose power; if the transformer explodes (Kernel layer), the entire building suffers!

1.2 Impact of Each Layer on Stability

Let's analyze layer by layer to see what stability issues may occur at each level:

Application Layer (Applications)

This is the layer users directly interact with. Stability issues mainly manifest as:

Java Crash: Uncaught exceptions causing app crashes
ANR: Main thread blocking causing app unresponsiveness
Memory Leak: Continuous memory consumption eventually leading to OOM

Impact Scope: Single application
Severity: Moderate (users can reopen the app)

Framework Layer (Application Framework)

This is Android's "brain", providing all services required for app operation:

System Server Crash: Core system service crashes, causing system restart
Service Deadlock: Multiple services waiting for each other, system freezes
Binder Resource Exhaustion: Inter-process communication blocked

Impact Scope: Entire system
Severity: Critical (may cause system restart)

Native Layer (Native Libraries & ART)

This layer is written in C/C++, with high execution efficiency but also more prone to issues:

Native Crash: Signals like SIGSEGV, SIGABRT causing process termination
Memory Access Violation: Wild pointers, array out of bounds
Resource Leak: File descriptor leaks, memory leaks

Impact Scope: Single process to entire system
Severity: Important to Critical

HAL Layer (Hardware Abstraction Layer)

The hardware abstraction layer connects software and hardware:

HAL Service Crash: Hardware functionality unavailable
Hardware Timeout: Communication timeout with hardware

Impact Scope: Specific hardware functionality
Severity: Important

Kernel Layer (Linux Kernel)

This is the foundation of the system:

Kernel Panic: Kernel crash, system directly restarts
Driver Crash: Specific hardware functionality abnormal
Deadlock: System completely frozen

Impact Scope: Entire device
Severity: Fatal

1.3 Key Processes and Services

In the Android system, there are several processes that are "vital"—if they crash, the entire system must restart:

Zygote Process

Zygote is Android's "incubator". All application processes are forked from it. Think of it as an "application factory":

Zygote
  ├── fork() → App1 process
  ├── fork() → App2 process
  ├── fork() → App3 process
  └── fork() → System Server process

If Zygote crashes, it's like the factory shutting down—new application processes cannot be created, and the system can only restart.

System Server Process

System Server is Android's "butler", running almost all core system services:

Service Name	Responsibility	What happens if it crashes
ActivityManagerService	Manages lifecycle of four major components	Cannot start/switch apps
WindowManagerService	Manages window display	Screen goes blank
PackageManagerService	Manages app installation/queries	Cannot install/find apps
PowerManagerService	Manages power state	Cannot sleep/wake
InputManagerService	Manages input events	Touch/key input fails

Once System Server crashes, the entire system restarts (you'll see the boot animation play again).

SurfaceFlinger Process

SurfaceFlinger is Android's "painter", responsible for compositing all app interfaces and displaying them on the screen:

App1 Surface ─┐
App2 Surface ──┼──→ SurfaceFlinger ──→ Display
StatusBar    ─┘

If SurfaceFlinger crashes, the screen either goes black or freezes.

II. Core Stability Mechanisms

The Android system designs multiple mechanisms to ensure stability, like the human immune system. Let's understand these "moats" one by one.

2.1 Process Management Mechanism

Android is a multitasking operating system, but phone memory is limited. How to ensure smooth system operation with limited resources? The answer is process priority management and Low Memory Killer (LMK).

Process Priority

Android divides processes into 5 levels based on their importance:

Priority	Name	oom_adj	Description	Example
Highest	Foreground	0	Process user is interacting with	Foreground app
High	Visible	100	Visible but not in foreground	App with floating window in background
Medium	Service	200	Process running background service	Music player
Low	Cached	900	Cached background process	Recently used app
Lowest	Empty	999	Empty process, can be killed anytime	Exited app

You can check process priority with the following code:

// Check process priority
ActivityManager am = getSystemService(ActivityManager.class);
List<ActivityManager.RunningAppProcessInfo> processes = am.getRunningAppProcesses();
for (ActivityManager.RunningAppProcessInfo info : processes) {
    Log.d(TAG, "Process: " + info.processName +
          ", Importance: " + info.importance +
          ", ImportanceReasonCode: " + info.importanceReasonCode);
}

Low Memory Killer (LMK)

When system memory is insufficient, LMK steps in to "kill processes". Its working principle is simple: prioritize killing low-priority processes, releasing memory for high-priority processes.

When memory is sufficient:  [Foreground App] [Visible App] [Service] [Cache1] [Cache2] [Empty]
                              ↓ Memory pressure
When memory is tight:      [Foreground App] [Visible App] [Service] [Cache1]  ← Empty and Cache2 killed
                              ↓ Severe memory shortage
Extreme case:              [Foreground App] [Visible App]  ← Even service processes killed

LMK vs OOM Killer:

Feature	LMK	OOM Killer
Layer	Android userspace	Linux kernel
Trigger Timing	When memory below threshold (proactive)	When memory exhausted (reactive)
Process Killing Strategy	Based on oom_adj priority	Based on oom_score
Purpose	Preventive memory release	Emergency memory release

LMK is like a "prepared" butler, starting cleanup before memory gets tight; OOM Killer is the "last resort" emergency measure.

2.2 Exception Monitoring Mechanism

The Android system has a comprehensive "health check" mechanism that can detect and handle various exceptions in time.

ANR Monitoring

ANR (Application Not Responding) is one of the most common stability issues users encounter. When an app doesn't respond within the specified time, the system pops up an "Application Not Responding" dialog.

ANR trigger conditions:

Type	Timeout	Trigger Scenario
Input event	5 seconds	Touch/key event not processed
Broadcast	10s foreground/60s background	Broadcast receiver not completed
Service	20s foreground/200s background	Service start not completed
ContentProvider	10 seconds	Provider publish timeout

Core ANR monitoring logic:

// Simplified ANR detection logic (InputDispatcher.cpp)
void InputDispatcher::handleAnr() {
    // 1. Send input event to app
    dispatchEventLocked(event);

    // 2. Set timeout timer (5 seconds)
    mAnrTimer.set(5000ms);

    // 3. Wait for app response
    if (!receivedResponse && timeout) {
        // 4. Trigger ANR
        notifyAnr(connection, reason);
    }
}

Crash Monitoring

App crashes are divided into Java Crash and Native Crash:

Java Crash Handling Flow:

// Set global exception handler
Thread.setDefaultUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
    @Override
    public void uncaughtException(Thread t, Throwable e) {
        // 1. Collect crash information
        String stackTrace = Log.getStackTraceString(e);

        // 2. Write to log
        Log.e(TAG, "Uncaught exception in thread " + t.getName(), e);

        // 3. Report to server (optional)
        CrashReporter.report(e);

        // 4. Terminate process
        Process.killProcess(Process.myPid());
    }
});

Native Crash Handling Flow:

Native layer crashes are handled through Linux signal mechanism:

Program triggers SIGSEGV (segmentation fault)
        ↓
Kernel sends signal to process
        ↓
debuggerd catches signal
        ↓
Collect crash information (registers, stack, maps)
        ↓
Generate tombstone file
        ↓
Notify AMS process termination

Watchdog Mechanism

Watchdog is System Server's "watchdog", specifically monitoring whether core threads are running normally.

// Core threads monitored by Watchdog
mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread");
mMonitorChecker.addMonitor(mAms); // ActivityManagerService
mMonitorChecker.addMonitor(mWms); // WindowManagerService
mMonitorChecker.addMonitor(mPms); // PowerManagerService
// ...

Watchdog working principle:

Every 30 seconds, send a message to the monitored thread's Handler
If no response within 60 seconds, judge as semi-deadlock state
If still no response after 60 seconds (default), trigger Watchdog restart
Generate traces.txt file, recording all thread stacks

Normal case:
Watchdog ──send message──→ Core thread ──responds within 30s──→ Watchdog: "All normal"

Abnormal case:
Watchdog ──send message──→ Core thread (stuck) ──no response for 60s──→ Watchdog: "Restart System Server!"

2.3 Resource Management Mechanism

Resource leaks are common root causes of stability issues. Android has specialized management mechanisms for several types of critical resources.

Memory Management

Android memory is divided into several regions:

Memory Type	Description	Typical Issues
Java Heap	Heap memory managed by Dalvik/ART	Memory leaks causing OOM
Native Heap	Memory allocated by C/C++	Forgetting to free causing leaks
Graphics	Graphics memory used by GPU	Large images causing OOM
Code	Memory occupied by loaded dex/so	Loading too many libraries

File Descriptor (FD) Management

The number of file descriptors each process can open is limited (usually 1024). FD leaks can cause:

Cannot open new files
Cannot create new sockets
Cannot establish Binder connections

# Check process FD usage
adb shell ls -la /proc/<pid>/fd | wc -l

# Check FD limits
adb shell cat /proc/<pid>/limits | grep "open files"

Binder Resource Management

Binder is Android's most important IPC mechanism, but its resources are also limited:

Binder Thread Pool: Maximum 16 threads by default
Binder Buffer: Maximum 1MB per process (regular apps)
Binder Proxy Objects: Limited quantity

When Binder resources are exhausted, inter-process communication fails, manifesting as ANR or functional abnormalities.

III. Stability Issue Classification

After understanding stability mechanisms, let's systematically classify various stability issues.

3.1 Classification by Severity

Severity Pyramid:

        /\
       /Fatal\        Kernel Panic, System Server Crash
      /Level\         → System restart, user data may be lost
     /──────\
    / Critical \     Native Crash, Watchdog restart
   /──────────\      → Function unavailable, need to restart app/system
  / Important  \   ANR, System service unresponsive
 /──────────────\  → Severe user experience degradation
/   Moderate     \ App Crash, performance lag
/────────────────\ → Affects single app, user can recover

3.2 Classification by Manifestation

Unresponsive Type

Issue	Manifestation	Common Causes
ANR	"Application Not Responding" dialog pops up	Main thread blocking, deadlock
System Unresponsive	System freezes, keys don't respond	System Server deadlock
Black Screen	Screen shows nothing	SurfaceFlinger issue

Crash Type

Issue	Manifestation	Common Causes
Java Crash	App crashes	Null pointer, array out of bounds
Native Crash	App crashes, has tombstone	Memory access violation
Kernel Crash	System directly restarts	Driver bug, hardware issue

Abnormal Restart Type

Issue	Manifestation	Common Causes
Watchdog Restart	Boot animation appears	System service deadlock
Kernel Panic	Sudden restart	Kernel exception
Hardware Watchdog	Sudden restart	System completely frozen

3.3 Common Root Cause Analysis

After analyzing numerous issues, I've summarized common root causes of stability problems:

Deadlock Issues (~25%)
- Multiple threads competing for same resource
- Binder calls forming loops
- Main thread waiting for child thread
Resource Exhaustion (~20%)
- Memory leaks causing OOM
- FD leaks
- Binder resource exhaustion
Memory Access Violations (~15%)
- Null pointer references
- Wild pointers
- Array out of bounds
Concurrency Race Conditions (~15%)
- Multi-threaded read/write conflicts
- Timing issues
Third-party SDK Issues (~10%)
- SDK compatibility issues
- SDK internal bugs
Others (~15%)
- Configuration errors
- Hardware compatibility
- System bugs

IV. Stability Analysis Framework

When encountering stability issues, how to efficiently locate root causes? Here's a practical analytical framework.

4.1 Analysis Workflow

4.2 5W1H Analysis Method

When facing stability issues, I'm used to using the 5W1H method to clarify thinking:

Question	Description	How to Obtain
What	What happened?	Phenomenon description, error type
When	When did it happen?	Timestamp, frequency
Where	Where did it happen?	Module, file, code line
Who	Who had the problem?	Process, thread, component
Why	Why did it happen?	Root cause analysis
How	How to reproduce? How to solve?	Reproduction steps, fix solution

4.3 Toolbox Overview

To do a good job, one must first sharpen one's tools. Here are common tools for analyzing stability issues:

Logging Tools

Tool	Purpose	Command Example
logcat	View real-time logs	`adb logcat -v threadtime`
DropBox	View system-saved exceptions	`adb shell dumpsys dropbox`
bugreport	Export complete system logs	`adb bugreport`

Analysis Tools

Tool	Purpose	Applicable Scenarios
Systrace	System-level performance analysis	ANR, lag issues
Perfetto	Next-generation trace tool	Complex performance issues
addr2line	Native symbolization	Native Crash
MAT	Memory analysis	Memory leaks

Example: Analyzing an ANR

# 1. Get ANR traces
adb pull /data/anr/traces.txt

# 2. Check ANR reason
adb shell dumpsys activity anr

# 3. Analyze stack traces in traces.txt
# Focus on main thread state:
# - If BLOCKED, find what lock it's waiting for
# - If WAITING, find what signal it's waiting for
# - If executing a method, analyze why that method is slow

# 4. Use Systrace for further analysis
python systrace.py -o trace.html sched freq idle am wm gfx view

V. Case Study: A Real ANR Analysis

Let's demonstrate how to apply the above knowledge through a real case.

Problem Phenomenon

User feedback: When switching WiFi in Settings, "System UI Not Responding" dialog frequently appears.

Analysis Process

Step 1: Collect Logs

adb bugreport
adb pull /data/anr/traces.txt

Step 2: Determine Problem Type

Found ANR record in DropBox:

Subject: Broadcast of Intent { act=android.net.wifi.WIFI_STATE_CHANGED }
ANR in com.android.systemui
PID: 1234
Reason: Broadcast of Intent { act=android.net.wifi.WIFI_STATE_CHANGED }
        waited 10003ms for android.intent.action.BATTERY_CHANGED

This is a Broadcast Timeout ANR.

Step 3: Analyze Stack

Found SystemUI's main thread in traces.txt:

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x74e04dd8 self=0x7a4e014c00
  | sysTid=1234 nice=-2 cgrp=default sched=0/0 handle=0x7b5a9f49a8
  | state=S schedstat=( 123456789 987654321 12345 ) utm=100 stm=50 core=2 HZ=100
  at com.android.systemui.BatteryController.update(BatteryController.java:150)
  - waiting to lock <0x0a1b2c3d> (a java.lang.Object) held by thread 15
  at com.android.systemui.BatteryController.onReceive(BatteryController.java:100)
  ...

The main thread is waiting for a lock held by thread 15.

Step 4: Find Lock-Holding Thread

"BatteryStats" prio=5 tid=15 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13579bdf self=0x7a4e028800
  at android.os.BinderProxy.transactNative(Native Method)
  at android.os.BinderProxy.transact(Binder.java:1129)
  at com.android.internal.os.IBatteryStats$Stub$Proxy.noteWifiOn(IBatteryStats.java:2046)
  ...

Thread 15 is making a Binder call, waiting for BatteryStatsService response.

Step 5: Root Cause Analysis

After further analysis, BatteryStatsService was performing a time-consuming statistical operation, causing Binder call timeout. This formed the following wait chain:

Main thread waiting for lock → Thread 15 holds lock but waiting for Binder → BatteryStatsService busy

Step 6: Solution

Move BatteryController's update operation to a child thread
Optimize BatteryStatsService's statistical logic

VI. Preview of Upcoming Articles

As the opening article of this series, this article establishes a fundamental cognitive framework for system stability. In subsequent articles, we will dive deep into each topic:

Article	Topic	What You'll Learn
Article 2	Deep Analysis of ANR Mechanism	Complete flow of ANR trigger, detection, and reporting
Article 3	ANR Troubleshooting Practice	traces.txt analysis techniques, common ANR cases
Article 4	Native Crash Analysis	Tombstone interpretation, addr2line usage
Article 5	Watchdog Mechanism	How system watchdog works

Stay tuned!

Summary

Starting from Android system architecture, this article introduced core concepts of system stability:

Layered Architecture: Understanding Android's five-layer architecture and stability impact of each layer
Key Processes: Zygote, System Server, SurfaceFlinger are the system's "lifelines"
Core Mechanisms: Process priority, LMK, ANR detection, Watchdog form the stability guarantee system
Issue Classification: Classifying stability issues by severity and manifestation
Analysis Framework: 5W1H analysis method and toolbox help efficiently locate problems

System stability cannot be mastered overnight—it requires continuous accumulation in practice. I hope this article helps you establish the correct cognitive framework, so you won't be lost when facing stability issues.

References

Author Bio: Years of Android system development experience, specializing in system stability and performance optimization. Welcome to follow this series and explore the wonderful world of Android systems together!