DEV Community

WonderLab
WonderLab

Posted on

Android Stability Basics: System Architecture and Key Mechanisms

System stability is the foundation of user experience and the lifeline of brand reputation.

Introduction: Why is System Stability So Important?

Remember the news about a well-known phone brand being collectively complained about by users due to frequent system crashes? A system crash may only take a few seconds, but the damage to brand reputation takes years to repair.

As an Android Framework engineer, you may often encounter scenarios like:

  • A tester rushes over angrily: "The phone crashed again!"
  • User feedback: "The app keeps crashing, it's terrible!"
  • The boss @s everyone in the group: "Massive ANR incidents online, urgent fix needed!"

Behind these problems lies the same theme—system stability.

System stability is like the foundation of a house. If the foundation is unstable, no matter how beautiful the decoration, it's still a castle in the air. The Android system runs dozens of system processes and hundreds of system services. They are like a precise symphony orchestra—if any member has a problem, the entire performance is affected.

This article is the opening piece of the "Android System Stability and Performance Optimization" series. Starting from the Android system architecture, we will help you establish a fundamental cognitive framework for system stability analysis. After reading this article, you will be able to:

  1. Understand the relationship between Android's layered architecture and stability
  2. Master the core mechanisms and key components of system stability
  3. Understand the classification and manifestations of stability issues
  4. Establish a fundamental analytical framework for system stability analysis

Ready? Let's begin this journey of exploration!


I. Android System Architecture Review

To understand system stability, we must first have a clear understanding of Android's system architecture. If we compare the Android system to a building, it is constructed in layers like this:

1.1 Layered Architecture Overview

01-01-android-architecture

This layered design is like a pyramid—the lower the layer, the more stable, but the greater the impact when problems occur.

To give an analogy: if your light bulb breaks (Application layer), just replace it; if the circuit board burns out (Framework layer), the entire room may lose power; if the transformer explodes (Kernel layer), the entire building suffers!

1.2 Impact of Each Layer on Stability

Let's analyze layer by layer to see what stability issues may occur at each level:

Application Layer (Applications)

This is the layer users directly interact with. Stability issues mainly manifest as:

  • Java Crash: Uncaught exceptions causing app crashes
  • ANR: Main thread blocking causing app unresponsiveness
  • Memory Leak: Continuous memory consumption eventually leading to OOM

Impact Scope: Single application
Severity: Moderate (users can reopen the app)

Framework Layer (Application Framework)

This is Android's "brain", providing all services required for app operation:

  • System Server Crash: Core system service crashes, causing system restart
  • Service Deadlock: Multiple services waiting for each other, system freezes
  • Binder Resource Exhaustion: Inter-process communication blocked

Impact Scope: Entire system
Severity: Critical (may cause system restart)

Native Layer (Native Libraries & ART)

This layer is written in C/C++, with high execution efficiency but also more prone to issues:

  • Native Crash: Signals like SIGSEGV, SIGABRT causing process termination
  • Memory Access Violation: Wild pointers, array out of bounds
  • Resource Leak: File descriptor leaks, memory leaks

Impact Scope: Single process to entire system
Severity: Important to Critical

HAL Layer (Hardware Abstraction Layer)

The hardware abstraction layer connects software and hardware:

  • HAL Service Crash: Hardware functionality unavailable
  • Hardware Timeout: Communication timeout with hardware

Impact Scope: Specific hardware functionality
Severity: Important

Kernel Layer (Linux Kernel)

This is the foundation of the system:

  • Kernel Panic: Kernel crash, system directly restarts
  • Driver Crash: Specific hardware functionality abnormal
  • Deadlock: System completely frozen

Impact Scope: Entire device
Severity: Fatal

1.3 Key Processes and Services

In the Android system, there are several processes that are "vital"—if they crash, the entire system must restart:

Zygote Process

Zygote is Android's "incubator". All application processes are forked from it. Think of it as an "application factory":

Zygote
  ├── fork() → App1 process
  ├── fork() → App2 process
  ├── fork() → App3 process
  └── fork() → System Server process
Enter fullscreen mode Exit fullscreen mode

If Zygote crashes, it's like the factory shutting down—new application processes cannot be created, and the system can only restart.

System Server Process

System Server is Android's "butler", running almost all core system services:

Service Name Responsibility What happens if it crashes
ActivityManagerService Manages lifecycle of four major components Cannot start/switch apps
WindowManagerService Manages window display Screen goes blank
PackageManagerService Manages app installation/queries Cannot install/find apps
PowerManagerService Manages power state Cannot sleep/wake
InputManagerService Manages input events Touch/key input fails

Once System Server crashes, the entire system restarts (you'll see the boot animation play again).

SurfaceFlinger Process

SurfaceFlinger is Android's "painter", responsible for compositing all app interfaces and displaying them on the screen:

App1 Surface ─┐
App2 Surface ──┼──→ SurfaceFlinger ──→ Display
StatusBar    ─┘
Enter fullscreen mode Exit fullscreen mode

If SurfaceFlinger crashes, the screen either goes black or freezes.


II. Core Stability Mechanisms

The Android system designs multiple mechanisms to ensure stability, like the human immune system. Let's understand these "moats" one by one.

2.1 Process Management Mechanism

Android is a multitasking operating system, but phone memory is limited. How to ensure smooth system operation with limited resources? The answer is process priority management and Low Memory Killer (LMK).

Process Priority

Android divides processes into 5 levels based on their importance:

01-02-process-priority-lmk

Priority Name oom_adj Description Example
Highest Foreground 0 Process user is interacting with Foreground app
High Visible 100 Visible but not in foreground App with floating window in background
Medium Service 200 Process running background service Music player
Low Cached 900 Cached background process Recently used app
Lowest Empty 999 Empty process, can be killed anytime Exited app

You can check process priority with the following code:

// Check process priority
ActivityManager am = getSystemService(ActivityManager.class);
List<ActivityManager.RunningAppProcessInfo> processes = am.getRunningAppProcesses();
for (ActivityManager.RunningAppProcessInfo info : processes) {
    Log.d(TAG, "Process: " + info.processName +
          ", Importance: " + info.importance +
          ", ImportanceReasonCode: " + info.importanceReasonCode);
}
Enter fullscreen mode Exit fullscreen mode

Low Memory Killer (LMK)

When system memory is insufficient, LMK steps in to "kill processes". Its working principle is simple: prioritize killing low-priority processes, releasing memory for high-priority processes.

When memory is sufficient:  [Foreground App] [Visible App] [Service] [Cache1] [Cache2] [Empty]
                              ↓ Memory pressure
When memory is tight:      [Foreground App] [Visible App] [Service] [Cache1]  ← Empty and Cache2 killed
                              ↓ Severe memory shortage
Extreme case:              [Foreground App] [Visible App]  ← Even service processes killed
Enter fullscreen mode Exit fullscreen mode

LMK vs OOM Killer:

Feature LMK OOM Killer
Layer Android userspace Linux kernel
Trigger Timing When memory below threshold (proactive) When memory exhausted (reactive)
Process Killing Strategy Based on oom_adj priority Based on oom_score
Purpose Preventive memory release Emergency memory release

LMK is like a "prepared" butler, starting cleanup before memory gets tight; OOM Killer is the "last resort" emergency measure.

2.2 Exception Monitoring Mechanism

The Android system has a comprehensive "health check" mechanism that can detect and handle various exceptions in time.

ANR Monitoring

ANR (Application Not Responding) is one of the most common stability issues users encounter. When an app doesn't respond within the specified time, the system pops up an "Application Not Responding" dialog.

ANR trigger conditions:

Type Timeout Trigger Scenario
Input event 5 seconds Touch/key event not processed
Broadcast 10s foreground/60s background Broadcast receiver not completed
Service 20s foreground/200s background Service start not completed
ContentProvider 10 seconds Provider publish timeout

Core ANR monitoring logic:

// Simplified ANR detection logic (InputDispatcher.cpp)
void InputDispatcher::handleAnr() {
    // 1. Send input event to app
    dispatchEventLocked(event);

    // 2. Set timeout timer (5 seconds)
    mAnrTimer.set(5000ms);

    // 3. Wait for app response
    if (!receivedResponse && timeout) {
        // 4. Trigger ANR
        notifyAnr(connection, reason);
    }
}
Enter fullscreen mode Exit fullscreen mode

Crash Monitoring

App crashes are divided into Java Crash and Native Crash:

Java Crash Handling Flow:

// Set global exception handler
Thread.setDefaultUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
    @Override
    public void uncaughtException(Thread t, Throwable e) {
        // 1. Collect crash information
        String stackTrace = Log.getStackTraceString(e);

        // 2. Write to log
        Log.e(TAG, "Uncaught exception in thread " + t.getName(), e);

        // 3. Report to server (optional)
        CrashReporter.report(e);

        // 4. Terminate process
        Process.killProcess(Process.myPid());
    }
});
Enter fullscreen mode Exit fullscreen mode

Native Crash Handling Flow:

Native layer crashes are handled through Linux signal mechanism:

Program triggers SIGSEGV (segmentation fault)
        ↓
Kernel sends signal to process
        ↓
debuggerd catches signal
        ↓
Collect crash information (registers, stack, maps)
        ↓
Generate tombstone file
        ↓
Notify AMS process termination
Enter fullscreen mode Exit fullscreen mode

Watchdog Mechanism

Watchdog is System Server's "watchdog", specifically monitoring whether core threads are running normally.

// Core threads monitored by Watchdog
mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread");
mMonitorChecker.addMonitor(mAms); // ActivityManagerService
mMonitorChecker.addMonitor(mWms); // WindowManagerService
mMonitorChecker.addMonitor(mPms); // PowerManagerService
// ...
Enter fullscreen mode Exit fullscreen mode

Watchdog working principle:

  1. Every 30 seconds, send a message to the monitored thread's Handler
  2. If no response within 60 seconds, judge as semi-deadlock state
  3. If still no response after 60 seconds (default), trigger Watchdog restart
  4. Generate traces.txt file, recording all thread stacks
Normal case:
Watchdog ──send message──→ Core thread ──responds within 30s──→ Watchdog: "All normal"

Abnormal case:
Watchdog ──send message──→ Core thread (stuck) ──no response for 60s──→ Watchdog: "Restart System Server!"
Enter fullscreen mode Exit fullscreen mode

2.3 Resource Management Mechanism

Resource leaks are common root causes of stability issues. Android has specialized management mechanisms for several types of critical resources.

Memory Management

Android memory is divided into several regions:

Memory Type Description Typical Issues
Java Heap Heap memory managed by Dalvik/ART Memory leaks causing OOM
Native Heap Memory allocated by C/C++ Forgetting to free causing leaks
Graphics Graphics memory used by GPU Large images causing OOM
Code Memory occupied by loaded dex/so Loading too many libraries

File Descriptor (FD) Management

The number of file descriptors each process can open is limited (usually 1024). FD leaks can cause:

  • Cannot open new files
  • Cannot create new sockets
  • Cannot establish Binder connections
# Check process FD usage
adb shell ls -la /proc/<pid>/fd | wc -l

# Check FD limits
adb shell cat /proc/<pid>/limits | grep "open files"
Enter fullscreen mode Exit fullscreen mode

Binder Resource Management

Binder is Android's most important IPC mechanism, but its resources are also limited:

  • Binder Thread Pool: Maximum 16 threads by default
  • Binder Buffer: Maximum 1MB per process (regular apps)
  • Binder Proxy Objects: Limited quantity

When Binder resources are exhausted, inter-process communication fails, manifesting as ANR or functional abnormalities.


III. Stability Issue Classification

After understanding stability mechanisms, let's systematically classify various stability issues.

3.1 Classification by Severity

01-03-stability-classification

Severity Pyramid:

        /\
       /Fatal\        Kernel Panic, System Server Crash
      /Level\         → System restart, user data may be lost
     /──────\
    / Critical \     Native Crash, Watchdog restart
   /──────────\      → Function unavailable, need to restart app/system
  / Important  \   ANR, System service unresponsive
 /──────────────\  → Severe user experience degradation
/   Moderate     \ App Crash, performance lag
/────────────────\ → Affects single app, user can recover
Enter fullscreen mode Exit fullscreen mode

3.2 Classification by Manifestation

Unresponsive Type

Issue Manifestation Common Causes
ANR "Application Not Responding" dialog pops up Main thread blocking, deadlock
System Unresponsive System freezes, keys don't respond System Server deadlock
Black Screen Screen shows nothing SurfaceFlinger issue

Crash Type

Issue Manifestation Common Causes
Java Crash App crashes Null pointer, array out of bounds
Native Crash App crashes, has tombstone Memory access violation
Kernel Crash System directly restarts Driver bug, hardware issue

Abnormal Restart Type

Issue Manifestation Common Causes
Watchdog Restart Boot animation appears System service deadlock
Kernel Panic Sudden restart Kernel exception
Hardware Watchdog Sudden restart System completely frozen

3.3 Common Root Cause Analysis

After analyzing numerous issues, I've summarized common root causes of stability problems:

  1. Deadlock Issues (~25%)

    • Multiple threads competing for same resource
    • Binder calls forming loops
    • Main thread waiting for child thread
  2. Resource Exhaustion (~20%)

    • Memory leaks causing OOM
    • FD leaks
    • Binder resource exhaustion
  3. Memory Access Violations (~15%)

    • Null pointer references
    • Wild pointers
    • Array out of bounds
  4. Concurrency Race Conditions (~15%)

    • Multi-threaded read/write conflicts
    • Timing issues
  5. Third-party SDK Issues (~10%)

    • SDK compatibility issues
    • SDK internal bugs
  6. Others (~15%)

    • Configuration errors
    • Hardware compatibility
    • System bugs

IV. Stability Analysis Framework

When encountering stability issues, how to efficiently locate root causes? Here's a practical analytical framework.

4.1 Analysis Workflow

01-04-analysis-workflow

4.2 5W1H Analysis Method

When facing stability issues, I'm used to using the 5W1H method to clarify thinking:

Question Description How to Obtain
What What happened? Phenomenon description, error type
When When did it happen? Timestamp, frequency
Where Where did it happen? Module, file, code line
Who Who had the problem? Process, thread, component
Why Why did it happen? Root cause analysis
How How to reproduce? How to solve? Reproduction steps, fix solution

4.3 Toolbox Overview

To do a good job, one must first sharpen one's tools. Here are common tools for analyzing stability issues:

Logging Tools

Tool Purpose Command Example
logcat View real-time logs adb logcat -v threadtime
DropBox View system-saved exceptions adb shell dumpsys dropbox
bugreport Export complete system logs adb bugreport

Analysis Tools

Tool Purpose Applicable Scenarios
Systrace System-level performance analysis ANR, lag issues
Perfetto Next-generation trace tool Complex performance issues
addr2line Native symbolization Native Crash
MAT Memory analysis Memory leaks

Example: Analyzing an ANR

# 1. Get ANR traces
adb pull /data/anr/traces.txt

# 2. Check ANR reason
adb shell dumpsys activity anr

# 3. Analyze stack traces in traces.txt
# Focus on main thread state:
# - If BLOCKED, find what lock it's waiting for
# - If WAITING, find what signal it's waiting for
# - If executing a method, analyze why that method is slow

# 4. Use Systrace for further analysis
python systrace.py -o trace.html sched freq idle am wm gfx view
Enter fullscreen mode Exit fullscreen mode

V. Case Study: A Real ANR Analysis

Let's demonstrate how to apply the above knowledge through a real case.

Problem Phenomenon

User feedback: When switching WiFi in Settings, "System UI Not Responding" dialog frequently appears.

Analysis Process

Step 1: Collect Logs

adb bugreport
adb pull /data/anr/traces.txt
Enter fullscreen mode Exit fullscreen mode

Step 2: Determine Problem Type

Found ANR record in DropBox:

Subject: Broadcast of Intent { act=android.net.wifi.WIFI_STATE_CHANGED }
ANR in com.android.systemui
PID: 1234
Reason: Broadcast of Intent { act=android.net.wifi.WIFI_STATE_CHANGED }
        waited 10003ms for android.intent.action.BATTERY_CHANGED
Enter fullscreen mode Exit fullscreen mode

This is a Broadcast Timeout ANR.

Step 3: Analyze Stack

Found SystemUI's main thread in traces.txt:

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x74e04dd8 self=0x7a4e014c00
  | sysTid=1234 nice=-2 cgrp=default sched=0/0 handle=0x7b5a9f49a8
  | state=S schedstat=( 123456789 987654321 12345 ) utm=100 stm=50 core=2 HZ=100
  at com.android.systemui.BatteryController.update(BatteryController.java:150)
  - waiting to lock <0x0a1b2c3d> (a java.lang.Object) held by thread 15
  at com.android.systemui.BatteryController.onReceive(BatteryController.java:100)
  ...
Enter fullscreen mode Exit fullscreen mode

The main thread is waiting for a lock held by thread 15.

Step 4: Find Lock-Holding Thread

"BatteryStats" prio=5 tid=15 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13579bdf self=0x7a4e028800
  at android.os.BinderProxy.transactNative(Native Method)
  at android.os.BinderProxy.transact(Binder.java:1129)
  at com.android.internal.os.IBatteryStats$Stub$Proxy.noteWifiOn(IBatteryStats.java:2046)
  ...
Enter fullscreen mode Exit fullscreen mode

Thread 15 is making a Binder call, waiting for BatteryStatsService response.

Step 5: Root Cause Analysis

After further analysis, BatteryStatsService was performing a time-consuming statistical operation, causing Binder call timeout. This formed the following wait chain:

Main thread waiting for lock → Thread 15 holds lock but waiting for Binder → BatteryStatsService busy
Enter fullscreen mode Exit fullscreen mode

Step 6: Solution

  1. Move BatteryController's update operation to a child thread
  2. Optimize BatteryStatsService's statistical logic

VI. Preview of Upcoming Articles

As the opening article of this series, this article establishes a fundamental cognitive framework for system stability. In subsequent articles, we will dive deep into each topic:

Article Topic What You'll Learn
Article 2 Deep Analysis of ANR Mechanism Complete flow of ANR trigger, detection, and reporting
Article 3 ANR Troubleshooting Practice traces.txt analysis techniques, common ANR cases
Article 4 Native Crash Analysis Tombstone interpretation, addr2line usage
Article 5 Watchdog Mechanism How system watchdog works

Stay tuned!


Summary

Starting from Android system architecture, this article introduced core concepts of system stability:

  1. Layered Architecture: Understanding Android's five-layer architecture and stability impact of each layer
  2. Key Processes: Zygote, System Server, SurfaceFlinger are the system's "lifelines"
  3. Core Mechanisms: Process priority, LMK, ANR detection, Watchdog form the stability guarantee system
  4. Issue Classification: Classifying stability issues by severity and manifestation
  5. Analysis Framework: 5W1H analysis method and toolbox help efficiently locate problems

System stability cannot be mastered overnight—it requires continuous accumulation in practice. I hope this article helps you establish the correct cognitive framework, so you won't be lost when facing stability issues.


References

  1. Android Official Documentation - System Architecture
  2. AOSP Source Code - ActivityManagerService
  3. Deep Understanding of Android Kernel Design Philosophy
  4. Android System Source Code Scenario Analysis

Author Bio: Years of Android system development experience, specializing in system stability and performance optimization. Welcome to follow this series and explore the wonderful world of Android systems together!


🎉 Thanks for Following, Let's Explore the Wonderful World of Android Systems Together!

Find Me: Homepage

Top comments (0)