愿历尽千帆 归来仍少年

一次SystemServer OOM导致的系统重启分析之路

字数统计: 1,282阅读时长: 6 min
2020/11/25

测试步骤
MTBF测试跑出来的机器重启,先解释下何为MTBF?
MTBF测试(Mean Time Between Failure)
主要测试终端在连续工作状态下,出现死机、重启、冻屏、应用强制关闭等严重故障的概率。通过在一个周期内以自动化方式反复执行规定用例,记录测试过程中被测终端出现的故障数

复现概率
1/200

问题现象
机器出现重启,并自动给出源头是OOM导致的结论

初步分析

正常步骤先看bugreport确定下是否是OOM导致的重启,搜索关键字”system_server_crash”或”am_crash”

1
2
3
4
5
6
7
8
9
2020-12-07 10:56:18 system_server_crash (text, 346 bytes)
Process: system_server
java.lang.OutOfMemoryError: Heap of System_Server has allocated 666MB , So trigger OutOfMemoryError crash
at com.android.server.WatchdogInjector.checkOOMState(WatchdogInjector.java:98)
at com.android.server.Watchdog.run(Watchdog.java:776)

12-07 10:56:18.313 1000 1614 1703 I am_crash: [1614,0,system_server,-1,java.lang.OutOfMemoryError,Heap of System_Server has allocated 666MB , So trigger OutOfMemoryError crash,WatchdogInjector.java,98]

12-07 10:56:18.312 1000 1614 1703 D OOMEventManagerFK: dumpheap success : /data/anr/system_server_1614.hprof

上面片段中可以看出当时SystemServer确实发生了OOM,日志中没有打出具体堆栈,这个时候去看下system_server_1614.hprof,这个应该是厂商定制化输出的,比如监测到SystemServer占据内存超过一定的阀值,就dump出这个hprof文件以便于分析

这个文件打开后,最大的一块是有五百多个android.app.assist.AssistStructure实例

接下来顺理成章的就看下android.app.assist.AssistStructure单个实例中的引用链关系了
直接点击Histogram搜索android.app.assist.AssistStructure即可

选择outgoing reference后如下,可以看到有529个

随便点击一个右键同样的选outgoing

每个AssistStructure实例中都包含WifiConfigActivity,WifiConfigActivity是一个什么样的界面呢?
是一个对话框样式的Activity,包含了可以填充数据的编辑框。
其实我们到这里已经大致明白了,罪魁祸首基本可以认定是WifiConfigActivity了

从bugreport中寻找关于WifiConfigActivity的线索,查找system_server OOM之前的片段

1
12-07 10:56:04.383  1000  1614  2419 I WindowManager: Input event dispatching timed out sending to com.android.settings/com.android.settings.wifi.WifiConfigActivity.  Reason: e4a42fb com.android.settings/com.android.settings.wifi.WifiConfigActivity (server) is not responding. Waited 8001ms for FocusEvent(hasFocus=true)

可以看到当时的WifiConfigActivity界面是卡住了,无法响应事件,可能当时系统在不停的GC,再看下WifiConfigActivity的调起记录

1
2
3
4
5
6
7
12-07 10:55:09.375  1000  1614 12880 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000
12-07 10:55:18.151 1000 1614 4490 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000
12-07 10:55:21.185 1000 1614 12880 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000
12-07 10:55:32.439 1000 1614 5665 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000
12-07 10:55:42.120 1000 1614 8959 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000
12-07 10:55:50.421 1000 1614 4063 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000
12-07 10:56:03.812 1000 1614 14712 I ActivityTaskManager: START u0 {flg=0x10000000 cmp=com.android.settings/.wifi.WifiConfigActivity (has extras)} from uid 1000

WifiConfigActivity什么时候会被调起呢,是在收到一个广播后,鉴于其启动模式是singleInstance
查阅代码发现这个文件有重写onNewIntent,所以再多次调用的情况下,onNewIntent会被多次触发
该函数里有一处代码

1
new AccessPoint(this, wifiConfiguration);

这里直接传入了this且该类是singleInstance,所以这个对象是无法被GC的,会导致内存持续增加,其实日志中也有Settings OOM的片段打出来

修改方案

修改方案如下:
传入this的地方改为弱引用,这样在该界面退出时activity可以正常被回收掉,另外新增一个文件继承自DialogInterface.OnDismissListener

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
public class DetachWifiDialogListener implements DialogInterface.OnDismissListener{
private final static String TAG = DetachWifiDialogListener.class.getSimpleName();
private Activity mActivity;
public DetachWifiDialogListener(Activity activity) {
mActivity = activity;
}
@Override
public void onDismiss(DialogInterface dialog) {
Log.d(TAG,"Dialog onDismiss");
}
public void clearOnDetach(Dialog dialog) {
if (dialog.getWindow() == null) {
return;
}
dialog.getWindow()
.getDecorView()
.getViewTreeObserver()
.addOnWindowAttachListener(new ViewTreeObserver.OnWindowAttachListener() {
@Override
public void onWindowAttached() {
Log.d(TAG, "dialog Attached to Window");
}
@Override
public void onWindowDetached() {
Log.d(TAG,"dialog Detached to Window");
if(mActivity != null) {
mActivity.finish();
mActivity = null;
}
}
});
}
}

在弹出Dialog后执行clearOnDetach,及时的将Activity finish掉,并且在onPause中(这里因为此处场景特殊,一般在onDestory中)将Listener置空
目的是剪断这个引用链

1
2
3
4
dismissListener = new DetachWifiDialogListener(mWifiConfigActivity.get());
//....
mDialog.show();
dismissListener.clearOnDetach(mDialog);

本地测试下来,dumpsys meminfo出的Activity实例个数在Dialog消失后会减少,说明不存在内存泄露了

常见示例

再例举一个最常见的非静态内部类Handler泄漏例子,延迟发送一个消息,此时在消息处理之前退出该界面,就会存在该Activity泄漏的情形

此时在onDestory内执行removeCallbacksAndMessages,这样的话每次退出界面后就会清空该handler上所有的callback和消息,这样就不会存在MessageQueue->Handler->Activity这个引用链

CATALOG
  1. 1. 初步分析
  2. 2. 修改方案
  3. 3. 常见示例