愿历尽千帆 归来仍少年

GC超时导致的后台应用崩溃问题分析

字数统计: 2,707阅读时长: 13 min
2021/09/08

写在前面

这个问题之所以会拿出来仔细分析,一方面是因为这个问题不是简单的应用崩溃而是框架层的报错,另一方面是因为希望通过这个问题梳理下后台GC的超时检测机制怎样的,这样我们后面在应用层如果重写finalize方法回收时会考虑的更加全面点。

问题背景

复现概率: 偶现
问题版本: Android R
问题现象: 处于微信界面,突然弹出王者荣耀停止运行

初步分析

拿到问题日志后,先看下报错的堆栈

1
2
3
4
5
6
7
09-02 20:53:26.679  2073  2089 E AndroidRuntime: FATAL EXCEPTION: FinalizerWatchdogDaemon
09-02 20:53:26.679 2073 2089 E AndroidRuntime: Process: com.tencent.tmgp.sgame:xg_vip_service, PID: 2073
09-02 20:53:26.679 2073 2089 E AndroidRuntime: java.util.concurrent.TimeoutException: android.database.BulkCursorToCursorAdaptor.finalize() timed out after 10 seconds
//省略部分堆栈
09-02 20:53:26.679 2073 2089 E AndroidRuntime:at android.database.AbstractCursor.finalize(AbstractCursor.java:524)
09-02 20:53:26.679 2073 2089 E AndroidRuntime:at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:291)
09-02 20:53:26.679 2073 2089 E AndroidRuntime:at java.lang.Daemons$FinalizerDaemon.runInternal(Daemons.java:278)

单单从这段堆栈看的话,BulkCursorToCursorAdaptor执行finalize超过了10s,导致FinalizerWatchdogDaemon报错,FinalizerWatchdogDaemon字面上看像是监测回收超时的守护线程。
看下FinalizerWatchdogDaemon代码中的作用解释

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/**
* The watchdog exits the VM if the finalizer ever gets stuck. We consider
* the finalizer to be stuck if it spends more than MAX_FINALIZATION_MILLIS
* on one instance.
*/
private static class FinalizerWatchdogDaemon extends Daemon {
@UnsupportedAppUsage
private static final FinalizerWatchdogDaemon INSTANCE = new FinalizerWatchdogDaemon();

private boolean needToWork = true; // Only accessed in synchronized methods.

private long finalizerTimeoutNs = 0; // Lazily initialized.

FinalizerWatchdogDaemon() {
super("FinalizerWatchdogDaemon");
}

简单解释下就是:如果对象的finalize出现阻塞超时了会导致进程退出

这个问题中对应的是数据库的关闭,当然也可以发生在其它场景下,只要重写了成员函数finalize的对象都有可能会遇到这个问题,所以如果再遇到GC超时的报错,报错堆栈AndroidRuntime:at java.lang.Daemons$上面的内容可能会不一样。
那么对于重写了成员函数finalize的对象,当它们被GC决定要被回收时,会立刻回收吗?
其实不会马上被回收,而是被放入到一个队列中,等待FinalizerDaemon守护线程去调用它们的成员函数finalize后再被回收。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/**
* This heap management thread moves elements from the garbage collector's
* pending list to the managed reference queue.
*/
private static class ReferenceQueueDaemon extends Daemon {
@UnsupportedAppUsage
private static final ReferenceQueueDaemon INSTANCE = new ReferenceQueueDaemon();

ReferenceQueueDaemon() {
super("ReferenceQueueDaemon");
}

@Override public void runInternal() {
while (isRunning()) {
Reference<?> list;
try {
synchronized (ReferenceQueue.class) {
while (ReferenceQueue.unenqueued == null) {
ReferenceQueue.class.wait();
}
list = ReferenceQueue.unenqueued;
ReferenceQueue.unenqueued = null;
}
} catch (InterruptedException e) {
continue;
} catch (OutOfMemoryError e) {
continue;
}
ReferenceQueue.enqueuePending(list);
}
}
}

超时阈值

1
2
3
4
5
6
7
8
// This used to be final. IT IS NOW ONLY WRITTEN. We now update it when we look at the command
// line argument, for the benefit of mis-behaved apps that might read it. SLATED FOR REMOVAL.
// There is no reason to use this: Finalizers should not rely on the value. If a finalizer takes
// appreciable time, the work should be done elsewhere. Based on disassembly of Daemons.class,
// the value is effectively inlined, so changing the field never did have an effect.
// DO NOT USE. FOR ANYTHING. THIS WILL BE REMOVED SHORTLY.
@UnsupportedAppUsage
private static long MAX_FINALIZE_NANOS = 10L * 1000 * NANOS_PER_MILLI;

注释中对于该值的说明是它很快将被移除,实际这个值在代码中并没有起到真正的作用了,更新它的值是为了方便在外边读取到。
真正的超时阈值是通过VMRuntime.getFinalizerTimeoutMs获取,默认值是10s.

1
2
finalizer_timeout_ms_ = runtime_options.GetOrDefault(Opt::FinalizerTimeoutMs);
RUNTIME_OPTIONS_KEY (unsigned int, FinalizerTimeoutMs, 10000u)

超时检测

通过watchdog机制检测finalizer在超时时间内有没有成功析构回收对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 * The watchdog exits the VM if the finalizer ever gets stuck. We consider
* the finalizer to be stuck if it spends more than MAX_FINALIZATION_MILLIS
* on one instance.
*/
private static class FinalizerWatchdogDaemon extends Daemon {
@UnsupportedAppUsage
private static final FinalizerWatchdogDaemon INSTANCE = new FinalizerWatchdogDaemon();

private boolean needToWork = true; // Only accessed in synchronized methods.

private long finalizerTimeoutNs = 0; // Lazily initialized.

FinalizerWatchdogDaemon() {
super("FinalizerWatchdogDaemon");
}

@Override public void runInternal() {
while (isRunning()) {
if (!sleepUntilNeeded()) { (1)
// We have been interrupted, need to see if this daemon has been stopped.
continue;
}
final Object finalizing = waitForFinalization();(2)
if (finalizing != null && !VMDebug.isDebuggerConnected()) {
finalizerTimedOut(finalizing);(3)
break;
}
}
}

Step1 GC前的检查

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/**
* Notify daemon that it's OK to sleep until notified that something is ready to be
* finalized.
*/
private synchronized void goToSleep() {
needToWork = false;
}

/**
* Notify daemon that there is something ready to be finalized.
*/
private synchronized void wakeUp() {
needToWork = true;
notify();
}

开启回收之前,needToWork会被置为true,此时sleepUntilNeeded返回的是true,所以线程不会wait

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@Override public void runInternal() {
// This loop may be performance critical, since we need to keep up with mutator
// generation of finalizable objects.
// We minimize the amount of work we do per finalizable object. For example, we avoid
// reading the current time here, since that involves a kernel call per object. We
// limit fast path communication with FinalizerWatchDogDaemon to what's unavoidable: A
// non-volatile store to communicate the current finalizable object, e.g. for
// reporting, and a release store (lazySet) to a counter.
// We do stop the FinalizerWatchDogDaemon if we have nothing to do for a
// potentially extended period. This prevents the device from waking up regularly
// during idle times.

// Local copy of progressCounter; saves a fence per increment on ARM and MIPS.
int localProgressCounter = progressCounter.get();

while (isRunning()) {
try {
// Use non-blocking poll to avoid FinalizerWatchdogDaemon communication
// when busy.
FinalizerReference<?> finalizingReference = (FinalizerReference<?>)queue.poll();
if (finalizingReference != null) {
finalizingObject = finalizingReference.get();
progressCounter.lazySet(++localProgressCounter);
} else {
finalizingObject = null;
progressCounter.lazySet(++localProgressCounter);
// Slow path; block.
FinalizerWatchdogDaemon.INSTANCE.goToSleep();
finalizingReference = (FinalizerReference<?>)queue.remove();
finalizingObject = finalizingReference.get();
progressCounter.set(++localProgressCounter);
//回收之前先唤醒看门狗线程
FinalizerWatchdogDaemon.INSTANCE.wakeUp();
}
//开始回收的流程
doFinalize(finalizingReference);
} catch (InterruptedException ignored) {
} catch (OutOfMemoryError ignored) {
}
}
}

如果此时线程处于wait,被中断了或者有OOME发生时,这个时候回到开头判断下isRunning(),也就是看下回收对象这个线程是否为空,如果该线程为空的话,这个循环体就没有必要再继续执行下去了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/**
* Wait until something is ready to be finalized.
* Return false if we have been interrupted
* See also http://code.google.com/p/android/issues/detail?id=22778.
*/
private synchronized boolean sleepUntilNeeded() {
while (!needToWork) {
try {
wait();
} catch (InterruptedException e) {
// Daemon.stop may have interrupted us.
return false;
} catch (OutOfMemoryError e) {
return false;
}
}
return true;
}

Step2 等待GC完成

这一步是等待回收结束的过程,这个睡眠过程中如果被中断,说明在这个周期内完成了析构,直接返回null

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/**
* Return an object that took too long to finalize or return null.
* Wait VMRuntime.getFinalizerTimeoutMs. If the FinalizerDaemon took essentially the
* whole time processing a single reference, return that reference. Otherwise return
* null. Only called from a single thread.
*/
private Object waitForFinalization() {
if (finalizerTimeoutNs == 0) {
finalizerTimeoutNs =
NANOS_PER_MILLI * VMRuntime.getRuntime().getFinalizerTimeoutMs();
// Temporary app backward compatibility. Remove eventually.
MAX_FINALIZE_NANOS = finalizerTimeoutNs;
}
long startCount = FinalizerDaemon.INSTANCE.progressCounter.get();
// Avoid remembering object being finalized, so as not to keep it alive.
//如果回收对象没有超时的话,这里会返回null
if (!sleepForNanos(finalizerTimeoutNs)) {
// Don't report possibly spurious timeout if we are interrupted.
return null;
}
if (getNeedToWork() && FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) {
// We assume that only remove() and doFinalize() may take time comparable to
// the finalizer timeout.
// We observed neither the effect of the gotoSleep() nor the increment preceding a
// later wakeUp. Any remove() call by the FinalizerDaemon during our sleep
// interval must have been followed by a wakeUp call before we checked needToWork.
// But then we would have seen the counter increment. Thus there cannot have
// been such a remove() call.
// The FinalizerDaemon must not have progressed (from either the beginning or the
// last progressCounter increment) to either the next increment or gotoSleep()
// call. Thus we must have taken essentially the whole finalizerTimeoutMs in a
// single doFinalize() call. Thus it's OK to time out. finalizingObject was set
// just before the counter increment, which preceded the doFinalize call. Thus we
// are guaranteed to get the correct finalizing value below, unless doFinalize()
// just finished as we were timing out, in which case we may get null or a later
// one. In this last case, we are very likely to discard it below.
Object finalizing = FinalizerDaemon.INSTANCE.finalizingObject;
sleepForNanos(500 * NANOS_PER_MILLI);
// Recheck to make it even less likely we report the wrong finalizing object in
// the case which a very slow finalization just finished as we were timing out.
if (getNeedToWork()
&& FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) {
return finalizing;
}
}
return null;
}

sleepForNanos对应的函数很简单,如果在超时时间内完成GC,就会计算传进来的超时阈值减去当前已经睡眠的时间,如果这个差值小于0,说明睡眠的时间超过了阈值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/**
* Sleep for the given number of nanoseconds, or slightly longer.
* @return false if we were interrupted.
*/
private boolean sleepForNanos(long durationNanos) {
// It's important to base this on nanoTime(), not currentTimeMillis(), since
// the former stops counting when the processor isn't running.
long startNanos = System.nanoTime();
while (true) {
long elapsedNanos = System.nanoTime() - startNanos;
long sleepNanos = durationNanos - elapsedNanos;
if (sleepNanos <= 0) {
return true;
}
// Ensure the nano time is always rounded up to the next whole millisecond,
// ensuring the delay is >= the requested delay.
long sleepMillis = (sleepNanos + NANOS_PER_MILLI - 1) / NANOS_PER_MILLI;
try {
Thread.sleep(sleepMillis);
} catch (InterruptedException e) {
if (!isRunning()) {
return false;
}
} catch (OutOfMemoryError ignored) {
if (!isRunning()) {
return false;
}
}
}
}

Step3 GC处理超时

如果第二步中的超时时间内析构没有完成,则返回析构的对象,触发finalizerTimedOut
到了这一步是最不希望看到的结局,此时系统会弹出应用停止运行的报错框。

注意这个时候并没有立刻杀死进程,杀死进程的选择权交给了用户,即通过弹窗展示给用户,但对于用户来说会一头雾水

分析结论

这种问题其实还是比较常见的,特别是低内存的机器上。RootCasue就是对象回收超时了,一般是由于队列中等待FinalizerDaemon线程回收的对象太多导致,或者此时系统资源异常紧张比如CPU负载过高或者低内存环境下。

场景实测

模拟还原现场

通过模拟GC时耗时操作,应用退到后台后10s会弹出报错框,堆栈如下

验证了超时时间的确是10s,同时也验证了GC时耗时的操作确实会可能触发这个现象

对比机情况

在手头的小米note9 pro上进行场景模拟测试,模拟GC耗时100s的情况

在小米的机器上,到了默认的10s后并不会有弹窗,说明小米肯定修改了超时时间,第一次是等待了全部的100s后竟然正常回收,说明超时时间设置的比较大。紧接着下一次在达到了近80s时,进程收到signal 9直接被kill了,此时再点击应用是冷启动。

小米修改了超时阈值(超过100s),通过直接sig 9杀掉了进程,没有报错弹窗,所以用户无感知

测试机情况

同样的在我们的机器上模拟GC耗时100s的情况
退出应用到后台,此时系统触发GC回收,达到十秒钟时,界面上直接弹出停止运行的报错框,此时只有点击了关闭应用,才会去kill进程

修改策略

在GC规定的超时时间内如果没有完成析构,直接sig 9给对应进程

CATALOG
  1. 1. 写在前面
  2. 2. 问题背景
  3. 3. 初步分析
  4. 4. 超时阈值
  5. 5. 超时检测
    1. 5.1. Step1 GC前的检查
    2. 5.2. Step2 等待GC完成
    3. 5.3. Step3 GC处理超时
  6. 6. 分析结论
  7. 7. 场景实测
    1. 7.1. 模拟还原现场
    2. 7.2. 对比机情况
    3. 7.3. 测试机情况
    4. 7.4. 修改策略