首页 > 代码库 > Android ANR分析(2)
Android ANR分析(2)
转自:http://blog.csdn.net/ruingman/article/details/53118202
定义
主线程在特定的时间内没有做完特定的事情
常见的场景
A.input事件超过5S没有处理完成 B.service executing 超时(bind,create,start,unbind等等),前台20s,后台200s C.广播处理超时,前台10S,后台60s D.ContentProvider执行超时,20s
常见的原因
A.耗时操作,如复杂的layout,庞大的for循环,IO等。 B.被Binder 对端block C.被子线程同步锁block D.Binder被占满导致主线程无法和SystemServer通信 E.得不到系统资源(CPU/RAM/IO) 其中ABCD比较好分析,而E比较困难。
如何分析?
指导思想:通过各种线索,寻求主线程在过去一段时间内发生block的可能原因。
线索包括: A.traces.txt/dropbox AMS在ANR发生的时候,dump相关进程(ANR的进程、systemserver、mediaserver,surfaceFinger等)的当前栈到traces.txt。 traces.txt存在的几种形式: 1. adb pull /data/anr/ 2. slog/2012xxxxx/misc/anr/snapshot_xxxx.log 3. slog/dropbox/data_app_anr_xxxxx.tgz 4. slog/dropbox/system_app_anr_xxxx.tgz 需要注意的是,traces.txt是抓取的是超时后(如input超时就是5s后)的snapshot,并不一定能够真实的反应出block的点。 也存在抓到主线程没有block,在idle的情况。 B.Eventlog中的dvm_lock_sample. 在同步锁发生content的时候,虚拟机会将这个竞争记录在eventlog中: dvm_lock_sample: [system_server,1,Binder_7, 22, ActivityManagerService.java,15921,-,9628,4] 进程名 被block线程 block时间(ms) 被block的行号 持有者行号 实现可以参考art/runtime/monitor_android.cc LogContentionEvent函数 如果主线程是被binder对端、被同步锁block,那么eventlog中就很有可能会有dvm_lock_sample的打印。 华为项目上实现了更加强大的功能BlockMonitor,在主线程有耗时操作(如handlemsg、Binder调用耗时)的时候会打印出栈。 C.搜索主线程在发生ANR前后的main,systemlog,结合代码查看可能block在哪里。 现在发生ANR的时候,sprdruntimeinfo在mainlog中会Dump出比较多的信息,其中比较重要的是binder、cpu。 D.ANR发生的时候,打印出的CPU的占用。 注意,并不能因为有进程cpu占用高就果断的去怀疑他。 因为最主要的线索traces.txt的有效性并不是非常高,所以ANR问题分析是存在一定的局限性的。
案例分析
1. Settings 被对端block
首先需要去看的就是traces.txt 中ANR进程的主线程的栈。 以settting的这次ANR为例: "main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 obj=0x73ee6470 self=0xb4d76500 | sysTid=22831 nice=0 cgrp=default sched=0/0 handle=0xb6f4bc00 | state=S schedstat=( 0 0 0 ) utm=22 stm=22 core=0 HZ=100 | stack=0xbe283000-0xbe285000 stackSize=8MB | held mutexes= native: #00 pc 000410ac /system/lib/libc.so (__ioctl+8) native: #01 pc 000477e5 /system/lib/libc.so (ioctl+14) native: #02 pc 0001e7c5 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132) native: #03 pc 0001ee17 /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+38) native: #04 pc 0001efcd /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+124) native: #05 pc 00019fb7 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30) native: #06 pc 00086de9 /system/lib/libandroid_runtime.so (???) native: #07 pc 00d94629 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140) at android.os.BinderProxy.transactNative(Native method) at android.os.BinderProxy.transact(Binder.java:503) at android.net.INetworkPolicyManager$Stub$Proxy.getNetworkPolicies(INetworkPolicyManager.java:410) at android.net.NetworkPolicyManager.getNetworkPolicies(NetworkPolicyManager.java:174) at com.android.settings.net.NetworkPolicyEditor.read(NetworkPolicyEditor.java:57) at com.android.settings.DataUsageSummary.onCreate(DataUsageSummary.java:361) at android.app.Fragment.performCreate(Fragment.java:2202) at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:942) at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1148) at android.app.BackStackRecord.run(BackStackRecord.java:793) at android.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:1535) at android.app.FragmentManagerImpl.executePendingTransactions(FragmentManager.java:562) at com.android.settings.SettingsActivity.switchToFragment(SettingsActivity.java:1084) at com.android.settings.SettingsActivity.onCreate(SettingsActivity.java:657) at android.app.Activity.performCreate(Activity.java:6251) at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1110) at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2370) at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2477) at android.app.ActivityThread.-wrap11(ActivityThread.java:-1) at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1345) at android.os.Handler.dispatchMessage(Handler.java:102) at android.os.Looper.loop(Looper.java:148) at android.app.ActivityThread.main(ActivityThread.java:5438) at java.lang.reflect.Method.invoke!(Native method) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:762) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:652) 主线程被对端block,而对端是在systemserver中的NetworkPolicyManager。 接下来继续去查NetworkPolicyManager为何会被block。 通常情况下,可以搜索getNetworkPolicies,一般的对端的函数并不会修改函数名: 看到我们的对端: "Binder_4" prio=5 tid=56 Blocked | group="main" sCount=1 dsCount=0 obj=0x1321a0a0 self=0xad31e200 | sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0x9fd00930 | state=S schedstat=( 0 0 0 ) utm=46441 stm=46582 core=1 HZ=100 | stack=0x9fc04000-0x9fc06000 stackSize=1014KB | held mutexes= at com.android.server.net.NetworkPolicyManagerService.getNetworkPolicies(NetworkPolicyManagerService.java:1696) - waiting to lock <0x07439315> (a java.lang.Object) held by thread 35 at android.net.INetworkPolicyManager$Stub.onTransact(INetworkPolicyManager.java:145) at android.os.Binder.execTransact(Binder.java:453 被Tid=35的人拿住了一把锁(0x07439315),那么继续看tid=35是谁,有两种方法: 1. 搜索tid=35 2. 搜索0x07439315,找到 - locked <0x07439315> (a java.lang.Object) "NetworkPolicy" prio=5 tid=35 TimedWaiting | group="main" sCount=1 dsCount=0 obj=0x12d98940 self=0x9f91f700 | sysTid=2415 nice=0 cgrp=default sched=0/0 handle=0xa0f33930 | state=S schedstat=( 0 0 0 ) utm=7681 stm=7783 core=0 HZ=100 | stack=0xa0e31000-0xa0e33000 stackSize=1038KB | held mutexes= at java.lang.Object.wait!(Native method) - waiting on <0x02580c1b> (a java.lang.Object) at java.lang.Thread.parkFor$(Thread.java:1220) - locked <0x02580c1b> (a java.lang.Object) at sun.misc.Unsafe.park(Unsafe.java:299) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372) at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:634) at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:426) at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:345) at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:340) at com.android.server.NetworkManagementService.setInterfaceQuota(NetworkManagementService.java:1712) - locked <0x0b0f91b8> (a java.lang.Object) at com.android.server.net.NetworkPolicyManagerService.setInterfaceQuota(NetworkPolicyManagerService.java:2421) at com.android.server.net.NetworkPolicyManagerService.updateNetworkRulesLocked(NetworkPolicyManagerService.java:1232) at com.android.server.net.NetworkPolicyManagerService$14.onReceive(NetworkPolicyManagerService.java:1060) - locked <0x07439315> (a java.lang.Object) at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:881) at android.os.Handler.handleCallback(Handler.java:739) at android.os.Handler.dispatchMessage(Handler.java:95) at android.os.Looper.loop(Looper.java:148) at android.os.HandlerThread.run(HandlerThread.java:61) 可以看到,NetworkPolicy在通过NativeDaemonConnector和netd通信(setInterfaceQuota) 我们结合log来看下是否有有用信息,按照之前的经验,netd在执行完cmd的时候,会打印出slow operation。 在发生ANR的前后,查找netd相关的打印: 06-19 15:29:00.997 1235 1270 I am_anr : [0,22831,com.android.settings,818429509,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)] 06-19 15:29:05.683 1235 2415 E NetdConnector: NDC Command {55445 bandwidth setiquota seth_w0 9223372036854775807} took too long (4755ms 06-19 15:29:05.723 1235 2491 I dvm_lock_sample: [system_server,1,Binder_4,4919,NetworkPolicyManagerService.java,1696,-,1056,100] 从eventslog中还可以发现,setting出现了两次连续的ANR,而上下文中都有类似上面的dvm_lock_sample NetworkPolicyManagerService相关的告警。 因此虽然15:28的这次ANR并没有打出有用的栈,但是我们还是可以猜测出这两次ANR的原因都是netd的cmd耗时太久导致的(在主线程的调用路径上存在必然的block) 那么,在netd可能无法修改的情况下,我们应该如何去resolve这个问题呢。 将可能存在block的操作放到非UI线程中去做。
2.com.huawei.hwvplayer.youku对端block:
打开dropbox中对应的system_app_anrxxxx: 查看主线程的栈: "main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500 | sysTid=25390 nice=-1 cgrp=default sched=3/0 handle=0xb6f18b34 | state=S schedstat=( 0 0 0 ) utm=81 stm=12 core=2 HZ=100 | stack=0xbe78b000-0xbe78d000 stackSize=8MB | held mutexes= kernel: (couldn‘t read /proc/self/task/25390/stack) native: #00 pc 000422d0 /system/lib/libc.so (__ioctl+8) native: #01 pc 00047825 /system/lib/libc.so (ioctl+14) native: #02 pc 0001e835 /system/lib/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+132) native: #03 pc 0001ee93 /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+38) native: #04 pc 0001f049 /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124) native: #05 pc 00019fe3 /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30) native: #06 pc 0008a035 /system/lib/libandroid_runtime.so (???) native: #07 pc 00d78869 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140) at android.os.BinderProxy.transactNative(Native method) at android.os.BinderProxy.transact(Binder.java:510) at android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771) at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883) at android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95) at android.os.Environment.getExternalStorageDirectory(Environment.java:354) at com.huawei.common.utils.PathUtils.<clinit>(PathUtils.java:51) at com.huawei.common.utils.PathUtils.getWorkspacePath(PathUtils.java:80) at com.huawei.common.components.log.Logger.<clinit>(Logger.java:37) at com.huawei.common.components.log.Logger.i(Logger.java:162) at com.huawei.hwvplayer.data.db.DbProvider.attachInfo(DbProvider.java:89) at android.app.ActivityThread.installProvider(ActivityThread.java:5279) at android.app.ActivityThread.installContentProviders(ActivityThread.java:4868) at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799) at android.app.ActivityThread.access$1600(ActivityThread.java:165) at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436) at android.os.Handler.dispatchMessage(Handler.java:102) at android.os.Looper.loop(Looper.java:188) at android.app.ActivityThread.main(ActivityThread.java:5578) at java.lang.reflect.Method.invoke!(Native method) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684) 主线程尝试去调用MountService的 getVolumeList接口,可能没有返回。 查看system_server中相关的,搜索getVolumeList 发现systemserver中有三个Binder线程和主线程被block,我们的对端是哪个Binder线程暂时无法确认,但这并不影响我们继续分析,因为他们被blockd的路径是一致的: "main" prio=5 tid=1 Blocked | group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500 | sysTid=22735 nice=-2 cgrp=default sched=0/0 handle=0xb6f18b34 | state=S schedstat=( 0 0 0 ) utm=432 stm=85 core=1 HZ=100 | stack=0xbe78b000-0xbe78d000 stackSize=8MB | held mutexes= at com.android.server.MountService.getVolumeList(MountService.java:2759) - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40 at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883) at android.os.storage.StorageManager.getVolumeList(StorageManager.java:858) at android.os.storage.StorageManager.getPrimaryVolume(StorageManager.java:906) at com.android.server.usb.UsbDeviceManager.systemReady(UsbDeviceManager.java:327) at com.android.server.usb.UsbService.systemReady(UsbService.java:181) at com.android.server.usb.UsbService$Lifecycle.onBootPhase(UsbService.java:78) at com.android.server.SystemServiceManager.startBootPhase(SystemServiceManager.java:135) at com.android.server.SystemServer$3.run(SystemServer.java:1489) at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:12417) at com.android.server.am.HwActivityManagerService.systemReady(HwActivityManagerService.java:960) at com.android.server.SystemServer.startOtherServices(SystemServer.java:1485) at com.android.server.SystemServer.run(SystemServer.java:381) at com.android.server.SystemServer.main(SystemServer.java:272) at java.lang.reflect.Method.invoke!(Native method) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684) "Binder_8" prio=5 tid=76 Blocked | group="main" sCount=1 dsCount=0 obj=0x13bd60a0 self=0x9c1abe00 | sysTid=25191 nice=-1 cgrp=default sched=0/0 handle=0x97158930 | state=S schedstat=( 0 0 0 ) utm=9 stm=7 core=2 HZ=100 | stack=0x9705c000-0x9705e000 stackSize=1014KB | held mutexes= at com.android.server.MountService.getVolumeList(MountService.java:2759) - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40 at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634) at android.os.Binder.execTransact(Binder.java:453) "Binder_2" prio=5 tid=8 Blocked | group="main" sCount=1 dsCount=0 obj=0x12cac0a0 self=0xaebf0300 | sysTid=22761 nice=-1 cgrp=default sched=0/0 handle=0xaef7d930 | state=S schedstat=( 0 0 0 ) utm=41 stm=25 core=0 HZ=100 | stack=0xaee81000-0xaee83000 stackSize=1014KB | held mutexes= at com.android.server.MountService.getVolumeList(MountService.java:2759) - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40 at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634) at android.os.Binder.execTransact(Binder.java:453) 他们均是被tid=40的人block,按照上面的方法搜索tid=40或者 0x0eeb54f1得到block的人: "MountService" prio=5 tid=40 TimedWaiting | group="main" sCount=1 dsCount=0 obj=0x132c1160 self=0x9ce57400 | sysTid=23512 nice=0 cgrp=default sched=0/0 handle=0x9a239930 | state=S schedstat=( 0 0 0 ) utm=29 stm=2 core=2 HZ=100 | stack=0x9a137000-0x9a139000 stackSize=1038KB | held mutexes= at java.lang.Object.wait!(Native method) - waiting on <0x00fea1f3> (a java.lang.Object) at java.lang.Thread.parkFor$(Thread.java:1235) - locked <0x00fea1f3> (a java.lang.Object) at sun.misc.Unsafe.park(Unsafe.java:299) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372) at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:777) at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:489) at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386) at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381) at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827) at com.android.server.MountService.handleSystemReady(MountService.java:776) - locked <0x0eeb54f1> (a java.lang.Object) at com.android.server.MountService.access$500(MountService.java:152) at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596) at android.os.Handler.dispatchMessage(Handler.java:102) at android.os.Looper.loop(Looper.java:150) at android.os.HandlerThread.run(HandlerThread.java:61) 和上面netd类似的,mountservice也是通过ndc和vold通信,这里我们需要继续查看是否vold存在异常。 之前提到过,类似这种同步锁block的,dvm_lock_sample一定会有打印,于是先去找eventslog,不过这个是华为的log,是没有eventlog的。 而华为实现了一个blockMonitor的功能,和dvm_lock_sample类似,当某个操作特别耗时的时候,会将其打印出来: 在ANR的附近找到如下: 07-19 10:17:50.739 25271 25271 W BlockMonitor: The binder calling took 55209ms.07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BlockMonitor.checkBinderTime(BlockMonitor.java:141)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BinderProxy.transact(Binder.java:511)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment.getExternalStorageDirectory(Environment.java:354)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.parsePathStrategy(FileProvider.java:583)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.getPathStrategy(FileProvider.java:534)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.attachInfo(FileProvider.java:352)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installProvider(ActivityThread.java:5279)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.access$1600(ActivityThread.java:165)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Handler.dispatchMessage(Handler.java:102)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Looper.loop(Looper.java:188)07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.main(ActivityThread.java:5578)07-19 10:17:50.739 25271 25271 W BlockMonitor: java.lang.reflect.Method.invoke(Native Method)07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)再加上之前的:07-19 10:17:50.729 22735 23512 E NativeDaemonConnector.ResponseQueue: Timeout waiting for response07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response to 4 volume reset07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response mOutputStream = android.net.LocalSocketImpl$SocketOutputStream@922644, mSocket = vold07-19 10:17:50.731 22735 23512 W MountService: Failed to reset vold07-19 10:17:50.731 22735 23512 W MountService: com.android.server.NativeDaemonTimeoutException: command ‘4 volume reset‘ failed with ‘null‘07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:501)07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.handleSystemReady(MountService.java:776)07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.access$500(MountService.java:152)07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)07-19 10:17:50.731 22735 23512 W MountService: at android.os.Handler.dispatchMessage(Handler.java:102)07-19 10:17:50.731 22735 23512 W MountService: at android.os.Looper.loop(Looper.java:150)07-19 10:17:50.731 22735 23512 W MountService: at android.os.HandlerThread.run(HandlerThread.java:61)我们有理由去推断vold的状态是不对的,但是又没有vold的栈。那么全局grep vold试试能不能找到线索,在kernellog中发现了vold的异常:07-19 14:22:24.669 <3>[10772.492156] c0 Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):07-19 14:22:24.669 <6>[10772.492217] c0 vold R running 0 224 1 0x0000000107-19 14:22:24.669 <4>[10772.492278] c0 [<c05ebecc>] (__schedule+0x38c/0x5bc) from [<c05ea478>] (schedule_timeout+0x18/0x1e8)07-19 14:22:24.669 <4>[10772.492309] c0 [<c05ea478>] (schedule_timeout+0x18/0x1e8) from [<c05eb90c>] (wait_for_common+0x11c/0x164)07-19 14:22:24.669 <4>[10772.492309] c0 [<c05eb90c>] (wait_for_common+0x11c/0x164) from [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4)07-19 14:22:24.669 <4>[10772.492339] c0 [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4) from [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74)07-19 14:22:24.669 <4>[10772.492370] c0 [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74) from [<c03d41f0>] (mmc_send_status+0x6c/0x8c)07-19 14:22:24.670 <4>[10772.492400] c0 [<c03d41f0>] (mmc_send_status+0x6c/0x8c) from [<c03d4504>] (sd_send_status+0x14/0x44)07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d4504>] (sd_send_status+0x14/0x44) from [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168)07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168) from [<c03dabd8>] (mmc_lockable_store+0x594/0x75c)07-19 14:22:24.670 <4>[10772.492461] c0 [<c03dabd8>] (mmc_lockable_store+0x594/0x75c) from [<c029d560>] (dev_attr_store+0x18/0x24)07-19 14:22:32.070 <4>[10772.492492] c0 [<c029d560>] (dev_attr_store+0x18/0x24) from [<c013b370>] (sysfs_write_file+0x104/0x148)07-19 14:22:32.070 <4>[10772.492522] c0 [<c013b370>] (sysfs_write_file+0x104/0x148) from [<c00eabb4>] (vfs_write+0xd0/0x180)07-19 14:22:32.070 <4>[10772.492553] c0 [<c00eabb4>] (vfs_write+0xd0/0x180) from [<c00eb070>] (SyS_write+0x38/0x68)07-19 14:22:32.071 <4>[10772.492583] c0 [<c00eb070>] (SyS_write+0x38/0x68) from [<c000e840>] (ret_fast_syscall+0x0/0x30)vold一直在这个操作中没有退出来,所以不能响应客户端的请求,从而导致了ANR。这个问题需要mmc的同事进一步去分析,目前怀疑是SD卡发生了错误。
3. setting 主线程耗时操作
"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 obj=0x7425caf8 self=0xb4827800 | sysTid=671 nice=0 cgrp=default sched=0/0 handle=0xb6febbec | state=S schedstat=( 0 0 0 ) utm=1853 stm=50 core=2 HZ=100 | stack=0xbe7bd000-0xbe7bf000 stackSize=8MB | held mutexes= kernel: (couldn‘t read /proc/self/task/671/stack) native: #00 pc 000133cc /system/lib/libc.so (syscall+28) native: #01 pc 000a9a83 /system/lib/libart.so (art::ConditionVariable::Wait(art::Thread*)+82) native: #02 pc 001b16f1 /system/lib/libart.so (art::JNI::NewString(_JNIEnv*, unsigned short const*, int)+640) native: #03 pc 00075887 /system/lib/libandroid_runtime.so (???) native: #04 pc 008570ab /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_database_CursorWindow_nativeGetString__JII+110) at android.database.CursorWindow.nativeGetString(Native method) at android.database.CursorWindow.getString(CursorWindow.java:438) at android.database.AbstractWindowedCursor.getString(AbstractWindowedCursor.java:51) at android.database.CursorWrapper.getString(CursorWrapper.java:114) at com.android.settings.ApnSettings.fillList(ApnSettings.java:259) at com.android.settings.ApnSettings.onResume(ApnSettings.java:208) at android.app.Fragment.performResume(Fragment.java:2096) at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:928) at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1067) at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1049) at android.app.FragmentManagerImpl.dispatchResume(FragmentManager.java:1879) at android.app.Activity.performResume(Activity.java:6113) at android.app.ActivityThread.performResumeActivity(ActivityThread.java:3015) at android.app.ActivityThread.handleResumeActivity(ActivityThread.java:3061) at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2415) at android.app.ActivityThread.access$800(ActivityThread.java:151) at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1313) at android.os.Handler.dispatchMessage(Handler.java:102) at android.os.Looper.loop(Looper.java:135) at android.app.ActivityThread.main(ActivityThread.java:5345) at java.lang.reflect.Method.invoke!(Native method) at java.lang.reflect.Method.invoke(Method.java:372) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:947) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:742) 从ApnSettings.java中可以看到fillList这个可能的耗时操作主线程处理的地方有OnResume和OnReceive。 且代码中存在使用HandlerThread异步处理这个操作的地方,可以借鉴下将这两处都放到HandlerThread中去做。
4. 系统资源紧张造成的ANR
通常情况下,一般这种问题的表现的可能形式是 1. traces.txt中的主线程的栈在一个非常common的操作中,如new一个变量,读取某个文件等 2. ANR的trace中IOW相当高(IOW+CPU sys+usr=100) 3. Kswapd非常活跃 4. Low Ram 配置 5. kernel log中 D状态的用户进程比较多,且都block在内存相关,且内存比较紧张. 6. 连续出现ANR,slog和ams一直在dumptrace,造成比较大的IO压力 7. 前台正在玩一个很大的游戏 8. 某个应用正在安装做dex2oat. 出现这种情况下,我们需要做的是: 1. 针对性的调整lmk参数,减少后台的数量 2. 检查是否有不应该常驻的进程 3. 是否需要修改伪前台的adj,减少常驻内存 4. 是否需要关闭dex2oat 此类问题比较难解决。
5. Binder被占满导致ANR
1. traces.txt中主线程处于IDLE状态,要么是没有抓到现场,要么是出现了其他问题。2. 分析log,找到ANR的时间点如下:后台服务执行超时,这个timeout是200s 07-28 03:37:33.681 1719 1732 I am_anr : [0,15625,com.android.exchange,948452933,executing service com.android.exchange/.service.EasService] ----- pid 15625 at 2016-07-28 03:37:33 -----,dumptrace的时间点。3. 查看是否有主线程的消息: 07-28 03:37:39.373 15625 15625 W BlockMonitor: Package name: com.android.exchange 07-28 03:37:39.373 15625 15625 W BlockMonitor: The Message{ what=115 obj=ServiceArgsData{token=android.os.BinderProxy@49814e2 startId=6 args=Intent { act=com.android.email.EXCHANGE_INTENT flg=0x4 cmp=com.android.exchange/.service.EasService (has extras) }} target=android.app.ActivityThread$H } took 5174ms. 之前有说过,华为项目是没有dvmlocksample但是有blockMonitor的,如果有耗时的binder调用以及mesg处理,那么就会被他抓过来。 从代码中可以看到,what=115是serviceArg消息,是startservice的一部分,而这个消息应该是34秒的时候才接受到并处理的。 那么,为什么33s的时候dumptrace并没有抓到主线程有卡顿?且后续并没有blockmonitor的打印,也就是说200s内主线程和systemserver并没有出现block。 那么还有可能是什么原因导致了前200s并没有收到这个消息? 我们来看下这部分的代码:
private final void sendServiceArgsLocked(ServiceRecord r, boolean execInFg, boolean oomAdjusted) throws TransactionTooLargeException { final int N = r.pendingStarts.size(); if (N == 0) { return; } while (r.pendingStarts.size() > 0) { Exception caughtException = null; ServiceRecord.StartItem si; try { si = r.pendingStarts.remove(0); if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Sending arguments to: " + r + " " + r.intent + " args=" + si.intent); if (si.intent == null && N > 1) { // If somehow we got a dummy null intent in the middle, // then skip it. DO NOT skip a null intent when it is // the only one in the list -- this is to support the // onStartCommand(null) case. continue; } si.deliveredTime = SystemClock.uptimeMillis(); r.deliveredStarts.add(si); si.deliveryCount++; if (si.neededGrants != null) { mAm.grantUriPermissionUncheckedFromIntentLocked(si.neededGrants, si.getUriPermissionsLocked()); } bumpServiceExecutingLocked(r, execInFg, "start"); //超时timer开始计算 if (!oomAdjusted) { oomAdjusted = true; mAm.updateOomAdjLocked(r.app); } int flags = 0; if (si.deliveryCount > 1) { flags |= Service.START_FLAG_RETRY; } if (si.doneExecutingCount > 0) { flags |= Service.START_FLAG_REDELIVERY; } r.app.thread.scheduleServiceArgs(r, si.taskRemoved, si.id, flags, si.intent);//通过binder给App发送一个消息,ActivityThread会收到并处理。 } catch (TransactionTooLargeException e) { if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Transaction too large: intent=" + si.intent); caughtException = e; } catch (RemoteException e) { // Remote process gone... we‘ll let the normal cleanup take care of this. if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Crashed while sending args: " + r); caughtException = e; } catch (Exception e) { Slog.w(TAG, "Unexpected exception", e); caughtException = e; } if (caughtException != null) { // Keep nesting count correct final boolean inDestroying = mDestroyingServices.contains(r); serviceDoneExecutingLocked(r, inDestroying, inDestroying); if (caughtException instanceof TransactionTooLargeException) { throw (TransactionTooLargeException)caughtException; } break; } } }
从上面的代码可以看出,在启动计时器和给app发送消息之间并不存在耗时操作,且这个是在同步锁之内的,如果慢了一定会被blockmonitor抓出来。那么问题的原因只有一种可能情形:200s之前给app发送的消息,APP并没有及时的处理。从经验可以猜测,一般这种情况都是app的Binder被占满导致没有空闲的Binder线程能够处理这个消息导致。而我们的traces正好dump到了这一幕:
app所有的Binder线程都全部在做一件事情,一个依赖于网络的操作,是被另外一个人通过contentprovider.query调过来的,有点不太正常。"Binder_10" prio=5 tid=30 Native | group="main" sCount=1 dsCount=0 obj=0x22dba160 self=0xb2f26700 | sysTid=4613 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9eb80930 | state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3 HZ=100 | stack=0x9ea84000-0x9ea86000 stackSize=1014KB | held mutexes= kernel: (couldn‘t read /proc/self/task/4613/stack) native: #00 pc 000423e8 /system/lib/libc.so (__pselect6+20) native: #01 pc 0001c115 /system/lib/libc.so (select+60) native: #02 pc 0000a279 /system/lib/libjavacrypto.so (???) native: #03 pc 0000b29f /system/lib/libjavacrypto.so (???) native: #04 pc 004bc3fd /data/dalvik-cache/arm/system@framework@boot.oat (Java_com_android_org_conscrypt_NativeCrypto_SSL_1read__JLjava_io_FileDescriptor_2Lcom_android_org_conscrypt_NativeCrypto_00024SSLHandshakeCallbacks_2_3BIII+192) at com.android.org.conscrypt.NativeCrypto.SSL_read(Native method) at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:705) - locked <0x0508e00e> (a java.lang.Object) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:108) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:196) at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:88) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:179) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:185) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:240) at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:264) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:284) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:126) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:596) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:517) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:495) at com.android.exchange.EasResponse.fromHttpRequest(EasResponse.java:91) at com.android.exchange.service.EasServerConnection.executeHttpUriRequest(EasServerConnection.java:384) at com.android.exchange.eas.EasOperation.performOperation(EasOperation.java:382) at com.android.exchange.service.EasService.searchGal(EasService.java:706) at com.android.exchange.provider.ExchangeDirectoryProvider.query(ExchangeDirectoryProvider.java:350) at android.content.ContentProvider.query(ContentProvider.java:1017) at android.content.ContentProvider$Transport.query(ContentProvider.java:238) at android.content.ContentProviderNative.onTransact(ContentProviderNative.java:112) at android.os.Binder.execTransact(Binder.java:453)
sprdruntime在ANR的时候会dump BinderState,搜索当前进程的pid,查看是否能够找到线索:07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: proc 1562507-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: thread 4537: l 0107-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: incoming transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d8407-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: thread 4538: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb60109007-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4563: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f407-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4564: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb60170407-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4578: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a7007-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4580: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db007-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4591: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4592: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb60243007-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4602: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb60277007-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4603: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab407-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4612: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df407-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4613: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb60313807-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15635: l 0207-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb60040407-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15636: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 624820: edc2bd80 from 30228:30240 to 15625:15636 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb60071407-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15658: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 624824: c2c47400 from 30228:30238 to 15625:15658 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600a2007-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 18343: l 0107-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 623794: d5132380 from 30228:30241 to 15625:18343 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6000f4可以看到我们binder线程被30228给占满了,而同时我们也发现30228也有点不太对:07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: proc 3022807-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 1612: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb60040407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 623738: e994e880 from 2332:4503 to 30228:1612 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f880009c07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4535: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d8407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625317: e82f4000 from 2332:4556 to 30228:4535 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f8800d6c07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4536: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb60109007-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625326: ed9d1c40 from 2332:4557 to 30228:4536 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f88010a007-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4558: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625733: e887e5c0 from 2332:4575 to 30228:4558 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f88013d407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4561: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb60170407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625756: e8719880 from 2332:4576 to 30228:4561 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f880170c07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4577: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a7007-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626125: d34d6ec0 from 2332:4585 to 30228:4577 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801a4407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4579: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db007-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626131: c3f98200 from 2332:4587 to 30228:4579 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801d7c07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4588: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626541: ede63540 from 2332:4598 to 30228:4588 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88020b407-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4590: l 1107-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb60243007-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626546: e9a32b80 from 2332:4599 to 30228:4590 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88023f007-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4600: l 1107-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb60277007-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626849: ecb23e00 from 2332:4608 to 30228:4600 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f880272c07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4601: l 1107-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab407-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626856: e8389f00 from 2332:4609 to 30228:4601 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802a6807-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4610: l 1107-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df407-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 627144: c27891c0 from 2332:4618 to 30228:4610 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802da407-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4611: l 1107-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb60313807-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 627151: e2b511c0 from 2332:4619 to 30228:4611 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88030e0也就是说实际上是2332query了30228,30228才会去query16625导致anr的。那么2332是谁?eventslog中可以看到2332是email,30228是android.process.acore。搜索traces.txt,看到我们在anr的时候有dump到2332的栈:发现了异常,有很多个线程(也就是上面dump到的2332的线程)在查询,"Filter" prio=5 tid=42 Native | group="main" sCount=1 dsCount=0 obj=0x130cc8e0 self=0x98637500 | sysTid=4502 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9877e930 | state=S schedstat=( 0 0 0 ) utm=1 stm=0 core=0 HZ=100 | stack=0x9867c000-0x9867e000 stackSize=1038KB | held mutexes= kernel: (couldn‘t read /proc/self/task/4502/stack) native: #00 pc 00017618 /system/lib/libc.so (syscall+28) native: #01 pc 000f5cbd /system/lib/libart.so (_ZN3art17ConditionVariable4WaitEPNS_6ThreadE+80) native: #02 pc 00275df1 /system/lib/libart.so (_ZN3art3JNI18CallBooleanMethodVEP7_JNIEnvP8_jobjectP10_jmethodIDSt9__va_list+592) native: #03 pc 0008617b /system/lib/libandroid_runtime.so (???) native: #04 pc 00089c2b /system/lib/libandroid_runtime.so (???) native: #05 pc 000198dd /system/lib/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+60) native: #06 pc 0001ec25 /system/lib/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+584) native: #07 pc 0001ef73 /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+262) native: #08 pc 0001f049 /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124) native: #09 pc 00019fe3 /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30) native: #10 pc 0008a035 /system/lib/libandroid_runtime.so (???) native: #11 pc 00d78919 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140) at android.os.BinderProxy.transactNative(Native method) at android.os.BinderProxy.transact(Binder.java:505) at android.content.ContentProviderProxy.query(ContentProviderNative.java:419) at android.content.ContentResolver.query(ContentResolver.java:502) at android.content.ContentResolver.query(ContentResolver.java:438) at com.huawei.mail.chips.BaseRecipientAdapter.doQuery(BaseRecipientAdapter.java:931) at com.huawei.mail.chips.BaseRecipientAdapter$DirectoryFilter.performFiltering(BaseRecipientAdapter.java:400) at android.widget.Filter$RequestHandler.handleMessage(Filter.java:234) at android.os.Handler.dispatchMessage(Handler.java:102) at android.os.Looper.loop(Looper.java:150) at android.os.HandlerThread.run(HandlerThread.java:61)那么接下来就需要结合代码查看这么多线程创建的原因。解决的思路也比较简单,是否需要采取线程池或者限制下线程的个数,防止由于此操作比较耗时(依赖于网络)导致对端ANR。
- Bug 582473 single touch point test,操作15次以上必现工厂测试anr
从dropbox中找到Validationtools对应的traces,查看主线程的栈,发现主线程在IDLE状态。 "main" prio=5 tid=1 NATIVE | group="main" sCount=1 dsCount=0 obj=0x4155acc0 self=0x414943f0 | sysTid=3287 nice=-1 sched=0/0 cgrp=apps handle=1074065748 | state=S schedstat=( 0 0 0 ) utm=1241 stm=340 core=0 #00 pc 00021764 /system/lib/libc.so (epoll_wait+12) #01 pc 000105e3 /system/lib/libutils.so (android::Looper::pollInner(int)+94) #02 pc 00010811 /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+92) #03 pc 0006b345 /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22) #04 pc 0001e70c /system/lib/libdvm.so (dvmPlatformInvoke+112) #05 pc 0004dc43 /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398) #06 pc 00027778 /system/lib/libdvm.so #07 pc 0002e410 /system/lib/libdvm.so (dvmMterpStd(Thread*)+76) #08 pc 0002bac8 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156) #09 pc 00060259 /system/lib/libdvm.so (dvmInvokeMethod(Object*, Method const*, ArrayObject*, ArrayObject*, ClassObject*, bool)+392) #10 pc 000683eb /system/lib/libdvm.so #11 pc 00027778 /system/lib/libdvm.so #12 pc 0002e410 /system/lib/libdvm.so (dvmMterpStd(Thread*)+76) #13 pc 0002bac8 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156) #14 pc 0005ff77 /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+338) #15 pc 0004982b /system/lib/libdvm.so #16 pc 0004de13 /system/lib/libandroid_runtime.so #17 pc 0004ecdd /system/lib/libandroid_runtime.so (android::AndroidRuntime::start(char const*, char const*)+532) #18 pc 00001423 /system/bin/app_process #19 pc 0000e403 /system/lib/libc.so (__libc_init+50) #20 pc 00000f34 /system/bin/app_process at android.os.MessageQueue.nativePollOnce(Native Method) at android.os.MessageQueue.next(MessageQueue.java:138) at android.os.Looper.loop(Looper.java:123) at android.app.ActivityThread.main(ActivityThread.java:5294) at java.lang.reflect.Method.invokeNative(Native Method) at java.lang.reflect.Method.invoke(Method.java:515) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:932) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:748) at dalvik.system.NativeStart.main(Native Method) 然后发现第二个线程是Binder10 且在block状态,通常情况下,一个不是非常繁忙的需要对外提供服务的app是不许要16个binder线程的。 于是接着查看是否其他binder也存在异常,结果发现所有的binder线程均处于同样的调用栈。"Binder_D" prio=5 tid=21 WAIT | group="main" sCount=1 dsCount=0 obj=0x41d6f468 self=0x555f4898 | sysTid=3344 nice=0 sched=0/0 cgrp=apps handle=1442108096 | state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3 at java.lang.Object.wait(Native Method) - waiting on <0x41d6f4c0> (a java.lang.VMThread) held by tid=21 (Binder_D) at java.lang.Thread.parkFor(Thread.java:1205) at sun.misc.Unsafe.park(Unsafe.java:325) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256) at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655) at android.view.IWindow$Stub.onTransact(IWindow.java:108) at android.os.Binder.execTransact(Binder.java:427) at dalvik.system.NativeStart.run(Native Method)"Binder_F" prio=5 tid=23 WAIT | group="main" sCount=1 dsCount=0 obj=0x41d4e600 self=0x55cb63f8 | sysTid=3351 nice=0 sched=0/0 cgrp=apps handle=1435714912 | state=S schedstat=( 0 0 0 ) utm=1 stm=1 core=0 at java.lang.Object.wait(Native Method) - waiting on <0x41d4e658> (a java.lang.VMThread) held by tid=23 (Binder_F) at java.lang.Thread.parkFor(Thread.java:1205) at sun.misc.Unsafe.park(Unsafe.java:325) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256) at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655) at android.view.IWindow$Stub.onTransact(IWindow.java:108) at android.os.Binder.execTransact(Binder.java:427) at dalvik.system.NativeStart.run(Native Method)如上面一个例子,在这种Binder满的情况下,systemserver和app之间的通信是被block的,虽然这个是4.4并没有sprdruntimeInfo给我们dump出binder的状态,我们还是可以推断出ANR的直接原因就是此处不合常理的Binder被占满。首先通过代码可以看到,surfaceView中这行对应的代码为如下: @Override public void resized(Rect frame, Rect overscanInsets, Rect contentInsets, Rect visibleInsets, boolean reportDraw, Configuration newConfig) { SurfaceView surfaceView = mSurfaceView.get(); if (surfaceView != null) { if (DEBUG) Log.v( "SurfaceView", surfaceView + " got resized: w=" + frame.width() + " h=" + frame.height() + ", cur w=" + mCurWidth + " h=" + mCurHeight); surfaceView.mSurfaceLock.lock(); try { if (reportDraw) { surfaceView.mUpdateWindowNeeded = true; surfaceView.mReportDrawNeeded = true; surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG); } else if (surfaceView.mWinFrame.width() != frame.width() || surfaceView.mWinFrame.height() != frame.height()) { surfaceView.mUpdateWindowNeeded = true; surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG); } } finally { surfaceView.mSurfaceLock.unlock(); } } } mSurfaceLocklock的路径还有如下: private void updateWindow(boolean force, boolean redrawNeeded) private final Canvas internalLockCanvas(Rect dirty)由于这个是个必现问题,那么问题就简单了,我们可以在所有会lock和unlock的地方都加上log,再复现一次来得到真相。于是乎发现是在SingleTouchPointTest pass的时候忘记unlockCanvs导致。
套路总结
1. 查看traces.txt,如果有block栈则通过栈来找到block原因。 2. 应用自身耗时操作尝试修改异步,被Binder block则进一步确认下对端的情况,如果是依赖于外部状态的可能block操作,应用修改为异步,如果是common接口,转FW 3. 如果不存在栈,那么尽人事听天命,查看ANR点附近是否有主线程block的可能信息,如果有则顺腾摸瓜,没有就Needinfo。
如何避免ANR
A.减少复杂的layout B.主线程尽量不要做和显示无关的事情。 C.如果存在可能会block、耗时的操作,不要放到主线程中,可以使用异步等方式来放到另外一个handlerthread或者asynctask中去做。 需要注意的是,使用handler.post或者sendmessage的时候需要确认清楚handler的looper,如果是主线程,依然可能会ANR。 在HandlerThread中如果需要处理和显示相关的,还需要到主线程中处理(非UI线程不能操作UI,在ViewRoot中有检查)
Android ANR分析(2)
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。