首页 > 代码库 > Linux上coredump调试:call stack栈顶函数地址为0 分析实战

Linux上coredump调试:call stack栈顶函数地址为0 分析实战

这几天测试中,又收到了coredump的报告,调用栈如下:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000000000432bb4 in ChargingNode::canProcessed (this=0x7f87b40118e0, maxTimestamp=9000000000) at src/sl/ChargingFile.C:406
#2  0x0000000000445de4 in BucketFileAdapter::checkin (this=0x2192b98, startTime=<value optimized out>) at src/sl/BucketFileAdapter.C:118
#3  0x0000000000446114 in BucketFileAdapter::start (this=0x2192b98) at src/sl/BucketFileAdapter.C:87
#4  0x000000000043560e in file_reader_run (arg=0x2192b98) at src/sl/ChargingFileAdapter.C:234
#5  0x0000003657607851 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003656ee890d in clone () from /lib64/libc.so.6

栈顶函数地址为0x0。

这是一个很有意思的现象,以前我没有遇到这种case。

结合代码看看:出错行的C++代码:

if(chargingFile  && (chargingFile ->canDecoded() == false))

用gdb看chargingFile的值,结果显示被优化了(使用了-O2):

(gdb) p chargingFile
$3 = <value optimized out>

那我们分析下frame 1 对应的C++代码,chargingFile即使为NULL也不可能导致core,所以只可能是chargingFIle不空,但调用canDecode()函数时出问题了。

对应反汇编代码:

(gdb) disas
Dump of assembler code for function ChargingNode::canProcessed(long):
   0x0000000000432b20 <+0>:     mov    %rbx,-0x18(%rsp)
   0x0000000000432b25 <+5>:     lea    0x78(%rdi),%rbx
   0x0000000000432b29 <+9>:     mov    %rbp,-0x10(%rsp)
   0x0000000000432b2e <+14>:    mov    %r12,-0x8(%rsp)
   0x0000000000432b33 <+19>:    mov    %rdi,%rbp
   0x0000000000432b36 <+22>:    sub    $0x18,%rsp
   0x0000000000432b3a <+26>:    mov    %rbx,%rdi
   0x0000000000432b3d <+29>:    mov    %rsi,%r12
   0x0000000000432b40 <+32>:    callq  0x407820 <pthread_mutex_lock@plt>
   0x0000000000432b45 <+37>:    mov    0x18(%rbp),%rcx
   0x0000000000432b49 <+41>:    mov    0x28(%rbp),%rdx
   0x0000000000432b4d <+45>:    mov    0x38(%rbp),%rax
   0x0000000000432b51 <+49>:    sub    0x40(%rbp),%rax
   0x0000000000432b55 <+53>:    sub    %rcx,%rdx
   0x0000000000432b58 <+56>:    sar    $0x3,%rdx
   0x0000000000432b5c <+60>:    sar    $0x3,%rax
   0x0000000000432b60 <+64>:    add    %rax,%rdx
   0x0000000000432b63 <+67>:    mov    0x50(%rbp),%rax
   0x0000000000432b67 <+71>:    sub    0x30(%rbp),%rax
   0x0000000000432b6b <+75>:    sar    $0x3,%rax
   0x0000000000432b6f <+79>:    shl    $0x6,%rax
   0x0000000000432b73 <+83>:    lea    -0x40(%rax,%rdx,1),%rax
   0x0000000000432b78 <+88>:    test   %rax,%rax
   0x0000000000432b7b <+91>:    je     0x432b83 <ChargingNode::canProcessed(long)+99>
   0x0000000000432b7d <+93>:    cmpb   $0x0,0x70(%rbp)
   0x0000000000432b81 <+97>:    je     0x432ba0 <ChargingNode::canProcessed(long)+128>
   0x0000000000432b83 <+99>:    mov    %rbx,%rdi
   0x0000000000432b86 <+102>:   callq  0x407250 <pthread_mutex_unlock@plt>
   0x0000000000432b8b <+107>:   xor    %eax,%eax
   0x0000000000432b8d <+109>:   mov    (%rsp),%rbx
   0x0000000000432b91 <+113>:   mov    0x8(%rsp),%rbp
   0x0000000000432b96 <+118>:   mov    0x10(%rsp),%r12
   0x0000000000432b9b <+123>:   add    $0x18,%rsp
   0x0000000000432b9f <+127>:   retq
   0x0000000000432ba0 <+128>:   cmp    %r12,0x68(%rbp)
   0x0000000000432ba4 <+132>:   jg     0x432bd0 <ChargingNode::canProcessed(long)+176>
   0x0000000000432ba6 <+134>:   mov    (%rcx),%rdi
   0x0000000000432ba9 <+137>:   test   %rdi,%rdi
   0x0000000000432bac <+140>:   je     0x432bb8 <ChargingNode::canProcessed(long)+152>
   0x0000000000432bae <+142>:   mov    (%rdi),%rax
   0x0000000000432bb1 <+145>:   callq  *0x20(%rax)
---Type <return> to continue, or q <return> to quit---
=> 0x0000000000432bb4 <+148>:   test   %al,%al

......

看到红色的汇编代码了吗?callq是X86上的函数调用指令,后面跟的是一个间接地址,不是一个直接的函数地址。

这种形式的汇编代码,我见过的只有两种情况:1 是内核启动的时候为了防止编译器优化而这样做,2虚函数调用。

这么说,canDecode函数是虚函数?查了下源码,果然是:virtual bool canDecoded();

那这么说,%eax里放的就是class的vptr。

证明一下:

(gdb) i r
rax            0x7f87b40c4670   140220818015856
rbx            0x7f87b4011958   140220817283416
rcx            0x7f87b4054478   140220817556600
rdx            0x4c     76
rsi            0x0      0
rdi            0x7f87b40cd480   140220818052224

根据X86_64上的函数调用传参数的习惯,rdi里存的就是chargingFile的值。

(gdb) p *(ChargingFile*)0x7f87b40cd480
$7 = {_vptr.ChargingFile = 0x7f87b40c4670, nodeName = "", hostName = "bucket", fileName =
    "/incoming4cdrsch/reported/acr/bucket/bucket12/MAS2_-_0000001709.20130704_-_2126+0800.INC", fileType = CDR_FILE_TYPE_ACR, qid = {id =
    -1}, timestamp = 1372944386, bufferedChargingRec = false, chargingNode = 0x7f87b40118e0, static fileStatusDir =
    "/incoming4cdrsch//status", static processedRecNumLimit = 300000, static acrFilesTotalSize = 0, decodeFlag = false, localFileFlag =
    false, fpFile = 0x0, fpStatusFile = 0x0, stopFlag =false, statusFileName = KeyboardInterrupt: Quit
, offset = 4184212, recordNum = 5695, totalRecordNum = 5695, accumNum = 0, static batchSize = 1000}

 

然后看上面标红色的,发现:vptr的值和rax的值一样。前面的分析是正确的。

那我们根据  0x0000000000432bb1 <+145>:   callq  *0x20(%rax) 来看一下内存 0x20(%rax)里到底是什么内容:

%rax=0x7f87b40c4670 

0x20(%rax) = 0x7f87b40c4690

看内存:

(gdb)  x/40x 0x7f87b40c4670
0x7f87b40c4670: 0x4100434e      0x2d5f3253      0x00000211      0x00000000
0x7f87b40c4680: 0xb4024f10      0x00007f87      0xb40cd470      0x00007f87
0x7f87b40c46900x00000000      0x00000000      0x00000000      0x00000000

看上面标蓝色的,哇塞地址为0x0。说明了什么呢童鞋们,对象被破坏了,虚函数表被覆盖了。

结合代码,发现问题是因为多线程情况下,互斥锁使用范围不当,导致对象被过早释放出现问题。

 

总结:

1. 碰到call stack栈顶函数地址为0,考虑虚函数表被破坏,即对象呗破坏的情况。

2. 熟悉常用的X86_64函数调用习惯,rdi里放的是第一个函数参数。

3. 多线程中的锁使用一定要注意范围,锁的太小可能不够,太大了性能会有问题。

Linux上coredump调试:call stack栈顶函数地址为0 分析实战