首页 > 代码库 > OTP supervisor的monitor_child是否有漏洞

OTP supervisor的monitor_child是否有漏洞

问题描述

OTP的supervisor中为了防止淘气的Child从link的另一端断掉link,supervisor会在shutdown child之前unlink(Child)并切换为monitor状态,这样supervisor对Child的监控将无法被Chlid终止。这段代码是由monitor_child/1实现的,其具体实现代码如下:

872 %% Help function to shutdown/2 switches from link to monitor approach 
873 monitor_child(Pid) -> 
874  
875         %% Do the monitor operation first so that if the child dies  
876         %% before the monitoring is done causing a ‘DOWN‘-message with 
877         %% reason noproc, we will get the real reason in the ‘EXIT‘-message 
878         %% unless a naughty child has already done unlink... 
879         erlang:monitor(process, Pid), 
880         unlink(Pid), 
881  
882         receive 
883                 %% If the child dies before the unlink we must empty 
884                 %% the mail-box of the ‘EXIT‘-message and the ‘DOWN‘-message. 
885                 {‘EXIT‘, Pid, Reason} ->  
886                         receive  
887                                 {‘DOWN‘, _, process, Pid, _} -> 
888                                         {error, Reason} 
889                         end 
890         after 0 ->  
891                         %% If a naughty child did unlink and the child dies before 
892                         %% monitor the result will be that shutdown/2 receives a  
893                         %% ‘DOWN‘-message with reason noproc. 
894                         %% If the child should die after the unlink there 
895                         %% will be a ‘DOWN‘-message with a correct reason 
896                         %% that will be handled in shutdown/2.  
897                         ok    
898         end 

但是这里我们会发现一个问题,unlink后monitor_child/1有一段奇怪的代码:

882         receive 
883                 %% If the child dies before the unlink we must empty 
884                 %% the mail-box of the ‘EXIT‘-message and the ‘DOWN‘-message. 
885                 {‘EXIT‘, Pid, Reason} ->  
886                         receive  
887                                 {‘DOWN‘, _, process, Pid, _} -> 
888                                         {error, Reason} 
889                         end 
890         after 0 ->

注释的意思是,如果在unlink之前child已经死掉,则 ‘EXIT‘消息中的Reason才是真实的Reason ,而之后因monitor/2而产生的‘DOWN‘消息会因为无法找到目标进程而变为noproc. 但是这里就存在一个问题:receive语句在扫描信箱后立刻就退出了,但是有可能unlink之前的‘EXIT‘消息此时 并未到达

问题解决

supervisor究竟是否存在这个问题呢?Erlang OTP的文档中对unlink/1是这样描述的:

Once unlink(Id) has returned it is guaranteed that the link between the caller and the entity referred to by Id has no effect on the caller in the future (unless the link is setup again). If caller is trapping exits, an {‘EXIT‘, Id, _} message due to the link might have been placed in the caller‘s message queue prior to the call, though. Note, the {‘EXIT‘, Id, _} message can be the result of the link, but can also be the result of Id calling exit/2. Therefore, it may be appropriate to cleanup the message queue when trapping exits after the call to unlink(Id), as follow:

    unlink(Id),
    receive
        {‘EXIT‘, Id, _} ->
            true
    after 0 ->
            true
    end

Note:

Prior to OTP release R11B (erts version 5.5) unlink/1 behaved completely asynchronous, i.e., the link was active until the "unlink signal" reached the linked entity. This had one undesirable effect, though. You could never know when you were guaranteed not to be effected by the link.

Current behavior can be viewed as two combined operations: asynchronously send an "unlink signal" to the linked entity and ignore any future results of the link.

从最后一句话中,我们可以看出——新版本的unlink/1的语义中不仅包含断开link,同时包含不再接收‘EXIT‘信号。所以unlink/1后如果信箱中还有‘EXIT‘信号,那一定是 unlink/1真正生效之前到达 的。也就是说,不可能存在unlink/1之后到达的‘EXIT‘消息,也就不会出现之前分析的‘EXIT‘信号堆积问题。

一开始还在怀疑Erlang的实现怎么如此不严谨,原来Erlang的代码看似简单,但是底层的代码事实上是仔细考虑了许多问题的。