首页 > 代码库 > resource manager因为CapacityScheduler的NPE异常退出,引起failover切换

resource manager因为CapacityScheduler的NPE异常退出,引起failover切换

一、问题描述

yarn2.0发生resource manager down(master2)掉,并引起resource manager的failover切换

二、问题分析

1)看master2上resource manager的日志

2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=warehouse        OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1466451117456_12139   CONTAINERID=container_1466451117456_12139_02_000001
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1466451117456_12139_000002 
with final state: FAILED, and exit status: -100
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1466451117456_12139_000002 State change from ALLOCATED t
o FINAL_SAVING
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1466451117456_12139_000002
2016-06-26 12:35:41,504 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_EXPIRED to the scheduler
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1664)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1231)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1117)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:686)
        at java.lang.Thread.run(Thread.java:724)2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_14664511174
56_12139_000002
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1466451117456_12139_000002 State change from FINAL_SAVIN
G to FAILED
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 0. The max attempts is 2
2016-06-26 12:35:41,505 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1466451117456_12139_000003
2016-06-26 12:35:41,505 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1466451117456_12139_000003 State change from NEW to SUBM

可以看到CapacityScheduler的NPE导致ResourceManager退出。这种退出机制本身是安全的,防止Scheduler的异常导致ResourceManager后续一直不可用。

2)分析原因可能是CapacityScheduler异步调度引起该异常,源码如下(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)

static void schedule(CapacityScheduler cs) {
    // First randomize the start point
    int current = 0;
    Collection<FiCaSchedulerNode> nodes = cs.nodeTracker.getAllNodes();
    int start = random.nextInt(nodes.size());
    //这里循环处理的时候,nodes可能已经被其他线程修改
    for (FiCaSchedulerNode node : nodes) {
      if (current++ >= start) {
        cs.allocateContainersToNode(node);
      }
    }
    // Now, just get everyone to be safe
    for (FiCaSchedulerNode node : nodes) {
      cs.allocateContainersToNode(node);
    }
    try {
      Thread.sleep(cs.getAsyncScheduleInterval());
    } catch (InterruptedException e) {}
  }

三、解决方法

修改capacity-scheduler.xml,取消异步调度

   <property>
        <name>yarn.scheduler.capacity.schedule-asynchronously.enable</name>
        <value>false</value>
    </property>


该修改需要重启ResourceManager才可生效

本文出自 “散人” 博客,请务必保留此出处http://zouqingyun.blog.51cto.com/782246/1878530

resource manager因为CapacityScheduler的NPE异常退出,引起failover切换