首页 > 代码库 > HA模式强制手动切换:IPC's epoch [X] is less than the last promised epoch [X+1]

HA模式强制手动切换:IPC's epoch [X] is less than the last promised epoch [X+1]

016-11-30 21:13:14,637 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 192.168.58.183:8485 failed to write txns 54560-54560. Will try to write to this JN again after the next log roll.    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)    at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:975)    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:422)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)    at org.apache.hadoop.ipc.Client.call(Client.java:1469)    at org.apache.hadoop.ipc.Client.call(Client.java:1400)    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)    at com.sun.proxy.$Proxy10.journal(Unknown Source)    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)    at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)    at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)    at java.util.concurrent.FutureTask.run(FutureTask.java:266)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)    at java.lang.Thread.run(Thread.java:745)2016-11-30 21:13:14,800 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 192.168.58.181:8485 failed to write txns 54560-54560. Will try to write to this JN again after the next log roll.org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPCs epoch 11 is less than the last promised epoch 12    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)    at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:975)    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:422)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)    at org.apache.hadoop.ipc.Client.call(Client.java:1469)    at org.apache.hadoop.ipc.Client.call(Client.java:1400)    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)    at com.sun.proxy.$Proxy10.journal(Unknown Source)    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)    at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)    at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)    at java.util.concurrent.FutureTask.run(FutureTask.java:266)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)    at java.lang.Thread.run(Thread.java:745)2016-11-30 21:13:14,812 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 192.168.58.182:8485 failed to write txns 54560-54560. Will try to write to this JN again after the next log roll.org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPCs epoch 11 is less than the last promised epoch 12    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)

 一、错误起因

  Active NameNode日志出现异常IPC‘s epoch [X] is less than the last promised epoch [X+1],出现短期的双Active

  我配置的ha自动切换,但是发现STandByNameNode是active,我强制手动切换了三次,STandByNameNode就无法访问了,估计是这个问题。

 二.内部原因

  【HDFS机制】:该问题属于hdfs对于脑列的异常保护,属于正常行为,不影响业务。

  1)ZKFC1对NameNode1(Active)进行健康检查,因为长时间监控不到NN1的回复,认为该NameNode1不健康,主动释放zk中的ActiveStandbyElectorLock,此时NN1还是active(因为zkfc与NameNode1连接异常,不能将其shutdown)。

  

zkfc log:  2014-06-16 02:11:02,720 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenode01/172.21.248.14:9005: Call From namenode01/1  72.21.248.14 to namenode02:9005 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.248.14:47271 remote=namenode01/172.21.248.14:9005]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout2014-06-16 02:12:12,825 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at namenode02/172.21.248.13:9005 standby (unable to connect)java.net.SocketTimeoutException: Call From namenode01/172.21.248.14 to namenode02:9005 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.248.14:59156 remote=namenode02/172.21.248.13:9005]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

 

  2)ZKFC2在zk中竞争到ActiveStandbyElectorLock,将NameNode2(原来的Standby)变成Active,同时会更新JN中的epoch使其+1。

  3)NameNode1(原先的Active)再次去操作JournalNode的editlog时发现自己的epoch比JN的epoch小1,促使自己重启,成为Standby NameNode。

NN1 log:2014-08-26 12:20:59,017 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.1.1.107:8485, 192.10.1.208:8485,192.10.1.209:8485], stream=QuorumOutputStream starting at txid 22795230)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: 192.10.1.208:8485: IPCs epoch 115 is less than the last promised epoch 116 

三.解决方案

  可以在core-site.xml文件中修改ha.health-monitor.rpc-timeout.ms参数值,来扩大zkfc监控检查超时时间。

<property><name>ha.health-monitor.rpc-timeout.ms</name><value>180000</value></property>

  

 四、结束语

  最后设置成手动切换吧...其实可以通过zookeeper来找到那个是active,我先不这么做吧。在hdfs-site.xml。

  但是设置成不自动切换的话,zkfc就取法启动,hbase必须用自己的zookeeper。

HA模式强制手动切换:IPC's epoch [X] is less than the last promised epoch [X+1]