首页 > 代码库 > hadoop2.4.0的distcp引起的问题
hadoop2.4.0的distcp引起的问题
最近在支持业务部门将数据从hadoop0.20.203迁移到hadoop2.4.0的时候,distcp报了几个错误,在这里记录一下:
1、报权限错误
15/01/06 10:48:37 ERROR tools.DistCp: Unable to cleanup meta folder: /DistCp
org.apache.hadoop.security.AccessControlException: Permission denied: user=weibo_bigdata_uquality, access=WRITE, inode="/":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
解决办法:修改/DistCp目录权限
hadoop fs -chmod 777 /DistCp
这个问题有一个疑问,其实我在执行distcp的时候指定了 -log选项,将log目录指定到了有权限的用户目录下面,但是还是报以上错误。
2、报401错误
15/01/06 10:48:37 ERROR tools.DistCp: Exception encountered
java.io.IOException: Server returned HTTP response code: 401 for URL:
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1459)
at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:462)
at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:474)
at org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:503)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
at org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:77)
at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:80)
at org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:327)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:151)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:373)
这个问题和我们hadoop集群的权限认证有关系,在hadoop1.0上面配置了ugi和extrahosts之后得到解决3、报检验和错误
经过以上两个错误的解决,distcp成功运行了,但是在运行过程中报校验和错误:
2015-01-06 11:32:37,604 ERROR [main] org.apache.hadoop.tools.util.RetriableCommand: Failure in Retriable command: Copying ... to ....
java.io.IOException: Check-sum mismatch between ..... and .....
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareCheckSums(RetriableFileCopyCommand.java:190)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:125)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:95)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:258)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:229)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:45)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
在网上搜索了一下,找到解决方案:
Running distcp on a CDH4 YARN cluster with a CDH3 hftp source will fail if the CRC checksum type being used is the CDH4 default (CRC32C). This is because the default checksum type was changed in CDH4 from the CDH3 default of CRC32.
You can work around this issue by changing the CRC checksum type on the CDH4 cluster to the CDH3 default, CRC32. To do this set dfs.checksum.type to CRC32 in hdfs-site.xml.
意思是在hadoop1.0中的校验和类型是CRC32,但是到了hadoop2.0,校验和类型改成了CRC32C,肯定就对不上了,那怎么办呢,把hadoop2.0的校验和类型也改成CRC32吧。
地址:http://blog.csdn.net/map_lixiupeng/article/details/27542625
设置参数dfs.checksum.type为CRC32。
最终的提交命令:
hadoop distcp -Ddfs.checksum.type=CRC32 -log /user/${user_name}/DistCp hftp://example1:50070/${path} hdfs://example2:8020/${path}
hadoop2.4.0的distcp引起的问题