首页 > 代码库 > Hadoop on Mac with IntelliJ IDEA: 1 解决输入路径不存在问题

Hadoop on Mac with IntelliJ IDEA: 1 解决输入路径不存在问题

本文讲述使用IntelliJ IDEA时遇到Hadoop提示input path does not exist(输入路径不存在)的一种解决办法。

环境:Mac OS X 10.9.5, IntelliJ IDEA 13.1.4, Hadoop 1.2.1

Hadoop放在虚拟机中,宿主机通过SSH连接,IDE和数据文件在宿主机。

这是自学Hadoop的第三天。以前做过点.NET开发,Mac、IntelliJ IDEA、Hadoop、CentOS对我而言,相当陌生。第一份Hadoop代码就遇到了问题。

以下代码摘自《Hadoop In Action》第4章第1份代码。

 1 public class MyJob extends Configured implements Tool { 2     public static class MapClass extends MapReduceBase 3             implements Mapper<Text, Text, Text, Text> { 4         @Override 5         public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) 6                 throws IOException { 7             output.collect(value, key); 8         } 9     }10 11 12     public static class Reduce extends MapReduceBase13             implements Reducer<Text, Text, Text, Text> {14         @Override15         public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {16             String csv = "";17             while (values.hasNext()) {18                 if (csv.length() > 0) {19                     csv += ", ";20                 }21                 csv += values.next().toString();22             }23             output.collect(key, new Text(csv));24         }25     }26 27     @Override28     public int run(String[] args) throws Exception {29         Configuration configuration = getConf();30 31         JobConf job = new JobConf(configuration, MyJob.class);32 33         Path in = new Path(args[0]);34         Path out = new Path(args[1]);35 36         FileInputFormat.setInputPaths(job, in);37         FileOutputFormat.setOutputPath(job, out);38 39         job.setJobName("MyJob");40         job.setMapperClass(MapClass.class);41         job.setReducerClass(Reduce.class);42 43         job.setInputFormat(KeyValueTextInputFormat.class);44         job.setOutputFormat(TextOutputFormat.class);45         job.setOutputKeyClass(Text.class);46         job.setOutputValueClass(Text.class);47         job.set("key.value.separator.in.input.line", ",");48 49         JobClient.runJob(job);50 51         return 0;52     }53 54     public static void main(String[] args) {55         try {56             int res = ToolRunner.run(new Configuration(), new MyJob(), args);57             System.exit(res);58         } catch (Exception e) {59             e.printStackTrace();60         }61     }62 }

主函数做了异常处理,其余地方和原书一致。

由于不会打jar包,我选择了直接在IDEA中执行代码,数据文件目录和书上不同,故命令行参数和原书略有差别,如下:

/Users/michael/Desktop/Hadoop/HadoopInAction/cite75_99.txt output

IDEA的配置如图

数据文件路径如图

以上配置无拼写错误。然后,我很高兴地按下‘Run MyJob.main()‘ ,准备等结果,继续跟着书走。

悲剧了,IDEA输出input path does not exist。输入路径是/Users/michael/IdeaProjects/Hadoop/Users/michael/Desktop/Hadoop/HadoopInAction/cite75_99.txt,这不是Working directory拼上我给的第一个参数么,怎么回事。

整份代码,就run方法中用了Path,应该是这边的问题。

在FileOutputFormat.setOutputPath(job, out);后面加上System.out.println(FileInputFormat.getInputPaths(job)[0].toUri());发现输入路径真的被合并到工作路径下了。怪不得报错呢(StackOverflow中有人说是我的数据文件没提前到Hadoop才会报这个错误。其实,我只会提交命令,提交后怎么让程序读取,我还不会,悲剧)。

 现在,可以判断问题是FileInputFormat.setInputPaths(job, in);导致的。进源码看看它是怎么工作的。

  /**   * Set the array of {@link Path}s as the list of inputs   * for the map-reduce job.   *    * @param conf Configuration of the job.    * @param inputPaths the {@link Path}s of the input directories/files    * for the map-reduce job.   */   public static void setInputPaths(JobConf conf, Path... inputPaths) {    Path path = new Path(conf.getWorkingDirectory(), inputPaths[0]);    StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));    for(int i = 1; i < inputPaths.length;i++) {      str.append(StringUtils.COMMA_STR);      path = new Path(conf.getWorkingDirectory(), inputPaths[i]);      str.append(StringUtils.escapeString(path.toString()));    }    conf.set("mapred.input.dir", str.toString());  }

可以看到,源码第一句就是合并conf和inputPaths。 既然合并了工作路径,那就把它去掉好了。

在FileInputFormat.setInputPaths(job, in);前保存合并前结果

  Path workingDirectoryBak = job.getWorkingDirectory();

再设置为根目录

  job.setWorkingDirectory(new Path("/"));

然后在它后面设置回来

  job.setWorkingDirectory(workingDirectoryBak);

加上输出,确认操作结果

  System.out.println(FileInputFormat.getInputPaths(job)[0].toUri());

新代码如下,mac下的输入法不好用,直接中式英语写注释

 1 public int run(String[] args) throws Exception { 2         Configuration configuration = getConf(); 3  4         JobConf job = new JobConf(configuration, MyJob.class); 5  6         Path in = new Path(args[0]); 7         Path out = new Path(args[1]); 8  9         // backup current directory, namely /Users/michael/IdeaProjects/Hadoop where source located10         Path workingDirectoryBak = job.getWorkingDirectory();11         // set to root dir12         job.setWorkingDirectory(new Path("/"));13         // let it combine root and input path14         FileInputFormat.setInputPaths(job, in);15         // set it back16         job.setWorkingDirectory(workingDirectoryBak);17         // print to confirm18         System.out.println(FileInputFormat.getInputPaths(job)[0].toUri());19         20         FileOutputFormat.setOutputPath(job, out);21 22         job.setJobName("MyJob");23         job.setMapperClass(MapClass.class);24         job.setReducerClass(Reduce.class);25 26         job.setInputFormat(KeyValueTextInputFormat.class);27         job.setOutputFormat(TextOutputFormat.class);28         job.setOutputKeyClass(Text.class);29         job.setOutputValueClass(Text.class);30         job.set("key.value.separator.in.input.line", ",");31 32         JobClient.runJob(job);33 34         return 0;35     }

再试一次,正常,将近1分钟执行完,配置差就是这样。

Hadoop on Mac with IntelliJ IDEA: 1 解决输入路径不存在问题