首页 > 代码库 > 在CentOS 运行你的第一个MapReduce程序
在CentOS 运行你的第一个MapReduce程序
在进行本文的操作之前要先搭建一个Hadoop的环境,为了便于实验,可采用单节点部署的方式,具体方法可参见:Centos 6.5 下Hadoop 1.2.1单节点环境的创建
编写源码
主要为创建一个解析气象数据的程序,可以从数据文件中选择气温最高的一年,采用Maven进行编译。下面只包含Maper,Reduce,以及Main函数的代码。完整项目代码请参见
https://github.com/Eric-aihua/practise/tree/master/hadoopMapper
package com.eric.hadoop.map; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; public void map(LongWritable fileOffset, Text lineRecord, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { System.out.println("##Processing Record:" + lineRecord.toString()); String line = lineRecord.toString(); String year = line.substring(15, 19); int temperature; if (line.charAt(87) == '+') { temperature = Integer.parseInt(line.substring(88, 92)); } else { temperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (temperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(temperature)); } } }
Reduce
package com.eric.hadoop.reduce; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; public class MaxTemperatureReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text year, Iterator<IntWritable> temperatures, OutputCollector<Text, IntWritable> output, Reporter arg3) throws IOException { int maxTemperature = Integer.MIN_VALUE; System.out.println("##Processing temperatures:" + temperatures); while (temperatures.hasNext()) { maxTemperature = Math.max(maxTemperature, temperatures.next().get()); } output.collect(year, new IntWritable(maxTemperature)); } }
Main
package com.eric.hadoop.jobconfig; import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import com.eric.hadoop.map.MaxTemperatureMapper; import com.eric.hadoop.reduce.MaxTemperatureReduce; public class MaxTemperature { public static void main(String[] args) throws IOException { JobConf conf = new JobConf(MaxTemperature.class); conf.setJobName("Get Max Temperature!"); if (args.length != 2) { System.err.print("Must contain 2 params:inputPath OutputPath"); System.exit(0); } FileInputFormat.addInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(MaxTemperatureMapper.class); conf.setReducerClass(MaxTemperatureReduce.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }
生成Jar文件
进入项目目录,执行
mvn install
成功执行后生成名称为hadoop-0.0.1-SNAPSHOT.jar的Jar文件获取测试数据
可以使用上文中github中的数据,也可从互联网上下载,URL为:https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all
假设下载的数据文件名称为1902,且放到HDFS文件系统的testdata目录
hadoop dfs -mkdir testdata
hadoop dfs -mkdir output
hadoop dfs -put 1902 testdata
执行Job
hadoop jar hadoop-0.0.1-SNAPSHOT.jar testdata/1902 output
观察结果
通过WEB控制台来监控:
通过命令行输出来监控:
[hadoop@localhost ~]$ hadoop jar hadoop-0.0.1-SNAPSHOT.jar testdata/1902 output
Warning: $HADOOP_HOME is deprecated.
14/11/26 13:33:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/11/26 13:33:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/11/26 13:33:39 WARN snappy.LoadSnappy: Snappy native library not loaded
14/11/26 13:33:39 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/26 13:33:40 INFO mapred.JobClient: Running job: job_201411261331_0002 #job的标识
14/11/26 13:33:41 INFO mapred.JobClient: map 0% reduce 0%
14/11/26 13:33:47 INFO mapred.JobClient: map 100% reduce 0% #Mapper的进度
14/11/26 13:33:54 INFO mapred.JobClient: map 100% reduce 33%
14/11/26 13:33:56 INFO mapred.JobClient: map 100% reduce 100%#Reduce的进度
14/11/26 13:33:57 INFO mapred.JobClient: Job complete: job_201411261331_0002
14/11/26 13:33:57 INFO mapred.JobClient: Counters: 30
14/11/26 13:33:57 INFO mapred.JobClient: Job Counters
14/11/26 13:33:57 INFO mapred.JobClient: Launched reduce tasks=1
14/11/26 13:33:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7744
14/11/26 13:33:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/11/26 13:33:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/11/26 13:33:57 INFO mapred.JobClient: Launched map tasks=2
14/11/26 13:33:57 INFO mapred.JobClient: Data-local map tasks=2
14/11/26 13:33:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9008
14/11/26 13:33:57 INFO mapred.JobClient: File Input Format Counters
14/11/26 13:33:57 INFO mapred.JobClient: Bytes Read=890953
14/11/26 13:33:57 INFO mapred.JobClient: File Output Format Counters
14/11/26 13:33:57 INFO mapred.JobClient: Bytes Written=9
14/11/26 13:33:57 INFO mapred.JobClient: FileSystemCounters
14/11/26 13:33:57 INFO mapred.JobClient: FILE_BYTES_READ=72221
14/11/26 13:33:57 INFO mapred.JobClient: HDFS_BYTES_READ=891143
14/11/26 13:33:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=309368
14/11/26 13:33:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9
14/11/26 13:33:57 INFO mapred.JobClient: Map-Reduce Framework
14/11/26 13:33:57 INFO mapred.JobClient: Map output materialized bytes=72227
14/11/26 13:33:57 INFO mapred.JobClient: Map input records=6565 #Mapper的输入记录数
14/11/26 13:33:57 INFO mapred.JobClient: Reduce shuffle bytes=72227
14/11/26 13:33:57 INFO mapred.JobClient: Spilled Records=13130
14/11/26 13:33:57 INFO mapred.JobClient: Map output bytes=59085
14/11/26 13:33:57 INFO mapred.JobClient: Total committed heap usage (bytes)=478543872
14/11/26 13:33:57 INFO mapred.JobClient: CPU time spent (ms)=4400 #CPU耗时
14/11/26 13:33:57 INFO mapred.JobClient: Map input bytes=888978
14/11/26 13:33:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=190
14/11/26 13:33:57 INFO mapred.JobClient: Combine input records=0
14/11/26 13:33:57 INFO mapred.JobClient: Reduce input records=6565 #Reduce的输出记录数
14/11/26 13:33:57 INFO mapred.JobClient: Reduce input groups=1
14/11/26 13:33:57 INFO mapred.JobClient: Combine output records=0
14/11/26 13:33:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=501690368
14/11/26 13:33:57 INFO mapred.JobClient: Reduce output records=1 #Reduce的输出记录数
14/11/26 13:33:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2167922688
14/11/26 13:33:57 INFO mapred.JobClient: Map output records=6565#Mapper的输出记录数检查运行结果
故障以及解析
问题描述:hadoop 的map阶段正常,但是reduce却卡在00%那里,等了好久进度仍然不变日志报错:2011-10-03 09:46:13,349 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for task attempt_201110022127_0003_m_000000_01. 将/etc/hosts中的主机名与/etc/sysconfig/network中的HOSTNAME一致,修改对应的文件后重启系统
在CentOS 运行你的第一个MapReduce程序
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。