首页 > 代码库 > Hadoop读书笔记(十)MapReduce中的从计数器理解combiner归约
Hadoop读书笔记(十)MapReduce中的从计数器理解combiner归约
Hadoop读书笔记系列文章:http://blog.csdn.net/caicongyang/article/category/2166855
1.combiner
问:什么是combiner:
答:Combiner发生在Mapper端,对数据进行归约处理,使传到reducer端的数据变小了,传输时间变端,作业时间变短,Combiner不能夸Mapper执行,(只有reduce可以接受多个Mapper的任务)。 并不是所有的算法都适合归约处理,例如求平均数
2.代码实现
WordCount.java
package combine; import java.io.IOException; import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner; /** * * <p> * Title: WordCount.java * Package counter * </p> * <p> * Description: * 问:什么是combiner: * 答:Combiner发生在Mapper端,对数据进行归约处理,使传到reducer端的数据变小了,传输时间变端,作业时间变短,Combiner不能夸Mapper执行, * (只有reduce可以接受多个Mapper的任务)并不是多少的算法都适合归约处理,例如求平均数 * * <p> * @author Tom.Cai * @created 2014-11-26 下午10:47:32 * @version V1.0 * */ public class WordCount { private static final String INPUT_PATH = "hdfs://192.168.80.100:9000/hello"; private static final String OUT_PATH = "hdfs://192.168.80.100:9000/out"; public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf); Path outPath = new Path(OUT_PATH); if (fileSystem.exists(outPath)) { fileSystem.delete(outPath, true); } Job job = new Job(conf, WordCount.class.getSimpleName()); //1.1设定输入文件 FileInputFormat.setInputPaths(job, INPUT_PATH); //1.2设定输入格式 job.setInputFormatClass(TextInputFormat.class); //指定自定义Mapper类 job.setMapperClass(MyMapper.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //1.3设定分区 job.setPartitionerClass(HashPartitioner.class); job.setNumReduceTasks(1); //1.4排序分组 //1.5归约 job.setCombinerClass(MyReducer.class); //2.2设定Reduce类 job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //2.3指定输出地址 FileOutputFormat.setOutputPath(job, new Path(OUT_PATH)); job.setOutputFormatClass(TextOutputFormat.class); job.waitForCompletion(true); } static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] splited = value.toString().split("\t"); for (String word : splited) { context.write(new Text(word), new LongWritable(1)); } } } static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override protected void reduce(Text key, Iterable<LongWritable> value, Context context) throws IOException, InterruptedException { long count = 0L; for (LongWritable times : value) { count += times.get(); } context.write(key, new LongWritable(count)); } } }
</pre><p></p><pre>
3.加入Combiner后的计数器:
14/12/01 21:26:41 INFO mapred.JobClient: Counters: 1914/12/01 21:26:41 INFO mapred.JobClient: File Output Format Counters
14/12/01 21:26:41 INFO mapred.JobClient: Bytes Written=20
14/12/01 21:26:41 INFO mapred.JobClient: FileSystemCounters
14/12/01 21:26:41 INFO mapred.JobClient: FILE_BYTES_READ=346
14/12/01 21:26:41 INFO mapred.JobClient: HDFS_BYTES_READ=40
14/12/01 21:26:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=128546
14/12/01 21:26:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=20
14/12/01 21:26:41 INFO mapred.JobClient: File Input Format Counters
14/12/01 21:26:41 INFO mapred.JobClient: Bytes Read=20
14/12/01 21:26:41 INFO mapred.JobClient: Map-Reduce Framework
14/12/01 21:26:41 INFO mapred.JobClient: Map output materialized bytes=50
14/12/01 21:26:41 INFO mapred.JobClient: Map input records=2
14/12/01 21:26:41 INFO mapred.JobClient: Reduce shuffle bytes=0
14/12/01 21:26:41 INFO mapred.JobClient: Spilled Records=6
14/12/01 21:26:41 INFO mapred.JobClient: Map output bytes=52
14/12/01 21:26:41 INFO mapred.JobClient: Total committed heap usage (bytes)=532807680
14/12/01 21:26:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=97
14/12/01 21:26:41 INFO mapred.JobClient: Combine input records=4
14/12/01 21:26:41 INFO mapred.JobClient: Reduce input records=3
14/12/01 21:26:41 INFO mapred.JobClient: Reduce input groups=3
14/12/01 21:26:41 INFO mapred.JobClient: Combine output records=3
14/12/01 21:26:41 INFO mapred.JobClient: Reduce output records=3
14/12/01 21:26:41 INFO mapred.JobClient: Map output records=4
4.未加入归约之前的计数器
14/12/01 21:35:27 INFO mapred.JobClient: Counters: 19
14/12/01 21:35:27 INFO mapred.JobClient: File Output Format Counters
14/12/01 21:35:27 INFO mapred.JobClient: Bytes Written=20
14/12/01 21:35:27 INFO mapred.JobClient: FileSystemCounters
14/12/01 21:35:27 INFO mapred.JobClient: FILE_BYTES_READ=362
14/12/01 21:35:27 INFO mapred.JobClient: HDFS_BYTES_READ=40
14/12/01 21:35:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=128090
14/12/01 21:35:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=20
14/12/01 21:35:27 INFO mapred.JobClient: File Input Format Counters
14/12/01 21:35:27 INFO mapred.JobClient: Bytes Read=20
14/12/01 21:35:27 INFO mapred.JobClient: Map-Reduce Framework
14/12/01 21:35:27 INFO mapred.JobClient: Map output materialized bytes=66
14/12/01 21:35:27 INFO mapred.JobClient: Map input records=2
14/12/01 21:35:27 INFO mapred.JobClient: Reduce shuffle bytes=0
14/12/01 21:35:27 INFO mapred.JobClient: Spilled Records=8
14/12/01 21:35:27 INFO mapred.JobClient: Map output bytes=52
14/12/01 21:35:27 INFO mapred.JobClient: Total committed heap usage (bytes)=366034944
14/12/01 21:35:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=97
14/12/01 21:35:27 INFO mapred.JobClient: Combine input records=0
14/12/01 21:35:27 INFO mapred.JobClient: Reduce input records=4
14/12/01 21:35:27 INFO mapred.JobClient: Reduce input groups=3
14/12/01 21:35:27 INFO mapred.JobClient: Combine output records=0
14/12/01 21:35:27 INFO mapred.JobClient: Reduce output records=3
14/12/01 21:35:27 INFO mapred.JobClient: Map output records=4
5.总结
从前后两个计数器输出可以看到:加了归约以后 Reduce input records从4变成了3,从Mapper端到Reduce端的作业变少了,传输时间变少了,从而提升了整体的作业时间。
欢迎大家一起讨论学习!
有用的自己收!
记录与分享,让你我共成长!欢迎查看我的其他博客;我的博客地址:http://blog.csdn.net/caicongyang
Hadoop读书笔记(十)MapReduce中的从计数器理解combiner归约