首页 > 代码库 > Hadoop读书笔记(十)MapReduce中的从计数器理解combiner归约

Hadoop读书笔记(十)MapReduce中的从计数器理解combiner归约

Hadoop读书笔记系列文章:http://blog.csdn.net/caicongyang/article/category/2166855

1.combiner

问:什么是combiner:

答:Combiner发生在Mapper端,对数据进行归约处理,使传到reducer端的数据变小了,传输时间变端,作业时间变短,Combiner不能夸Mapper执行,(只有reduce可以接受多个Mapper的任务)。 并不是所有的算法都适合归约处理,例如求平均数

2.代码实现

WordCount.java

package combine;

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
/**
 * 
 * <p> 
 * Title: WordCount.java 
 * Package counter 
 * </p>
 * <p>
 * Description: 
 *	问:什么是combiner:
 *  答:Combiner发生在Mapper端,对数据进行归约处理,使传到reducer端的数据变小了,传输时间变端,作业时间变短,Combiner不能夸Mapper执行,
 *  (只有reduce可以接受多个Mapper的任务)并不是多少的算法都适合归约处理,例如求平均数
 * 
 * <p>
 * @author Tom.Cai
 * @created 2014-11-26 下午10:47:32 
 * @version V1.0 
 *
 */
public class WordCount {
	private static final String INPUT_PATH = "hdfs://192.168.80.100:9000/hello";
	private static final String OUT_PATH = "hdfs://192.168.80.100:9000/out";

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);
		Path outPath = new Path(OUT_PATH);
		if (fileSystem.exists(outPath)) {
			fileSystem.delete(outPath, true);
		}
		Job job = new Job(conf, WordCount.class.getSimpleName());
		//1.1设定输入文件
		FileInputFormat.setInputPaths(job, INPUT_PATH);
		//1.2设定输入格式
		job.setInputFormatClass(TextInputFormat.class);
		//指定自定义Mapper类
		job.setMapperClass(MyMapper.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		//1.3设定分区
		job.setPartitionerClass(HashPartitioner.class);
		job.setNumReduceTasks(1);
		//1.4排序分组
		
		//1.5归约
		job.setCombinerClass(MyReducer.class);
		
		//2.2设定Reduce类
		job.setReducerClass(MyReducer.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		//2.3指定输出地址
		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
		job.setOutputFormatClass(TextOutputFormat.class);

		job.waitForCompletion(true);
	}

	static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String[] splited = value.toString().split("\t");
			for (String word : splited) {
				context.write(new Text(word), new LongWritable(1));
			}
		}
	}

	static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
		@Override
		protected void reduce(Text key, Iterable<LongWritable> value, Context context) throws IOException, InterruptedException {
			long count = 0L;
			for (LongWritable times : value) {
				count += times.get();
			}
			context.write(key, new LongWritable(count));
		}

	}

}

</pre><p></p><pre>

 3.加入Combiner后的计数器:

14/12/01 21:26:41 INFO mapred.JobClient: Counters: 19
14/12/01 21:26:41 INFO mapred.JobClient:   File Output Format Counters 
14/12/01 21:26:41 INFO mapred.JobClient:     Bytes Written=20
14/12/01 21:26:41 INFO mapred.JobClient:   FileSystemCounters
14/12/01 21:26:41 INFO mapred.JobClient:     FILE_BYTES_READ=346
14/12/01 21:26:41 INFO mapred.JobClient:     HDFS_BYTES_READ=40
14/12/01 21:26:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=128546
14/12/01 21:26:41 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=20
14/12/01 21:26:41 INFO mapred.JobClient:   File Input Format Counters 
14/12/01 21:26:41 INFO mapred.JobClient:     Bytes Read=20
14/12/01 21:26:41 INFO mapred.JobClient:   Map-Reduce Framework
14/12/01 21:26:41 INFO mapred.JobClient:     Map output materialized bytes=50
14/12/01 21:26:41 INFO mapred.JobClient:     Map input records=2
14/12/01 21:26:41 INFO mapred.JobClient:     Reduce shuffle bytes=0
14/12/01 21:26:41 INFO mapred.JobClient:     Spilled Records=6
14/12/01 21:26:41 INFO mapred.JobClient:     Map output bytes=52
14/12/01 21:26:41 INFO mapred.JobClient:     Total committed heap usage (bytes)=532807680
14/12/01 21:26:41 INFO mapred.JobClient:     SPLIT_RAW_BYTES=97
14/12/01 21:26:41 INFO mapred.JobClient:     Combine input records=4
14/12/01 21:26:41 INFO mapred.JobClient:     Reduce input records=3
14/12/01 21:26:41 INFO mapred.JobClient:     Reduce input groups=3
14/12/01 21:26:41 INFO mapred.JobClient:     Combine output records=3
14/12/01 21:26:41 INFO mapred.JobClient:     Reduce output records=3
14/12/01 21:26:41 INFO mapred.JobClient:     Map output records=4

4.未加入归约之前的计数器

14/12/01 21:35:27 INFO mapred.JobClient: Counters: 19
14/12/01 21:35:27 INFO mapred.JobClient:   File Output Format Counters 
14/12/01 21:35:27 INFO mapred.JobClient:     Bytes Written=20
14/12/01 21:35:27 INFO mapred.JobClient:   FileSystemCounters
14/12/01 21:35:27 INFO mapred.JobClient:     FILE_BYTES_READ=362
14/12/01 21:35:27 INFO mapred.JobClient:     HDFS_BYTES_READ=40
14/12/01 21:35:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=128090
14/12/01 21:35:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=20
14/12/01 21:35:27 INFO mapred.JobClient:   File Input Format Counters 
14/12/01 21:35:27 INFO mapred.JobClient:     Bytes Read=20
14/12/01 21:35:27 INFO mapred.JobClient:   Map-Reduce Framework
14/12/01 21:35:27 INFO mapred.JobClient:     Map output materialized bytes=66
14/12/01 21:35:27 INFO mapred.JobClient:     Map input records=2
14/12/01 21:35:27 INFO mapred.JobClient:     Reduce shuffle bytes=0
14/12/01 21:35:27 INFO mapred.JobClient:     Spilled Records=8
14/12/01 21:35:27 INFO mapred.JobClient:     Map output bytes=52
14/12/01 21:35:27 INFO mapred.JobClient:     Total committed heap usage (bytes)=366034944
14/12/01 21:35:27 INFO mapred.JobClient:     SPLIT_RAW_BYTES=97
14/12/01 21:35:27 INFO mapred.JobClient:     Combine input records=0
14/12/01 21:35:27 INFO mapred.JobClient:     Reduce input records=4
14/12/01 21:35:27 INFO mapred.JobClient:     Reduce input groups=3
14/12/01 21:35:27 INFO mapred.JobClient:     Combine output records=0
14/12/01 21:35:27 INFO mapred.JobClient:     Reduce output records=3
14/12/01 21:35:27 INFO mapred.JobClient:     Map output records=4

5.总结

从前后两个计数器输出可以看到:加了归约以后 Reduce input records从4变成了3,从Mapper端到Reduce端的作业变少了,传输时间变少了,从而提升了整体的作业时间。



欢迎大家一起讨论学习!

有用的自己收!

记录与分享,让你我共成长!欢迎查看我的其他博客;我的博客地址:http://blog.csdn.net/caicongyang




Hadoop读书笔记(十)MapReduce中的从计数器理解combiner归约