首页 > 代码库 > Hadoop阅读笔记(三)——深入MapReduce排序和单表连接

Hadoop阅读笔记(三)——深入MapReduce排序和单表连接

  继上篇了解了使用MapReduce计算平均数以及去重后,我们再来一探MapReduce在排序以及单表关联上的处理方法。
在MapReduce系列的第一篇就有说过,MapReduce不仅是一种分布式的计算方法,更是一种解决问题的新思维、新思路。将原先看似可以一条龙似的处理一刀切成两端,一端是Map、一端是Reduce,Map负责分,Reduce负责合。

  1.MapReduce排序
  问题模型:
  给出多个数据文件输入如:

  sortfile1.txt

11131517192123252729

  sortfile2.txt

1012141618202224262830

  sortfile3.txt

12345678910

  最终要完成排序形成结果文件格式如下:

1 12 23 34 45 56 67 78 89 910 1011 10……

  

  要解决的问题有了,那么如何排序,如何实现,我们尚且存在哪些问题:
  1.我们知道MapReduce本身就有自带的排序功能,能够直接用;
  2.如果用MapReduce默认排序功能,如何使用,针对key为int类型以及String类型又有何不同;
  3.如何保证三个输入文件乃至更多个输入文件的输入,使得在排序结果中全局有序

  实际需求有了,问题也来了,那么需要一一解决。MapReduce确实有自己的排序机制,我们不会排开不用,但是不能完全靠内部机制实现。要知道MapReduce是根据key进行排序的,如果key为int类型,则按照key的数值大小排序;如果key为String类型,则按照字典先后顺序进行排序。为了保证这里的全局有序,需要定义一个自己的Partition类,起到一个全局筛选的作用,是的map过后的分配到reduce端的都是有序的。具体做法就是用输入数据的最大值除以系统partition数量的商作为分割数据的边界增量,也就是说分割数据的边界为此商的1倍、2倍至numPartitions-1倍,这样就能保证执行partition后的数据是整体有序的。之后,在Reduce端得到的<key, value-list>,根据value-list中的元素个数将输入的key作为value的输出次数,输出的key是一个全局变量,用于统计当前的位次。
  具体代码如下:

public class Sort {//map将输入中的value转化成IntWritable类型,作为输出的key  public static class Map extends Mapper<Object, Text, IntWritable, IntWritable>{        private static IntWritable data = http://www.mamicode.com/new IntWritable();      "line:" + line);      try{    	  data.set(Integer.parseInt(line));    	        }catch(Exception e){    	  data.set(1000);      }      System.out.println("Map key:" + data .toString() );      context.write(data, new IntWritable(1));    }  }//reduce将输入的key复制到输出的value,然后根据输入的//value-list中元素的个数决定key的输出次数//用全局linenum来代表key的位次  public static class Reduce extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {    private static IntWritable linenum = new IntWritable(1);    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {      for (IntWritable val : values) {        context.write(linenum , key);        System.out.println("Reduce key:" + linenum  + "\tReduce value:" + key );	linenum = new IntWritable(linenum.get() + 1);      }    }  }//自定义Partition函数,此函数根据输入数据的最大值和MapReduce框架中//Partition的数量获取将输入数据按照大小分块的边界,然后根据输入数值和//边界的关系返回对应的Partition ID   public static class Partition extends Partitioner <IntWritable,IntWritable> {		@Override          public int getPartition(IntWritable key, IntWritable value, int numPartitions) {				int Maxnumber = 65223;	int bound = Maxnumber/numPartitions + 1;	int keynumber = key.get();	for(int i = 0; i < numPartitions; i++){		System.out.println("numPartitions:" + numPartitions);		if(keynumber < bound*i && keynumber >= bound*(i-1))			return i-1;		}		return 0;					}  }  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "Sort");    job.setJarByClass(Sort.class);    job.setMapperClass(Map.class);    job.setReducerClass(Reduce.class);    job.setPartitionerClass(Partition.class);    job.setOutputKeyClass(IntWritable.class);    job.setOutputValueClass(IntWritable.class);    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    System.exit(job.waitForCompletion(true) ? 0 : 1);  }}

  

  注意:
  1.在自己新建测试数据的时候,需要小心处理,比如在sortfile1.txt中一共是10行数据,如果将换行符停留在第11行,则在map阶段会抛出格式转换异常,所以添加代码中try catch处理。
  2.为了更清晰的看出MapReduce以及Partition的执行过程,通过打印信息来了解每一个执行过程。
  3.Reduce中应该是“return 0”,圣经《hadoop 实战2》中写成了return -1,实践证明是有错误的

  程序执行,打印信息如下:

15/01/28 21:19:28 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=15/01/28 21:19:28 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).15/01/28 21:19:28 INFO input.FileInputFormat: Total input paths to process : 315/01/28 21:19:29 INFO mapred.JobClient: Running job: job_local_000115/01/28 21:19:29 INFO input.FileInputFormat: Total input paths to process : 315/01/28 21:19:29 INFO mapred.MapTask: io.sort.mb = 10015/01/28 21:19:29 INFO mapred.MapTask: data buffer = 79691776/9961472015/01/28 21:19:29 INFO mapred.MapTask: record buffer = 262144/327680line:1115/01/28 21:19:29 INFO mapred.MapTask: Starting flush of map outputMap key:11numPartitions:1line:13Map key:13numPartitions:1line:15Map key:15numPartitions:1line:17Map key:17numPartitions:1line:19Map key:19numPartitions:1line:21Map key:21numPartitions:1line:23Map key:23numPartitions:1line:25Map key:25numPartitions:1line:27Map key:27numPartitions:1line:29Map key:29numPartitions:115/01/28 21:19:29 INFO mapred.MapTask: Finished spill 015/01/28 21:19:29 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting15/01/28 21:19:29 INFO mapred.LocalJobRunner: 15/01/28 21:19:29 INFO mapred.TaskRunner: Task ‘attempt_local_0001_m_000000_0‘ done.15/01/28 21:19:29 INFO mapred.MapTask: io.sort.mb = 10015/01/28 21:19:29 INFO mapred.MapTask: data buffer = 79691776/9961472015/01/28 21:19:29 INFO mapred.MapTask: record buffer = 262144/327680line:10Map key:10numPartitions:1line:12Map key:12numPartitions:1line:14Map key:14numPartitions:1line:16Map key:16numPartitions:1line:18Map key:18numPartitions:1line:20Map key:20numPartitions:1line:22Map key:22numPartitions:1line:24Map key:24numPartitions:1line:26Map key:26numPartitions:1line:28Map key:28numPartitions:1line:30Map key:30numPartitions:115/01/28 21:19:29 INFO mapred.MapTask: Starting flush of map output15/01/28 21:19:29 INFO mapred.MapTask: Finished spill 015/01/28 21:19:29 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting15/01/28 21:19:29 INFO mapred.LocalJobRunner: 15/01/28 21:19:29 INFO mapred.TaskRunner: Task ‘attempt_local_0001_m_000001_0‘ done.15/01/28 21:19:29 INFO mapred.MapTask: io.sort.mb = 10015/01/28 21:19:30 INFO mapred.JobClient:  map 100% reduce 0%15/01/28 21:19:30 INFO mapred.MapTask: data buffer = 79691776/9961472015/01/28 21:19:30 INFO mapred.MapTask: record buffer = 262144/327680line:1Map key:1numPartitions:1line:2Map key:2numPartitions:1line:3Map key:3numPartitions:1line:4Map key:4numPartitions:1line:5Map key:5numPartitions:1line:6Map key:6numPartitions:1line:7Map key:7numPartitions:1line:8Map key:8numPartitions:1line:9Map key:9numPartitions:1line:10Map key:10numPartitions:115/01/28 21:19:30 INFO mapred.MapTask: Starting flush of map output15/01/28 21:19:30 INFO mapred.MapTask: Finished spill 015/01/28 21:19:30 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting15/01/28 21:19:30 INFO mapred.LocalJobRunner: 15/01/28 21:19:30 INFO mapred.TaskRunner: Task ‘attempt_local_0001_m_000002_0‘ done.15/01/28 21:19:30 INFO mapred.LocalJobRunner: 15/01/28 21:19:30 INFO mapred.Merger: Merging 3 sorted segments15/01/28 21:19:30 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 316 bytes15/01/28 21:19:30 INFO mapred.LocalJobRunner: Reduce key:1	Reduce value:1Reduce key:2	Reduce value:2Reduce key:3	Reduce value:3Reduce key:4	Reduce value:4Reduce key:5	Reduce value:5Reduce key:6	Reduce value:6Reduce key:7	Reduce value:7Reduce key:8	Reduce value:8Reduce key:9	Reduce value:9Reduce key:10	Reduce value:10Reduce key:11	Reduce value:10Reduce key:12	Reduce value:11Reduce key:13	Reduce value:12Reduce key:14	Reduce value:13Reduce key:15	Reduce value:14Reduce key:16	Reduce value:15Reduce key:17	Reduce value:16Reduce key:18	Reduce value:17Reduce key:19	Reduce value:18Reduce key:20	Reduce value:19Reduce key:21	Reduce value:20Reduce key:22	Reduce value:21Reduce key:23	Reduce value:22Reduce key:24	Reduce value:23Reduce key:25	Reduce value:24Reduce key:26	Reduce value:25Reduce key:27	Reduce value:26Reduce key:28	Reduce value:27Reduce key:29	Reduce value:28Reduce key:30	Reduce value:29Reduce key:31	Reduce value:3015/01/28 21:19:30 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting15/01/28 21:19:30 INFO mapred.LocalJobRunner: 15/01/28 21:19:30 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now15/01/28 21:19:30 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local_0001_r_000000_0‘ to hdfs://hadoop:9000/usr/hadoop/output315/01/28 21:19:30 INFO mapred.LocalJobRunner: reduce > reduce15/01/28 21:19:30 INFO mapred.TaskRunner: Task ‘attempt_local_0001_r_000000_0‘ done.15/01/28 21:19:31 INFO mapred.JobClient:  map 100% reduce 100%15/01/28 21:19:31 INFO mapred.JobClient: Job complete: job_local_000115/01/28 21:19:31 INFO mapred.JobClient: Counters: 1415/01/28 21:19:31 INFO mapred.JobClient:   FileSystemCounters15/01/28 21:19:31 INFO mapred.JobClient:     FILE_BYTES_READ=6722015/01/28 21:19:31 INFO mapred.JobClient:     HDFS_BYTES_READ=26115/01/28 21:19:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=13811515/01/28 21:19:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=16815/01/28 21:19:31 INFO mapred.JobClient:   Map-Reduce Framework15/01/28 21:19:31 INFO mapred.JobClient:     Reduce input groups=3015/01/28 21:19:31 INFO mapred.JobClient:     Combine output records=015/01/28 21:19:31 INFO mapred.JobClient:     Map input records=3115/01/28 21:19:31 INFO mapred.JobClient:     Reduce shuffle bytes=015/01/28 21:19:31 INFO mapred.JobClient:     Reduce output records=3115/01/28 21:19:31 INFO mapred.JobClient:     Spilled Records=6215/01/28 21:19:31 INFO mapred.JobClient:     Map output bytes=24815/01/28 21:19:31 INFO mapred.JobClient:     Combine input records=015/01/28 21:19:31 INFO mapred.JobClient:     Map output records=3115/01/28 21:19:31 INFO mapred.JobClient:     Reduce input records=31

  

  通过打印信息我们知道:
  Map output records=31 Reduce input records=31
  首先执行了Map,进行数据逐行输入,然后执行Partition过程,给每个元素打上唯一标记,确保进入Reduce阶段时整齐有序,最后执行Reduce阶段,完成全局排序过程。

  最终的输出文件信息:

1	12	23	34	45	56	67	78	89	910	1011	1012	1113	1214	1315	1416	1517	1618	1719	1820	1921	2022	2123	2224	2325	2426	2527	2628	2729	2830	2931	30

  其实MapReduce的排序就是这么easy,先是让所有的人都进来,按照map的指定格式写入context,再经过partition全局指挥官的考量,打上排序的标记,最后在reduce中完成最终排序结果的输出。

 

  2.MapReduce单表关联
  问题模型,给出多个输入文件如下:

  table1.txt

大儿子 爸爸小儿子 爸爸大女儿 爸爸小女儿 爸爸爸爸 爷爷爸爸 二大爷爸爸 三大爷 

  table2.txt

二女儿 妈妈二儿子 妈妈妈妈 爷爷妈妈 二大爷妈妈 三大爷

  最终要得到的数据形式为:

grandchild	grandparent二女儿	爷爷二女儿	二大爷二女儿	三大爷二儿子	爷爷二儿子	二大爷……

  

  MapReduce下的表与表或者表与自身的连接不会像传统SQL语句那样直接一个left join、right join就能出一个最终表,鉴于本场景的需求,需要进行表连接,一个左表、一个右表,都是同一张表,连接的条件是左表的parent列以及右表的child列,整个过程就是一个自连接过程。
  我们的解决思路如下:
  1.Map端将输入数据分割为parent和child列,将parent设置为key,child设置为value输出,记为左表;再将同意对child和parent中的child设为key,parent设为value输出,记为右表
  2.为了区分左右表,需要在输出的value中添加有关左右表的标示信息
  3.在Reduce接收到的经过shuffle过程的结果中,每个key的value-list就包含了grandchild和grandparent关系,取出每个key的value-list进行解析,将左表的child放入一个数组,右表中的parent放入一个数组,然后对这两个数据求笛卡尔积就是最终结果

  代码如下:

public class STjoin {  public static int time = 0;//map将输入分割成child和parent,然后正序输出一次作为右表,反//序输出一次作为左表,需要注意的是在输出的value中必须加上左右表//区别标志  public static class Map extends Mapper<Object, Text, Text, Text>{          public void map(Object key, Text value, Context context) throws IOException, InterruptedException {       String childname = new String();       String parentname = new String();       String relationtype = new String();       String line = value.toString();       int i = 0;       while(line.charAt(i)!=‘ ‘){		i++;	}       String[] values = {line.substring(0,i),line.substring(i+1)};       if(values[0].compareTo("child") != 0)       {    	   childname = values[0];    	   parentname = values[1];    	   relationtype = "1";  //左右表区分标志    	   context.write(new Text(values[1]), new Text(relationtype + "+" + childname + "+" + parentname));    	   System.out.println("左表 Map  key:" + new Text(values[1]) + "\tvalue:" + (relationtype + "+" + childname + "+" + parentname) );    	   //左表    	   relationtype = "2";      	   context.write(new Text(values[0]), new Text(relationtype + "+" + childname + "+" + parentname));    	   System.out.println("右表 Map  key:" + new Text(values[0]) + "\tvalue:" + (relationtype + "+" + childname + "+" + parentname) );      //右表        }    }  }    public static class Reduce extends Reducer<Text,Text,Text,Text> {	public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {    		 if(time == 0){   //输出表头	 	context.write(new Text("grandchild"),new Text("grandparent"));		time++;	 }         int grandchildnum = 0;    	 String grandchild[] = new String[10];    	 int grandparentnum = 0;    	 String grandparent[] = new String[10];	 Iterator ite = values.iterator();    	 while(ite.hasNext())    	 {    		 String record = ite.next().toString();    		 int len = record.length();    		 int i = 2;		 	if(len == 0) continue;    		 char relationtype = record.charAt(0);    		 String childname = new String();    		 String parentname = new String();			//获取value-list中value的child    		 while(record.charAt(i) != ‘+‘)    		 {    			 childname = childname + record.charAt(i); 	    			 i++;		     	         }    		 i = i+1;			//获取value-list中value的parent    		 while(i < len)    		 {    			 parentname = parentname + record.charAt(i);	    			 i++;	    		 }			//左表,取出child放入grandchild    		 if(relationtype == ‘1‘){    			 grandchild[grandchildnum] = childname;    			 grandchildnum++;    		 }    		 else{//右表,取出parent放入grandparent    			 grandparent[grandparentnum] = parentname;    			 grandparentnum++;    		 }         }		//grandchild和grandparent数组求笛卡儿积    	 if(grandparentnum != 0 && grandchildnum != 0){    		 for(int m = 0; m < grandchildnum; m++){    			 for(int n = 0; n < grandparentnum; n++){    	context.write(new Text(grandchild[m]),new Text(grandparent[n]));   //输出结果    	System.out.println("Reduce 孙子:" + grandchild[m] + "\t 爷爷:" +  grandparent[n]);    			 }    		 }    	}    	 	}  }  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length != 2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job = new Job(conf, "single table join");    job.setJarByClass(STjoin.class);    job.setMapperClass(Map.class);    job.setReducerClass(Reduce.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(Text.class);    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    System.exit(job.waitForCompletion(true) ? 0 : 1);  }}

  代码写的很明白,为了弄清楚MapReduce每一步还是加入了打印信息,程序执行的过程信息如下:

15/01/28 22:06:28 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=15/01/28 22:06:28 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).15/01/28 22:06:28 INFO input.FileInputFormat: Total input paths to process : 215/01/28 22:06:28 INFO mapred.JobClient: Running job: job_local_000115/01/28 22:06:28 INFO input.FileInputFormat: Total input paths to process : 215/01/28 22:06:28 INFO mapred.MapTask: io.sort.mb = 10015/01/28 22:06:28 INFO mapred.MapTask: data buffer = 79691776/9961472015/01/28 22:06:28 INFO mapred.MapTask: record buffer = 262144/327680左表 Map  key:爸爸	value:1+大儿子+爸爸右表 Map  key:大儿子	value:2+大儿子+爸爸左表 Map  key:爸爸	value:1+小儿子+爸爸右表 Map  key:小儿子	value:2+小儿子+爸爸左表 Map  key:爸爸	value:1+大女儿+爸爸右表 Map  key:大女儿	value:2+大女儿+爸爸左表 Map  key:爸爸	value:1+小女儿+爸爸右表 Map  key:小女儿	value:2+小女儿+爸爸左表 Map  key:爷爷	value:1+爸爸+爷爷右表 Map  key:爸爸	value:2+爸爸+爷爷左表 Map  key:二大爷	value:1+爸爸+二大爷右表 Map  key:爸爸	value:2+爸爸+二大爷左表 Map  key:三大爷 	value:1+爸爸+三大爷 右表 Map  key:爸爸	value:2+爸爸+三大爷 15/01/28 22:06:28 INFO mapred.MapTask: Starting flush of map output15/01/28 22:06:28 INFO mapred.MapTask: Finished spill 015/01/28 22:06:28 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting15/01/28 22:06:28 INFO mapred.LocalJobRunner: 15/01/28 22:06:28 INFO mapred.TaskRunner: Task ‘attempt_local_0001_m_000000_0‘ done.15/01/28 22:06:28 INFO mapred.MapTask: io.sort.mb = 10015/01/28 22:06:28 INFO mapred.MapTask: data buffer = 79691776/9961472015/01/28 22:06:28 INFO mapred.MapTask: record buffer = 262144/327680左表 Map  key:妈妈	value:1+二女儿+妈妈右表 Map  key:二女儿	value:2+二女儿+妈妈左表 Map  key:妈妈	value:1+二儿子+妈妈右表 Map  key:二儿子	value:2+二儿子+妈妈左表 Map  key:爷爷	value:1+妈妈+爷爷右表 Map  key:妈妈	value:2+妈妈+爷爷左表 Map  key:二大爷	value:1+妈妈+二大爷右表 Map  key:妈妈	value:2+妈妈+二大爷左表 Map  key:三大爷	value:1+妈妈+三大爷右表 Map  key:妈妈	value:2+妈妈+三大爷15/01/28 22:06:28 INFO mapred.MapTask: Starting flush of map output15/01/28 22:06:28 INFO mapred.MapTask: Finished spill 015/01/28 22:06:28 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting15/01/28 22:06:28 INFO mapred.LocalJobRunner: 15/01/28 22:06:28 INFO mapred.TaskRunner: Task ‘attempt_local_0001_m_000001_0‘ done.15/01/28 22:06:28 INFO mapred.LocalJobRunner: 15/01/28 22:06:28 INFO mapred.Merger: Merging 2 sorted segments15/01/28 22:06:28 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 697 bytes15/01/28 22:06:28 INFO mapred.LocalJobRunner: Reduce 孙子:二女儿	 爷爷:爷爷Reduce 孙子:二女儿	 爷爷:二大爷Reduce 孙子:二女儿	 爷爷:三大爷Reduce 孙子:二儿子	 爷爷:爷爷Reduce 孙子:二儿子	 爷爷:二大爷Reduce 孙子:二儿子	 爷爷:三大爷Reduce 孙子:大儿子	 爷爷:爷爷Reduce 孙子:大儿子	 爷爷:二大爷Reduce 孙子:大儿子	 爷爷:三大爷 Reduce 孙子:小儿子	 爷爷:爷爷Reduce 孙子:小儿子	 爷爷:二大爷Reduce 孙子:小儿子	 爷爷:三大爷 Reduce 孙子:大女儿	 爷爷:爷爷Reduce 孙子:大女儿	 爷爷:二大爷Reduce 孙子:大女儿	 爷爷:三大爷 Reduce 孙子:小女儿	 爷爷:爷爷Reduce 孙子:小女儿	 爷爷:二大爷Reduce 孙子:小女儿	 爷爷:三大爷 15/01/28 22:06:28 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting15/01/28 22:06:28 INFO mapred.LocalJobRunner: 15/01/28 22:06:28 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now15/01/28 22:06:28 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local_0001_r_000000_0‘ to hdfs://hadoop:9000/usr/hadoop/output415/01/28 22:06:28 INFO mapred.LocalJobRunner: reduce > reduce15/01/28 22:06:28 INFO mapred.TaskRunner: Task ‘attempt_local_0001_r_000000_0‘ done.15/01/28 22:06:29 INFO mapred.JobClient:  map 100% reduce 100%15/01/28 22:06:29 INFO mapred.JobClient: Job complete: job_local_000115/01/28 22:06:29 INFO mapred.JobClient: Counters: 1415/01/28 22:06:29 INFO mapred.JobClient:   FileSystemCounters15/01/28 22:06:29 INFO mapred.JobClient:     FILE_BYTES_READ=5058015/01/28 22:06:29 INFO mapred.JobClient:     HDFS_BYTES_READ=51515/01/28 22:06:29 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=10331215/01/28 22:06:29 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=36915/01/28 22:06:29 INFO mapred.JobClient:   Map-Reduce Framework15/01/28 22:06:29 INFO mapred.JobClient:     Reduce input groups=1215/01/28 22:06:29 INFO mapred.JobClient:     Combine output records=015/01/28 22:06:29 INFO mapred.JobClient:     Map input records=1215/01/28 22:06:29 INFO mapred.JobClient:     Reduce shuffle bytes=015/01/28 22:06:29 INFO mapred.JobClient:     Reduce output records=1915/01/28 22:06:29 INFO mapred.JobClient:     Spilled Records=4815/01/28 22:06:29 INFO mapred.JobClient:     Map output bytes=64515/01/28 22:06:29 INFO mapred.JobClient:     Combine input records=015/01/28 22:06:29 INFO mapred.JobClient:     Map output records=2415/01/28 22:06:29 INFO mapred.JobClient:     Reduce input records=24

  最终得到的文件就是打印信息中的输出信息:

grandchild	grandparent二女儿	爷爷二女儿	二大爷二女儿	三大爷二儿子	爷爷二儿子	二大爷二儿子	三大爷大儿子	爷爷大儿子	二大爷大儿子	三大爷 小儿子	爷爷小儿子	二大爷小儿子	三大爷 大女儿	爷爷大女儿	二大爷大女儿	三大爷 小女儿	爷爷小女儿	二大爷小女儿	三大爷 

  如果觉得有用,记得点赞哦,也欢迎加入大数据群413471695进行技术讨论^_^

  本篇链接:《Hadoop阅读笔记(三)——深入MapReduce排序和单表连接》

 

Hadoop阅读笔记(三)——深入MapReduce排序和单表连接