首页 > 代码库 > OutputFormat输出过程的学习
OutputFormat输出过程的学习
花了大约1周的时间,终于把MapReduce的5大阶段的源码学习结束掉了,收获不少,就算本人对Hadoop学习的一个里程碑式的纪念吧。今天花了一点点的时间,把MapReduce的最后一个阶段,输出OutputFormat给做了分析,这个过程跟InputFormat刚刚好是对着干的,二者极具对称性。为什么这么说呢,待我一一分析。
OutputFormat过程的作用就是定义数据key-value的输出格式,给你处理好后的数据,究竟以什么样的形式输出呢,才能让下次别人拿到这个文件的时候能准确的提取出里面的数据。这里,我们撇开这个话题,仅仅我知道的一些定义的数据格式的方法,比如在Redis中会有这样的设计:
[key-length][key][value-length][value][key-length][key][value-length][value]...
或者说不一定非要省空间,直接搞过分隔符
[key] [value]\n
[key] [value]\n
[key] [value]\n
.....
这样逐行读取,再以空格隔开,取出里面的键值对,这么做简单是简单,就是不紧凑,空间浪费得有点多。在MapReduce的OutputFormat的有种格式用的就是这种方式。
首先必须得了解OutputFormat里面到底有什么东西:
public interface OutputFormat<K, V> { /** * Get the {@link RecordWriter} for the given job. * 获取输出记录键值记录 * * @param ignored * @param job configuration for the job whose output is being written. * @param name the unique name for this part of the output. * @param progress mechanism for reporting progress while writing to file. * @return a {@link RecordWriter} to write the output for the job. * @throws IOException */ RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException; /** * Check for validity of the output-specification for the job. * * <p>This is to validate the output specification for the job when it is * a job is submitted. Typically checks that it does not already exist, * throwing an exception when it already exists, so that output is not * overwritten.</p> * 作业运行之前进行的检测工作,例如配置的输出目录是否存在等 * * @param ignored * @param job job configuration. * @throws IOException when output should not be attempted */ void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException; }很简单的2个方法,RecordWriter比较重要,后面的key-value的写入操作都是根据他来完成的。但是他是一个接口,在MapReduce中,我们用的最多的他的子类是FileOutputFormat:
/** A base class for {@link OutputFormat}. */ public abstract class FileOutputFormat<K, V> implements OutputFormat<K, V> {他是一个抽象类,但是实现了接口中的第二个方法checkOutputSpecs()方法:
public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException { // Ensure that the output directory is set and not already there Path outDir = getOutputPath(job); if (outDir == null && job.getNumReduceTasks() != 0) { throw new InvalidJobConfException("Output directory not set in JobConf."); } if (outDir != null) { FileSystem fs = outDir.getFileSystem(job); // normalize the output directory outDir = fs.makeQualified(outDir); setOutputPath(job, outDir); // get delegation token for the outDir's file system TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] {outDir}, job); // check its existence if (fs.exists(outDir)) { //如果输出目录以及存在,则抛异常 throw new FileAlreadyExistsException("Output directory " + outDir + " already exists"); } } }就是检查输出目录在不在的操作。在这个类里还出现了一个辅助类:
public static Path getTaskOutputPath(JobConf conf, String name) throws IOException { // ${mapred.out.dir} Path outputPath = getOutputPath(conf); if (outputPath == null) { throw new IOException("Undefined job output-path"); } //根据OutputCommitter获取输出路径 OutputCommitter committer = conf.getOutputCommitter(); Path workPath = outputPath; TaskAttemptContext context = new TaskAttemptContext(conf, TaskAttemptID.forName(conf.get("mapred.task.id"))); if (committer instanceof FileOutputCommitter) { workPath = ((FileOutputCommitter)committer).getWorkPath(context, outputPath); } // ${mapred.out.dir}/_temporary/_${taskid}/${name} return new Path(workPath, name); }就是上面OutputCommiter,里面定义了很多和Task,job作业相关的方法。很多时候都会与OutputFormat合作的形式出现。他也有自己的子类实现FileOutputCommiter:
public class FileOutputCommitter extends OutputCommitter { public static final Log LOG = LogFactory.getLog( "org.apache.hadoop.mapred.FileOutputCommitter"); /** * Temporary directory name */ public static final String TEMP_DIR_NAME = "_temporary"; public static final String SUCCEEDED_FILE_NAME = "_SUCCESS"; static final String SUCCESSFUL_JOB_OUTPUT_DIR_MARKER = "mapreduce.fileoutputcommitter.marksuccessfuljobs"; public void setupJob(JobContext context) throws IOException { JobConf conf = context.getJobConf(); Path outputPath = FileOutputFormat.getOutputPath(conf); if (outputPath != null) { Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME); FileSystem fileSys = tmpDir.getFileSystem(conf); if (!fileSys.mkdirs(tmpDir)) { LOG.error("Mkdirs failed to create " + tmpDir.toString()); } } } ....在Reduce阶段的后面的写阶段,FileOutputFormat是默认的输出的类型:
//获取输出的key,value final RecordWriter<OUTKEY, OUTVALUE> out = new OldTrackingRecordWriter<OUTKEY, OUTVALUE>( reduceOutputCounter, job, reporter, finalName); OutputCollector<OUTKEY,OUTVALUE> collector = new OutputCollector<OUTKEY,OUTVALUE>() { public void collect(OUTKEY key, OUTVALUE value) throws IOException { //将处理后的key,value写入输出流中,最后写入HDFS作为最终结果 out.write(key, value); // indicate that progress update needs to be sent reporter.progress(); } };out就是直接发挥作用的类,但是是哪个Formtat的返回的呢,我们进入OldTrackingRecordWriter继续看:
public OldTrackingRecordWriter( org.apache.hadoop.mapred.Counters.Counter outputRecordCounter, JobConf job, TaskReporter reporter, String finalName) throws IOException { this.outputRecordCounter = outputRecordCounter; //默认是FileOutputFormat文件输出方式 this.fileOutputByteCounter = reporter .getCounter(FileOutputFormat.Counter.BYTES_WRITTEN); Statistics matchedStats = null; if (job.getOutputFormat() instanceof FileOutputFormat) { matchedStats = getFsStatistics(FileOutputFormat.getOutputPath(job), job); } fsStats = matchedStats; FileSystem fs = FileSystem.get(job); long bytesOutPrev = getOutputBytes(fsStats); //从配置中获取作业的输出方式 this.real = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter); long bytesOutCurr = getOutputBytes(fsStats); fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev); }果然是我们所想的那样,FileOutputFormat,但是不要忘了它的getRecordWriter()是抽象方法,调用它还必须由它的子类来实现:
public abstract RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException;在这里我们举出其中在InputFormat举过的例子,TextOutputFormat,SequenceFileOutputFormat,与TextInputFormat,SequenceFileInputFormat对应。也就说说上面2个子类定义了2种截然不同的输出格式,也就返回了不一样的RecordWriter实现类.在TextOutputFormat中,他定义了一个叫LineRecordWriter的定义:
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException { //从配置判断输出是否要压缩 boolean isCompressed = getCompressOutput(job); //配置中获取加在key-value的分隔符 String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t"); //根据是否压缩获取相应的LineRecordWriter if (!isCompressed) { Path file = FileOutputFormat.getTaskOutputPath(job, name); FileSystem fs = file.getFileSystem(job); FSDataOutputStream fileOut = fs.create(file, progress); return new LineRecordWriter<K, V>(fileOut, keyValueSeparator); } else { Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job, GzipCodec.class); // create the named codec CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job); // build the filename including the extension Path file = FileOutputFormat.getTaskOutputPath(job, name + codec.getDefaultExtension()); FileSystem fs = file.getFileSystem(job); FSDataOutputStream fileOut = fs.create(file, progress); return new LineRecordWriter<K, V>(new DataOutputStream (codec.createOutputStream(fileOut)), keyValueSeparator); }他以一个内部类的形式存在于TextOutputFormat。而在SequenceFileOutputFormat中,他的形式是怎样的呢:
public RecordWriter<K, V> getRecordWriter( FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException { // get the path of the temporary output file Path file = FileOutputFormat.getTaskOutputPath(job, name); FileSystem fs = file.getFileSystem(job); CompressionCodec codec = null; CompressionType compressionType = CompressionType.NONE; if (getCompressOutput(job)) { // find the kind of compression to do compressionType = getOutputCompressionType(job); // find the right codec Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job, DefaultCodec.class); codec = ReflectionUtils.newInstance(codecClass, job); } final SequenceFile.Writer out = SequenceFile.createWriter(fs, job, file, job.getOutputKeyClass(), job.getOutputValueClass(), compressionType, codec, progress); return new RecordWriter<K, V>() { public void write(K key, V value) throws IOException { out.append(key, value); } public void close(Reporter reporter) throws IOException { out.close();} }; }关键的操作都在于SequenceFile.Writer中。有不同的RecordWriter就会有不同的写入数据的方式,这里我们举LineRecordWriter的例子。看看他的写入方法:
//往输出流中写入key-value public synchronized void write(K key, V value) throws IOException { //判断键值对是否为空 boolean nullKey = key == null || key instanceof NullWritable; boolean nullValue = http://www.mamicode.com/value == null || value instanceof NullWritable;>在这个方法里,我们就能看出他的存储形式就是我刚刚在上面讲的第二种存储方式。这个方法将会在下面的代码中被执行:OutputCollector<OUTKEY,OUTVALUE> collector = new OutputCollector<OUTKEY,OUTVALUE>() { public void collect(OUTKEY key, OUTVALUE value) throws IOException { //将处理后的key,value写入输出流中,最后写入HDFS作为最终结果 out.write(key, value); // indicate that progress update needs to be sent reporter.progress(); } };过程可以这么理解:collector.collect()------->out.write(key, value)------->LineRecordWriter.write(key, value)------->DataOutputStream.write(key, value).
DataOutputStream是内置于LineRecordWriter的作为里面的变量存在的。这样从Reduce末尾阶段到Output的过程也完全打通了。下面可以看看这上面涉及的完整的类目关系。
下一阶段的学习,可能或偏向于Task,Job阶段的过程分析,更加宏观过程上的一个分析。也可能会分析某个功能块的实现过程,比如Hadoop的IPC过程,据说用了很多JAVA NIO的东西。
OutputFormat输出过程的学习