首页 > 代码库 > TFIDF文档向量化-Mahout_MapReduce
TFIDF文档向量化-Mahout_MapReduce
Mahout之SparseVectorsFromSequenceFiles源码分析
目标:将一个给定的sequence文件集合转化为SparseVectors
1、对文档分词
1.1)使用最新的{@link org.apache.lucene.util.Version}创建一个Analyzer,用来下文1.2分词;
Class<? extends Analyzer> analyzerClass = StandardAnalyzer.class; if (cmdLine.hasOption(analyzerNameOpt)) { String className = cmdLine.getValue(analyzerNameOpt).toString(); analyzerClass = Class.forName(className).asSubclass(Analyzer.class); // try instantiating it, b/c there isn‘t any point in setting it if // you can‘t instantiate it AnalyzerUtils.createAnalyzer(analyzerClass); }
1.2)使用{@link StringTuple}将input documents转化为token数组(input documents必须是{@link org.apache.hadoop.io.SequenceFile}格式);
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);
输入:inputDir 输出:tokenizedPath
SequenceFileTokenizerMapper:
//将input documents按Analyzer进行分词,并将分得的词放在一个StringTuple中
TokenStream stream = analyzer.tokenStream(key.toString(), new StringReader(value.toString())); CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class); stream.reset(); StringTuple document = new StringTuple();//StringTuple是一个能够被用于Hadoop Map/Reduce Job的String类型有序List while (stream.incrementToken()) { if (termAtt.length() > 0) { document.add(new String(termAtt.buffer(), 0, termAtt.length())); } }
2、创建TF向量(Term Frequency Vectors)---多个Map/Reduce Job
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize, minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
2.1)全局词统计(TF)
startWordCounting(input, dictionaryJobPath, baseConf, minSupport);
使用Map/Reduce并行地统计全局的词频,这里只考虑(maxNGramSize == 1)
输入:tokenizedPath 输出:wordCountPath
TermCountMapper:
//统计一个文本文档中的词频
OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>(); for (String word : value.getEntries()) { if (wordCount.containsKey(word)) { wordCount.put(word, wordCount.get(word) + 1); } else { wordCount.put(word, 1); } } wordCount.forEachPair(new ObjectLongProcedure<String>() { @Override public boolean apply(String first, long second) { try { context.write(new Text(first), new LongWritable(second)); } catch (IOException e) { context.getCounter("Exception", "Output IO Exception").increment(1); } catch (InterruptedException e) { context.getCounter("Exception", "Interrupted Exception").increment(1); } return true; } });
TermCountCombiner:( 同 TermCountReducer)
TermCountReducer:
//汇总所有的words和单词的weights,并将同一word的权重sum
long sum = 0; for (LongWritable value : values) { sum += value.get(); } if (sum >= minSupport) {//TermCountCombiner没有这个过滤) context.write(key, new LongWritable(sum)); }
2.2)创建词典
List<Path> dictionaryChunks; dictionaryChunks = createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);
读取2.1词频Job的feature frequency List,并给它们指定id
输入:wordCountPath 输出:dictionaryJobPath
/** * Read the feature frequency List which is built at the end of the Word Count Job and assign ids to them. * This will use constant memory and will run at the speed of your disk read */ private static List<Path> createDictionaryChunks(Path wordCountPath, Path dictionaryPathBase, Configuration baseConf, int chunkSizeInMegabytes, int[] maxTermDimension) throws IOException { List<Path> chunkPaths = Lists.newArrayList(); Configuration conf = new Configuration(baseConf); FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf); long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;//默认64M int chunkIndex = 0; Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex); chunkPaths.add(chunkPath); SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); try { long currentChunkSize = 0; Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN); int i = 0; for (Pair<Writable,Writable> record : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) { if (currentChunkSize > chunkSizeLimit) {//生成新的词典文件 Closeables.close(dictWriter, false); chunkIndex++; chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex); chunkPaths.add(chunkPath); dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); currentChunkSize = 0; } Writable key = record.getFirst(); int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8; currentChunkSize += fieldSize; dictWriter.append(key, new IntWritable(i++));//指定id } maxTermDimension[0] = i;//记录最大word数目 } finally { Closeables.close(dictWriter, false); } return chunkPaths; }
2.3)构造PartialVectors(TF)
int partialVectorIndex = 0; Collection<Path> partialVectorPaths = Lists.newArrayList(); for (Path dictionaryChunk : dictionaryChunks) { Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++); partialVectorPaths.add(partialVectorOutputPath); makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath, maxTermDimension[0], sequentialAccess, namedVectors, numReducers); }
将input documents使用a chunk of features创建a partial vector
(这是由于词典文件被分成了多个文件,每个文件只能构造总的vector的一部分,其中每一部分叫一个partial vector)
输入:tokenizedPath 输出:partialVectorPaths
Mapper:(Mapper)
TFPartialVectorReducer:
//读取词典文件
//MAHOUT-1247 Path dictionaryFile = HadoopUtil.getSingleCachedFile(conf); // key is word value is id for (Pair<Writable, IntWritable> record : new SequenceFileIterable<Writable, IntWritable>(dictionaryFile, true, conf)) { dictionary.put(record.getFirst().toString(), record.getSecond().get()); }
//转化a document为a sparse vector
StringTuple value =http://www.mamicode.com/ it.next(); Vector vector = new RandomAccessSparseVector(dimension, value.length()); // guess at initial size for (String term : value.getEntries()) { if (!term.isEmpty() && dictionary.containsKey(term)) { // unigram int termId = dictionary.get(term); vector.setQuick(termId, vector.getQuick(termId) + 1); } }
2.4)合并PartialVectors(TF)
Configuration conf = new Configuration(baseConf); Path outputDir = new Path(output, tfVectorsFolderName); PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, conf, normPower, logNormalize, maxTermDimension[0], sequentialAccess, namedVectors, numReducers);
合并所有的partial {@link org.apache.mahout.math.RandomAccessSparseVector}s为完整的{@link org.apache.mahout.math.RandomAccessSparseVector}
输入:partialVectorPaths 输出:tfVectorsFolder
Mapper:(Mapper)
PartialVectorMergeReducer:
//合并partial向量为完整的TF向量
Vector vector = new RandomAccessSparseVector(dimension, 10); for (VectorWritable value : values) { vector.assign(value.get(), Functions.PLUS);//将包含不同word的向量合并为一个 }
3、创建IDF向量(document frequency Vectors)---多个Map/Reduce Job
Pair<Long[], List<Path>> docFrequenciesFeatures = null; // Should document frequency features be processed if (shouldPrune || processIdf) { log.info("Calculating IDF"); docFrequenciesFeatures = TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize); }
3.1)统计DF词频
Path wordCountPath = new Path(output, WORDCOUNT_OUTPUT_FOLDER);
startDFCounting(input, wordCountPath, baseConf);
输入:tfDir 输出:featureCountPath
TermDocumentCountMapper:
//为一个文档中的每个word计数1、文档数1
Vector vector = value.get(); for (Vector.Element e : vector.nonZeroes()) { out.set(e.index()); context.write(out, ONE); } context.write(TOTAL_COUNT, ONE);
Combiner:(TermDocumentCountReducer)
TermDocumentCountReducer:
//将每个word的文档频率和文档总数sum
long sum = 0; for (LongWritable value : values) { sum += value.get(); }
3.2)df词频分块
return createDictionaryChunks(wordCountPath, output, baseConf, chunkSizeInMegabytes);
将df词频分块存放到多个文件,记录word总数、文档总数
输入:featureCountPath 输出:dictionaryPathBase
/** * Read the document frequency List which is built at the end of the DF Count Job. This will use constant * memory and will run at the speed of your disk read */ private static Pair<Long[], List<Path>> createDictionaryChunks(Path featureCountPath, Path dictionaryPathBase, Configuration baseConf, int chunkSizeInMegabytes) throws IOException { List<Path> chunkPaths = Lists.newArrayList(); Configuration conf = new Configuration(baseConf); FileSystem fs = FileSystem.get(featureCountPath.toUri(), conf); long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L; int chunkIndex = 0; Path chunkPath = new Path(dictionaryPathBase, FREQUENCY_FILE + chunkIndex); chunkPaths.add(chunkPath); SequenceFile.Writer freqWriter = new SequenceFile.Writer(fs, conf, chunkPath, IntWritable.class, LongWritable.class); try { long currentChunkSize = 0; long featureCount = 0; long vectorCount = Long.MAX_VALUE; Path filesPattern = new Path(featureCountPath, OUTPUT_FILES_PATTERN); for (Pair<IntWritable,LongWritable> record : new SequenceFileDirIterable<IntWritable,LongWritable>(filesPattern, PathType.GLOB, null, null, true, conf)) { if (currentChunkSize > chunkSizeLimit) { Closeables.close(freqWriter, false); chunkIndex++; chunkPath = new Path(dictionaryPathBase, FREQUENCY_FILE + chunkIndex); chunkPaths.add(chunkPath); freqWriter = new SequenceFile.Writer(fs, conf, chunkPath, IntWritable.class, LongWritable.class); currentChunkSize = 0; } int fieldSize = SEQUENCEFILE_BYTE_OVERHEAD + Integer.SIZE / 8 + Long.SIZE / 8; currentChunkSize += fieldSize; IntWritable key = record.getFirst(); LongWritable value = record.getSecond(); if (key.get() >= 0) { freqWriter.append(key, value); } else if (key.get() == -1) {//文档数目 vectorCount = value.get(); } featureCount = Math.max(key.get(), featureCount); } featureCount++; Long[] counts = {featureCount, vectorCount};//word数目、文档数目 return new Pair<Long[], List<Path>>(counts, chunkPaths); } finally { Closeables.close(freqWriter, false); } }
4、创建TFIDF(Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors)
TFIDFConverter.processTfIdf(
new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
outputDir, conf, docFrequenciesFeatures, minDf, maxDF, norm, logNormalize,
sequentialAccessOutput, namedVectors, reduceTasks);
4.1)生成PartialVectors(TFIDF)
int partialVectorIndex = 0; List<Path> partialVectorPaths = Lists.newArrayList(); List<Path> dictionaryChunks = datasetFeatures.getSecond(); for (Path dictionaryChunk : dictionaryChunks) { Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++); partialVectorPaths.add(partialVectorOutputPath); makePartialVectors(input, baseConf, datasetFeatures.getFirst()[0], datasetFeatures.getFirst()[1], minDf, maxDF, dictionaryChunk, partialVectorOutputPath, sequentialAccessOutput, namedVector); }
使用a chunk of features创建a partial tfidf vector
输入:tfVectorsFolder 输出:partialVectorOutputPath
DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);//缓存df分块文件
Mapper:(Mapper)
TFIDFPartialVectorReducer:
//计算每个文档中每个word的TFIDF值
Vector value =http://www.mamicode.com/ it.next().get(); Vector vector = new RandomAccessSparseVector((int) featureCount, value.getNumNondefaultElements()); for (Vector.Element e : value.nonZeroes()) { if (!dictionary.containsKey(e.index())) { continue; } long df = dictionary.get(e.index()); if (maxDf > -1 && (100.0 * df) / vectorCount > maxDf) { continue; } if (df < minDf) { df = minDf; } vector.setQuick(e.index(), tfidf.calculate((int) e.get(), (int) df, (int) featureCount, (int) vectorCount)); }
4.2)合并partial向量(TFIDF)
Configuration conf = new Configuration(baseConf); Path outputDir = new Path(output, DOCUMENT_VECTOR_OUTPUT_FOLDER); PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, baseConf, normPower, logNormalize, datasetFeatures.getFirst()[0].intValue(), sequentialAccessOutput, namedVector, numReducers);
合并所有的partial向量为一个完整的文档向量
输入:partialVectorOutputPath 输出:outputDir
Mapper:Mapper
PartialVectorMergeReducer:
//汇总TFIDF向量
Vector vector = new RandomAccessSparseVector(dimension, 10); for (VectorWritable value : values) { vector.assign(value.get(), Functions.PLUS); }