首页 > 代码库 > Lucene实现自定义分词器(同义词查询与高亮)

Lucene实现自定义分词器(同义词查询与高亮)

今天我们实现一个简单的分词器,仅仅做演示使用功能如下:

1、分词按照空格、横杠、点号进行拆分;

2、实现hi与hello的同义词查询功能;

3、实现hi与hello同义词的高亮显示;

MyAnalyzer实现代码:

public class MyAnalyzer extends Analyzer {
	private int analyzerType;
	
	public MyAnalyzer(int type) {
		super();
		analyzerType = type;
	}

	@Override
	protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
		MyTokenizer tokenizer = new MyTokenizer(fieldName, reader, analyzerType);
		return new TokenStreamComponents(tokenizer);
	}
}

MyTokenizer实现代码:

public class MyTokenizer extends Tokenizer {
	public class WordUnit{
		WordUnit(String word, int start, int length){
			this.word = word;
			this.start = start;
			this.length = length;
//System.out.println("\tWordUnit: " + word + "|" + start + "|" + length);
		}
		
		String word;
		int start;
		int length;
	}
	
	private int analyzerType;
	private int endPosition;
	private Iterator<WordUnit> it;
	private ArrayList<WordUnit> words;
	
	private final CharTermAttribute termAtt;
	private final OffsetAttribute offsetAtt;
	
	public MyTokenizer(String fieldName, Reader in, int type) {
		super(in);
		
		it = null;
		endPosition = 0;
		analyzerType = type;
		offsetAtt = addAttribute(OffsetAttribute.class);
		termAtt = addAttribute(CharTermAttribute.class);
		addAttribute(PayloadAttribute.class);
	}	

	@Override
	public boolean incrementToken() throws IOException {
		clearAttributes();
		
		char[] inputBuf = new char[1024];
		if(it == null) {
			int bufSize = input.read(inputBuf);
			if(bufSize <= 0) return false;

			int beginIndex = 0;
			int endIndex = 0;
			words = new ArrayList<WordUnit>();
			for(endIndex = 0; endIndex < bufSize; endIndex++) {
				if(inputBuf[endIndex] != '-' && inputBuf[endIndex] != ' ' && inputBuf[endIndex] != '.') continue;

				addWord(inputBuf, beginIndex, endIndex);
				beginIndex = endIndex + 1;
			}
			addWord(inputBuf, beginIndex, endIndex);//add the last
			
			if(words.isEmpty()) return false;
			it = words.iterator();
		}
		
		if(it != null && it.hasNext()){
			WordUnit word = it.next();
			termAtt.append(word.word);
			termAtt.setLength(word.word.length());
			
			endPosition = word.start + word.length;
			offsetAtt.setOffset(word.start, endPosition);

			return true;
		}
			
		return false;
	}

	@Override
	public void reset() throws IOException {
		super.reset();
		
		it = null;
		endPosition = 0;
	}

	@Override
	public final void end() {
		int finalOffset = correctOffset(this.endPosition);
		offsetAtt.setOffset(finalOffset, finalOffset);
	}
	
	private void addWord(char[] inputBuf, int begin, int end){
		if(end <= begin) return;

		String word = new String(inputBuf, begin, end - begin);
		words.add(new WordUnit(word, begin, end - begin));
	
	 	if(analyzerType == 0 && word.equals("hi")) words.add(new WordUnit("hello", begin, 2));
	 	if(analyzerType == 0 && word.equals("hello")) words.add(new WordUnit("hi", begin, 5));
	}
}

索引的时候分词器类型:analyzerType=0;

搜索的时候分词器类型:analyzerType=1;

高亮的时候分词器类型:analyzerType=0;

搜索hello时的效果如下:

Score doc 0 hightlight to: look <em>hello</em> on
Score doc 1 hightlight to: I am <em>hi</em> China Chinese
可以看到含有hi的文档也被搜索出来,同样也会高亮。

Lucene实现自定义分词器(同义词查询与高亮)