首页 > 代码库 > Apache commons codec 之language
Apache commons codec 之language
Apache commons codec的language是一个功能比较强大的包,主要是用在对各种语言的处理上,当然,这个包对汉字的支持很糟糕。这一块的内容,在网上非常少,只能自己写一些挺肤浅的代码,以后如果有机会接触,再完善。
我们先学习下编码的范例。
Examples of Soundex Coding |
Name Letters Coded Coding |
Allricht l, r, c A-462 |
Eberhard b, r, r E-166 |
Engebrethson n, g, b E-521 |
Heimbach m, b, c H-512 |
Hanselmann n, s, l H-524 |
Henzelmann n, z, l H-524 |
Hildebrand l, d, b H-431 |
Kavanagh v, n, g K-152 |
Lind, Van n, d L-530 |
Lukaschowsky k, s, s L-222 |
McDonnell c, d, n M-235 |
McGee c M-200 |
O‘Brien b, r, n O-165 |
Opnian p, n, n O-155 |
Oppenheimer p, n, m O-155 |
Swhgler s, l, r S-460 |
Riedemanas d, m, n R-355 |
Zita t Z-300 |
Zitzmeinn t, z, m Z-325 |
在一些英文使用的场合中,特别是语音上,有很多可能相似的单词,用这个包里的方法使用下:
package test.ffm83.commons.codec;
import org.apache.commons.codec.language.RefinedSoundex;
import org.apache.commons.lang.StringUtils;
/**
* 通过apache commonscodec的lanauage包进行字符的相似度差异
* 返回的是0到最短的编码长度,0表示无关,4或者意味着可能比较相似
* @author范芳铭
*/
public classEasyLanguageDiff {
private RefinedSoundexstringEncoder = this.createStringEncoder();
public static void main(String[] args)throws Exception{
EasyLanguageDiffdiff = newEasyLanguageDiff();
diff.getDifference();
diff.getEncode();
}
protectedRefinedSoundex createStringEncoder() {
return new RefinedSoundex();
}
publicRefinedSoundex getStringEncoder() {
return this.stringEncoder;
}
public void getDifference() throws Exception{
System.out.println(StringUtils.center("字符串差异", 50,"-"));
System.out.println(this.getStringEncoder().difference(null,null));
System.out.println(this.getStringEncoder().difference("",""));
System.out.println(this.getStringEncoder().difference(" "," "));
System.out.println(this.getStringEncoder().difference("Margaret","Andrew"));
System.out.println(this.getStringEncoder().difference("Smith","Smythe"));
System.out.println(this.getStringEncoder().difference("Ann","Andrew"));
System.out.println(this.getStringEncoder().difference("Green","Greene"));
System.out.println(this.getStringEncoder().difference("Smithers","Smythers"));
System.out.println();
}
public void getEncode() throws Exception{
System.out.println(StringUtils.center("语言编码", 50,"-"));
System.out.println(this.getStringEncoder().encode("testing"));
System.out.println(this.getStringEncoder().encode("TESTING"));
System.out.println(this.getStringEncoder().encode("dogs"));
System.out.println(RefinedSoundex.US_ENGLISH.encode("dogs"));
//目前不支持汉字,加入就直接异常
//System.out.println(this.getStringEncoder().encode("范芳铭"));
System.out.println();
}
}
运行结果如下:
----------------------字符串差异-----------------------
0
0
0
1
6
3
5
8
-----------------------语言编码-----------------------
T6036084
T6036084
D6043
D6043
Apache commons codec 之language