首页 > 代码库 > Lemmatisation & Stemming 词干提取
Lemmatisation & Stemming 词干提取
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications. 1.Stemmer 抽取词的词干或词根形式(不一定能够表达完整语义) Porter Stemmer基于Porter词干提取算法
>>> from nltk.stem.porter import PorterStemmer >>> porter_stemmer = PorterStemmer() >>> porter_stemmer.stem(‘maximum’) u’maximum’ >>> porter_stemmer.stem(‘presumably’) u’presum’ >>> porter_stemmer.stem(‘multiply’) u’multipli’ >>> porter_stemmer.stem(‘provision’) u’provis’ >>> porter_stemmer.stem(‘owed’) u’owe’
Lancaster Stemmer 基于Lancaster 词干提取算法
>>> from nltk.stem.lancaster import LancasterStemmer >>> lancaster_stemmer = LancasterStemmer() >>> lancaster_stemmer.stem(‘maximum’) ‘maxim’ >>> lancaster_stemmer.stem(‘presumably’) ‘presum’ >>> lancaster_stemmer.stem(‘presumably’) ‘presum’ >>> lancaster_stemmer.stem(‘multiply’) ‘multiply’ >>> lancaster_stemmer.stem(‘provision’) u’provid’ >>> lancaster_stemmer.stem(‘owed’) ‘ow’
Snowball Stemmer基于Snowball 词干提取算法
>>> from nltk.stem import SnowballStemmer >>> snowball_stemmer = SnowballStemmer(“english”) >>> snowball_stemmer.stem(‘maximum’) u’maximum’ >>> snowball_stemmer.stem(‘presumably’) u’presum’ >>> snowball_stemmer.stem(‘multiply’) u’multipli’ >>> snowball_stemmer.stem(‘provision’) u’provis’ >>> snowball_stemmer.stem(‘owed’) u’owe’
2.Lemmatization 把一个任何形式的语言词汇还原为一般形式,标记词性的前提下效果比较好
>>> from nltk.stem.wordnet import WordNetLemmatizer >>> lmtzr = WordNetLemmatizer() >>> lmtzr.lemmatize(‘cars‘) ‘car‘ >>> lmtzr.lemmatize(‘feet‘) ‘foot‘ >>> lmtzr.lemmatize(‘people‘) ‘people‘ >>> lmtzr.lemmatize(‘fantasized‘,pos=“v”) #postag ‘fantasize‘
3.MaxMatch 在中文自然语言处理中常常用来进行分词
from nltk.stem import WordNetLemmatizer from nltk.corpus import words wordlist = set(words.words()) wordnet_lemmatizer = WordNetLemmatizer() def max_match(text): pos2 = len(text) result = ‘‘ while len(text) > 0: word = wordnet_lemmatizer.lemmatize(text[0:pos2]) if word in wordlist: result = result + text[0:pos2] + ‘ ‘ text = text[pos2:] pos2 = len(text) else: pos2 = pos2-1 return result[0:-1] >>> string = ‘theyarebirds‘ >>> print(max_match(string)) they are birds
https://marcobonzanini.com/2015/01/26/stemming-lemmatisation-and-pos-tagging-with-python-and-nltk/
http://blog.csdn.net/baimafujinji/article/details/51069522
Lemmatisation & Stemming 词干提取
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。