首页 > 代码库 > 自然语言处理----词干提取器

自然语言处理----词干提取器

这里主要介绍nltk中的一些现成的词干提取器Porter和Lancaster.

1. Porter

>>> import nltk
>>> porter=nltk.PorterStemmer()
>>> raw=‘‘‘Listen, strange women lying in ponds distributing swords is no basis
... for a system of government. Supreme executive power derives from a mandate from
... the masses, not from some farcical aquatic‘‘‘
>>> tokens=nltk.word_tokenize(raw)
>>> [porter.stem(t) for t in tokens]
[listen, ,, ustrang, women, ulie, in, upond, udistribut, usword, is, no, ubasi, for, a, system, of, ugovern, ., usuprem, uexecut, power, uderiv, from,
, umandat, from, the, umass, ,, not, from, some, ufarcic, uaquat]

2. Lancaster

>>> lancaster=nltk.LancasterStemmer()
>>> [lancaster.stem(t) for t in tokens]
[list, ,, strange, wom, lying, in, pond, distribut, sword, is, no, bas, for, a, system, of, govern, ., suprem, execut, pow, der, from, a, mand, from
, the, mass, ,, not, from, som, farc, aqu]

3. 词形归并器:删除词缀产生的词, 常用的有WordNetLemmatier

>>> wnl=nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(t) for t in tokens]
[Listen, ,, strange, uwoman, lying, in, upond, distributing, usword, is, no, basis, for, a, system, of, government, ., Supreme, executive, power, derives, from, a, mandate, from, the, umass, ,, not, from, some, farcical, aquatic]

从上面的运行结果可以看出,Porter词干提取器的效果比较好。

自然语言处理----词干提取器