首页 > 代码库 > [搜索]波特词干(Porter Streamming)提取算法详解(3)

[搜索]波特词干(Porter Streamming)提取算法详解(3)



接上

[搜索]波特词干(Porter Streamming)提取算法详解(2)

下面分为5大步骤来使用前面提到的替换条件来进行词干提取。

左边是规则,右边是提取成功或者失败的例子(用小写字母表示)。

步骤1

SSES -> SS                   caresses  ->  caress
IES  -> I                          ponies    ->  poni
                                       ties      ->  ti

SS   -> SS                      caress    ->  caress
S    ->                            cats      ->  cat

(m>0) EED -> EE           feed      ->  feed
                                       agreed    ->  agree

 (*v*) ED  ->                    plastered ->  plaster
                                       bled      ->  bled
 (*v*) ING ->                   motoring  ->  motor
                                       sing      ->  sing

AT -> ATE                       conflat(ed)  ->  conflate

BL -> BLE                       troubl(ed)   ->  trouble
IZ -> IZE                       siz(ed)      ->  size
    (*d and not (*L or *S or *Z))
       -> single letter
                                    hopp(ing)    ->  hop
                                    tann(ed)     ->  tan
                                    fall(ing)    ->  fall
                                    hiss(ing)    ->  hiss
                                    fizz(ed)     ->  fizz
(m=1 and *o) -> E       fail(ing)    ->  fail
                                    fil(ing)     ->  file

(*v*) Y -> I                    happy        ->  happi
                                    sky          ->  sky
通过步骤1的处理,复数和过去分词就被处理了。

步骤2

    (m>0) ATIONAL ->  ATE           relational     ->  relate
    (m>0) TIONAL  ->  TION          conditional    ->  condition
                                    rational       ->  rational
    (m>0) ENCI    ->  ENCE          valenci        ->  valence
    (m>0) ANCI    ->  ANCE          hesitanci      ->  hesitance
    (m>0) IZER    ->  IZE           digitizer      ->  digitize
    (m>0) ABLI    ->  ABLE          conformabli    ->  conformable
    (m>0) ALLI    ->  AL            radicalli      ->  radical
    (m>0) ENTLI   ->  ENT           differentli    ->  different
    (m>0) ELI     ->  E             vileli        - >  vile
    (m>0) OUSLI   ->  OUS           analogousli    ->  analogous
    (m>0) IZATION ->  IZE           vietnamization ->  vietnamize
    (m>0) ATION   ->  ATE           predication    ->  predicate
    (m>0) ATOR    ->  ATE           operator       ->  operate
    (m>0) ALISM   ->  AL            feudalism      ->  feudal
    (m>0) IVENESS ->  IVE           decisiveness   ->  decisive
    (m>0) FULNESS ->  FUL           hopefulness    ->  hopeful
    (m>0) OUSNESS ->  OUS           callousness    ->  callous
    (m>0) ALITI   ->  AL            formaliti      ->  formal
    (m>0) IVITI   ->  IVE           sensitiviti    ->  sensitive
    (m>0) BILITI  ->  BLE           sensibiliti    ->  sensible
步骤3

    (m>0) ICATE ->  IC              triplicate     ->  triplic
    (m>0) ATIVE ->                  formative      ->  form
    (m>0) ALIZE ->  AL              formalize      ->  formal
    (m>0) ICITI ->  IC              electriciti    ->  electric
    (m>0) ICAL  ->  IC              electrical     ->  electric
    (m>0) FUL   ->                  hopeful        ->  hope
    (m>0) NESS  ->                  goodness       ->  good

步骤4

    (m>1) AL    ->                  revival        ->  reviv
    (m>1) ANCE  ->                  allowance      ->  allow
    (m>1) ENCE  ->                  inference      ->  infer
    (m>1) ER    ->                  airliner       ->  airlin
    (m>1) IC    ->                  gyroscopic     ->  gyroscop
    (m>1) ABLE  ->                  adjustable     ->  adjust
    (m>1) IBLE  ->                  defensible     ->  defens
    (m>1) ANT   ->                  irritant       ->  irrit
    (m>1) EMENT ->                  replacement    ->  replac
    (m>1) MENT  ->                  adjustment     ->  adjust
    (m>1) ENT   ->                  dependent      ->  depend
    (m>1 and (*S or *T)) ION ->     adoption       ->  adopt
    (m>1) OU    ->                  homologou      ->  homolog
    (m>1) ISM   ->                  communism      ->  commun
    (m>1) ATE   ->                  activate       ->  activ
    (m>1) ITI   ->                  angulariti     ->  angular
    (m>1) OUS   ->                  homologous     ->  homolog
    (m>1) IVE   ->                  effective      ->  effect
    (m>1) IZE   ->                  bowdlerize     ->  bowdler

通过前面的四个步骤,后缀就被去掉了,剩下最后一步做一些微调操作。

步骤5

    (m>1) E     ->                  probate        ->  probat
                                    rate           ->  rate
    (m=1 and not *o) E ->           cease          ->  ceas

(m > 1 and *d and *L) -> single letter
                                    controll       ->  control
                                    roll           ->  roll


有人专门对Porter的算法进行了测评,发现词干提取能显著提高召回率,而且轻度提取对准确率影响不大,但是深度提取会严重影响准确率,所以他们建议,首先使用轻度提取,如果查询结果太少时再使用深度提取。

[搜索]波特词干(Porter Streamming)提取算法详解(3)