首页 > 代码库 > 倒排索引

倒排索引

倒排索引(inverted index)

常被成为反向索引、置入文档和反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档

或者一组文档中的存储位置的映射。是文档检索系统中最常用的数据结构。

 

例如:

下面是要被索引的文本:

T0 = "it is what it is"

T1 = "what is it"

T2 = "it is a banana"

生成的倒排索引可以表示为下面所示:

"a" = {(2,2)}

"banana" = {(2,3)}

"is" = {(0,1),(0,4),(1,1),(2,1)}

"it" = {(0,0),(0,3),(1,2),(2,0)}

"what" = {(0,2),(1,0)}

我们可以得到这些完全反向索引,有(文档位置、查询单词所在文档中位置)组成的成对数据。

同样,文档位置、和查询单词所在文档中位置,都从零开始计算。

所以,"banana":{(2,3)}表示 banana在第三个文档中的第四个单词位置。

=====例子如下:

DATA:存储正向索引

word_index:存储倒排索引,每个空格分隔的单词作为key,

      value是list结果,通过list.append方法,依次添加相应单词在文本文件中的位置()。

      单词位置使用(行中index+所在行号)的形式表示。  

#coding:utf-8import sysDATA = {}word_index = {}# query->(line_no,word_index)#using rever_index#使用倒排结果def check_index(sentense):    query = sentense.split( )    for v in query:        if word_index.has_key(v)==True:            #print word_index[v],"####",v            for index_lineno in word_index[v]:  #[‘0.0‘,‘2,1‘,‘2,3‘]                #print index_lineno                print DATA[int(index_lineno.split(.)[1])]if __name__ =="__main__":    # 生成倒排    line_num = 0    for line in sys.stdin:        line = line.strip( \r\n)        fields = line.split( )        DATA[line_num] = line        for i, val in enumerate(fields):            if word_index.has_key(val) == False:                word_index[val] = []            word_index[val].append(".".join(                [str(i), str(line_num)]))        line_num += 1    print word_index    print DATA    print "=====test query"    queries = "it is example"    print ("####input search sentense:%s",queries)    print "####search result is :"    check_index(queries)    print "done=========="    sys.exit(0)

 

=====

input.data 文本文件:

it is what it iswhat is itit is a bananafrom your second exampleWhen I run the algo using some sampleWhat am I doing wrong ?

======运行结果:

{What: [0.5], doing: [3.5], is: [1.0, 4.0, 1.1, 1.2], 
some: [6.4], it: [0.0, 3.0, 2.1, 0.2], sample: [7.4],
second: [2.3], your: [1.3], what: [2.0, 0.1], from: [0.3],
banana: [3.2], ?: [5.5], run: [2.4], I: [1.4, 2.5],
When: [0.4], wrong: [4.5], using: [5.4], a: [2.2],
am: [1.5], algo: [4.4], the: [3.4], example: [3.3]}
{0:
it is what it is, 1: what is it, 2: it is a banana,
  3: from your second example, 4: When I run the algo using some sample,
  5: What am I doing wrong ?}=====test query(####input search sentense:%s, it is example)####search result is :it is what it isit is what it iswhat is itit is a bananait is what it isit is what it iswhat is itit is a bananafrom your second exampledone==========

 

倒排索引