[IR] Index Construction

首页 > 代码库 > [IR] Index Construction

2024-08-19 02:36:26 217人阅读

Three steps to construct Inverted Index as following:

技术分享

最难的step中：

Token sequence.
Sort by term.
Dictionary & Postings

第2步中的最现实的问题是：假如100G的terms如何排序？

External Sorting Algorithm

基于块的排序索引方法

技术分享

注释：

4. 文档集读取
5. 排序
6. 排序结果f_i 存放到disk
7. Merge 这些排序结果为一个整体的Inverted Index list. 使用小窗口一点点地放入min-heap，
堆顶端输出的是Inverted Index list的 dictionary部分，由小变大的顺序（因为f_i已排序）

内存式单遍扫描索引构建方法

技术分享

注释：

拥有同一hash value的terms的排序设计:
Insert-at-back and move-to-front heuristic

每个块f_i 建立新dict；
去除了高代价的sort
最后一步依然是 MergeBlocks(.)

动态索引 - Dynamic Indexing

技术分享

注释：

Main Index在Disk; Auxiliary Index在Memory。
可以视为 Immediate Merge 与 No merge 的一个折中。
由于每个倒排记录在 log₂(T/n)层的每一层中都只处理一次，因此整个索引构建的时间是Θ(T*log₂(T/n))。

引伸：

求Disk中最后留下的Index的数量。
We use |C| to denote the total size of the document collection, and M to denote the memory size.
Let‘s assume that: C = h*M
For i in [0, log₂h]{　　X = [h - (2ⁱ – 1)] mod [2ⁱ⁺¹]　　If X belong to (0, 2ⁱ],　　　　exist in column.　　Else,　　　　not exist.}
The sum of existing X is the answer.

[IR] Index Construction

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > [IR] Index Construction

[IR] Index Construction

看完仍有疑问？有类似问题直接问程序猿