首页 > 代码库 > [IR] Information Extraction

[IR] Information Extraction

阶段性总结

Boolean retrieval

单词搜索

【Qword1 and Qword2】               O(x+y)

【Qword1 and Qword2】- 改进: Galloping Search   O(2a*log2(b/a))

【Qword1 and not Qword2】        O(m*log2n) 

【Qword1 or not Qword2】           O(m+n)

【Qword1 and Qword2 and Qword3 and ...】     O(Total_Length * log2k)

 

句子搜索

1. Biword Indexes

2. Positional Index --> Proximity Queries

 

Index Construction

构建过程中的Sort的探索:

  1. 基于块的排序索引方法
  2. 内存式单遍扫描索引构建方法
  3. 动态索引 - Dynamic Indexing

 

Compression

Heaps’ law: M = kTb

Zipf’s law: cfi = K/i

  • 压缩Dictionary 
  • 压缩Posting list

思路:基本查询,构建,然后压缩

 

Tolerant Retrieval & Spelling Correction & Language Model

WILD-CARD QUERIES

  • prefix 
  • suffix
  • "mon*ing"
  • “Permuterm vocabulary"
  •  K-gram indexes

Spelling Correction

(1) Error detection

(2) Error correction

Language Model

查询似然模型 --> 混合模型:Jelinek-Mercer method

QueryMd 中出现的概率,然后Ranking.

 

Probabilistic Model

  • 二值独立模型 - Binary Independence Model

针对一个Query,某Term是否该出现在文档中呢?

一篇New doc出现,遂统计every Term与该doc的关系,得到Ci

 

Link Analysis

In degree i 正比于 1/iα ,  例如: α = 2.1

1. Number of In Degree.

2. "Flow" Model

    • small graphs.
    • large graphs. (Markov渐进性质)
      • Spider traps 
      • Dead Ends

 

Ranking - top k

精确方式:

Consine Similarity: tf-idf

精确加速:

使用Quick Select:n + k * log(k) : "find top k" + "sort top k"

Threshold Methods - MaxScore Method

模糊加速:

Index Elimination (heuristic function)

3 of 4 query terms

Champion List

Cluster Pruning Method

  

Evaluation

无序检索结果的评价方法
有序检索结果的评价方法

 


大目标 --> 小目标

• Text Categorization:
  – Classify an entire document


• Information Extraction (IE):
  – Identify and classify small units within documents

  1. segmentation: 提取Term (NE) 语法
  2. classification: 认识Term (type, Chunking) 语义
  3. association: 聚类Term


Named Entity Extraction (NE):
  – A subset of IE
  – Identify and classify proper names: "People, locations, organizations"

 


 

技术分享

 

Main tasks
Named Entity Recognition
• Relation Extraction

 

Pattern-based Relation Extraction

– Relation extraction and its difficulties

  1. – Use of POS Tags
  2. – Use of Constituent Parse
  3. – Use of Dependency Parse

 

1. 

技术分享

 

2. 

技术分享

 

3.

 技术分享

 

[IR] Information Extraction