[IR] Information Extraction

首页 > 代码库 > [IR] Information Extraction

2024-08-19 17:27:21 220人阅读

阶段性总结

Boolean retrieval

单词搜索

【Qword1 and Qword2】 O(x+y)

【Qword1 and Qword2】- 改进： Galloping Search O(2a*log₂(b/a))

【Qword1 and not Qword2】 O(m*log₂n)

【Qword1 or not Qword2】 O(m+n)

【Qword1 and Qword2 and Qword3 and ...】 O(Total_Length * log₂k)

句子搜索

1. Biword Indexes

2. Positional Index --> Proximity Queries

Index Construction

构建过程中的Sort的探索：

基于块的排序索引方法
内存式单遍扫描索引构建方法
动态索引 - Dynamic Indexing

Compression

Heaps’ law: M = kT^b

Zipf’s law: cf_i = K/i

压缩Dictionary

压缩Posting list

思路：基本查询，构建，然后压缩

Tolerant Retrieval & Spelling Correction & Language Model

WILD-CARD QUERIES

prefix　
suffix
"mon*ing"
“Permuterm vocabulary"
K-gram indexes

Spelling Correction

(1) Error detection

(2) Error correction

Language Model

查询似然模型 --> 混合模型：Jelinek-Mercer method

求Query在M_d中出现的概率，然后Ranking.

Probabilistic Model

二值独立模型 - Binary Independence Model

针对一个Query，某Term是否该出现在文档中呢？

一篇New doc出现，遂统计every Term与该doc的关系，得到C_i。

Link Analysis

In degree i 正比于 1/i^α, 例如: α = 2.1

1. Number of In Degree.

2. "Flow" Model

- small graphs.
- large graphs. (Markov渐进性质)

- - Spider traps
  - Dead Ends

Ranking - top k

精确方式：

Consine Similarity: tf-idf

精确加速：

使用Quick Select：n + k * log(k) : "find top k" + "sort top k"

Threshold Methods - MaxScore Method

模糊加速：

Index Elimination (heuristic function)

3 of 4 query terms

Champion List

Cluster Pruning Method

Evaluation

无序检索结果的评价方法
有序检索结果的评价方法

大目标 --> 小目标

• Text Categorization:
　　– Classify an entire document

• Information Extraction (IE):
　　– Identify and classify small units within documents

segmentation: 提取Term (NE) 语法
classification: 认识Term (type, Chunking) 语义
association: 聚类Term

• Named Entity Extraction (NE):
　　– A subset of IE
　　– Identify and classify proper names: "People, locations, organizations"

技术分享

Main tasks
• Named Entity Recognition
• Relation Extraction

Pattern-based Relation Extraction

– Relation extraction and its difficulties

– Use of POS Tags
– Use of Constituent Parse
– Use of Dependency Parse

技术分享

[IR] Information Extraction

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > [IR] Information Extraction