首页 > 代码库 > 关于最近研究的关键词提取keyword extraction做的笔记
关于最近研究的关键词提取keyword extraction做的笔记
来源:http://blog.csdn.net/caohao2008/article/details/3144639
之前内容的整理
要求:第一: 首先找出具有proposal性质的paper,归纳出经典的方法有哪些. 第二:我们如果想用的话,哪种更实用或者易于实现? 哪种在研究上更有意义.
第一, 较好较全面地介绍keyword extraction的经典特征的文章《Finding Advertising Keywords on Web Pages》.
基于概念的keywords提取,使用概念、分类来辅助关键词抽取。较经典的文章《Discovering Key Concepts in Verbose Queries》,《A study on automatically extracted keywords in text categorization》
基于查询日志的keywords提取,有文章《Using the wisdom of the crowds for keyword generation》,《Keyword Extraction for Contextual Advertisement》
Keywords扩展,keywords生成《Keyword Generation for Search Engine Advertising using Semantic Similarity》, 《Using the wisdom of the crowds for keyword generation》,《n-Keyword based Automatic Query Generation》
第二, 较常用的特征,之前研究者提到过的特征:
《Finding Advertising Keywords on web pages》中提到过的特征
1.语言特征 词性标注
2.首字母大写
3.关键词是否在hypertext里
4.关键词是否在meta data里
5.关键词是否在title里
6.关键词是否在url里
7.TF,DF
8.关键词所处位置信息
9.关键词所在句子长度及文档长度
10.候选短语的长度
11.查询日志
我想到的特征
1.周围信息含量,附近几个词甚至是一个句子的平均信息含量。
2.语义距离,使用co-occurance.
3.NE。 曾经在IE抽取中使用过。
4.关键词之间的关系,语义距离。divergance是越大愈好还是越小越好。或者没有影响?
2.3.2.1 Lin: linguistic features.
The linguistic information used in feature extraction includes: two types of pos tags – noun (NN & NNS) andpropernoun (NNP & NNPS), and one type of chunk – noun phrase(NP). The variations used in MoS are: whether the phrase contain these pos tags; whether all the words in that phrase share the same pos tags (either proper noun or noun); and whether the whole candidate phrase is a noun phrase. For DeS, they are: whether the word has the pos tag; whether the word is the beginning of a noun phrase; whether the word is in a noun phrase, but not the first word; and whether the word is outside any noun phrase.
2.3.2.2 C: capitalization.
Whether a word is capitalized is an indication of being art of a proper noun, or an important word. This set of features for MoS is defined as: whether all the words in the andidate phrase are capitalized; whether the first word of he candidate phrase is capitalized; and whether the candidate phrase has a capitalized word. For DeS, it is imply
whether the word is capitalized.
2.3.2.3 H: hypertext.
Whether a candidate phrase or word is part of the anchor text for a hypertext link is extracted as the following features. For MoS, they are: whether the whole candidate phrase matches exactly the anchortext of a link; whether all the words of the candidate phrase are in the same anchor text; and whether any word of the candidate phrase belongs to the anchor text of a link. For DeS, they are: whether the word is the beginning of the anchor text; whether the word is in the anchor text of a link, but not the first word; and whether the word is outside any anchor text.
2.3.2.4 Ms: meta section features.
The header of an HTML document may provide additional information embedded in meta tags. Although the
text in this region is usually not seen by readers, whether a candidate appears in this meta section seems important. For MoS, the features are whether the whole candidate phrase is in the meta section. For DeS, they are: whether the word is the first word in a meta tag; and whether the word occurs somewhere in a meta tag, but not as the first word.
2.3.2.5 T: title.
The only human readable text in the HTML header is the TITLE, which is usually put in the window caption by the browser. For MoS, the feature is whether the whole candidate phrase is in the title. For DeS, the features are: whether the word is the beginning of the title; and whether the word is in the title, but not the first word.
2.3.2.6 M: meta features.
In addition to TITLE, several meta tags are potentially related to keywords, and are used to derive features. In the MoS framework, the features are: whether the whole candidate phrase is in the meta-description; whether the whole candidate phrase is in the meta-keywords; and whether the whole candidate phrase is in the meta-title. For DeS, the features are: whether the word is the beginning of the metadescription; whether the word is in the meta-description, but not the first word; whether the word is the beginning of the meta-keywords; whether the word is in the meta-keywords, but not the first word; whether the word is the beginning of the meta-title; and whether the word is in the meta-title, but not the first word.
2.3.2.7 U: URL.
A web document has one additional highly useful property – the name of the document, which is its URL. For MoS, the features are: whether the whole candidate phrase is in part of the URL string; and whether any word of the candidate phrase is in the URL string. In the DeS framework, the feature is whether the word is in the URL string.
2.3.2.8 IR: information retrieval oriented features.
We consider the TF (term frequency) and DF (document frequency) values of the candidate as real-valued features. The document frequency is derived by counting how many documents in our web page collection that contain the given term. In addition to the original TF and DF frequency numbers, log(TF + 1) and log(DF + 1) are also used as features. The features used in the monolithic and the decomposed frameworks are basically the same, where for DeS, the “term” is the candidate word.
2.3.2.9 Loc: relative location of the candidate.
The beginning of a document often contains an introduction or summary with important words and phrases. Therefore, the location of the occurrence of the word or phrase in the document is also extracted as a feature. Since the length of a document or a sentence varies considerably, we take only the ratio of the location instead of the absolute number. For example, if a word appears in the 10th position, while the whole document contains 200 words, the ratio is then 0.05. These features used for the monolithic and decomposed frameworks are the same. When the candidate is a phrase, its first word is used as its location. There are three different relative locations used as features: wordRatio – the relative location of the candidate in the sentence; sentRatio – the location of the sentence where the candidate is in divided by the total number of sentences in the document; wordDocRatio – the relative location of the candidate in the document. In addition to these 3 realvalued features, we also use their logarithms as features.Specifically, we used log(1+wordRatio), log(1+sentRatio), and log(1 + wordDocRatio).
2.3.2.10 Len: sentence and document length.
The length (in words) of the sentence (sentLen) where the candidate occurs, and the length of the whole document
(docLen) (words in the header are not included) are used as features. Similarly, log(1+sentLen) and log(1+docLen) are also included.
2.3.2.11 phLen: length of the candidate phrase.
For the monolithic framework, the length of the candidate phrase (phLen) in words and log(1+phLen) are included as features. These features are not used in the decomposed framework.
2.3.2.12 Q: query log.
The query log of a search engine reflects the distribution of the keywords people are most interested in. We use the
information to create the following features. For these experiments, unless otherwise mentioned, we used a log file
with the most frequent 7.5 million queries. For the monolithic framework, we consider one binary fea-
ture – whether the phrase appears in the query log, and two real-valued features – the frequency with which it appears and the log value, log(1 + frequency). For the decomposed framework, we consider more variations of this information: whether the word appears in the query log file as the first word of a query; whether the word appears in the query log file as an interior word of a query; and whether the word appears in the query log file as the last word of a query. The frequency values of the above features and their log values (log(1 + f), where f is the corresponding frequency value) are also used as real-valued features. Finally, whether the word never appears in any query log entries is also a feature.
和师姐讨论完之后的内容整理
背景及应用:
基于内容的广告词推荐系统,例如google、yahoo、ebay等的在线广告系统
问答系统
关键词替代、扩展
冗余查询的精简、调整、重新规整等
辅助分类
辅助话题追踪
特征选取:
1.语言特征:使用POS(part-of-speech),标出词性。如名词、动词、副词、形容词等。
2.title : 该关键词是否出现在document中的标题里。
3.position : 该关键词在document中的位置,是否出现在整篇文章的首句、末句或段落的首句、末句等。《Automatic Keyword Extraction Using Linguistic Features》里面详细介绍了这种方法。
4.TF,IDF:最基本的信息权衡特征。
5.NE : 该关键词是否为命名实体,如人名、地名。是否为日期信息,如年月日,时间等。
6.关键词之间关系:关键词之间的语义距离,是越大越好还是越小越好,还是没有关系?
7.周围词信息含量:该词所在的位置附近几个词的信息含量是否高?或者说该词所在的句子在整篇文章中信息含量情况如何?
8.该关键词是否在其他关键词中出现过:作为关键词出现的概率
9.document所属类别:可参考基于分类的关键词提取和基于concept的关键词提取
10.该词是否出现在一个总结性句子中
关于NE的问题
1. 在paper《News-Oriented Automatic Chinese Keyword Indexing》中使用过
2. NE的信息含量非常高。
3. NE的区分度非常高。
值得注意和探讨的问题:
1. 关键词的定义?是区分度最大还是信息含量最大。
2. 由分词带来的影响。TF的粒度的问题。分词本身存在的问题,《Chinese keyword extraction based on max-duplicated Strings of the Documents》找出重复的最大字串。
《News-Oriented Automatic Chinese Keyword Indexing》描写中文关键词抽取,非常经典的一篇文章。其提出了在分词前先统计字符频率,解决了分词不准确及分词粒度带来的问题。提到了过滤关键词的方法等等。使用POS标记词串,然后过滤掉信息含量比较低的词性对应的词汇。例如连词,副词等等。
关于选择出来的特征,如何选取最有效的特征,可以参考论文《Multi-Subset Selection for Keyword Extraction and Other Prototype Search Tasks Using Feature Selection Algorithms》
关于最近研究的关键词提取keyword extraction做的笔记