首页 > 代码库 > 自然语言处理(1)之NLTK与PYTHON
自然语言处理(1)之NLTK与PYTHON
自然语言处理(1)之NLTK与PYTHON
题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间。碰巧这几天在亚马逊上找书时发现了这本《Python自然语言处理》,瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助。所以最近都会学习这本书,也写下这些笔记。
1. NLTK简述
NLTK模块及功能介绍
语言处理任务 | NLTK模块 | 功能描述 |
获取语料库 | nltk.corpus | 语料库和词典的标准化接口 |
字符串处理 | nltk.tokenize,nltk.stem | 分词、句子分解、提取主干 |
搭配研究 | nltk.collocations | t-检验,卡方,点互信息 |
词性标示符 | nltk.tag | n-gram,backoff,Brill,HMM,TnT |
分类 | nltk.classify,nltk.cluster | 决策树,最大熵,朴素贝叶斯,EM,k-means |
分块 | nltk.chunk | 正则表达式,n-gram,命名实体 |
解析 | nltk.parse | 图标,基于特征,一致性,概率性,依赖项 |
语义解释 | nltk.sem,nltk.inference | λ演算,一阶逻辑,模型检验 |
指标评测 | nltk.metrics | 精度,召回率,协议系数 |
概率与估计 | nltk.probability | 频率分布,平滑概率分布 |
应用 | nltk.app,nltk.chat | 图形化的关键词排序,分析器,WordNet查看器,聊天机器人 |
语言学领域的工作 | nltk.toolbox | 处理SIL工具箱格式的数据 |
2. NLTK安装
我的Python版本是2.7.5,NLTK版本2.0.4
1 DESCRIPTION 2 The Natural Language Toolkit (NLTK) is an open source Python library 3 for Natural Language Processing. A free online book is available. 4 (If you use the library for academic research, please cite the book.) 5 6 Steven Bird, Ewan Klein, and Edward Loper (2009). 7 Natural Language Processing with Python. O‘Reilly Media Inc. 8 http://nltk.org/book 9 10 @version: 2.0.4
安装步骤跟http://www.nltk.org/install.html 一样
1. 安装Setuptools: http://pypi.python.org/pypi/setuptools
在页面的最下面setuptools-5.7.tar.gz
2. 安装 Pip: 运行 sudo easy_install pip(一定要以root权限运行)
3. 安装 Numpy (optional): 运行 sudo pip install -U numpy
4. 安装 NLTK: 运行 sudo pip install -U nltk
5. 进入python,并输入以下命令
1 192:chapter2 rcf$ python2 Python 2.7.5 (default, Mar 9 2014, 22:15:05) 3 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin4 Type "help", "copyright", "credits" or "license" for more information.5 >>> import nltk6 >>> nltk.download()
当出现以下界面进行nltk_data的下载
也可直接到 http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml 去下载数据包,并拖到Download Directory。我就是这么做的。
最后在Python目录运行以下命令以及结果,说明安装已成功
1 from nltk.book import * 2 *** Introductory Examples for the NLTK Book *** 3 Loading text1, ..., text9 and sent1, ..., sent9 4 Type the name of the text or sentence to view it. 5 Type: ‘texts()‘ or ‘sents()‘ to list the materials. 6 text1: Moby Dick by Herman Melville 1851 7 text2: Sense and Sensibility by Jane Austen 1811 8 text3: The Book of Genesis 9 text4: Inaugural Address Corpus10 text5: Chat Corpus11 text6: Monty Python and the Holy Grail12 text7: Wall Street Journal13 text8: Personals Corpus14 text9: The Man Who Was Thursday by G . K . Chesterton 1908
3. NLTK的初次使用
现在开始进入正题,由于本人没学过python,所以使用NLTK也就是学习Python的过程。初次学习NLTK主要使用的时NLTK里面自带的一些现有数据,上图中已由显示,这些数据都在nltk.book里面。
3.1 搜索文本
concordance:搜索text1中的monstrous
1 >>> text1.concordance("monstrous") 2 Building index... 3 Displaying 11 of 11 matches: 4 ong the former , one was of a most monstrous size . ... This came towards us , 5 ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r 6 ll over with a heathenish array of monstrous clubs and spears . Some were thick 7 d as you gazed , and wondered what monstrous cannibal and savage could ever hav 8 that has survived the flood ; most monstrous and most mountainous ! That Himmal 9 they might scout at Moby Dick as a monstrous fable , or still worse and more de10 th of Radney .‘" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l11 ing Scenes . In connexion with the monstrous pictures of whales , I am strongly12 ere to enter upon those still more monstrous stories of them which are to be fo13 ght have been rummaged out of this monstrous cabinet there is no telling . But 14 of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
similar:查找text1中与monstrous相关的所有词语
1 >>> text1.similar("monstrous")2 Building word-context index...3 abundant candid careful christian contemptible curious delightfully4 determined doleful domineering exasperate fearless few gamesome5 horrible impalpable imperial lamentable lazy loving
dispersion_plot:用离散图判断词在文本的位置即偏移量
1 >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])
3.2 计数词汇
len:获取长度,即可获取文章的词汇个数,也可获取单个词的长度
1 >>> len(text1) #计算text1的词汇个数2 2608193 >>> len(set(text1)) #计算text1 不同的词汇个数4 193175 >>> len(text1[0]) #计算text1 第一个词的长度6 1
sorted:排序
1 >>> sent12 [‘Call‘, ‘me‘, ‘Ishmael‘, ‘.‘]3 >>> sorted(sent1)4 [‘.‘, ‘Call‘, ‘Ishmael‘, ‘me‘]
3.3 频率分布
nltk.probability.FreqDist
1 >>> fdist1=FreqDist(text1) #获取text1的频率分布情况2 >>> fdist1 #text1具有19317个样本,但是总体有260819个值3 <FreqDist with 19317 samples and 260819 outcomes> 4 >>> keys=fdist1.keys() 5 >>> keys[:50] #获取text1的前50个样本
6 [‘,‘, ‘the‘, ‘.‘, ‘of‘, ‘and‘, ‘a‘, ‘to‘, ‘;‘, ‘in‘, ‘that‘, "‘", ‘-‘, ‘his‘, ‘it‘, ‘I‘, ‘s‘, ‘is‘, ‘he‘, ‘with‘, ‘was‘, ‘as‘, ‘"‘, ‘all‘, ‘for‘, ‘this‘, ‘!‘, ‘at‘, ‘by‘, ‘but‘, ‘not‘, ‘--‘, ‘him‘, ‘from‘, ‘be‘, ‘on‘, ‘so‘, ‘whale‘, ‘one‘, ‘you‘, ‘had‘, ‘have‘, ‘there‘, ‘But‘, ‘or‘, ‘were‘, ‘now‘, ‘which‘, ‘?‘, ‘me‘, ‘like‘]
1 >>> fdist1.items()[:50] #text1的样本分布情况,比如‘,‘出现了18713次,总共的词为2608192 [(‘,‘, 18713), (‘the‘, 13721), (‘.‘, 6862), (‘of‘, 6536), (‘and‘, 6024), (‘a‘, 4569), (‘to‘, 4542), (‘;‘, 4072), (‘in‘, 3916), (‘that‘, 2982), ("‘", 2684), (‘-‘, 2552), (‘his‘, 2459), (‘it‘, 2209), (‘I‘, 2124), (‘s‘, 1739), (‘is‘, 1695), (‘he‘, 1661), (‘with‘, 1659), (‘was‘, 1632), (‘as‘, 1620), (‘"‘, 1478), (‘all‘, 1462), (‘for‘, 1414), (‘this‘, 1280), (‘!‘, 1269), (‘at‘, 1231), (‘by‘, 1137), (‘but‘, 1113), (‘not‘, 1103), (‘--‘, 1070), (‘him‘, 1058), (‘from‘, 1052), (‘be‘, 1030), (‘on‘, 1005), (‘so‘, 918), (‘whale‘, 906), (‘one‘, 889), (‘you‘, 841), (‘had‘, 767), (‘have‘, 760), (‘there‘, 715), (‘But‘, 705), (‘or‘, 697), (‘were‘, 680), (‘now‘, 646), (‘which‘, 640), (‘?‘, 637), (‘me‘, 627), (‘like‘, 624)]
1 >>> fdist1.hapaxes()[:50] #text1的样本只出现一次的词2 [‘!\‘"‘, ‘!)"‘, ‘!*‘, ‘!--"‘, ‘"...‘, "‘,--", "‘;", ‘):‘, ‘);--‘, ‘,)‘, ‘--\‘"‘, ‘---"‘, ‘---,‘, ‘."*‘, ‘."--‘, ‘.*--‘, ‘.--"‘, ‘100‘, ‘101‘, ‘102‘, ‘103‘, ‘104‘, ‘105‘, ‘106‘, ‘107‘, ‘108‘, ‘109‘, ‘11‘, ‘110‘, ‘111‘, ‘112‘, ‘113‘, ‘114‘, ‘115‘, ‘116‘, ‘117‘, ‘118‘, ‘119‘, ‘12‘, ‘120‘, ‘121‘, ‘122‘, ‘123‘, ‘124‘, ‘125‘, ‘126‘, ‘127‘, ‘128‘, ‘129‘, ‘130‘]
3 >>> fdist1[‘!\‘"‘]
4 1
1 >>> fdist1.plot(50,cumulative=True) #画出text1的频率分布图
3.4 细粒度的选择词
1 >>> long_words=[w for w in set(text1) if len(w) > 15] #获取text1内样本词汇长度大于15的词并按字典序排序2 >>> sorted(long_words) 3 [‘CIRCUMNAVIGATION‘, ‘Physiognomically‘, ‘apprehensiveness‘, ‘cannibalistically‘, ‘characteristically‘, ‘circumnavigating‘, ‘circumnavigation‘, ‘circumnavigations‘, ‘comprehensiveness‘, ‘hermaphroditical‘, ‘indiscriminately‘, ‘indispensableness‘, ‘irresistibleness‘, ‘physiognomically‘, ‘preternaturalness‘, ‘responsibilities‘, ‘simultaneousness‘, ‘subterraneousness‘, ‘supernaturalness‘, ‘superstitiousness‘, ‘uncomfortableness‘, ‘uncompromisedness‘, ‘undiscriminating‘, ‘uninterpenetratingly‘]4 >>> fdist1=FreqDist(text1) #获取text1内样本词汇长度大于7且出现次数大于7的词并按字典序排序
5 >>> sorted([wforwin set(text5) if len(w) > 7 and fdist1[w] > 7]) 6 [‘American‘, ‘actually‘, ‘afternoon‘, ‘anything‘, ‘attention‘, ‘beautiful‘, ‘carefully‘, ‘carrying‘, ‘children‘, ‘commanded‘, ‘concerning‘, ‘considered‘, ‘considering‘, ‘difference‘, ‘different‘, ‘distance‘, ‘elsewhere‘, ‘employed‘, ‘entitled‘, ‘especially‘, ‘everything‘, ‘excellent‘, ‘experience‘, ‘expression‘, ‘floating‘, ‘following‘, ‘forgotten‘, ‘gentlemen‘, ‘gigantic‘, ‘happened‘, ‘horrible‘, ‘important‘, ‘impossible‘, ‘included‘, ‘individual‘, ‘interesting‘, ‘invisible‘, ‘involved‘, ‘monsters‘, ‘mountain‘, ‘occasional‘, ‘opposite‘, ‘original‘, ‘originally‘, ‘particular‘, ‘pictures‘, ‘pointing‘, ‘position‘, ‘possibly‘, ‘probably‘, ‘question‘, ‘regularly‘, ‘remember‘, ‘revolving‘, ‘shoulders‘, ‘sleeping‘, ‘something‘, ‘sometimes‘, ‘somewhere‘, ‘speaking‘, ‘specially‘, ‘standing‘, ‘starting‘, ‘straight‘, ‘stranger‘, ‘superior‘, ‘supposed‘, ‘surprise‘, ‘terrible‘, ‘themselves‘, ‘thinking‘, ‘thoughts‘, ‘together‘, ‘understand‘, ‘watching‘, ‘whatever‘, ‘whenever‘, ‘wonderful‘, ‘yesterday‘, ‘yourself‘]
3.5 词语搭配和双连词
用bigrams()可以实现双连词
1 >>> bigrams([‘more‘,‘is‘,‘said‘,‘than‘,‘done‘])2 [(‘more‘, ‘is‘), (‘is‘, ‘said‘), (‘said‘, ‘than‘), (‘than‘, ‘done‘)]3 >>> text1.collocations()4 Building collocations list5 Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm6 whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;7 years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief8 mate; white whale; ivory leg; one hand
3.6 NLTK频率分类中定义的函数
例子 | 描述 |
fdist=FreqDist(samples) | 创建包含给定样本的频率分布 |
fdist.inc(sample) | 增加样本 |
fdist[‘monstrous‘] | 计数给定样本出现的次数 |
fdist.freq(‘monstrous‘) | 样本总数 |
fdist.N() | 以频率递减顺序排序的样本链表 |
fdist.keys() | 以频率递减的顺序便利样本 |
for sample in fdist: | 数字最大的样本 |
fdist.max() | 绘制频率分布表 |
fdist.tabulate() | 绘制频率分布图 |
fdist.plot() | 绘制积累频率分布图 |
fdist.plot(cumulative=True) | 绘制积累频率分布图 |
fdist1<fdist2 | 测试样本在fdist1中出现的样本是否小于fdist2 |
最后看下text1的类情况. 使用type可以查看变量类型,使用help()可以获取类的属性以及方法。以后想要获取具体的方法可以使用help(),这个还是很好用的。
1 >>> type(text1) 2 <class ‘nltk.text.Text‘> 3 >>> help(‘nltk.text.Text‘) 4 Help on class Text in nltk.text: 5 6 nltk.text.Text = class Text(__builtin__.object) 7 | A wrapper around a sequence of simple (string) tokens, which is 8 | intended to support initial exploration of texts (via the 9 | interactive console). Its methods perform a variety of analyses10 | on the text‘s contexts (e.g., counting, concordancing, collocation11 | discovery), and display the results. If you wish to write a12 | program which makes use of these analyses, then you should bypass13 | the ``Text`` class, and use the appropriate analysis function or14 | class directly instead.15 | 16 | A ``Text`` is typically initialized from a given document or17 | corpus. E.g.:18 | 19 | >>> import nltk.corpus20 | >>> from nltk.text import Text21 | >>> moby = Text(nltk.corpus.gutenberg.words(‘melville-moby_dick.txt‘))22 | 23 | Methods defined here:24 | 25 | __getitem__(self, i)26 | 27 | __init__(self, tokens, name=None)28 | Create a Text object.29 | 30 | :param tokens: The source text.31 | :type tokens: sequence of str32 | 33 | __len__(self)34 | 35 | __repr__(self)36 | :return: A string representation of this FreqDist.37 | :rtype: string38 | 39 | collocations(self, num=20, window_size=2)40 | Print collocations derived from the text, ignoring stopwords.41 | 42 | :seealso: find_collocations43 | :param num: The maximum number of collocations to print.44 | :type num: int45 | :param window_size: The number of tokens spanned by a collocation (default=2)46 | :type window_size: int47 | 48 | common_contexts(self, words, num=20)49 | Find contexts where the specified words appear; list50 | most frequent common contexts first.51 | 52 | :param word: The word used to seed the similarity search53 | :type word: str54 | :param num: The number of words to generate (default=20)55 | :type num: int56 | :seealso: ContextIndex.common_contexts()
4. 语言理解的技术
1. 词意消歧
2. 指代消解
3. 自动生成语言
4. 机器翻译
5. 人机对话系统
6. 文本的含义
5. 总结
虽然是初次接触Python,NLTK,但是我已经觉得他们的好用以及方便,接下来就会深入的学习他们。