首页 > 代码库 > 自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON

题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间。碰巧这几天在亚马逊上找书时发现了这本《Python自然语言处理》,瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助。所以最近都会学习这本书,也写下这些笔记。

1. NLTK简述

NLTK模块及功能介绍

语言处理任务NLTK模块功能描述
获取语料库nltk.corpus语料库和词典的标准化接口
字符串处理nltk.tokenize,nltk.stem分词、句子分解、提取主干
搭配研究nltk.collocationst-检验,卡方,点互信息
词性标示符nltk.tagn-gram,backoff,Brill,HMM,TnT
分类nltk.classify,nltk.cluster决策树,最大熵,朴素贝叶斯,EM,k-means
分块nltk.chunk正则表达式,n-gram,命名实体
解析nltk.parse图标,基于特征,一致性,概率性,依赖项
语义解释nltk.sem,nltk.inferenceλ演算,一阶逻辑,模型检验
指标评测nltk.metrics精度,召回率,协议系数
概率与估计nltk.probability频率分布,平滑概率分布
应用nltk.app,nltk.chat图形化的关键词排序,分析器,WordNet查看器,聊天机器人
语言学领域的工作nltk.toolbox处理SIL工具箱格式的数据

2. NLTK安装

  我的Python版本是2.7.5,NLTK版本2.0.4

 1 DESCRIPTION 2     The Natural Language Toolkit (NLTK) is an open source Python library 3     for Natural Language Processing.  A free online book is available. 4     (If you use the library for academic research, please cite the book.) 5      6     Steven Bird, Ewan Klein, and Edward Loper (2009). 7     Natural Language Processing with Python.  OReilly Media Inc. 8     http://nltk.org/book 9     10     @version: 2.0.4

安装步骤跟http://www.nltk.org/install.html 一样

1. 安装Setuptools: http://pypi.python.org/pypi/setuptools

  在页面的最下面setuptools-5.7.tar.gz

2. 安装 Pip: 运行 sudo easy_install pip(一定要以root权限运行)

3. 安装 Numpy (optional): 运行 sudo pip install -U numpy

4. 安装 NLTK: 运行 sudo pip install -U nltk

5. 进入python,并输入以下命令

1 192:chapter2 rcf$ python2 Python 2.7.5 (default, Mar  9 2014, 22:15:05) 3 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin4 Type "help", "copyright", "credits" or "license" for more information.5 >>> import nltk6 >>> nltk.download()

当出现以下界面进行nltk_data的下载

也可直接到 http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml 去下载数据包,并拖到Download Directory。我就是这么做的。

最后在Python目录运行以下命令以及结果,说明安装已成功

 1 from nltk.book import * 2 *** Introductory Examples for the NLTK Book *** 3 Loading text1, ..., text9 and sent1, ..., sent9 4 Type the name of the text or sentence to view it. 5 Type: texts() or sents() to list the materials. 6 text1: Moby Dick by Herman Melville 1851 7 text2: Sense and Sensibility by Jane Austen 1811 8 text3: The Book of Genesis 9 text4: Inaugural Address Corpus10 text5: Chat Corpus11 text6: Monty Python and the Holy Grail12 text7: Wall Street Journal13 text8: Personals Corpus14 text9: The Man Who Was Thursday by G . K . Chesterton 1908

3. NLTK的初次使用

  现在开始进入正题,由于本人没学过python,所以使用NLTK也就是学习Python的过程。初次学习NLTK主要使用的时NLTK里面自带的一些现有数据,上图中已由显示,这些数据都在nltk.book里面。

3.1 搜索文本

concordance:搜索text1中的monstrous 

 1 >>> text1.concordance("monstrous") 2 Building index... 3 Displaying 11 of 11 matches: 4 ong the former , one was of a most monstrous size . ... This came towards us ,  5 ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r 6 ll over with a heathenish array of monstrous clubs and spears . Some were thick 7 d as you gazed , and wondered what monstrous cannibal and savage could ever hav 8 that has survived the flood ; most monstrous and most mountainous ! That Himmal 9 they might scout at Moby Dick as a monstrous fable , or still worse and more de10 th of Radney ." CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l11 ing Scenes . In connexion with the monstrous pictures of whales , I am strongly12 ere to enter upon those still more monstrous stories of them which are to be fo13 ght have been rummaged out of this monstrous cabinet there is no telling . But 14 of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

similar:查找text1中与monstrous相关的所有词语

1 >>> text1.similar("monstrous")2 Building word-context index...3 abundant candid careful christian contemptible curious delightfully4 determined doleful domineering exasperate fearless few gamesome5 horrible impalpable imperial lamentable lazy loving

dispersion_plot:用离散图判断词在文本的位置即偏移量

1 >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

3.2 计数词汇

len:获取长度,即可获取文章的词汇个数,也可获取单个词的长度

1 >>> len(text1)   #计算text1的词汇个数2 2608193 >>> len(set(text1)) #计算text1 不同的词汇个数4 193175 >>> len(text1[0])   #计算text1 第一个词的长度6 1

sorted:排序

1 >>> sent12 [Call, me, Ishmael, .]3 >>> sorted(sent1)4 [., Call, Ishmael, me]

3.3 频率分布

nltk.probability.FreqDist

1 >>> fdist1=FreqDist(text1)    #获取text1的频率分布情况2 >>> fdist1              #text1具有19317个样本,但是总体有260819个值3 <FreqDist with 19317 samples and 260819 outcomes> 4 >>> keys=fdist1.keys()       5 >>> keys[:50]                 #获取text1的前50个样本
6 [,, the, ., of, and, a, to, ;, in, that, "", -, his, it, I, s, is, he, with, was, as, ", all, for, this, !, at, by, but, not, --, him, from, be, on, so, whale, one, you, had, have, there, But, or, were, now, which, ?, me, like]
1 >>> fdist1.items()[:50]      #text1的样本分布情况,比如‘,‘出现了18713次,总共的词为2608192 [(,, 18713), (the, 13721), (., 6862), (of, 6536), (and, 6024), (a, 4569), (to, 4542), (;, 4072), (in, 3916), (that, 2982), ("", 2684), (-, 2552), (his, 2459), (it, 2209), (I, 2124), (s, 1739), (is, 1695), (he, 1661), (with, 1659), (was, 1632), (as, 1620), (", 1478), (all, 1462), (for, 1414), (this, 1280), (!, 1269), (at, 1231), (by, 1137), (but, 1113), (not, 1103), (--, 1070), (him, 1058), (from, 1052), (be, 1030), (on, 1005), (so, 918), (whale, 906), (one, 889), (you, 841), (had, 767), (have, 760), (there, 715), (But, 705), (or, 697), (were, 680), (now, 646), (which, 640), (?, 637), (me, 627), (like, 624)]
1 >>> fdist1.hapaxes()[:50]   #text1的样本只出现一次的词2 [!\‘", !)", !*, !--", "..., "‘,--", "‘;", ):, );--, ,), --\‘", ---", ---,, ."*, ."--, .*--, .--", 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 11, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 12, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130]
3 >>> fdist1[‘!\‘"‘]
4 1

  1 >>> fdist1.plot(50,cumulative=True) #画出text1的频率分布图

 3.4 细粒度的选择词

1 >>> long_words=[w for w in set(text1) if len(w) > 15]  #获取text1内样本词汇长度大于15的词并按字典序排序2 >>> sorted(long_words)        3 [CIRCUMNAVIGATION, Physiognomically, apprehensiveness, cannibalistically, characteristically, circumnavigating, circumnavigation, circumnavigations, comprehensiveness, hermaphroditical, indiscriminately, indispensableness, irresistibleness, physiognomically, preternaturalness, responsibilities, simultaneousness, subterraneousness, supernaturalness, superstitiousness, uncomfortableness, uncompromisedness, undiscriminating, uninterpenetratingly]4 >>> fdist1=FreqDist(text1)    #获取text1内样本词汇长度大于7且出现次数大于7的词并按字典序排序
5 >>> sorted([wforwin set(text5) if len(w) > 7 and fdist1[w] > 7]) 6 [American, actually, afternoon, anything, attention, beautiful, carefully, carrying, children, commanded, concerning, considered, considering, difference, different, distance, elsewhere, employed, entitled, especially, everything, excellent, experience, expression, floating, following, forgotten, gentlemen, gigantic, happened, horrible, important, impossible, included, individual, interesting, invisible, involved, monsters, mountain, occasional, opposite, original, originally, particular, pictures, pointing, position, possibly, probably, question, regularly, remember, revolving, shoulders, sleeping, something, sometimes, somewhere, speaking, specially, standing, starting, straight, stranger, superior, supposed, surprise, terrible, themselves, thinking, thoughts, together, understand, watching, whatever, whenever, wonderful, yesterday, yourself]

3.5 词语搭配和双连词

用bigrams()可以实现双连词

1 >>> bigrams([more,is,said,than,done])2 [(more, is), (is, said), (said, than), (than, done)]3 >>> text1.collocations()4 Building collocations list5 Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm6 whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;7 years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief8 mate; white whale; ivory leg; one hand

3.6 NLTK频率分类中定义的函数

例子描述
fdist=FreqDist(samples)创建包含给定样本的频率分布
fdist.inc(sample)增加样本
fdist[‘monstrous‘]计数给定样本出现的次数
fdist.freq(‘monstrous‘)样本总数
fdist.N()以频率递减顺序排序的样本链表
fdist.keys()以频率递减的顺序便利样本
for sample in fdist:数字最大的样本
fdist.max()绘制频率分布表
fdist.tabulate()绘制频率分布图
fdist.plot()绘制积累频率分布图
fdist.plot(cumulative=True)绘制积累频率分布图
fdist1<fdist2测试样本在fdist1中出现的样本是否小于fdist2

最后看下text1的类情况. 使用type可以查看变量类型,使用help()可以获取类的属性以及方法。以后想要获取具体的方法可以使用help(),这个还是很好用的。

 1 >>> type(text1) 2 <class nltk.text.Text> 3 >>> help(nltk.text.Text) 4 Help on class Text in nltk.text: 5  6 nltk.text.Text = class Text(__builtin__.object) 7  |  A wrapper around a sequence of simple (string) tokens, which is 8  |  intended to support initial exploration of texts (via the 9  |  interactive console).  Its methods perform a variety of analyses10  |  on the texts contexts (e.g., counting, concordancing, collocation11  |  discovery), and display the results.  If you wish to write a12  |  program which makes use of these analyses, then you should bypass13  |  the ``Text`` class, and use the appropriate analysis function or14  |  class directly instead.15  |  16  |  A ``Text`` is typically initialized from a given document or17  |  corpus.  E.g.:18  |  19  |  >>> import nltk.corpus20  |  >>> from nltk.text import Text21  |  >>> moby = Text(nltk.corpus.gutenberg.words(melville-moby_dick.txt))22  |  23  |  Methods defined here:24  |  25  |  __getitem__(self, i)26  |  27  |  __init__(self, tokens, name=None)28  |      Create a Text object.29  |      30  |      :param tokens: The source text.31  |      :type tokens: sequence of str32  |  33  |  __len__(self)34  |  35  |  __repr__(self)36  |      :return: A string representation of this FreqDist.37  |      :rtype: string38  |  39  |  collocations(self, num=20, window_size=2)40  |      Print collocations derived from the text, ignoring stopwords.41  |      42  |      :seealso: find_collocations43  |      :param num: The maximum number of collocations to print.44  |      :type num: int45  |      :param window_size: The number of tokens spanned by a collocation (default=2)46  |      :type window_size: int47  |  48  |  common_contexts(self, words, num=20)49  |      Find contexts where the specified words appear; list50  |      most frequent common contexts first.51  |      52  |      :param word: The word used to seed the similarity search53  |      :type word: str54  |      :param num: The number of words to generate (default=20)55  |      :type num: int56  |      :seealso: ContextIndex.common_contexts()

4. 语言理解的技术

1. 词意消歧

2. 指代消解

3. 自动生成语言

4. 机器翻译

5. 人机对话系统

6. 文本的含义

5. 总结

虽然是初次接触Python,NLTK,但是我已经觉得他们的好用以及方便,接下来就会深入的学习他们。