自然语言处理(1)之NLTK与PYTHON

首页 > 代码库 > 自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON

2024-07-18 02:32:13 226人阅读

自然语言处理(1)之NLTK与PYTHON

题记: 由于现在的项目是搜索引擎，所以不由的对自然语言处理产生了好奇，再加上一直以来都想学Python，只是没有机会与时间。碰巧这几天在亚马逊上找书时发现了这本《Python自然语言处理》，瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助。所以最近都会学习这本书，也写下这些笔记。

1. NLTK简述

NLTK模块及功能介绍

语言处理任务	NLTK模块	功能描述
获取语料库	nltk.corpus	语料库和词典的标准化接口
字符串处理	nltk.tokenize,nltk.stem	分词、句子分解、提取主干
搭配研究	nltk.collocations	t-检验，卡方，点互信息
词性标示符	nltk.tag	n-gram，backoff，Brill，HMM，TnT
分类	nltk.classify,nltk.cluster	决策树，最大熵，朴素贝叶斯，EM，k-means
分块	nltk.chunk	正则表达式，n-gram，命名实体
解析	nltk.parse	图标，基于特征，一致性，概率性，依赖项
语义解释	nltk.sem,nltk.inference	λ演算，一阶逻辑，模型检验
指标评测	nltk.metrics	精度，召回率，协议系数
概率与估计	nltk.probability	频率分布，平滑概率分布
应用	nltk.app,nltk.chat	图形化的关键词排序，分析器，WordNet查看器，聊天机器人
语言学领域的工作	nltk.toolbox	处理SIL工具箱格式的数据

2. NLTK安装

　　我的Python版本是2.7.5，NLTK版本2.0.4

 1 DESCRIPTION 2     The Natural Language Toolkit (NLTK) is an open source Python library 3     for Natural Language Processing.  A free online book is available. 4     (If you use the library for academic research, please cite the book.) 5      6     Steven Bird, Ewan Klein, and Edward Loper (2009). 7     Natural Language Processing with Python.  O‘Reilly Media Inc. 8     http://nltk.org/book 9     10     @version: 2.0.4

安装步骤跟http://www.nltk.org/install.html 一样

1. 安装Setuptools: http://pypi.python.org/pypi/setuptools

　在页面的最下面setuptools-5.7.tar.gz

2. 安装 Pip: 运行 sudo easy_install pip(一定要以root权限运行)

3. 安装 Numpy (optional): 运行 sudo pip install -U numpy

4. 安装 NLTK: 运行 sudo pip install -U nltk

5. 进入python，并输入以下命令

1 192:chapter2 rcf$ python2 Python 2.7.5 (default, Mar  9 2014, 22:15:05) 3 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin4 Type "help", "copyright", "credits" or "license" for more information.5 >>> import nltk6 >>> nltk.download()

当出现以下界面进行nltk_data的下载

也可直接到 http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml 去下载数据包，并拖到Download Directory。我就是这么做的。

最后在Python目录运行以下命令以及结果，说明安装已成功

 1 from nltk.book import * 2 *** Introductory Examples for the NLTK Book *** 3 Loading text1, ..., text9 and sent1, ..., sent9 4 Type the name of the text or sentence to view it. 5 Type: ‘texts()‘ or ‘sents()‘ to list the materials. 6 text1: Moby Dick by Herman Melville 1851 7 text2: Sense and Sensibility by Jane Austen 1811 8 text3: The Book of Genesis 9 text4: Inaugural Address Corpus10 text5: Chat Corpus11 text6: Monty Python and the Holy Grail12 text7: Wall Street Journal13 text8: Personals Corpus14 text9: The Man Who Was Thursday by G . K . Chesterton 1908

3. NLTK的初次使用

　　现在开始进入正题，由于本人没学过python，所以使用NLTK也就是学习Python的过程。初次学习NLTK主要使用的时NLTK里面自带的一些现有数据，上图中已由显示，这些数据都在nltk.book里面。

3.1 搜索文本

concordance:搜索text1中的monstrous

 1 >>> text1.concordance("monstrous") 2 Building index... 3 Displaying 11 of 11 matches: 4 ong the former , one was of a most monstrous size . ... This came towards us ,  5 ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r 6 ll over with a heathenish array of monstrous clubs and spears . Some were thick 7 d as you gazed , and wondered what monstrous cannibal and savage could ever hav 8 that has survived the flood ; most monstrous and most mountainous ! That Himmal 9 they might scout at Moby Dick as a monstrous fable , or still worse and more de10 th of Radney .‘" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l11 ing Scenes . In connexion with the monstrous pictures of whales , I am strongly12 ere to enter upon those still more monstrous stories of them which are to be fo13 ght have been rummaged out of this monstrous cabinet there is no telling . But 14 of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

similar:查找text1中与monstrous相关的所有词语

1 >>> text1.similar("monstrous")2 Building word-context index...3 abundant candid careful christian contemptible curious delightfully4 determined doleful domineering exasperate fearless few gamesome5 horrible impalpable imperial lamentable lazy loving

dispersion_plot：用离散图判断词在文本的位置即偏移量

1 >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

3.2 计数词汇

len:获取长度，即可获取文章的词汇个数，也可获取单个词的长度

1 >>> len(text1)   #计算text1的词汇个数2 2608193 >>> len(set(text1)) #计算text1 不同的词汇个数4 193175 >>> len(text1[0])   #计算text1 第一个词的长度6 1

sorted:排序

1 >>> sent12 [‘Call‘, ‘me‘, ‘Ishmael‘, ‘.‘]3 >>> sorted(sent1)4 [‘.‘, ‘Call‘, ‘Ishmael‘, ‘me‘]

3.3 频率分布

nltk.probability.FreqDist

1 >>> fdist1=FreqDist(text1)    #获取text1的频率分布情况2 >>> fdist1      　　　　　　　　#text1具有19317个样本,但是总体有260819个值3 <FreqDist with 19317 samples and 260819 outcomes> 4 >>> keys=fdist1.keys()       5 >>> keys[:50]                 #获取text1的前50个样本
6 [‘,‘, ‘the‘, ‘.‘, ‘of‘, ‘and‘, ‘a‘, ‘to‘, ‘;‘, ‘in‘, ‘that‘, "‘", ‘-‘, ‘his‘, ‘it‘, ‘I‘, ‘s‘, ‘is‘, ‘he‘, ‘with‘, ‘was‘, ‘as‘, ‘"‘, ‘all‘, ‘for‘, ‘this‘, ‘!‘, ‘at‘, ‘by‘, ‘but‘, ‘not‘, ‘--‘, ‘him‘, ‘from‘, ‘be‘, ‘on‘, ‘so‘, ‘whale‘, ‘one‘, ‘you‘, ‘had‘, ‘have‘, ‘there‘, ‘But‘, ‘or‘, ‘were‘, ‘now‘, ‘which‘, ‘?‘, ‘me‘, ‘like‘]

1 >>> fdist1.items()[:50]      #text1的样本分布情况，比如‘,‘出现了18713次，总共的词为2608192 [(‘,‘, 18713), (‘the‘, 13721), (‘.‘, 6862), (‘of‘, 6536), (‘and‘, 6024), (‘a‘, 4569), (‘to‘, 4542), (‘;‘, 4072), (‘in‘, 3916), (‘that‘, 2982), ("‘", 2684), (‘-‘, 2552), (‘his‘, 2459), (‘it‘, 2209), (‘I‘, 2124), (‘s‘, 1739), (‘is‘, 1695), (‘he‘, 1661), (‘with‘, 1659), (‘was‘, 1632), (‘as‘, 1620), (‘"‘, 1478), (‘all‘, 1462), (‘for‘, 1414), (‘this‘, 1280), (‘!‘, 1269), (‘at‘, 1231), (‘by‘, 1137), (‘but‘, 1113), (‘not‘, 1103), (‘--‘, 1070), (‘him‘, 1058), (‘from‘, 1052), (‘be‘, 1030), (‘on‘, 1005), (‘so‘, 918), (‘whale‘, 906), (‘one‘, 889), (‘you‘, 841), (‘had‘, 767), (‘have‘, 760), (‘there‘, 715), (‘But‘, 705), (‘or‘, 697), (‘were‘, 680), (‘now‘, 646), (‘which‘, 640), (‘?‘, 637), (‘me‘, 627), (‘like‘, 624)]

1 >>> fdist1.hapaxes()[:50]   #text1的样本只出现一次的词2 [‘!\‘"‘, ‘!)"‘, ‘!*‘, ‘!--"‘, ‘"...‘, "‘,--", "‘;", ‘):‘, ‘);--‘, ‘,)‘, ‘--\‘"‘, ‘---"‘, ‘---,‘, ‘."*‘, ‘."--‘, ‘.*--‘, ‘.--"‘, ‘100‘, ‘101‘, ‘102‘, ‘103‘, ‘104‘, ‘105‘, ‘106‘, ‘107‘, ‘108‘, ‘109‘, ‘11‘, ‘110‘, ‘111‘, ‘112‘, ‘113‘, ‘114‘, ‘115‘, ‘116‘, ‘117‘, ‘118‘, ‘119‘, ‘12‘, ‘120‘, ‘121‘, ‘122‘, ‘123‘, ‘124‘, ‘125‘, ‘126‘, ‘127‘, ‘128‘, ‘129‘, ‘130‘]
3 >>> fdist1[‘!\‘"‘]
4 1

1 >>> fdist1.plot(50,cumulative=True) #画出text1的频率分布图

3.4 细粒度的选择词

1 >>> long_words=[w for w in set(text1) if len(w) > 15]  #获取text1内样本词汇长度大于15的词并按字典序排序2 >>> sorted(long_words)        3 [‘CIRCUMNAVIGATION‘, ‘Physiognomically‘, ‘apprehensiveness‘, ‘cannibalistically‘, ‘characteristically‘, ‘circumnavigating‘, ‘circumnavigation‘, ‘circumnavigations‘, ‘comprehensiveness‘, ‘hermaphroditical‘, ‘indiscriminately‘, ‘indispensableness‘, ‘irresistibleness‘, ‘physiognomically‘, ‘preternaturalness‘, ‘responsibilities‘, ‘simultaneousness‘, ‘subterraneousness‘, ‘supernaturalness‘, ‘superstitiousness‘, ‘uncomfortableness‘, ‘uncompromisedness‘, ‘undiscriminating‘, ‘uninterpenetratingly‘]4 >>> fdist1=FreqDist(text1)    #获取text1内样本词汇长度大于7且出现次数大于7的词并按字典序排序
5 >>> sorted([wforwin set(text5) if len(w) > 7 and fdist1[w] > 7]) 6 [‘American‘, ‘actually‘, ‘afternoon‘, ‘anything‘, ‘attention‘, ‘beautiful‘, ‘carefully‘, ‘carrying‘, ‘children‘, ‘commanded‘, ‘concerning‘, ‘considered‘, ‘considering‘, ‘difference‘, ‘different‘, ‘distance‘, ‘elsewhere‘, ‘employed‘, ‘entitled‘, ‘especially‘, ‘everything‘, ‘excellent‘, ‘experience‘, ‘expression‘, ‘floating‘, ‘following‘, ‘forgotten‘, ‘gentlemen‘, ‘gigantic‘, ‘happened‘, ‘horrible‘, ‘important‘, ‘impossible‘, ‘included‘, ‘individual‘, ‘interesting‘, ‘invisible‘, ‘involved‘, ‘monsters‘, ‘mountain‘, ‘occasional‘, ‘opposite‘, ‘original‘, ‘originally‘, ‘particular‘, ‘pictures‘, ‘pointing‘, ‘position‘, ‘possibly‘, ‘probably‘, ‘question‘, ‘regularly‘, ‘remember‘, ‘revolving‘, ‘shoulders‘, ‘sleeping‘, ‘something‘, ‘sometimes‘, ‘somewhere‘, ‘speaking‘, ‘specially‘, ‘standing‘, ‘starting‘, ‘straight‘, ‘stranger‘, ‘superior‘, ‘supposed‘, ‘surprise‘, ‘terrible‘, ‘themselves‘, ‘thinking‘, ‘thoughts‘, ‘together‘, ‘understand‘, ‘watching‘, ‘whatever‘, ‘whenever‘, ‘wonderful‘, ‘yesterday‘, ‘yourself‘]

3.5 词语搭配和双连词

用bigrams()可以实现双连词

1 >>> bigrams([‘more‘,‘is‘,‘said‘,‘than‘,‘done‘])2 [(‘more‘, ‘is‘), (‘is‘, ‘said‘), (‘said‘, ‘than‘), (‘than‘, ‘done‘)]3 >>> text1.collocations()4 Building collocations list5 Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm6 whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;7 years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief8 mate; white whale; ivory leg; one hand

3.6 NLTK频率分类中定义的函数

例子	描述
fdist=FreqDist(samples)	创建包含给定样本的频率分布
fdist.inc(sample)	增加样本
fdist[‘monstrous‘]	计数给定样本出现的次数
fdist.freq(‘monstrous‘)	样本总数
fdist.N()	以频率递减顺序排序的样本链表
fdist.keys()	以频率递减的顺序便利样本
for sample in fdist:	数字最大的样本
fdist.max()	绘制频率分布表
fdist.tabulate()	绘制频率分布图
fdist.plot()	绘制积累频率分布图
fdist.plot(cumulative=True)	绘制积累频率分布图
fdist1<fdist2	测试样本在fdist1中出现的样本是否小于fdist2

最后看下text1的类情况. 使用type可以查看变量类型，使用help()可以获取类的属性以及方法。以后想要获取具体的方法可以使用help()，这个还是很好用的。

 1 >>> type(text1) 2 <class ‘nltk.text.Text‘> 3 >>> help(‘nltk.text.Text‘) 4 Help on class Text in nltk.text: 5  6 nltk.text.Text = class Text(__builtin__.object) 7  |  A wrapper around a sequence of simple (string) tokens, which is 8  |  intended to support initial exploration of texts (via the 9  |  interactive console).  Its methods perform a variety of analyses10  |  on the text‘s contexts (e.g., counting, concordancing, collocation11  |  discovery), and display the results.  If you wish to write a12  |  program which makes use of these analyses, then you should bypass13  |  the ``Text`` class, and use the appropriate analysis function or14  |  class directly instead.15  |  16  |  A ``Text`` is typically initialized from a given document or17  |  corpus.  E.g.:18  |  19  |  >>> import nltk.corpus20  |  >>> from nltk.text import Text21  |  >>> moby = Text(nltk.corpus.gutenberg.words(‘melville-moby_dick.txt‘))22  |  23  |  Methods defined here:24  |  25  |  __getitem__(self, i)26  |  27  |  __init__(self, tokens, name=None)28  |      Create a Text object.29  |      30  |      :param tokens: The source text.31  |      :type tokens: sequence of str32  |  33  |  __len__(self)34  |  35  |  __repr__(self)36  |      :return: A string representation of this FreqDist.37  |      :rtype: string38  |  39  |  collocations(self, num=20, window_size=2)40  |      Print collocations derived from the text, ignoring stopwords.41  |      42  |      :seealso: find_collocations43  |      :param num: The maximum number of collocations to print.44  |      :type num: int45  |      :param window_size: The number of tokens spanned by a collocation (default=2)46  |      :type window_size: int47  |  48  |  common_contexts(self, words, num=20)49  |      Find contexts where the specified words appear; list50  |      most frequent common contexts first.51  |      52  |      :param word: The word used to seed the similarity search53  |      :type word: str54  |      :param num: The number of words to generate (default=20)55  |      :type num: int56  |      :seealso: ContextIndex.common_contexts()

4. 语言理解的技术

1. 词意消歧

2. 指代消解

3. 自动生成语言

4. 机器翻译

5. 人机对话系统

6. 文本的含义

5. 总结

虽然是初次接触Python，NLTK，但是我已经觉得他们的好用以及方便，接下来就会深入的学习他们。

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON

1. NLTK简述

2. NLTK安装

3. NLTK的初次使用

3.1 搜索文本

3.2 计数词汇

3.3 频率分布

3.4 细粒度的选择词

3.5 词语搭配和双连词

3.6 NLTK频率分类中定义的函数

4. 语言理解的技术

5. 总结

看完仍有疑问？有类似问题直接问程序猿