初试主题模型LDA-基于python的gensim包

2024-10-18 04:16:02 210人阅读

http://blog.csdn.net/a_step_further/article/details/51176959

LDA是文本挖掘中常用的主题模型，用来从大量文档中提取出最能表达各个主题的一些关键词，具体算法原理可参阅KM上相关文章。笔者因业务需求，需对腾讯微博上若干账号的消息进行主题提取，故而尝试了一下该算法，基于python的gensim包实现一个简单的分析。

安装python的中文分词模块， jieba
安装python的文本主题建模的模块, gensim (官网 https://radimrehurek.com/gensim/)。这个模块安装时依赖了一大堆其它包，需要耐心地一个一个安装。

[python]

#!/usr/bin/python
#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import jieba
from gensim import corpora, models
def get_stop_words_set(file_name):
with open(file_name,‘r‘) as file:
return set([line.strip() for line in file])
def get_words_list(file_name,stop_word_file):
stop_words_set = get_stop_words_set(stop_word_file)
print "共计导入 %d 个停用词" % len(stop_words_set)
word_list = []
with open(file_name,‘r‘) as file:
for line in file:
tmp_list = list(jieba.cut(line.strip(),cut_all=False))
word_list.append([term for term in tmp_list if str(term) not in stop_words_set]) #注意这里term是unicode类型，如果不转成str，判断会为假
return word_list
if __name__ == ‘__main__‘:
if len(sys.argv) < 3:
print "Usage: %s <raw_msg_file> <stop_word_file>" % sys.argv[0]
sys.exit(1)
raw_msg_file = sys.argv[1]
stop_word_file = sys.argv[2]
word_list = get_words_list(raw_msg_file,stop_word_file) #列表，其中每个元素也是一个列表，即每行文字分词后形成的词语列表
word_dict = corpora.Dictionary(word_list) #生成文档的词典，每个词与一个整型索引值对应
corpus_list = [word_dict.doc2bow(text) for text in word_list] #词频统计，转化成空间向量格式
lda = models.ldamodel.LdaModel(corpus=corpus_list,id2word=word_dict,num_topics=10,alpha=‘auto‘)
output_file = ‘./lda_output.txt‘
with open(output_file,‘w‘) as f:
for pattern in lda.show_topics():
print >> f, "%s" % str(pattern)

另外还有一些学习资料：https://yq.aliyun.com/articles/26029 [python] LDA处理文档主题分布代码入门笔记

初试主题模型LDA-基于python的gensim包

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们