Python 将pdf转换成txt（不处理图片）

首页 > 代码库 > Python 将pdf转换成txt（不处理图片）

Python 将pdf转换成txt（不处理图片）

2024-07-11 15:41:33 221人阅读

　　上一篇文章中已经介绍了简单的python爬网页下载文档，但下载后的文档多为doc或pdf，对于数据处理仍然有很多限制，所以将doc／pdf转换成txt显得尤为重要。查找了很多资料，在linux下要将doc转换成txt确实有难度，所以考虑先将pdf转换成txt。

　　师兄推荐使用PDFMiner来处理，尝试了一番，确实效果不错，在此和大家分享。

　　PDFMiner 的简介：PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.有兴趣的同学请通过官网进行详细查看，通过PDFMiner中的小工具pdf2txt.py，便能将pdf转换成txt，而且仍保留pdf中的格式，超赞！

　　阅读pdf2txt.py的源码，我们可以看到具体的实现步骤，为了以后能处理大规模的pdf文件，这里我们只提取出pdf转换成txt的部分，具体实现代码如下：

# -*- coding: utf-8 -*-  #----------------------------------------------------- #   功能：将pdf转换成txt（不处理图片）#   作者：chenbjin #   日期：2014-07-11#   语言：Python 2.7.6  #   环境：linux（ubuntu）#        PDFMiner20140328（Must be installed）#   使用：python pdf2txt.py file.pdf#-----------------------------------------------------import sysfrom pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreterfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfpage import PDFPage#maindef main(argv) :    #输出文件名，这里只处理单文档，所以只用了argv［1］    outfile = argv[1] + ‘.txt‘    args = [argv[1]]    debug = 0    pagenos = set()    password = ‘‘    maxpages = 0    rotation = 0    codec = ‘utf-8‘   #输出编码    caching = True    imagewriter = None    laparams = LAParams()    #    PDFResourceManager.debug = debug    PDFPageInterpreter.debug = debug    rsrcmgr = PDFResourceManager(caching=caching)    outfp = file(outfile,‘w‘)
　　 #pdf转换    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,                imagewriter=imagewriter)
　　    for fname in args:        fp = file(fname,‘rb‘)        interpreter = PDFPageInterpreter(rsrcmgr, device)
　　　　 #处理文档对象中每一页的内容        for page in PDFPage.get_pages(fp, pagenos,                          maxpages=maxpages, password=password,                          caching=caching, check_extractable=True) :            page.rotate = (page.rotate+rotation) % 360            interpreter.process_page(page)        fp.close()    device.close()    outfp.close()    returnif __name__ == ‘__main__‘ : main(sys.argv)

　　下一步将尝试将pdf中的图片进行转换，可以通过http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/ 进行了解。

参考资料：

1.PDFMiner：http://www.unixuser.org/~euske/python/pdfminer/

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Python 将pdf转换成txt（不处理图片）

Python 将pdf转换成txt（不处理图片）

看完仍有疑问？有类似问题直接问程序猿