java利用pdfbox处理pdf

首页 > 代码库 > java利用pdfbox处理pdf

2024-07-16 00:23:51 222人阅读

刚开始以为java读取pdf向读取txt文件一样简单，图样图森普！乱码问题！

在网上找了下资料，发现Apache的PDFBOX，下面写一下PDFBOX读取PDF的代码。

下载jar包：http://pdfbox.apache.org/downloads.html#recent

创建pdf，写入pdf的代码，官网上有介绍：http://pdfbox.apache.org/cookbook/documentcreation.html

直接搬过来

Create a blank PDF

This small sample shows how to create a new PDF document using PDFBox.

 1 // Create a new empty document 2 PDDocument document = new PDDocument(); 3  4 // Create a new blank page and add it to the document 5 PDPage blankPage = new PDPage(); 6 document.addPage( blankPage ); 7  8 // Save the newly created document 9 document.save("BlankPage.pdf");10 11 // finally make sure that the document is properly12 // closed.13 document.close();

Hello World using a PDF base font

This small sample shows how to create a new document and print the text "Hello World" using one of the PDF base fonts.

// Create a document and add a page to itPDDocument document = new PDDocument();PDPage page = new PDPage();document.addPage( page );// Create a new font object selecting one of the PDF base fontsPDFont font = PDType1Font.HELVETICA_BOLD;// Start a new content stream which will "hold" the to be created contentPDPageContentStream contentStream = new PDPageContentStream(document, page);// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"contentStream.beginText();contentStream.setFont( font, 12 );contentStream.moveTextPositionByAmount( 100, 700 );//注意这个坐标，(0,0)为本页的左下角contentStream.drawString( "Hello World" );contentStream.endText();// Make sure that the content stream is closed:contentStream.close();// Save the results and ensure that the document is properly closed:document.save( "Hello World.pdf");document.close();

Read PDF

下面是我参考网上的代码自己尝试的，官网没有具体例子介绍
其实整个过程就是加载Document(pdf文档) 利用IO流写入到TXT文件

 1 package tools; 2  3 import java.io.File; 4 import java.io.FileNotFoundException; 5 import java.io.FileWriter; 6 import java.io.IOException; 7 import java.net.MalformedURLException; 8 import java.net.URL; 9 import org.apache.pdfbox.pdmodel.PDDocument;10 import org.apache.pdfbox.util.PDFTextStripper;11 12 public class PDFHandler {13     public static void readPDF(String pdfFile) {14         String txtFile = null;15         PDDocument doc = null;16         FileWriter writer = null;17         URL url = null;18         try {19             url = new URL(pdfFile); 20         } catch (MalformedURLException e) {21             //有异常说明无法转成url，以文件系统处理22             url = null;23         }24         25         if(url != null) {//url处理26             try {27                 doc = PDDocument.load(url);//加载文档28                 String fileName = url.getFile();29                 if(fileName.endsWith(".pdf")) { //得到新文件的文件名30                     File outFile = new File(fileName.replace(".pdf", ".txt"));31                     txtFile = outFile.getName(); 32                 } else {33                     return;34                 }35             } catch (IOException e) {36                 e.printStackTrace();37                 return;38             }39         } else {//文件系统处理40             try {41                 doc = PDDocument.load(pdfFile);42                 if(pdfFile.endsWith(".pdf")) {43                     txtFile = pdfFile.replace(".pdf", ".txt");44                 } else {45                     return;46                 }47             } catch (IOException e) {48                 e.printStackTrace();49                 return;50             }51         }52         try {53             writer = new FileWriter(txtFile);54             PDFTextStripper textStripper = new PDFTextStripper();//读取PDF到TXT中的操作类55             textStripper.setSortByPosition(false);//这个看了下官方说明，不是很确定是什么意思，但是为了提高效率最好设为false,缺省为false56             textStripper.setStartPage(1);//起始页，缺省为第一页57             textStripper.setEndPage(2);//结束页，缺省为最后一页58             textStripper.writeText(doc, writer);//最重要的一步，写入到txt59         } catch (FileNotFoundException e) {60             e.printStackTrace();61         } catch (IOException e) {62             e.printStackTrace();63         } finally {64             if(doc != null) {65                 try {66                     doc.close();67                 } catch (IOException e) {68                     e.printStackTrace();69                 }70             }71             if(writer!= null) {72                 try {73                     writer.close();74                 } catch (IOException e) {75                     e.printStackTrace();76                 }77             }78         }79     }80     public static void main(String[] args) {81         readPDF("resource/正则表达式.pdf");82     }83 }

本来处理pdf这个需求是在学习Lucene过程中提出的，不过在官网上看到了这个

Lucene Integration

Document luceneDocument = LucenePDFDocument.getDocument( ... );

好吧！

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > java利用pdfbox处理pdf