首页 > 代码库 > java利用pdfbox处理pdf
java利用pdfbox处理pdf
刚开始以为java读取pdf向读取txt文件一样简单,图样图森普!乱码问题!
在网上找了下资料,发现Apache的PDFBOX,下面写一下PDFBOX读取PDF的代码。
下载jar包:http://pdfbox.apache.org/downloads.html#recent
创建pdf,写入pdf的代码,官网上有介绍:http://pdfbox.apache.org/cookbook/documentcreation.html
直接搬过来
Create a blank PDF
This small sample shows how to create a new PDF document using PDFBox.
1 // Create a new empty document 2 PDDocument document = new PDDocument(); 3 4 // Create a new blank page and add it to the document 5 PDPage blankPage = new PDPage(); 6 document.addPage( blankPage ); 7 8 // Save the newly created document 9 document.save("BlankPage.pdf");10 11 // finally make sure that the document is properly12 // closed.13 document.close();
Hello World using a PDF base font
This small sample shows how to create a new document and print the text "Hello World" using one of the PDF base fonts.
// Create a document and add a page to itPDDocument document = new PDDocument();PDPage page = new PDPage();document.addPage( page );// Create a new font object selecting one of the PDF base fontsPDFont font = PDType1Font.HELVETICA_BOLD;// Start a new content stream which will "hold" the to be created contentPDPageContentStream contentStream = new PDPageContentStream(document, page);// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"contentStream.beginText();contentStream.setFont( font, 12 );contentStream.moveTextPositionByAmount( 100, 700 );//注意这个坐标,(0,0)为本页的左下角contentStream.drawString( "Hello World" );contentStream.endText();// Make sure that the content stream is closed:contentStream.close();// Save the results and ensure that the document is properly closed:document.save( "Hello World.pdf");document.close();
Read PDF
下面是我参考网上的代码自己尝试的,官网没有具体例子介绍
其实整个过程就是 加载Document(pdf文档) 利用IO流写入到TXT文件
1 package tools; 2 3 import java.io.File; 4 import java.io.FileNotFoundException; 5 import java.io.FileWriter; 6 import java.io.IOException; 7 import java.net.MalformedURLException; 8 import java.net.URL; 9 import org.apache.pdfbox.pdmodel.PDDocument;10 import org.apache.pdfbox.util.PDFTextStripper;11 12 public class PDFHandler {13 public static void readPDF(String pdfFile) {14 String txtFile = null;15 PDDocument doc = null;16 FileWriter writer = null;17 URL url = null;18 try {19 url = new URL(pdfFile); 20 } catch (MalformedURLException e) {21 //有异常说明无法转成url,以文件系统处理22 url = null;23 }24 25 if(url != null) {//url处理26 try {27 doc = PDDocument.load(url);//加载文档28 String fileName = url.getFile();29 if(fileName.endsWith(".pdf")) { //得到新文件的文件名30 File outFile = new File(fileName.replace(".pdf", ".txt"));31 txtFile = outFile.getName(); 32 } else {33 return;34 }35 } catch (IOException e) {36 e.printStackTrace();37 return;38 }39 } else {//文件系统处理40 try {41 doc = PDDocument.load(pdfFile);42 if(pdfFile.endsWith(".pdf")) {43 txtFile = pdfFile.replace(".pdf", ".txt");44 } else {45 return;46 }47 } catch (IOException e) {48 e.printStackTrace();49 return;50 }51 }52 try {53 writer = new FileWriter(txtFile);54 PDFTextStripper textStripper = new PDFTextStripper();//读取PDF到TXT中的操作类55 textStripper.setSortByPosition(false);//这个看了下官方说明,不是很确定是什么意思,但是为了提高效率最好设为false,缺省为false56 textStripper.setStartPage(1);//起始页,缺省为第一页57 textStripper.setEndPage(2);//结束页,缺省为最后一页58 textStripper.writeText(doc, writer);//最重要的一步,写入到txt59 } catch (FileNotFoundException e) {60 e.printStackTrace();61 } catch (IOException e) {62 e.printStackTrace();63 } finally {64 if(doc != null) {65 try {66 doc.close();67 } catch (IOException e) {68 e.printStackTrace();69 }70 }71 if(writer!= null) {72 try {73 writer.close();74 } catch (IOException e) {75 e.printStackTrace();76 }77 }78 }79 }80 public static void main(String[] args) {81 readPDF("resource/正则表达式.pdf");82 }83 }
本来处理pdf这个需求是在学习Lucene过程中提出的,不过在官网上看到了这个
Lucene Integration
Document luceneDocument = LucenePDFDocument.getDocument( ... );
好吧!
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。