首页 > 代码库 > lucent检索技术之创建索引:使用POI读取txt/word/excel/ppt/pdf内容
lucent检索技术之创建索引:使用POI读取txt/word/excel/ppt/pdf内容
在使用lucent检索文档时,必须先为各文档创建索引。索引的创建即读出文档信息(如文档名称、上传时间、文档内容等),然后再经过分词建索引写入到索引文件里。这里主要是总结下读取各类文档内容这一步。
一、之前做过一个小工具也涉及到读取word和excel内容,采用的是com组件的方式来读取。即导入COM库,引入命名空间(using Microsoft.Office.Interop.Word;using Microsoft.Office.Interop.Excel;),然后读代码如下:
读取word
public string readWORD(object filepath) { string filename = Convert.ToString(filepath); Microsoft.Office.Interop.Word.Application wordapp = new Microsoft.Office.Interop.Word.Application(); object isreadonly = true; object nullobj = System.Reflection.Missing.Value; object missingValue =http://www.mamicode.com/ Type.Missing; object miss = System.Reflection.Missing.Value; object saveChanges = WdSaveOptions.wdDoNotSaveChanges; Microsoft.Office.Interop.Word._Document doc = wordapp.Documents.Open(ref filename, ref nullobj, ref isreadonly); string content = doc.Content.Text; doc.Close(ref saveChanges, ref missingValue, ref missingValue); wordapp.Quit(ref saveChanges, ref miss, ref miss); wordapp = null; return content; }
读取excel
用COM读取excel代码,首先是启动excel程序打开工作表,然后取得工作表名,再读取单元格内容,比较繁琐,代码略。
另外,也可以采用OleDB读取EXCEL文件,即把excel作为一个数据库,读出内容返回datatable,代码:
public DataSet ExcelToDS(string Path) { string strConn = "Provider=Microsoft.Jet.OLEDB.4.0;" +"Data Source="+ Path +";"+"Extended Properties=Excel 8.0;"; OleDbConnection conn = new OleDbConnection(strConn); conn.Open(); string strExcel = ""; OleDbDataAdapter myCommand = null; DataSet ds = null; strExcel="select * from [sheet1$]"; myCommand = new OleDbDataAdapter(strExcel, strConn); ds = new DataSet(); myCommand.Fill(ds,"table1"); return ds; } 对于EXCEL中的表即sheet([sheet1$])如果不是固定的可以使用下面的方法得到 string strConn = "Provider=Microsoft.Jet.OLEDB.4.0;" +"Data Source="+ Path +";"+"Extended Properties=Excel 8.0;"; OleDbConnection conn = new OleDbConnection(strConn); DataTable schemaTable = objConn.GetOleDbSchemaTable(System.Data.OleDb.OleDbSchemaGuid.Tables,null); string tableName=schemaTable.Rows[0][2].ToString().Trim();
读取ppt
public string readPPT(object filepath) { string file = filepath.ToString(); Microsoft.Office.Interop.PowerPoint.Application pa = new Microsoft.Office.Interop.PowerPoint.Application(); Microsoft.Office.Interop.PowerPoint.Presentation pp = pa.Presentations.Open(file, Microsoft.Office.Core.MsoTriState.msoTrue, Microsoft.Office.Core.MsoTriState.msoFalse, Microsoft.Office.Core.MsoTriState.msoFalse); string content = ""; foreach (Microsoft.Office.Interop.PowerPoint.Slide slide in pp.Slides) { foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes) content += shape.TextFrame.TextRange.Text.ToString(); } pa.Quit(); pp.Close(); pa = null; return content; }
采用COM方式读取效率很低,而创建索引只需取得文档内容,也要求要快速高效获得要索引的文件内容。因此,COM读取不适用于创建索引。POI包含了各类文档所需的类,使用时只需添加相应的类,实现代码也简单,更重要的是能快速地取得文档内容。
二、采用POI
(1)首先下载POI包,在解决方案中通过“管理NuGet程序包”工具来下载;也可以到Apache官网下载。
(2)以下是POI读取各文档内容代码(包含读取txt、word、excel、ppt、pdf)。
/// <summary> /// 读取各类文档内容 /// </summary> /// <param name="filepath">文档路径</param> /// <param name="filename">文档名称</param> /// <returns></returns> public string textToreader(string filepath, object filename) { string content = null; FileInfo file = new FileInfo(filename.ToString()); switch (file.Extension.ToLower()) { case ".txt": content = readTXT(filepath); break; case ".doc": content = readWORD(filepath); break; case ".xls": content = readEXCEL(filepath); break; case ".pdf": content = readPDF(filepath); break; case ".ppt": content = readPPT(filepath); break; } return content; } /// <summary> /// 读取txt /// </summary> /// <param name="filepath"></param> /// <returns></returns> public string readTXT(string filepath) { StreamReader st = new StreamReader(filepath, Encoding.GetEncoding("gb2312")); string content = st.ReadToEnd(); return content; } /// <summary> /// 读取word2003 /// </summary> /// <param name="filepath"></param> /// <returns></returns> public string readWORD(string filepath) { FileInputStream fs = new FileInputStream(filepath); HWPFDocument doc = new HWPFDocument(fs); string content = doc.getDocumentText(); return content; } /// <summary> /// 读取excel2003 /// </summary> /// <param name="filepath"></param> /// <returns></returns> public string readEXCEL(object filepath) { string filename = filepath.ToString(); FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);//读取流 POIFSFileSystem ps = new POIFSFileSystem(fs); HSSFWorkbook hwb = new HSSFWorkbook(ps); ExcelExtractor extractor = new ExcelExtractor(hwb); extractor.FormulasNotResults = true; extractor.IncludeSheetNames = true; string content = extractor.Text; return content; } /// <summary> /// 读取pdf /// </summary> /// <param name="filepath"></param> /// <returns></returns> public string readPDF(string filepath) { PDDocument doc = PDDocument.load(filepath); PDFTextStripper pdfStripper = new PDFTextStripper(); string content = pdfStripper.getText(doc); doc.close(); return content; } /// <summary> /// 读取ppt2003 /// </summary> /// <param name="filepath"></param> /// <returns></returns> public string readPPT(string filepath) { FileInputStream fs = new FileInputStream(filepath); SlideShow ss = new SlideShow(new HSLFSlideShow(fs)); Slide[] slides = ss.getSlides();// 获得每一张幻灯片 string content = ""; for (int i = 0; i < slides.Length; i++) { TextRun[] t = slides[i].getTextRuns();// 为了取得幻灯片的文字内容,建立TextRun for (int j = 0; j < t.Length; j++) { content += t[j].getText(); } } return content; }
注:以上是对office2003的读取,不同版本的读取对应不同的POI接口程序。
Excel 文件: xls 格式文件对应 POI API 为 HSSF ; xlsx 格式为 office 2007 的文件格式,POI 中对应的API 为XSSF。
Word 文件:doc 格式文件对应的 POI API 为 HWPF; docx 格式为 XWPF。
powerPoint 文件:ppt 格式对应的 POI API 为 HSLF; pptx 格式为 XSLF。
lucent检索技术之创建索引:使用POI读取txt/word/excel/ppt/pdf内容