lucent检索技术之创建索引：使用POI读取txt/word/excel/ppt/pdf内容

首页 > 代码库 > lucent检索技术之创建索引：使用POI读取txt/word/excel/ppt/pdf内容

lucent检索技术之创建索引：使用POI读取txt/word/excel/ppt/pdf内容

2024-07-25 05:15:20 225人阅读

在使用lucent检索文档时，必须先为各文档创建索引。索引的创建即读出文档信息（如文档名称、上传时间、文档内容等），然后再经过分词建索引写入到索引文件里。这里主要是总结下读取各类文档内容这一步。

一、之前做过一个小工具也涉及到读取word和excel内容，采用的是com组件的方式来读取。即导入COM库，引入命名空间（using Microsoft.Office.Interop.Word;using Microsoft.Office.Interop.Excel;），然后读代码如下：

读取word

   public string readWORD(object filepath)        {            string filename = Convert.ToString(filepath);            Microsoft.Office.Interop.Word.Application wordapp = new Microsoft.Office.Interop.Word.Application();            object isreadonly = true;            object nullobj = System.Reflection.Missing.Value;            object missingValue =http://www.mamicode.com/ Type.Missing;            object miss = System.Reflection.Missing.Value;            object saveChanges = WdSaveOptions.wdDoNotSaveChanges;            Microsoft.Office.Interop.Word._Document doc = wordapp.Documents.Open(ref filename, ref nullobj, ref isreadonly);            string content = doc.Content.Text;            doc.Close(ref saveChanges, ref missingValue, ref missingValue);            wordapp.Quit(ref saveChanges, ref miss, ref miss);            wordapp = null;            return content;           }

View Code

读取excel

用COM读取excel代码，首先是启动excel程序打开工作表，然后取得工作表名，再读取单元格内容，比较繁琐，代码略。

另外，也可以采用OleDB读取EXCEL文件，即把excel作为一个数据库，读出内容返回datatable，代码：

public DataSet ExcelToDS(string Path) { string strConn = "Provider=Microsoft.Jet.OLEDB.4.0;" +"Data Source="+ Path +";"+"Extended Properties=Excel 8.0;"; OleDbConnection conn = new OleDbConnection(strConn); conn.Open();   string strExcel = "";    OleDbDataAdapter myCommand = null; DataSet ds = null; strExcel="select * from [sheet1$]"; myCommand = new OleDbDataAdapter(strExcel, strConn); ds = new DataSet(); myCommand.Fill(ds,"table1");    return ds; } 对于EXCEL中的表即sheet([sheet1$])如果不是固定的可以使用下面的方法得到 string strConn = "Provider=Microsoft.Jet.OLEDB.4.0;" +"Data Source="+ Path +";"+"Extended Properties=Excel 8.0;"; OleDbConnection conn = new OleDbConnection(strConn); DataTable schemaTable = objConn.GetOleDbSchemaTable(System.Data.OleDb.OleDbSchemaGuid.Tables,null); string tableName=schemaTable.Rows[0][2].ToString().Trim();

View Code

读取ppt

        public string readPPT(object filepath)        {            string file = filepath.ToString();            Microsoft.Office.Interop.PowerPoint.Application pa = new Microsoft.Office.Interop.PowerPoint.Application();            Microsoft.Office.Interop.PowerPoint.Presentation pp = pa.Presentations.Open(file, Microsoft.Office.Core.MsoTriState.msoTrue, Microsoft.Office.Core.MsoTriState.msoFalse, Microsoft.Office.Core.MsoTriState.msoFalse);            string content = "";            foreach (Microsoft.Office.Interop.PowerPoint.Slide slide in pp.Slides)            {                foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes)                    content += shape.TextFrame.TextRange.Text.ToString();            }            pa.Quit();            pp.Close();            pa = null;            return content;       }

View Code

采用COM方式读取效率很低，而创建索引只需取得文档内容，也要求要快速高效获得要索引的文件内容。因此，COM读取不适用于创建索引。POI包含了各类文档所需的类，使用时只需添加相应的类，实现代码也简单，更重要的是能快速地取得文档内容。

二、采用POI

（1）首先下载POI包，在解决方案中通过“管理NuGet程序包”工具来下载；也可以到Apache官网下载。

（2）以下是POI读取各文档内容代码(包含读取txt、word、excel、ppt、pdf)。

        /// <summary>        /// 读取各类文档内容      /// </summary>        /// <param name="filepath">文档路径</param>        /// <param name="filename">文档名称</param>        /// <returns></returns>        public string textToreader(string filepath, object filename)        {            string content = null;            FileInfo file = new FileInfo(filename.ToString());            switch (file.Extension.ToLower())            {                case ".txt":                    content = readTXT(filepath);                    break;                case ".doc":                    content = readWORD(filepath);                    break;                case ".xls":                    content = readEXCEL(filepath);                    break;                case ".pdf":                    content = readPDF(filepath);                    break;                case ".ppt":                    content = readPPT(filepath);                    break;            }            return content;        }        /// <summary>        /// 读取txt        /// </summary>        /// <param name="filepath"></param>        /// <returns></returns>        public string readTXT(string filepath)        {            StreamReader st = new StreamReader(filepath, Encoding.GetEncoding("gb2312"));            string content = st.ReadToEnd();            return content;        }        /// <summary>        /// 读取word2003        /// </summary>        /// <param name="filepath"></param>        /// <returns></returns>        public string readWORD(string filepath)        {            FileInputStream fs = new FileInputStream(filepath);            HWPFDocument doc = new HWPFDocument(fs);            string content = doc.getDocumentText();            return content;        }        /// <summary>        /// 读取excel2003        /// </summary>        /// <param name="filepath"></param>        /// <returns></returns>        public string readEXCEL(object filepath)        {            string filename = filepath.ToString();            FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);//读取流            POIFSFileSystem ps = new POIFSFileSystem(fs);            HSSFWorkbook hwb = new HSSFWorkbook(ps);            ExcelExtractor extractor = new ExcelExtractor(hwb);            extractor.FormulasNotResults = true;            extractor.IncludeSheetNames = true;            string content = extractor.Text;            return content;        }        /// <summary>        /// 读取pdf        /// </summary>        /// <param name="filepath"></param>        /// <returns></returns>        public string readPDF(string filepath)        {            PDDocument doc = PDDocument.load(filepath);            PDFTextStripper pdfStripper = new PDFTextStripper();            string content = pdfStripper.getText(doc);            doc.close();            return content;        }        /// <summary>        /// 读取ppt2003        /// </summary>        /// <param name="filepath"></param>        /// <returns></returns>        public string readPPT(string filepath)        {            FileInputStream fs = new FileInputStream(filepath);            SlideShow ss = new SlideShow(new HSLFSlideShow(fs));            Slide[] slides = ss.getSlides();// 获得每一张幻灯片            string content = "";            for (int i = 0; i < slides.Length; i++)            {                TextRun[] t = slides[i].getTextRuns();// 为了取得幻灯片的文字内容，建立TextRun                for (int j = 0; j < t.Length; j++)                {                    content += t[j].getText();                }            }            return content;        }

View Code

注：以上是对office2003的读取，不同版本的读取对应不同的POI接口程序。

Excel 文件: xls 格式文件对应 POI API 为 HSSF ； xlsx 格式为 office 2007 的文件格式，POI 中对应的API 为XSSF。

Word 文件：doc 格式文件对应的 POI API 为 HWPF； docx 格式为 XWPF。

powerPoint 文件：ppt 格式对应的 POI API 为 HSLF； pptx 格式为 XSLF。

lucent检索技术之创建索引：使用POI读取txt/word/excel/ppt/pdf内容

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > lucent检索技术之创建索引：使用POI读取txt/word/excel/ppt/pdf内容

lucent检索技术之创建索引：使用POI读取txt/word/excel/ppt/pdf内容

看完仍有疑问？有类似问题直接问程序猿