首页 > 代码库 > Onenote实现OCR识别图片

Onenote实现OCR识别图片

OCR识别推荐两个软件:

  Tesseract:一个开源的,由谷歌维护的OCR软件。

  Onenote:微软Office附带或者可以自己独立安装。

这次讲Onenote实现的OCR识别。

注:2010版及其以后版本OCR实现方式类似:office将其转换为特定xm格式,然后提取想要的节点就ok了;onenote2007识别比较简单:通过MODI API接口直接之别。

我这里是实现了 office2007和office2010的ocr识别函数。

源程序下载:坚果云连接

技术分享
  1 using Microsoft.Office.Interop.OneNote;
  2 using System;
  3 using System.Collections.Generic;
  4 using System.Drawing;
  5 using System.Drawing.Imaging;
  6 using System.IO;
  7 using System.Linq;
  8 using System.Text;
  9 using System.Threading.Tasks;
 10 using System.Xml;
 11 using System.Xml.Linq;
 12 
 13 namespace Extraction.OCR
 14 {
 15     public class ExtractionOCR
 16     {
 17         #region
 18         private static readonly ExtractionOCR instance = new ExtractionOCR();
 19         public static ExtractionOCR Instance { get { return instance; } }
 20         public static string section_path { get; set; }
 21         public static int waitTime = 3 * 1000;
 22         #endregion
 23         /// <summary>
 24         /// office2007 MODI组件OCR识别
 25         /// </summary>
 26         /// <param name="imgPath"></param>
 27         /// <returns></returns>
 28         public string Ocr_2007(string imgPath)
 29         {
 30             try
 31             {
 32                 var imgType = imgPath.Substring(imgPath.Length - 3);
 33                 var data =http://www.mamicode.com/ File.ReadAllBytes(imgPath);
 34                 string imgInfo = "";
 35                 int i = 0;
 36                 var localimgFile = AppDomain.CurrentDomain.BaseDirectory + @"\" + Guid.NewGuid().ToString() + "." + imgType;
 37                 while (!imgInfo.Equals("转换成功") && i < 3)
 38                 {
 39                     ++i;
 40                     imgInfo = this.GetBase64(data, imgType, localimgFile);
 41                 }
 42                 MODI.Document doc = new MODI.Document();
 43                 doc.Create(localimgFile);
 44                 MODI.Image image;
 45                 MODI.Layout layout;
 46                 doc.OCR(MODI.MiLANGUAGES.miLANG_CHINESE_SIMPLIFIED, true, true);
 47                 StringBuilder sb = new StringBuilder();
 48                 image = (MODI.Image)doc.Images[0];
 49                 layout = image.Layout;
 50                 sb.Append(layout.Text);
 51                 doc = null;
 52                 var localFilePath = AppDomain.CurrentDomain.BaseDirectory + @"\" + Guid.NewGuid().ToString() + ".txt";
 53                 File.WriteAllText(localFilePath, sb.ToString());
 54                 Console.WriteLine(sb.ToString());
 55                 return localFilePath;
 56             }
 57             catch (Exception e)
 58             {
 59                 File.AppendAllText(AppDomain.CurrentDomain.BaseDirectory + @"\log.txt", e.ToString());
 60                 return "";
 61             }
 62             finally
 63             {
 64                 GC.Collect();
 65             }
 66         }
 67         /// <summary>
 68         /// onenote 2010,注意需要先在onenote中创建笔记本,并且将至转换为onenote2007格式
 69         /// 推荐使用onenote2016(个人版即可),API与2010类似,(去掉XMLSchema.xs2007参数即可)其他可参考API参数命名。
 70         /// 注意1:一定要将dll属性中的“嵌入互操作类型”属性关闭
 71         /// </summary>
 72         /// <param name="imgPath"></param>
 73         /// <returns></returns>
 74         public string Ocr_2010(string imgPath)
 75         {
 76             section_path = @"C:\Users\zhensheng\Desktop\打杂\ocr\ocr.one";
 77             try
 78             {
 79                 if(string.IsNullOrEmpty(section_path))
 80                 {
 81                     Console.WriteLine("请先建立笔记本");
 82                     File.AppendAllText(AppDomain.CurrentDomain.BaseDirectory + @"\log.txt", "需要先在onenote中创建笔记本,并且将至转换为onenote2007格式,且将.one文件得路径赋值给section_path");
 83                     return "";
 84                 }
 85                 var imgType = imgPath.Substring(imgPath.Length - 3);
 86                 var data =http://www.mamicode.com/ File.ReadAllBytes(imgPath);
 87                 string guid = Guid.NewGuid().ToString();
 88                 string pageID = "";
 89                 string pageXml;
 90                 XNamespace ns;
 91 
 92                 lock (this)
 93                 {
 94                     var onenoteApp = new Microsoft.Office.Interop.OneNote.Application();  //onenote提供的API
 95                     if (onenoteApp == null)
 96                     {
 97                         File.AppendAllText(AppDomain.CurrentDomain.BaseDirectory + @"\log.txt", "Microsoft.Office.Interop.OneNote.Application()创建失败");
 98                         return "";
 99                     }
100                     #region 创建页面并返回pageID
101                     string sectionID;
102                     onenoteApp.OpenHierarchy(section_path, null, out sectionID, CreateFileType.cftSection);
103                     pageID = "{" + guid + "}{1}{B0}";  // 格式 {guid}{tab}{??}
104                     onenoteApp.CreateNewPage(sectionID, out pageID);
105                     #endregion
106 
107                     #region 获取onenote页面xml结构格式
108                     string notebookXml;
109                     onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml, XMLSchema.xs2007);
110                     var doc = XDocument.Parse(notebookXml);
111                     ns = doc.Root.Name.Namespace;
112 
113                     //var pageNode = doc.Descendants(ns + "Page").FirstOrDefault();
114                     //pageID = pageNode.Attribute("ID").Value;
115 
116                     #endregion
117                     #region 将图片插入页面
118                     Tuple<string, int, int> imgInfo = this.GetBase64(data, imgType);
119                     var page = new XDocument(new XElement(ns + "Page",
120                                                     new XElement(ns + "Outline",
121                                                     new XElement(ns + "OEChildren",
122                                                         new XElement(ns + "OE",
123                                                         new XElement(ns + "Image",
124                                                             new XAttribute("format", imgType), new XAttribute("originalPageNumber", "0"),
125                                                             new XElement(ns + "Position",
126                                                                 new XAttribute("x", "0"), new XAttribute("y", "0"), new XAttribute("z", "0")),
127                                                             new XElement(ns + "Size",
128                                                                 new XAttribute("width", imgInfo.Item2), new XAttribute("height", imgInfo.Item3)),
129                                                             new XElement(ns + "Data", imgInfo.Item1)))))));
130 
131                     page.Root.SetAttributeValue("ID", pageID);
132                     onenoteApp.UpdatePageContent(page.ToString(), DateTime.MinValue, XMLSchema.xs2007);
133                     #endregion
134 
135                     #region 通过轮询访问获取OCR识别的结果,轮询超时次数为30次
136                     int fileSize = Convert.ToInt32(data.Length / 1024 / 1024); // 文件大小 单位M
137                     int count = 0;
138                     do
139                     {
140                         System.Threading.Thread.Sleep(waitTime * (fileSize > 1 ? fileSize : 1)); // 小于1M的都默认1M
141                         onenoteApp.GetPageContent(pageID, out pageXml, PageInfo.piBinaryData, XMLSchema.xs2007);
142                         ++count;
143                     }
144                     while (pageXml == "" && count < 6);
145                     #endregion
146 
147                     #region 删除页面
148                     onenoteApp.DeleteHierarchy(pageID, DateTime.MinValue);
149                     //onenoteApp = null;
150                     #endregion
151                 }
152                 #region 从xml中提取OCR识别后的文档信息,然后输出到string中
153                 XmlDocument xmlDoc = new XmlDocument();
154                 xmlDoc.LoadXml(pageXml);
155                 XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDoc.NameTable);
156                 nsmgr.AddNamespace("one", ns.ToString());
157                 XmlNode xmlNode = xmlDoc.SelectSingleNode("//one:Image//one:OCRText", nsmgr);
158                 if (xmlNode == null)
159                 {
160                     File.AppendAllText(AppDomain.CurrentDomain.BaseDirectory + @"\log.txt", "OCR没有识别出值");
161                     return "";
162                 }
163                 #endregion
164                 var localFilePath = AppDomain.CurrentDomain.BaseDirectory + @"\" + Guid.NewGuid().ToString() + ".txt";
165                 File.WriteAllText(localFilePath, xmlNode.InnerText.ToString());
166                 Console.WriteLine(xmlNode.InnerText.ToString());
167 
168                 return localFilePath;
169             }
170             catch (Exception e)
171             {
172                 File.AppendAllText(AppDomain.CurrentDomain.BaseDirectory + @"\log.txt", e.ToString());
173                 return "";
174             }
175         }
176         private string GetBase64(byte[] data, string imgType, string filePath)
177         {
178             using (MemoryStream ms = new MemoryStream())
179             {
180                 MemoryStream ms1 = new MemoryStream(data);
181                 Bitmap bp = (Bitmap)Image.FromStream(ms1);
182                 switch (imgType.ToLower())
183                 {
184                     case "jpg":
185                         bp.Save(ms, ImageFormat.Jpeg);
186                         break;
187 
188                     case "jpeg":
189                         bp.Save(ms, ImageFormat.Jpeg);
190                         break;
191 
192                     case "gif":
193                         bp.Save(ms, ImageFormat.Gif);
194                         break;
195 
196                     case "bmp":
197                         bp.Save(ms, ImageFormat.Bmp);
198                         break;
199 
200                     case "tiff":
201                         bp.Save(ms, ImageFormat.Tiff);
202                         break;
203 
204                     case "png":
205                         bp.Save(ms, ImageFormat.Png);
206                         break;
207 
208                     case "emf":
209                         bp.Save(ms, ImageFormat.Emf);
210                         break;
211 
212                     default:
213                         return "不支持的图片格式。";
214                 }
215                 byte[] buffer = ms.ToArray();
216                 File.WriteAllBytes(filePath, buffer);
217                 ms1.Close();
218                 ms.Close();
219                 return "转换成功";
220                 //return new Tuple<string, int, int>(Convert.ToBase64String(buffer), bp.Width, bp.Height);
221             }
222         }
223         private Tuple<string, int, int> GetBase64(byte[] data, string imgType)
224         {
225             using (MemoryStream ms = new MemoryStream())
226             {
227                 MemoryStream ms1 = new MemoryStream(data);
228                 Bitmap bp = (Bitmap)Image.FromStream(ms1);
229                 switch (imgType.ToLower())
230                 {
231                     case "jpg":
232                         bp.Save(ms, ImageFormat.Jpeg);
233                         break;
234 
235                     case "jpeg":
236                         bp.Save(ms, ImageFormat.Jpeg);
237                         break;
238 
239                     case "gif":
240                         bp.Save(ms, ImageFormat.Gif);
241                         break;
242 
243                     case "bmp":
244                         bp.Save(ms, ImageFormat.Bmp);
245                         break;
246 
247                     case "tiff":
248                         bp.Save(ms, ImageFormat.Tiff);
249                         break;
250 
251                     case "png":
252                         bp.Save(ms, ImageFormat.Png);
253                         break;
254 
255                     case "emf":
256                         bp.Save(ms, ImageFormat.Emf);
257                         break;
258 
259                     default:
260                         return new Tuple<string, int, int>("不支持的图片格式。", 0, 0);
261                 }
262                 byte[] buffer = ms.ToArray();
263                 ms1.Close();
264                 ms.Close();
265                 return new Tuple<string, int, int>(Convert.ToBase64String(buffer), bp.Width, bp.Height);
266             }
267         }
268     }
269 }
View Code

 

注意:

  1. office2007需要安装office sp2补丁。
  2. 谨记 关闭onenote.dll的 嵌入互操作类型 关闭
  3. 如果是在服务器上使用office/onenote,需要开启 桌面体验 功能。
  4. 如果是在服务器上使用onenote 2007,需要注意的是该组件是32位的,也就是说直接调用这个接口的必须是32位程序。
  5. 如果是在服务器上使用onenote 2010,需要注意的是COM需要开通访问权限,如果出现 onenote.application 无法创建的异常,请给调用进程赋予管理员权限(进程-右键-选择以管理员的账号。。)
  6. 如果有其他异常,参阅微软文档:https://msdn.microsoft.com/zh-cn/library/jj680117

参考:

  1. onenote 2010 ocr实现
  2. 浅谈OCR之onenote 2010

 

Onenote实现OCR识别图片