首页 > 代码库 > tesseract 字体训练资料篇
tesseract 字体训练资料篇
tesseract 字体训练资料篇
1.制作.box档案文件.
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox
2.开始培训
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train
或
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr
set_unicharset_properties
不知道什么来的
training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata
font_properties
字体属性文件
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
在<字体>是一个字符串命名的字体 ; <斜体>,<加粗>,<固定>,<衬线>和<哥特体>都是简单的0或1标志指示字体是与否的属性。
Example:
timesitalic 1 0 0 1 0
----在3.03,有一个默认的font_properties文件,涵盖3000字体(不一定准确)培训/langdata / font_properties。
Clustering
shapeclustering 创建主控形状表的聚类形状并将其写入一个文件shapetable。
shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
----如果你得到错误信息,像这样的 "index >= 0 && index < size_used_:Error:Assert failed in genericvector.h, line 512" 添加shapetable文件到您的语言数据文件。
mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
你的文件是通过unicharset_extractor以上产生的unicharset,和lang.unicharset是输出unicharset将给予combine_tessdata。mftraining将输出两个数据文件:inttemp(形状的原型)和pffmtable(每个字符的预期功能)。
输出normproto数据文件
cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr ...
数据字典(可选)
Name | Type | Description |
word-dawg | dawg | A dawg made from dictionary words from the language. |
freq-dawg | dawg | A dawg made from the most frequent words which would have gone into word-dawg. |
punc-dawg | dawg | A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. |
number-dawg | dawg | A dawg made from tokens which originally contained digits. Each digit is replaced by a space character. |
fixed-length-dawgs | dawg | Several dawgs of different fixed lengths —— useful for languages like Chinese. |
bigram-dawg | dawg | A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?. |
unambig-dawg | dawg | TODO: Describe. |
user-words | text | A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see tesseract(1). |
wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset
wordlist2dawg words_list lang.word-dawg lang.unicharset
参考资料:
WIKI
https://code.google.com/p/tesseract-ocr/wiki/FAQ
Introduction
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#font_properties_(new_in_3.01)
WORDLIST2DAWG(1) Manual Page
http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html
COMBINE_TESSDATA(1) Manual Page
http://tesseract-ocr.googlecode.com/svn-history/r800/trunk/doc/combine_tessdata.1.html