首页 > 代码库 > 浅谈HtmlParser

浅谈HtmlParser

  使用Heritrix抓取到自己所需的网页后,还需要对网页中的内容进行分类等操作,这个时候就需要用到htmlparser,但是使用htmlparser并不是那么容易!因为相关的文档比较少,很多更能需要开发者自己去摸索,去发掘!

  不过这里给大家提供一个比较好的网站(htmlparser的API):http://tool.oschina.net/apidocs/apidoc?api=HTMLParser,这个API是英文版的,英语不好的这时就要逼迫自己看下去了。

  HTMLParser的核心模块是org.htmlparser.Parser类,这个类实际完成了对于HTML页面的分析工作。这个类有下面几个构造函数:

public Parser ();public Parser (Lexer lexer, ParserFeedback fb);public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;public Parser (String resource, ParserFeedback feedback) throws ParserException;public Parser (String resource) throws ParserException;public Parser (Lexer lexer);public Parser (URLConnection connection) throws ParserException;

和一个静态类

public static Parser createParser (String html, String charset);

  对于大多数使用者来说,使用最多的是通过一个URLConnection或者一个保存有网页内容的字符串来初始化Parser,或者使用静态函数来生成一个Parser对象。ParserFeedback的代码很简单,是针对调试和跟踪分析过程的,一般不需要改变。而使用Lexer则是一个相对比较高级的话题,放到以后再讨论吧。
  这里比较有趣的一点是,如果需要设置页面的编码方式的话,不使用Lexer就只有静态函数一个方法了。对于大多数中文页面来说,好像这是应该用得比较多的一个方法。

下面是初始化Parser的例子(通过打开一个网页的URL,中间的OpenFile方法是在打开一个本地的html文件时使用的)。

【加载的网页文件:index.html】

技术分享
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html>    <head>        <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>        <title>百度</title>        <link href = "http://www.mamicode.com/a_1.css" rel = "stylesheet" type = "text/css"/>    </head>    <body>        <div  align = "center" class = "photo" >            <img src = "http://www.mamicode.com/image/baidu.PNG" >        </div>        <div align = "center" class = "body">            <table cellpadding="8">                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>                </td>                <td>                    <font color = "black">网页</font>                </td>                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>                </td>                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">知道</a>                </td>                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>                </td>                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">图片</a>                </td>                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">视频</a>                </td>                <td>                    <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">地图</a>                </td>            </table>            <input class = "input" >        </div>    </body></html>
View Code

 

【源码:htmlparser_1.java】

 1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileInputStream; 4 import java.io.InputStreamReader; 5 import java.net.HttpURLConnection; 6 import java.net.URL; 7 import org.htmlparser.Parser; 8 import org.htmlparser.visitors.TextExtractingVisitor; 9 10 public class Main {11     private static String ENCODE = "GBK";12     private static void message(String msg) {13         // TODO Auto-generated method stub14         try {15             System.out.println(new String(msg.getBytes(ENCODE), System16                     .getProperty("file.encoding")));17         } catch (Exception e) {18             // TODO: handle exception19             e.printStackTrace();20         }21     }22     23     /*24      * 打开一个文件25      */26     public static String OpenFile(String FileName) {27         try {28             File mFile = new File(FileName);29             FileInputStream mFileInputStream = new FileInputStream(mFile);30             InputStreamReader mInputStreamReader = new InputStreamReader(31                     mFileInputStream, ENCODE);32             BufferedReader mBufferedReader = new BufferedReader(33                     mInputStreamReader);34             String mContent = "";35             String mTemp = "";36             while ((mTemp = mBufferedReader.readLine()) != null) {37                 mContent += mTemp + "\n";38             }39             mBufferedReader.close();40         } catch (Exception e) {41             // TODO: handle exception42             e.printStackTrace();43             return "";44         }45         return FileName;46     }47 48     /*49      * main方法50      */51     public static void main(String[] args) {52         // String mContent=OpenFile("");53         try {54             Parser mParser = new Parser((HttpURLConnection) (new URL(55                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());56             TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();57             mParser.visitAllNodesWith(mExtractingVisitor);58             String textInPage = mExtractingVisitor.getExtractedText();59             message(textInPage);60         } catch (Exception e) {61             // TODO: handle exception62             e.printStackTrace();63         }64     }65 66 }

测试输出结果:

技术分享
 1      2          3         百度 4          5      6      7          8              9         10         11             12                 13                     新闻14                 15                 16                     网页17                 18                 19                     贴吧20                 21                 22                     知道23                 24                 25                     音乐26                 27                 28                     图片29                 30                 31                     视频32                 33                 34                     地图35                 36             37             38         39     
View Code

 

 HTMLParser将解析过的信息保存为一个树的结构。Node是信息保存的数据类型基础。

请看Node的定义:
public interface Node extends Cloneable;

Node中包含的方法有几类:

对于树型结构进行遍历的函数,这些函数最容易理解:

Node getParent ():取得父节点NodeList getChildren ():取得子节点的列表Node getFirstChild ():取得第一个子节点Node getLastChild ():取得最后一个子节点Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了)Node getNextSibling ():取得下一个兄弟节点

 取得Node内容的函数:

String getText ():取得文本String toPlainTextString():取得纯文本信息。String toHtml () :取得HTML信息(原始HTML)String toHtml (boolean verbatim):取得HTML信息(原始HTML)String toString ():取得字符串信息(原始HTML)Page getPage ():取得这个Node对应的Page对象int getStartPosition ():取得这个Node在HTML页面中的起始位置int getEndPosition ():取得这个Node在HTML页面中的结束位置

用于Filter过滤的函数:

void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。

 用于Visitor遍历的函数:

void accept (NodeVisitor visitor):对这个Node应用visitor

用于修改内容的函数,这类用得比较少:

void setPage (Page page):设置这个Node对应的Page对象void setText (String text):设置文本void setChildren (NodeList children):设置子节点列表

其他函数:

void doSemanticAction (): 执行这个Node对应的操作(只有少数Tag有对应的操作)Object clone (): 接口Clone的抽象函数。

 实际我们用HTMLParser最多的是处理HTML页面,Filter或Visitor相关的函数是必须的,然后第一类和第二类函数是用得最多的。第一类函数比较容易理解,下面用例子说明一下第二类函数。

【源码:htmlparser_2.java】

 1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileInputStream; 4 import java.io.InputStreamReader; 5 import java.net.HttpURLConnection; 6 import java.net.URL; 7 import org.htmlparser.Node; 8 import org.htmlparser.Parser; 9 import org.htmlparser.util.NodeIterator;10 import org.htmlparser.visitors.TextExtractingVisitor;11 import org.omg.CosNaming.NamingContextPackage.NotEmpty;12 13 public class Main {14     private static String ENCODE = "utf-8";15     private static void message(String msg) {16         // TODO Auto-generated method stub17         try {18             System.out.println(new String(msg.getBytes(ENCODE), System19                     .getProperty("file.encoding")));20         } catch (Exception e) {21             // TODO: handle exception22             e.printStackTrace();23         }24     }25     26     /*27      * 打开一个文件28      */29     public static String OpenFile(String FileName) {30         try {31             File mFile = new File(FileName);32             FileInputStream mFileInputStream = new FileInputStream(mFile);33             InputStreamReader mInputStreamReader = new InputStreamReader(34                     mFileInputStream, ENCODE);35             BufferedReader mBufferedReader = new BufferedReader(36                     mInputStreamReader);37             String mContent = "";38             String mTemp = "";39             while ((mTemp = mBufferedReader.readLine()) != null) {40                 mContent += mTemp + "\n";41             }42             mBufferedReader.close();43         } catch (Exception e) {44             // TODO: handle exception45             e.printStackTrace();46             return "";47         }48         return FileName;49     }50 51     /*52      * main方法53      */54     public static void main(String[] args) {55         // String mContent=OpenFile("");56         try {57             Parser mParser = new Parser((HttpURLConnection) (new URL(58                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());59 //            TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();60 //            mParser.visitAllNodesWith(mExtractingVisitor);61 //            String textInPage = mExtractingVisitor.getExtractedText();62 //            message(textInPage);63             64             for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) {65                 Node node = i.nextNode();66                 message("getText:"+node.getText());67                 message("getPlainText:"+node.toPlainTextString());68                 message("toHtml:"+node.toHtml());69                 message("toHtml(true):"+node.toHtml(true));70                 message("tohtml(false):"+node.toHtml(false));71                 message("toString:"+node.toString());72                 message("==============================");73             }74         } catch (Exception e) {75             // TODO: handle exception76             e.printStackTrace();77         }78     }79 }

测试输出结果:

技术分享
  1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"  2 getPlainText:  3 toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  4 toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  5 tohtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  6 toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121  7 ==============================  8 getText:  9  10 getPlainText: 11  12 toHtml: 13  14 toHtml(true): 15  16 tohtml(false): 17  18 toString:Txt (121[0,121],123[1,0]): \n 19 ============================== 20 getText:html 21 getPlainText: 22      23          24         百度 25          26      27      28          29              30          31          32              33                  34                     新闻 35                  36                  37                     网页 38                  39                  40                     贴吧 41                  42                  43                     知道 44                  45                  46                     音乐 47                  48                  49                     图片 50                  51                  52                     视频 53                  54                  55                     地图 56                  57              58              59          60      61  62  63 toHtml:<html> 64     <head> 65         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/> 66         <title>百度</title> 67         <link href = "http://www.mamicode.com/a_1.css" rel = "stylesheet" type = "text/css"/> 68     </head> 69     <body> 70         <div  align = "center" class = "photo" > 71             <img src = "http://www.mamicode.com/image/baidu.PNG" > 72         </div> 73         <div align = "center" class = "body"> 74             <table cellpadding="8"> 75                 <td> 76                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">新闻</a> 77                 </td> 78                 <td> 79                     <font color = "black">网页</font> 80                 </td> 81                 <td> 82                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a> 83                 </td> 84                 <td> 85                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">知道</a> 86                 </td> 87                 <td> 88                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">音乐</a> 89                 </td> 90                 <td> 91                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">图片</a> 92                 </td> 93                 <td> 94                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">视频</a> 95                 </td> 96                 <td> 97                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">地图</a> 98                 </td> 99             </table>100             <input class = "input" >101         </div>102     </body>103 104 </html>105 toHtml(true):<html>106     <head>107         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>108         <title>百度</title>109         <link href = "http://www.mamicode.com/a_1.css" rel = "stylesheet" type = "text/css"/>110     </head>111     <body>112         <div  align = "center" class = "photo" >113             <img src = "http://www.mamicode.com/image/baidu.PNG" >114         </div>115         <div align = "center" class = "body">116             <table cellpadding="8">117                 <td>118                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>119                 </td>120                 <td>121                     <font color = "black">网页</font>122                 </td>123                 <td>124                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>125                 </td>126                 <td>127                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">知道</a>128                 </td>129                 <td>130                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>131                 </td>132                 <td>133                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">图片</a>134                 </td>135                 <td>136                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">视频</a>137                 </td>138                 <td>139                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">地图</a>140                 </td>141             </table>142             <input class = "input" >143         </div>144     </body>145 146 </html>147 tohtml(false):<html>148     <head>149         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>150         <title>百度</title>151         <link href = "http://www.mamicode.com/a_1.css" rel = "stylesheet" type = "text/css"/>152     </head>153     <body>154         <div  align = "center" class = "photo" >155             <img src = "http://www.mamicode.com/image/baidu.PNG" >156         </div>157         <div align = "center" class = "body">158             <table cellpadding="8">159                 <td>160                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>161                 </td>162                 <td>163                     <font color = "black">网页</font>164                 </td>165                 <td>166                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>167                 </td>168                 <td>169                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">知道</a>170                 </td>171                 <td>172                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>173                 </td>174                 <td>175                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">图片</a>176                 </td>177                 <td>178                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">视频</a>179                 </td>180                 <td>181                     <a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站">地图</a>182                 </td>183             </table>184             <input class = "input" >185         </div>186     </body>187 188 </html>189 toString:Tag (123[1,0],129[1,6]): html190   Txt (129[1,6],132[2,1]): \n\t191   Tag (132[2,1],138[2,7]): head192     Txt (138[2,7],142[3,2]): \n\t\t193     Tag (142[3,2],216[3,76]): meta http-equiv = "Content-Type" content = "text/ht...194     Txt (216[3,76],220[4,2]): \n\t\t195     Tag (220[4,2],227[4,9]): title196       Txt (227[4,9],229[4,11]): 百度197       End (229[4,11],237[4,19]): /title198     Txt (237[4,19],241[5,2]): \n\t\t199     Tag (241[5,2],302[5,63]): link href = "http://www.mamicode.com/a_1.css" rel = "stylesheet" type = "te...200     Txt (302[5,63],305[6,1]): \n\t201     End (305[6,1],312[6,8]): /head202   Txt (312[6,8],315[7,1]): \n\t203   Tag (315[7,1],321[7,7]): body204     Txt (321[7,7],325[8,2]): \n\t\t205     Tag (325[8,2],365[8,42]): div  align = "center" class = "photo" 206       Txt (365[8,42],370[9,3]): \n\t\t\t207       Tag (370[9,3],403[9,36]): img src = "http://www.mamicode.com/image/baidu.PNG" 208       Txt (403[9,36],407[10,2]): \n\t\t209       End (407[10,2],413[10,8]): /div210     Txt (413[10,8],417[11,2]): \n\t\t211     Tag (417[11,2],454[11,39]): div align = "center" class = "body"212       Txt (454[11,39],459[12,3]): \n\t\t\t213       Tag (459[12,3],482[12,26]): table cellpadding="8"214         Txt (482[12,26],488[13,4]): \n\t\t\t\t215         Tag (488[13,4],492[13,8]): td216           Txt (492[13,8],499[14,5]): \n\t\t\t\t\t217           Tag (499[14,5],552[14,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站"218             Txt (552[14,58],554[14,60]): 新闻219             End (554[14,60],558[14,64]): /a220           Txt (558[14,64],564[15,4]): \n\t\t\t\t221           End (564[15,4],569[15,9]): /td222         Txt (569[15,9],575[16,4]): \n\t\t\t\t223         Tag (575[16,4],579[16,8]): td224           Txt (579[16,8],586[17,5]): \n\t\t\t\t\t225           Tag (586[17,5],608[17,27]): font color = "black"226           Txt (608[17,27],610[17,29]): 网页227           End (610[17,29],617[17,36]): /font228           Txt (617[17,36],623[18,4]): \n\t\t\t\t229           End (623[18,4],628[18,9]): /td230         Txt (628[18,9],634[19,4]): \n\t\t\t\t231         Tag (634[19,4],638[19,8]): td232           Txt (638[19,8],645[20,5]): \n\t\t\t\t\t233           Tag (645[20,5],698[20,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站"234             Txt (698[20,58],700[20,60]): 贴吧235             End (700[20,60],704[20,64]): /a236           Txt (704[20,64],710[21,4]): \n\t\t\t\t237           End (710[21,4],715[21,9]): /td238         Txt (715[21,9],721[22,4]): \n\t\t\t\t239         Tag (721[22,4],725[22,8]): td240           Txt (725[22,8],732[23,5]): \n\t\t\t\t\t241           Tag (732[23,5],785[23,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站"242             Txt (785[23,58],787[23,60]): 知道243             End (787[23,60],791[23,64]): /a244           Txt (791[23,64],797[24,4]): \n\t\t\t\t245           End (797[24,4],802[24,9]): /td246         Txt (802[24,9],808[25,4]): \n\t\t\t\t247         Tag (808[25,4],812[25,8]): td248           Txt (812[25,8],819[26,5]): \n\t\t\t\t\t249           Tag (819[26,5],872[26,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站"250             Txt (872[26,58],874[26,60]): 音乐251             End (874[26,60],878[26,64]): /a252           Txt (878[26,64],884[27,4]): \n\t\t\t\t253           End (884[27,4],889[27,9]): /td254         Txt (889[27,9],895[28,4]): \n\t\t\t\t255         Tag (895[28,4],899[28,8]): td256           Txt (899[28,8],906[29,5]): \n\t\t\t\t\t257           Tag (906[29,5],959[29,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站"258             Txt (959[29,58],961[29,60]): 图片259             End (961[29,60],965[29,64]): /a260           Txt (965[29,64],971[30,4]): \n\t\t\t\t261           End (971[30,4],976[30,9]): /td262         Txt (976[30,9],982[31,4]): \n\t\t\t\t263         Tag (982[31,4],986[31,8]): td264           Txt (986[31,8],993[32,5]): \n\t\t\t\t\t265           Tag (993[32,5],1046[32,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百度网站"266             Txt (1046[32,58],1048[32,60]): 视频267             End (1048[32,60],1052[32,64]): /a268           Txt (1052[32,64],1058[33,4]): \n\t\t\t\t269           End (1058[33,4],1063[33,9]): /td270         Txt (1063[33,9],1069[34,4]): \n\t\t\t\t271         Tag (1069[34,4],1073[34,8]): td272           Txt (1073[34,8],1080[35,5]): \n\t\t\t\t\t273           Tag (1080[35,5],1133[35,58]): a href = "http://www.mamicode.com/#" target = _blank title = "欢迎来到&#10百...274             Txt (1133[35,58],1135[35,60]): 地图275             End (1135[35,60],1139[35,64]): /a276           Txt (1139[35,64],1145[36,4]): \n\t\t\t\t277           End (1145[36,4],1150[36,9]): /td278         Txt (1150[36,9],1155[37,3]): \n\t\t\t279         End (1155[37,3],1163[37,11]): /table280       Txt (1163[37,11],1168[38,3]): \n\t\t\t281       Tag (1168[38,3],1192[38,27]): input class = "input" 282       Txt (1192[38,27],1196[39,2]): \n\t\t283       End (1196[39,2],1202[39,8]): /div284     Txt (1202[39,8],1205[40,1]): \n\t285     End (1205[40,1],1212[40,8]): /body286   Txt (1212[40,8],1216[42,0]): \n\n287   End (1216[42,0],1223[42,7]): /html288 289 ==============================
View Code

 

  对于第一个Node的内容,对应的就是第一行<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">,从这个输出结果中,也可以看出内容的树状结构。或者说是树林结构。在Page内容的第一层Tag,如DOCTYPE,head和html,分别形成了一个最高层的Node节点(很多人可能对第二个和第四个Node的内容有点奇怪。实际上这两个Node就是两个换行符号。HTMLParser把HTML页面内容中的所有换行,空格,Tab等都转换成了相应的Tag,所以就出现了这样的Node。虽然内容少但是级别高,呵呵)

  getPlainTextString是把用户可以看到的内容都包含了。有趣的有两点,一是<head>标签中的Title内容是在plainText中的,可能在标题中可见的也算可见吧。另外就是象前面说的,HTML内容中的换行符什么的,也都成了plainText,这个逻辑上好像有点问题。

  另外可能大家发现toHtml,toHtml(true)和toHtml(false)的结果没什么区别。实际也是这样的,如果跟踪HTMLParser的代码就可以发现,Node的子类是AbstractNode,其中实现了toHtml()的代码,直接调用toHtml(false),而AbstractNode的三个子类RemarkNode,TagNode和TextNode中,toHtml(boolean verbatim)的实现中,都没有处理verbatim参数,所以三个函数的结果是一模一样的。如果你不需要实现你自己的什么特殊处理,简单使用toHtml就可以了。

HTML的Node类继承关系如下图(这个是从别的文章Copy的)

技术分享

a

 

浅谈HtmlParser