首页 > 代码库 > QT学习:c++解析html相关

QT学习:c++解析html相关

原来我做爬虫的时候,对页面进行解析的时候总是用很简单粗暴的方法,直接找规律。后来在网上看到了gumbo,尝试了一下,发现确实很好用,所以向大家推荐一下。

以下转自:http://blog.csdn.net/whyistao/article/details/37919581

1.c++好像没有太多的html解析库可以用,最后试着在qt里面集成了htmlcxx,一开始在pro里面写了 includepath += 路径,发现仍然没有用后来发现只要在 HEADERS 和 SOURCES 里面 把htmlcxx的c文件和.h文件 +=进去就行了,像这样:SOURCES += main.cpp        html/utils.cc         html/Uri.cc         html/ParserSax.cc         html/ParserDom.cc         html/Node.cc         html/Extensions.ccHEADERS  += mainwindow.h         html/utils.h         html/Uri.h         html/tree.h         html/ParserSax.h         html/ParserDom.h         html/Node.h         html/Extensions.h         html/debug.h         html/ci_string.h         html/wincstring.h         html/tld.h参考了:   htmlcxx for qt(mingw)      http://blog.chinaunix.net/uid-21525518-id-1824657.html2.使用gumbo解析导入c和h文件方法同上,记一下gumbo常用类型GumboOutput   用GumboOutput来解析html源码,然后output->root即为根节点。GumboOutput* output = gumbo_parse(htmlString.c_str());GumboNode* node = output->rootGumboNode    节点                      GumboNode node;      获得节点里面的东西    node->v->text                           //  节点的文本node->v.element.children    // 获得节点的子节点列表node->type     //节点的类型 GumboVector    节点容器  比如可以   GumboVector  * children  =    node->v.element.children;   来获得节点的子节点列表(GumboNode*) ( children->data[i] )     //获得这个节点列表的第i个节点   GumboAttribute  节点属性GumboAttribute* href;  if (node->v.element.tag == GUMBO_TAG_A &&   (href = http://www.mamicode.com/gumbo_get_attribute(&node->v.element.attributes, "href"))) {    std::cout << href->value << std::endl;  }节点的类型    ELEMENT_NODE,普通元素节点,如<html>,<p>,<div>,<span>,<img>    ATTRIBUTE_NODE,元素属性    TEXT_NODE,文本节点    CDATA_SECTION_NODE,即<![CDATA[ ]]>    ENTITY_REFERENCE_NODE,实体引用,如&     ENTITY_NODE,实体,如<!ENTITY copyright “Copyright 2010, impng. All rights reserved”]>    PROCESSING_INSTRUCTION_NODE,PI,处理指令,如<?xml  version=”1.0″?>    COMMENT_NODE,注释<!–   –>    DOCUMENT_NODE,根节点,即document.nodeType    DOCUMENT_TYPE_NODE,DTD,文档类型<!DOCTYPE   >    DOCUMENT_FRAGMENT_NODE,文档片段    NOTATION_NODE,DTD中定义的记号  在代码里的节点类型可以有如下几种           (使用方法       node->type ==  GUMBO_NODE_ELEMENT )typedef enum {  /** Document node.  v will be a GumboDocument. */  GUMBO_NODE_DOCUMENT,  /** Element node.  v will be a GumboElement. */  GUMBO_NODE_ELEMENT,  /** Text node.  v will be a GumboText. */  GUMBO_NODE_TEXT,  /** CDATA node. v will be a GumboText. */  GUMBO_NODE_CDATA,  /** Comment node.  v. will be a GumboText, excluding comment delimiters. */  GUMBO_NODE_COMMENT,  /** Text node, where all contents is whitespace.  v will be a GumboText. */  GUMBO_NODE_WHITESPACE} GumboNodeType;标签类型:                           (使用方法    node->v.element.tag != GUMBO_TAG_SCRIPT   )typedef enum {  // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-root-element  GUMBO_TAG_HTML,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#document-metadata  GUMBO_TAG_HEAD,  GUMBO_TAG_TITLE,  GUMBO_TAG_BASE,  GUMBO_TAG_LINK,  GUMBO_TAG_META,  GUMBO_TAG_STYLE,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/scripting-1.html#scripting-1  GUMBO_TAG_SCRIPT,  GUMBO_TAG_NOSCRIPT,  GUMBO_TAG_TEMPLATE,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/sections.html#sections  GUMBO_TAG_BODY,  GUMBO_TAG_ARTICLE,  GUMBO_TAG_SECTION,  GUMBO_TAG_NAV,  GUMBO_TAG_ASIDE,  GUMBO_TAG_H1,  GUMBO_TAG_H2,  GUMBO_TAG_H3,  GUMBO_TAG_H4,  GUMBO_TAG_H5,  GUMBO_TAG_H6,  GUMBO_TAG_HGROUP,  GUMBO_TAG_HEADER,  GUMBO_TAG_FOOTER,  GUMBO_TAG_ADDRESS,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/grouping-content.html#grouping-content  GUMBO_TAG_P,  GUMBO_TAG_HR,  GUMBO_TAG_PRE,  GUMBO_TAG_BLOCKQUOTE,  GUMBO_TAG_OL,  GUMBO_TAG_UL,  GUMBO_TAG_LI,  GUMBO_TAG_DL,  GUMBO_TAG_DT,  GUMBO_TAG_DD,  GUMBO_TAG_FIGURE,  GUMBO_TAG_FIGCAPTION,  GUMBO_TAG_MAIN,  GUMBO_TAG_DIV,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#text-level-semantics  GUMBO_TAG_A,  GUMBO_TAG_EM,  GUMBO_TAG_STRONG,  GUMBO_TAG_SMALL,  GUMBO_TAG_S,  GUMBO_TAG_CITE,  GUMBO_TAG_Q,  GUMBO_TAG_DFN,  GUMBO_TAG_ABBR,  GUMBO_TAG_DATA,  GUMBO_TAG_TIME,  GUMBO_TAG_CODE,  GUMBO_TAG_VAR,  GUMBO_TAG_SAMP,  GUMBO_TAG_KBD,  GUMBO_TAG_SUB,  GUMBO_TAG_SUP,  GUMBO_TAG_I,  GUMBO_TAG_B,  GUMBO_TAG_U,  GUMBO_TAG_MARK,  GUMBO_TAG_RUBY,  GUMBO_TAG_RT,  GUMBO_TAG_RP,  GUMBO_TAG_BDI,  GUMBO_TAG_BDO,  GUMBO_TAG_SPAN,  GUMBO_TAG_BR,  GUMBO_TAG_WBR,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/edits.html#edits  GUMBO_TAG_INS,  GUMBO_TAG_DEL,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#embedded-content-1  GUMBO_TAG_IMAGE,  GUMBO_TAG_IMG,  GUMBO_TAG_IFRAME,  GUMBO_TAG_EMBED,  GUMBO_TAG_OBJECT,  GUMBO_TAG_PARAM,  GUMBO_TAG_VIDEO,  GUMBO_TAG_AUDIO,  GUMBO_TAG_SOURCE,  GUMBO_TAG_TRACK,  GUMBO_TAG_CANVAS,  GUMBO_TAG_MAP,  GUMBO_TAG_AREA,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#mathml  GUMBO_TAG_MATH,  GUMBO_TAG_MI,  GUMBO_TAG_MO,  GUMBO_TAG_MN,  GUMBO_TAG_MS,  GUMBO_TAG_MTEXT,  GUMBO_TAG_MGLYPH,  GUMBO_TAG_MALIGNMARK,  GUMBO_TAG_ANNOTATION_XML,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#svg-0  GUMBO_TAG_SVG,  GUMBO_TAG_FOREIGNOBJECT,  GUMBO_TAG_DESC,  // SVG title tags will have GUMBO_TAG_TITLE as with HTML.  // http://www.whatwg.org/specs/web-apps/current-work/multipage/tabular-data.html#tabular-data  GUMBO_TAG_TABLE,  GUMBO_TAG_CAPTION,  GUMBO_TAG_COLGROUP,  GUMBO_TAG_COL,  GUMBO_TAG_TBODY,  GUMBO_TAG_THEAD,  GUMBO_TAG_TFOOT,  GUMBO_TAG_TR,  GUMBO_TAG_TD,  GUMBO_TAG_TH,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#forms  GUMBO_TAG_FORM,  GUMBO_TAG_FIELDSET,  GUMBO_TAG_LEGEND,  GUMBO_TAG_LABEL,  GUMBO_TAG_INPUT,  GUMBO_TAG_BUTTON,  GUMBO_TAG_SELECT,  GUMBO_TAG_DATALIST,  GUMBO_TAG_OPTGROUP,  GUMBO_TAG_OPTION,  GUMBO_TAG_TEXTAREA,  GUMBO_TAG_KEYGEN,  GUMBO_TAG_OUTPUT,  GUMBO_TAG_PROGRESS,  GUMBO_TAG_METER,  // http://www.whatwg.org/specs/web-apps/current-work/multipage/interactive-elements.html#interactive-elements  GUMBO_TAG_DETAILS,  GUMBO_TAG_SUMMARY,  GUMBO_TAG_MENU,  GUMBO_TAG_MENUITEM,  // Non-conforming elements that nonetheless appear in the HTML5 spec.  // http://www.whatwg.org/specs/web-apps/current-work/multipage/obsolete.html#non-conforming-features  GUMBO_TAG_APPLET,  GUMBO_TAG_ACRONYM,  GUMBO_TAG_BGSOUND,  GUMBO_TAG_DIR,  GUMBO_TAG_FRAME,  GUMBO_TAG_FRAMESET,  GUMBO_TAG_NOFRAMES,  GUMBO_TAG_ISINDEX,  GUMBO_TAG_LISTING,  GUMBO_TAG_XMP,  GUMBO_TAG_NEXTID,  GUMBO_TAG_NOEMBED,  GUMBO_TAG_PLAINTEXT,  GUMBO_TAG_RB,  GUMBO_TAG_STRIKE,  GUMBO_TAG_BASEFONT,  GUMBO_TAG_BIG,  GUMBO_TAG_BLINK,  GUMBO_TAG_CENTER,  GUMBO_TAG_FONT,  GUMBO_TAG_MARQUEE,  GUMBO_TAG_MULTICOL,  GUMBO_TAG_NOBR,  GUMBO_TAG_SPACER,  GUMBO_TAG_TT,  // Used for all tags that don‘t have special handling in HTML.  GUMBO_TAG_UNKNOWN,  // A marker value to indicate the end of the enum, for iterating over it.  // Also used as the terminator for varargs functions that take tags.  GUMBO_TAG_LAST,} GumboTag;3.使用gumbo的时候,报了一个RtlWerpReportException failed with status code :-1073741823 错,一开始以为是堆栈溢出的问题,后来发现是自己代码逻辑没写对,最好对照着官方demo的用法去写if (node->v.element.tag == GUMBO_TAG_A &&      (href = http://www.mamicode.com/gumbo_get_attribute(&node->v.element.attributes, "href"))) {    std::cout << href->value << std::endl;  }4.编译gumbo的时候报了一个错 错误:for loop initial declarations are only allowed in C99 mode所以在项目pro配置里要加上这两句QMAKE_CFLAGS_DEBUG +=  --std=c99QMAKE_CFLAGS_RELEASE +=  --std=c99

 

转载请注明:http://www.cnblogs.com/fnlingnzb-learner/p/5835428.html

QT学习:c++解析html相关