首页 > 代码库 > python之lxml(xpath)

python之lxml(xpath)

bs4确实没这个好用,bs4的树太复杂

lxml很好

定位非常好

详细解说在注释里面有了

 1 #!/usr/bin/python3.4 2 # -*- coding: utf-8 -*- 3  4 from lxml import etree 5 import urllib.request 6  7 # 目标网址的html可以看一下 8 url = "http://www.1kkk.com/manhua589/" 9 # 解析网址10 data =http://www.mamicode.com/ urllib.request.urlopen(url).read()11 # 解码12 html = data.decode(UTF-8,ignore)13 14 page = etree.HTML(html.lower())15 16 # 查找的目标样式如下17 """18 ...19 <ul class="sy_nr1 cplist_ullg">20     <li>21       <a href="http://www.mamicode.com/vol1-6871/" class="tg">第1卷</a>(96页)</li>22     <li>23       <a href="http://www.mamicode.com/vol2-6872/" class="tg">第2卷</a>(90页)</li>24     <li>25       <a href="http://www.mamicode.com/vol3-6873/" class="tg">第3卷</a>(95页)</li>26     <li>27       <a href="http://www.mamicode.com/vol4-6874/" class="tg">第4卷</a>(94页)</li>28     <li>29       <a href="http://www.mamicode.com/vol5-6875/" class="tg">第5卷</a>(95页)</li>30     ...31 """32 33 # 找到ul下li下的a中的href34 hrefs = page.xpath(//ul[@class="sy_nr1 cplist_ullg"][2]/li/a/@href)35 36 # 找到<a>...</a>之间的文字37 hrefnames = page.xpath(//ul[@class="sy_nr1 cplist_ullg"][2]/li/a/text())38 39 # 找到页数40 hrefpages = page.xpath(//ul[@class="sy_nr1 cplist_ullg"][2]/li/text())41 42 for href in hrefs:43     # 打印出来44     print(href)

打印结果:

 1 /vol1-6871/ 2 /vol2-6872/ 3 /vol3-6873/ 4 /vol4-6874/ 5 /vol5-6875/ 6 /vol6-6876/ 7 /vol7-6877/ 8 /vol8-6878/ 9 /vol9-6879/10 /vol10-6880/11 /vol11-23456/12 /vol12-23457/13 /vol13-23695/14 /vol14-28326/15 /vol15-31740/16 /ch145-149-33558/17 /ch150-33559/18 /ch151-197255/19 /ch152-33560/20 /ch153-33561/21 /ch154-33562/22 /ch155-33563/23 /ch156-33564/24 /ch157-33565/25 ...

 

python之lxml(xpath)