阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

首页 > 代码库 > 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

2024-10-29 14:11:02 207人阅读

1.查找以<a>开头的所有文本，然后判断href是否在<a>里面，如果<a>里面有href,就像<a href="http://www.mamicode.com/" >,然后提取href的值。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
    if ‘href‘ in link.attrs:
        print(link.attrs[‘href‘])

运行结果：

技术分享

在网页源代码的定位：

技术分享

2.提取以 /wiki/开头的文本

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html,"lxml")
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=http://www.mamicode.com/re.compile("^(/wiki/)((?!:).)*$")):
    if ‘href‘ in link.attrs:
        print(link.attrs[‘href‘])

运行结果：

技术分享

3.连环着提取不同网页以/wiki开头的文本

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html,"lxml")
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=http://www.mamicode.com/re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)

运行结果：

技术分享

运行一段时间之后，会报错：远程主机强迫关闭了一个现有的连接，这是网站拒绝程序的连接吗？

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

看完仍有疑问？有类似问题直接问程序猿