python 爬虫（三）

首页 > 代码库 > python 爬虫（三）

2024-08-29 22:44:14 216人阅读

爬遍整个域名

六度空间理论：任何两个陌生人之间所间隔的人不会超过六个，也就是说最多通过五个人你可以认识任何一个陌生人。通过维基百科我们能够通过连接从一个人连接到任何一个他想连接到的人。

1. 获取一个界面的所有连接

1 from urllib.request import urlopen
2 from bs4 import BeautifulSoup
3 
4 html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
5 bsObj = BeautifulSoup(html,‘html.parser‘)
6 for link in bsObj.find_all("a"):
7     if ‘href‘ in link.attrs:
8         print(link.attrs[‘href‘])

View Code

2. 获取维基百科当前人物关联的事物

1. 除去网页中每个界面都会存在sidebar，footbar，header links 和 category pages,talk pages.

2. 当前界面连接到其他界面的连接都会有的相同点

I 包含在一个id为bodyContent的div中

II url中不包含分号，并且以/wiki/开头

1 from urllib.request import urlopen
2 from bs4 import BeautifulSoup
3 import re
4 
5 html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
6 bsObj = BeautifulSoup(html,"html.parser")
7 for link in bsObj.find(‘div‘,{"id":"bodyContent"}).find_all("a",href=http://www.mamicode.com/re.compile("^(/wiki/)((?!:).)*$")):
8     if ‘href‘ in link.attrs:
9         print(link.attrs[‘href‘])

View Code

3. 深层查找

简单的从一个维基百科界面中找到当前界面的连接是没有意义的，如果能够从当前界面开始循环的找下去会有很大的进步

1. 需要创建一个简单的方法，返回当前界面所有文章的连接

2. 创建一个main方法，从一个界面开始查找，然后进入其中一个随机连接，以这个新连接为基础继续查找直到没有新的连接为止。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
from random import choice
import re

basename = "http://en.wikipedia.org"

def getLinks(pagename):
    url = basename + pagename
    try:
        with urlopen(url) as html:
            bsObj = BeautifulSoup(html,"html.parser")
            links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=http://www.mamicode.com/re.compile("^(/wiki/)((?!:).)*$"))
            return [link.attrs[‘href‘] for link in links if ‘href‘ in link.attrs]
    except (HTTPError,AttributeError) as e:
        return None


def main():
    links = getLinks("/wiki/Kevin_Bacon")
    while len(links) > 0:
        nextpage = choice(links)
        print(nextpage)
        links = getLinks(nextpage)

main()

View Code

4. 爬遍整个域名

1. 爬遍整个网站首先需要从网站的主界面开始

2. 需要保存已经访问过的网页，避免重复访问相同的地址

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
from random import choice
import re

basename = "http://en.wikipedia.org"
visitedpages = set()#使用set来保存已经访问过的界面地址

def visitelink(pagename):
    url = basename + pagename
    global visitedpages
    try:
        with urlopen(url) as html:
            bsObj = BeautifulSoup(html,"html.parser")
        links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=http://www.mamicode.com/re.compile("^(/wiki/)((?!:).)*$"))
        for eachlink in links:
            if ‘href‘ in eachlink.attrs:
                if eachlink.attrs[‘href‘] not in visitedpages:
                    nextpage = eachlink.attrs[‘href‘]
                    print(nextpage)
                    visitedpages.add(nextpage)
                    visitelink(nextpage)
    except (HTTPError,AttributeError) as e:
        return None

visitelink("")

View Code

5. 从网站上搜集有用信息

1. 没做什么特别的东西，在访问网页的时候打印了一些 h1和文字内容

2. 在print的时候出现的问题》

UnicodeEncodeError: ‘gbk‘ codec can‘t encode character u‘\xa9‘ in position 24051: illegal multibyte sequence

　　解决方法：在print之前将source_code.encode(‘GB18030‘)

解释：GB18030是GBK的父集，所以能兼容GBK不能编码的字符。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
from random import choice
import re

basename = "http://en.wikipedia.org"
visitedpages = set()#使用set来保存已经访问过的界面地址

def visitelink(pagename):
    url = basename + pagename
    global visitedpages
    try:
        with urlopen(url) as html:
            bsObj = BeautifulSoup(html,"html.parser")
        try:
            print(bsObj.h1.get_text())
            print(bsObj.find("div",{"id":"mw-content-text"}).find("p").get_text().encode(‘GB18030‘))
        except AttributeError as e:
            print("AttributeError")
        links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=http://www.mamicode.com/re.compile("^(/wiki/)((?!:).)*$"))
        for eachlink in links:
            if ‘href‘ in eachlink.attrs:
                if eachlink.attrs[‘href‘] not in visitedpages:
                    nextpage = eachlink.attrs[‘href‘]
                    print(nextpage)
                    visitedpages.add(nextpage)
                    visitelink(nextpage)
    except (HTTPError,AttributeError) as e:
        return None

visitelink("")

View Code

python 爬虫（三）

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > python 爬虫（三）

python 爬虫（三）

看完仍有疑问？有类似问题直接问程序猿