Python 开发简单爬虫 - 基础框架

首页 > 代码库 > Python 开发简单爬虫 - 基础框架

Python 开发简单爬虫 - 基础框架

2024-09-04 06:34:36 217人阅读

1. 目标：开发轻量级爬虫（不包括需登陆的和 Javascript异步加载的）

　　不需要登陆的静态网页抓取

2. 内容：

　　2.1 爬虫简介

　　2.2 简单爬虫架构

　　2.3 URL管理器

　　2.4 网页下载器（urllib2）

　　2.5 网页解析器（BeautifulSoup）

　　2.6 完整实例：爬取百度百科Python词条相关的1000个页面数据

3. 爬虫简介：一段自动抓取互联网信息的程序

　　技术分享

　　爬虫价值：互联网数据，为我所用。

　　技术分享

4. 简单爬虫架构：

　　技术分享

　　运行流程：　　　

　　技术分享

5. URL管理器：管理待抓取URL集合和已抓取URL集合

　　- 防止重复抓取、防止循环抓取

　　技术分享

　　- 实现方式：

　　技术分享

6. 网页下载器：将互联网URL对应的网页下载到本地的工具

　　技术分享

　　- 分类：

　　技术分享

　　- urllib2 下载网页的方法：

　　　　1. 最简洁方法： url ===> urllib2.urlopen(url)　　　

import urllib2

# 直接请求
response = urllib2.urlopen(‘http://www.baidu.com‘)

# 获取状态码，如果是200表示获取成功
print response.getcode()

# 读取内容
cont = response.read()

　　　　2. 添加data、http header：（url，data，header） ===> urllib2.Request ===> urllib2.urlopen(request)

import urllib2

# 创建Request对象
request = urllib2.Request(url)

# 添加数据
request.add_data(‘a‘, ‘1‘)

# 添加http的header
request.add_header(‘User-Agent‘, ‘Mozilla/5.0‘)

# 发送请求获取结果
response = urllib2.urlopen(request)

　　　　3. 添加特殊情景的处理器：

　　　　　　技术分享

import urllib2, cookielib

# 创建cookie容器
cj = cookielib.CookieJar()

# 创建1个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# 给urllib2安装opener
urllib2.install_opener(opener)

# 使用带有cookie的urllib2访问网页
response = urllib2.urlopen(“http://www.baidu.com/”)

7. urllib2 实例代码演示：

# -*- coding: utf-8 -*-
"""
Created on Tue Feb 14 10:31:06 2017

@author: Wayne
"""
import urllib2, cookielib

url = "http://www.baidu.com"

print "the 1st method"
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print "the 2nd method"
request = urllib2.Request(url)
request.add_header("user-agent", "Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print "the 3rd method"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

8. 网页解析器：从网页中提取有价值数据的工具

　　技术分享

　　python 的网页解析器：

　　技术分享

　　结构化解析 - DOM ( Document Object Model) 树：

　　技术分享

9. 网页解析器 - Beautiful Soup

　　9.1 Beautiful Soup

　　　　- Python 第三方库，用于从HTML或XML中提取数据

　　　　- 官网：http://www.crummy.com/software/BeautifulSoup

　　9.2 安装并测试 beautifulsoup4

　　　　- 安装：pip install beautifulsoup4

　　　　- 测试：import bs4

　　9.3 Beautiful Soup语法

　　　　技术分享

　　9.4 创建 BeautifulSoup 对象

from bs4 import BeautifulSoup
# 根据 HTML 网页字符串创建 BeautifulSoup 对象
soup = BeautifulSoup(
                     html_doc,                     # HTML文档字符串
                     ‘html.parser‘                  # HTML解析器
                     from_encoding=‘utf-8‘     # HTML文档的编码
                     )

　　9.5 搜索节点（find_all， find）

# 方法：find_all(name, attrs, string)
# 查找所有标签为 a 的节点
soup.find_all(‘a‘)

# 查找所有标签为 a，链接符合 /view/123.htm 形式的节点
soup.find_all(‘a‘, href=http://www.mamicode.com/‘/view/123.htm‘)>

　　9.6 访问节点信息

# 得到节点： <a href=http://www.mamicode.com/‘1.html‘>Python>

10. BeautifulSoup 实例测试

# -*- coding: utf-8 -*-
"""
Created on Tue Feb 14 11:00:42 2017

@author: Wayne
"""

from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, ‘html.parser‘, from_encoding=‘urf-8‘)

print ‘\n## Get all the links‘
links = soup.find_all(‘a‘)
for link in links:
    print link.name, link[‘href‘], link.get_text()
    
    
print ‘\n## Get the links include "lacie"‘
link_node = soup.find(‘a‘, href=http://www.mamicode.com/‘http://example.com/lacie‘)"ill"))
print link_node.name, link_node[‘href‘], link_node.get_text()


print ‘\n## Get "P" Paragraph Text‘
p_node = soup.find(‘p‘, class_=‘title‘)
print p_node.name, p_node.get_text()

Python 开发简单爬虫 - 基础框架

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Python 开发简单爬虫 - 基础框架

Python 开发简单爬虫 - 基础框架

看完仍有疑问？有类似问题直接问程序猿