首页 > 代码库 > 爬虫之Scripy

爬虫之Scripy

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

 

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

技术分享

  • 引擎(Scrapy)
           用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler)
           用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader)
           用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders)
           爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline)
           负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares)
          位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares)
          介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares)
      介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

 

Scrapy运行流程大概如下:

  1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
  2. 引擎把URL封装成一个请求(Request)传给下载器
  3. 下载器把资源下载下来,并封装成应答包(Response)
  4. 爬虫解析Response
  5. 解析出实体(Item),则交给实体管道进行进一步的处理
  6. 解析出的是链接(URL),则把URL交给调度器等待抓取

安装:

#scrapy 的一些依赖:pywin32、pyOpenSSL、Twisted、lxml 、zope.interface。(安装的时候,注意看报错信息)

#安装wheel
pip3 install wheel-i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#安装这个依赖包,才有安装上Twisted
pip3 install Incremental -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#再pip3安装Twisted,但是还是安装不成功,会报错。(解决其它依赖问题)
pip3 install Twisted -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#再进入软件存放目录,再安装就可以成功啦。
pip3 install Twisted-17.1.0-cp35-cp35m-win32.whl

#安装scrapy
pip3 install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

#pywin32
下载:https://sourceforge.net/projects/pywin32/files/

创建:

#创建项目
scrapy startproject xiaohuar


#进入项目
cd xiaohuar


#创建爬虫应用
scrapy genspider xiaohuar xiaohar.com


#运行爬虫
scrapy crawl chouti --nolog

目录:

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py

解释:

  • scrapy.cfg  项目的配置信息,主要为Scrapy命令行工具提供一个基础的配置信息。(真正爬虫相关的配置信息在settings.py文件中)
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据处理行为,如:一般结构化的数据持久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等
  • spiders      爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

选择器:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id=‘i1‘ href="http://www.mamicode.com/link.html">first item</a></li>
            <li class="item-0"><a id=‘i2‘ href="http://www.mamicode.com/llink.html">first item</a></li>
            <li class="item-1"><a href="http://www.mamicode.com/llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="http://www.mamicode.com/llink2.html">second item</a></div>
    </body>
</html>
"""
response = HtmlResponse(url=http://example.com, body=html,encoding=utf-8)
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[2]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[@id]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[@id="i1"]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[@href="http://www.mamicode.com/link.html"][@id="i1"]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[contains(@href, "link")]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[starts-with(@href, "link")]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[re:test(@id, "i\d+")]‘)
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[re:test(@id, "i\d+")]/text()‘).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(‘//a[re:test(@id, "i\d+")]/@href‘).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(‘/html/body/ul/li/a/@href‘).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(‘//body/ul/li/a/@href‘).extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath(‘//body/ul/li‘)
# for item in ul_list:
#     v = item.xpath(‘./a/span‘)
#     # 或
#     # v = item.xpath(‘a/span‘)
#     # 或
#     # v = item.xpath(‘*/a/span‘)
#     print(v)

自定义扩展:

自定义扩展时,利用信号在指定位置注册制定操作

from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint(MMMM)
        ext = cls(val)

        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        return ext

    def spider_opened(self, spider):
        print(open)

    def spider_closed(self, spider):
        print(close)

自定义去重复:

scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:

DUPEFILTER_CLASS = scrapy.dupefilter.RFPDupeFilter
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen

自定义:

#偶合性低,给url去重使用
class RepeatFilter(object):
    def __init__(self):
        self.visited_set = set()
    @classmethod
    def from_settings(cls, settings):
        return cls()
    def request_seen(self, request):
        if request.url in self.visited_set:#先看当前url在不在visited_set
            return True
        self.visited_set.add(request.url) #如果不在就加进去
        return False
    def open(self):  # 每次开始的时候都会调用
        # print(‘open‘)
        pass
    def close(self, reason): #每次结束的时候都会调用
        # print(‘close‘)
        pass
    def log(self, request, spider):#每次捕捉到重复的url都会写在log里面
        # print(‘log....‘)
        pass

settings:

# 1. 爬虫名称
# BOT_NAME = ‘step8_king‘

# 2. 爬虫应用路径
# SPIDER_MODULES = [‘step8_king.spiders‘]
# NEWSPIDER_MODULE = ‘step8_king.spiders‘

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客户端 user-agent请求头
# USER_AGENT = ‘step8_king (+http://www.yourdomain.com)‘

# Obey robots.txt rules
# 4. 禁止爬虫配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar进行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
#    使用telnet ip port ,然后通过命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = ‘127.0.0.1‘
# TELNETCONSOLE_PORT = [6023,]

# Override the default request headers:
# 10. 默认请求头
# DEFAULT_REQUEST_HEADERS = {
#     ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#     ‘Accept-Language‘: ‘en‘,
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES = {
#    ‘step8_king.pipelines.CustomPipeline‘: 500,
# }


# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# 12. 自定义扩展,基于信号进行调用
# EXTENSIONS = {
#     # ‘step8_king.extensions.MyExtension‘: 500,
# }


# 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = 3

# 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo

# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = ‘scrapy.squeue.PickleLifoDiskQueue‘
# SCHEDULER_MEMORY_QUEUE = ‘scrapy.squeue.LifoMemoryQueue‘
# 先进先出,广度优先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = ‘scrapy.squeue.PickleFifoDiskQueue‘
# SCHEDULER_MEMORY_QUEUE = ‘scrapy.squeue.FifoMemoryQueue‘

# 15. 调度器队列
# SCHEDULER = ‘scrapy.core.scheduler.Scheduler‘
# from scrapy.core.scheduler import Scheduler


# 16. 访问URL去重
# DUPEFILTER_CLASS = ‘step8_king.duplication.RepeatUrl‘


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 10

# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = ‘httpcache‘
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘


# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# 爬虫中间件
SPIDER_MIDDLEWARES = {
   step8_king.middlewares.MyCustomSpiderMiddleware: 543,
}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# 下载中间件
DOWNLOADER_MIDDLEWARES = {
   # ‘step8_king.middlewares.MyCustomDownloaderMiddleware‘: 500,
}

自定义pipline

 

一个简单的爬虫:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
import re
import urllib
import os
 
 
class XiaoHuarSpider(scrapy.spiders.Spider):
    name = "xiaohuar"
    allowed_domains = ["xiaohuar.com"]
    start_urls = [
        "http://www.xiaohuar.com/list-1-1.html",
    ]
 
    def parse(self, response):
        # 分析页面
        # 找到页面中符合规则的内容(校花图片),保存
        # 找到所有的a标签,再访问其他a标签,一层一层的搞下去
 
        hxs = HtmlXPathSelector(response)
 
        # 如果url是 http://www.xiaohuar.com/list-1-\d+.html
        if re.match(http://www.xiaohuar.com/list-1-\d+.html, response.url):
            items = hxs.select(//div[@class="item_list infinite_scroll"]/div)
            for i in range(len(items)):
                src = hxs.select(//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/a/img/@src % i).extract()
                name = hxs.select(//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/span/text() % i).extract()
                school = hxs.select(//div[@class="item_list infinite_scroll"]/div[%d]//div[@class="img"]/div[@class="btns"]/a/text() % i).extract()
                if src:
                    ab_src = "http://www.xiaohuar.com" + src[0]
                    file_name = "%s_%s.jpg" % (school[0].encode(utf-8), name[0].encode(utf-8))
                    file_path = os.path.join("/Users/wupeiqi/PycharmProjects/beauty/pic", file_name)
                    urllib.urlretrieve(ab_src, file_path)
 
        # 获取所有的url,继续访问,并在其中寻找相同的url
        all_urls = hxs.select(//a/@href).extract()
        for url in all_urls:
            if url.startswith(http://www.xiaohuar.com/list-1-):
                yield Request(url, callback=self.parse)

以上代码将符合规则的页面中的图片保存在指定目录,并且在HTML源码中找到所有的其他 a 标签的href属性,从而“递归”的执行下去,直到所有的页面都被访问过为止。以上代码之所以可以进行“递归”的访问相关URL,关键在于parse方法使用了 yield Request对象。

注:可以修改settings.py 中的配置文件,以此来指定“递归”的层数,如: DEPTH_LIMIT = 1

获取相应的cookie:

def parse(self, response):
    from scrapy.http.cookies import CookieJar
    cookieJar = CookieJar()
    cookieJar.extract_cookies(response, response.request)
    print(cookieJar._cookies)

格式化处理:

上述实例只是简单的图片处理,所以在parse方法中直接处理。如果对于想要获取更多的数据(获取页面的价格、商品名称、QQ等),则可以利用Scrapy的items将数据格式化,然后统一交由pipelines来处理。

import scrapy
 
class JieYiCaiItem(scrapy.Item):
 
    company = scrapy.Field()
    title = scrapy.Field()
    qq = scrapy.Field()
    info = scrapy.Field()
    more = scrapy.Field()

上述定义模板,以后对于从请求的源码中获取的数据同意按照此结构来获取,所以在spider中需要有一下操作:

import scrapy
import hashlib
from beauty.items import JieYiCaiItem
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class JieYiCaiSpider(scrapy.spiders.Spider):
    count = 0
    url_set = set()

    name = "jieyicai"
    domain = http://www.jieyicai.com
    allowed_domains = ["jieyicai.com"]

    start_urls = [
        "http://www.jieyicai.com",
    ]

    rules = [
        #下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)
        #Rule(SgmlLinkExtractor(allow=(r‘http://test_url/test?page_index=\d+‘))),
        #下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)
        #Rule(LinkExtractor(allow=(r‘http://www.jieyicai.com/Product/Detail.aspx?pid=\d+‘)), callback="parse"),
    ]

    def parse(self, response):
        md5_obj = hashlib.md5()
        md5_obj.update(response.url)
        md5_url = md5_obj.hexdigest()
        if md5_url in JieYiCaiSpider.url_set:
            pass
        else:
            JieYiCaiSpider.url_set.add(md5_url)
            
            hxs = HtmlXPathSelector(response)
            if response.url.startswith(http://www.jieyicai.com/Product/Detail.aspx):
                item = JieYiCaiItem()
                item[company] = hxs.select(//span[@class="username g-fs-14"]/text()).extract()
                item[qq] = hxs.select(//span[@class="g-left bor1qq"]/a/@href).re(.*uin=(?P<qq>\d*)&)
                item[info] = hxs.select(//div[@class="padd20 bor1 comard"]/text()).extract()
                item[more] = hxs.select(//li[@class="style4"]/a/@href).extract()
                item[title] = hxs.select(//div[@class="g-left prodetail-text"]/h2/text()).extract()
                yield item

            current_page_urls = hxs.select(//a/@href).extract()
            for i in range(len(current_page_urls)):
                url = current_page_urls[i]
                if url.startswith(/):
                    url_ab = JieYiCaiSpider.domain + url
                    yield Request(url_ab, callback=self.parse)

此处代码的关键在于:

  • 将获取的数据封装在了Item对象中
  • yield Item对象 (一旦parse中执行yield Item对象,则自动将该对象交个pipelines的类来处理)
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
from twisted.enterprise import adbapi
import MySQLdb.cursors
import re

mobile_re = re.compile(r(13[0-9]|15[012356789]|17[678]|18[0-9]|14[57])[0-9]{8})
phone_re = re.compile(r(\d+-\d+|\d+))

class JsonPipeline(object):

    def __init__(self):
        self.file = open(/Users/wupeiqi/PycharmProjects/beauty/beauty/jieyicai.json, wb)


    def process_item(self, item, spider):
        line = "%s  %s\n" % (item[company][0].encode(utf-8), item[title][0].encode(utf-8))
        self.file.write(line)
        return item

class DBPipeline(object):

    def __init__(self):
        self.db_pool = adbapi.ConnectionPool(MySQLdb,
                                             db=DbCenter,
                                             user=root,
                                             passwd=123,
                                             cursorclass=MySQLdb.cursors.DictCursor,
                                             use_unicode=True)

    def process_item(self, item, spider):
        query = self.db_pool.runInteraction(self._conditional_insert, item)
        query.addErrback(self.handle_error)
        return item

    def _conditional_insert(self, tx, item):
        tx.execute("select nid from company where company = %s", (item[company][0], ))
        result = tx.fetchone()
        if result:
            pass
        else:
            phone_obj = phone_re.search(item[info][0].strip())
            phone = phone_obj.group() if phone_obj else  

            mobile_obj = mobile_re.search(item[info][1].strip())
            mobile = mobile_obj.group() if mobile_obj else  

            values = (
                item[company][0],
                item[qq][0],
                phone,
                mobile,
                item[info][2].strip(),
                item[more][0])
            tx.execute("insert into company(company,qq,phone,mobile,address,more) values(%s,%s,%s,%s,%s,%s)", values)

    def handle_error(self, e):
        print error,e

上述中的pipelines中有多个类,到底Scapy会自动执行那个?哈哈哈哈,当然需要先配置了,不然Scapy就蒙逼了。。。

在settings.py中做如下配置:

ITEM_PIPELINES = {
    beauty.pipelines.DBPipeline: 300,
    beauty.pipelines.JsonPipeline: 100,
}
# 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。

一个小蜘蛛:

技术分享
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest


class ChouTiSpider(scrapy.Spider):
    # 爬虫应用的名称,通过此名称启动爬虫命令
    name = "chouti"
    # 允许的域名
    allowed_domains = ["chouti.com"]

    cookie_dict = {}
    has_request_set = {}

    def start_requests(self):
        url = http://dig.chouti.com/
        # return [Request(url=url, callback=self.login)]
        yield Request(url=url, callback=self.login)

    def login(self, response):
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookie_dict[m] = n.value

        req = Request(
            url=http://dig.chouti.com/login,
            method=POST,
            headers={Content-Type: application/x-www-form-urlencoded; charset=UTF-8},
            body=phone=8615131255089&password=pppppppp&oneMonth=1,
            cookies=self.cookie_dict,
            callback=self.check_login
        )
        yield req

    def check_login(self, response):
        req = Request(
            url=http://dig.chouti.com/,
            method=GET,
            callback=self.show,
            cookies=self.cookie_dict,
            dont_filter=True
        )
        yield req

    def show(self, response):
        # print(response)
        hxs = HtmlXPathSelector(response)
        news_list = hxs.select(//div[@id="content-list"]/div[@class="item"])
        for new in news_list:
            # temp = new.xpath(‘div/div[@class="part2"]/@share-linkid‘).extract()
            link_id = new.xpath(*/div[@class="part2"]/@share-linkid).extract_first()
            yield Request(
                url=http://dig.chouti.com/link/vote?linksId=%s %(link_id,),
                method=POST,
                cookies=self.cookie_dict,
                callback=self.do_favor
            )

        page_list = hxs.select(//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href).extract()
        for page in page_list:

            page_url = http://dig.chouti.com%s % page
            import hashlib
            hash = hashlib.md5()
            hash.update(bytes(page_url,encoding=utf-8))
            key = hash.hexdigest()
            if key in self.has_request_set:
                pass
            else:
                self.has_request_set[key] = page_url
                yield Request(
                    url=page_url,
                    method=GET,
                    callback=self.show
                )

    def do_favor(self, response):
        print(response.text)
自动登录抽屉点赞

 

爬虫之Scripy