实用scrapy批量下载自己的博客园文章

首页 > 代码库 > 实用scrapy批量下载自己的博客园文章

实用scrapy批量下载自己的博客园文章

2024-09-12 02:40:17 215人阅读

首先，在items.py中定义几个字段用来保存网页数据（网址，标题，网页源码）

如下所示：

import scrapy
class MycnblogsItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    page_title = scrapy.Field()    page_url = scrapy.Field()    page_html = scrapy.Field()

最重要的是我们的spider，我们这里的spider继承自CrawlSpider，方便我们定义正则来提示爬虫需要抓取哪些页面。

如：爬去下一页，爬去各个文章

在spdier中，我们使用parse_item方法来解析目标网页，从而得到文章的网址，标题和内容。

注：在parse_item方法中，我们在得到的html源码中，新增了base标签，这样打开下载后的html文件，不至于页面错乱，而是使用博客园的css样式

spdier源码如下:

# -*- coding: utf-8 -*-from mycnblogs.items import MycnblogsItemfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass CnblogsSpider(CrawlSpider):    name = "cnblogs"    allowed_domains = ["cnblogs.com"]    start_urls = [‘http://www.cnblogs.com/hongfei/‘]    rules = (        # 爬取下一页，没有callback，意味着follow为True        Rule(LinkExtractor(allow=(‘default.html\?page=\d+‘,))),        # 爬取所有的文章，并使用parse_item方法进行解析，得到文章网址，文章标题，文章内容        Rule(LinkExtractor(allow=(‘hongfei/p/‘,)), callback=‘parse_item‘),        Rule(LinkExtractor(allow=(‘hongfei/articles/‘,)), callback=‘parse_item‘),        Rule(LinkExtractor(allow=(‘hongfei/archive/\d+/\d+/\d+/\d+.html‘,)), callback=‘parse_item‘),    )    def parse_item(self, response):        item = MycnblogsItem()        item[‘page_url‘] = response.url        item[‘page_title‘] = response.xpath("//title/text()").extract_first()        html = response.body.decode("utf-8")        html = html.replace("<head>", "<head><base href=http://www.mamicode.com/‘http://www.cnblogs.com/‘>")        item[‘page_html‘] = html        yield item

在pipelines.py文件中，我们使用process_item方法来处理返回的item

# -*- coding: utf-8 -*-# Define your item pipelines here## Don‘t forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport codecsclass MycnblogsPipeline(object):    def process_item(self, item, spider):        file_name = ‘./blogs/‘ + item[‘page_title‘] + ‘.html‘        with codecs.open(filename=file_name, mode=‘wb‘, encoding=‘utf-8‘) as f:            f.write(item[‘page_html‘])        return item

以下是item pipeline的一些典型应用：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

为了启用一个Item Pipeline组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:

ITEM_PIPELINES = {   ‘mycnblogs.pipelines.MycnblogsPipeline‘: 300,}

程序运行后，将采集所有的文章到本地，如下所示：

技术分享

实用scrapy批量下载自己的博客园文章

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 实用scrapy批量下载自己的博客园文章

实用scrapy批量下载自己的博客园文章

看完仍有疑问？有类似问题直接问程序猿