首页 > 代码库 > scrapy.Spider的属性和方法

scrapy.Spider的属性和方法

scrapy.Spider的属性和方法
属性:
name:spider的名称,要求唯一
allowed_domains:允许的域名,限制爬虫的范围
start_urls:初始urls
custom_settings:个性化设置,会覆盖全局的设置
crawler:抓取器,spider将绑定到它上面
custom_settings:配置实例,包含工程中所有的配置变量
logger:日志实例,打印调试信息

方法:
from_crawler(crawler, *args, **kwargs):类方法,用于创建spider
start_requests():生成初始的requests
make_requests_from_url(url):遍历urls,生成一个个request
parse(response):用来解析网页内容
log(message[,level.component]):用来记录日志,这里请使用logger属性记录日志,self.logger.info(visited success)
closed(reason):当spider关闭时调用的方法

子类:
主要CrawlSpider
1:最常用的spider,用于抓取普通的网页
2:增加了两个成员
1)rules:定义了一些抓取规则--链接怎么跟踪,使用哪一个parse函数解析此链接
2)parse_start_url(response):解析初始url的相应
实例:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = example.com
    allowed_domains = [example.com]
    start_urls = [http://www.example.com]

    rules = (
        # Extract links matching ‘category.php‘ (but not matching ‘subsection.php‘)
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=(category\.php, ), deny=(subsection\.php, ))),

        # Extract links matching ‘item.php‘ and parse them with the spider‘s method parse_item
        Rule(LinkExtractor(allow=(item\.php, )), callback=parse_item),
    )

    def parse_item(self, response):
        self.logger.info(Hi, this is an item page! %s, response.url)
        item = scrapy.Item()
        item[id] = response.xpath(//td[@id="item_id"]/text()).re(rID: (\d+))
        item[name] = response.xpath(//td[@id="item_name"]/text()).extract()
        item[description] = response.xpath(//td[@id="item_description"]/text()).extract()
        return item

 

scrapy.Spider的属性和方法