首页 > 代码库 > python爬虫----(6. scrapy框架,抓取亚马逊数据)
python爬虫----(6. scrapy框架,抓取亚马逊数据)
利用xpath()分析抓取数据还是比较简单的,只是网址的跳转和递归等比较麻烦。耽误了好久,还是豆瓣好呀,URL那么的规范。唉,亚马逊URL乱七八糟的.... 可能对url理解还不够.
amazon ├── amazon │ ├── __init__.py │ ├── __init__.pyc │ ├── items.py │ ├── items.pyc │ ├── msic │ │ ├── __init__.py │ │ └── pad_urls.py │ ├── pipelines.py │ ├── settings.py │ ├── settings.pyc │ └── spiders │ ├── __init__.py │ ├── __init__.pyc │ ├── pad_spider.py │ └── pad_spider.pyc ├── pad.xml └── scrapy.cfg
(1)items.py
from scrapy import Item, Field class PadItem(Item): sno = Field() price = Field()
(2)pad_spider.py
# -*- coding: utf-8 -*- from scrapy import Spider, Selector from scrapy.http import Request from amazon.items import PadItem class PadSpider(Spider): name = "pad" allowed_domains = ["amazon.com"] start_urls = [] u1 = ‘http://www.amazon.cn/s/ref=sr_pg_‘ u2 = ‘?rh=n%3A2016116051%2Cn%3A!2016117051%2Cn%3A888465051%2Cn%3A106200071&page=‘ u3 = ‘&ie=UTF8&qid=1408641827‘ for i in range(181): url = u1 + str(i+1) + u2 + str(i+1) + u3 start_urls.append(url) def parse(self, response): sel = Selector(response) sites = sel.xpath(‘//div[@class="rsltGrid prod celwidget"]‘) items = [] for site in sites: item = PadItem() item[‘sno‘] = site.xpath(‘@name‘).extract()[0] try: item[‘price‘] = site.xpath(‘ul/li/div/a/span/text()‘).extract()[0] # 索引异常,说明是新品 except IndexError: item[‘price‘] = site.xpath(‘ul/li/a/span/text()‘).extract()[0] items.append(item) return items
(3)settings.py
# -*- coding: utf-8 -*- # Scrapy settings for amazon project # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy.org/en/latest/topics/settings.html # BOT_NAME = ‘amazon‘ SPIDER_MODULES = [‘amazon.spiders‘] NEWSPIDER_MODULE = ‘amazon.spiders‘ # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = ‘amazon (+http://www.yourdomain.com)‘ USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5‘ FEED_URI = ‘pad.xml‘ FEED_FORMAT = ‘xml‘
(4)结果如下 pad.xml
<?xml version="1.0" encoding="utf-8"?> <items> <item> <sno>B00JWCIJ78</sno> <price>¥3199.00</price> </item> <item> <sno>B00E907DKM</sno> <price>¥3079.00</price> </item> <item> <sno>B00L8R7HKA</sno> <price>¥3679.00</price> </item> <item> <sno>B00IZ8W4F8</sno> <price>¥3399.00</price> </item> <item> <sno>B00MJMW4BU</sno> <price>¥4399.00</price> </item> <item> <sno>B00HV7KAMI</sno> <price>¥3799.00</price> </item> ... </items>
(5)数据保存,保存到数据库
...
-- 2014年08月22日04:12:43
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。