首页 > 代码库 > 批量抓取表情包爬虫脚本
批量抓取表情包爬虫脚本
import re import os import time import requests import multiprocessing from multiprocessing.pool import ThreadPool picqueue = multiprocessing.Queue() pagequeue = multiprocessing.Queue() logqueue = multiprocessing.Queue() picpool = ThreadPool(50) pagepool = ThreadPool(5) error = [] for x in range(1, 838): pagequeue.put(x) def getimglist(body): imglist = re.findall( ur‘data-original="//ws\d([^"]+)" data-backup="[^"]+" alt="([^"]+)"‘, body) for url, name in imglist: if name: name = name + url[-4:] url = "http://ws1" + url logqueue.put(url) picqueue.put((name, url)) if len(imglist)==0: print body def savefile(): http = requests.Session() while True: name, url = picqueue.get() if not os.path.isfile(name): req = http.get(url) try: open(name, ‘wb‘).write(req.content) except: error.append([name, url]) def getpage(): http = requests.Session() while True: pageid = pagequeue.get() req = http.get( "https://www.doutula.com/photo/list/?page={}".format(pageid)) getimglist(req.text) time.sleep(1) for x in range(5): pagepool.apply_async(getpage) for x in range(50): picpool.apply_async(savefile) while True: print picqueue.qsize(), pagequeue.qsize(), logqueue.qsize() time.sleep(1)
7分钟左右,即可爬完
批量抓取表情包爬虫脚本
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。