用户里面居然有一半以上是qq邮箱,so 要想办法通过不用oauth的方式拿到.
作为一个pythoner,有很多爬虫框架可以选择,例如scrapy pyspider 没错有中文 有ui 有时间调度
爬虫框架会给你做很多事情,基本的东西入parse 回调等等,重要的功能室可以用深度或者广度优先算法进行类似下一页
的爬取, 更好一些的
但是邮箱扒图这种事情就是拿到url后直接抓回来就好, 没必要这么兴师动众,so requests就够了。
可以再扒图的进程挂掉后可以让他回复掉之前的现场(我可不想一次次重新抓, 几十万邮箱呢)
第一步是获得url,如果你不介意gravatar会被墙,qq的连接会变(毕竟不是文档给出的地址), 这个地方就够了。
gravatar python实现
需要注意的参数 s是尺寸,gravatar做的比较好,基本什么尺寸都有
d是默认参数,不想用默认头像的时候填404,gravatar会返回404的响应, 其他参数请自己看文档
s为图片大小,我扒了一下发现里面有这么多的size尺寸 1 2 3 4 5 40 41 100 140 160 240 640
1~5是都有的尺寸,其中2对应4040, 4对应100100, 但是请注意,不是每个人都有100大小的图(10年前传的头像,从来没改过,真的有这种用户, 我身边就有…)
里面提到了php curl反盗链抓东西 可惜是php的,我已经改为python的了,
python版, 虽然最终的实现没用用到这个东西(qq有可以直接访问的连接oh yeah),但是不一定什么时候就用到了。
下面是贴了5个大小的图,不确定能不能再github or osc or sf上显示
Like所有的爬虫可能会遇到的问题,你需要伪装AGENTS, 否则爬虫可能会被禁掉,因为我爬qq的时候发现,一段时间后qq头像的大小变为了0,一定是出事情了。
gravatar的用户量, 这个比例一直再将,从40人1人,到60人1人,在我抓到6万邮箱的时候发现这个比例大体是100人中有1人
关于无视默认图片, gravatar直接使用404判断,这个简单。qq麻烦些,首先先download回默认的几个图,然后md5下这个图,这样下载qq图的时候对比下这个md5码,一样则说明是默认图片,pass.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | #!/usr/bin/env python # -*- coding: utf- 8 -*- import requests import hashlib import urllib import sys import os import random from functools import partial AGENTS = [ "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)" , "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/ Safari/532.5" , "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9" , "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7" , "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14" , "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14" , "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1" , "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7" , "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre" , "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2009042316 Firefox/3.0.10" , "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv: Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)" , "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv: Gecko/20091201 Firefox/3.5.6 GTB5" , "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv: Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)" , "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" , "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" , "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0" , "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2" , "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1" , "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre" , ] QQ_MD5_ESCAPE = [ ‘11567101378fc08988b38b8f0acb1f74‘ , ‘9d11f9fcc1888a4be8d610f8f4bba224‘ ] LOG_FILES = ‘scrapy_{}.log‘ EMAIL_LIST = ‘email_list_{}.json‘ AVATAR_PATH = ‘avatar/{}{}‘ LOG_LEVEL_EXISTS = ‘EXISTS‘ LOG_LEVEL_NOTSET_OR_ERROR = ‘NOTSET_OR_ERROR‘ LOG_LEVEL_TYPE_ERROR = ‘TYPE_ERROR‘ LOG_LEVEL_ERROR = ‘ERROR‘ LOG_LEVEL_FAIL = ‘FAIL‘ LOG_LEVEL_SUCCESS = ‘SUCCESS‘ LOG_LEVEL_IGNORE = ‘IGNORE‘ def get_gravatar_url(email, default_avatar=None, use_404=False, size= 100 ): data = http://www.mamicode.com/{} if default_avatar and default_avatar.startswith( ‘http‘ ): data[ ‘d‘ ] = default_avatar if use_404: data[ ‘d‘ ] = ‘404‘ data[ ‘s‘ ] = str(size) gravatar_url = "http://secure.gravatar.com/avatar/" + hashlib.md5(email.lower()).hexdigest() + "?" gravatar_url += urllib.urlencode(data) return gravatar_url def get_random_headers(): agent = random.choice(AGENTS) headers = { ‘User-Agent‘ : agent} return headers def check_logfile(part): last_scrapy_line = 1 if os.path.exists( ‘scrapy_{}.log‘ .format(part)): with open( ‘scrapy_{}.log‘ .format(part)) as log_read: for line in log_read: last_scrapy_line = max(last_scrapy_line, int (line.split()[ 0 ])) print last_scrapy_line return last_scrapy_line + 1 def get_log_message(log_format= ‘{index} {level} {email} {msg}‘ , index=None, level=None, email=None, msg=None): return log_format.format(index=index, level=level, email=email, msg=msg) SUCCESS_LOG = partial(get_log_message, level=LOG_LEVEL_SUCCESS, msg= ‘scrapyed success‘ ) EXIST_LOG = partial(get_log_message, level=LOG_LEVEL_EXISTS, msg= ‘scrapyed already‘ ) FAIL_LOG = partial(get_log_message, level=LOG_LEVEL_FAIL, msg= ‘scrapyed failed‘ ) NOT_QQ_LOG = partial(get_log_message, level=LOG_LEVEL_TYPE_ERROR, msg= ‘not qq email‘ ) IGNORE_LOG = partial(get_log_message, level=LOG_LEVEL_TYPE_ERROR, msg= ‘ignore email‘ ) EMPTY_SIZE_LOG = partial(get_log_message, level=LOG_LEVEL_ERROR, msg= ‘empty avatar‘ ) UNEXCEPT_ERROR_LOG = partial(get_log_message, level=LOG_LEVEL_ERROR, msg= ‘unexcept error‘ ) def write_log(log, msg): log.write(msg) log.write( ‘\n‘ ) log.flush() def save_avatar_file(filename, content): with open(filename, ‘wb‘ ) as avatar_file: avatar_file.write(content) def scrapy_context(part, suffix= ‘.jpg‘ , rescrapy=False, hook=None): last_scrapy_line = check_logfile(part) index = last_scrapy_line with open(LOG_FILES.format(part), ‘a‘ ) as log: with open(EMAIL_LIST.format(part)) as list_file: for linenum, email in enumerate(list_file): if linenum < last_scrapy_line: continue email = email.strip() if not rescrapy: if os.path.exists(AVATAR_PATH.format(email, suffix)): print EXIST_LOG(index=index, email=email) index += 1 continue if not hook: raise NotImplementedError() try : hook(part, suffix=suffix, rescrapy=rescrapy, log=log, index=index, email=email) except Exception as ex: print UNEXCEPT_ERROR_LOG(index=index, email=email) write_log(log, UNEXCEPT_ERROR_LOG(index=index, email=email)) raise ex index += 1 def scrapy_qq_hook(part, suffix= ‘.jpg‘ , rescrapy=False, log=None, index=None, email=None): if ‘qq.com‘ not in email.lower(): print NOT_QQ_LOG(index=index, email=email) write_log(log, NOT_QQ_LOG(index=index, email=email)) return url = ‘http://q4.qlogo.cn/g?b=qq&nk={}&s=4‘ .format(email) response = requests.get(url, timeout= 10 , headers=get_random_headers()) if response.status_code == 200 : # 判断用户是否有大图标, 如果没有则请求小图标 if hashlib.md5(response.content) in QQ_MD5_ESCAPE: url = ‘http://q4.qlogo.cn/g?b=qq&nk={}&s=2‘ .format(email) response = requests.get(url, timeout= 10 , headers=get_random_headers()) if response.status_code == 200 : if not len(response.content): print EMPTY_SIZE_LOG(index=index, email=email) write_log(log, EMPTY_SIZE_LOG(index=index, email=email)) # 这里再次判断是因为上一个 200 判断做了一次图片check if response.status_code == 200 : save_avatar_file(AVATAR_PATH.format(email, suffix), response.content) print SUCCESS_LOG(index=index, email=email) write_log(log, SUCCESS_LOG(index=index, email=email)) else : print FAIL_LOG(index=index, email=email) write_log(log, FAIL_LOG(index=index, email=email)) def scrapy_gravatar_hook(part, suffix= ‘.jpg‘ , rescrapy=False, ignore_email_suffix=None, log=None, index=None, email=None): if ignore_email_suffix and ignore_email_suffix in email.lower(): print IGNORE_LOG(index=index, email=email) write_log(log, IGNORE_LOG(index=index, email=email)) return response = requests.get(get_gravatar_url(email, use_404=True), timeout= 10 , headers=get_random_headers()) if response.status_code == 200 : save_avatar_file(AVATAR_PATH.format(email, suffix), response.content) print SUCCESS_LOG(index=index, email=email) write_log(log, SUCCESS_LOG(index=index, email=email)) else : print FAIL_LOG(index=index, email=email) write_log(log, FAIL_LOG(index=index, email=email)) return scrapy_gravatar = partial(scrapy_context, hook=scrapy_gravatar_hook) scrapy_qq = partial(scrapy_context, hook=scrapy_qq_hook) FUNC_MAPPER = { ‘qq‘ : scrapy_qq, ‘gravatar‘ : scrapy_gravatar, } if __name__ == ‘__main__‘ : scrapy_type = sys.argv[ 1 ] part = sys.argv[ 2 ] if scrapy_type not in FUNC_MAPPER: print ‘type should in [qq | gravatar]‘ exit( 0 ) FUNC_MAPPER[scrapy_type](part) |
pip install requests
mkdir /opt/projects/scripts/avatar
python scrapy_avatar.py gravatar 0
或者python scrapy_avatar.py qq 0
当email_list比较大的时候, 为了使用更多的进程你可以将email_list拆分成多个list
例如 email_list_0.json
你就可以使用 python scrapy_avatar.py gravatar 0
python scrapy_avatar.py gravatar 1
因为这是一个简单的脚本,因此懒得用click做脚本参数处理,只依赖于requests, 参数判断就懒得写了.
那个for循环里使用的是contextmanager yield来做的,但是有个奇怪的RuntimeError generator didn‘t stop
, 无奈将yield改为hook的方法.qq的头像有些奇怪的问题,例如不是没人都有100大小的图,但是没人都有40大小的图, 因此优先拿大图, 在qq那边就做了一次判断.
附: 简单的显示linux服务器图片的方式 Flask+nginx
pip install flask
app.py丢到抓图的地方,改下nginx里面头像地址的root,丢进/etc/nginx/site-enable去 reload nginx, 别忘了host添上localtest
flask代码 app.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #!/usr/bin/env python # -*- coding: utf- 8 -*- from flask import Flask, send_from_directory, safe_join import os app = Flask(__name__) app.debug = True @app .route( "/" ) def hello(): avatars = os.listdir( ‘avatar‘ ) avatars = sorted(avatars) html = ‘\n‘ .join( "<img src=http://www.mamicode.com/‘/avatar/{}‘ />" .format(avatar) for avatar in avatars) return html if __name__ == "__main__" : app.run(host= ‘‘ , port= 11111 ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | upstream localtest-backend { server 127.0 . 0.1 : 11111 fail_timeout= 0 ; } server { listen 80 ; server_name localtest.com; location ~ /avatar/(?P<file>.*) { root /opt/projects/scripts/ new ; try_files /avatar/$file /avatar/$file = 404 ; expires 30d; gzip on; gzip_types text/plain application/x-javascript text/css application/javascript; gzip_comp_level 3 ; } location / { proxy_pass http: //localtest-backend; } } |