首页 > 代码库 > python3煎蛋网的爬虫

python3煎蛋网的爬虫

做的第一个爬虫就遇上了硬茬儿,可能是我http头没改好还是我点击次数过高导致无法循环爬取煎蛋网的妹子图。。。。

不过也算是邪恶了一把。。。技术本无罪~~~

爬了几页的照片下来还是很赏心悦目的~

import urllib.request

import re

import time

import requests

k=1
def read_url(url,k):
    user_agent = Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
    headers = {User-Agent: user_agent,
               Referer: http://jandan.net/ooxx/page-2406
            #    ‘Accept‘: ‘image / webp, image / *, * / *;q = 0.8‘,
            # ‘Referer‘: ‘http: // cdn.jandan.net / wp - content / themes / egg / style.css?v = 20170319‘,
            # ‘Accept - Encoding‘: ‘gzip, deflate, sdch‘,
            # ‘Accept - Language‘: ‘zh - CN, zh; q = 0.8‘,
            # ‘Connection‘:‘close‘,
            # ‘host‘:‘cdn.jandan.net‘,
            # ‘Accept - Encoding‘: ‘dentity‘
    }
    #cok = {"Cookie":"_ga=GA1.2.1842145399.1491574879; Hm_lvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491574879; Hm_lpvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491575669"}
    r=urllib.request.Request(url,headers=headers)
    req = urllib.request.urlopen(r)
    # print(req.read().decode(‘utf-8‘))
    # r= requests.get(url,headers=headers,cookies=cok)
    image_d(req.read().decode(utf-8),k)

def image_d(data,k):

    print(正在爬取第%d页图片 %k)
    # datalist = []
    dirct = C:\\Users\eexf\Desktop\jiandan
    pattern = re.compile(<img src="http://www.mamicode.com/(.*?)" /></p>)
    res = re.findall(pattern,data)

    for i in res:
        j = http:+i
        data1 = urllib.request.urlopen(j).read()
        k = re.split(/, j)[-1]
        # print(i)
        path = dirct + / + k
        f = open(path, wb)
        f.write(data1)

        f.close()
    print(爬取完成)


        # respon = urllib.request.Request(i)
        # data1= urllib.request.urlopen(respon).read().decode(‘utf-8‘)
        #
        # datalist.append(data1)
        # print(datalist)

if __name__==__main__:
    url = http://jandan.net/ooxx/page-2406#+str(i)
    read_url(url,k)
    k+=1
    # time.sleep(3)

基本的结构框架也就是:请求网页源代码-->通过正则表达式匹配相应的图片地址返回一个列表-->将列表中所有地址中的内容全部写入一个文件夹。。

 

代码很乱,第一个爬虫权当留个纪念~~~

附上福利:

技术分享

python3煎蛋网的爬虫