首页 > 代码库 > OpenWRT镜像爬虫搭建本地源

OpenWRT镜像爬虫搭建本地源

网上的爬虫不能用,还是先表达谢意,不过我比较懒不喜欢重复写别人写的教程,只贴出修改,怎么用自己看教程吧。

 

我自己改了一版可以正常爬:

#!/usr/bin/env python#coding=utf-8## Openwrt Package Grabber## Copyright (C) 2016 sohobloo.me#import urllib2import reimport osimport time# the url of package list page, end with "/"baseurl = https://downloads.openwrt.org/snapshots/trunk/ramips/mt7620/packages/# which directory to save all the packages, end with "/"time = time.strftime("%Y%m%d%H%M%S", time.localtime())savedir = ./ + time + /pattern = r<a href="http://www.mamicode.com/([^/?].*?)">def fetch(url, path = ‘‘):    if not os.path.exists(savedir + path):        os.makedirs(savedir + path)    print fetching package list from  + url    content = urllib2.urlopen(url + path, timeout=15).read()    items = re.findall(pattern, content)    cnt = 0    for item in items:        if item == ../:            continue        elif item.endswith(/):            fetch(url, path + item)        else:            cnt += 1            print downloading item %d: %(cnt) + path + item            if os.path.isfile(savedir + path + item):                print file exists, ignored.            else:                rfile = urllib2.urlopen(baseurl + path + item)                with open(savedir + path + item, "wb") as code:                    code.write(rfile.read())fetch(baseurl)print done!

 

修改内容:

1. 增加了一级当前时间格式的根目录

2. 修改正则,过滤无效的地址(问号开头)

3. 改为递归爬目录结构

 

另外很高兴Python知识终于可以用了,撒花。

 


 

想更新截图失败,博客园看上去是要死了。

 

OpenWRT镜像爬虫搭建本地源