首页 > 代码库 > 看雪精华帖爬虫
看雪精华帖爬虫
看雪自带的搜索感觉不是太好用, 然后弄了个爬虫
目前支持4种功能
1. 爬取某个版块所有的链接, 并保持到文件
2. 自动把精华帖分类出来, 并保存到文件
3. 把含有指定关键字的链接单独保存为文件(针对所有链接)
4. 把含有指定关键字的链接单独保存为文件(针对所有精华帖链接)
github下载地址:
https://github.com/bingghost/pediy_spider
需要下载下面的依赖库
bs4
requests
html5lib
代码如下
#!/usr/bin/env python# encoding: utf-8"""@author: bingghost@copyright: 2016 bingghost. All rights reserved.@contact:@date: 2016-12-1@description: 看雪爬虫"""import reimport timeimport requestsimport argparsefrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding(‘utf8‘)class PediySpider: def __init__(self, spider_url, specified_title): self._url = spider_url self.file_dict = {"all_title":"all_title.txt", "good_title":"good_title.txt", "filter_title":"filter_title.txt", "filter_good_title":"filter_good_title.txt"} # good title self.filter_list = [‘jhinfo.gif‘, ‘good_3.gif‘, ‘good_2.gif‘] # title specified self.specified_title = specified_title self.page_count = self.get_page_count() pass def get_page_content(self, page_num): rep_data = http://www.mamicode.com/requests.get(self._url + str(page_num))"[-] start spider" self.worker() print "[-] spider okay" pass passdef set_argument(): # add description parser = argparse.ArgumentParser( description="A spider for the bbs of pediy‘s Android security forum," "also you can modify the url to spider other forum.") # add argument group = parser.add_mutually_exclusive_group(required=True) group.add_argument( ‘-a‘, ‘--all‘, action=‘store_true‘, help=‘Get all titles‘) group.add_argument( ‘-f‘, ‘--filter‘, type=str, default=None, help=‘filter title‘) group.add_argument( ‘-gf‘, ‘--gfilter‘, type=str, default=None, help=‘filter good title‘) args = parser.parse_args() return args passdef main(): args = set_argument() spider_dict = {"android":"http://bbs.pediy.com/forumdisplay.php?f=161&order=desc&page=", "ios":"http://bbs.pediy.com/forumdisplay.php?f=166&order=desc&page="} pediy_spider = None if args.all: pediy_spider = PediySpider(spider_dict[‘android‘], None) pass if args.filter: pediy_spider = PediySpider(spider_dict[‘android‘], args.filter) pass if args.gfilter: pediy_spider = PediySpider(spider_dict[‘android‘], args.gfilter) pass pediy_spider.start_work() passif __name__ == ‘__main__‘: main()
效果:
看雪精华帖爬虫
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。