首页 > 代码库 > 看雪精华帖爬虫

看雪精华帖爬虫

看雪自带的搜索感觉不是太好用, 然后弄了个爬虫

目前支持4种功能

1. 爬取某个版块所有的链接, 并保持到文件

2. 自动把精华帖分类出来, 并保存到文件

3. 把含有指定关键字的链接单独保存为文件(针对所有链接)

4. 把含有指定关键字的链接单独保存为文件(针对所有精华帖链接)

 

github下载地址:

https://github.com/bingghost/pediy_spider

 

需要下载下面的依赖库

bs4
requests
html5lib

 

代码如下

#!/usr/bin/env python# encoding: utf-8"""@author:     bingghost@copyright:  2016 bingghost. All rights reserved.@contact:@date:       2016-12-1@description: 看雪爬虫"""import reimport timeimport requestsimport argparsefrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding(‘utf8‘)class PediySpider:    def __init__(self, spider_url, specified_title):        self._url = spider_url        self.file_dict = {"all_title":"all_title.txt",                          "good_title":"good_title.txt",                          "filter_title":"filter_title.txt",                          "filter_good_title":"filter_good_title.txt"}        # good title        self.filter_list = [‘jhinfo.gif‘, ‘good_3.gif‘, ‘good_2.gif‘]        # title specified        self.specified_title = specified_title        self.page_count = self.get_page_count()        pass    def get_page_content(self, page_num):        rep_data = http://www.mamicode.com/requests.get(self._url + str(page_num))"[-] start spider"        self.worker()        print "[-] spider okay"        pass    passdef set_argument():    # add description    parser = argparse.ArgumentParser(        description="A spider for the bbs of pediy‘s Android security forum,"               "also you can modify the url to spider other forum.")    # add argument    group = parser.add_mutually_exclusive_group(required=True)    group.add_argument(        ‘-a‘, ‘--all‘,        action=‘store_true‘,        help=‘Get all titles‘)    group.add_argument(        ‘-f‘, ‘--filter‘,        type=str,        default=None,        help=‘filter title‘)    group.add_argument(        ‘-gf‘, ‘--gfilter‘,        type=str,        default=None,        help=‘filter good title‘)    args = parser.parse_args()    return args    passdef main():    args = set_argument()    spider_dict = {"android":"http://bbs.pediy.com/forumdisplay.php?f=161&order=desc&page=",                   "ios":"http://bbs.pediy.com/forumdisplay.php?f=166&order=desc&page="}    pediy_spider = None    if args.all:        pediy_spider = PediySpider(spider_dict[‘android‘], None)        pass    if args.filter:        pediy_spider = PediySpider(spider_dict[‘android‘], args.filter)        pass    if args.gfilter:        pediy_spider = PediySpider(spider_dict[‘android‘], args.gfilter)        pass    pediy_spider.start_work()    passif __name__ == ‘__main__‘:    main()

 

效果:

技术分享

 

看雪精华帖爬虫