爬虫网站

首页 > 代码库 > 爬虫网站

2024-08-18 03:20:39 220人阅读

# -*- coding:utf -8 -*-
import urllib2
import re
def getlist():
    html = urllib2.urlopen("http://www.quanshu.net/book/0/269/").read()
    reg = re.compile(r‘<li><a href="http://www.mamicode.com/(.*?)" title=".*?">(.*?)</a></li>‘)
    urls = re.findall(reg,html)
    return urls
def getcontent(url):
    html = urllib2.urlopen("http://www.quanshu.net/book/0/269/"+url).read()              #url为字符串要加到引号外边
    html = html.decode(‘gbk‘).encode(‘utf-8‘)            #decode（"gdk")把decode编码转换为Unicode      #encode("utf-8")把Unicode编码转换为utf-8
    reg = re.compile(r‘</script>   &nbsp(.*?)<script type="text/javascript">‘,re.S)    re.S换行
    content = re.findall(reg,html)[0]
    return content
for i in getlist():
    content = getcontent(i[0])
    content = content.replace(‘<br /><br />    ‘,‘\r\n‘)    #\r\n换行

    try:
        with open(i[1]+‘.txt‘,‘wb‘) as f:          #w表示可写  b表示二进制
            f.write(content)
    except Exception,e:
        continue

爬虫网站

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 爬虫网站

爬虫网站

看完仍有疑问？有类似问题直接问程序猿