python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

首页 > 代码库 > python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

2024-11-25 23:15:02 204人阅读

import urllib.requestimport reimport timenum=input("输入日期（20150101000）：")def openpage(url):    html=urllib.request.urlopen(url)        page=html.read().decode(‘gb2312‘)        return pagedef getpassage(page):    passage = re.findall(r‘<p class="MsoNormal" align="left">([\s\S]*)</FONT>‘,str(page))        passage1=re.sub("</?\w+[^>]*>", "", str(passage))        passage2=passage1.replace(‘\\r‘, ‘\r‘).replace(‘\\n‘, ‘ \n‘).replace(‘\\t‘,‘\t‘).replace(‘]‘,‘‘).replace(‘[‘,‘‘).replace(‘&nbsp;‘,‘   ‘)    print(passage2)    with open(load,‘a‘,encoding=‘utf-8‘) as f:        f.write("-----------------------------"+"日期"+str(date)+"---------------------------------\n"+passage2+"----------------------------------------------------\n")for i in range(1,32):    date=int(num)+int(i)    print(date)    load="C:/Users/home/Desktop/新建文本文档.txt"    url=("http://www.hbuas.edu.cn/news/xyxw/news_"+str(date)+".htm")    
　　
    try:        page=openpage(url)        getpassage(page)        print("第"+str(i)+"号有文章，----已下载")    except:        print("第"+str(i)+"号无文章。")    time.sleep(2)

写了一个爬学校新闻网的爬虫，

主要涉及 re正则 urllib.request 文件的写入

在爬取文章时通常会返回很多影响美感的代码

如下：

技术分享

优化：

两次正则

passage = re.findall(r‘<p align="left">([\s\S]*)</FONT>‘,str(page))       #第一次匹配字段    passage1=re.sub("</?\w+[^>]*>", "", str(passage))　　　　　　　　　　　　　　# 第二次去掉html标签

替换

passage2=passage1.replace(‘\\r‘, ‘\r‘).replace(‘\\n‘, ‘ \n‘).replace(‘\\t‘,‘\t‘).replace(‘]‘,‘‘).replace(‘[‘,‘‘).replace(‘&nbsp;‘,‘   ‘)

效果如下：

技术分享

over！

python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

看完仍有疑问？有类似问题直接问程序猿