下载新浪博客文章，保存成文本文件(python)

首页 > 代码库 > 下载新浪博客文章，保存成文本文件(python)

下载新浪博客文章，保存成文本文件(python)

2024-08-08 12:50:04 219人阅读

今天用Python写了一个下载韩寒新浪博客文章的下载器，恩，基本功能如下：

1、从新浪博客上批量下载文章，并按文章标题创建文件

2、对下载的文章进行格式化。

已知Bug:长篇文章格式会错乱

 1 #!/usr/bin/python 2 #-*- coding:utf-8 -*- 3  4 import urllib 5 import os 6 import re 7  8 def article_format(usock,basedir):     9     title_flag=True10     context_start_flag=True11     context_end_flag=True12     for line in usock:13         if title_flag:14             title=re.findall(r‘(<title>.+?<)‘,line)15             if title:16                 title=title[0][7:-1]17                 filename=basedir+title18                 print filename19                 try:20                     fobj=open(filename,‘w+‘)21                     fobj.write(title+‘\n‘)22                     title_flag=False23                 except IOError,e:24                     print "Open %s error:%s"%(filename,e)25             else:26                 #print "Title has not found,drop it"27                 pass28         elif context_start_flag:29             results1=re.findall(r‘(<.+?正文开始.+?>)‘,line)30             if results1:31                 context_start_flag=False32         elif context_end_flag:33             results2=re.findall(r‘(<.+?正文结束.+?)‘,line)34             if results2:35                 context_end_flag=False36                 fobj.write(‘\nEND‘)37                 fobj.close()38                 break39             else:    40                 if ‘div‘ in line or ‘span‘ in line or  ‘<p>‘ in line:41                     pass42                 else:    43                     line=re.sub(‘&#65292;‘,‘,‘,line)44                     line=re.sub(‘&#65306;‘,‘:‘,line)45                     line=re.sub(‘&#65281;‘,‘!‘,line)46                     line=re.sub(‘&#65288;‘,‘(‘,line)47                     line=re.sub(‘&#65289;‘,‘)‘,line)48                     line=re.sub(‘&#8943;‘,‘...‘,line)49                     line=re.sub(‘&#65311;‘,‘?‘,line)50                     line=re.sub(‘&#65307;‘,‘;‘,line)51                     line=re.sub(r‘<wbr>‘,‘‘,line)52                     line=re.sub(r‘&nbsp;‘,‘‘,line)53                     line=re.sub(r‘<br\s+?/>‘,‘‘,line)54                     fobj.write(line)55         else:56             pass57 58 if __name__==‘__main__‘:59     basedir=‘/home/tmyyss/article/‘60     if not os.path.exists(basedir):61         os.makedirs(basedir)62 63     usock=urllib.urlopen("http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html")64     context=usock.read()65     #print context66     raw_url_list=re.findall(r‘(<a\s+title.+?href="http://www.mamicode.com/http.+?html)‘,context)67     for url in raw_url_list:68         url=re.findall(‘(http.+?html)‘,url)[0]69         article_usock=urllib.urlopen(url)70         article_format(article_usock,basedir)

View Code

下载新浪博客文章，保存成文本文件(python)

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 下载新浪博客文章，保存成文本文件(python)

下载新浪博客文章，保存成文本文件(python)

看完仍有疑问？有类似问题直接问程序猿