首页 > 代码库 > Python 网络爬虫(新闻收集脚本)

Python 网络爬虫(新闻收集脚本)

=====================爬虫原理=====================

通过Python访问新闻首页,并用正则表达式获取新闻排行榜链接。

依次访问这些链接,从网页的html代码中获取文章信息,并将信息保存到Article对象中。

将Article对象中的数据通过pymysql【第三方模块】保存到数据库中。

=====================数据结构=====================

CREATE TABLE `news` (  `id` int(6) unsigned  AUTO_INCREMENT NOT NULL,  `title` varchar(45) NOT NULL,  `author` varchar(12) NOT NULL,  `date` varchar(12) NOT NULL,  `about` varchar(255) NOT NULL,  `content` text NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

=====================脚本代码=====================

# 百度百家文章收集import reimport urllib.requestimport pymysql.cursors# 数据库配置参数config = {    host: localhost,    port: 3310,    username: woider,    password: 3243,    database: python,    charset: utf8}# 数据表创建语句‘‘‘CREATE TABLE `news` (  `id` int(6) unsigned  AUTO_INCREMENT NOT NULL,  `title` varchar(45) NOT NULL,  `author` varchar(12) NOT NULL,  `date` varchar(12) NOT NULL,  `about` varchar(255) NOT NULL,  `content` text NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;‘‘‘# 文章对象class Article(object):    title = None    author = None    date = None    about = None    content = None    pass# 正则表达式patArticle = <p\s*class="title"><a\s*href="http://www.mamicode.com/(.+?)"  # 匹配文章链接patTitle = <div\s*id="page">\s*<h1>(.+)</h1>  # 匹配文章标题patAuthor = <div\s*class="article-info">\s*<a.+?>(.+)</a>  # 匹配文章作者patDate = <span\s*class="time">(.+)</span>  # 匹配发布日期patAbout = <blockquote><i\s*class="i\siquote"></i>(.+)</blockquote>  # 匹配文章简介patContent = <div\s*class="article-detail">((.|\s)+)  # 匹配文章内容patCopy = <div\s*class="copyright">(.|\s)+  # 匹配版权声明patTag = (<script((.|\s)*?)</script>)|(<.*?>\s*)  # 匹配HTML标签# 文章信息def collect_article(url):    article = Article()    html = urllib.request.urlopen(url).read().decode(utf8)    article.title = re.findall(patTitle, html)[0]    article.author = re.findall(patAuthor, html)[0]    article.date = re.findall(patDate, html)[0]    article.about = re.findall(patAbout, html)[0]    content = re.findall(patContent, html)[0]    content = re.sub(patCopy, ‘‘, content[0])    content = re.sub(</p>, \n, content)    content = re.sub(patTag, ‘‘, content)    article.content = content    return article# 储存信息def save_article(connect, article):    message = None    try:        cursor = connect.cursor()        sql = "INSERT INTO news (title, author, date, about, content) VALUES ( %s, %s, %s, %s, %s)"        data = (article.title, article.author, article.date, article.about, article.content)        cursor.execute(sql, data)        connect.commit()    except Exception as e:        message = str(e)    else:        message = article.title    finally:        cursor.close()        return message# 抓取链接home = http://baijia.baidu.com/  # 百度百家首页html = urllib.request.urlopen(home).read().decode(utf8)  # 获取页面源码links = re.findall(patArticle, html)[0:10]  # 每日热点新闻# 连接数据库connect = pymysql.connect(    host=config[host],    port=int(config[port]),    user=config[username],    passwd=config[password],    db=config[database],    charset=config[charset])for url in links:    article = collect_article(url) # 收集文章信息    message = save_article(connect,article) # 储存文章信息    print(message)    passconnect.close() # 关闭数据库连接

=====================运行结果=====================

技术分享

技术分享

Python 网络爬虫(新闻收集脚本)