Python 网络爬虫（新闻收集脚本）

首页 > 代码库 > Python 网络爬虫（新闻收集脚本）

Python 网络爬虫（新闻收集脚本）

2024-08-15 09:24:19 218人阅读

=====================爬虫原理=====================

通过Python访问新闻首页，并用正则表达式获取新闻排行榜链接。

依次访问这些链接，从网页的html代码中获取文章信息，并将信息保存到Article对象中。

将Article对象中的数据通过pymysql【第三方模块】保存到数据库中。

=====================数据结构=====================

CREATE TABLE `news` (  `id` int(6) unsigned  AUTO_INCREMENT NOT NULL,  `title` varchar(45) NOT NULL,  `author` varchar(12) NOT NULL,  `date` varchar(12) NOT NULL,  `about` varchar(255) NOT NULL,  `content` text NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

=====================脚本代码=====================

# 百度百家文章收集import reimport urllib.requestimport pymysql.cursors# 数据库配置参数config = {    ‘host‘: ‘localhost‘,    ‘port‘: ‘3310‘,    ‘username‘: ‘woider‘,    ‘password‘: ‘3243‘,    ‘database‘: ‘python‘,    ‘charset‘: ‘utf8‘}# 数据表创建语句‘‘‘CREATE TABLE `news` (  `id` int(6) unsigned  AUTO_INCREMENT NOT NULL,  `title` varchar(45) NOT NULL,  `author` varchar(12) NOT NULL,  `date` varchar(12) NOT NULL,  `about` varchar(255) NOT NULL,  `content` text NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;‘‘‘# 文章对象class Article(object):    title = None    author = None    date = None    about = None    content = None    pass# 正则表达式patArticle = ‘<p\s*class="title"><a\s*href="http://www.mamicode.com/(.+?)"‘  # 匹配文章链接patTitle = ‘<div\s*id="page">\s*<h1>(.+)</h1>‘  # 匹配文章标题patAuthor = ‘<div\s*class="article-info">\s*<a.+?>(.+)</a>‘  # 匹配文章作者patDate = ‘<span\s*class="time">(.+)</span>‘  # 匹配发布日期patAbout = ‘<blockquote><i\s*class="i\siquote"></i>(.+)</blockquote>‘  # 匹配文章简介patContent = ‘<div\s*class="article-detail">((.|\s)+)‘  # 匹配文章内容patCopy = ‘<div\s*class="copyright">(.|\s)+‘  # 匹配版权声明patTag = ‘(<script((.|\s)*?)</script>)|(<.*?>\s*)‘  # 匹配HTML标签# 文章信息def collect_article(url):    article = Article()    html = urllib.request.urlopen(url).read().decode(‘utf8‘)    article.title = re.findall(patTitle, html)[0]    article.author = re.findall(patAuthor, html)[0]    article.date = re.findall(patDate, html)[0]    article.about = re.findall(patAbout, html)[0]    content = re.findall(patContent, html)[0]    content = re.sub(patCopy, ‘‘, content[0])    content = re.sub(‘</p>‘, ‘\n‘, content)    content = re.sub(patTag, ‘‘, content)    article.content = content    return article# 储存信息def save_article(connect, article):    message = None    try:        cursor = connect.cursor()        sql = "INSERT INTO news (title, author, date, about, content) VALUES ( %s, %s, %s, %s, %s)"        data = (article.title, article.author, article.date, article.about, article.content)        cursor.execute(sql, data)        connect.commit()    except Exception as e:        message = str(e)    else:        message = article.title    finally:        cursor.close()        return message# 抓取链接home = ‘http://baijia.baidu.com/‘  # 百度百家首页html = urllib.request.urlopen(home).read().decode(‘utf8‘)  # 获取页面源码links = re.findall(patArticle, html)[0:10]  # 每日热点新闻# 连接数据库connect = pymysql.connect(    host=config[‘host‘],    port=int(config[‘port‘]),    user=config[‘username‘],    passwd=config[‘password‘],    db=config[‘database‘],    charset=config[‘charset‘])for url in links:    article = collect_article(url) # 收集文章信息    message = save_article(connect,article) # 储存文章信息    print(message)    passconnect.close() # 关闭数据库连接

=====================运行结果=====================

技术分享

Python 网络爬虫（新闻收集脚本）

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Python 网络爬虫（新闻收集脚本）

Python 网络爬虫（新闻收集脚本）

看完仍有疑问？有类似问题直接问程序猿