python python 入门学习之网页数据爬虫cnbeta文章保存

首页 > 代码库 > python python 入门学习之网页数据爬虫cnbeta文章保存

python python 入门学习之网页数据爬虫cnbeta文章保存

2024-11-16 19:40:38 202人阅读

需求驱动学习的动力。

因为我们单位上不了外网所以读新闻是那么的痛苦，试着自己抓取网页保存下来，然后离线阅读。今天抓取的是cnbeta科技新闻，抓取地址是http://m.cnbeta.com/wap/index.htm?page=1,咱们需要抓取的是前5页就行了。代码如下：

#!/usr/bin/python# -*- coding: utf-8 -*-import urllib2,re,time,jsonimport sysfrom bs4 import BeautifulSoupreload(sys)sys.setdefaultencoding(‘utf-8‘)n=0f = open(‘cnbeta.txt‘,‘a‘)headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘} mainurl="http://m.cnbeta.com/wap"for i in range(1,5): add=‘http://m.cnbeta.com/wap/index.htm?page=‘+str(i)  req = urllib2.Request(add, headers=headers)  wb=urllib2.urlopen(req).read() soup=BeautifulSoup(wb) file=open(str(i)+‘cnbetamain.html‘,‘a‘) file.write(wb) elv1ment=soup.find_all(‘div‘,{‘class‘:‘list‘}) for elv in elv1ment:  n=n+1  url=elv.find(‘a‘,href=http://www.mamicode.com/True).get(‘href‘)  name=elv.find(‘a‘,href=http://www.mamicode.com/True).get_text()  print name + ‘,‘+‘http://m.cnbeta.com‘+url  f.write(str(n)+‘,‘+name + ‘,‘+‘http://m.cnbeta.com‘+url+‘\n‘)  try:   html =urllib2.urlopen(urllib2.Request(‘http://m.cnbeta.com‘+url, headers=headers)).read()   filename=name+‘.html‘   file=open(filename,‘a‘)   file.write(html)  except:   print ‘NOT FOUND‘  #print filename  time.sleep(1)f.close()file.close()print ‘OVER‘

首先需要抓取页面，循环地址，这个地方需要注意的是因为很多网站禁止机器访问所以需要headers，万能的

headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}

拿到主页数据后需要找到此主页包含的文章及文章地址，用beautifulsoup访问处理html，beautifulsoup需要对网页进行分级处理，有head body div title href几种模式，这里需要用的是 div class="list" 。找到文章地址后打开url并保存到当前文件夹下面，名字用文章名命名。

python python 入门学习之网页数据爬虫cnbeta文章保存

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > python python 入门学习之网页数据爬虫cnbeta文章保存

python python 入门学习之网页数据爬虫cnbeta文章保存

看完仍有疑问？有类似问题直接问程序猿