网络爬虫-Python

首页 > 代码库 > 网络爬虫-Python

2024-08-06 15:13:02 222人阅读

周末没事自己写了个网络爬虫，先介绍一下它的功能，这是个小程序，主要用来抓取网页上的文章，博客等，首先找到你要抓取的文章，比如韩寒的新浪博客，进入他的文章目录，记下目录的连接比如 http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html，里面每篇文章都有个连接，我们现在需要做的就是根据每个链接进入并把文章复制到你自己的电脑文件里。这就把文章爬下来了哈哈，不说了直接来代码吧

import urllib

import time

url=[‘‘]*50

j = 0

con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read() #目录链接

i=0

title = con.find(r‘<a title=‘) #找到第一次出现<a title=的位置

href = http://www.mamicode.com/con.find(r‘href=‘,title) #找到

html = con.find(r‘.html‘,href) #同上

while title != -1 and href != -1 and html != -1 and i<50: #目录下面大概50篇文章

url[i] = con[href + 6:html +5] #抓取每篇文章的链接

print url[i]