python python 入门学习之网页数据爬虫搜狐汽车数据库

首页 > 代码库 > python python 入门学习之网页数据爬虫搜狐汽车数据库

python python 入门学习之网页数据爬虫搜狐汽车数据库

2024-11-16 17:54:39 203人阅读

自己从事的是汽车行业，所以首先要做的第一个程序是抓取搜狐汽车的销量数据库（http://db.auto.sohu.com/cxdata/）；

数据库提供了07年至今的汽车月销量，每个车型对应一个xml数据，比如速腾的销量：http://db.auto.sohu.com/xml/sales/model/model1004sales.xml

现在需要做的是遍历所有车型，以这个格式保存 ‘车型----日期----销量’。

#!/usr/bin/python# -*- coding: utf-8 -*-import urllib2,string,re,timej=0file = open(‘D:\Program Files\Notepad++Portable\App\Notepad++\databasesohu.txt‘,‘r‘).read()f=file.split(‘\n‘)for n in range(0,len(f)):   #开始访问  if f[n]<> "":    j=j+1   wb=urllib2.urlopen(‘http://db.auto.sohu.com/xml/sales/model/model‘+str(f[n])+‘sales.xml‘).read()   #获取车型名字   code=wb[wb.index(‘name=‘)+6:wb.index(‘">‘)]    model=f[n]+"---"+code   #print model #标记用的   reg=‘sales date=.(.*?). salesNum=.(.*?)./>‘  #正则表达式   list=re.compile(reg).findall(wb)   for i in range(len(list),0,-1):    lt=list[i-1]     lt=lt[0]+"---"+lt[1]    Mdata=model+"---"+lt    print Mdata    file1 = open(‘D:\Program Files\Notepad++Portable\App\Notepad++\save.txt‘,‘a‘)    file1.write(Mdata+ ‘\n‘)    file1.close()    #时间延迟   time.sleep(0.5)  else:  print ‘over‘print j

file = open(‘D:\Program Files\Notepad++Portable\App\Notepad++\databasesohu.txt‘,‘r‘).read()f=file.split(‘\n‘)
打开车型代码大全，并用换行符分割

wb=urllib2.urlopen(‘http://db.auto.sohu.com/xml/sales/model/model‘+str(f[n])+‘sales.xml‘).read()

然后开始遍历车型，用URLlib2进行访问，获取汽车名称model。
用正则表达式获取日期及销量（此处也可以用xml处理来获得）。
将数据保存至text文档。
新手需要注意的问题是 python中文件的读取的方法，此处用的open(,‘a‘)，就是add的意思。


参考：http://www.cnblogs.com/allenblogs/archive/2010/09/13/1824842.html
     http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001374738281887b88350bd21544e6095d55eaf54cac23f000

python python 入门学习之网页数据爬虫搜狐汽车数据库

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > python python 入门学习之网页数据爬虫搜狐汽车数据库

python python 入门学习之网页数据爬虫搜狐汽车数据库

看完仍有疑问？有类似问题直接问程序猿