零python基础--爬虫实践总结

首页 > 代码库 > 零python基础--爬虫实践总结

零python基础--爬虫实践总结

2024-09-15 08:53:24 216人阅读

网络爬虫，是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

爬虫主要应对的问题：1.http请求 2.解析html源码 3.应对反爬机制。

觉得爬虫挺有意思的，恰好看到知乎有人分享的一个爬虫小教程：https://zhuanlan.zhihu.com/p/20410446 立马学起！

主要步骤：

1、按照教程下载python、配置环境变量，学习使用pip命令、安装开发ide：pycharm

2、学习使用python发送请求获取页面

3、使用chrome开发者工具观察页面结构特征，使用beautifulsoup解析页面

4、保存页面到本地文件

遇到的主要问题：

1.python基本语法：变量、函数、循环、异常、条件语句、创建目录、写文件。

2.python缩进很重要，缩进决定语句分组和层次，在循环的时候尤其看清楚。

3.编码格式：从代码编辑、到网页内容、中文文件名，无处不有编码格式的问题。

4.beautifulsoup使用。

5.抓取规则失效，重新分析失效页面，重新选择页面特征。

实践，用爬虫获取网页上的试题（自动抓取下一页）代码：

# encoding=utf8 #设置编辑源py文件的编码格式为utf8import requests, sys, chardet, os, time, random, timefrom bs4 import BeautifulSoupreload(sys)  #必须要重新加载sys.setdefaultencoding("utf8")print sys.getdefaultencoding(), sys.getfilesystemencoding()  # utf8 mbcs:MBCS(Multi-ByteChactacterSystem,即多字节字符系统)它是编码的一种类型,而不是某个特定编码的名称path = os.getcwd() #获取当前文件所在目录newPath = os.path.join(path, "Computer")if not os.path.isdir(newPath):    os.mkdir(newPath) #新建文件夹destFile = unicode(newPath + "/题目.docx","utf-8) #存为word也可以，不过后续用office编辑后，保存的时候总需要另存为；用unicode()后，文件名取中文名不会变成乱码#最常见的模拟浏览器，伪装headersheaders = {    ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36‘}def downLoadHtml(url):    html = requests.get(url, headers=headers)    content = html.content    contentEn = chardet.detect(content).get("encoding", "utf-8")    # print contentEn  #GB2312    try:        tranCon = content.decode(contentEn).encode(sys.getdefaultencoding())#转换网页内容编码格式；消除中文乱码    except Exception:        return content #用了编码转换，为什么还是存在少量页面异常？    # print tranCon    else:        return tranCondef parseHtml(url):    # print url, "now"    content = downLoadHtml(url)    contentEn = chardet.detect(content).get("encoding", "utf-8")    soup = BeautifulSoup(content, "html.parser")  # soup.name  [document] BeautifulSoup 对象表示的是一个文档的全部内容    # 查找下一页url    theUL = soup.find("ul", {"class": "con_updown"})    theLi = theUL.find("li")    href = theLi.find("a").get("href")    preUrl = None    if href:        print href, "next"        preUrl = href    # 查找所需内容    topics = []    try:        divCon = soup.find("div", attrs={"class": "con_nr"})        if divCon:            subjects = divCon.find_all("p")  # __len__属性不是整数，而是：method-wrapper ‘__len__‘ of ResultSet object            index = 0 #借助index标识查找第几个，还有别的方式？            for res in subjects:                #跳过不想要的导读行内容                if index == 0 and res.string == "【导读】":                    index = 1  # 跳出循环也要加1                    continue  # 跳过 导读                topic = res.string  # res有子标签及文本，就会返回None                if topic:                    #按需要，只留下纯文本，保存到文件                    try:                        parsed = topic.decode(contentEn).encode("utf8")                    except Exception:                        topics.append("本页面解码有误，请自行查看: " + url + "\n")  # ‘%d‘ %index str(index) 数字转字符串                        break                    else:                        topics.append(parsed + "\n")                index = index + 1            topics.append("\n")        else:            topics.append("本页面查找试题有误，请自行查看: " + url + "\n")    except Exception:        topics.append("本页面解析有误，请自行查看: " + url + "\n")    fp = open(destFile, ‘a‘)  # a追加写    fp.writelines(topics)    fp.close()    return preUrl#执行.py文件的入口if __name__ == ‘__main__‘:    i = 0 #记录处理了多少页面    next = "http://xxxxx/1.html" #起始页面    print "start time:", time.strftime(‘%Y-%m-%d %H:%M:%S‘, time.localtime(time.time())) #打印时间，看跑了多久    print next, "start"    while next and i < 1000:        next = parseHtml(next)        i = i + 1        #sTime = random.randint(3, 8) #随机整数 [3,8)        #time.sleep(sTime)  # 休息：防反爬    print "end time:", time.strftime(‘%Y-%m-%d %H:%M:%S‘, time.localtime(time.time()))    print "i =", i, "url:", next    fp = open(destFile, ‘a‘)  # a追加写    fp.writelines(["lastPage：" + str(next) + "\n", "total:" + str(i) + "\n"])  # None及数字：无法和字符串用 + 拼接    fp.close()

零python基础--爬虫实践总结

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 零python基础--爬虫实践总结

零python基础--爬虫实践总结

看完仍有疑问？有类似问题直接问程序猿