python爬糗百

首页 > 代码库 > python爬糗百

2024-09-01 17:32:43 217人阅读

目的：显示糗百多页文字内容，一次看个够，节约时间。

工具：python 2.7，BeautifulSoup，requests (没有采用urllib2，因为比较麻烦）

先把源码贴出来：

#-*- coding:utf-8 -*-

import requests

from bs4 import BeautifulSoup

page_number = 1

pages = int(raw_input("输入你想要看的总页数：\n“)

while page_number <= pages:

myUrl = "http://www.qiubai.com/hot/page/" + str(page_number)

print ‘第%d页：’ %page_number

res = requests.get(myUrl)

res.encoding = ‘utf-8‘

soup = BeautifulSoup(res.text,‘html.parser‘)

content_link = soup.select(‘.content‘)

for clink in content_link:

print clink.text

page_number +=1

知识点：

1.requests

网络资源（URLs）截取套件

改善Urllib2的缺点，让使用者以最简单的方式获取网络资源

可以使用REST操作(POST,PUT,GET,DELETE)存取网络资源

import requests

newsurl = ‘http://qiubai.com/hot/page/1‘ # 糗百的网址，第1页

res = requests.get(newsrul)

res.encoding = ‘utf-8‘ # 网页的内容是utf-8的格式

# encode的作用是将unicode编码转换成其他编码的字符串
# decode的作用是将其他编码的字符串转换成unicode编码

print（res.text)

2. BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(res.text, ‘html.parser‘) # ‘html.parser ‘是网页解析器，不加解析器会出现警告

# print soup.text #获取网页里面的文字，

我们要的内容在特定标签里，要用到select找出

‘‘‘soup = Beautiflsoup(html_sample)

header = soup.select(‘h1)#这段是用select找出含有h1标签的内容

print(header)‘‘‘ 回传一个list

print header[0]

print header[0].text #取出内文

2.1.取得含有特定CSS属性的元素：

a. 使用select找出所有id为title的元素（id前面需加#）

alink = soup.select(‘#title‘)

print(alink)

b. 使用select找出所有class为link的元素（class前面需加点号.)

soup = BeautifulSoup(html_sample)

for link in soup.select(‘.link‘)

print link

2.2 取得所有a标签内的链接

使用select找出所有a tag 的 href的链接

alinks = soup.select(‘a‘)

for link in alinks:

print link[‘href‘]

#the end

简单的爬虫，练练手，有好的建议或意见请留言，谢谢！

python爬糗百

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > python爬糗百

python爬糗百

看完仍有疑问？有类似问题直接问程序猿