Python爬虫基础之requests

首页 > 代码库 > Python爬虫基础之requests

2024-11-13 15:21:02 202人阅读

一、随时随地爬取一个网页下来

　　怎么爬取网页？对网站开发了解的都知道，浏览器访问Url向服务器发送请求，服务器响应浏览器请求并返回一堆HTML信息，其中包括html标签，css样式，js脚本等。我们之前用的是Python标准基础库Urllib实现的，

现在我们使用Python的Requests HTTP库写个脚本开始爬取网页。Requests的口号很响亮“让HTTP服务人类“，够霸气。

二、Python Requests库的基本使用

1.GET和POST请求方式

GET请求

1 import requests2 3 payload = {"t": "b", "w": "Python urllib"}4 response = requests.get(‘http://zzk.cnblogs.com/s‘, params=payload)5 # print(response.url)  # 打印 http://zzk.cnblogs.com/s?w=Python+urllib&t=b&AspxAutoDetectCookieSupport=16 print(response.text)

Python requests的GET请求，不需要在作为请求参数前，对dict参数进行urlencode()和手动拼接到请求url后面，get()方法会直接对params参数这样做。

POST请求

1 import requests2 3 payload = {"t": "b", "w": "Python urllib"}4 response = requests.post(‘http://zzk.cnblogs.com/s‘, data=http://www.mamicode.com/payload)5 print(response.text)  # u‘......‘

Python requests的POST请求，不需要在作为请求参数前，对dict参数进行urlencode()和encode()将字符串转换成字节码。raw属性返回的是字节码，text属性直接返回unicode格式的字符串，而不需要再进行decode()将返回的bytes字节码转化为unicode。

相对于Python urllib而言，Python requests更加简单易用。

2.设置请求头headers

1 import requests2 3 payload = {"t": "b", "w": "Python urllib"}4 headers = {‘user_agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘}5 response = requests.get(‘http://zzk.cnblogs.com/s‘, params=payload, headers=headers)6 print(response.request.headers)

get方法的请求头，可以通过传递字典格式的参数给headers来实现。response.headers返回服务器响应的请求头信息，response.request.headers返回客户端的请求头信息。

3.设置会话cookie

1 import requests2 3 cookies = {‘cookies_are‘: ‘working‘}4 response = requests.get(‘http://zzk.cnblogs.com/‘, cookies=cookies)5 print(response.text)

requests.get()方法cookies参数除了支持dict()字典格式，还支持传递一个复杂的RequestsCookieJar对象，可以指定域名和路径属性。

1 import requests2 import requests.cookies3 4 cookieJar = requests.cookies.RequestsCookieJar()5 cookieJar.set(‘cookies_are‘, ‘working‘, domain=‘cnblogs‘, path=‘/cookies‘)6 response = requests.get(‘http://zzk.cnblogs.com/‘, cookies=cookieJar)7 print(response.text)

4.设置超时时间timeout

1 import requests2 3 response = requests.get(‘http://zzk.cnblogs.com/‘, timeout=0.001)4 print(response.text)

三、Python Requests库的高级使用

1.Session Object

1 from requests import Request,Session2 3 s = Session()4 5 s.get(‘http://httpbin.org/cookies/set/sessioncookie/123456789‘)6 r = s.get(‘http://httpbin.org/cookies‘)7 8 print(r.text)9 # ‘{"cookies": {"sessioncookie": "123456789"}}‘

通过Session，我们可以在多个请求之间传递cookies信息，不过仅限于同一域名下，否则不会附带上cookie。如果碰到需要登录态的页面，我们可以在登陆的时候保存登录态，再访问其他页面时附带上就好。

2.Prepared Requested

 1 from requests import Request,Session 2  3 url = ‘http://zzk.cnblogs.com/s‘ 4 payload = {"t": "b", "w": "Python urllib"} 5 headers = { 6         ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘, 7         ‘Content-Type‘:‘application/x-www-form-urlencoded‘ 8 } 9 s = Session()10 request = Request(‘GET‘, url, headers=headers, data=http://www.mamicode.com/payload)11 prepped = request.prepare()12 13 # do something with prepped.headers14 del prepped.headers[‘Content-Type‘]15 response = s.send(prepped, timeout=3)16 print(response.request.headers)

Request对象的prepare()方法返回的对象允许在发送请求前做些额外的工作，例如更新请求体body或者请求头headers.

四、Python Requests库的实际应用

1.GET请求封装

 1 def do_get_request(self, url, headers=None, timeout=3, is_return_text=True, num_retries=2): 2         if url is None: 3             return None 4         print(‘Downloading:‘, url) 5         if headers is None:  # 默认请求头 6             headers = { 7                 ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘} 8         response = None 9         try:10             response = requests.get(url,headers=headers,timeout=timeout)11 12             response.raise_for_status()  # a 4XX client error or 5XX server error response,raise requests.exceptions.HTTPError13             if response.status_code == requests.codes.ok:14                 if is_return_text:15                     html = response.text16                 else:17                     html = response.json()18             else:19                 html = None20         except requests.Timeout as err:21             print(‘Downloading Timeout:‘, err.args)22             html = None23         except requests.HTTPError as err:24             print(‘Downloading HTTP Error,msg:{0}‘.format(err.args))25             html = None26             if num_retries > 0:27                 if 500 <= response.status_code < 600:28                     return self.do_get_request(url, headers=headers, num_retries=num_retries - 1)  # 服务器错误，导致请求失败，默认重试2次29         except requests.ConnectionError as err:30             print(‘Downloading Connection Error:‘, err.args)31             html = None32 33         return html

2.POST请求封装

 1  def do_post_request(self, url, data=http://www.mamicode.com/None, headers=None, timeout=3, is_return_text=True, num_retries=2): 2         if url is None: 3             return None 4         print(‘Downloading:‘, url) 5         # 如果请求数据未空，直接返回 6         if data is None: 7             return 8         if headers is None: 9             headers = {10                 ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘}11         response = None12         try:13             response = requests.post(url,data=http://www.mamicode.com/data, headers=headers, timeout=timeout)   # 设置headers timeout无效14 15             response.raise_for_status()  # a 4XX client error or 5XX server error response,raise requests.exceptions.HTTPError16             if response.status_code == requests.codes.ok:17                 if is_return_text:18                     html = response.text19                 else:20                     html = response.json()21             else:22                 print(‘else‘)23                 html = None24         except requests.Timeout as err:25             print(‘Downloading Timeout:‘, err.args)26             html = None27         except requests.HTTPError as err:28             print(‘Downloading HTTP Error,msg:{0}‘.format(err.args))29             html = None,30             if num_retries > 0:31                 if 500 <= response.status_code < 600:32                     return self.do_post_request(url, data=http://www.mamicode.com/data, headers=headers,33                                                 num_retries=num_retries - 1)  # 服务器错误，导致请求失败，默认重试2次34         except requests.ConnectionError as err:35             print(‘Downloading Connection Error:‘, err.args)36             html = None37 38         return html

3.登录态cookie

 1 def save_cookies(self, requeste_cookiejar, filename): 2     with open(filename, ‘wb‘)as f: 3         pickle.dump(requeste_cookiejar, f) 4  5 def load_cookies(self, filename): 6     with open(filename, ‘rb‘) as f: 7         return pickle.load(f) 8  9 # save request cookies10 r = requests.get(url)11 save_cookies(r.cookies,filename)12 13 # load cookies and do a request14 requests.get(url,cookies=load_cookies(filename))

Python爬虫基础之requests

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Python爬虫基础之requests