首页 > 代码库 > 用python+selenium抓取知乎今日最热和本月最热的前三个问题及每个问题的首个回答并保存至html文件
用python+selenium抓取知乎今日最热和本月最热的前三个问题及每个问题的首个回答并保存至html文件
抓取知乎今日最热和本月最热的前三个问题及每个问题的首个回答,保存至html文件,该html文件的文件名应该是20160228_zhihu_today_hot.html,也就是日期+zhihu_today_hot.html
代码如下:
from selenium import webdriver from time import sleep import time class ZhiHu(): def __init__(self): self.dr = webdriver.Chrome() self.dr.maximize_window() self.today_hot_list = self.get_today_hot() self.month_hot_list = self.get_month_hot() def get_today_hot(self): ‘‘‘知乎今日最热问题前3个‘‘‘ today_hot = [] i = 0 while i < 3: self.dr.get(‘https://www.zhihu.com/explore‘) sleep(3) question_title = self.dr.find_elements_by_css_selector(‘div.explore-feed.feed-item>h2>a.question_link‘)[i].text #获取问题 question_answer_url = self.dr.find_elements_by_css_selector(‘div.explore-feed.feed-item>h2>a.question_link‘)[i].get_attribute(‘href‘) #获取问题回答的url self.dr.get(question_answer_url) #访问问题url sleep(10) question_answer_innerhtml = self.dr.find_element_by_css_selector(‘.zm-editable-content.clearfix‘).get_attribute(‘innerHTML‘) #获取首个回答的innerHTML today_hot.append((question_title, question_answer_innerhtml)) i += 1 return today_hot def write_today_data(self): file_date = time.strftime(‘%Y-%m-%d‘,time.localtime(time.time())) self.file = open(file_date+‘_zhihu_today_hot‘+‘.html‘,‘wb‘) file_line = ‘**********************************************<br />‘ #<br \>为转行符 for item in self.today_hot_list: self.file.write(file_line.encode(‘gbk‘)) self.file.write((‘问题:‘+item[0]+‘<br />‘).encode(‘gbk‘)) self.file.write((‘首个回答:‘+item[1]+‘<br />‘).encode(‘gbk‘)) self.file.close() def get_month_hot(self): ‘‘‘知乎本月最热问题前3个‘‘‘ month_hot = [] i = 5 # 本月最热div前已有5个标签 while i < 8: self.dr.get(‘https://www.zhihu.com/explore#monthly-hot‘) sleep(3) question_title = self.dr.find_elements_by_css_selector(‘div.explore-feed.feed-item>h2>a.question_link‘)[i].text # 获取问题 question_answer_url = self.dr.find_elements_by_css_selector(‘div.explore-feed.feed-item>h2>a.question_link‘)[i].get_attribute(‘href‘) # 获取问题回答的url self.dr.get(question_answer_url) # 访问问题url sleep(5) question_answer_innerhtml = self.dr.find_element_by_css_selector(‘.zm-editable-content‘).get_attribute(‘innerHTML‘) # 获取首个回答的innerHTML month_hot.append((question_title, question_answer_innerhtml)) i += 1 return month_hot def write_month_data(self): file_date = time.strftime(‘%Y-%m-%d‘, time.localtime(time.time())) self.file = open(file_date + ‘_zhihu_mouth_hot‘ + ‘.html‘, ‘wb‘) file_line = ‘--------------------------------------<br />‘ for item in self.month_hot_list: self.file.write(file_line.encode(‘gbk‘)) self.file.write((‘问题:‘ + item[0] + ‘<br />‘).encode(‘gbk‘)) self.file.write((‘首个回答:‘ + item[1] + ‘<br />‘).encode(‘gbk‘)) self.file.close() def quit(self): self.dr.quit() if __name__ == ‘__main__‘: zhihu = ZhiHu() zhihu.write_today_data() zhihu.write_month_data() zhihu.quit()
网页如下:
生成html如下:
嘻嘻,html的排版不是多好哈~
本文出自 “无想法,无成就!” 博客,请务必保留此出处http://kemixing.blog.51cto.com/10774787/1885534
用python+selenium抓取知乎今日最热和本月最热的前三个问题及每个问题的首个回答并保存至html文件
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。