一、引言

上篇给大家介绍了Python爬虫索要爬去的源网站及所需的软件，本篇开始，将正式的开始爬取数据。

二、爬虫利器 Beautiful Soup

1、简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。

2、安装 Beautiful Soup

pip install beautifulsoup4

然后需要安装 lxml

pip install lxml

三、开始爬取网页源码

 1 # coding = utf-8
 2 
 3 import urllib
 4 import urllib.request
 5 from bs4 import BeautifulSoup
 6 
 7 
 8 def getCityLinks():
 9     url = ‘http://lishi.tianqi.com/‘
10     response = urllib.request.urlopen(url, timeout=20)
11     result = response.read()
12     soup = BeautifulSoup(result, "lxml")
13     print(soup)
14 getCityLinks()

运行代码：

技术分享

结果：

# coding = utf-8

import urllib
import urllib.request
from bs4 import BeautifulSoup


def getCityLinks():
    url = ‘http://lishi.tianqi.com/‘
    response = urllib.request.urlopen(url, timeout=20)
    result = response.read()
    soup = BeautifulSoup(result, "lxml") 
    links = soup.select("ul > li > a")
    for a in links: 
        print(a)
getCityLinks()

技术分享

四、提取城市信息

至此我们已经获取了网页的源代码，下一步我们索要做的就是提取我们感兴趣的信息。

我们的目标数据为城市信息，且都是a标签，分析html结构，使用 soup.select("ul > li > a") 提取符合该条件下的所有a标签

# coding = utf-8

import urllib
import urllib.request
from bs4 import BeautifulSoup


def getCityLinks():
    url = ‘http://lishi.tianqi.com/‘
    response = urllib.request.urlopen(url, timeout=20)
    result = response.read()
    soup = BeautifulSoup(result, "lxml")
    links = soup.select("ul > li > a")
    for a in links:
        print(a)
getCityLinks()

再一次运行，得到如下数据。

技术分享

并不是所有的a标签都是我们需要的数据，因此再次过滤。

# coding = utf-8

import urllib
import urllib.request
from bs4 import BeautifulSoup


def getCityLinks():
    url = ‘http://lishi.tianqi.com/‘
    response = urllib.request.urlopen(url, timeout=20)
    result = response.read()
    soup = BeautifulSoup(result, "lxml")
    links = soup.select("ul > li > a")
    for a in links:
        if a.get_text() + ‘历史天气‘ == a.get(‘title‘):
            city = a.get_text()
            url = a.get(‘href‘)
            print(a)
getCityLinks()

再次运行后，得到的结果才是我们想要的。

技术分享

2、历史天气首页信息提取

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 2、历史天气首页信息提取

2、历史天气首页信息提取

一、引言

二、爬虫利器 Beautiful Soup

三、开始爬取网页源码

四、提取城市信息

看完仍有疑问？有类似问题直接问程序猿