Python 开发轻量级爬虫

(imooc总结07--网页解析器BeautifulSoup)

BeautifulSoup下载和安装    使用pip install 安装：在命令行cmd之后输入，pip install BeautifulSoup4BeautifulSoup语法    分为三个部分。    首先根据下载好的html网页字符串，我们创建一个BeautifulSoup这个对象，创建这个对象的同时就将整个文档字符串下载成一个DOM树。    然后根据这个dom树，我们就可以进行各种节点的搜索，这里有两个方法find_all/find。find_all方法会搜索出所有满足要求的节点，    find方法只会搜索出第一个满足要求的节点。这两个方法的参数是一模一样的。    得到一个节点以后，我们就可以访问节点的名称、属性、文字，相应的，在搜索节点的时候，我们也可以按照节点名称进行搜索，按照节    点的属性进行搜索，或按照节点的文字进行搜素，这里将节点内容分为名称、属性、文字。我们举例说明。下面是网页上一个链接：    <a href=http://www.mamicode.com/’123.html’ class=’article_link’> python >

        from bs4 import BeautifulSoup        import re        html_doc = """            <html><head><title>The Dormouse‘s story</title></head>            <body>            <p class="title"><b>The Dormouse‘s story</b></p>            <p class="story">Once upon a time there were three little sisters; and their names were            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;            and they lived at the bottom of a well.</p>            <p class="story">...</p>            """        print(‘获取所有的链接‘)        links = soup.find_all(‘a‘)        for link in links:            print(link.name,link[‘href‘],link.get_text())        print(‘获取lacie的链接‘)        link_node = soup.find(‘a‘,href=http://www.mamicode.com/‘http://example.com/lacie‘)        print(link_node.name,link_node[‘href‘],link_node.get_text())        print(‘正则匹配‘)        link_node = soup.find(‘a‘,href=http://www.mamicode.com/re.compile(r"ill"))        print (link_node.name,link_node[‘href‘],link_node.get_text())        print(‘获取P段文字‘)        p_node = soup.find(‘p‘,class_=‘title‘)        print(p_node.name,p_node.get_text())

Python 开发轻量级爬虫07

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Python 开发轻量级爬虫07

Python 开发轻量级爬虫07

Python 开发轻量级爬虫

(imooc总结07--网页解析器BeautifulSoup)

看完仍有疑问？有类似问题直接问程序猿