python爬虫的一些小小问题、python动态正则表达式

首页 > 代码库 > python爬虫的一些小小问题、python动态正则表达式

python爬虫的一些小小问题、python动态正则表达式

2024-10-18 04:53:02 213人阅读

1.首先urllib不能用了，需要引入的是urllib2，正则re。

#coding=utf-8
# import urllib
import urllib2
import re

def getHtml(url):
    page = urllib2.urlopen(url)
    html = page.read()
    return html



def getCountry(html):
    reg = r‘<td>(.*?)</td>‘
    #imgre = re.compile(reg)#编译会出错，不要再编译了。
    imglist = re.findall(reg, html, re.S|re.M)
    #re.S|re.M   ‘i‘、‘L‘、‘m‘、‘s‘、‘u‘、‘x‘里的一个或多个字母。
    # 表达式不匹配任何字符，但是指定相应的标志：re.I(忽略大小写)、re.L(依赖locale)、re.M(多行模式)、re.S(.匹配所有字符)、re.U(依赖Unicode)、re.X(详细模式)。
    return imglist

html = getHtml("https://en.wikipedia.org/wiki/List_of_countries_by_electricity_consumption")
print getCountry(html)

要注意一下注释里面的内容。

2.python动态正则表达式写法：

import re
f = open("b.txt")
ll = f.read(1000000)
print ll
for i in range(1,220):
    reg = "‘"+ str(i) + "‘" + ‘(.*?)‘+ "‘"+str(i+1)+"‘"#这里可以实现动态匹配
    reg2 = re.compile(r‘‘+reg+‘‘)#每次编译的正则表达式都不一样
    list = re.findall(reg2,ll)
    # print i,reg
    print list

注意看写法。

python爬虫的一些小小问题、python动态正则表达式

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > python爬虫的一些小小问题、python动态正则表达式

python爬虫的一些小小问题、python动态正则表达式

看完仍有疑问？有类似问题直接问程序猿