Using Regular Expressions in Python

首页 > 代码库 > Using Regular Expressions in Python

Using Regular Expressions in Python

2024-07-14 07:45:00 217人阅读

1. 反斜杠的困扰(The Backslash)

　　有时候需要匹配的文本带有‘\‘,如‘\python‘,因为正则表达式有些特殊字符有特殊意义，所以需要前面加上‘\‘来消除特殊意义，这里匹配的正则表达式是‘\\python‘，这时候如果要编译这个正则表达式需要re.compile(‘\\\\python‘),因为在传递字符串的时候python本身就需要用‘\\‘来表示‘\‘，也就造成了反斜杠的泛滥。

　　使用前缀‘r‘可以解决这个问题：’r‘之后的‘\‘只是这个字符本身而没有特殊意义，比如r‘\n‘表示两个符号‘\‘和‘n‘，而原来‘\n‘表示换行符，所以可以写成re.compile(‘\\python‘)

2. match的几种methods

match(): 判断RE是否从第一个字符开始匹配

search(): 判断是否从任意一个位置匹配（不管是否从第一个字符开始）

findall(): 找到RE匹配的所有子字符串，并以list的形式返回

finditer(): 找到RE匹配的所有子字符串，并以iterator的形式返回

 1 >>> p = compile(‘[a-z]+‘) 2 >>> print p.match("") 3 None 4  5 >>> m = p.match(‘tempo‘) 6 >>> m.group() 7 ‘tempo‘ 8 >>> m.start(), m.end() 9 (0, 5)10 >>> m.span()11 (0, 5)12 13 >>> m = p.search(‘::: message‘)14 >>> m.group()15 ‘message‘16 17 >>> p = re.compile(‘\d+‘)18 >>> it = p.finditer(‘12 drummers drumming, 11 pipers piping, 10 lords a-leaping‘)19 >>> for match in it:20 ...     print match.group()21 ...22 (0, 2)23 (22, 24)24 (29, 31)

‘|‘: “或”，A|B表示匹配A或者B

‘^‘: 匹配行的开头，一般情况下只会匹配字符串的开头，如果在MULTILINE模式下则可以匹配每行的开头

1 >>> print re.search(‘^From‘, ‘From Here to Eternity‘)2 <_sre.SRE_Match object at 0x...>3 >>> print re.search(‘^From‘, ‘Reciting From Memory‘)4 None

‘\A‘: 只匹配字符串的开头

‘$‘: 匹配行的末尾，一般情况下只会匹配字符串的末尾，如果在MULTILINE模式下则可以匹配每个换行符

‘Z‘: 只匹配字符串的末尾

‘\b‘: 单词的边界，即单词的开头或者末尾，可以是空格或者非字母数字的字符

1 >>> p = re.compile(r‘\bclass\b‘)2 >>> print p.search(‘no class at all‘)3 <_sre.SRE_Match object at 0x...>4 >>> print p.search(‘the declassified algorithm‘)5 None6 >>> print p.search(‘one subclass is‘)7 None

　　注意上面用来编译的正则前面要带‘r‘，因为在Python中‘\b‘是退格键，如下：

1 >>> p = re.compile(‘\bclass\b‘)2 >>> print p.search(‘no class at all‘)3 None4 >>> print p.search(‘\b‘ + ‘class‘ + ‘\b‘)5 <_sre.SRE_Match object at 0x...>

‘\B‘: 匹配不是单词边界

4. Grouping

　　用‘()‘括起来表示分组，比如(ab)*可以匹配ab０次或者更多次。

　　分组是有编号的，０表示整个RE，从左往右编号依次加一，只要看左括号的位置就行

1 >>> p = re.compile(‘(a(b)c)d‘)2 >>> m = p.match(‘abcd‘)3 >>> m.group(0)4 ‘abcd‘5 >>> m.group(1)6 ‘abc‘7 >>> m.group(2)8 ‘b‘

5. Non-capturing Groups

　　()里面跟着?:的是non-capturing group，一般用于一些分组后面不需要用到的时候

1 >>> m = re.match("([abc])+", "abc")2 >>> m.groups()3 (‘c‘,)4 >>> m = re.match("(?:[abc])+", "abc")5 >>> m.groups()6 ()

　　原来编号１的分组在加上?:之后被忽略了，可以用在筛选一些没有用的分组。比较令人疑惑的是第２行m.groups()为什么是‘c‘,这里我觉得应该是贪婪模式尽量往后找的原因

６. Named Groups

　　分组过多的时候编号难以记住，可以给分组加上一个名字，相当于一个“标签”，后面引用的时候可以直接通过名字来引用，用法是(?P<name>...)

1 >>> p = re.compile(r‘((?P<word>\b\w+\b))‘)2 >>> m = p.search( ‘(((( Lots of punctuation )))‘ )3 >>> m.group(‘word‘)4 ‘Lots‘5 >>> m.group(1)6 ‘Lots‘

　　还可以用来搜索与前面重复的字段：

1 >>> p = re.compile(r‘(?P<word>\b\w+)\s+(?P=word)‘)2 >>> p.search(‘Paris in the the spring‘).group()3 ‘the the‘

7. Lookahead Assertions

　　(?=...) 需要在这个位置匹配括号里面之后的内容，匹配成功即可，不会消耗字符串的内容。

　　(?!...) 与前一个相反，不能匹配括号里面之后的内容才能匹配成功，不会消耗字符串的内容。

　　如在匹配带有扩展名的文件的时候，如果想要排除掉带有后缀.bat和.exe的文件，可以用　.*[.](?!bat$|exe$).*$

8. Splitting Strings

1 >>> p = re.compile(r‘\W+‘)2 >>> p2 = re.compile(r‘(\W+)‘)3 >>> p.split(‘This... is a test.‘)4 [‘This‘, ‘is‘, ‘a‘, ‘test‘, ‘‘]5 >>> p2.split(‘This... is a test.‘)6 [‘This‘, ‘... ‘, ‘is‘, ‘ ‘, ‘a‘, ‘ ‘, ‘test‘, ‘.‘, ‘‘]

　　如果使用括号将匹配内容括起来，那么在分割的时候匹配内容也会作为分割的一部分，否则匹配内容会被忽略。如上面通过空格将单词分开，加了括号之后空格也会记录为结果。

　　.split()函数可以加进参数maxsplit表示最多分割几段，如：

1 >>> p = re.compile(r‘\W+‘)2 >>> p.split(‘This is a test, short and sweet, of split().‘)3 [‘This‘, ‘is‘, ‘a‘, ‘test‘, ‘short‘, ‘and‘, ‘sweet‘, ‘of‘, ‘split‘, ‘‘]4 >>> p.split(‘This is a test, short and sweet, of split().‘, 3)5 [‘This‘, ‘is‘, ‘a‘, ‘test, short and sweet, of split().‘]

９. Search and Replace

　　p = re.compile(‘...‘) , p.sub(replacement, string, [count = 0])

　　string是原来用来被查找替换的字符串，replacement是查找到之后替换的内容，count是最多替换的个数，默认是０表示全部替换

1 >>> p = re.compile( ‘(blue|white|red)‘)2 >>> p.sub( ‘colour‘, ‘blue socks and red shoes‘)3 ‘colour socks and colour shoes‘4 >>> p.sub( ‘colour‘, ‘blue socks and red shoes‘, count=1)5 ‘colour socks and red shoes‘

　　subn()使用方法类似，但是会同时返回找到替换的个数

1 >>> p = re.compile( ‘(blue|white|red)‘)2 >>> p.subn( ‘colour‘, ‘blue socks and red shoes‘)3 (‘colour socks and colour shoes‘, 2)4 >>> p.subn( ‘colour‘, ‘no colours at all‘)5 (‘no colours at all‘, 0)

　　之前有命名分组的使用方法，(?P<name>...)，后面想替换成这个分组的内容可以用\g<name>来替换，或者替换成第２组的内容可以用\2，但是替换成第２组内容后面再加上一个０就会写成\20，会与第２０组产生误会，所以有另一个写法是\g<2>，下面例子用了这３种方法，将所有的section替换成subsection

1 >>> p = re.compile(‘section{ (?P<name> [^}]* ) }‘, re.VERBOSE)2 >>> p.sub(r‘subsection{\1}‘,‘section{First}‘)3 ‘subsection{First}‘4 >>> p.sub(r‘subsection{\g<1>}‘,‘section{First}‘)5 ‘subsection{First}‘6 >>> p.sub(r‘subsection{\g<name>}‘,‘section{First}‘)7 ‘subsection{First}‘

　　还有一种更复杂的方法是用一个函数进行替换，每次会对match到的内容进行一定的操作，如下面将数字转化为１６进制的时候，首先将match到的内容作为参数传进函数进行处理再返回１６进制数

1 >>> def hexrepl(match):2 ..."Return the hex string for a decimal number"3 ...value =http://www.mamicode.com/ int(match.group())4 ...return hex(value)5 ...6 >>> p = re.compile(r‘\d+‘)7 >>> p.sub(hexrepl, ‘Call 65490 for printing, 49152 for user code.‘)8 ‘Call 0xffd2 for printing, 0xc000 for user code.‘

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Using Regular Expressions in Python

Using Regular Expressions in Python

看完仍有疑问？有类似问题直接问程序猿