Chapter 7: 正则表达式
正则替换
import re
s = '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD$', 'RD.', s) # '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD\b', 'RD.', s) # '100 BROAD RD. APT 3'
re.sub
实现正则表达式方法的替换r
字符表示这个字符串是一个raw字符串,无需解决反斜线的转义问题,在写正则时最好如此\b
表示单词边界,$
表示单词结尾
正则搜索
校验罗马数字千位数:M, MM, MMM或空
import re
pattern = '^M?M?M?$'
re.search(pattern, '') # <_sre.SRE_Match at 0x103839a58>
re.search(pattern, 'MMMM') # None, 不显示输出
校验百位数,有以下可能:
- 100=C
- 200=CC
- 300 = CCC
- 400=CD
- 500=D
- 600=DC
- 700 = DCC
- 800 = DCCC
- 900=CM
因此有四种可能的模式:
- CM
- CD
- 零到三次出现 C 字符 (出现零次表示百位数为 0)
- D,后面跟零个到三个 C 字符
pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
# pattern可以用{m, n}方式改写为
pattern2 = '^M{0,3}(CM|CD|D?C{0,3})$'
加入个位和十位:
pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
pattern2 = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
带有内联注释 (Inline Comments) 的正则表达式
使用松散的正则表达式来添加正则注释,其中的空格、换行和注释均会被忽略
pattern = """
^ # beginning of string
M{0,3} # thousands - 0 to 3 M's
(CM|CD|D?C{0,3}) #hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) #tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
#or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) #ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
#or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""
在使用时需要指定re.VERBOSE
来声明它是一个松散的正则表达式
import re
re.search(pattern, 'M', re.VERBOSE)
正则匹配并获取其中的内容
需要识别的格式包括:
- 800-555-1212
- 800 555 1212
- 800.555.1212
- (800) 555-1212
- 1-800-555-1212
- 800-555-1212-1234
- 800-555-1212x1234
- 800-555-1212 ext. 1234
- work 1-(800) 555.1212 #1234
首先编写测试函数,可以不断修改phonePattern
函数测试结果。
import re
phone_numbers = ['800-555-1212','800 555 1212','800.555.1212','(800) 555-1212','1-800-555-1212','800-555-1212-1234','800-555-1212x1234','800-555-1212 ext. 1234','work 1-(800) 555.1212 #1234']
phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')
for number in phone_numbers:
print number,":",
if phonePattern.search(number):
print phonePattern.search(number).groups()
else:
print "failed"
结果为
800-555-1212 : failed
800 555 1212 : failed
800.555.1212 : failed
(800) 555-1212 : failed
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed
- 无分机号的无法处理
- 无连字符的无法处理
修改为
phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
结果为
800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : failed
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed
非数字开头的无法匹配,可以修改最前面的匹配字符串头,改为
phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
结果为
800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : ('800', '555', '1212', '')
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed
因为\D
匹配非数字,所以前面有1的匹配失败,数字开头对于匹配无作用,所以把最前面的匹配开头都去掉
phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
结果为
800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : ('800', '555', '1212', '')
1-800-555-1212 : ('800', '555', '1212', '')
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : ('800', '555', '1212', '1234')
全部匹配成功。
使用前面所用的松散的正则表达式加入注释,可写为:
phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555') # optional separator
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212') # optional separator
\D* # optional separator
(\d*) # extension is optional and can be any number of digits # end of string
$
''', re.VERBOSE)
匹配规则汇总:
-d
匹配数字,-D
匹配非数字的任意字符+
匹配1或多个,*
匹配0或多个,?
匹配0或1个^
匹配开头,$
x{n,m}
匹配 x 字符,至少 n 次,至多 m 次。(a|b|c)
要么匹配 a,要么匹配 b,要么匹配 c。(x)
一般情况下表示一个记忆组 (remembered group)。你可以利用re.search
函数返回对象的groups()
函数获取它的值。