《Dive into Python》读书笔记之正则表达式

原創

2018-09-04 10:35

Chapter 7: 正则表达式

正则替换

import re
s =  '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD$', 'RD.', s)        # '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD\b', 'RD.', s)   # '100 BROAD RD. APT 3'

re.sub实现正则表达式方法的替换
r字符表示这个字符串是一个raw字符串，无需解决反斜线的转义问题，在写正则时最好如此
\b表示单词边界，$表示单词结尾

正则搜索

校验罗马数字千位数：M, MM, MMM或空

import re
pattern = '^M?M?M?$'
re.search(pattern, '')          # <_sre.SRE_Match at 0x103839a58>
re.search(pattern, 'MMMM')      # None， 不显示输出

校验百位数，有以下可能：

100=C
200=CC
300 = CCC
400=CD
500=D
600=DC
700 = DCC
800 = DCCC
900=CM

因此有四种可能的模式:

CM
CD
零到三次出现 C 字符 (出现零次表示百位数为 0)
D,后面跟零个到三个 C 字符

pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
# pattern可以用{m, n}方式改写为
pattern2 = '^M{0,3}(CM|CD|D?C{0,3})$'

加入个位和十位：

pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
pattern2 = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'

带有内联注释 (Inline Comments) 的正则表达式

使用松散的正则表达式来添加正则注释，其中的空格、换行和注释均会被忽略

pattern = """
^                   # beginning of string 
M{0,3}              # thousands - 0 to 3 M's
(CM|CD|D?C{0,3})    #hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                    # or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3})    #tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's), 
                    #or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3})    #ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's), 
                    #or 5-8 (V, followed by 0 to 3 I's)
$                   # end of string
"""

在使用时需要指定re.VERBOSE来声明它是一个松散的正则表达式

import re
re.search(pattern, 'M', re.VERBOSE)

正则匹配并获取其中的内容

需要识别的格式包括：

800-555-1212
800 555 1212
800.555.1212
(800) 555-1212
1-800-555-1212
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

首先编写测试函数，可以不断修改phonePattern函数测试结果。

import re
phone_numbers = ['800-555-1212','800 555 1212','800.555.1212','(800) 555-1212','1-800-555-1212','800-555-1212-1234','800-555-1212x1234','800-555-1212 ext. 1234','work 1-(800) 555.1212 #1234']

phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')

for number in phone_numbers:
    print number,":", 
    if phonePattern.search(number):
        print phonePattern.search(number).groups()
    else:
        print "failed"

结果为

800-555-1212 : failed
800 555 1212 : failed
800.555.1212 : failed
(800) 555-1212 : failed
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed

无分机号的无法处理
无连字符的无法处理

修改为

phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

结果为

800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : failed
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed

非数字开头的无法匹配，可以修改最前面的匹配字符串头，改为

phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

结果为

800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : ('800', '555', '1212', '')
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed

因为\D匹配非数字，所以前面有1的匹配失败，数字开头对于匹配无作用，所以把最前面的匹配开头都去掉

phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

结果为

800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : ('800', '555', '1212', '')
1-800-555-1212 : ('800', '555', '1212', '')
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : ('800', '555', '1212', '1234')

全部匹配成功。
使用前面所用的松散的正则表达式加入注释，可写为：

phonePattern = re.compile(r'''
        # don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D*     # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555') # optional separator
\D*     # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212') # optional separator
\D*     # optional separator
(\d*)   # extension is optional and can be any number of digits # end of string
$
''', re.VERBOSE)

匹配规则汇总：

-d匹配数字，-D匹配非数字的任意字符
+匹配1或多个，*匹配0或多个，？匹配0或1个
^匹配开头，$
x{n,m} 匹配 x 字符,至少 n 次,至多 m 次。
(a|b|c) 要么匹配 a,要么匹配 b,要么匹配 c。
(x) 一般情况下表示一个记忆组 (remembered group)。你可以利用
re.search 函数返回对象的 groups() 函数获取它的值。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

《Dive into Python》读书笔记之正则表达式

Chapter 7: 正则表达式

正则替换

正则搜索

带有内联注释 (Inline Comments) 的正则表达式

正则匹配并获取其中的内容

匹配规则汇总：

蓝桥15届stema编程题密码锁-动态规划 C++和Python最后一道题

2021看雪SDC议题回顾 | SaTC：一种全新的物联网设备漏洞自动化挖掘方法

C# 代码学习

Kafka存储机制

aws语音呼叫调用，告警电话

【转】[C#] WebAPI 防止并发调用二（冥等性）

一个简单的MD5加盐

HTTP URL 详解

得物 ZooKeeper SLA 也可以 99.99%

创新工具：2024年开发者必备的一款表格控件（二）

用python的win32com模塊替換word中的文字搞定批量打印獎狀

kindle無縫遷移筆記

Hello World!

Web Application Architectures @Coursera 學習筆記（一）

用calibre抓取烏雲知識庫並生成電子書

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結