pyhton正则表达式学习

正则表达式是从信息中搜索特定的模式的一把瑞士军刀。它们是一个巨大的工具库，其中的一些功能经常被忽视或未被充分利用。今天我将向你们展示一些正则表达式的高级用法。

举个例子，这是一个我们可能用来检测电话美国电话号码的正则表达式：

r'^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$'

我们可以加上一些注释和空格使得它更具有可读性。

r'^'

r'(1[-\s.])?' #
optional '1-', '1.' or '1'

r'(\()?'      #
optional opening parenthesis

r'\d{3}'      #
the area code

r'(?(2)\))'   #
if there was opening parenthesis, close it

r'[-\s.]?'    #
followed by '-' or '.' or space

r'\d{3}'      #
first 3 digits

r'[-\s.]?'    #
followed by '-' or '.' or space

r'\d{4}$'    #
last 4 digits

让我们把它放到一个代码片段里：

import re

 

numbers
= [
"123
555 6789",

            "1-(123)-555-6789",

            "(123-555-6789",

            "(123).555.6789",

            "123
55 6789" ]

 

for number
in numbers:

    pattern
= re.match(r'^'

                   r'(1[-\s.])?'           #
optional '1-', '1.' or '1'

                   r'(\()?'                #
optional opening parenthesis

                   r'\d{3}'                #
the area code

                   r'(?(2)\))'             #
if there was opening parenthesis, close it

                   r'[-\s.]?'              #
followed by '-' or '.' or space

                   r'\d{3}'                #
first 3 digits

                   r'[-\s.]?'              #
followed by '-' or '.' or space

                   r'\d{4}$\s*',number)   
#
last 4 digits

 

    if pattern:

        print '{0}
is valid'.format(number)

    else:

        print '{0}
is not valid'.format(number)

输出，不带空格：

123 555 6789 is valid

1-(123)-555-6789 is valid

(123-555-6789 is not valid

(123).555.6789 is valid

123 55 6789 is not valid

正则表达式是 python 的一个很好的功能，但是调试它们很艰难，而且正则表达式很容易就出错。

幸运的是，python 可以通过对 re.compile 或 re.match 设置 re.DEBUG (实际上就是整数 128) 标志就可以输出正则表达式的解析树。

import re

 

numbers
= [
"123
555 6789",

            "1-(123)-555-6789",

            "(123-555-6789",

            "(123).555.6789",

            "123
55 6789" ]

 

for number
in numbers:

    pattern
= re.match(r'^'

                    r'(1[-\s.])?'        #
optional '1-', '1.' or '1'

                    r'(\()?'             #
optional opening parenthesis

                    r'\d{3}'             #
the area code

                    r'(?(2)\))'          #
if there was opening parenthesis, close it

                    r'[-\s.]?'           #
followed by '-' or '.' or space

                    r'\d{3}'             #
first 3 digits

                    r'[-\s.]?'           #
followed by '-' or '.' or space

                    r'\d{4}$',
number, re.DEBUG)  #
last 4 digits

 

    if pattern:

        print '{0}
is valid'.format(number)

    else:

        print '{0}
is not valid'.format(number)

解析树

at_beginning

max_repeat
0 1

  subpattern
1

    literal
49

    in

      literal
45

      category
category_space

      literal
46

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
0 1

  subpattern
2

    literal
40

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
3 3

  in

    category
category_digit

max_repeat
0 2147483648

  in

    category
category_space

subpattern
None

  groupref_exists
2

    literal
41

None

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
0 1

  in

    literal
45

    category
category_space

    literal
46

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
3 3

  in

    category
category_digit

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
0 1

  in

    literal
45

    category
category_space

    literal
46

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
4 4

  in

    category
category_digit

at
at_end

max_repeat
0 2147483648

  in

    category
category_space

123 555 6789 is valid

1-(123)-555-6789 is valid

(123-555-6789 is not valid

(123).555.6789 is valid

123 55 6789 is not valid

贪婪和非贪婪

在我解释这个概念之前，我想先展示一个例子。我们要从一段 html 文本寻找锚标签：

import re

html
= 'Hello
<a href="http://pypix.com" title="pypix">Pypix</a>'

m
= re.findall('<a.*>.*<\/a>',
html)

if m:

    print m

结果将在意料之中：

['<a
href="http://pypix.com" title="pypix">Pypix</a>']

我们改下输入，添加第二个锚标签：

import re

html
= 'Hello
<a href="http://pypix.com" title="pypix">Pypix</a>' \

       'Hello
<a href="http://example.com" title"example">Example</a>'

m
= re.findall('<a.*>.*<\/a>',
html)

if m:

    print m

结果看起来再次对了。但是不要上当了！如果我们在同一行遇到两个锚标签后，它将不再正确工作：

['<a
href="http://pypix.com" title="pypix">Pypix</a>Hello <a href="http://example.com" title"example">Example</a>']

这次模式匹配了第一个开标签和最后一个闭标签以及在它们之间的所有的内容，成了一个匹配而不是两个单独的匹配。这是因为默认的匹配模式是“贪婪的”。

当处于贪婪模式时，量词(比如 * 和 +)匹配尽可能多的字符。

当你加一个问号在后面时（.*?）它将变为“非贪婪的”。

import re

html
= 'Hello
<a href="http://pypix.com" title="pypix">Pypix</a>' \

       'Hello
<a href="http://example.com" title"example">Example</a>'

m
= re.findall('<a.*?>.*?<\/a>',
html)

if m:

    print m

现在结果是正确的。

['<a
href="http://pypix.com" title="pypix">Pypix</a>',
'<a
href="http://example.com" title"example">Example</a>']

前向界定符和后向界定符

一个前向界定符搜索当前的匹配之后搜索匹配。通过一个例子比较好解释一点。

下面的模式首先匹配 foo，然后检测是否接着匹配 bar：

import re

 

strings
= [ 
"hello
foo",        
#
returns False

             "hello
foobar"  ]   
#
returns True

 

for string
in strings:

    pattern
= re.search(r'foo(?=bar)',
string)

    if pattern:

        print 'True'

    else:

        print 'False'

这看起来似乎没什么用，因为我们可以直接检测 foobar 不是更简单么。然而，它也可以用来前向否定界定。下面的例子匹配foo，当且仅当它的后面没有跟着 bar。

import re

 

strings
= [ 
"hello
foo",        
#
returns True

             "hello
foobar",     
#
returns False

             "hello
foobaz"]     
#
returns True

 

for string
in strings:

    pattern
= re.search(r'foo(?!bar)',
string)

    if pattern:

        print 'True'

    else:

        print 'False'

后向界定符类似，但是它查看当前匹配的前面的模式。你可以使用 (?> 来表示肯定界定，(?<! 表示否定界定。

下面的模式匹配一个不是跟在 foo 后面的 bar。

import re

 

strings
= [ 
"hello
bar",        
#
returns True

             "hello
foobar",     
#
returns False

             "hello
bazbar"]     
#
returns True

 

for string
in strings:

    pattern
= re.search(r'(?<!foo)bar',string)

    if pattern:

        print 'True'

    else:

        print 'False'

条件(IF-Then-Else)模式

正则表达式提供了条件检测的功能。格式如下：

(?(?=regex)then|else)

条件可以是一个数字。表示引用前面捕捉到的分组。

比如我们可以用这个正则表达式来检测打开和闭合的尖括号：

import re

 

strings
= [ 
"<pypix>",   
#
returns true

             "<foo",      
#
returns false

             "bar>",      
#
returns false

             "hello" ]    
#
returns true

 

for string
in strings:

    pattern
= re.search(r'^(<)?[a-z]+(?(1)>)$',
string)

    if pattern:

        print 'True'

    else:

        print 'False'

在上面的例子中，1 表示分组 (<)，当然也可以为空因为后面跟着一个问号。当且仅当条件成立时它才匹配关闭的尖括号。

条件也可以是界定符。

无捕获组

分组，由圆括号括起来，将会捕获到一个数组，然后在后面要用的时候可以被引用。但是我们也可以不捕获它们。

我们先看一个非常简单的例子：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(f.*)(b.*)',
string)          

print "f*
=> {0}".format(pattern.group(1))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group(2))
#
prints b* => bar

现在我们改动一点点，在前面加上另外一个分组 (H.*)：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(H.*)(f.*)(b.*)',
string)          

print "f*
=> {0}".format(pattern.group(1))
#
prints f* => Hello          

print "b*
=> {0}".format(pattern.group(2))
#
prints b* => bar

模式数组改变了，取决于我们在代码中怎么使用这些变量，这可能会使我们的脚本不能正常工作。现在我们不得不找到代码中每一处出现了模式数组的地方，然后相应地调整下标。如果我们真的对一个新添加的分组的内容没兴趣的话，我们可以使它“不被捕获”，就像这样：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(?:H.*)(f.*)(b.*)',
string)          

print "f*
=> {0}".format(pattern.group(1))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group(2))
#
prints b* => bar

通过在分组的前面添加 ?:，我们就再也不用在模式数组中捕获它了。所以数组中其他的值也不需要移动。

命名组

像前面那个例子一样，这又是一个防止我们掉进陷阱的方法。我们实际上可以给分组命名，然后我们就可以通过名字来引用它们，而不再需要使用数组下标。格式是：(?Ppattern) 我们可以重写前面那个例子，就像这样：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)',
string)          

print "f*
=> {0}".format(pattern.group('fstar'))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group('bstar'))
#
prints b* => bar

现在我们可以添加另外一个分组了，而不会影响模式数组里其他的已存在的组：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(?P<hi>H.*)(?P<fstar>f.*)(?P<bstar>b.*)',
string)          

print "f*
=> {0}".format(pattern.group('fstar'))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group('bstar'))
#
prints b* => bar          

print "h*
=> {0}".format(pattern.group('hi'))
#
prints b* => Hello

使用回调函数

在 Python 中 re.sub() 可以用来给正则表达式替换添加回调函数。

让我们来看看这个例子，这是一个 e-mail 模板：

import re         

template
= "Hello
[first_name] [last_name], \          

 Thank
you for purchasing
[product_name] from [store_name].
\          

 The
total cost of your purchase was [product_price] plus [ship_price] for shipping.
\          

 You
can expect your product to arrive in [ship_days_min]
to [ship_days_max] business days. \          

 Sincerely,
\          

 [store_manager_name]"         

#
assume dic has all the replacement data          

#
such as dic['first_name'] dic['product_price'] etc...          

dic
= {         

 "first_name" :
"John",         

 "last_name" :
"Doe",         

 "product_name" :
"iphone",         

 "store_name" :
"Walkers",         

 "product_price":
"$500",         

 "ship_price":
"$10",         

 "ship_days_min":
"1",         

 "ship_days_max":
"5",         

 "store_manager_name":
"DoeJohn"         

}         

result
= re.compile(r'\[(.*)\]')         

print result.sub('John',
template, count=1)

注意到每一个替换都有一个共同点，它们都是由一对中括号括起来的。我们可以用一个单独的正则表达式来捕获它们，并且用一个回调函数来处理具体的替换。

所以用回调函数是一个更好的办法：

import re         

template
= "Hello
[first_name] [last_name], \          

 Thank
you for purchasing
[product_name] from [store_name].
\          

 The
total cost of your purchase was [product_price] plus [ship_price] for shipping.
\          

 You
can expect your product to arrive in [ship_days_min]
to [ship_days_max] business days. \          

 Sincerely,
\          

 [store_manager_name]"         

#
assume dic has all the replacement data          

#
such as dic['first_name'] dic['product_price'] etc...          

dic
= {         

 "first_name" :
"John",         

 "last_name" :
"Doe",         

 "product_name" :
"iphone",         

 "store_name" :
"Walkers",         

 "product_price":
"$500",         

 "ship_price":
"$10",         

 "ship_days_min":
"1",         

 "ship_days_max":
"5",         

 "store_manager_name":
"DoeJohn"         

}         

def multiple_replace(dic,
text):

    pattern
= "|".join(map(lambda key
: re.escape("["+key+"]"),
dic.keys()))

    return re.sub(pattern,
lambda m:
dic[m.group()[1:-1]],
text)     

print multiple_replace(dic,
template)

不要重复发明轮子

更重要的可能是知道在什么时候不要使用正则表达式。在许多情况下你都可以找到替代的工具。

解析 [X]HTML

Stackoverflow 上的一个答案用一个绝妙的解释告诉了我们为什么不应该用正则表达式来解析 [X]HTML。

你应该使用使用 HTML 解析器，Python 有很多选择：

ElementTree 是标准库的一部分
BeautifulSoup 是一个流行的第三方库
lxml 是一个功能齐全基于 c 的快速的库

后面两个即使是处理畸形的 HTML 也能很优雅，这给大量的丑陋站点带来了福音。

ElementTree 的一个例子：

from xml.etree
import ElementTree         

tree
= ElementTree.parse('filename.html')         

for element
in tree.findall('h1'):         

   print ElementTree.tostring(element)

其他

在使用正则表达式之前，这里有很多其他可以考虑的工具。

原文链接：http://blog.jobbole.com/65605/

倔强不倒翁

发布了80 篇原创文章 · 获赞 217 · 访问量 88万+

私信关注

pyhton正则表达式学习

解析树

贪婪和非贪婪

前向界定符和后向界定符

条件(IF-Then-Else)模式

无捕获组

命名组

使用回调函数

不要重复发明轮子

解析 [X]HTML

Stackoverflow 上的一个答案用一个绝妙的解释告诉了我们为什么不应该用正则表达式来解析 [X]HTML。

其他

杭州的 IT 崩盘了么？

开源高性能结构化日志模块NanoLog

Python 潮流周刊#55：分享 9 个高质量的技术类信息源！

WinForm应用实战开发指南 - 表格数据录入问题解析

Azure Virtual Network (22) 多订阅使用Azure DNS解析问题 Windows Azure Platform 系列文章目录

【简写Mybatis-02】注册机的实现以及SqlSession处理

手绘二维码

.NET借助虚拟网卡实现一个简单异地组网工具

python題目-----type特殊用法

貝葉斯（樸素貝葉斯，正太貝葉斯）及OpenCV源碼分析

理解Python併發編程-PoolExecutor篇

python中zip 和 izip , izip_longest比較

pyhton正則表達式學習

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結