正向預查找
import re
# ?=pattern ,正向預查找 (look-ahead)
# 下面是檢查是否<尖括號有缺失的情況
address = re.compile(
'''
((?P<name>
([\w.,]+\s+)*[\w.,]+
)
\s+
) # 名字必需存在,正向預查找尖括號
# 尖括號要麼配對,要麼不要,不能出現單個
(?= (<.*>$) # 配對的尖括號
|
([^<].*[^>]$) # 沒有尖括號
)
<? # 尖括號可選
(?P<email>
[\w\d.+-]+
@
([\w\d.]+\.)+ #
(com|org|edu) #
)
>? # 尖括號可選
''',
re.UNICODE | re.VERBOSE)
candidates = [
u'First Last <[email protected]>',
u'No Brackets [email protected]',
u'Open Bracket <[email protected]',
u'Close Bracket [email protected]>',
]
for candidate in candidates:
print 'Candidate:', candidate
match = address.search(candidate)
if match:
print ' Name :', match.groupdict()['name']
print ' Email:', match.groupdict()['email']
else:
print ' No match'
結果
Candidate: First Last [email protected]
Name : First Last
Email: [email protected]
Candidate: No Brackets [email protected]
Name : No Brackets
Email: [email protected]
Candidate: Open Bracket
關於正向預查找和反向預查找
提供字符串:foobarbarfoo
bar(?=bar) 找到第一個bar (找到的bar後面跟一個bar) .
bar(?!bar) 找到第二個bar (找到的bar後面沒有跟一個bar).
(?<=foo)bar 找到第一個bar (找到的bar前面跟一個foo).
(?<!foo)bar 找到第二個bar (找到的bar前面不跟一個foo).
下面是stackoverflow上面的一個解析
Look ahead Positive(?=)
Find expression A where expression B follows
A(?=B)
Look ahead Negative(?!)
Find expression A where expression B does not follow
A(?!B)
Look behind Positive(?<=)
Find expression A where expression B precedes
(?<=B)A
Look behind Negative(?<!)
Find expression A where expression B does not precedes it
(?<!B)A
最小組團
注:最小組團是無捕捉的特殊正則表達式分組,它可以用於優化正則表達式性能
非組團: /\b(engineer|engrave|end)\b/
如果把“engineering”拿去匹配,正則引擎會先匹配到“engineer”,但接下來就遇到了字詞邊界\b,所以匹配不成功。然後,正則引擎又會嘗試在字串裏尋找下一個匹配內容:engrave。匹配到eng的時候,後面的又對不上了,匹配失敗。最後,嘗試 “end”,結果同樣是失敗。仔細觀察,你會發現,一旦engineer匹配失敗,並且都抵達了字詞邊界,“engrave”和“end”這兩個詞就已經不可能匹配成功了。
這兩個詞都比engineer短小,從長度上來說就不可能被匹配了,所以正則引擎不應該再多做無謂的嘗試。
最小組團:/\b(?>engineer|engrave|end)\b/
只會匹配一次,發現engineer都不滿足要求,就不再回溯了,直接匹配不成功
練習代碼
look_ahead = re.compile('python(?:2|3)')
look_ahead_pattern = re.compile('python(?=2)')
look_ahead_not_pattern = re.compile('python(?!2)')
text = 'pythonic python2 python3'
def print_info(re_obj, text=text):
for match in re_obj.finditer(text):
print match.group(),
print 'start is %d, end is %d' % (match.start(), match.end())
print
print_info(look_ahead)
print_info(look_ahead_pattern)
print_info(look_ahead_not_pattern)
結果
python2 start is 9, end is 16
python3 start is 17, end is 24
python start is 9, end is 15
python start is 0, end is 6
python start is 17, end is 23