Python學習：re模塊

正則表達式

在文本中查找pattern

正則表達式最常用的就是在文本中查找匹配項，比如：

import re

patterns = ['this', 'that']
text = 'does this text match the patterns?'

for pattern in patterns:
	print('looking for "%s" in "%s" ->' % (pattern, text))
	if re.search(pattern, text):
		print('found a match')
	else:
		print('no match')
        
#search()函數返回一個Match對象，如果沒有找到匹配項，則函數返回None

search函數返回的Match對象包含了匹配的基本信息數據，包括初始的輸入文本，匹配表達式，文本中匹配的起始位置與結束位置等。如：

import re

pattern = 'this'
text = 'does this text match the patterns?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print('found "%s" in "%s" from %d to %d ("%s")' % \
	(match.re.pattern, match.string, s, e, text[s:e]))

如果在程序中經常使用某個匹配表達式，可以首先編譯匹配表達式，得到Regular Expression Objects，然後利用Regular Expression Objects繼續匹配查找：

import re

regexes = [re.compile(p) for p in ['this', 'that']]
text = 'does this text match the patterns?'

for regex in regexes:
	print('looking for "%s" in "%s" ->' % (regex.pattern, text))
	if regex.search(text):
		print('found a match')
	else:
		print('no match')

多個匹配

如果在文本中有多個符合匹配要求的匹配項，則可以使用findall函數，函數返回匹配的字符串列表。

import re

text = 'abbaaabbbbaaaaabbbba'
pattern = 'ab'

for match in re.findall(pattern, text):
	print('found "%s"' % match)
	
print(re.findall(pattern, text))
#返回匹配的字符串列表['ab', 'ab', 'ab']

與函數findall不同，函數finditer返回多個匹配的Match對象實例：

import re

text = 'abbaaabbbbaaaaabbbba'
pattern = 'ab'

for match in re.finditer(pattern, text):
	s = match.start()
	e = match.end()
	print('found "%s" at %d:%d' % (text[s:e], s, e))

匹配語法

正則表達式的匹配語法見相關的書籍介紹，以下爲更多的範例：

import re

def test_patterns(text, patterns=[]):
	"""
	given source text and a list of patterns,
	look for matches for each pattern within the text
	and print them to stdout
	"""
	print()
	print(''.join(str(i%10) for i in range(len(text))))
	print(text)
	
	for pattern in patterns:
		print()
		print('mathing "%s"' % pattern)
		for match in re.finditer(pattern, text):
			s = match.start()
			e = match.end()
			print('%2d : %2d = "%s"' % (s, e-1, text[s:e]))
	return

if __name__ == '__main__':
	test_patterns('abbaaabbbaaaabbba', ['ab'])
    print('--------------------------------------')
	test_patterns('abbaaabbbaaaabbba', ['ab*','ab+','ab?','ab{3}','ab{2,3}'])

search與match的區別

re.match與re.search的區別：re.match只匹配字符串的開始，如果字符串開始不符合正則表達式，則匹配失敗，函數返回None；而re.search匹配整個字符串，直到找到一個匹配。

search函數接受pos與endpos位置參數，來限制查找的範圍。如：

import re

text = 'This is some text -- with anything'
pattern = re.compile(r'\b\w*is\w*\b')

print('text: %s' % text)
print
pos = 0
while True:
    match = pattern.search(text, pos)
    if not match:
        break
    s = match.start()
    e = match.end()
    print('%2d : %2d = "%s"' % (s, e-1, text[s:e]))
    #move forward in text for the next search
    pos = e

選項

在re模塊中，有一些可選的標誌位，可設置正則表達式不同的功能，這些標誌可通過位運算符（比如或運算|）組合起來，傳遞給compile、search、match等函數。

re.IGNORECASE 忽略大小寫。

re.MULTILINE 使得^與$匹配符不僅能夠適用單行字符串，也適用文本的每一行（換行符）

re.DOTALL 使得點（.）也能夠匹配換行符（默認.不匹配換行符）

re.ASCII 適用ASCII解碼方式，Python3默認使用Unicode編碼解碼方式，而Python2默認適用ASCII編碼解碼方式。

re.VERBOSE 正則表達式的Verbose模式，即對於複雜的正則表達式，可以將正則表達式通過分段，添加註釋等方式，使得表達式能夠易於閱讀與理解。以下是例子：

#對於電子郵箱地址的正則表達式，未使用Verbose模式
import re
address = re.compile('[\w\d.+-]+@([\w\d.]+\.)+(com|org|edu)')
candidates = [
    u'[email protected]',
    u'[email protected]',
    u'[email protected]',
    u'[email protected]',
]

for candidate in candidates:
    match = address.search(candidate)
    print('{:<30} {}'.format(candidate, 'match' if match else 'no match'))
    
#使用Verbose模式
import re

address = re.compile(
	"""
	[\w\d.+-]+		#username
	@
	([\w\d.]+\.)+	#domain name prefix
	(com|org|edu)	#support more top-level domains
	""",
	re.VERBOSE
)

candidates = [
    u'[email protected]',
    u'[email protected]',
    u'[email protected]',
    u'[email protected]',
]

for candidate in candidates:
    match = address.search(candidate)
    print('{:<30} {}'.format(candidate, 'match' if match else 'no match'))

另外，選項可以嵌入正則表達式字符串內部，比如，如果需要打開大小寫敏感的選項，通過在正則表達式的開頭添加(?i)的方式。例如：

import re
text = 'This is some text -- with anything.'
pattern = r'(?i)\bT\w+'
regex = re.compile(pattern)
print('matches: ', regex.findall(text))

可在通過類似(?im) 的方式同時打開IGNORECASE與MULTILINE標誌。各個標誌的簡寫如下：

ASCII：a
IGNORECASE：i
MULTILINE：m
DOTALL：s
VERBOSE：x

替換

正則表達式不僅可以搜索文本內容，還可以根據替換文本中匹配的結果值。通過python提供的sub函數可實現，例如：

import re
bold = re.compile(r'\*{2}(.*?)\*{2}')
text = 'make this **bold**. this **too**.'
print('text: ', text)
print('bold: ', bold.sub(r'<b>\1<b>', text))
#這裏的\1表示正則表達式裏面的前向引用

另外subn函數不僅返回替換後的字符串，還返回替換字符子串的數量。

篡篡

發佈了68 篇原創文章 · 獲贊 4 · 訪問量 9138

私信關注

Python學習：re模塊

正則表達式

在文本中查找pattern

多個匹配

匹配語法

search與match的區別

選項

替換

macvlan網絡模式實現跨主機Docker通信

直接路由方式實現跨主機Docker通信

Python學習：logging模塊

通過Docker鏡像運行MySQL

Python學習：datetime模塊

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結