pyhton正則表達式學習

正則表達式是從信息中搜索特定的模式的一把瑞士軍刀。它們是一個巨大的工具庫，其中的一些功能經常被忽視或未被充分利用。今天我將向你們展示一些正則表達式的高級用法。

舉個例子，這是一個我們可能用來檢測電話美國電話號碼的正則表達式：

r'^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$'

我們可以加上一些註釋和空格使得它更具有可讀性。

r'^'

r'(1[-\s.])?' #
optional '1-', '1.' or '1'

r'(\()?'      #
optional opening parenthesis

r'\d{3}'      #
the area code

r'(?(2)\))'   #
if there was opening parenthesis, close it

r'[-\s.]?'    #
followed by '-' or '.' or space

r'\d{3}'      #
first 3 digits

r'[-\s.]?'    #
followed by '-' or '.' or space

r'\d{4}$'    #
last 4 digits

讓我們把它放到一個代碼片段裏：

import re

 

numbers
= [
"123
555 6789",

            "1-(123)-555-6789",

            "(123-555-6789",

            "(123).555.6789",

            "123
55 6789" ]

 

for number
in numbers:

    pattern
= re.match(r'^'

                   r'(1[-\s.])?'           #
optional '1-', '1.' or '1'

                   r'(\()?'                #
optional opening parenthesis

                   r'\d{3}'                #
the area code

                   r'(?(2)\))'             #
if there was opening parenthesis, close it

                   r'[-\s.]?'              #
followed by '-' or '.' or space

                   r'\d{3}'                #
first 3 digits

                   r'[-\s.]?'              #
followed by '-' or '.' or space

                   r'\d{4}$\s*',number)   
#
last 4 digits

 

    if pattern:

        print '{0}
is valid'.format(number)

    else:

        print '{0}
is not valid'.format(number)

輸出，不帶空格：

123 555 6789 is valid

1-(123)-555-6789 is valid

(123-555-6789 is not valid

(123).555.6789 is valid

123 55 6789 is not valid

正則表達式是 python 的一個很好的功能，但是調試它們很艱難，而且正則表達式很容易就出錯。

幸運的是，python 可以通過對 re.compile 或 re.match 設置 re.DEBUG (實際上就是整數 128) 標誌就可以輸出正則表達式的解析樹。

import re

 

numbers
= [
"123
555 6789",

            "1-(123)-555-6789",

            "(123-555-6789",

            "(123).555.6789",

            "123
55 6789" ]

 

for number
in numbers:

    pattern
= re.match(r'^'

                    r'(1[-\s.])?'        #
optional '1-', '1.' or '1'

                    r'(\()?'             #
optional opening parenthesis

                    r'\d{3}'             #
the area code

                    r'(?(2)\))'          #
if there was opening parenthesis, close it

                    r'[-\s.]?'           #
followed by '-' or '.' or space

                    r'\d{3}'             #
first 3 digits

                    r'[-\s.]?'           #
followed by '-' or '.' or space

                    r'\d{4}$',
number, re.DEBUG)  #
last 4 digits

 

    if pattern:

        print '{0}
is valid'.format(number)

    else:

        print '{0}
is not valid'.format(number)

解析樹

at_beginning

max_repeat
0 1

  subpattern
1

    literal
49

    in

      literal
45

      category
category_space

      literal
46

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
0 1

  subpattern
2

    literal
40

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
3 3

  in

    category
category_digit

max_repeat
0 2147483648

  in

    category
category_space

subpattern
None

  groupref_exists
2

    literal
41

None

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
0 1

  in

    literal
45

    category
category_space

    literal
46

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
3 3

  in

    category
category_digit

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
0 1

  in

    literal
45

    category
category_space

    literal
46

max_repeat
0 2147483648

  in

    category
category_space

max_repeat
4 4

  in

    category
category_digit

at
at_end

max_repeat
0 2147483648

  in

    category
category_space

123 555 6789 is valid

1-(123)-555-6789 is valid

(123-555-6789 is not valid

(123).555.6789 is valid

123 55 6789 is not valid

貪婪和非貪婪

在我解釋這個概念之前，我想先展示一個例子。我們要從一段 html 文本尋找錨標籤：

import re

html
= 'Hello
<a href="http://pypix.com" title="pypix">Pypix</a>'

m
= re.findall('<a.*>.*<\/a>',
html)

if m:

    print m

結果將在意料之中：

['<a
href="http://pypix.com" title="pypix">Pypix</a>']

我們改下輸入，添加第二個錨標籤：

import re

html
= 'Hello
<a href="http://pypix.com" title="pypix">Pypix</a>' \

       'Hello
<a href="http://example.com" title"example">Example</a>'

m
= re.findall('<a.*>.*<\/a>',
html)

if m:

    print m

結果看起來再次對了。但是不要上當了！如果我們在同一行遇到兩個錨標籤後，它將不再正確工作：

['<a
href="http://pypix.com" title="pypix">Pypix</a>Hello <a href="http://example.com" title"example">Example</a>']

這次模式匹配了第一個開標籤和最後一個閉標籤以及在它們之間的所有的內容，成了一個匹配而不是兩個單獨的匹配。這是因爲默認的匹配模式是“貪婪的”。

當處於貪婪模式時，量詞(比如 * 和 +)匹配儘可能多的字符。

當你加一個問號在後面時（.*?）它將變爲“非貪婪的”。

import re

html
= 'Hello
<a href="http://pypix.com" title="pypix">Pypix</a>' \

       'Hello
<a href="http://example.com" title"example">Example</a>'

m
= re.findall('<a.*?>.*?<\/a>',
html)

if m:

    print m

現在結果是正確的。

['<a
href="http://pypix.com" title="pypix">Pypix</a>',
'<a
href="http://example.com" title"example">Example</a>']

前向界定符和後向界定符

一個前向界定符搜索當前的匹配之後搜索匹配。通過一個例子比較好解釋一點。

下面的模式首先匹配 foo，然後檢測是否接着匹配 bar：

import re

 

strings
= [ 
"hello
foo",        
#
returns False

             "hello
foobar"  ]   
#
returns True

 

for string
in strings:

    pattern
= re.search(r'foo(?=bar)',
string)

    if pattern:

        print 'True'

    else:

        print 'False'

這看起來似乎沒什麼用，因爲我們可以直接檢測 foobar 不是更簡單麼。然而，它也可以用來前向否定界定。下面的例子匹配foo，當且僅當它的後面沒有跟着 bar。

import re

 

strings
= [ 
"hello
foo",        
#
returns True

             "hello
foobar",     
#
returns False

             "hello
foobaz"]     
#
returns True

 

for string
in strings:

    pattern
= re.search(r'foo(?!bar)',
string)

    if pattern:

        print 'True'

    else:

        print 'False'

後向界定符類似，但是它查看當前匹配的前面的模式。你可以使用 (?> 來表示肯定界定，(?<! 表示否定界定。

下面的模式匹配一個不是跟在 foo 後面的 bar。

import re

 

strings
= [ 
"hello
bar",        
#
returns True

             "hello
foobar",     
#
returns False

             "hello
bazbar"]     
#
returns True

 

for string
in strings:

    pattern
= re.search(r'(?<!foo)bar',string)

    if pattern:

        print 'True'

    else:

        print 'False'

條件(IF-Then-Else)模式

正則表達式提供了條件檢測的功能。格式如下：

(?(?=regex)then|else)

條件可以是一個數字。表示引用前面捕捉到的分組。

比如我們可以用這個正則表達式來檢測打開和閉合的尖括號：

import re

 

strings
= [ 
"<pypix>",   
#
returns true

             "<foo",      
#
returns false

             "bar>",      
#
returns false

             "hello" ]    
#
returns true

 

for string
in strings:

    pattern
= re.search(r'^(<)?[a-z]+(?(1)>)$',
string)

    if pattern:

        print 'True'

    else:

        print 'False'

在上面的例子中，1 表示分組 (<)，當然也可以爲空因爲後面跟着一個問號。當且僅當條件成立時它才匹配關閉的尖括號。

條件也可以是界定符。

無捕獲組

分組，由圓括號括起來，將會捕獲到一個數組，然後在後面要用的時候可以被引用。但是我們也可以不捕獲它們。

我們先看一個非常簡單的例子：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(f.*)(b.*)',
string)          

print "f*
=> {0}".format(pattern.group(1))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group(2))
#
prints b* => bar

現在我們改動一點點，在前面加上另外一個分組 (H.*)：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(H.*)(f.*)(b.*)',
string)          

print "f*
=> {0}".format(pattern.group(1))
#
prints f* => Hello          

print "b*
=> {0}".format(pattern.group(2))
#
prints b* => bar

模式數組改變了，取決於我們在代碼中怎麼使用這些變量，這可能會使我們的腳本不能正常工作。現在我們不得不找到代碼中每一處出現了模式數組的地方，然後相應地調整下標。如果我們真的對一個新添加的分組的內容沒興趣的話，我們可以使它“不被捕獲”，就像這樣：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(?:H.*)(f.*)(b.*)',
string)          

print "f*
=> {0}".format(pattern.group(1))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group(2))
#
prints b* => bar

通過在分組的前面添加 ?:，我們就再也不用在模式數組中捕獲它了。所以數組中其他的值也不需要移動。

命名組

像前面那個例子一樣，這又是一個防止我們掉進陷阱的方法。我們實際上可以給分組命名，然後我們就可以通過名字來引用它們，而不再需要使用數組下標。格式是：(?Ppattern) 我們可以重寫前面那個例子，就像這樣：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)',
string)          

print "f*
=> {0}".format(pattern.group('fstar'))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group('bstar'))
#
prints b* => bar

現在我們可以添加另外一個分組了，而不會影響模式數組裏其他的已存在的組：

import re         

string
= 'Hello
foobar'         

pattern
= re.search(r'(?P<hi>H.*)(?P<fstar>f.*)(?P<bstar>b.*)',
string)          

print "f*
=> {0}".format(pattern.group('fstar'))
#
prints f* => foo          

print "b*
=> {0}".format(pattern.group('bstar'))
#
prints b* => bar          

print "h*
=> {0}".format(pattern.group('hi'))
#
prints b* => Hello

使用回調函數

在 Python 中 re.sub() 可以用來給正則表達式替換添加回調函數。

讓我們來看看這個例子，這是一個 e-mail 模板：

import re         

template
= "Hello
[first_name] [last_name], \          

 Thank
you for purchasing
[product_name] from [store_name].
\          

 The
total cost of your purchase was [product_price] plus [ship_price] for shipping.
\          

 You
can expect your product to arrive in [ship_days_min]
to [ship_days_max] business days. \          

 Sincerely,
\          

 [store_manager_name]"         

#
assume dic has all the replacement data          

#
such as dic['first_name'] dic['product_price'] etc...          

dic
= {         

 "first_name" :
"John",         

 "last_name" :
"Doe",         

 "product_name" :
"iphone",         

 "store_name" :
"Walkers",         

 "product_price":
"$500",         

 "ship_price":
"$10",         

 "ship_days_min":
"1",         

 "ship_days_max":
"5",         

 "store_manager_name":
"DoeJohn"         

}         

result
= re.compile(r'\[(.*)\]')         

print result.sub('John',
template, count=1)

注意到每一個替換都有一個共同點，它們都是由一對中括號括起來的。我們可以用一個單獨的正則表達式來捕獲它們，並且用一個回調函數來處理具體的替換。

所以用回調函數是一個更好的辦法：

import re         

template
= "Hello
[first_name] [last_name], \          

 Thank
you for purchasing
[product_name] from [store_name].
\          

 The
total cost of your purchase was [product_price] plus [ship_price] for shipping.
\          

 You
can expect your product to arrive in [ship_days_min]
to [ship_days_max] business days. \          

 Sincerely,
\          

 [store_manager_name]"         

#
assume dic has all the replacement data          

#
such as dic['first_name'] dic['product_price'] etc...          

dic
= {         

 "first_name" :
"John",         

 "last_name" :
"Doe",         

 "product_name" :
"iphone",         

 "store_name" :
"Walkers",         

 "product_price":
"$500",         

 "ship_price":
"$10",         

 "ship_days_min":
"1",         

 "ship_days_max":
"5",         

 "store_manager_name":
"DoeJohn"         

}         

def multiple_replace(dic,
text):

    pattern
= "|".join(map(lambda key
: re.escape("["+key+"]"),
dic.keys()))

    return re.sub(pattern,
lambda m:
dic[m.group()[1:-1]],
text)     

print multiple_replace(dic,
template)

不要重複發明輪子

更重要的可能是知道在什麼時候不要使用正則表達式。在許多情況下你都可以找到替代的工具。

解析 [X]HTML

Stackoverflow 上的一個答案用一個絕妙的解釋告訴了我們爲什麼不應該用正則表達式來解析 [X]HTML。

你應該使用使用 HTML 解析器，Python 有很多選擇：

ElementTree 是標準庫的一部分
BeautifulSoup 是一個流行的第三方庫
lxml 是一個功能齊全基於 c 的快速的庫

後面兩個即使是處理畸形的 HTML 也能很優雅，這給大量的醜陋站點帶來了福音。

ElementTree 的一個例子：

from xml.etree
import ElementTree         

tree
= ElementTree.parse('filename.html')         

for element
in tree.findall('h1'):         

   print ElementTree.tostring(element)

其他

在使用正則表達式之前，這裏有很多其他可以考慮的工具。

原文鏈接：http://blog.jobbole.com/65605/

倔強不倒翁

發佈了80 篇原創文章 · 獲贊 217 · 訪問量 88萬+

私信關注

pyhton正則表達式學習

解析樹

貪婪和非貪婪

前向界定符和後向界定符

條件(IF-Then-Else)模式

無捕獲組

命名組

使用回調函數

不要重複發明輪子

解析 [X]HTML

Stackoverflow 上的一個答案用一個絕妙的解釋告訴了我們爲什麼不應該用正則表達式來解析 [X]HTML。

其他

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（一）部署K8s

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

python題目-----type特殊用法

貝葉斯（樸素貝葉斯，正太貝葉斯）及OpenCV源碼分析

理解Python併發編程-PoolExecutor篇

python中zip 和 izip , izip_longest比較

pyhton正則表達式學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結