python正則表達式——regex模塊

單詞起始位置、結束位置、分界位置

regex用\m表示單詞起始位置,用\M表示單詞結束位置。

\b:是單詞分界位置,但不能區分是起始還是結束位置。

局部範圍的flag控制

(?flags-flags:...)

在re模塊,flag只能作用於整個表達式,現在可以作用於局部範圍了:

>>> regex.search(r"<B>(?i:good)</B>", "<B>GOOD</B>")
<regex.Match object; span=(0, 11), match='<B>GOOD</B>'>

在這個例子裏,忽略大小寫模式只作用於標籤之間的單詞。

(?i:)是打開忽略大小寫,(?-i:)則是關閉忽略大小寫。

如果有多個flag挨着寫既可,如(?is-f:):減號左邊的是打開,減號右邊的是關閉

全局範圍的flag控制

除了局部範圍的flag,還有全局範圍的flag控制,如 (?si-f)<B>good</B>

re模塊也支持這個,可以參見Python文檔。

把flags寫進表達式、而不是以函數參數的方式聲明,方便直觀且不易出錯。

Additional features:附加功能

Added support for lookaround in conditional pattern (Hg issue 163)

對條件模式中環顧四周的支持

>>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')
<regex.Match object; span=(0, 3), match='123'>
>>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')
<regex.Match object; span=(0, 6), match='abc123'>

這與在一對替代方案的第一個分支中進行環視不太一樣。 

>>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc'))   # 若分支1不匹配,嘗試第2個分支
<regex.Match object; span=(0, 6), match='123abc'>
>>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc'))    # 若分支1不匹配,不嘗試第2個分支
None

在第一個示例中,環顧四周匹配,但第一個分支的其餘部分不匹配,因此嘗試了第二個分支,而在第二個示例中,環顧四周匹配,並且第一個分支不匹配,但是第二個分支沒有嘗試。

Added POSIX matching (leftmost longest) (Hg issue 150)

POSIX匹配(最左最長):(?p)

>>> # Normal matching.
>>> regex.search(r'Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 2), match='Mr'>
>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'>
>>> # POSIX matching.
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'>

Added (?(DEFINE)...) (Hg issue 152)

命名組:如果沒有名爲“ DEFINE”的組,則…將被忽略,但只要有任何組定義,(?(DEFINE))將可用:

>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
<regex.Match object; span=(0, 11), match='5 elephants'>

# 卡兩頭爲固定樣式、中間隨意的內容
>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')
<regex.Match object; span=(0, 8), match='123哈哈dog'>
[[a-z]--[aeiou]]

V0:simple sets,與re模塊兼容

V1:nested sets,功能增強,集合包含'a'-'z',排除“a”, “e”, “i”, “o”, “u”

eg:

     regex.search(r'(?V1)[[a-z]--[aeiou]]+', 'abcde')

     regex.search(r'[[a-z]--[aeiou]]+', 'abcde', flags=regex.V1)

<regex.Match object; span=(1, 4), match='bcd'>

(?p)

POSIX匹配(最左最長匹配)

eg:

regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'> 

 

>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'> 

(?(DEFINE)...)

命名組內容及名字:如果沒有名爲“ DEFINE”的組,則…將被忽略,但只要有任何組定義,(?(DEFINE))將起作用。

eg:

>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
<regex.Match object; span=(0, 11), match='5 elephants'>

 

# 卡兩頭爲固定樣式、中間隨意的內容
>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')
<regex.Match object; span=(0, 8), match='123哈哈dog'>

\K

保留K出現位置之後的匹配內容,丟棄其之前的匹配內容。

>>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef')
<regex.Match object; span=(2, 5), match='cde'>   保留cde,丟棄ab
>>> m[0]   'cde'
>>> m[1]   'abcde'

>>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef')   
<regex.Match object; span=(1, 3), match='bc'>   反向,保留bc,丟棄def
>>> m[0]  'bc'
>>> m[1]  'bcdef'

 

expandf

使用下標來獲取重複捕獲組的捕獲 

>>> m = regex.match(r"(\w)+", "abc")
>>> m.expandf("{1}")  'c'    m.expandf("{1}") == m.expandf("{1[-1]}")
>>> m.expandf("{1[0]} {1[1]} {1[2]}")      'a b c'
>>> m.expandf("{1[-1]} {1[-2]} {1[-3]}")   'c b a'

定義組名
>>> m = regex.match(r"(?P<letter>\w)+", "abc")
>>> m.expandf("{letter}")    'c'
>>> m.expandf("{letter[0]} {letter[1]} {letter[2]}")       'a b c'
>>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}")    'c b a'

 

>>> m = regex.match(r"(\w+) (\w+)", "foo bar")
>>> m.expandf("{0} => {2} {1}")     'foo bar => bar foo'

>>> m = regex.match(r"(?P<word1>\w+) (?P<word2>\w+)", "foo bar")
>>> m.expandf("{word2} {word1}")    'bar foo'

 

同樣可以用於search()方法

   

subf

subfn

subf和subfn分別是sub和subn的替代方案。當傳遞替換字符串時,他們將其視爲格式字符串。

 

>>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar")
'foo bar => bar foo'
>>> regex.subf(r"(?P<word1>\w+) (?P<word2>\w+)", "{word2} {word1}", "foo bar")
'bar foo' 

partial

部分匹配:match、search、fullmatch、finditer都支持部分匹配,使用partial關鍵字參數設置。匹配對象有一個pattial參數,當部分匹配時返回True,完全匹配時返回False

 

>>> regex.search(r'\d{4}', '12', partial=True)
       <regex.Match object; span=(0, 2), match='12', partial=True>
>>> regex.search(r'\d{4}', '123', partial=True)
       <regex.Match object; span=(0, 3), match='123', partial=True>
>>> regex.search(r'\d{4}', '1234', partial=True)
       <regex.Match object; span=(0, 4), match='1234'>    完全匹配:沒有partial
>>> regex.search(r'\d{4}', '12345', partial=True)
      <regex.Match object; span=(0, 4), match='1234'>
>>> regex.search(r'\d{4}', '12345', partial=True).partial     完全匹配
       False
>>> regex.search(r'\d{4}', '145', partial=True).partial        部分匹配
      True
>>> regex.search(r'\d{4}', '1245', partial=True).partial      完全匹配
      False

capturesdict()

groupdict()

captures()

capturesdict() 是 groupdict() 和 captures()的結合:

groupdict():返回一個字典,key = 組名,value = 匹配的最後一個值 

captures():返回一個所有匹配值的列表

capturesdict():返回一個字典,key = 組名,value = 所有匹配值的列表

 

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}
>>> m.captures("word")
['one', 'two', 'three']
>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()
{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']} 

(?P<name>)

允許組名重複

允許組名重複,後面的捕獲覆蓋前面的捕獲
可選組:
>>> # Both groups capture, the second capture 'overwriting' the first.
>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or second")
>>> m.group("item")   'second'
>>> m.captures("item")   ['first', 'second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", " or second")
>>> m.group("item")     'second'
>>> m.captures("item")   ['second']

>>> m = regex.match(r"(?P<item>\w+)? or (?P<item>\w+)?", "first or ")
>>> m.group("item")     'first'
>>> m.captures("item")   ['first']

 

強制性組:
>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)?", "first or second")
>>> m.group("item")    'second'
>>> m.captures("item")  ['first', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", " or second")
>>> m.group("item")     'second'
>>> m.captures("item")   ['', 'second']

>>> m = regex.match(r"(?P<item>\w*) or (?P<item>\w*)", "first or ")
>>> m.group("item")        ''
>>> m.captures("item")     ['first', '']

 

detach_string

匹配對象通過其string屬性,對所搜索字符串進行引用。detach_string方法將“分離”該字符串,使其可用於垃圾回收,如果該字符串很大,則可能節省寶貴的內存。

>>> m = regex.search(r"\w+", "Hello world")
>>> print(m.group())
Hello
>>> print(m.string)
Hello world
>>> m.detach_string()
>>> print(m.group())
Hello
>>> print(m.string)
None

(?0)、(?1)、(?2)

 

(?R)或(?0)嘗試遞歸匹配整個正則表達式。
(?1)、(?2)等,嘗試匹配相關的捕獲組,第1組、第2組。(Tarzan|Jane) loves (?1) == (Tarzan|Jane) loves (?:Tarzan|Jane)
(?&name)嘗試匹配命名的捕獲組。

>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Tarzan loves Jane").groups()
('Tarzan',)
>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Jane loves Tarzan").groups()
('Jane',)

>>> m = regex.search(r"(\w)(?:(?R)|(\w?))\1", "kayak")
>>> m.group(0, 1, 2)
('kayak', 'k', None)

模糊匹配

三種類型錯誤:

  • 插入: “i”
  • 刪除:“d”
  • 替換:“s”
  • 任何類型錯誤:“e”

Examples:

  • foo match “foo” exactly
  • (?:foo){i} match “foo”, permitting insertions
  • (?:foo){d} match “foo”, permitting deletions
  • (?:foo){s} match “foo”, permitting substitutions
  • (?:foo){i,s} match “foo”, permitting insertions and substitutions
  • (?:foo){e} match “foo”, permitting errors

如果指定了某種類型的錯誤,則不允許任何未指定的類型。在以下示例中,我將省略item並僅寫出模糊性:

  • {d<=3} permit at most 3 deletions, but no other types
  • {i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
  • {1<=e<=3} permit at least 1 and at most 3 errors
  • {i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

It’s also possible to state the costs of each type of error and the maximum permitted total cost.

Examples:

  • {2i+2d+1s<=4} each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
  • {i<=1,d<=1,s<=1,2i+2d+1s<=4} at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4

Examples:

  • {s<=2:[a-z]} at most 2 substitutions, which must be in the character set [a-z].
  • {s<=2,i<=3:\d} at most 2 substitutions, at most 3 insertions, which must be digits.

默認情況下,模糊匹配將搜索滿足給定約束的第一個匹配項。ENHANCEMATCH (?e)標誌將使它嘗試提高找到的匹配項的擬合度(即減少錯誤數量)。

BESTMATCH標誌將使其搜索最佳匹配。

  • regex.search("(dog){e}", "cat and dog")[1] returns "cat" because that matches "dog" with 3 errors (an unlimited number of errors is permitted).
  • regex.search("(dog){e<=1}", "cat and dog")[1] returns " dog" (with a leading space) because that matches "dog" with 1 error, which is within the limit.
  • regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog" (without a leading space) because the fuzzy search matches " dog" with 1 error, which is within the limit, and the (?e) then it attempts a better fit.

匹配對象具有屬性fuzzy_counts,該屬性給出替換、插入和刪除的總數:

>>> # A 'raw' fuzzy match:
>>> regex.fullmatch(r"(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 1)
>>> # 0 substitutions, 0 insertions, 1 deletion.

>>> # A better match might be possible if the ENHANCEMATCH flag used:
>>> regex.fullmatch(r"(?e)(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 0)
>>> # 0 substitutions, 0 insertions, 0 deletions.

匹配對象還具有屬性fuzzy_changes,該屬性給出替換、插入和刪除的位置的元組:

>>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar')
>>> m
<regex.Match object; span=(7, 10), match='a f', fuzzy_counts=(0, 2, 2)>
>>> m.fuzzy_changes
([], [7, 8], [10, 11])
\L<name>
Named lists
老方法:p = regex.compile(r"first|second|third|fourth|fifth"),如果列表很大,則解析生成的正則表達式可能會花費大量時間,並且還必須注意正確地對字符串進行轉義和正確排序,例如,“ cats”位於“ cat”之間。

新方法: 順序無關緊要,將它們視爲一個set

>>> option_set = ["first", "second", "third", "fourth", "fifth"]
>>> p = regex.compile(r"\L<options>", options=option_set)

named_lists屬性:

>>> print(p.named_lists)
# Python 3
{'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})}
# Python 2
{'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])}

Set operators

僅版本1行爲

添加了集合運算符,並且集合可以包含嵌套集合。

按優先級高低排序的運算符爲:

  • || for union (“x||y” means “x or y”)
  • ~~ (double tilde) for symmetric difference (“x~~y” means “x or y, but not both”)
  • && for intersection (“x&&y” means “x and y”)
  • -- (double dash) for difference (“x–y” means “x but not y”)

隱式聯合,即[ab]中的簡單並置具有最高優先級。因此,[ab && cd] 與 [[a || b] && [c || d]] 相同。

eg:

  • [ab]  # Set containing ‘a’ and ‘b’
  • [a-z]  # Set containing ‘a’ .. ‘z’
  • [[a-z]--[qw]]  # Set containing ‘a’ .. ‘z’, but not ‘q’ or ‘w’
  • [a-z--qw]  # Same as above
  • [\p{L}--QW]  # Set containing all letters except ‘Q’ and ‘W’
  • [\p{N}--[0-9]]  # Set containing all numbers except ‘0’ .. ‘9’
  • [\p{ASCII}&&\p{Letter}] # Set containing all characters which are ASCII and letter
 

匹配對象具有其他方法,這些方法返回有關重複捕獲組的所有成功匹配的信息。這些方法是:

  • matchobject.captures([group1, ...])
  • matchobject.starts([group])
  • matchobject.ends([group])
  • matchobject.spans([group])
>>> m = regex.search(r"(\w{3})+", "123456789")
>>> m.group(1)
'789'
>>> m.captures(1)
['123', '456', '789']
>>> m.start(1)
6
>>> m.starts(1)
[0, 3, 6]
>>> m.end(1)
9
>>> m.ends(1)
[3, 6, 9]
>>> m.span(1)
(6, 9)
>>> m.spans(1)
[(0, 3), (3, 6), (6, 9)]
訪問組的方式
(1)通過下標、切片訪問:
>>> m = regex.search(r"(?P<before>.*?)(?P<num>\d+)(?P<after>.*)", "pqr123stu")
>>> print(m["before"])
pqr
>>> print(len(m))
4
>>> print(m[:])
('pqr123stu', 'pqr', '123', 'stu')

(2)通過group("name")訪問:
>>> m.group('num') 

'123'

 

(3)通過組序號訪問:
>>> m.group(0)

'pqr123stu'

>>> ​​​​​​​m.group(1)

'pqr'

?r
 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章