最近做爬蟲,把python基礎的正則表達式又重新過了一遍。
常規匹配
import re
content = 'Hello 123 4567 World_this is a regex Demo'
print(len(content))
result = re.match("^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$", content)
print(result)
print(result.group())
print(result.span())
輸出:
41
<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_this is a regex Demo'>
Hello 123 4567 World_this is a regex Demo
(0, 41)
泛匹配
content = 'Hello 123 4567 World_this is a regex Demo'
result = re.match("^Hello.*Demo$", content)
匹配目標
獲取1234567
result = re.match("^Hello\s(\d+)\s.*Demo$", content)
print(result.group(1))
輸出:1234567
貪婪匹配
result = re.match("^He.*(\d+).*Demo$", content)
# 輸出 7
非貪婪匹配
?:匹配儘可能少的字符
result = re.match("^He.*?(\d+).*Demo$", content)
# 輸出1234567
匹配模式
- World_this之後進行了換行
- 參數後添加 re.S代表.*都能匹配
- .*匹配除了換行的任意字符
content = '''Hello 1234567 World_this
is a regex Demo
'''
result = re.match("^He.*?(\d+).*?Demo$", content, re.S)
轉義
content = 'price is $10.00'
result = re.match('price is \$10\.00', content)
re.search
找一個結果
findall: 找所有
re.sub
字符串替換
content = "dimples 54 aaa"
result = re.sub('\d+', '', content)
print(result)
輸出:
dimples aaa
替換原字符串本身或包含字符串
- \1 : 把第一個括號得內容拿過來做一些替換
content = "dimples 12345 aaa"
result = re.sub('(\d+)', r'\1 6789', content)
print(result)
輸出:
dimples 12345 6789 aaa
結合使用
獲取所有歌名
html = '''<li class="aaa">
<a href="2.mp3">借</a>
</li>
<li data-view="6">消愁</li>
'''
result1 = re.sub('<a.*?>|</a>', '', html)
print(result1)
result2 = re.findall('<li.*?>(.*?)</li>', result1, re.S)
for result in result2:
print(result.strip())
輸出:
借
消愁
re.compile
把一個正則字符串編譯成正則表達式對象,便於複用這個匹配模式
content = '''Hello 1234567 World_this
is a regex Demo
'''
pattern = re.compile("^He.*?(\d+).*?Demo$", re.S)
result = re.match(pattern, content)
print(result)
print(result.group(1))
# 輸出: 1234567