正則表達式的進階用法——預查與分組

原創

Koorye

2020-06-14 18:52

文章目錄

預查

分組

昨天剛發現正則表達式的分組用法，故在此記錄。

預查

正向預查：`?=`, `?!`

?=: 檢測包含此結尾的內容，但不捕獲。

例：w+(?=\.com)，檢測以.com結尾的字符串，但返回結果中不包含.com.

*注：需引入re庫，後面不再贅述。

print(re.findall(r'\w+(?=\.com)', 'baidu.com google.com csdn.net'))

輸入結果：

['baidu', 'google']

?!: 檢測不包含此結尾的內容，但不捕獲。

例：Windows(?!95|98)，檢測不以95和98結尾的"Windows"，返回結果中不包含Windows之後的內容。

print(re.match(r'Windows(?!95|98|NT|2000)', 'Windows95'))
print(re.match(r'Windows(?!95|98|NT|2000)', 'Windows10').group)

輸出：

None
Windows

負向預查：`?<=`, `?<!`/`?<!=`

?<=: 檢測包含此開頭的內容，但不捕獲。

例：(?<=www\.)\w+，檢測以www.開頭的字符串，但返回結果中不包含開頭。

print(re.findall(r'(?<=www\.)\w+', 'www.github.com cn.vuejs.org www.baidu.com'))

輸出結果：

['github', 'baidu']

?<!和?<!=: 等價，檢測不包含此開頭的內容，但不捕獲。此處不再舉例。

練習：網頁小爬蟲

需求：爬取douban.com中所有用div標籤包裹的內容

import requests
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
url = 'https://www.douban.com/'
http = requests.get(url, headers=headers)
reg = r'(?<=<div[^>]*>)\n*\w+\n*(?=</div>)'
result_list = re.findall(reg, http.text)
for result in result_list:
    print(result)

我們自然而然的想到左右用<?<=div...(任意字符)>...<?=/div>，然而結果卻是：

re.error: look-behind requires fixed-width pattern

原來負向預查並不支持不定長字符串，我們需要找到更好的辦法，不過現在我們可以先嚐試不排除div標籤，直接打印：

修改表達式爲：reg = r'<div[^>]*>[^<]*</div>'

輸出結果：

<div id="dale_anonymous_homepage_top_for_crazy_ad"></div>
<div id="dale_anonymous_homepage_right_top"></div>
<div id="dale_homepage_online_activity_promo_1"></div>
<div id="dale_anonymous_homepage_doublemint"></div>
<div class="side"></div>
<div id="dale_anonymous_homepage_movie_bottom" class="extra"></div>
<div class="author">〔日〕多利安助...</div>
<div class="author">〔日〕伊坂幸太...</div>
<div class="author">〔日〕池井戶潤...</div>
<div class="author">吳沚默</div>
<div class="author"></div>
...
<div class="title">你聽過《東京愛情故事》嗎？</div>
<div id="dale_anonymous_home_page_middle_2" class="extra"></div>
<div class="market-topic-pic"
            style="background-image:url(https://img3.doubanio.com/img/files/file-1513305186-3.jpg)">
          </div>
<div class="market-spu-pic"
            style="background-image: url(https://img3.doubanio.com/img/files/file-1546855945-0.jpg)">
          </div>
<div class="market-spu-pic"
            style="background-image: url(https://img3.doubanio.com/img/files/file-1545819571-0.jpg)">
          </div>
<div class="market-spu-pic"
            style="background-image: url(https://img9.doubanio.com/img/files/file-1513305186-4.jpg)">
          </div>
<div class="follow">
          3人關注
        </div>
<div class="datetime">
            4月4日 週六 19:30 - 21:30
        </div>
<div class="follow">
          1人關注
        </div>
<div class="datetime">
            12月21日 週六 - 4月12日 週日
        </div>
<div class="follow">
          5人關注
        </div>
<div id="dale_anonymous_home_page_bottom" class="extra"></div>

結果返回但非常雜亂，這就是我們之後需要改進的。

分組

普通分組

使用圓括號()表示分組。

這裏介紹一下re.match()函數，這個函數會根據正則表達式從開頭向後匹配，返回第一個符合的結果 (開頭必須符合) 。它的返回值是一個object，object有group()方法和groups()方法。

group(): 返回正常匹配結果
groups(): 返回分組內容

print(re.match('(\w+)\.\w+\.(\w+)', 'www.baidu.com').group())
print(re.match('(\w+)\.\w+\.(\w+)', 'www.baidu.com').groups())

這裏使用圓括號將網頁URL的前綴作爲一組，後綴作爲一組，因此groups()會返回網頁前後綴字符串。

輸出結果：

www.baidu.com
('www', 'com')

命名分組

命名分組，顧名思義，可以給每個分組命名，返回值將以字典形式表示。

語法：?P<分組名>

例：

result = re.match('(?P<head>\w+)\.\w+\.(?P<tail>\w+)', 'www.baidu.com')
print(result['head'], result['tail'])

用head，tail命名網頁前後綴，最終可通過訪問字典的方式訪問match返回值。

練習完善

之前的爬蟲得到結果，卻能雜亂，有個分組的知識我們就可以做出改進。

import requests
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
url = 'https://www.douban.com/'
http = requests.get(url, headers=headers)
reg = r'<div[^>]*>([^<]*)</div>'
result_list = re.findall(reg, http.text)
index = 1
for result in result_list:
    if str(result).strip():
        print(str(index) + ". " + str(result).strip())
        index += 1

只做了少許修改：

爲reg標籤包裹的內容加上括號（findall函數會以列表形式返回分組的所有內容，如果沒有分組就返回匹配結果）
對匹配結果使用strip()方法去除多餘的空字符
排除全空的結果，併爲結果加上索引

輸出結果：

1. 喫屎不忘拉屎人的日記
2. 〔日〕多利安助...
3. 〔日〕伊坂幸太...
4. 〔日〕池井戶潤...
5. 吳沚默
6. 免費
7. 免費
8. 免費
9. 免費
10. 「旅行」我想去精靈旅社度個假
11. 歐美丨在Uptown聽Funk修個椰子皮
12. 日本民謠：我的歌，是用時間...
13. 「復古」音樂和情緒都不會過時
14. 給我一段音樂，推開看得見風...
15. 你聽過《東京愛情故事》嗎？
16. 3月25日 週三 19:30 - 21:30
17. 3人關注
18. 3月18日 週三 19:30 - 21:30
19. 3人關注
20. 4月4日 週六 19:30 - 21:30
21. 1人關注
22. 12月21日 週六 - 4月12日 週日
23. 5人關注

這次返回的結果就非常成功，簡單利索。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

正則表達式的進階用法——預查與分組

文章目錄

預查

正向預查：`?=`, `?!`

負向預查：`?<=`, `?<!`/`?<!=`

練習：網頁小爬蟲

分組

普通分組

命名分組

練習完善

淺談卷積神經網絡(CNN)——卷積、批標準化、池化、失活

【jdbc編程】使用c3p0連接池對MySQL數據庫進行訪問

【最新】Qt5.13及以上版本如何訪問MySQL數據庫

【超詳細】從零開始完成vue-cli腳手架配置

還在用雙系統？試試WSL吧——安裝與配置WSL、配置vim、安裝圖形界面

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

正則表達式的進階用法——預查與分組

文章目錄

預查

正向預查：?=, ?!

負向預查：?<=, ?<!/?<!=

練習：網頁小爬蟲

分組

普通分組

命名分組

練習完善

正向預查：`?=`, `?!`

負向預查：`?<=`, `?<!`/`?<!=`