Python爬蟲學習 8 —— re庫的使用

原創

我有两颗糖

2019-08-28 17:44

上一篇博客我們學習了正則表達式，python有一個re庫專門用於正則表達式匹配。

一、淺談Re庫

導入re庫：
Re庫是Python的標準庫(使用時不需要安裝額外的插件)，主要用於字符串匹配。
調用方式：import re

正則表達式的表示：
raw string：原生字符串類型
表示方法：r’text’
舉個栗子：r’[1-9]\d{5}’ （其中的 \ 不被當作轉義符處理）

raw string：不包含轉義字符，不需要考慮需要多少個 \
string字符串：使用起來更加繁瑣，因爲表示一個 \ 需要使用 \

二、re庫的主要函數

函數名與功能：

函數	說明
re.search()	在一個字符串中搜索匹配正則表達式的第一個位置，返回match對象
re.match()	從一個字符串的開始位置起匹配正則表達式，返回match對象
re.findall()	搜索字符串，以列表類型返回全部能匹配的子串
re.split()	將一個字符串按照正則表達式匹配結果進行分割，返回列表類型
re.finditer()	搜索字符串，返回一個匹配結果的選代類型，每個迭代元素是match對象
re.sub()	在一個字符串中替換所有匹配正則表達式的子串，返回替換後的字符串

參數說明：

參數名	描述
pattern	正則表達式的字符串或原生字符串表示
string	待匹配字符串
flags	正則表達式使用的控制標記
maxsplit	split：最大分割數，剩餘部分作爲最後一個元素輸出
repl	sub：替換匹配字符串的字符串
count	sub：匹配的最大替換次數

flags	說明
re.I re,IGNORECASE	忽略正則表達式的大小寫，[A-Z]也可以匹配小寫字母
re.M re.NMULTILINE	正則表達式的^操作符可以將給定字符的每行當作匹配開始
re.S re.DOTALL	正則表達式中的 . 操作符能夠匹配所有字符，默認匹配處換行外的所有字符

函數使用語法：

1、re.search(pattern, string, flags=0)
注意：返回的match可能爲空，使用group前要先判斷

import re

match = re.search(r'[1-9]\d{5}', '342100')
if match:
	print(match.group(0))

2、re.match(pattern, string, flags=0)

match = re.match(r'[1-9]\d{5}', "342300 HNU")
if match:
	print(match.group(0))

3、re.findall(pattern, string, flags=0)

mylist = re.findall(r'[1-9]\d{5}', "HNU496300 342300")
print(mylist)

4、re.split(pattern, string, maxsplit=0, flags=0)
比較設置了maxsplit參數和未設置的區別：

mylist = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(mylist)
mylist = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1)
print(mylist)

5、re.finditer(pattern, string, flags=0)

for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
	if m:
		print(m.group(0))

6、re.sub(pattern, string, count=0, repl, flags=0)

s =re.sub(r'[1-9]\d{5}', ":zipcode", "BIT100081 TSU100084")
print(s)

三、面向對象使用re庫

re庫的兩種使用方法：
1、函數式用法：
rst = re.search(r’[1-9]\d{5}’, ‘BIT 100084’)
2、面向對象用法：
pat = re.compiler(r’[1-9]\d{5}’)
rst = pat.search(‘BIT 10084’)
3、函數式用法實質：
我們打開re庫，查看各個函數的定義，可以發現返回值是_compile().match()，也就是間接使用了面向對象用法。

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

compile函數：
語法：regex = re.compile(pattern, flags=0)
功能：將正則表達式的字符串形式編譯爲正則表達式對象
任何函數式的用法都剋採用面向對象的方法替代，如：

match = re.match(r'[1-9]\d{5}', "342300 HNU")
if match:
	print(match.group(0))

regex = re.compile(r'[1-9]\d{5}')
match = regex.match("342300 HNU")
if match:
	print(match.group(0))

四、match對象

match對象的屬性：

屬性	描述
.string	待匹配的字符串
.re	匹配時使用的pattern對象(正則表達式)
.pos	正則表達式搜索文本的開始位置
.endpos	正則表達式搜索文本的結束位置

match對象的方法：

方法	描述
.group(0)	獲得匹配後的字符串
.start()	返回匹配字符串在原始字符串的開始位置
.end()	返回匹配字符串在原始字符串的結束位置
.span()	返回(.start(), .end())

match對象使用：

import re
m = re.search(r'[1-9]\d{5}', "BIT100084 TSU100081")
print(m.string)
print(m.re)
print(m.pos)
print(m.endpos)
print(m.group(0))
print(m.span())

輸出：

BIT100084 TSU100081
re.compile('[1-9]\\d{5}')
0
19
100084
(3, 9)

第八篇python爬蟲學習筆記完結啦 cheers 🍻🍻

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲學習 8 —— re庫的使用

一、淺談Re庫

二、re庫的主要函數

三、面向對象使用re庫

四、match對象

OpenGL學習筆記——搭建OpenGL程序框架

小白學習圖像處理4——噪聲模型與圖像平滑（matlab實現）

Java學習 —— 使用eclipse的7個小技巧

小白學習圖像處理1——圖像的構成和數據類型

matplotlib學習（一） —— 繪製折線圖、直方圖、散點圖、堆疊圖和餅圖

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結