編譯正則和非編譯正則
在使用編譯正則的時候,系統不需要反覆解讀你的正則表達式,故而速度更快。
通識的說,就是編譯性程序和解釋性程序的速度差別也是這個原因
- test_re_nocompile.py
#!/usr/bin/python
# _*_ codeing: UTF-8 _*_
from __future__ import print_function
import re
import time
def main():
time_start = time.time()
pattern = "[0-9]+"
with open('test/1.txt') as f:
for line in f:
print(re.findall(pattern,line))
time_end = time.time()
print('{:_^+10.4f}'.format(time_end-time_start))
#print(": {:_^+10.4f}".format(3.1415926))
if __name__ == '__main__':
main()
- test_re_compile.py
#!/usr/bin/python
# _*_ coding: UTF-8 _*_
from __future__ import print_function
import re
import time
def main():
time_start = time.time()
pattern = "[0-9]+"
re_obj = re.compile(pattern)
with open('test/1.txt') as f:
for line in f:
print(re_obj.findall(line))
time_end = time.time()
print('{:_^10.4f}'.format(time_end-time_start))
if __name__ == '__main__':
main()
常用的re方法
匹配類函數
閒話不多說,直接上代碼(ipython)
In [1]: import re
In [2]: data = "What is the difference between python 2.7.13 and 3.6.0"
In [3]: re.findall("[0-9]\.[0-9]\.[0-9]",data)
Out[3]: ['2.7.1', '3.6.0']
In [4]: re.findall("\d\.\d\.\d",data)
Out[4]: ['2.7.1', '3.6.0']
In [1]: import re
In [2]: data = "What is the difference between python 2.7.13 and 3.6.0"
In [3]: re.findall("[0-9]\.[0-9]\.[0-9]",data)
Out[3]: ['2.7.1', '3.6.0']
In [4]: re.findall("\d\.\d\.\d",data)
Out[4]: ['2.7.1', '3.6.0']
In [5]: re.findall("Python [0-9]\.[0-9]\.[0-9]", data, flags=re.IGNORECASE)
Out[5]: ['python 2.7.1']
這裏的findall
可以查找data裏的全部匹配項
當然還有就是flags=re.IGNOERCASE
忽略大小寫
match 匹配函數的開始
我想你一定知道 startswith
,我覺得這個match
就是一個加強版,加入了正則的元素,讓匹配更加的靈活
In [9]: s = "12345上山打老虎"
In [10]: re.match('\d+',s)
Out[10]: <_sre.SRE_Match object; span=(0, 5), match='12345'>
#遇到這種情況使用 startswith 就很雞肋了,當然,re.match還有一些方法可以用
In [5]: import re
In [6]: s = "12345上山打老虎"
In [7]: r = re.match('\d+',s)
In [8]: r.start()
Out[8]: 0
In [9]: r.end()
Out[9]: 5
In [10]: r.string
Out[10]: '12345上山打老虎'
In [11]: r.group()
Out[11]: '12345'
- 這裏還要強調的是 search
In [12]: re.search("山",s)
Out[12]: <_sre.SRE_Match object; span=(6, 7), match='山'>
In [13]: r = re.search("山",s)
In [14]: r.start()
Out[14]: 6
In [15]: r.string
Out[15]: '12345上山打老虎'
In [16]: r.group()
Out[16]: '山'
- 還有就是
finditer
返回的是迭代器
In [17]: data = "What is the difference between python 2.7.13 and Python 3.6.0"
In [18]: r = re.finditer("\d+\.\d+\.\d+", data)
In [19]: for it in r:
...: print(it.group(0))
...:
2.7.13
3.6.0
“修改類”函數
如果你瞭解C++的string,那麼你一定知道 substr(),而在python中,他被replace代替。當然,更加定製化的是 re.sub
In [28]: data = "What is the difference between python 2.7.13 and Python 3.6.0"
In [29]: data.replace("Python","*****")
Out[29]: 'What is the difference between python 2.7.13 and ***** 3.6.0'
In [35]: re.sub("\d\.\d\.\d+","*.*.*",data,flags=re.IGNORECASE)
Out[35]: 'What is the difference between python *.*.* and python *.*.*'
In [36]: re.sub("python \d\.\d\.\d+","*.*.*",data,flags=re.IGNORECASE)
Out[36]: 'What is the difference between *.*.* and *.*.*'
- 本想在replace上添加
flags=re.IGNORECASE
,但是,這是re中的內容,所以無法在replace中使用
In [40]: text = "Today is 25/4/2018. PyCon starts 5/25/2017"
In [41]: re.sub(r"(\d+)/(\d+)/(\d+)",r'\3-\1-\2', text)
Out[41]: 'Today is 2018-25-4. PyCon starts 2017-5-25'
- 還記得
split
嗎,對,他也可以使用 re
In [90]: text = "MySQL slave binlog position: master host '10.173.33.35',filename 'mysql-bin.000002', positon '524993060'"
In [91]: re.split(r"[':,\s]+",text.strip("'"))
Out[91]:
['MySQL',
'slave',
'binlog',
'position',
'master',
'host',
'10.173.33.35',
'filename',
'mysql-bin.000002',
'positon',
'524993060']
小知識:
(僅限在ipython中使用)
In [82]: temp = !ls -al | grep test.py
In [83]: temp
Out[83]: ['-rw-r--r-- 1 root root 14 Apr 25 16:53 test.py']
大小寫不敏感,
我就不贅述了,就是 flags=re.IGNORECASE
貪婪匹配和非貪婪匹配
In [98]: text = "Beautiful is better than ugly.Explicit is batter than implicit."
In [99]: re.findall("Beautiful.*\.",text)
Out[99]: ['Beautiful is better than ugly.Explicit is batter than implicit.']
In [100]: re.findall("Beautiful.*?\.",text)
Out[100]: ['Beautiful is better than ugly.']
附帶爬蟲demo
In [120]: import requests
In [121]: import re
In [122]: r = requests.get('https://news.ycombinator.com')
In [124]: re.findall('"https?://.*?"', str(r.content))
Out[124]:
['"https://news.ycombinator.com"',
'"http://conference.startupschool.org/"',
'"http://norvig.com/spell-correct.html"',
... ...