協程:
協程是實現併發編程的一種方式。一說併發,你肯定想到了多線程 / 多進程模型,沒錯,多線程 / 多進程
node2:/root/python/20200524#cat t1.py
import time
def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
time.sleep(sleep_time)
print('OK {}'.format(url))
def main(urls):
for url in urls:
nowTime = time.strftime("%Y-%m-%d %H:%M:%S")
print(nowTime)
crawl_page(url)
print(nowTime)
main(['url_1', 'url_2', 'url_3', 'url_4'])
node2:/root/python/20200524#time python t1.py
2020-04-16 22:50:45
crawling url_1
OK url_1
2020-04-16 22:50:45
2020-04-16 22:50:46
crawling url_2
OK url_2
2020-04-16 22:50:46
2020-04-16 22:50:48
crawling url_3
OK url_3
2020-04-16 22:50:48
2020-04-16 22:50:51
crawling url_4
OK url_4
2020-04-16 22:50:51
real 0m10.043s
user 0m0.022s
sys 0m0.009s
於是,一個很簡單的思路出現了——我們這種爬取操作,完全可以併發化。我們就來看看使用協程怎麼寫
import asyncio
async def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
await asyncio.sleep(sleep_time)
print('OK {}'.format(url))
async def main(urls):
for url in urls:
await crawl_page(url)
%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
########## 輸出 ##########
crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
Wall time: 10 s
node2:/root/python/20200524#time python3 t2.py
2020-04-16 23:42:52
crawling url_1
OK url_1
2020-04-16 23:42:53
2020-04-16 23:42:53
crawling url_2
OK url_2
2020-04-16 23:42:55
2020-04-16 23:42:55
crawling url_3
OK url_3
2020-04-16 23:42:58
2020-04-16 23:42:58
crawling url_4
OK url_4
2020-04-16 23:43:02
real 0m10.095s
user 0m0.070s
sys 0m0.014s
10 秒就對了,還記得上面所說的,await 是同步調用,因此, crawl_page(url) 在當前的調用結束之前,是不會觸發下一次調用的。於是,這個代碼效果就和上面完全一樣了,相當於我們用異步接口寫了個同步代碼。
node2:/root/python/20200524#cat t3.py
import asyncio
async def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
await asyncio.sleep(sleep_time)
print('OK {}'.format(url))
async def main(urls):
tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
for task in tasks:
await task
asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
node2:/root/python/20200524#time python3 t3
python3: can't open file 't3': [Errno 2] No such file or directory
real 0m0.027s
user 0m0.020s
sys 0m0.007s
node2:/root/python/20200524#time python3 t3.py
crawling url_1
crawling url_2
crawling url_3
crawling url_4
OK url_1
OK url_2
OK url_3
OK url_4
real 0m4.090s
user 0m0.034s
sys 0m0.052s