python 第三方庫gevent 協程用法爬蟲案例

原創

大蛇王

2019-06-20 19:59

運行環境：python3.6

第三方庫安裝：pip install gevent

網絡爬蟲，這種io高密集型的應用由於大部分的時間在等待響應方面，所以CPU的使用率並不高，爲了解決這些問題，我們使用異步的方式來進行爬蟲程序。

gevent是python的第三方庫，通過greenlet實現協程，其基本思想是：

當一個greenlet遇到IO操作時，比如訪問網絡，就自動切換到其他的greenlet，等到IO操作完成，再在適當的時候切換回來繼續執行。由於IO操作非常耗時，經常使程序處於等待狀態，有了gevent爲我們自動切換協程，就保證總有greenlet在運行，而不是等待IO。

下面是正常請求和試用協程請求的簡單案例：

import gevent
import requests
import time

# 創建一個待爬的url列表
url_list = ["https://www.csdn.net", "https://www.jianshu.com", "https://www.baidu.com", "https://blog.csdn.net/t8116189520",
            "https://www.baidu.com", "https://www.baidu.com", "https://ai.youdao.com", "https://www.163.com/",
            "https://fanyi.baidu.com", "https://www.tmall.com/", "https://www.taobao.com/" ]

# 定義一個功能函數，輸入url獲取網頁源碼
def get_HTML(url):
    resp=requests.get(url)
    # 判斷響應碼是否爲200
    if resp.status_code==200:
        res=resp.status_code
        pass
    else:
        print(url,resp.status_code)

# 試用普通方法對url列表進行請求，並計算耗時
def common_method(url_list):
    time_begin = time.time()
    for i in url_list:
        get_HTML(i)
    print(time.time() - time_begin)
    
# 試用協程對url列表進行請求，並計算耗時
def gevent_method(url_list):
    time_begin=time.time()
    g_list=[]
    for i in url_list:
        s=gevent.spawn(get_HTML, i)
        g_list.append(s)
    gevent.joinall(g_list)
    print(time.time()-time_begin)
    
print("begin common")
common_method(url_list)

print("begin gevent")
gevent_method(url_list)

運行結果如下：

ps:由於待爬的列表數量比較少，所以效率提升並不是很明顯。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python 第三方庫gevent 協程用法爬蟲案例

DAPPER 事務 TRANSACTION

mysql 分組後取最大值的問題（已解決）

python讀取csv文件的下載鏈接，獲取表格數據

linux服務器centos配置防火牆，開啓端口（指定ip白名單）

python 大批量文本分詞以及詞頻統計（高效處理案例）

python3提取字符串中的手機號碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python 第三方庫gevent 協程用法 爬蟲案例

python 第三方庫gevent 協程用法爬蟲案例