【Python網絡爬蟲整理記錄 D:03】——多線程與多進程 | 提高爬蟲的速度

通過學習唐松,來自《Python 網絡爬蟲:從入門到實踐》整理記錄,文中代碼引用自作者源代碼。本篇僅做整理參考使用!!!安利一波:這本爬蟲教程的確不錯!!!ღ( ´・ᴗ・` )比心

提升爬蟲速度的三種主要方法:

  1. 多線程爬蟲(併發,異步)
    網絡爬蟲是IO密集型,多線程能夠有效的提升效率,因爲單線程下有IO操作會進行IO等待,所以會造成不必要的時間浪費,而開啓多線程能在線程A等待時自動切換到線程B,可以不浪費CPU的資源,從而提升程序執行的效率
  2. 多進程爬蟲(並行,異步)
    由於Python設計之初GIL(全局解釋器鎖)的存在,多線程爬蟲並不能充分地發揮多核CPU的資源,多進程爬蟲可以利用CPU的多核,充分發揮CPU資源
  3. 多協程爬蟲(併發,異步)(這裏我沒整理

併發和並行,同步和異步的概念

  • 併發和並行
    併發和並行是兩個相似的概念
    併發:是指在一個時間段發生若干事件的情況(單核CPU運行多個工作任務)
    並行:是指在同一時刻發生若干事件的情況(多個CPU運行多個工作任務)
  • 同步和異步
    同步:同步就是併發或者並行的各個任務不是獨自運行的,任務之間有一定的交替順序,只有在一個任務完成得到結果後,另一個任務纔會開始運行。(可以理解爲順序結構)
    異步:異步則是併發或者並行的各個任務可以獨自運行,一個任務的運行不受另一個任務的影響。(可以理解爲分支結構)

多線程爬蟲實踐

threading模塊提供了Thread類來處理線程,包括以下方法
run():用以表示線程活動的方法
start():啓動線程活動
join([time]):等待至線程中止。阻塞調用線程直至線程join()方法調用爲止
isAlive():返回線程是否是活動的
getName():返回線程名
setName():設置線程名

thread庫

import threading
import requests
import time

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()


class myThread(threading.Thread):
    def __init__(self, name, link_range):
        threading.Thread.__init__(self)
        self.name = name
        self.link_range = link_range

    def run(self):
        print("Starting " + self.name)
        crawler(self.name, self.link_range)
        print("Exiting " + self.name)


def crawler(threadName, link_range):
    for i in range(link_range[0], link_range[1] + 1):
        try:
            r = requests.get(link_list[i], timeout=20)
            print(threadName, r.status_code, link_list[i])
        except Exception as e:
            print(threadName, 'Error: ', e)


thread_list = []
link_range_list = [(0, 200), (201, 400), (401, 600), (601, 800), (801, 1000)]

# 創建新線程
for i in range(1, 6):
    thread = myThread("Thread-" + str(i), link_range_list[i - 1])
    thread.start()
    thread_list.append(thread)

# 等待所有線程完成
for thread in thread_list:
    thread.join()

end = time.time()
print('簡單多線程爬蟲的總時間爲:', end - start)
print("Exiting Main Thread")
#簡單多線程爬蟲的總時間爲: 457.86393117904663

Queue+thread

import threading
import requests
import time
import queue as Queue

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)
        
start = time.time()
class myThread (threading.Thread):
    def __init__(self, name, q):
        threading.Thread.__init__(self)
        self.name = name
        self.q = q
    def run(self):
        print ("Starting " + self.name)
        while True:
            try:
                crawler(self.name, self.q)
            except:
                break
        print ("Exiting " + self.name)
        
def crawler(threadName, q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=20)
        print (q.qsize(), threadName, r.status_code, url)
    except Exception as e: 
        print (q.qsize(), threadName, url, 'Error: ', e)
        
threadList = ["Thread-1", "Thread-2", "Thread-3","Thread-4", "Thread-5"]
workQueue = Queue.Queue(1000)
threads = []

# 創建新線程
for tName in threadList:
    thread = myThread(tName, workQueue)
    thread.start()
    threads.append(thread)
    
# 填充隊列
for url in link_list:
    workQueue.put(url)

# 等待所有線程完成
for t in threads:
    t.join()

end = time.time()
print ('Queue多線程爬蟲的總時間爲:', end-start)
print ("Exiting Main Thread")

多進程爬蟲實踐

如果我們要用多進程,就需要用到multiprocessing這個庫
使用multiprocessing庫有兩種方法
多進程中父進程與子進程無法共享全局變量!

Process+Queue

基本與thread庫類似,需注意,在thread多線程中用來控制隊列的Queue庫,multiprocessing自帶了。在多進程中,每個進程都可以單獨設置它的屬性,如果將daemon設置爲True,當父進程結束後,子進程就會自動被終止。

from multiprocessing import Process, Queue
import time
import requests

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)

start = time.time()
class MyProcess(Process):
    def __init__(self, q):
        Process.__init__(self)
        self.q = q

    def run(self):
        print ("Starting " , self.pid)
        while not self.q.empty():
            crawler(self.q)
        print ("Exiting " , self.pid)

def crawler(q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=20)
        print (q.qsize(), r.status_code, url)
    except Exception as e: 
        print (q.qsize(), url, 'Error: ', e)

if __name__ == '__main__':
    ProcessNames = ["Process-1", "Process-2", "Process-3"]
    workQueue = Queue(1000)

    # 填充隊列
    for url in link_list:
        workQueue.put(url)

    for i in range(0, 3):
        p = MyProcess(workQueue)
        p.daemon = True
        p.start()
        p.join()

    end = time.time()
    print ('Process + Queue多進程爬蟲的總時間爲:', end-start)
    print ('Main process Ended!')

Pool+Queue

Pool提供指定數量的進程供用戶調用。當有新的請求提交到Pool中時,如果池還沒有滿,就會創建一個新的進程用來執行該請求;但如果池中的進程數已經達到規定的最大值,該請求就會繼續等待,直到池中有進程結束才能夠創建新的進程。
在使用Pool之前需要了解一下阻塞和非阻塞的概念。
阻塞和非阻塞關注的是程序在等待調用結果(消息,返回值)時的狀態。阻塞要等到回調結果出來,在有結果之前,當前進程會被掛起。非阻塞爲添加進程後,不一定非要等結果出來就可以添加其他進程運行。

非阻塞方法:

from multiprocessing import Pool, Manager
import time
import requests

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)

start = time.time()
def crawler(q, index):
    Process_id = 'Process-' + str(index)
    while not q.empty():
        url = q.get(timeout=2)
        try:
            r = requests.get(url, timeout=20)
            print (Process_id, q.qsize(), r.status_code, url)
        except Exception as e: 
            print (Process_id, q.qsize(), url, 'Error: ', e)


if __name__ == '__main__':
    manager = Manager()
    workQueue = manager.Queue(1000)

    # 填充隊列
    for url in link_list:
        workQueue.put(url)

    pool = Pool(processes=3)
    for i in range(4):
        pool.apply_async(crawler, args=(workQueue, i))

    print ("Started processes")
    pool.close()
    pool.join()

    end = time.time()
    print ('Pool + Queue多進程爬蟲的總時間爲:', end-start)
    print ('Main process Ended!')

阻塞方法:

from multiprocessing import Pool, Manager
import time
import requests

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)

start = time.time()
def crawler(q, index):
    Process_id = 'Process-' + str(index)
    while not q.empty():
        url = q.get(timeout=2)
        try:
            r = requests.get(url, timeout=20)
            print (Process_id, q.qsize(), r.status_code, url)
        except Exception as e: 
            print (Process_id, q.qsize(), url, 'Error: ', e)


if __name__ == '__main__':
    manager = Manager()
    workQueue = manager.Queue(1000)

    # 填充隊列
    for url in link_list:
        workQueue.put(url)

    pool = Pool(processes=3)
    for i in range(4):
        pool.apply(crawler, args=(workQueue, i))

    print ("Started processes")
    pool.close()
    pool.join()

    end = time.time()
    print ('Pool + Queue多進程爬蟲的總時間爲:', end-start)
    print ('Main process Ended!')
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章