通過學習唐松,來自《Python 網絡爬蟲:從入門到實踐》整理記錄,文中代碼引用自作者源代碼。本篇僅做整理參考使用!!!安利一波:這本爬蟲教程的確不錯!!!ღ( ´・ᴗ・` )比心
提升爬蟲速度的三種主要方法:
- 多線程爬蟲(併發,異步)
網絡爬蟲是IO密集型,多線程能夠有效的提升效率,因爲單線程下有IO操作會進行IO等待,所以會造成不必要的時間浪費,而開啓多線程能在線程A等待時自動切換到線程B,可以不浪費CPU的資源,從而提升程序執行的效率 - 多進程爬蟲(並行,異步)
由於Python設計之初GIL(全局解釋器鎖)的存在,多線程爬蟲並不能充分地發揮多核CPU的資源,多進程爬蟲可以利用CPU的多核,充分發揮CPU資源 - 多協程爬蟲(併發,異步)(這裏我沒整理)
併發和並行,同步和異步的概念
- 併發和並行
併發和並行是兩個相似的概念
併發:是指在一個時間段發生若干事件的情況(單核CPU運行多個工作任務)
並行:是指在同一時刻發生若干事件的情況(多個CPU運行多個工作任務) - 同步和異步
同步:同步就是併發或者並行的各個任務不是獨自運行的,任務之間有一定的交替順序,只有在一個任務完成得到結果後,另一個任務纔會開始運行。(可以理解爲順序結構)
異步:異步則是併發或者並行的各個任務可以獨自運行,一個任務的運行不受另一個任務的影響。(可以理解爲分支結構)
多線程爬蟲實踐
threading模塊提供了Thread類來處理線程,包括以下方法
run():用以表示線程活動的方法
start():啓動線程活動
join([time]):等待至線程中止。阻塞調用線程直至線程join()方法調用爲止
isAlive():返回線程是否是活動的
getName():返回線程名
setName():設置線程名
thread庫
import threading
import requests
import time
link_list = []
with open('alexa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('\t')[1]
link = link.replace('\n', '')
link_list.append(link)
start = time.time()
class myThread(threading.Thread):
def __init__(self, name, link_range):
threading.Thread.__init__(self)
self.name = name
self.link_range = link_range
def run(self):
print("Starting " + self.name)
crawler(self.name, self.link_range)
print("Exiting " + self.name)
def crawler(threadName, link_range):
for i in range(link_range[0], link_range[1] + 1):
try:
r = requests.get(link_list[i], timeout=20)
print(threadName, r.status_code, link_list[i])
except Exception as e:
print(threadName, 'Error: ', e)
thread_list = []
link_range_list = [(0, 200), (201, 400), (401, 600), (601, 800), (801, 1000)]
# 創建新線程
for i in range(1, 6):
thread = myThread("Thread-" + str(i), link_range_list[i - 1])
thread.start()
thread_list.append(thread)
# 等待所有線程完成
for thread in thread_list:
thread.join()
end = time.time()
print('簡單多線程爬蟲的總時間爲:', end - start)
print("Exiting Main Thread")
#簡單多線程爬蟲的總時間爲: 457.86393117904663
Queue+thread
import threading
import requests
import time
import queue as Queue
link_list = []
with open('alexa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('\t')[1]
link = link.replace('\n','')
link_list.append(link)
start = time.time()
class myThread (threading.Thread):
def __init__(self, name, q):
threading.Thread.__init__(self)
self.name = name
self.q = q
def run(self):
print ("Starting " + self.name)
while True:
try:
crawler(self.name, self.q)
except:
break
print ("Exiting " + self.name)
def crawler(threadName, q):
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print (q.qsize(), threadName, r.status_code, url)
except Exception as e:
print (q.qsize(), threadName, url, 'Error: ', e)
threadList = ["Thread-1", "Thread-2", "Thread-3","Thread-4", "Thread-5"]
workQueue = Queue.Queue(1000)
threads = []
# 創建新線程
for tName in threadList:
thread = myThread(tName, workQueue)
thread.start()
threads.append(thread)
# 填充隊列
for url in link_list:
workQueue.put(url)
# 等待所有線程完成
for t in threads:
t.join()
end = time.time()
print ('Queue多線程爬蟲的總時間爲:', end-start)
print ("Exiting Main Thread")
多進程爬蟲實踐
如果我們要用多進程,就需要用到multiprocessing這個庫
使用multiprocessing庫有兩種方法
多進程中父進程與子進程無法共享全局變量!
Process+Queue
基本與thread庫類似,需注意,在thread多線程中用來控制隊列的Queue庫,multiprocessing自帶了。在多進程中,每個進程都可以單獨設置它的屬性,如果將daemon設置爲True,當父進程結束後,子進程就會自動被終止。
from multiprocessing import Process, Queue
import time
import requests
link_list = []
with open('alexa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('\t')[1]
link = link.replace('\n','')
link_list.append(link)
start = time.time()
class MyProcess(Process):
def __init__(self, q):
Process.__init__(self)
self.q = q
def run(self):
print ("Starting " , self.pid)
while not self.q.empty():
crawler(self.q)
print ("Exiting " , self.pid)
def crawler(q):
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print (q.qsize(), r.status_code, url)
except Exception as e:
print (q.qsize(), url, 'Error: ', e)
if __name__ == '__main__':
ProcessNames = ["Process-1", "Process-2", "Process-3"]
workQueue = Queue(1000)
# 填充隊列
for url in link_list:
workQueue.put(url)
for i in range(0, 3):
p = MyProcess(workQueue)
p.daemon = True
p.start()
p.join()
end = time.time()
print ('Process + Queue多進程爬蟲的總時間爲:', end-start)
print ('Main process Ended!')
Pool+Queue
Pool提供指定數量的進程供用戶調用。當有新的請求提交到Pool中時,如果池還沒有滿,就會創建一個新的進程用來執行該請求;但如果池中的進程數已經達到規定的最大值,該請求就會繼續等待,直到池中有進程結束才能夠創建新的進程。
在使用Pool之前需要了解一下阻塞和非阻塞的概念。
阻塞和非阻塞關注的是程序在等待調用結果(消息,返回值)時的狀態。阻塞要等到回調結果出來,在有結果之前,當前進程會被掛起。非阻塞爲添加進程後,不一定非要等結果出來就可以添加其他進程運行。
非阻塞方法:
from multiprocessing import Pool, Manager
import time
import requests
link_list = []
with open('alexa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('\t')[1]
link = link.replace('\n','')
link_list.append(link)
start = time.time()
def crawler(q, index):
Process_id = 'Process-' + str(index)
while not q.empty():
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print (Process_id, q.qsize(), r.status_code, url)
except Exception as e:
print (Process_id, q.qsize(), url, 'Error: ', e)
if __name__ == '__main__':
manager = Manager()
workQueue = manager.Queue(1000)
# 填充隊列
for url in link_list:
workQueue.put(url)
pool = Pool(processes=3)
for i in range(4):
pool.apply_async(crawler, args=(workQueue, i))
print ("Started processes")
pool.close()
pool.join()
end = time.time()
print ('Pool + Queue多進程爬蟲的總時間爲:', end-start)
print ('Main process Ended!')
阻塞方法:
from multiprocessing import Pool, Manager
import time
import requests
link_list = []
with open('alexa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('\t')[1]
link = link.replace('\n','')
link_list.append(link)
start = time.time()
def crawler(q, index):
Process_id = 'Process-' + str(index)
while not q.empty():
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print (Process_id, q.qsize(), r.status_code, url)
except Exception as e:
print (Process_id, q.qsize(), url, 'Error: ', e)
if __name__ == '__main__':
manager = Manager()
workQueue = manager.Queue(1000)
# 填充隊列
for url in link_list:
workQueue.put(url)
pool = Pool(processes=3)
for i in range(4):
pool.apply(crawler, args=(workQueue, i))
print ("Started processes")
pool.close()
pool.join()
end = time.time()
print ('Pool + Queue多進程爬蟲的總時間爲:', end-start)
print ('Main process Ended!')