Python多線程爬蟲—批量爬取豆瓣電影動態加載的電影信息（小白詳細說明自己對於多線程瞭解）

單線程與多線程爬取時間比較

最近聽取了老師的建議，開始對多線程爬蟲進行自學，在進行多線程爬蟲實戰之前我做了三點準備，並將準備時所學的東西已寫成博文與大家分享，兄你們要是感興趣的話可以看一看喔😁要是有什麼錯誤的地方可以直接評論私信我

Python—多線程編程（一）線程的創建，管理，停止
 Python—多線程編程（二）線程安全（臨界資源問題和多線程同步）
Python—Queue模塊基本使用方法詳解

本博文是使用多線程爬蟲爬取豆瓣電影信息的實戰解析

1.爲什麼要學習使用多線程進行爬蟲

我之前所寫的博文和所進行的爬蟲都是單線程爬蟲，代碼都是在MainThread下運行的，也就是Python的主線程。但當爬取的數據量到一定數量級別的時候，單線程爬蟲會耗時很久，達不到大數據對數據要求的及時性，而解決這一時間問題的解決方法可以使用多線程和多進程等。爬蟲主要運行時間消耗是請求網頁時的io阻塞，所以開啓多線程，讓不同請求的等待同時進行，可以大大提高爬蟲運行效率。（自己簡單的理解說明）

2.本此實戰採用的模式——生產者，消費者模式（實戰解析中會詳細說明）

（1）某個模塊負責產生數據，這些數據由另一個模塊來負責處理（此處的模塊是廣義的，可以是類、函數、線程、進程等）。
（2）產生數據的模塊，就形象地稱爲生產者；而處理數據的模塊，就稱爲消費者。
（3）單單抽象出生產者和消費者，還夠不上是生產者／消費者模式。
（4）該模式還需要有一個緩衝區處於生產者和消費者之間，作爲一箇中介。
（5）生產者把數據放入緩衝區，而消費者從緩衝區取出數據。

實戰解析

1.分析網頁源碼，構造請求頭（谷歌瀏覽器）

Python幫你瞭解你喜歡的人！爬取她的微博內容信息！（Ajax數據爬取）

（1）網頁信息(可以看到電影信息是通過點擊“加載更多”進行加載的，也就是Ajax動態加載，之前的博文中提到過如何爬取這類信息，爬取分析方法一致)

（2）電影信息源碼位置（照片不是很清晰，但可以看出它是以 json 數據格式返回的數據，在面板 XHR 下，用 Aajx 動態加載的）

（3）分析構造請求頭（可以看到它是有請求參數的，分析可得請求頁面加載通過 page_start 這個參數控制，每20個爲一頁，我自己爬蟲可得，電影信息應該有384個左右，本此爬取200個電影信息爲例）

構造的請求頭爲：HTTP請求過程——Chrome瀏覽器Network詳解

headers = {
    'Accept': '*/*',
    'Host': 'movie.douban.com',
    'Referer': 'https://movie.douban.com/explore',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    #Ajax請求必有的請求頭參數之一
    'X-Requested-With': 'XMLHttpRequest'  
}

3.單線程爬取並得到爬取時間（由於比較簡單直接上源碼和結果）：

import threading
import time
from queue import Queue
from urllib.parse import urlencode

import pymysql
import requests

# '''單線程爬取 200個電影信息 用時3.1670422554016113秒'''
headers = {
    'Accept': '*/*',
    'Host': 'movie.douban.com',
    'Referer': 'https://movie.douban.com/explore',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
}

# def save(information):
#        #把數據存入數據庫
#        #建立數據庫連接
#        connection = pymysql.connect(host = 'localhost',
#                                     user = 'root',
#                                     password = '123456',
#                                     database = '豆瓣電影信息',
#                                     charset = 'utf8'
#        )
#        #增加數據
#        for item in information:
#            try:
#                with connection.cursor() as cursor:
#                    sql = 'insert into 電影信息 (電影名字,電影評分,電影鏈接) values (%s,%s,%s)'
#                    cursor.execute(sql,(item.get('title'),item.get('rate'),item.get('url')))
#                    #提交數據庫事務
#                    connection.commit()
#            except pymysql.DatabaseError:
#                #數據庫事務回滾
#                connection.rollback()
#
#        connection.close()

sum = 0
def output(json_text):
    global sum
    for item in json_text:
        sum+=1
        print('電影評分：{0}，電影名稱：{1}，電影鏈接：{2}'.format(item.get('rate'),item.get('title'),item.get('url')))
def get_json_text(url_list):
    for url in url_list:
        response = requests.get(url, headers=headers)
        json_text = response.json()
        output(json_text.get('subjects'))

def main():
    time1 = time.time()
    page = [i * 20 for i in range(10)]
    url_list = []
    for i in page:
        params = {
            'type': 'movie',
            'tag': '熱門',
            'sort': 'recommend',
            'page_limit': '20',
            'page_start': i
        }
        url = 'https://movie.douban.com/j/search_subjects?' + urlencode(params)
        url_list.append(url)
    get_json_text(url_list)
    time2 = time.time()
    print('共爬取電影信息：{0}個，共用時：{1}'.format(sum,(time2 - time1)))

if __name__ == '__main__':
    main()

結果：單線程爬取 200個電影信息用時3.1670422554016113秒

4.多線程進行爬取（直接上源碼代碼中有很詳細的註釋）（5個生產者3個消費者）：

import threading
import time
from queue import Queue
from urllib.parse import urlencode

import pymysql
import requests



headers = {
    'Accept': '*/*',
    'Host': 'movie.douban.com',
    'Referer': 'https://movie.douban.com/explore',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
}

# 中間者，存放爬取返回結果隊列
result_queue = Queue()


class Threading_product(threading.Thread):
	#初始化
    def __init__(self,name,url_queue):
        threading.Thread.__init__(self)
        self.url_queue = url_queue
        self.name = name
    #重寫threading中的run()執行方法
    def run(self):
        print('生產者{0}正在爬取...'.format(self.name))
        self.spider()
        print('生產者{0}爬取完成...'.format(self.name))
    #簡單的對信息進行爬取
    def spider(self):
        while True:
            if self.url_queue.empty():
                break
            else:
                url = self.url_queue.get()
                response = requests.get(url, headers=headers)
                json_text = response.json()
                # 將返回結果存入中間者，返回結果隊列當中
                result_queue.put(json_text.get('subjects'))


class Threading_constumer(threading.Thread):
    #初始化
    def __init__(self,name,result_queue):
        threading.Thread.__init__(self)
        self.name = name
        self.result_queue = result_queue
    #重寫run()方法
    def run(self):
        print('消費者{0}正在解析頁面並進行電影信息輸出：'.format(self.name))
        self.spider()
	#簡單的進行輸出
    def spider(self):
        while True:
            if self.result_queue.empty():
                break
            else:
                response = self.result_queue.get()
                for item in response:
                    print('電影評分：{0}，電影名稱：{1}，電影鏈接：{2}'.format(item.get('rate'), item.get('title'), item.get('url')))


def main():
	#產生程序開始時間
    time3 = time.time()
    #每20個爲一頁的信息，創建10頁
    page = [i * 20 for i in range(10)]
    #存放頁面的url隊列
    url_queue = Queue()
    #構造url鏈接
    for i in page:
        params = {
            'type': 'movie',
            'tag': '熱門',
            'sort': 'recommend',
            'page_limit': '20',
            'page_start': i
        }
        url = 'https://movie.douban.com/j/search_subjects?' + urlencode(params)
        #放入url隊列當中
        url_queue.put(url)
    #生產者隊列
    threading_product_list = []
    #創建5個生產者
    for i in range(5):
    	#產生生產者
        threading_product = Threading_product(i,url_queue)
        #執行
        threading_product.start()
        #將生產者放入生產者隊列當中
        threading_product_list.append(threading_product)
    #遍歷生產者，阻塞主線程，也就是說讓每一個生成者都執行完自己的線程內容
    for i in threading_product_list:
        i.join()
    #消費者隊列
    threading_constumer_list = []
    #創建3個消費者
    for i in range(5):
        #產生消費者
        threading_constumer = Threading_constumer(i,result_queue)
        #執行
        threading_constumer.start()
        #將消費者放入隊列當中
        threading_constumer_list.append(threading_constumer)
    #遍歷消費者，阻塞主線程，也就是說讓每一個消費者都執行完自己的線程內容
    for i in threading_constumer_list:
        i.join()
    #獲取代碼執行完的最後時間
    time4 = time.time()
    print('共用時：{0}'.format(time4 - time3))

if __name__ == '__main__':
    main()

最終的執行結果：

最後可以清楚的看到
使用多線程爬取200個電影信息的時間是 0.661125898361206秒
而使用單線程的時間是用時3.1670422554016113秒
差別其實很大，這才200個電影信息，當數據量再大的時候，多線程的速度優勢會更快
會持續對多線程進行學習，及時分享。

Python多線程爬蟲—批量爬取豆瓣電影動態加載的電影信息（小白詳細說明自己對於多線程瞭解）

單線程與多線程爬取時間比較

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

Python多線程爬蟲—批量爬取豆瓣電影動態加載的電影信息（小白詳細說明自己對於多線程瞭解）

Python幫你玩轉Excel文檔之xlwt模塊創建Excel文檔（基本操作）

（2020年）解決報錯：SyntaxError: Non-UTF-8 code starting with '\xe6' in file

Python幫你玩轉Excel文檔之xlrd模塊的基本詳細操作

Python—Queue模塊基本使用方法詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結