【Python爬蟲】多線程爬取糗事百科【最新版本】

多線程爬蟲項目示意圖

首先，我們要明確知道多線程以下幾個重點：
1.要等目標線程都結束才能使主線程結束：主線程結束所有線程都會隨之停止則線程可能還未完全跑完
2.多個線程間要對同一數據進行操作時要添加互斥鎖
3.多個線程之間通信要用隊列（先進先出）

此次多線程爬蟲我們要寫兩個多線程：
1.爬取網頁的多線程 2.解析網頁的多線程

import requests
import threading
from lxml import etree
from queue import Queue
import json
import time


class Crawlthread(threading.Thread):
    def __init__(self, name, pagequeue, dataqueue):
        # 調用父類初始化
        super(Crawlthread, self).__init__()
        self.name = name
        self.pagequeue = pagequeue
        self.dataqueue = dataqueue
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}

    def run(self):
        print("啓動" + self.name)
        # 循環的目的是等待此線程有從隊列中獲取數據
        while not a:
            try:
                page = self.pagequeue.get(False)
                # 默認是True 當隊列爲空進入阻塞直至超時
                # 設置False 當隊列爲空會捕獲異常
                url_temp = "https://www.qiushibaike.com/text/page/{}/"
                url = url_temp.format(page)
                response = requests.get(url, headers=self.headers)
                time.sleep(1)
                self.dataqueue.put(response.content.decode("utf8", "ignore"))
            except:
                pass


class Parsethread(threading.Thread):
    def __init__(self, name, dataqueue, file, lock):
        super(Parsethread, self).__init__()
        self.name = name
        self.dataqueue = dataqueue
        self.file = file
        self.lock = lock

    def run(self):
        print("啓動" + self.name)
        while not b:
            try:
                html = self.dataqueue.get(False)
                self.parse(html)
            except:
                pass
        print("退出" + self.name)

    def parse(self, html):
        html = etree.HTML(html)
        node_list = html.xpath("//div[@id='content-left']/div[@class='article block untagged mb15 typs_hot']")
        connect_list = []
        for div in node_list:
            item = {}
            item["id"] = div.xpath(".//h2/text()")
            item["id"] = [i.replace("\n", "") for i in item["id"]][0]
            item["content"] = div.xpath(".//div[@class='content']/span/text()")
            item["content"] = [i.replace("\n", "") for i in item["content"]][0]
            item["author_image"] = div.xpath(".//div/a/img/@src")
            item["author_image"] = "https:" + item["author_image"][0] if len(item["author_image"]) > 0 else None
            item["age"] = div.xpath(".//div[contains(@class, 'articleGender')]/text()")
            item["age"] = item["age"][0] if len(item["age"]) > 0 else "暫無此信息"
            item["author_gender"] = div.xpath(".//div[contains(@class, 'articleGender')]/@class")
            item["author_gender"] = item["author_gender"][0].split(" ")[-1].replace("Icon", "")[0] if len(
                item["author_gender"]) > 0 else "暫無此信息"
            item["stats_vote"] = div.xpath(".//div/span/i/text()")
            item["stats_vote"] = item["stats_vote"][0] if len(item["stats_vote"]) > 0 else "沒有點贊"
            item["user_commons"] = div.xpath(".//span[@class='stats-comments']/a/i/text()")
            item["user_commons"] = item["user_commons"][0] if len(item["stats_vote"]) > 0 else "沒有評論"
            connect_list.append(item)
            # 同步鎖操作：包括打開和釋放鎖 避免因爲多個線程同時操作進行時出錯
        with self.lock:
            for c in connect_list:
                self.file.write(json.dumps(c, ensure_ascii=False) + "\n")
        print("保存成功")
        
a = False
b = False

def main():
    pagequeue = Queue(10)
    dataqueue = Queue()
    for i in range(1, 11):
        pagequeue.put(i)
    file = open("duanzi.json", "a", encoding="utf8")
    lock = threading.Lock()
    crawllist = ["採1號", "採2號", "採3號", "採4號"]
    treadcawl = []
    # 啓動獲取網頁四個線程
    for crawl in crawllist:
        thread = Crawlthread(crawl, pagequeue, dataqueue)
        thread.start()
        treadcawl.append(thread)  # 加入線程隊列

        parselist = ["解1號", "解2號", "解3號", "解4號"]
    # 啓動解析網頁四個線程
    threadparse = []
    for parse in parselist:
        thread = Parsethread(parse, dataqueue, file, lock)
        thread.start()
        threadparse.append(thread)
    # 判斷隊列是否已經都添加 不爲空即還有沒有添加一直等待都已經添加到線程中
    while not pagequeue.empty():
        pass
    global a
    a = True
    print("page_queue已經爲空了")
    # 確保此線程都在主線程結束前已經執行完畢 ->主線程阻塞
    for thread in treadcawl:
        thread.join()

    while not dataqueue.empty():
        pass
    global b
    b = True
    for thread in threadparse:
        thread.join()

    with lock:
        file.close()


if __name__ == '__main__':
    main()

運行結果如下：


> 不急不躁，穩紮穩打。
> 歡迎相互交流，促進進步。

【Python爬蟲】多線程爬取糗事百科【最新版本】

《日本蠟燭圖》讀書筆記 & 技術分析回測

Python多線程編程深度探索：從入門到實戰

《期貨-市場技術分析》讀書筆記

mongodb處理json數據很好

頂級 Javaer 都在用的 20 個類庫，真香！

[轉帖]cpupower

google瀏覽器插件開發

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

(Python)文本和字節序列

Python面試題自我總結(一)

鏈表|棧|隊列|樹(Python實現)

什麼是Docker？

喜提個人博客網站

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結