使用seleinum模塊動態爬取熊貓直播平臺全部的主播房間。

原創

2018-10-09 23:20

爬取熊貓平臺的數據也是使用面向對象的思想，和同樣的邏輯思維，可以借鑑一下這種邏輯思維。至於解析可以參看我的這一篇博客：https://blog.csdn.net/qq_39198486/article/details/82950583

如果使用seleinum模塊時不會配置chromedriver文件，可以參考這篇博客：https://blog.csdn.net/qq_39198486/article/details/82930025

下面我就直接放全部代碼，主要地方我都有註釋，就不一一在代碼外寫出來了：

# author: aspiring

from selenium import webdriver
import time
import json


class XiongmaoSpider:
    def __init__(self):
        self.start_url = "https://www.panda.tv/all"  # start_url
        self.driver = webdriver.Chrome()  # 實例化一個瀏覽器

    def get_content_list(self):  # 提取數據
        li_list = self.driver.find_elements_by_xpath("//ul[@id='later-play-list']/li")  # 分組
        content_list = []
        for li in li_list:
            item = {}
            item["name"] = li.find_element_by_xpath(".//span[@class='video-nickname']").get_attribute("title")
            item["title"] = li.find_element_by_xpath(".//span[@class='video-title']").text
            item["room_img"] = li.find_element_by_xpath(".//img[@class='video-img video-img-lazy']").get_attribute("data-original")
            item["watch_num"] = li.find_element_by_xpath(".//span[@class='video-number']").text
            print(item)
            content_list.append(item)  # 將字典放入逐條添加到一個列表內

        # 獲取下一頁元素
        next_url = self.driver.find_elements_by_xpath("//a[@class='j-page-next']")
        # 確保獲取最後一頁的出現沒有下一頁時不會報錯，並在while循環中作爲判別條件
        next_url = next_url[0] if len(next_url) > 0 else None  

        return content_list, next_url

    def save_content_list(self, content_list):
        with open("xiongmao.txt", "a", encoding="utf-8") as f:
            for content in content_list:
                f.write(json.dumps(content, ensure_ascii=False, indent=2))  # 使用json將數據以json格式寫入文件
                f.write("\n")

    def run(self):  # 實現主要邏輯
        # 1.start_url
        # 2.發送請求，獲取響應
        self.driver.get(self.start_url)
        # 3.提取數據
        content_list, next_url = self.get_content_list()
        # 4.保存
        self.save_content_list(content_list)
        # 點擊下一頁元素
        while next_url is not None:
            next_url.click() # 點擊下一頁
            time.sleep(2)  # 睡2s是爲了下一頁元素的載入緩衝時間，防止頁面元素還沒加載出來就去提取數據
            # 3.提取數據
            content_list, next_url = self.get_content_list()
            # 4.保存
            self.save_content_list(content_list)


if __name__ == '__main__':
    xiongmao = XiongmaoSpider()
    xiongmao.run()

下面是導出的json格式的文件，我列舉l前三個數據：

{
  "name": "沐慈Kiki",
  "title": "Happy day香檳 啤酒抽獎",
  "room_img": "https://i.h2.pdim.gs/90/c0a8df56c6462e3882782f4fc22602ff/w338/h190.jpg",
  "watch_num": "1.3萬"
}
{
  "name": "芒果魚丶",
  "title": "韓服大師上王者",
  "room_img": "https://i.h2.pdim.gs/90/9c28dff6ad3d6bf1cef25e9062c0257e/w338/h190.jpg",
  "watch_num": "19.4萬"
}
{
  "name": "會旋轉的冬瓜丶",
  "title": "瓜式一刀流 開斬！",
  "room_img": "https://i.h2.pdim.gs/90/9fdc00fe5a2b9252765468ff2cd533dd/w338/h190.jpg",
  "watch_num": "18.0萬"
}
...
...

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用seleinum模塊動態爬取熊貓直播平臺全部的主播房間。

使用selenium時出現 " FileNotFoundError: [WinError 2] 系統找不到指定的文件。" 的解決辦法。

使用python中的requests爬取百度翻譯實現中英互譯功能

git命令(有重複，懶於整理)

使用java完成一個猜數字的小遊戲(數據範圍在1-100之間)

Scrapy報錯：no module named win32api 的解決方法以及虛擬環境下的解決方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結