一次爬取騰訊雲社區文章的經歷分享

原創

2020-06-14 12:44

最近學了一些Python爬蟲，很多網站爬取還是非常簡單的，比如第二頁的鏈接中基本會出現page=2，這樣的網站爬起來非常方便。但是，有的網站比如接下來要將的騰訊雲社區，對於爬蟲做了一些防護。下面，我來分享一下我的爬取騰訊雲社區內容的探索經歷。

分析

首先輸入搜索Python後，拉到頁面最下，可以看到“點擊加載更多按鈕”。

在檢查界面的network中，我們可以看到他發送了/search?action=SearchList這個鏈接。

打開這個鏈接後，我們能在preview中找到下一頁的內容，那麼基本就確立了這個鏈接的內容就是我們需要的。

但是這個鏈接無論是第幾頁都是這個，因而猜測某些頁碼信息隱藏在Headers中，經過一番尋找，終於在Payload找到了一些屬性。

我們可以猜測pageNumber屬性就是頁碼數，q屬性就是代表搜索的內容。因而我用python嘗試了一下，將pageNumber改成3，q改成python3，拿到的就是搜索python3的第三頁內容。最終假設成立，就需要開始幹活了。

添加Payload信息

直接添加payload屬性,

payload = {"action": "SearchList", "payload": {"pageNumber": i, "q": "python", "searchTab": "article"}}

在post中添加json=payload。

html = requests.post(url=url, json=payload, headers=headers).content.decode("utf-8","ignore")

寫入數據庫（可以跳過）

由於搜索到的數據量非常大，因而爲了方便之後的查看，我將其寫入了數據庫。當然，你不會數據庫的話可以生成csv文件,這裏僅提供寫入數據庫的方法。如果你的電腦沒有安裝數據庫，我這提供一個非常簡單的安裝方法，使用phpstudy。

安裝後可以啓動WNMP環境，然後可以使用phpmyadmin查看編輯數據庫。在裏面新建cloud_tecent數據庫，然後在其中新建article數據表。
最後調用pymysql模塊編輯數據庫

# 連接本地數據庫cloud_tecent
connectSql = pymysql.connect(host="127.0.0.1", user="root", passwd="321369", db="cloud_tecent")
for j in range(0,len(title),2):
    title1 = str(title[j])
    #去除標題中<em>和</em>
    title2 = title1.replace("<em>", "")
    title2 = title2.replace("</em>", "")
    link1 = "/developer/article/"+link[j]
    m = int(j/2)
    comment1 = comment[m]
    # 寫入數據表 title link comment
    sql = "insert into article(title,link,comment) value('"+title2+"','"+link1+"','"+comment1+"')"
    connectSql.query(sql)
    connectSql.commit()

全部代碼

import json
import pymysql
import re
import requests
def get_cloudtecent(i):
    url = 'https://cloud.tencent.com/developer/services/ajax/search?action=SearchList'
    headers = {
        'Content-Type': 'application/json;charset=UTF-8',
        'Cookie': '請填入你的cookie',
        'Referer': 'https://cloud.tencent.com/developer/search/article-python',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
        'accept-language': 'zh - CN, zh;q = 0.9',
        'accept': 'application/json, text/plain, */*'
    }
    payload = {"action": "SearchList", "payload": {"pageNumber": i, "q": "python", "searchTab": "article"}}

    html = requests.post(url=url, json=payload, headers=headers).content.decode("utf-8","ignore")
    print(html)
    title = re.compile('"article":{"id":.*?"articleId":.*?"title":"(.*?)"',re.S).findall(html)
    # 測試的時候輸出這三個值驗證正則表達式
    #print(title)
    link = re.compile('"articleId":(.*?),',re.S).findall(html)
    #print(link)
    comment = re.compile('"summary":"(.*?)"',re.S).findall(html)
    #print(comment)
    # 連接本地數據庫cloud_tecent
    connectSql = pymysql.connect(host="127.0.0.1", user="root", passwd="321369", db="cloud_tecent")
    for j in range(0,len(title),2):
        title1 = str(title[j])
        #去除標題中<em>和</em>
        title2 = title1.replace("<em>", "")
        title2 = title2.replace("</em>", "")
        link1 = "/developer/article/"+link[j]
        m = int(j/2)
        comment1 = comment[m]
        # 寫入數據表 title link comment
        sql = "insert into article(title,link,comment) value('"+title2+"','"+link1+"','"+comment1+"')"
        connectSql.query(sql)
        connectSql.commit()
# 爬前五頁
for i in range(1,5):
    get_cloudtecent(i)

結果

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

一次爬取騰訊雲社區文章的經歷分享

分析

添加Payload信息

寫入數據庫（可以跳過）

全部代碼

結果

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（一）部署K8s

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

Xcode安裝失敗後的緩存清理

關於鬥魚和虎牙直播導致電腦及網絡卡頓問題解決方案

黑蘋果修復顯示器亮度調節

簡單易學的合成多人螞蟻呀嘿視頻教程

雙倍的分辨率，雙倍的快樂

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結