[python]微信公衆號文章爬取

需求

爬取一些微信公衆號的文章

數據來源

1.搜狗微信搜索，可以搜索微信公衆號文章，但只能顯示該公衆號最近十篇的文章
2.通過個人微信公衆號中的素材管理，查看其他微信公衆號文章

步驟

1.手動從網站上獲取cookie通過cookie登陸
2.從請求url中獲取token
3.拼接參數請求https://mp.weixin.qq.com/cgi-bin/searchbiz獲取公衆號的fakeid也就是biz
4.拼接參數請求https://mp.weixin.qq.com/cgi-bin/appmsg?獲取文章列表信息
5.通過文章url爬取文章

通過這種方式是沒辦法得到閱讀量和點贊數的，因爲網頁打開公衆號文章是沒有閱讀數和點贊數的

代碼

github倉庫地址

import requests
import json
import re
import time

class WeChatCrawler():

    def __init__(self, wxList):
        self.wxList = wxList
        self.cookies = self.__getCookiesFromText()
        self.token = self.__getToken()
        self.headers = {
            "HOST": "mp.weixin.qq.com",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"
        }
        self.searchBizParam = {
            'action': 'search_biz',
            'token': self.token,
            'lang': 'zh_CN',
            'f': 'json',
            'ajax': '1',
            'query': '',
            'begin': '0',
            'count': '5',
        }
        self.getMsgListParam = {
            'token': self.token,
            'lang': 'zh_CN',
            'f': 'json',
            'ajax': '1',
            'action': 'list_ex',
            'begin': '0',
            'count': '5',
            'query': '',
            'fakeid': '',
            'type': '9'
        }

    def __getCookiesFromText(self):
        # 手動獲取cookie
        with open('cookie.txt', 'r', encoding='utf-8') as f:
            cookieStr = f.read()
            # 處理cookieStr格式轉化成json
            cookieStr = "{\"" + cookieStr + "\"}"
            cookieStr = cookieStr.replace("rewardsn=;", "").replace(";", "\",\"").replace("=", "\":\"").replace(
                "\":\"\"", "=\"").replace(' ', '')
            # print(cookieStr)
            cookies = json.loads(cookieStr)
            return cookies

    def __getToken(self):
        url = 'https://mp.weixin.qq.com'
        response = requests.get(url=url, cookies=self.cookies)
        token = re.findall(r'token=(\d+)', str(response.url))[0]
        return token

    def __getWXFakeid(self, wx):
        searchUrl = 'https://mp.weixin.qq.com/cgi-bin/searchbiz?'
        self.searchBizParam['query'] = wx
        searchResponse = requests.get(searchUrl, cookies=self.cookies, headers=self.headers, params=self.searchBizParam)
        fakeid = searchResponse.json().get('list')[0].get('fakeid')
        return fakeid

    def __getWXMsgCnt(self, fakeId):
        self.getMsgListParam['fakeid'] = fakeId
        appmsgUrl = 'https://mp.weixin.qq.com/cgi-bin/appmsg?'
        appmsgResponse = requests.get(appmsgUrl, cookies=self.cookies, headers=self.headers,
                                      params=self.getMsgListParam)
        wxMsgCnt = appmsgResponse.json().get('app_msg_cnt')
        return wxMsgCnt

    def __getWXMsgList(self, fakeId):
        appmsgUrl = 'https://mp.weixin.qq.com/cgi-bin/appmsg?'
        wxMsgCnt = self.__getWXMsgCnt(fakeId)
        if wxMsgCnt is not None:
            pages = int(wxMsgCnt) // 5
            begin = 0
            for _ in range(pages):
                print('====翻頁====', begin)
                self.getMsgListParam['begin'] = str(begin)
                msgListResponse = requests.get(appmsgUrl, cookies=self.cookies, headers=self.headers,
                                               params=self.getMsgListParam)
                msgList = msgListResponse.json().get('app_msg_list')
                for item in msgList:
                    # todo more
                    msgLink = item.get('link')
                    print(msgLink)
                    msgTitle = item.get('title')
                    print(msgTitle)
                begin += 5
                time.sleep(3)

    def runCrawler(self):
        fakeIds = list(map(self.__getWXFakeid, self.wxList))
        list(map(self.__getWXMsgList, fakeIds))

if __name__ == '__main__':
    # example
    wxList = ['量子位', ]
    wc = WeChatCrawler(wxList)
    wc.runCrawler()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[python]微信公衆號文章爬取

[python]微信公衆號文章爬取

需求

數據來源

步驟

代碼

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

使用Dockerfile創建Ubuntu+Pytorch+CUDA 鏡像

[Pytorch練手]使用CNN圖像分類

2020考研復旦計算機專碩392經驗貼

在docker中使用pytorch時共享內存問題

[python]微信公衆號文章爬取

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結