Python網絡爬蟲-模擬Ajax請求抓取微博

Python模擬Ajax請求

有時候我們在用requests抓取頁面的時候，得到的結果可能和在瀏覽器中看到的不一樣：在瀏覽器中可以看到正常顯示的頁面數據，但是使用requests得到的結果並沒有。這是因爲requests獲取到的都是原始的HTML靜態文檔，而瀏覽器中的頁面則是經過javaScript處理數據後生成的結果，這些數據的來源有很多種，可能是通過Ajax加載的，經過JS生成等。

Ajax:全稱是Asynchronous JavaScript and XML，即異步的JavaScript和XML。它能夠保證在頁面不被刷新、頁面鏈接不改變的情況下刷新並展示數據。比如我們在刷微博的時候，微博有下滑查看更多內容，一直下滑會出現一個加載的動畫，不一會兒就繼續出現新的微博內容，這就是Ajax加載的過程。在這個過程中，頁面實際上利用Ajax請求在後臺與服務器進行了數據交互，在獲取數據之後再利用JavaScript改變網頁，這樣網頁內容就會更新了。

下面利用Ajax請求抓取微博的內容。

1.目標

抓取新浪微博個人首頁發表的個人微博數據，如微博內容、點贊、評論和轉發數量等。

2.分析

打開Chrome，輸入https://m.weibo.cn/u/2695482785，並下滑拉倒底部，查看請求的發送過程：

如上圖所示，初次進去，然後點開查看請求Network，分別查看XHR（是Ajax請求的方式），分別查看Headers和preview以及Response。

分析結果：

請求URL：https://m.weibo.cn/api/container/getIndex?type=uid&value=2695482785&containerid=1076032695482785&page=2
請求方式：GET
請求頭：詳見Request Headers
請求參數：type、value、containerid和page

請求的響應分析：

如上圖所示，請求的響應內容是一個json格式的數據，點開data關鍵字下並點開cards目錄，然後點開具體內容，裏面有個mlog字段，然後展開，可以發現正是微博的一些信息，比如attitudes_count（點贊數量），comments_count（評論數量），reposts_count（轉發數量），text（微博正文）等。

因此，我們請求一次接口，就可以得到10條微博，而請求的參數只需要改變page參數即可。

3.解析響應內容

獲取json數據後，我們可以查找data關鍵字來獲取其下的具體內容，然後獲取cards關鍵字下的具體內容，通過解析cards下每個item具體的內容來提取我們想要的數據，因此解析json的代碼如下：

def parse_page(json):
    if json:
        items = json.get('data').get('cards') # 獲取到cards內容是一個item列表
        for item in items: # 循環列表提取數據
            item = item.get('mblog')
            weibo = {}
            weibo['id'] = item.get('id')
            weibo['text'] = item.get('text')
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')
            yield weibo # 返回一個weibo的字典

4.整體代碼

# -*- coding: utf-8 -*-
# @Time    : 2019-07-12 21:47
# @Author  : xudong
# @email   : [email protected]
# @Site    : 
# @File    : ajaxTest.py
# @Software: PyCharm

from urllib.parse import urlencode
import requests
import json


# 請求的url
base_url = "https://m.weibo.cn/api/container/getIndex?";

# 構造請求頭
headers = {
    'Host' : 'm.weibo.cn',
    'Referer' : 'https://m.weibo.cn/u/2695482785',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/52.0.2743.116 Safari/537.36',
    'X-Requested-With' : 'XMLHttpRequest'
}

# 獲取每一頁的請求數據返回的是json格式
def get_page(page):
    params = {
        'type' : 'uid',
        'value' : '2695482785',
        'containerid' : '1076032695482785',
        'page' : page
    }
    url = base_url + urlencode(params)
    try:
        response = requests.get(url, headers=headers)
        print(type(response))
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)


# 解析每一頁的json數據，並返回一個weibo的字典類型數據
def parse_page(json):
    if json:
        items = json.get('data').get('cards')
        for item in items:
            item = item.get('mblog')
            weibo = {}
            weibo['id'] = item.get('id')
            weibo['text'] = item.get('text')
            print(type(item.get('text')))
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')
            yield weibo


# 將解析完的數據寫入文件
def write_file(content):
    with open('weibo1.txt', 'a' ,encoding='utf-8') as file:
        file.write(content)

if __name__ == '__main__':
    for page in range(3): # 只爬取了三頁
        json1 = get_page(page)
        results = parse_page(json1)
        for result in results:
            print(result)
            write_file(json.dumps(result, ensure_ascii=False) + '\n')

當運行後，能夠在當前的目錄中看到weibo1.txt的結果並有如下的數據則表面模擬Ajax請求抓取微博成功，目標達成。

當然可以用自己的微博uid去試試啊～～～

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python網絡爬蟲-模擬Ajax請求抓取微博

Deep Learning Based Text Classification (文本分類綜述)

NG機器學習總結-（三）線性迴歸

NG機器學習總結-（四）邏輯迴歸

100天搞定機器學習（100-Days-Of-ML）（七）Numpy數組基礎

CNN經典算法AlexNet介紹（論文詳細解讀）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結