Crawl AJAX dynamic web page using Python 2.x and 3.x

原創

2020-02-22 13:34

The term AJAX is short for Asynchronous Javascript and XML. It uses the Javascript XMLHttpRequest function to create a tunnel between the client's browser and the server to transmit information back and forth without having to refresh the page.

To crawl the contents created by AJAX, sometimes it's easy to identify the URL requested by the AJAX directly. Take the IE 11 as an example. First, press F12 and enter the developer tools mode. Select the "Network" tab, click the button to trigger the XMLHttpRequest, notice the URL tab and find out the URL links caused by the AJAX.

However, sometimes we cannot identify the URL caused by XMLHttpRequest directly. In this case, we have to build up the URL Request manually.

1. identify the URL with the POST protocol.

2. double click the above URL and copy the value of "User-Agent"

3. select the Request body tab and copy the values.

4. the python code:

Python 2.x

import urllib2
import urllib
import json

url = 'http://www.huxiu.com/v2_action/article_list'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)'
data = {'huxiu_hash_code' : '63b69ec3342ee8c7e6ec4cab561482c9', 'page':2, 'last_dateline':1466664240}
data = urllib.urlencode(data)

request = urllib2.Request(url=url,data=data)
response = urllib2.urlopen(request)

result = json.loads(response.read())
print result

Python 3.x

import urllib
import json

url = 'http://www.huxiu.com/v2_action/article_list'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)'
data = {'huxiu_hash_code' : '63b69ec3342ee8c7e6ec4cab561482c9', 'page':2, 'last_dateline':1466664240}
data = (urllib.parse.urlencode(data)).encode('utf-8')
response = urllib.request.urlopen(url, data)

#parse json
result = json.loads(response.read().decode('utf-8'))
print (response)
print (result)

Yunhe_Feng

發佈了139 篇原創文章 · 獲贊 558 · 訪問量 119萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Crawl AJAX dynamic web page using Python 2.x and 3.x

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

Crawl GB2312 encoded webpages with Python 3.x

Hidden terminal and Exposed terminal

Using pip to install Python packages on Anaconda

[A Weird Bug] caused by the name of Python script

CSMA (carrier sense multiple access)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結