任務:爬取你有哪些讓你一秒變開心的表情包回答下的表情包
網頁爬取
這兒主要的知識點是Ajax加載的問題,簡單來說就是瀏覽網頁的時候,會有下滑查看更多的選項,知乎的回答就屬於這種。
先打開Chrome輸入網址,按下F12,打開開發者工具。選Network一項,在Network下面選XHR(xhr格式的爲Ajax請求),不斷下滑網頁,直到抓到以answer?開頭的包。
我們來分析一下這個包,點擊Headers
可以看到這是個requests請求,他base_url = “https://www.zhihu.com/api/v4/questions/302378021/answers?”,後面跟着include、limit等幾個參數。這部分的參數可以在Query String Parameters欄目中找到。
現在可以寫一下網頁請求的代碼,headers中老規矩設置一下自己referer和user-agent,可以複製Request Headers欄目中的。
import requests
from urllib.parse import urlencode
def get_page(offset):
params = {
'include' : "data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics",
'limit' : 5,
'offset' : offset,
'platform': 'desktop',
'sort_by': 'default',
}
base_url = "https://www.zhihu.com/api/v4/questions/302378021/answers?"
url = base_url + urlencode(params) #請求網址
headers = {
'referer' : "https://www.zhihu.com/question/302378021",
'user-agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
}
try:
response = requests.get(url, headers=headers) #發送get請求
if response.status_code == 200:
return response.json() #返回json格式
except requests.ConnectionError:
return None
get_page(5)
數據分析
選中preview,可以看到data下面有5條數據,一條即爲一個回答,展開其中一個回答,可以看到回答的內容在content中
把content中內容複製下來分析一下,篇幅有限只選了第一個figure的內容
<p>更新一下</p><p class="ztext-empty-paragraph"><br/></p>
<figure data-size="normal">
<noscript><img src="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.gif" data-rawwidth="240" data-rawheight="240" data-size="normal" data-thumbnail="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.jpg" class="content_image" width="240"/></noscript><img src="data:image/svg+xml;utf8,<svg xmlns='http://www.w3.org/2000/svg' width='240' height='240'></svg>" data-rawwidth="240" data-rawheight="240" data-size="normal" data-thumbnail="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.jpg" class="content_image lazy" width="240" data-actualsrc="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.gif"/>
</figure>
可以看到這個figure標籤內有4個網址:
img src="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.gif"
data-thumbnail="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.jpg"
data-thumbnail="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.jpg"
data-actualsrc="https://pic3.zhimg.com/50/v2-3a5f0b335d4b3e55724b78cc7f2fb0b2_hd.gif"
data-thumbnail應該是縮略圖(忽略),我們爬取屬性爲img src或者data-actualsrc的值即可,這次用正則表達式來提取
import requests
from urllib.parse import urlencode
import re
def get_page(offset):
params = {
'include' : "data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics",
'limit' : 5,
'offset' : offset,
'platform': 'desktop',
'sort_by': 'default',
}
base_url = "https://www.zhihu.com/api/v4/questions/302378021/answers?"
url = base_url + urlencode(params)
headers = {
'referer' : "https://www.zhihu.com/question/302378021",
'user-agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError:
return None
def get_image(json):
if json.get('data'):
for item in json.get('data'):
content = item.get('content')
#提取data-actualsrc的值爲網址的屬性(我這裏只選了jpg圖片,想要gif的改成gif即可)
images = re.findall('.*?data-actualsrc=.(https:\S*?jpg)',content)
for image in images:
yield image
json = get_page(5)
for image in get_image(json):
print(image)
數據存儲
數據存儲就比較簡單了,對爬取到的url發送請求並保存
def save_image(image, cnt):
path = './images/'
file_path = path + '/{0}.jpg'.format(str(cnt))
response = requests.get(image)
with open(file_path, 'wb') as f:
f.write(response.content)
最後這裏只是分析了一個包,只有5個回答,看一下preview中paging欄目下的next和previous,即下一個和前一個請求的網址,可以看到,只改變了offset的值,分別爲10和0,本次請求的offset爲5。不難發現規律,每次offset增加5
完整代碼:
import requests
from urllib.parse import urlencode
import re
import time
def get_page(offset):
params = {
'include' : "data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics",
'limit' : 5,
'offset' : offset,
'platform': 'desktop',
'sort_by': 'default',
}
base_url = "https://www.zhihu.com/api/v4/questions/302378021/answers?"
url = base_url + urlencode(params)
headers = {
'referer' : "https://www.zhihu.com/question/302378021",
'user-agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError:
return None
def get_image(json):
if json.get('data'):
for item in json.get('data'):
content = item.get('content')
images = re.findall('.*?data-actualsrc=.(https:\S*?jpg)',content)
for image in images:
yield image
def save_image(image, cnt):
path = './images/'
file_path = path + '/{0}.jpg'.format(str(cnt))
response = requests.get(image)
with open(file_path, 'wb') as f:
f.write(response.content)
if __name__ == "__main__":
cnt = 0
for i in range(100):
json = get_page(5*i)
for image in get_image(json):
save_image(image, cnt)
cnt += 1
if cnt % 100 == 0:
print("第%d個表情包存儲完畢"%i)
time.sleep(1)