前言
利用Python爬取的是今日頭條中的街拍美圖。廢話不多說。
讓我們愉快地開始吧~
開發工具
Python版本: 3.6.4
相關模塊:
re;
requests模塊;
以及一些Python自帶的模塊。
環境搭建
安裝Python並添加到環境變量,pip安裝需要的相關模塊即可。
詳細瀏覽器信息
獲取文章鏈接相關代碼:
import requests
import json
import re
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
def get_first_data(offset):
params = {
'offset': offset,
'format': 'json',
'keyword': '街拍',
'autoload': 'true',
'count': '20',
'cur_tab': '1',
'from':'search_tab'
}
response = requests.get(url='https://www.toutiao.com/search_content/', headers=headers, params=params)
try:
response.raise_for_status()
return response.text
except Exception as exc:
print("獲取失敗")
return None
def handle_first_data(html):
data = json.loads(html)
if data and "data" in data.keys():
for item in data.get("data"):
yield item.get("article_url")
這裏需要提一下requests模塊的報錯,在response對象上調用 raise_for_status()方法,如果下載文件出錯,會拋出異常,需要使用 try 和 except 語句將代碼行包裹起來,處理這一錯誤,不讓程序崩潰。
另外附上requests模塊技術文檔網址:http://cn.python-requests.org/zh_CN/latest/
獲取圖片鏈接相關代碼:
def get_second_data(url):
if url:
try:
reponse = requests.get(url, headers=headers)
reponse.raise_for_status()
return reponse.text
except Exception as exc:
print("進入鏈接發生錯誤")
return None
def handle_second_data(html):
if html:
pattern = re.compile(r'gallery: JSON.parse\((.*?)\),', re.S)
result = re.search(pattern, html)
if result:
imageurl = []
data = json.loads(json.loads(result.group(1)))
if data and "sub_images" in data.keys():
sub_images = data.get("sub_images")
images = [item.get('url') for item in sub_images]
for image in images:
imageurl.append(images)
return imageurl
else:
print("have no result")
獲取圖片相關代碼:
def download_image(imageUrl):
for url in imageUrl:
try:
image = requests.get(url).content
except:
pass
with open("images"+str(url[-10:])+".jpg", "wb") as ob:
ob.write(image)
ob.close()
print(url[-10:] + "下載成功!" + url)
def main():
html = get_first_data(0)
for url in handle_first_data(html):
html = get_second_data(url)
if html:
result = handle_second_data(html)
if result:
try:
download_image(result)
except KeyError:
print("{0}存在問題,略過".format(result))
continue
if __name__ == '__main__':
main()