看了很多大佬的博客關於這點自己懂得太少,aiohttp這個庫的應用不是很熟練,比照別人的代碼自己也先實踐以後,後續需要看官方文檔來補充這點知識。
中文文檔
https://segmentfault.com/p/1210000013564725
自己比照別人代碼寫一個關於用aiohttp來實現的爬蟲代碼。
目標網站:
http://www.ivsky.com/tupian/ziranfengguang/
簡單爬取天堂圖片網的照片
邏輯就不講了,直接上個代碼
import time
import aiohttp
import asyncio
from scrapy import Selector
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'
}
# 獲取網頁(文本信息)
async def fetch(session, url):
async with session.get(url, headers=headers) as response:
return await response.text(encoding='utf-8')
# 獲取每一頁的所有圖片路徑
async def url_parse(html):
selector = Selector(text=html)
url_list = selector.xpath('//ul[@class="ali"]//li//img/@src').extract()
return url_list
# 進行圖片的下載
async def down_img(session, url_list):
for each_url in img_list:
print('程序正在採集%s' % each_url)
async with session.get(each_url, headers=headers) as response:
img_response = await response.read()
with open('./image/%s.jpg' % time.time(), 'wb') as file:
file.write(img_response)
# 開始執行抓取
async def start(url):
async with aiohttp.ClientSession() as session:
html = await fetch(session, url) # 得到每一頁的html
url_list = await url_parse(html) # 解析得到每一頁的圖片url
await down_img(session, url_list) # 進行圖片的下載
if __name__ == '__main__':
each_url = "http://www.ivsky.com/tupian/ziranfengguang/index_{page}.html"
full_urllist = [each_url.format(page=i) for i in range(1, 20)]
event_loop = asyncio.get_event_loop()
tasks = [start(url) for url in full_urllist]
tasks = asyncio.wait(tasks)
event_loop.run_until_complete(tasks) # 等待任務結束
後續需要掌握這塊aiohttp庫的知識,今天只是分享了一下代碼。