網址鏈接:
http://product.auto.163.com/#DQ2001
分析
分析發現:網頁是通過左邊部分的點擊,從而改變右邊的數據,所以我們需要先獲取左邊所有品牌對應的鏈接,拿到所有品牌的連接後進行逐個爬取車系圖。
分析左邊區域的源代碼:每個鏈接只有最後一段是不一樣的,切其中數字和上面父div的id屬性一致。
實現
1.獲取所有鏈接(單獨文件和scrapy無關係)
import requests
from bs4 import BeautifulSoup
request = requests.get("http://product.auto.163.com/")
request.encoding = "GBK" #編碼
#選擇解析器
soup = BeautifulSoup(request.content, 'html.parser')
lists = soup.select(".brand_cont .brand_name")
l = []
for brand in lists:
data = "http://product.auto.163.com/new_daquan/brand/" + brand["id"] + ".html"
l.append(data)
print(l)
#查看結果
for x in l:
print(x)
結果展示:
2.編寫scrapy代碼
1)Item 是保存爬取到的數據的容器: items.py
import scrapy
class CarseriesItem(scrapy.Item):
name = scrapy.Field() #車系名
image_urls = scrapy.Field() # 車系圖片地址
brand = scrapy.Field() #車品牌
- settings.py
打開管道,關閉ROBOTSTXT_OBEY,設置請求頭
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'product.auto.163.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
# 結果處理管道
ITEM_PIPELINES = {
'carseries.pipelines.CarseriesPipeline': 200,
}
#圖片保存的路徑
IMAGES_STORE = "C:/Users/Lance/Desktop/carSeries/"
3)解析類itcast.py
這裏我就不分析了,大夥可以自己分析,有需要討論的可以私信我交流
import scrapy
from carseries.items import CarseriesItem
class ItcastSpider(scrapy.Spider):
#爬蟲名,獨一無二
name = 'itcast'
# 允許的請求域
allowed_domains = ['http://product.auto.163.com/#DQ2001']
#這裏填第一步獲取到的鏈接列表
start_urls = ['http://product.auto.163.com/new_daquan/brand/1685.html', '']
def parse(self, response):
type = response.css("div[class='item-cont cur'] .item")
for t in type:
#拿到種類名
brand = t.css(".brand-c-title::text").extract()[0]
lis = t.css("li")
for li in lis:
if li.css("img"):
item = CarseriesItem()
name = li.css("img::attr(title)").extract()[0]
img = li.css("img::attr(src)").extract()[0]
item['image_urls'] = [img] #這裏得換成這種列表形式,否則怕數據時會出錯
item['brand'] = brand
item['name'] = name
yield item
4)管道處理數據下載圖片pipelines.py
import requests
import os
from scrapy.utils.project import get_project_settings # 獲取項目的setting文件
IMAGES_STORE = get_project_settings().get("IMAGES_STORE")
class CarseriesPipeline(object):
def process_item(self, item, spider):
for img in item['image_urls']:
path = IMAGES_STORE + item['brand'] + "/"
if not os.path.exists(path): # 如果路徑不存在,就創建
os.makedirs(path)
path = path + item['name'] + ".jpg"
response = requests.get(img)
with open(path, 'wb') as f:
f.write(response.content)
f.flush()
f.close()
return item
3. 執行程序
scrapy crawl itcast
4.結果展示:只截了一部分