繼前一篇文章用nodejs+puppeteer+chromium爬取了這個英雄資料後,在本篇同樣爬這個頁面,思路都差不多,只是用不同語言來實現,可作爲參考,個人覺得爬蟲還是nodejs比較好用,可能是我python太菜吧
本例環境和所需第三方包:python3、pycharm、selenium2.48.0(用3.0+版本會報錯,因爲新版本放棄phantomjs了,當然也可以用chrome和firefox,不過可能需要另外裝驅動)、scrapy1.6.0
對於還沒了解過scrapy和phantomjs的可以先看下這兩篇博客,寫的很詳細:
https://www.cnblogs.com/kongzhagen/p/6549053.html
https://jiayi.space/post/scrapy-phantomjs-seleniumdong-tai-pa-chong
安裝就不多說了,可以先裝scrapy,如果有報錯需要什麼再裝什麼,一般windos還需要安裝一個pywin32
創建項目:
scrapy startproject loldocument
cd loldocument
scrapy genspider hero lol.qq.com/data/info-heros.shtml
然後生成目錄:
spiders/hero.py發起列英雄表頁的請求獲取每個英雄所對應的詳情頁url,循環訪問每個url,然後得到想要的數據:
# -*- coding: utf-8 -*-
import scrapy
from loldocument.items import LoldocumentItem
import os
import urllib.request
import re
class HeroSpider(scrapy.Spider):
name = 'hero'
allowed_domains = ['lol.qq.com']
start_urls = ['https://lol.qq.com/data/info-heros.shtml']
def parse(self, response):
heros = response.xpath('//ul[@id="jSearchHeroDiv"]/li')
# heros = [heros[0], heros[1]]
for hero in heros: # 遍歷每個li
imgu = 'http:' + hero.xpath("./a/img/@src").extract_first()
title = hero.xpath("./a/@title").extract_first()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
req = urllib.request.Request(url=imgu, headers=headers)
res = urllib.request.urlopen(req)
path = r'F:\loldocument\hero_logo' # 保存英雄頭像的路徑
if not os.path.exists(path):
os.makedirs(path)
file_name = os.path.join(r'F:\loldocument\hero_logo', title + '.jpg')
with open(file_name, 'wb') as fp:
fp.write(res.read())
url = 'https://lol.qq.com/data/' + hero.xpath("./a/@href").extract_first()
request = scrapy.Request(url=url, callback=self.parse_detail)
request.meta['PhantomJS'] = True
request.meta['title'] = title
yield request
def parse_detail(self, response):
# 英雄詳情
item = LoldocumentItem()
item['title'] = response.meta['title']
item['DATAname'] = response.xpath('//h1[@id="DATAname"]/text()').extract_first()
item['DATAtitle'] = response.xpath('//h2[@id="DATAtitle"]/text()').extract_first()
item['DATAtags'] = response.xpath('//div[@id="DATAtags"]/span/text()').extract()
infokeys = response.xpath('//dl[@id="DATAinfo"]/dt/text()').extract()
infovalues = response.xpath('//dl[@id="DATAinfo"]/dd/i/@style').extract()
item['DATAinfo'] = {} # 英雄屬性
for i,v in enumerate(infokeys):
item['DATAinfo'][v] = re.sub(r'width:', "", infovalues[i])
yield item
items.py定義要存入item的字段:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class LoldocumentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
DATAname = scrapy.Field()
DATAtitle = scrapy.Field()
DATAtags = scrapy.Field()
DATAinfo = scrapy.Field()
因爲用selenium爬取異步數據所以需要另外單獨定義一個下載器中間件,在新建目錄和文件,在/loldocument新建python package middlware.py:
from selenium import webdriver
from scrapy.http import HtmlResponse
import time
class JavaScriptMiddleware(object):
def process_request(self, request, spider):
print("PhantomJS is starting...")
driver = webdriver.PhantomJS() # 指定使用的瀏覽器
# driver = webdriver.Firefox()
driver.get(request.url)
time.sleep(1)
if 'PhantomJS' in request.meta :
js = "var q=document.documentElement.scrollTop=1000"
driver.execute_script(js) # 可執行js,模仿用戶操作。此處爲將頁面拉至最底端。
time.sleep(1)
body = driver.page_source
print("訪問詳情頁" + request.url)
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
else:
js = "var q=document.documentElement.scrollTop=1000"
driver.execute_script(js) # 可執行js,模仿用戶操作。此處爲將頁面拉至最底端。
time.sleep(1)
body = driver.page_source
print("訪問:" + request.url)
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
settings.py修改三個配置:
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'loldocument.middlewares.middleware.JavaScriptMiddleware': 543, #鍵爲中間件類的路徑,值爲中間件的順序
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None, #禁止內置的中間件
}
ITEM_PIPELINES = {
'loldocument.pipelines.LoldocumentPipeline': 100,
}
數據處理pipelines.py:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class LoldocumentPipeline(object):
def process_item(self, item, spider):
with open('hero_detail.txt', 'a') as txt:
str = json.dumps(dict(item), ensure_ascii=False)
txt.write(str)
執行 scrapy crawl hero 執行程序,同時命令窗會輸出詳情及異常信息,加--nolog可以不輸出詳情
得到的數據hero_detail.txt和英雄頭像hero_logo目錄: