前言
本節學習splash和AJAX
可參考
爬蟲之Splash基礎篇
Ajax詳解
1、splash
介紹
Splash是一個針對js的渲染服務。
- 它內置了一個瀏覽器和http接口。
- 基於Python3和Twisted引擎。
- 可以異步處理任務。
安裝(只有linux和mac能安裝):
https://splash.readthedocs.io/en/stable/install.html
首先需要安裝docker
- Docker是基於Go語言實現的開源容器項目,誕生於2013年初。
- 所謂容器,可以簡單地理解爲隔斷。(桶裝方便麪)
安裝:
docker pull scrapinghub/splash
運行:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
訪問
http://localhost:8050
http請求
splash提供的http接口
對於抓取網頁,最重要的就是 : render.html
curl 'http://localhost:8050/render.html?url=http://www.baidu.com/&timeout=30&wait=0.5'
# url:必填,要請求的網址
# timeout:選填,超時時間
# wait:選填,頁面加載完畢後,等待的時間
抓取今日頭條,對比渲染和沒有渲染的效果
import requests
from lxml import etree
url = 'http://localhost:8050/render.html?url=https://www.toutiao.com&timeout=30&wait=0.5'
# url = 'https://www.toutiao.com'
response = requests.get(url)
print(response.text)
tree = etree.HTML(response.text)
article_titles = tree.xpath('//div[@class="title-box"]/a/text()')
print(article_titles)
抓取《我不是藥神》的豆瓣評論
import csv
import time
import requests
from lxml import etree
fw = open('douban_comments.csv', 'w')
writer = csv.writer(fw)
writer.writerow(['comment_time','comment_content'])
for i in range(0,20):
url = 'http://localhost:8050/render.html?url=https://movie.douban.com/subject/26752088/comments?start={}&limit=20&sort=new_score&status=P&timeout=30&wait=0.5'.format(i*20)
# url = 'https://movie.douban.com/subject/26752088/comments?start={}&limit=20&sort=new_score&status=P'.format(i*20)
response = requests.get(url)
tree = etree.HTML(response.text)
comments = tree.xpath('//div[@class="comment"]')
for item in comments:
comment_time = item.xpath('./h3/span[2]/span[contains(@class,"comment-time")]/@title')[0]
comment_time = int(time.mktime(time.strptime(comment_time,'%Y-%m-%d %H:%M:%S')))
comment_content = item.xpath('./p/span/text()')[0].strip()
print(comment_time)
print(comment_content)
writer.writerow([comment_time,comment_content])
python執行一段lua腳本,並抓取京東商品信息
import json
import requests
from lxml import etree
from urllib.parse import quote
lua = '''
function main(splash, args)
local treat = require("treat")
local response = splash:http_get("https://search.jd.com/Search?keyword=相機&enc=utf-8")
return {
html = treat.as_string(response.body),
url = response.url,
status = response.status
}
end
'''
# 線上部署的服務,需要將localhost換成服務器的公網地址(不是內網地址)
url = 'http://localhost:8050/execute?lua_source=' + quote(lua)
response = requests.get(url)
html = json.loads(response.text)['html']
tree = etree.HTML(html)
# 單品
products_1 = tree.xpath('//div[@class="gl-i-wrap"]')
for item in products_1:
try:
name_1 = item.xpath('./div[@class="p-name p-name-type-2"]/a/em/text()')[0]
price_1 = item.xpath('./div[@class="p-price"]/strong/@data-price | ./div[@class="p-price"]/strong/i/text()')[0]
print(name_1)
print(price_1)
except:
pass
# 套裝
products_2 = tree.xpath('//div[@class="tab-content-item tab-cnt-i-selected"]')
for item in products_2:
name_2 = item.xpath('./div[@class="p-name p-name-type-2"]/a/em/text()')[0]
price_2 = item.xpath('./div[@class="p-price"]/strong/@data-price | ./div[@class="p-price"]/strong/i/text()')[0]
print(name_2)
print(price_2)
scrapy-splash
一個讓scrapy結合splash,進行動態抓取的庫
文檔:
https://github.com/scrapy-plugins/scrapy-splash
# 安裝
pip install scrapy-splash
# 創建scrapy項目:
scrapy startproject scrapysplashtest
# 創建爬蟲:
scrapy genspider taobao www.taobao.com
# 修改settings文件:
# 添加SPLASH_URL:
SPLASH_URL = 'http://localhost:8050'
# 添加下載器中間件:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 啓用爬蟲去重中間件:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# 設置自定義的去重類:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 配置緩存後端
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
2、AJAX
- 異步的JavaScript和XML
- 這個技術可以在頁面不刷新的情況下,利用js和後端服務器進行交互,將內容顯示在前端頁面上
- 優點是可以大大提高網頁的打開速度,從開發角度可以做到前後端分離,提高開發速度
ajax的工作步驟:
- 發送請求:通過接口,js向服務器發送xmlhttp請求(XHR)
- 解析內容:js得到響應後,返回的內容可能是html,也可能是json格式
- 渲染網頁:js通過操縱dom樹,改變dom節點的內容,達到修改網頁的目的
抓取噹噹網書評
import json
import requests
from lxml import etree
for i in range(1,5):
# url = 'http://product.dangdang.com/index.php?r=comment/list&productId=25340451&pageIndex=1'
url = 'http://product.dangdang.com/index.php?r=comment/list&productId=25340451&categoryPath=01.07.07.04.00.00&mainProductId=25340451&mediumId=0&pageIndex={}'.format(i)
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url,
headers=header,
timeout=5
)
print(response.text)
result = json.loads(response.text)
comment_html = result['data']['list']['html']
tree = etree.HTML(comment_html)
comments = tree.xpath('//div[@class="items_right"]')
for item in comments:
comment_time = item.xpath('./div[contains(@class,"starline")]/span[1]/text()')[0]
comment_content = item.xpath('./div[contains(@class,"describe_detail")]/span[1]//text()')[0]
print(comment_time)
print(comment_content)
抓取金色財經快訊接口
import requests
import json
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
url = 'https://api.jinse.com/v4/live/list?limit=20&reading=false&flag=up'
response = requests.get(url,
headers=header,
timeout=5
)
result = json.loads(response.text)
print(result)
# json格式分析工具:http://www.bejson.com/
for item in result['list'][0]['lives']:
# print(item)
timestamp = item['created_at']
content = item['content']
print(timestamp)
print(content)
抓取36氪快訊
import requests
import json
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
url = 'https://36kr.com/api/newsflash?&per_page=20'
response = requests.get(url,
headers=header,
timeout=5
)
print(json.loads(response.text))
data = json.loads(response.text)['data']
print(data)
items = data['items']
print(items)
for item in items:
print(item)
item_info = {}
title = item['title']
item_info['title'] = title
description = item['description']
item_info['content'] = description
published_time = item['published_at']
item_info['published_time'] = published_time
print(item_info)
結語
通過例子對splash和ajax有個大概的瞭解