Scrapy設置隨機請求頭爬取貓眼電影TOP100並用xpath解析數據後存入MongoDB。
**1、**首先我們先創建一個scrapy項目,運行CMD後按下圖所示進行創建:
**2、**項目創建好後,用pycharm打開項目maoyanToptutorial。在爬蟲過程中,我們常會使用各種僞裝來降低被目標網站反爬的概率,其中隨機更換User-Agent就是一種手段。當我們的scrapy項目創建完成並執行時,首先會讀取setting.py文件的配置,而在框架機制裏又存在一個下載中間件,在setting.py裏是默認關閉的,所以要先開啓它(即去掉註釋)。
setting.py
#隨機跳變User-agent
DOWNLOADER_MIDDLEWARES = {
'maoyanToptutorial.middlewares.MaoyantoptutorialDownloaderMiddleware': 543,
}
註釋去除後來到middlewares.py文件填補相關程序,在改文件下有一個類MaoyantoptutorialDownloaderMiddleware,在該類下定義一個方法,此方法主要是創建一個UA列表即:
def __init__(self):
#UA列表
self.USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
同樣的在MaoyantoptutorialDownloaderMiddleware類下有一個process_request方法,主要是爲了實現從UA列表中隨機返回一個UA,我們將它改寫爲:
def process_request(self, request, spider):
request.headers['User-Agent']=random.choice(self.USER_AGENT_LIST)
至此,我們就實現了在scrapy中設置隨機User-Agent,接下來我們將分析目標站點的URL請求規律後實現用xpath提取數據並存入MongoDB.。
**3、**打開chrome的開發者工具,切換到Network後刷新頁面,觀察到服務器返回給我們的東西里只有第一個請求URL:https://maoyan.com/board/4裏有我們所需要的數據,切換到Response可以找到所需要的數據,如下圖所示:
接着我們點擊下一頁獲取更多的信息,並找到對相應的URL爲:https://maoyan.com/board/4?offset=10 ,對URL進行比對分析後我們可以發現其中變化的字段爲offset後的數字,且每次以10個偏移度增加,由此我們可以得到一個簡化的URL爲:https://maoyan.com/board/4?offset={“偏移度增加10”} ,接下來我們開始寫爬蟲程序。
**4、**由Response中的信息可知,我們所提取的數據在某些節點中,爲了能夠更好的解析頁面信息,此處我們使用XPath解析庫來提取節點,然後再調用相應方法獲取它的正文內容或者屬性值,此外我們還藉助chrome瀏覽器的XPath Helper插件來輔助檢查我們所寫的匹配規則是否正確。比如以提取電影名爲例,如下圖所示:
**5、**知道提取規則後 ,我們開始寫相應代碼,首先來到items.py下定義item對象,即:
import scrapy
class MaoyantoptutorialItem(scrapy.Item):
title = scrapy.Field()#電影名稱
star = scrapy.Field()#演員
releasetime = scrapy.Field()#上映時間
score=scrapy.Field()#評分
爲了將數據存入MongoDB,我們還需在pipelines.py下添加以下代碼:
pipelines.py
import pymongo
class MongoPipeline(object):
def __init__(self,mongo_url,mongo_db):
self.mongo_url=mongo_url
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls,crawler):
return cls(
mongo_url=crawler.settings.get('MONGO_URL'),
mongo_db = crawler.settings.get('MONGO_DB')
)
def open_spider(self,spider):
self.client=pymongo.MongoClient(self.mongo_url)
self.db=self.client[self.mongo_db]
def process_item(self,item,spider):
name=item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self,spider):
self.client.close()
接着在setting.py文件的配置數據庫的連接參數,即添加以下代碼:
setting.py
MONGO_URL='localhost'
MONGO_DB='MAOYANTOP'
#啓用MongoPipeline
ITEM_PIPELINES = {
'maoyanToptutorial.pipelines.MongoPipeline': 300,
}
**6、**以上內容設置好後,來到spiders文件下的maoyanTop.py實現主要的爬取邏輯,代碼如下:
maoyanTop.py
# -*- coding: utf-8 -*-
import scrapy
from maoyanToptutorial.items import MaoyantoptutorialItem
class MaoyantopSpider(scrapy.Spider):
name = 'maoyanTop'
allowed_domains = ['maoyan.com/board/4/?offset=']
start_urls = ['https://maoyan.com/board/4/?offset=']
# 下一頁前綴url
next_base_url = 'http://maoyan.com/board/4'
def parse(self, response):
if response:
movies=response.xpath('//div[@class="main"]/dl/dd')# 獲取每頁所有電影的節點
item = MaoyantoptutorialItem()
for movie in movies:
title=movie.xpath('.//a[@class="image-link"]/@title').extract_first()
star=movie.xpath('.//div[@class="movie-item-info"]/p[@class="star"]/text()').extract_first()
releasetime=movie.xpath('.//div[@class="movie-item-info"]/p[@class="releasetime"]/text()').extract_first()
score_1=movie.xpath('.//div[contains(@class,"movie-item-number")]/p[@class="score"]/i[@class="integer"]/text()').extract_first()
score_2=movie.xpath('.//div[contains(@class,"movie-item-number")]/p[@class="score"]/i[@class="fraction"]/text()').extract_first()
item['title']=title
item['star'] = star.strip()[3:]
item['releasetime'] = releasetime.strip()[5:]
item['score'] = score_1+score_2
yield item
# 處理下一頁
next=response.xpath('.').re_first(r'href="(.*?)">下一頁</a>')
if next:
next_url = self.next_base_url + next
# scrapy會對request的URL去重,加上dont_filter=True則告訴它這個URL不參與去重
yield scrapy.Request(url=next_url,callback=self.parse, dont_filter=True)
**6、**爬取結果如下圖: