Scrapy設置隨機請求頭爬取貓眼電影TOP100並用xpath解析數據後存入MongoDB

Scrapy設置隨機請求頭爬取貓眼電影TOP100並用xpath解析數據後存入MongoDB。

**1、**首先我們先創建一個scrapy項目,運行CMD後按下圖所示進行創建:
在這裏插入圖片描述
**2、**項目創建好後,用pycharm打開項目maoyanToptutorial。在爬蟲過程中,我們常會使用各種僞裝來降低被目標網站反爬的概率,其中隨機更換User-Agent就是一種手段。當我們的scrapy項目創建完成並執行時,首先會讀取setting.py文件的配置,而在框架機制裏又存在一個下載中間件,在setting.py裏是默認關閉的,所以要先開啓它(即去掉註釋)。
setting.py

#隨機跳變User-agent
DOWNLOADER_MIDDLEWARES = {
   'maoyanToptutorial.middlewares.MaoyantoptutorialDownloaderMiddleware': 543,
}

註釋去除後來到middlewares.py文件填補相關程序,在改文件下有一個類MaoyantoptutorialDownloaderMiddleware,在該類下定義一個方法,此方法主要是創建一個UA列表即:

    def __init__(self):
        #UA列表
        self.USER_AGENT_LIST = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]

同樣的在MaoyantoptutorialDownloaderMiddleware類下有一個process_request方法,主要是爲了實現從UA列表中隨機返回一個UA,我們將它改寫爲:

    def process_request(self, request, spider):
        request.headers['User-Agent']=random.choice(self.USER_AGENT_LIST)

至此,我們就實現了在scrapy中設置隨機User-Agent,接下來我們將分析目標站點的URL請求規律後實現用xpath提取數據並存入MongoDB.。

**3、**打開chrome的開發者工具,切換到Network後刷新頁面,觀察到服務器返回給我們的東西里只有第一個請求URL:https://maoyan.com/board/4裏有我們所需要的數據,切換到Response可以找到所需要的數據,如下圖所示:
在這裏插入圖片描述
接着我們點擊下一頁獲取更多的信息,並找到對相應的URL爲:https://maoyan.com/board/4?offset=10 ,對URL進行比對分析後我們可以發現其中變化的字段爲offset後的數字,且每次以10個偏移度增加,由此我們可以得到一個簡化的URL爲:https://maoyan.com/board/4?offset={“偏移度增加10”} ,接下來我們開始寫爬蟲程序。

**4、**由Response中的信息可知,我們所提取的數據在某些節點中,爲了能夠更好的解析頁面信息,此處我們使用XPath解析庫來提取節點,然後再調用相應方法獲取它的正文內容或者屬性值,此外我們還藉助chrome瀏覽器的XPath Helper插件來輔助檢查我們所寫的匹配規則是否正確。比如以提取電影名爲例,如下圖所示:
在這裏插入圖片描述
**5、**知道提取規則後 ,我們開始寫相應代碼,首先來到items.py下定義item對象,即:

import scrapy
class MaoyantoptutorialItem(scrapy.Item):
    title = scrapy.Field()#電影名稱
    star = scrapy.Field()#演員
    releasetime = scrapy.Field()#上映時間
	score=scrapy.Field()#評分

爲了將數據存入MongoDB,我們還需在pipelines.py下添加以下代碼:
pipelines.py

import pymongo
class MongoPipeline(object):
    def __init__(self,mongo_url,mongo_db):
        self.mongo_url=mongo_url
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self,spider):
        self.client=pymongo.MongoClient(self.mongo_url)
        self.db=self.client[self.mongo_db]

    def process_item(self,item,spider):
        name=item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

接着在setting.py文件的配置數據庫的連接參數,即添加以下代碼:
setting.py

MONGO_URL='localhost'
MONGO_DB='MAOYANTOP'
#啓用MongoPipeline
ITEM_PIPELINES = {
   'maoyanToptutorial.pipelines.MongoPipeline': 300,
}

**6、**以上內容設置好後,來到spiders文件下的maoyanTop.py實現主要的爬取邏輯,代碼如下:
maoyanTop.py

# -*- coding: utf-8 -*-
import scrapy
from maoyanToptutorial.items import MaoyantoptutorialItem

class MaoyantopSpider(scrapy.Spider):
    name = 'maoyanTop'
    allowed_domains = ['maoyan.com/board/4/?offset=']
    start_urls = ['https://maoyan.com/board/4/?offset=']
    # 下一頁前綴url
    next_base_url = 'http://maoyan.com/board/4'
    
    def parse(self, response):
        if response:
            movies=response.xpath('//div[@class="main"]/dl/dd')# 獲取每頁所有電影的節點
            item = MaoyantoptutorialItem()
            for movie in movies:
                title=movie.xpath('.//a[@class="image-link"]/@title').extract_first()
                star=movie.xpath('.//div[@class="movie-item-info"]/p[@class="star"]/text()').extract_first()
                releasetime=movie.xpath('.//div[@class="movie-item-info"]/p[@class="releasetime"]/text()').extract_first()
                score_1=movie.xpath('.//div[contains(@class,"movie-item-number")]/p[@class="score"]/i[@class="integer"]/text()').extract_first()
                score_2=movie.xpath('.//div[contains(@class,"movie-item-number")]/p[@class="score"]/i[@class="fraction"]/text()').extract_first()

                item['title']=title
                item['star'] = star.strip()[3:]
                item['releasetime'] = releasetime.strip()[5:]
                item['score'] = score_1+score_2
                yield item
            # 處理下一頁
            next=response.xpath('.').re_first(r'href="(.*?)">下一頁</a>')
            if next:
                next_url = self.next_base_url + next
                # scrapy會對request的URL去重,加上dont_filter=True則告訴它這個URL不參與去重
                yield  scrapy.Request(url=next_url,callback=self.parse, dont_filter=True)

**6、**爬取結果如下圖:
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章