day070 CrawlSpider

CrawlSpider的基本使用,Request,Response,DownloadMiddlewares,Settings

CrawlSpider

  • 所有自定義的spider爬蟲類都會繼承scrapy.Spider類,Spider是最基本的類
  • CrawlSpider是Spider類的拓展類
  • CrawlSpider類跟Spider類相比,多了一個Rule類,這個類用來匹配獲取頁面中的鏈接的

Rule類

class scrapy.spiders.Rule(
        link_extractor, 
        callback = None, 
        cb_kwargs = None, 
        follow = None, 
        process_links = None, 
        process_request = None
)
  • 其中最常用的參數:
    • link_extractor表示匹配的鏈接對象,包含rep匹配規則
    • callback表示回調函數,該回調函數接受一個response作爲第一個參數。*注意當編寫爬蟲規則時,要求避免使用parse作爲回調函數,是由於CrawlSpider使用parse()方法來實現其邏輯,如果覆蓋了parse()方法,crawl_spider將會運行失效
    • follow表示是否跟進,None表示默認,默認爲Truw,表示跟進,否則爲False,表示不跟進

LinkExtractors類

  • linkExtractors類是就是生Rule類對象時的第一個link_extractor對象的類,它的作用就一個: 提取網頁中的鏈接
class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

常用的參數:allow,表示提取網站允許的匹配條件,()裏寫re規則
其他不常用:

deny:滿足括號中“正則表達式”的URL一定不提取 *(優先級高於allow)。

allow_domains:會被提取的鏈接的domains。

deny_domains:一定不會被提取鏈接的domains。

restrict_xpaths:使用xpath表達式,和allow共同作用過濾鏈接。

LinkExtractor類的使用步驟:

  1. 定義一個link對象,初始化參數爲匹配規則:link = LinkExtractor('正則表達式,如: start=\d+')
  2. 使用link對象調用extract_link(response)方法,返回查詢的結果的列表集:result = link.extract_link(response)
  3. 可以發現使用方法和正則的使用方法很相似:
re1 = re.compile('正則表達式')
result = re1.find('需要匹配的字符串')

示例:騰訊社招網站的爬取

  • 目標:爬取列表頁信息及詳情頁信息

tentent.py

# -*- coding: utf-8 -*-
import scrapy

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from TestCrawlSpider.items import TestcrawlspiderListItem, TestcrawlspiderDetailItem

class TencentSpider(CrawlSpider):
    name = 'tencent3'
    allowed_domians = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php?keywords=python&tid=87&lid=2156&start=0']

    # rules 提取規則
    rules = (
        # 先提取列表頁
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_list', follow=True),

        # 提取詳情頁
        Rule(LinkExtractor(allow=r'position_detail.php'), callback='parse_detail', follow=False),
    )

    def parse_list(self, response):
        # 解析出tr數據
        tr_list = response.xpath('//tr[@class="even"]|//tr[@class="odd"]')
        for tr in tr_list:
            item = TestcrawlspiderListItem()
            item['work_name'] = tr.xpath('./td[1]/a/text()').extract_first()
            item['work_type'] = tr.xpath('./td[2]/text()').extract_first()
            item['work_count'] = tr.xpath('./td[3]/text()').extract_first()
            item['work_place'] = tr.xpath('./td[4]/text()').extract_first()
            item['work_time'] = tr.xpath('./td[5]/text()').extract_first()

            # 將解析完畢的數據 --》引擎--》管道
            yield item

    def parse_detail(self, response):
        # 解析出詳情頁的數據
        ul_list = response.xpath('//ul[@class="squareli"]')
        item = TestcrawlspiderDetailItem()
        item['work_duty'] = ''.join(ul_list[0].xpath('./li/text()').extract())
        item['work_requir'] = ''.join(ul_list[1].xpath('./li/text()').extract())

        # 將解析完的數據--》引擎--》管道
        yield item

items.py


import scrapy


class TestcrawlspiderListItem(scrapy.Item):
    # 設置列表頁目標數據
    work_name = scrapy.Field()
    work_type = scrapy.Field()
    work_count = scrapy.Field()
    work_place = scrapy.Field()
    work_time = scrapy.Field()


class TestcrawlspiderDetailItem(scrapy.Item):
    # 設置詳情頁目標數據
    work_duty = scrapy.Field()
    work_requir = scrapy.Field()

piplines.py

import json
from items import TestcrawlspiderDetailItem, TestcrawlspiderListItem


class TestcrawlspiderListPipeline(object):
    def open_spider(self, spider):
        self.file = open('list.json', 'w')

    def process_item(self, item, spider):
        # 判斷item是否是列表頁的數據,是,才存儲
        if isinstance(item, TestcrawlspiderListItem):
            dict_item = dict(item)
            str_item = json.dumps(dict_item) + '\n'
            self.file.write(str_item)
        return item

    def close_spider(self, spider):
        self.file.close()


class TestcrawlspiderDetailPipeline(object):
    def open_spider(self, spider):
        self.file = open('detail.json', 'w')

    def process_item(self, item, spider):
        # 判斷item是否是詳情頁的數據,是,才存儲
        if isinstance(item, TestcrawlspiderDetailItem):
            dict_item = dict(item)
            str_item = json.dumps(dict_item) + '\n'
            self.file.write(str_item)
        return item

    def close_spider(self, spider):
        self.file.close()

middlewares.py

import scrapy
from TestCrawlSpider.settings import USER_AGENT_LIST
import random


# 設置隨機的User_Agent的中間件
class UserAgentMiddleWares(object):
    def process_request(self, request, spider):
        # 隨機挑選一個User-Agent
        user_agent = random.choice(USER_AGENT_LIST)

        # 修改request對象中的user-agent
        request.headers['User-Agent'] = user_agent
        print '***' * 30
        print user_agent


# 設置隨機代理proxy的中間件
class ProxyMiddleWares(object):
    def process_request(self, request, spider):
        # 隨機一個代理(要錢的/免費的)
        proxy = 'http://162.138.3.1:8888'
        # 修改request的meta參數
        request.meta['proxy']=proxy

        print '----'*30
        print proxy

settings.py

BOT_NAME = 'TestCrawlSpider'

SPIDER_MODULES = ['TestCrawlSpider.spiders']
NEWSPIDER_MODULE = 'TestCrawlSpider.spiders'

DOWNLOADER_MIDDLEWARES = {
   'TestCrawlSpider.middlewares.UserAgentMiddleWares': 543,
    'TestCrawlSpider.middlewares.ProxyMiddleWares':888,
}

ITEM_PIPELINES = {
   'TestCrawlSpider.pipelines.TestcrawlspiderListPipeline': 300,


USER_AGENT_LIST = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
]

Request&Response

  • request對象是Request類的實例化對象,就是集成了請求相關的信息,以備發起請求的對象。
  • 常用參數有:

url: 就是需要請求,並進行下一步處理的url

callback: 指定該請求返回的Response,由那個函數來處理。

method: 請求一般不需要指定,默認GET方法,可設置爲"GET", "POST", "PUT"等,且保證字符串大寫

headers: 請求時,包含的頭文件。一般不需要。內容一般如下:
        # 自己寫過爬蟲的肯定知道
        Host: media.readthedocs.org
        User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
        Accept: text/css,*/*;q=0.1
        Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
        Accept-Encoding: gzip, deflate
        Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/
        Cookie: _ga=GA1.2.1612165614.1415584110;
        Connection: keep-alive
        If-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMT
        Cache-Control: max-age=0

meta: 比較常用,在不同的請求之間傳遞數據使用的。字典dict型

        request_with_cookies = Request(
            url="http://www.example.com",
            cookies={'currency': 'USD', 'country': 'UY'},
            meta={'dont_merge_cookies': True}
        )

encoding: 使用默認的 'utf-8' 就行。
errback: 指定錯誤處理函數
  • response對象是Response類的實例化,包含服務端響應給瀏覽器的相關信息。

  • 常用參數/信息有:

status: 響應碼
_set_body(body): 響應體
_set_url(url):響應url
self.request = request

DownloadMiddlewares

  • 溫故下scrapy框架圖

# scrapy框架圖.png

  • 有爬蟲,就有反爬蟲,有反爬蟲,就有反反爬蟲。。。。。。爬蟲和反爬蟲的鬥爭是無窮無盡的。
  • scrapy基本的功能可以滿足基本的爬蟲任務,使用CrawlSpider甚至只需要簡單的設置一下過濾條件follow=True,就可以自動爬取大量的列表頁及詳情頁。但是,大部分網站不會這麼容易爬取。這時候就需要進行額外的操作,就需要在基本設置的基礎上進行中間件的設置。
  • scrapy的中間件有spider中間件和download中間件兩種,常用的是download中間件,可以在request對象正式發送前進行響應的設置。
  • 常見的反反爬蟲設置有:

    • 模擬用戶登錄(自動獲取cookie)/禁用cookies
    • 設置隨機的User-Agent
    • 設置隨機的代理IP
    • 控制訪問頻次
    • 使用selenium自動化控制瀏覽器,從瀏覽器渲染後的頁面中換取數據
    • 。。。
  • 示例:模擬用戶登錄到github主頁

# -*- coding: utf-8 -*-
import scrapy


class GithubSpider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    # 模擬登錄 : scrapy框架 會自動保存當前的cookie
    def parse(self, response):
        login_url = 'https://github.com/session'

        formdata = {
            "login": '賬戶名',
            'password': '密碼',
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': response.xpath('//*[@id="login"]/form/input[2]/@value').extract_first()
        }

        # 發送 登錄請求  POST
        yield scrapy.FormRequest(url=login_url, formdata=formdata, callback=self.parse_logined)

    def parse_logined(self, response):

        with open('222githublogin.html', 'w') as f:
            f.write(response.body)

Settings

  • setting設置是對整個爬蟲架構進行相關的設置,設置項有:
# -*- coding: utf-8 -*-

# Scrapy settings for GitHub project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'GitHub'

SPIDER_MODULES = ['GitHub.spiders']
NEWSPIDER_MODULE = 'GitHub.spiders'

# 設置 log文件
# LOG_FILE = 'github.log'

# 日誌的等級  5個等級 debug info warning error 嚴重

LOG_LEVEL = 'ERROR'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 設置併發數  3000個  3000 1次  硬件
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs

# 設置延遲 批次
# DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# 禁用cookie
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'GitHub.middlewares.GithubSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'GitHub.middlewares.GithubDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'GitHub.pipelines.GithubPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

發佈了69 篇原創文章 · 獲贊 11 · 訪問量 1萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章