scrapy+selenium之中国裁判文书网文书爬取

浅尝python网络爬虫,略有心得。有不足之处,请多指正

url = https://wenshu.court.gov.cn/

爬取内容:裁判文书

爬取框架:scrapy框架  +  selenium模拟浏览器访问

开始想暴力分析网页结构获取数据,哈哈哈哈哈,天真了。看来自己什么水平还真不知道。

之后锁定pyspider框架,搞了四五天。该框架对于页面超链接的连续访问问题,可以手动点击单个链接测试,但是通过外部“run”操作,会获取不到数据。其实最后发现很多博客说pyspider的官网文档已经很久没有更新了,企业、项目一般都会用到scrapy。scrapy框架结构如下图:

https://upload-images.jianshu.io/upload_images/10170086-062ee2661ec03aa9.png?imageMogr2/auto-orient/strip|imageView2/2/w/740

中间为scrapy引擎,左侧item为爬取内容实体,相关pipeline作用是在yield item语句返回之后,先通过pipelines.py一些类进行处理操作,比如:存入MongoDB中;下面为spiders实现原始页面的请求、对响应页面的解析以及将后续访问页面传输到上面Scheduler访问序列中,然后每次从该序列选取一个链接通过右边Downloader访问Internet。

而其中有两个中间件,Spider Middlewares和Downloader Middlewares。前者使用频率不如后者,而下载中间件是当Scheduler发来请求时,可以在访问Internet之前,对链接添加一些参数信息;或者是在返回响应页面之前,通过selenium进行动态js渲染以及模拟浏览器可以实现一些点击翻页操作等等。

上面说到一个js渲染问题,这里使用的是selenium,其实还有一种渲染方法是splash渲染。但是通过splash需要在Spiders代码中编写Lua脚本,这种方法实现页面访问,翻页,点击等的不如selenium来的方便;另一方面,对于大型爬取任务来说,毕竟selenium为阻塞式访问界面,这将会给我们的项目平添时耗,而splash可以实现分布式访问。

先选择的是selenium操作,但是由于网站服务器响应问题,需要设置对于一些网页标签元素显示与否的等待时间以及对于整个网页的模拟访问timeout时间。期间真的是各种调参,进行单元测试,其实最后页面访问成功率基本为100%。

好奇之下,对于该项目又使用splash进行操作。但是没能成功将页面渲染,查阅原因得知可能是网站设置了某些限制splash的js函数。(讨论scrapy-splash渲染不成功问题?https://mp.csdn.net/console/editor/html/104331571

代码为爬取前两页数据,单页文书数量为5,通过selenium设置单页文书数量为最大值15,这样可以省去很多主页面的跳转、访问、解析时间。由於单页15个访问链接,外加2个文书列表界面,总共32次响应返回,响应码全部为"200"(成功响应)。以下为代码执行结果图:

此时数据库中显示插入的30条文书实体数据,以下为MongoDB中相应collection更新的数据:

踩坑合集(bugs):

1.详情页解析函数测试,通过xpath或css函数得到标签元素而返回的对象为"list"类型。而之后使用extract(),返回的也是"list"类型,但是不可以继续调用xpath或css函数;

2.通过selenium模拟浏览器操作,对一些标签元素的显示设置等待时间以及整个访问页面响应的超时时间。对时间多少,标签元素的选择等进行测试;

3.将文书主体内容不同部分:当事人、法院原由、判决结果进行拆分。通过div个数以及一些特殊情况的解决,可以解决不同的div标签个数情况下的数据清洗。以及对于一些不授权展示的特殊情况的处理。

4.开始时,为了通过一个下载中间件响应文书列表和文书详情两种页面。解决办法:通过meta字段,设置哨兵区别两种不同链接。

5.由于不论是第几页发送的请求,链接总是不变的。实际操作中,点击后续页面只是内容发生变化而链接自始至终不会改变。所以在spiders代码中需要设置访问链接的"dont_filter"参数为true,即不对重复链接过滤。

6.由于5造成的影响,如果单纯模拟点击下一页链接,到了第二页还是原来链接,再点击下一页还会回到第一页。所以需要直接点击模拟点击对应的页码,但是开始最多只显示6个可选页码按钮点击6页面才会显示后续页面按钮。所以通过规律,编写函数获取点击的按钮序列。

7.由于6中对页码数较大的页面访问需要模拟多次点击按钮跳转,所以可以首先模拟选择下面单页文书列表可以展示的文书数量为最大15,是原始的3倍,因此翻页的次数大大降低,以提高爬取效率。

 

好了,话不多说。上代码:

spiders:

# -*- coding: utf-8 -*-
import time

import scrapy
from scrapy import Request
from urllib.parse import urljoin
from scrapy import Selector
from wenshu.items import WenshuItem
import requests
from scrapy import Spider
from scrapy_splash import SplashRequest
from urllib.parse import quote

script_one = """
function main(splash, args)
  splash.html5_media_enabled = true
  splash.plugins_enabled = true
  splash.response_body_enabled = true
  splash.request_body_enabled = true
  splash.js_enabled = true
  splash.resource_timeout = 30
  splash.images_enabled = false
  
  assert(splash:go(args.url))
  assert(splash:wait(args.wait))
  return { html = splash:html(),
           har = splash:har(),
           png = splash:png()
         }
end
"""

script = """
function main(splash, args)
  splash.resource_timeout = 40
  splash.images_enabled = true
  assert(splash:go(args.url))
  assert(splash:wait(args.wait))
  js = string.format("isnext = function(page = %s) { if(page == 1) { return document.html; } document.querySelector('a.pageButton:nth-child(8)').click(); return document.html; }",args.page)
  splash:runjs(js)
  splash:evaljs(js)
  assert(splash:wait(args.wait))
  return splash:html()
end
"""

detail = '''
function main(splash, args)
  splash.resource_timeout = 20
  splash.images_enable = false
  assert(splash:go(args.url))
  assert(splash:wait(args.wait))
  return splash:html()
  end
'''




class WsSpider(scrapy.Spider):
    name = 'ws'
    allowed_domains = ['wenshu.court.gov.cn']
    base_url = 'https://wenshu.court.gov.cn/website/wenshu'
    #start_urls = 'https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?pageId=11c90f39fb379fbf4ab85bb180682ce0&s38=300&fymc=%E6%B2%B3%E5%8C%97%E7%9C%81%E9%AB%98%E7%BA%A7%E4%BA%BA%E6%B0%91%E6%B3%95%E9%99%A2'
    start_urls = 'https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?s38=300&fymc=%E6%B2%B3%E5%8C%97%E7%9C%81%E9%AB%98%E7%BA%A7%E4%BA%BA%E6%B0%91%E6%B3%95%E9%99%A2'
    def start_requests(self):
        for page in range(1, self.settings.get('MAX_PAGE') + 1):
            self.logger.debug(str(page))

            #if page == 1:
            #    self.logger.debug(self.settings.get('MAX_PAGE'))
            #    self.logger.debug('=====================================')
            #elif page == 2:
            #    self.logger.debug('+++++++++++++++++++++++++++++++++++++')
            yield Request(url=self.start_urls, callback=self.parse_origin, meta={'page':page,'tag':0}, dont_filter=True)

            #yield SplashRequest(url=self.start_urls, callback=self.parse_origin, endpoint='execute', args={'lua_source':script_one,
            #   'wait':5,'page':page
            #})

    def parse_origin(self, response):
        #self.logger.debug(response.text)
        urls = response.xpath('//a[@class="caseName"]/@href').extract()
        self.logger.debug(str(response.status))
        self.logger.debug(str(response.status))
        self.logger.debug(str(response.status))

        for url in urls:
            target_url = self.base_url + url[2:]
            self.logger.debug('url: ' + url)
            self.logger.debug('target_url: ' + target_url)

        for url in urls:
            #each_url = url.xpath('./@href').extract_first()
            # self.logger.debug(str(url))
            target_url = self.base_url + url[2:]

            yield Request(url=target_url, callback=self.parse_detail, meta={'tag':1}, dont_filter=False)


    #def parse_detail()


items:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

from scrapy import Item,Field


class WenshuItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    collection = 'wenshu'
    title = Field()     # 标题
    release = Field()   # 发布时间
    views = Field()     # 访问量
    court = Field()     # 审判法院
    type = Field()      # 文书类型
    prelude = Field()   # 首部
    parties = Field()   # 当事人
    justification = Field() # 理由
    end = Field()       # 结果
    chief = Field()     # 审判长
    judge = Field()     # 审判员
    time = Field()      # 审判时间
    assistant = Field() # 法官助理
    clerk = Field()     # 书记员



pipelines:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

from wenshu.items import WenshuItem


class WsMongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls (
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        if isinstance(item, WenshuItem):
            self.db[item.collection].insert(dict(item))
            return item


class WenshuPipeline(object):
    def process_item(self, item, spider):
        return item

middlewares:
 

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import time
from scrapy import signals
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from logging import getLogger
from selenium.webdriver.support.select import Select

class WenshuSeleniumMiddleware(object):
    def __init__(self, timeout=None, service_args=[]):
        self.logger = getLogger(__name__)
        self.timeout = timeout
        self.browser = webdriver.Chrome()
        self.browser.set_window_size(1200,600)
        self.browser.set_page_load_timeout(self.timeout)
        self.wait = WebDriverWait(self.browser, self.timeout)


    def __del__(self):
        self.browser.close()


    #def process_request()




class WenshuSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class WenshuDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

settings:

BOT_NAME = 'wenshu'

SPIDER_MODULES = ['wenshu.spiders']
NEWSPIDER_MODULE = 'wenshu.spiders'


ROBOTSTXT_OBEY = False


ITEM_PIPELINES = {
    'wenshu.pipelines.WsMongoPipeline': 300,
    # 'wenshu.pipelines.MongoPipeline': 302,
}


DOWNLOADER_MIDDLEWARES = {
     'wenshu.middlewares.WenshuSeleniumMiddleware': 543,
    #'scrapy_splash.SplashCookiesMiddleware': 723,
    #'scrapy_splash.SplashMiddleware': 725,
    #'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}


MAX_PAGE = 2

# SPLASH_URL = 'http://127.0.0.1:8050'

SELENIUM_TIMEOUT = 60

PHANTOMJS_SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']

# DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

MONGO_URI = 'localhost'

MONGO_DB = 'test'

 

彩蛋:睡会觉,四点闹铃。然后让我们用詹俊的话高喊:

利物浦是冠军!!!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章