【爬蟲】Scrapy 抓取網站數據

【原文鏈接】http://chenqx.github.io/2014/11/09/Scrapy-Tutorial-for-BBSSpider/

Scrapy Tutorial

　　接下來以爬取飲水思源BBS數據爲例來講述爬取過程，詳見 bbsdmoz代碼。
　　本篇教程中將帶您完成下列任務：

1. 創建一個Scrapy項目
2. 定義提取的Item
3. 編寫爬取網站的 spider 並提取 Item
4. 編寫 Item Pipeline 來存儲提取到的Item(即數據)

Creating a project

　　在開始爬取之前，您必須創建一個新的Scrapy項目。進入您打算存儲代碼的目錄中，運行下列命令：

scrapy startproject bbsdmoz

該命令將會創建包含下列內容的 bbsDmoz 目錄，這些文件分別是：

scrapy.cfg: 項目的配置文件
bbsDmoz/: 該項目的python模塊。之後您將在此加入代碼。
bbsDmoz/items.py: 項目中的item文件.
bbsDmoz/pipelines.py: 項目中的pipelines文件.
bbsDmoz/settings.py: 項目的設置文件.
bbsDmoz/spiders/: 放置spider代碼的目錄.

Defining our Item

　　Item 是保存爬取到的數據的容器；其使用方法和python字典類似，並且提供了額外保護機制來避免拼寫錯誤導致的未定義字段錯誤。
　　類似在ORM (Object Relational Mapping, 對象關係映射) 中做的一樣，您可以通過創建一個 scrapy.Item 類，並且定義類型爲 scrapy.Field 的類屬性來定義一個Item。(如果不瞭解ORM,不用擔心，您會發現這個步驟非常簡單)
　　首先根據需要從bbs網站獲取到的數據對item進行建模。我們需要從中獲取url，發帖板塊，發帖人，以及帖子的內容。對此，在item中定義相應的字段。編輯 bbsDmoz 目錄中的 items.py 文件：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class BbsItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = Field()
    forum = Field()
    poster = Field()
    content = Field()

　　一開始這看起來可能有點複雜，但是通過定義item，您可以很方便的使用Scrapy的其他方法。而這些方法需要知道您的item的定義。

Our first Spider

　　Spider是用戶編寫用於從單個網站(或者一些網站)爬取數據的類。
　　其包含了一個用於下載的初始URL，如何跟進網頁中的鏈接以及如何分析頁面中的內容，提取生成 item 的方法。

創建一個Spider

Selectors選擇器

　　我們使用XPath來從頁面的HTML源碼中選擇需要提取的數據。這裏給出XPath表達式的例子及對應的含義:

/html/head/title: 選擇HTML文檔中 <head> 標籤內的 <title> 元素
/html/head/title/text(): 選擇上面提到的 <title> 元素的文字
//td: 選擇所有的 <td> 元素
//div[@class="mine"]: 選擇所有具有 class="mine" 屬性的 div 元素

　　以飲水思源BBS一頁面爲例：https://bbs.sjtu.edu.cn/bbstcon?board=PhD&reid=1406973178&file=M.1406973178.A
　　觀察HTML頁面源碼並創建我們需要的數據(種子名字，描述和大小)的XPath表達式。
　　通過觀察，我們可以發現poster是包含在 pre/a 標籤中的，這裏是userid=jasperstream:

　　因此可以提取jasperstream的 XPath 表達式爲：'//pre/a/text()'

　　爲了配合XPath，Scrapy除了提供了 Selector 之外，還提供了方法來避免每次從response中提取數據時生成selector的麻煩。Selector有四個基本的方法:

xpath(): 傳入xpath表達式，返回該表達式所對應的所有節點的selector list列表。
css(): 傳入CSS表達式，返回該表達式所對應的所有節點的selector list列表.
extract(): 序列化該節點爲unicode字符串並返回list。
re(): 根據傳入的正則表達式對數據進行提取，返回unicode字符串list列表。

　　如提取上述的poster的數據：

sel.xpath('//pre/a/text()').extract()

使用Item

　　Item 對象是自定義的python字典。您可以使用標準的字典語法來獲取到其每個字段的值(字段即是我們之前用Field賦值的屬性)。一般來說，Spider將會將爬取到的數據以 Item 對象返回。　

Spider代碼

　　以下爲我們的第一個Spider代碼，保存在 bbsDmoz/spiders 目錄下的 forumSpider.py 文件中：

# -*- coding: utf-8 -*-
"""
Created on Fri Jul 20 13:18:58 2018

@author: Administrator
"""

from scrapy.selector import Selector
from scrapy.http import  Request
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.loader import ItemLoader
#SGMLParser based link extractors are unmantained and its usage is discouraged. It is recommended to migrate to LxmlLinkExtractor
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from bbsdmoz.items import BbsItem

class forumSpider(CrawlSpider):
    # name of spiders
    name = 'bbsSpider'
    allow_domain = ['bbs.sjtu.edu.cn']
    start_urls = [ 'https://bbs.sjtu.edu.cn/bbsall' ]
    link_extractor = {
        'page':  LxmlLinkExtractor(allow = '/bbsdoc,board,\w+\.html$'),
        'page_down':  LxmlLinkExtractor(allow = '/bbsdoc,board,\w+,page,\d+\.html$'),
        'content':  LxmlLinkExtractor(allow = '/bbscon,board,\w+,file,M\.\d+\.A\.html$'),
    }
    _x_query = {
        'page_content':    '//pre/text()[2]',
        'poster'    :    '//pre/a/text()',
        'forum'    :    '//center/text()[2]',
    }
  
    def parse(self, response):
        for link in self.link_extractor['page'].extract_links(response):
            yield Request(url = link.url, callback=self.parse_page)
  
    def parse_page(self, response):
        for link in self.link_extractor['page_down'].extract_links(response):
            yield Request(url = link.url, callback=self.parse_page)
        for link in self.link_extractor['content'].extract_links(response):
            yield Request(url = link.url, callback=self.parse_content)

    def parse_content(self, response):
        bbsItem_loader = ItemLoader(item=BbsItem(), response = response)
        url = str(response.url)
        bbsItem_loader.add_value('url', url)
        bbsItem_loader.add_xpath('forum', self._x_query['forum'])
        bbsItem_loader.add_xpath('poster', self._x_query['poster'])
        bbsItem_loader.add_xpath('content', self._x_query['page_content'])
        return bbsItem_loader.load_item()

Define Item Pipeline

　　當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，一些組件會按照一定的順序執行對Item的處理。
　　每個item pipeline組件(有時稱之爲“Item Pipeline”)是實現了簡單方法的Python類。他們接收到Item並通過它執行一些行爲，同時也決定此Item是否繼續通過pipeline，或是被丟棄而不再進行處理。
　　以下是item pipeline的一些典型應用：

清理HTML數據
驗證爬取的數據(檢查item包含某些字段)
查重(並丟棄)
將爬取結果保存，如保存到數據庫、XML、JSON等文件中

編寫 Item Pipeline

　　編寫你自己的item pipeline很簡單，每個item pipeline組件是一個獨立的Python類，同時必須實現以下方法:

process_item(item, spider)
　　每個item pipeline組件都需要調用該方法，這個方法必須返回一個 Item (或任何繼承類)對象，或是拋出 DropItem異常，被丟棄的item將不會被之後的pipeline組件所處理。
　　參數：item (Item object) – 由 parse 方法返回的 Item 對象
　　　　　spider (Spider object) – 抓取到這個 Item 對象對應的爬蟲對象

　　此外,他們也可以實現以下方法：

open_spider(spider)
　　當spider被開啓時，這個方法被調用。
　　參數: spider (Spider object) – 被開啓的spider
close_spider(spider)
　　當spider被關閉時，這個方法被調用，可以再爬蟲關閉後進行相應的數據處理。
　　參數: spider (Spider object) – 被關閉的spider

　　本文爬蟲的item pipeline如下，保存爲XML文件：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals
from scrapy import log
from bbsdmoz.items import BbsItem
from twisted.enterprise import adbapi
from scrapy.contrib.exporter import XmlItemExporter
from dataProcess import dataProcess

class BbsdmozPipeline(object):
    
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        self.file = open('bbsData.xml', 'wb')
        self.expoter = XmlItemExporter(self.file)
        self.expoter.start_exporting()

    def spider_closed(self, spider):
        self.expoter.finish_exporting()
        self.file.close()

        # process the crawled data, define and call dataProcess function
        # dataProcess('bbsData.xml', 'text.txt')

    def process_item(self, item, spider):
        self.expoter.export_item(item)
        return item

　　編寫dataProcess.py小工具：

# -*- coding: utf-8 -*-
"""
Created on Fri Jul 20 14:45:01 2018

@author: Administrator
"""

from lxml import etree
# In Python 3, ConfigParser has been renamed to configparser
from configparser import ConfigParser

class dataProcess:
    def __init__(self, source_filename, target_filename):
        # load stop words into the memory.
        fin = open(source_filename, 'r')

        read = fin.read()

        output = open(target_filename, 'w')
        output.write(read)

        fin.close()
        output.close()

Settings (settings.py)

　　Scrapy設定(settings)提供了定製Scrapy組件的方法。您可以控制包括核心(core)，插件(extension)，pipeline及spider組件。
　　設定爲代碼提供了提取以key-value映射的配置值的的全局命名空間(namespace)。設定可以通過下面介紹的多種機制進行設置。
　　設定(settings)同時也是選擇當前激活的Scrapy項目的方法(如果您有多個的話)。
　　在setting配置文件中，你可一定以抓取的速率、是否在桌面顯示抓取過程信息等。詳細請參考內置設定列表請參考。

　　爲了啓用一個Item Pipeline組件，你必須將它的類添加到 ITEM_PIPELINES. 分配給每個類的整型值，確定了他們運行的順序，item按數字從低到高的順序，通過pipeline，通常將這些數字定義在0-1000範圍內。

# -*- coding: utf-8 -*-

# Scrapy settings for bbsdmoz project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bbsdmoz'

SPIDER_MODULES = ['bbsdmoz.spiders']
NEWSPIDER_MODULE = 'bbsdmoz.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bbsdmoz (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'bbsdmoz.middlewares.BbsdmozSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'bbsdmoz.middlewares.BbsdmozDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'bbsdmoz.pipelines.BbsdmozPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Crawling

　　寫好爬蟲程序後，我們就可以運行程序抓取數據。進入項目的根目錄bbsDomz/下，執行下列命令啓動spider (會爬很久)：

scrapy crawl bbsSpider

【完整代碼】

dataProcess.py

# -*- coding: utf-8 -*-
"""
Created on Fri Jul 20 14:45:01 2018

@author: Administrator
"""

from lxml import etree
# In Python 3, ConfigParser has been renamed to configparser
from configparser import ConfigParser

class dataProcess:
    def __init__(self, source_filename, target_filename):
        # load stop words into the memory.
        fin = open(source_filename, 'r')

        read = fin.read()

        output = open(target_filename, 'w')
        output.write(read)

        fin.close()
        output.close()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for bbsdmoz project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bbsdmoz'

SPIDER_MODULES = ['bbsdmoz.spiders']
NEWSPIDER_MODULE = 'bbsdmoz.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bbsdmoz (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'bbsdmoz.middlewares.BbsdmozSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'bbsdmoz.middlewares.BbsdmozDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'bbsdmoz.pipelines.BbsdmozPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals
from scrapy import log
from bbsdmoz.items import BbsItem
from twisted.enterprise import adbapi
from scrapy.contrib.exporter import XmlItemExporter
from dataProcess import dataProcess

class BbsdmozPipeline(object):
    
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        self.file = open('bbsData.xml', 'wb')
        self.expoter = XmlItemExporter(self.file)
        self.expoter.start_exporting()

    def spider_closed(self, spider):
        self.expoter.finish_exporting()
        self.file.close()

        # process the crawled data, define and call dataProcess function
        # dataProcess('bbsData.xml', 'text.txt')

    def process_item(self, item, spider):
        self.expoter.export_item(item)
        return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class BbsItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = Field()
    forum = Field()
    poster = Field()
    content = Field()

forumSpider.py

# -*- coding: utf-8 -*-
"""
Created on Fri Jul 20 13:18:58 2018

@author: Administrator
"""

from scrapy.selector import Selector
from scrapy.http import  Request
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.loader import ItemLoader
#SGMLParser based link extractors are unmantained and its usage is discouraged. It is recommended to migrate to LxmlLinkExtractor
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from bbsdmoz.items import BbsItem

class forumSpider(CrawlSpider):
    # name of spiders
    name = 'bbsSpider'
    allow_domain = ['bbs.sjtu.edu.cn']
    start_urls = [ 'https://bbs.sjtu.edu.cn/bbsall' ]
    link_extractor = {
        'page':  LxmlLinkExtractor(allow = '/bbsdoc,board,\w+\.html$'),
        'page_down':  LxmlLinkExtractor(allow = '/bbsdoc,board,\w+,page,\d+\.html$'),
        'content':  LxmlLinkExtractor(allow = '/bbscon,board,\w+,file,M\.\d+\.A\.html$'),
    }
    _x_query = {
        'page_content':    '//pre/text()[2]',
        'poster'    :    '//pre/a/text()',
        'forum'    :    '//center/text()[2]',
    }
  
    def parse(self, response):
        for link in self.link_extractor['page'].extract_links(response):
            yield Request(url = link.url, callback=self.parse_page)
  
    def parse_page(self, response):
        for link in self.link_extractor['page_down'].extract_links(response):
            yield Request(url = link.url, callback=self.parse_page)
        for link in self.link_extractor['content'].extract_links(response):
            yield Request(url = link.url, callback=self.parse_content)
            
    def parse_content(self, response):
        bbsItem_loader = ItemLoader(item=BbsItem(), response = response)
        url = str(response.url)
        bbsItem_loader.add_value('url', url)
        bbsItem_loader.add_xpath('forum', self._x_query['forum'])
        bbsItem_loader.add_xpath('poster', self._x_query['poster'])
        bbsItem_loader.add_xpath('content', self._x_query['page_content'])
        return bbsItem_loader.load_item()

The end.

【爬蟲】Scrapy 抓取網站數據

Scrapy Tutorial

Creating a project

Defining our Item

Our first Spider

創建一個Spider

Selectors選擇器

使用Item

Spider代碼

Define Item Pipeline

編寫 Item Pipeline

Settings (settings.py)

Crawling

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

【Sqoop】Export data into RDBMS using Sqoop 及其調優

【NLP】Python中文文本聚類

【NLP】Python英文文本聚類

【NLP】Jieba中文分詞

【Python】解決matplotlib圖例中文亂碼問題——win10版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結