Scrapy實戰-爬取某博客聚合網站信息

前言

前段時間看了一些介紹Scrapy及用Scarpy進行抓取網絡信息的博客。總體來說信息量還是過少,對於一個成熟的框架來說,只看博客還是不夠。所以還是看了一遍官方文檔。

看完後,總要做點什麼來練練手,正好前段時間在網上閒逛的時候找到了一個國內某大神做的某國外博客的聚合類網站。裏面涉及到大量博客地址。點擊某博客後,會列出該博客下所有視頻地址。其實該網站也是一個爬蟲。

將所有視頻下載下來是不現實的。將博客地址存取下來即可,後續需要的時候再編寫一個爬蟲用於解析該博客下的所有圖片、文字、視頻。

Scrapy安裝

Scrapy安裝用pip即可。本次練習採用的是Python3.5.2,win7 64位系統。集成於Anaconda。官網上推薦如下安裝方式:

conda install -c scrapinghub scrapy

但安裝完後在startproject的時候出現錯誤。於是又用pip卸載了scrapy,再用pip安裝scrapy,就行了,具體原因不詳。。

開始項目

在想要存放項目的位置打開cmd。輸入以下命令(XXX爲項目名稱):

scrapy startproject XXX

編寫item

由於該網站結構比較簡單,每頁可提取出30個博客地址,因此items.py比較簡單,只要有一個裝數據的容器即可:
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class XXXItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    Blogs_per_page = scrapy.Field()

編寫pipeline

pipeline是用於處理item的,item生成後,Scrapy會調用相關pipeline對該item進行諸如打印、存儲等操作。本練習用MongoDB存儲採集到的信息。

安裝MongoDB

MongoDB網址:https://www.mongodb.com/
下載地址:https://www.mongodb.com/download-center#community
下載完成後一頓安裝。之後進入C:\Program Files\MongoDB\Server\3.2\bin,打開cmd。輸入mongod,得到以下信息:
Microsoft Windows [版本 6.1.7601]
版權所有 (c) 2009 Microsoft Corporation。保留所有權利。


C:\Program Files\MongoDB\Server\3.2\bin>mongod
2016-10-11T12:36:54.234+0800 I CONTROL  [main] Hotfix KB2731284 or later update
is not installed, will zero-out data files
2016-10-11T12:36:54.236+0800 I CONTROL  [initandlisten] MongoDB starting : pid=3
3256 port=27017 dbpath=C:\data\db\ 64-bit host=CJPC160816-051
2016-10-11T12:36:54.236+0800 I CONTROL  [initandlisten] targetMinOS: Windows 7/W
indows Server 2008 R2
2016-10-11T12:36:54.236+0800 I CONTROL  [initandlisten] db version v3.2.10
2016-10-11T12:36:54.236+0800 I CONTROL  [initandlisten] git version: 79d9b3ab5ce
20f51c272b4411202710a082d0317
2016-10-11T12:36:54.236+0800 I CONTROL  [initandlisten] OpenSSL version: OpenSSL
 1.0.1t-fips  3 May 2016
2016-10-11T12:36:54.236+0800 I CONTROL  [initandlisten] allocator: tcmalloc
2016-10-11T12:36:54.237+0800 I CONTROL  [initandlisten] modules: none
2016-10-11T12:36:54.237+0800 I CONTROL  [initandlisten] build environment:
2016-10-11T12:36:54.237+0800 I CONTROL  [initandlisten]     distmod: 2008plus-ss
l
2016-10-11T12:36:54.237+0800 I CONTROL  [initandlisten]     distarch: x86_64
2016-10-11T12:36:54.237+0800 I CONTROL  [initandlisten]     target_arch: x86_64
2016-10-11T12:36:54.237+0800 I CONTROL  [initandlisten] options: {}
2016-10-11T12:36:54.239+0800 I -        [initandlisten] Detected data files in C
:\data\db\ created by the 'wiredTiger' storage engine, so setting the active sto
rage engine to 'wiredTiger'.
2016-10-11T12:36:54.241+0800 I STORAGE  [initandlisten] wiredtiger_open config:
create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=fals
e,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snapp
y),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),stati
stics_log=(wait=0),
2016-10-11T12:36:55.115+0800 I NETWORK  [HostnameCanonicalizationWorker] Startin
g hostname canonicalization worker
2016-10-11T12:36:55.115+0800 I FTDC     [initandlisten] Initializing full-time d
iagnostic data capture with directory 'C:/data/db/diagnostic.data'
2016-10-11T12:36:55.147+0800 I NETWORK  [initandlisten] waiting for connections
on port 27017


說明MongoDB數據庫打開了,端口爲本地27017端口。
Python操作MongoDB需要用到pymongo庫,可以用pip直接安裝。
編寫的pipelines.py文件如下:
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from .items import XXXItem


class XXXPipeline(object):
    def __init__(self):
        client = pymongo.MongoClient("localhost", 27017)
        db = client["XXX"]
        self.blogs = db["Blogs"]

    def process_item(self, item, spider):
        if isinstance(item, XXXItem):
            self.blogs.insert(dict(item))

編寫spider

終於要開始編寫爬蟲程序了。在XXX-spiders目錄下新建一個XXX_spider.py文件。代碼如下,相關解釋可以看註釋:
# -*- coding: utf-8 -*-

import scrapy
from scrapy.selector import Selector
from tumblr_get.items import XXXItem


class XXXSpider(scrapy.Spider):
    # 此爬蟲唯一的名字
    name = 'XXX_spider'
    # 爬蟲開始爬的鏈接
    start_urls = ['http://www.XXX.com/blogs.html?page=1&name=']

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.page_count = 100
        self.num_count = 1

    def start_requests(self):
        '''用於不斷構造request請求'''
        while self.num_count < self.page_count:
            # 翻頁
            self.num_count += 1
            # 構造請求,並指定回調函數
            # 這裏用meta給回調函數傳參數
            yield scrapy.Request(url='http://www.XXX.com/blogs.html?page=%s&name=' % str(self.num_count),
                                 meta={'count': self.num_count},
                                 callback=self.parse)

    def parse(self, response):
        # meta裏帶着構造request時的參數
        print('開始第{0:d}頁'.format(response.meta['count']))
        # 用XPath解析網頁
        selector1 = Selector(response=response)
        blogs_per_page_list = selector1.xpath('//*[@id="amz-main"]/div/div/table/tbody/tr/td/a/span/text()').extract()
        # 實例化item,用於存儲數據
        XXX_item = XXXItem()
        XXX_item['Blogs_per_page'] = blogs_per_page_list
        # 返回該item,供pipeline作進一步處理
        yield XXX_item

其中解析網頁用到了XPath,比bs4要快。相關教程可去w3c找。
用Chrome得到XPath較爲方便,選好相關節點後,右鍵複製XPath即可,通常複製後的XPath指向某特定節點,帶着中括號裏面有編號,若想獲取所有相同類型的節點,將方括號及之中的內容去掉即可。

設定中間件Middleware

爲了防止爬蟲被ban,通常的做法主要有三種:
  • 降低採集頻率
  • 設置UserAgent將爬蟲僞裝成瀏覽器
  • 設置代理ip
在XXX目錄下新建一個middlewares.py,代碼很簡單,就是給request對象加上相應的值,如下:
import random
from XXX.settings import USER_AGENT_LIST
from XXX.settings import PROXY_LIST

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua = random.choice(USER_AGENT_LIST)
        if ua:
            request.headers["User-Agent"] = ua

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        ip1 = 'http://' + random.choice(PROXY_LIST)
        request.meta['proxy'] = ip1

編寫配置文件settings

這裏只將本次練習所涉及到的配置項列出來,其他具體可參見文檔。
# -*- coding: utf-8 -*-
# Scrapy settings for tumblr_get project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XXX'
SPIDER_MODULES = ['XXX.spiders']
NEWSPIDER_MODULE = 'XXX.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1.0
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'XXX.middlewares.RandomUserAgentMiddleware': 300,
    'XXX.middlewares.ProxyMiddleware': 310,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
LOG_LEVEL = 'INFO'
USER_AGENT_LIST = [
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
    'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)']
PROXY_LIST = open('ip.txt','r').read().split('\n')

UserAgent是在網上搜的,有很多,適當選幾個就行。可用的ip可以在網上找免費的,但說實話不太好用,有的時候請求不出來。所以真要用代理ip,還要加上驗證和不斷獲取代理ip的代碼。

最後寫一個啓動文件

在項目目錄新建一個begin.py:
from scrapy import cmdline

cmdline.execute("scrapy crawl XXX_spider".split())

爬取結果

從17:22:24開始,到22:35:58結束。共爬取了7719頁的數據。博客名共231570個。

總結

總體來說,簡單的Scrapy爬蟲編寫步驟如下:
  1. 手動分析相關頁面,確定要保存的數據
  2. 新建Scrapy項目
  3. 編寫items.py文件,構造好容納數據的item
  4. 編寫pipelines.py文件,確保相關item能夠得以處理
  5. 編寫spider,涉及起始網頁、循環構造request請求及鏈接跟進、response的解析等
  6. 編寫中間件middlewares,防止被封,或實現其他一些功能
  7. 在setting.py文件中設定採集頻率、註冊中間件、設置log級別等等
  8. 啓動爬蟲

值得注意的有以下幾點:
  1. 實際爬取時間比上述要短,因爲中途去喫飯,電腦待機了。。。所以大約要減去50分鐘的樣子。
  2. 真實數據比7719頁要多,不過中間可能對方服務器有點問題,請求不出來,獲取不到數據。
  3. 總體上爲了防止被封,爬取得較慢,如果用上代理ip,再將頻率調高,時間會大大縮短。
  4. 該站較爲簡單,請求全部用get,未涉及到模擬登錄的問題。較爲簡單。

附一張Scrapy的工作流程圖(來源於網絡):


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章