scrapy + mogoDB 網站爬蟲

工具環境

語言：python3.6
數據庫：MongoDB (安裝及運行命令如下)

python3 -m pip install pymongo
brew install mongodb
mongod --config /usr/local/etc/mongod.conf

框架：scrapy1.5.1 (安裝命令如下)

python3 -m pip install Scrapy

用 scrapy 框架創建一個爬蟲項目

在終端執行如下命令，創建一個名爲 myspider 的爬蟲項目

scrapy startproject myspider

即可得到一個如下結構的文件目錄

創建 crawl 樣式的爬蟲

針對不同的用途， scrapy 提供了不同種類的爬蟲類型，分別是 Spider：所有爬蟲的祖宗
CrawlSpider：比較常用的爬取整站數據的爬蟲（下面的例子就是用這種） XMLFeedSpider CSVFeedSpider
SitemapSpider

先在命令行進入到 spiders 目錄下

cd myspider/myspider/spiders

然後創建 crawl 類型的爬蟲模板

scrapy genspider -t crawl zgmlxc www.zgmlxc.com.cn

參數說明：

-t crawl 指明爬蟲的類型

zgmlxc 是我給這個爬蟲取的名字

www.zgmlxc.com.cn 是我要爬取的站點

完善小爬蟲 zgmlxc

打開 zgmlxc.py 文件，可以看到一個基本的爬蟲模板，現在就開始對其進行一系列的配置工作，讓這個小爬蟲根據我的指令去爬取信息。

配置跟蹤頁面規則

rules = (
    // 定位到 www.zgmlxc.com.cn/node/72.jspx 這個頁面
    Rule(LinkExtractor(allow=r'.72\.jspx')),  
    // 在上面規定的頁面中，尋找符合下面規則的 url, 爬取裏面的內容，並把獲取的信息返回給 parse_item（）函數
    Rule(LinkExtractor(allow=r'./info/\d+\.jspx'), callback='parse_item'),
)

這裏有個小坑，就是最後一個 Rule 後面必須有逗號，否則報錯

rules = (
Rule(LinkExtractor(allow=r'./info/\d+\.jspx'), callback='parse_item', follow=True),
)

在 items.py 內定義我們需要提取的字段

import scrapy

class CrawlspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    piclist = scrapy.Field()
    shortname = scrapy.Field()

完善 parse_item 函數
這裏就是把上一步返回的內容，配置規則，提取我們想要的信息。這裏必須用 join 方法，是爲了方便後面順利導入數據庫。

def parse_item(self, response):
    yield {
        'title' : ' '.join(response.xpath("//div[@class='head']/h3/text()").get()).strip(),
        'shortname' : ' '.join(response.xpath("//div[@class='body']/p/strong/text()").get()).strip(),
        'piclist' : ' '.join(response.xpath("//div[@class='body']/p/img/@src").getall()).strip(),
        'content' : ' '.join(response.css("div.body").extract()).strip(),
            }

PS: 下面是提取內容的常用規則，直接總結在這裏了：

1). 獲取 img 標籤中的 src: //img[@class=‘photo-large’]/@src

2). 獲取文章主題內容及排版: response.css(“div.body”).extract()

將信息存入 MogoDB 數據庫

將信息存入 MogoDB 數據庫
打開 settings.py 添加如下信息：

# 建立爬蟲與數據庫之間的連接關係
ITEM_PIPELINES = {
   'crawlspider.pipelines.MongoDBPipeline': 300,
}

# 設置數據庫信息
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = 'spider_world'
MONGODB_COLLECTION = 'zgmlxc'

# 設置文明爬蟲, 意思是每個請求之間間歇 5 秒, 對站點友好, 也防止被黑名單
```py
DOWNLOAD_DELAY = 5

在 piplines.py 中

import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Question added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

在終端運行這個小爬蟲

scrapy crawl myspider

在 navicat 中查看信息入庫情況

如下圖新建一個 MogoDB 的數據庫連接，填入上面配置的信息，如果一切順利，就可以看到我們想要的信息都已經入庫了。

以上就完成了自定義爬蟲到數據入庫的全過程
更多技術素材視頻可加交流羣下載：1029344413

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy + mogoDB 網站爬蟲

工具環境

用 scrapy 框架創建一個爬蟲項目

創建 crawl 樣式的爬蟲

完善小爬蟲 zgmlxc

將信息存入 MogoDB 數據庫

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

智力題還是水有毒 (智力喚醒、簡單代碼、公平性)

Python求兩個數的最大公約數

爲什麼程序員都不願去外包公司工作？

用Python畫QQ表情中的滑稽臉

閒得發慌之趣味技能：python之貓臉檢測

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結