Scrapy簡明教程(四)——爬取CSDN博客專家所有博文並存入MongoDB

  首先,我們來看一下CSDN博客專家的鏈接: http://blog.csdn.net/experts.html

這裏寫圖片描述


上圖爲 CSDN 所有博客專家頁面,點擊下一頁後發現每次 url 都不會改變,但是已經翻頁了,檢查網頁元素如下圖:

這裏寫圖片描述

   我們發現可以通過 value 值來構造 url 實現翻頁,&page=1代表第一頁,先來看一下構造的 CSDN 博客專家首頁: http://blog.csdn.net/peoplelist.html?channelid=0&page=1,沒有了剛纔炫酷的樣式,但是並不影響我們爬取數據。

這裏寫圖片描述

這裏,我們隨機訪問一位博主的首頁,然後點擊下一頁:

這裏寫圖片描述

   可以發現 list 後面的數字就是頁碼,通過它可以構造博主所有文章翻頁 url,然後爬取每頁博文 url,最後爬取博文詳情頁信息。下面開始編寫爬蟲,創建項目請參考:Scrapy簡明教程(二)——開啓Scrapy爬蟲項目之旅,下面直接講編寫其他組件:

1. 編寫 settings.py :

設置用戶代理,解除 ITEM_PIPELINES 註釋,用戶代理可在審查元素中查看:

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'

ITEM_PIPELINES = {
    'csdnblog.pipelines.CsdnblogPipeline': 300,
}

2. 編寫要抽取的數據域 (items.py) :

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CsdnblogItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    releaseTime = scrapy.Field()
    readnum = scrapy.Field()
    article = scrapy.Field()
    keywords = scrapy.Field()

3. 編寫 piplines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines heremZ#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from pymongo import MongoClient

class CsdnblogPipeline(object):
    def __init__(self):
        self.client = MongoClient('localhost', 27017)
        mdb = self.client['csdnblog']
        self.collection = mdb['csdnblog']

    def process_item(self, item, spider):
        data = dict(item)
        self.collection.insert(data)

        return item

    def close_spider(self, spider):
        self.client.close()

4. 編寫爬蟲文件:

# -*- coding: utf-8 -*-

import scrapy
import re
import urllib2
import lxml.html
import math
import jieba
import jieba.analyse
from optparse import OptionParser
from docopt import docopt
from scrapy.http import Request
from csdnblog.items import CsdnblogItem
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

class SpiderCsdnblogSpider(scrapy.Spider):
    name = "spider_csdnblog"
    allowed_domains = ["csdn.net"]
    start_urls = ['http://blog.csdn.net/peoplelist.html?channelid=0&page=1']

    def parse(self, response):
        # 從所有博客專家首頁開始抓取
        data = response.xpath('/html/body/div/span/text()').extract()[0]
        # 抓取頁碼
        pages = re.findall('共(.*?)頁', str(data))[0]
        for i in range(0, int(pages)):
            # 構造博客專家翻頁url
            purl = 'http://blog.csdn.net/peoplelist.html?channelid=0&page=' + str(i+1)
            yield Request(url=purl, callback=self.blogger)

    def blogger(self, response):
        # 爬取當前頁博客專家url
        bloggers = response.xpath('/html/body/dl/dd/a/@href').extract()
        for burl in bloggers:
            yield Request(url=burl, callback=self.total)

    def total(self, response):
        data = response.xpath('//*[@id="papelist"]/span/text()').extract()[0]
        pages = re.findall('共(.*?)頁',str(data))
        for i in range(0, int(pages[0])):
            # 構造博客專家所有博文翻頁url
            purl = str(response.url) + '/article/list/' + str(i+1)
            yield Request(url= purl, callback=self.article)

    def article(self, response):
        # 爬取博主當前頁所有文章url
        articles = response.xpath('//span[@class="link_title"]/a/@href').extract()
        for aurl in articles:
            url = "http://blog.csdn.net" + aurl
            yield Request(url=url, callback=self.detail)

    def detail(self, response):
        item = CsdnblogItem()

        # 爬取博文詳情頁信息
        item['url'] = response.url
        # 新版主題CSDN和舊版主題CSDN按照不同抓取規則抓取
        title = response.xpath('//span[@class="link_title"]/a/text()').extract()
        if not title:
            item['title'] = response.xpath('//h1[@class="csdn_top"]/text()').extract()[0].encode('utf-8')
            item['releaseTime'] = response.xpath('//span[@class="time"]/text()').extract()[0].encode('utf-8')
            item['readnum'] = response.xpath('//button[@class="btn-noborder"]/span/text()').extract()[0]
        else:
            item['title'] = title[0].encode('utf-8').strip()
            item['releaseTime'] = response.xpath('//span[@class="link_postdate"]/text()').extract()[0]
            item['readnum'] = response.xpath('//span[@class="link_view"]/text()').extract()[0].encode('utf-8')[:-9]

        # 抓取正文
        data = response.xpath('//div[@class="markdown_views"]|//div[@id="article_content"]')
        item['article'] = data.xpath('string(.)').extract()[0]

        # 用python jieba模塊提取正文關鍵字,topK=2表示提取詞頻最高的兩個
        tags = jieba.analyse.extract_tags(item['article'], topK=2)
        item['keywords'] = (','.join(tags))

        print "博文標題: ", item['title']
        print "博文鏈接: ", item['url']

        yield item

5. 運行項目:

scrapy crawl spider_csdnblog --nolog

這裏寫圖片描述

6. 在 MongoDB 客戶端中查看數據:

這裏寫圖片描述

Note: 如果要爬取指定博主的所有博文,只需要簡單修改爬蟲文件即可(保留 total 函數及其下面函數)。

至此,CSDN 博客專家的所有博文已成功爬取並存入了 MongoDB 數據庫中。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章