Python學習日記 Scrapy框架 2. 爬取教師信息

1. 創建新項目

Terminal中進入待創建項目目錄，輸入scrapy startproject 項目名稱

出現問題：

解決辦法：在Terminal輸入 pip install -I cryptography，等待其安裝成功。然後再輸入scrapy startproject Spider（自定義的項目名）即可創建成功。

出現如上後創建成功。

項目中會得到的文件結構如下：

scrapy.cfg ：Scrapy的配置未見

items.py ：Items定義爬取的數據結構（待爬取的內容格式）

middlewares.py ：Middlewares定義爬取的中間件

pipelines.py：Pipelines定義數據管道（儲存內容）

settings.py：配置文件

2.明確爬取內容，編寫items.py

該項目預期爬取教師信息，則有老師姓名、職位、簡介

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ItcastspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 老師姓名
    name = scrapy.Field()
    # 職位
    level = scrapy.Field()
    # 介紹信息
    info = scrapy.Field()

3. 編寫爬蟲文件

3.1 獲取相應

首先通過 scrapy shell example.com，response通過最簡單的方法得到相應。如果是403代表爬蟲被封了；如果是200代表訪問正常。所以我們需要設置user-agent，僞裝成瀏覽器：scrapy shell example.com -s USER_AGENT。

3.2 創建爬蟲文件

根據創建項目時scrapy startproject xx時應答：

You can start your first spider with：

cd XX

scrapy gensipder example example.com

上述中，example：爬蟲名，不可與項目名重複；example.com：爬蟲範圍，即www.baidu.com後爬蟲不會在www.google.com上爬取內容。

3.3 Xpath語法

符號	作用
/	選擇某個標籤下的所有內容
text()	選擇標籤內所包含的文本
@	選擇標籤屬性信息
//	選擇所有標籤
[@屬性=值]	該標籤屬性滿足一定條件

在網頁中F12打開開發者工具，定位需要爬取的信息後，通過Copy xpath獲得對應的xpath的表達式，在prase函數中編寫數據定位代碼。

先鎖定需要爬取信息，然後通過Copy xpath獲取xpath表達式，比如我們需要獲得該圖左邊人物的名字和職稱

/html/body/div[5]/div/ul/li[1]/div[2]

/html/body/div[5]/div/ul/li[1]/div[2]/p[1]/b

/html/body/div[5]/div/ul/li[1]/div[2]/p[1]/text()

上述三個表達式分別表示文字欄目錄，名字，職位。通過在scrapy shell XX得到響應後調試，可以得：

在定位完成需要爬取的信息後，開始編寫爬蟲文件。

# -*- coding: utf-8 -*-
import scrapy

from new2.items import New2Item
class ItcastSpider(scrapy.Spider):
    name = 'itcast' # 爬蟲名
    allowed_domains = ['me.sjtu.edu.cn'] #允許爬蟲範圍
    start_urls = ['http://me.sjtu.edu.cn/academician.html'] #第一個爬取url

    def parse(self, response):
        # 通過scrapy內置的xpath規則解析網頁，返回一個包含selector對象的列表
        teacher_list = response.xpath('//div[@class="txtk"]')
        # 實例化類
        item  = New2Item()

        for each in teacher_list:
            item['name'] = each.xpath('./p/b').extract()[0]
            item['level'] = each.xpath('./p/text()').extract()[0]
            item['info'] = each.xpath('./div/p/span').extract()[0]
            yield item

4. 編寫管道文件

管道文件pipelines的作用是將爬取內容保存到本地

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
class New2Pipeline(object):
    def __init__(self):
        # 在本地創建teacher.json文件
        self.filename = open('teacher.json', 'wb+')
    def process_item(self, item, spider):
        # python類型轉化爲json字符串
        text = json.dumps(dict(item), ensure_ascii=False) + '\n'
        # 寫入
        self.filename.write(text.encode('utf-8'))
        return item