一、爬取暱圖網

第一步：

1、新建項目

scrapy startproject nituwang

2、新建爬蟲

scrapy genspider nituwang_spider nipic.com

3、更改設置

……

第二步：

1、爬蟲啓動文件

from scrapy import cmdline

cmdline.execute("scrapy crawl  --nolog nituwang_spider".split()) #啓動爬蟲，無日誌

第三步：

1、寫items.py文件：

用來存放爬蟲爬取下來數據的模型，類似Django的models，不過這個沒有整型、字符……之分。

items.py

import scrapy


class NituwangItem(scrapy.Item):
    big_image_url = scrapy.Field() #存放高清大圖
    title = scrapy.Field()        #圖的名字

第四步：

1、寫爬蟲：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib import parse
import re
from ..items import NituwangItem


class NituwangSpiderSpider(scrapy.Spider):
    name = 'nituwang_spider'
    allowed_domains = ['nipic.com']
    start_urls = ['http://www.nipic.com/photo/lvyou/ziran/index.html', ]  # 爬蟲開始url

    def parse(self, response):
        image_urls = response.xpath('//li[contains(@class,"new-works-box")]/a/@href').getall()
        # 圖片詳情頁
        for image_url in image_urls:
            # 發起請求，->圖片詳情頁，解析圖片詳情頁。
            yield Request(parse.urljoin(response.url, image_url), callback=self.parse_detail)

        # 獲取總頁數。因爲這個網址的最後一頁的下一頁是第一頁
        end_link = response.xpath('//a[@class="bg-png24 page-btn"][2]/@href').get()
        res = re.match('.*?(\d{1,10})', end_link)
        end_num = res.group(1)
        for i in range(1, int(end_num)):
            url = 'http://www.nipic.com/photo/lvyou/ziran/index.html?page={}'.format(i)
            # 請求下一頁，繼續解析
            yield Request(url, callback=self.parse)

    # 圖片詳情頁，獲取高清大圖
    def parse_detail(self, response):
        item = NituwangItem()
        big_image_url = response.xpath('//img[@id="J_worksImg"]/@src').get()  # 大圖鏈接
        item['big_image_url'] = [big_image_url]  # 傳遞列表，下載圖片需要使用列表
        item['title'] = response.xpath('//img[@id="J_worksImg"]/@alt').get() #圖片名
        return item

第五步：

1、進入pipelines，對爬取下來的數據進行操作。
get_media_requests----->file_path--->item_completed

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
import random


class MyImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        '''
        是在發送下載請求之前調用
        去發送下載圖片的請求
        '''
        #meta用來傳遞信息，在file_path中沒有item信息。
        for image_url in item['big_image_url']:
            yield Request(image_url, meta={'title': item['title']})

    def item_completed(self, results, item, info):
        print('******the results is********:', results) #也可以在這個方法中更改文件名，因爲在file_path存儲完成後，item_completed裏有個results參數，results參數保存了圖片下載的相關信息
        return item

    def file_path(self, request, response=None, info=None):
        '''
        這個方法是在圖片將要被存儲的時候調用，來獲取這個圖片存儲的路徑
        :param request:
        :param response:
        :param info:
        :return: 存儲路徑
        '''
        a = 'abcdefghijklmnopqrstuvwxyz1234567890'
        b = ''.join(random.sample(a, 10))
        title = request.meta['title'].strip() + b  # 因爲會有重複的名字，所以從隨機提取幾個字符
        path = '風景/%s.jpg' % (title)
        return path

第六步：

1、在settings.py文件中設置

……
ITEM_PIPELINES = {
    # 'nituwang.pipelines.NituwangPipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'nituwang.pipelines.MyImagePipeline': 1,

}
path = os.path.abspath(os.path.dirname(__file__))
path = os.path.join(path, 'images')
IMAGES_STORE = path  # 存儲的路徑，不寫絕對路徑，在別的系統運行時，需要重新設定。
IMAGES_URLS_FIELD = 'big_image_url'  # 要下載的字段，使用哪個url
……

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy保存圖片&自定義保存

一、爬取暱圖網

第一步：

第二步：

第三步：

第四步：

第五步：

第六步：

CSS選擇器&xpath語法

爬取古詩詞網（使用正則）

Scrapy使用MySQL

自動化提交數據

熱播電影推薦

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結