scrapy學習及爬起點小說

原創

miaomiao0313

2020-07-02 13:30

學習scrapy已經有一段時間了，之前因爲各種事吧一直沒有對這部分內容進行總結，好啦，現在言歸正傳了。

1.最煩人的scrapy安裝已經解決了，接下來就是利用scrapy進行實戰演練。

2.首先，在命令窗口中創建項目，輸入scrapy startproject project-name.

3.查過資料後，知道各個項目的意義：

scrapy.cfg----項目的配置文件

stand/spiders/----放置spiders代碼的目錄（就是之後在這裏建咱們自己寫的爬蟲代碼）

stand/items.py----爬取的目標（自己隨便起名字）

stand/pipelines.py----管道，即若將爬到的內容放到數據庫中就在這寫明，若存到本地，便不用改

stand/settings.py----放置，即儲存的路徑、格式

4.下面我將在創建的"heart"項目下寫代碼。

items.py

import scrapy
from scrapy import Item,Field

class HeartItem(scrapy.Item):
    novname = scrapy.Field()
    link = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
pass

settings.py

BOT_NAME = 'heart'

USER_AGENT='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0'
FEED_URI=u'file///D:/Python27/Scripts/heart/nov1.CSV'
FEED_FORMAT = 'CSV'
SPIDER_MODULES = ['heart.spiders']
NEWSPIDER_MODULE = 'heart.spiders'

qidian.py(隨便起的爬蟲名字)

# coding:utf-8
from scrapy.spiders import CrawlSpider
from scrapy.http import Request
import requests
from scrapy.selector import Selector
from heart.items import HeartItem


class heartSpider(CrawlSpider):
    for i in range(1,3):
        name = "heart"
        start_urls = ['http://f.qidian.com/all?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page=%d&month=-1&style=1&action=-1' % i]
        def __init__(self):
            self.item = HeartItem()

        def parse(self, response):
            selector = Selector(response)
            urls = selector.xpath('//div[@class="book-mid-info"]/h4/a/@href').extract()
            novname = selector.xpath('//div[@class="book-mid-info"]/h4/text()').extract()


            for url in urls:
                url = "http:" + url
                yield Request(url, callback=self.parseContent)

        def parseContent(self,response):
            selector1 = Selector(response)
            links = selector1.xpath('//ul[@class="cf"]/li/a/@href').extract()
            novname = selector1.xpath('//title/text()').extract()
            self.item['novname'] = novname
            for link in links:
                self.item['link'] =  'http:'+link
                r = requests.get(self.item["link"])
                sel = Selector(r)
               # selector2 = Selector(response)
                title = sel.xpath('//title/text()').extract()
                content = sel.xpath('//div[@class="read-content j_readContent"]/p/text()').extract()
                self.item['title'] = title
                self.item['content'] = content


                yield self.item

5.寫爬蟲代碼時需要注意：

***確保安裝好requests,requests學習網址：

http://blog.csdn.net/iloveyin/article/details/21444613

http://jingyan.baidu.com/article/b2c186c8f5d219c46ef6ff85.html

6.學習完後，終於再次體會到“前人栽樹，後人乘涼”的感覺，這裏提供一些比較好的scrapy學習網址，真的很有用！！！

http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html

http://blog.csdn.net/yedoubushishen/article/details/50984045

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy學習及爬起點小說

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

bs與re爬起點網站的免費完本小說

scrapy學習及爬起點小說

python爬網易評論

pyhton爬誅仙小說

爬取下拉加載的動態網頁信息

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結