學習scrapy已經有一段時間了,之前因爲各種事吧一直沒有對這部分內容進行總結,好啦,現在言歸正傳了。
1.最煩人的scrapy安裝已經解決了,接下來就是利用scrapy進行實戰演練。
2.首先,在命令窗口中創建項目,輸入scrapy startproject project-name.
3.查過資料後,知道各個項目的意義:
scrapy.cfg----項目的配置文件
stand/spiders/----放置spiders代碼的目錄(就是之後在這裏建咱們自己寫的爬蟲代碼)
stand/items.py----爬取的目標(自己隨便起名字)
stand/pipelines.py----管道,即若將爬到的內容放到數據庫中就在這寫明,若存到本地,便不用改
stand/settings.py----放置,即儲存的路徑、格式
4.下面我將在創建的"heart"項目下寫代碼。
items.py
import scrapy
from scrapy import Item,Field
class HeartItem(scrapy.Item):
novname = scrapy.Field()
link = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
pass
settings.py
BOT_NAME = 'heart'
USER_AGENT='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0'
FEED_URI=u'file///D:/Python27/Scripts/heart/nov1.CSV'
FEED_FORMAT = 'CSV'
SPIDER_MODULES = ['heart.spiders']
NEWSPIDER_MODULE = 'heart.spiders'
qidian.py(隨便起的爬蟲名字)
# coding:utf-8
from scrapy.spiders import CrawlSpider
from scrapy.http import Request
import requests
from scrapy.selector import Selector
from heart.items import HeartItem
class heartSpider(CrawlSpider):
for i in range(1,3):
name = "heart"
start_urls = ['http://f.qidian.com/all?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page=%d&month=-1&style=1&action=-1' % i]
def __init__(self):
self.item = HeartItem()
def parse(self, response):
selector = Selector(response)
urls = selector.xpath('//div[@class="book-mid-info"]/h4/a/@href').extract()
novname = selector.xpath('//div[@class="book-mid-info"]/h4/text()').extract()
for url in urls:
url = "http:" + url
yield Request(url, callback=self.parseContent)
def parseContent(self,response):
selector1 = Selector(response)
links = selector1.xpath('//ul[@class="cf"]/li/a/@href').extract()
novname = selector1.xpath('//title/text()').extract()
self.item['novname'] = novname
for link in links:
self.item['link'] = 'http:'+link
r = requests.get(self.item["link"])
sel = Selector(r)
# selector2 = Selector(response)
title = sel.xpath('//title/text()').extract()
content = sel.xpath('//div[@class="read-content j_readContent"]/p/text()').extract()
self.item['title'] = title
self.item['content'] = content
yield self.item
5.寫爬蟲代碼時需要注意:
***確保安裝好requests,requests學習網址:
http://blog.csdn.net/iloveyin/article/details/21444613
http://jingyan.baidu.com/article/b2c186c8f5d219c46ef6ff85.html
6.學習完後,終於再次體會到“前人栽樹,後人乘涼”的感覺,這裏提供一些比較好的scrapy學習網址,真的很有用!!!
http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html
http://blog.csdn.net/yedoubushishen/article/details/50984045