scrapy框架之爬取豆瓣電影
思路:
1.建立項目
scrapy startproject douban
創建爬蟲者:scrapy genspider douban movie.douban.com
2.明確目標,主要是處理items.py
3.編寫爬蟲處理,數據爬取和解析
4.數據存儲,可以存儲格式一般是json/csv/mongodb/redis/mysql,本練習主要是採用mongodb數據存儲
整個部署的目錄結構
1. items.py
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
serial_number = scrapy.Field()
movie_name = scrapy.Field()
introduce = scrapy.Field()
star =scrapy.Field()
evaluate = scrapy.Field()
describe = scrapy.Field()
2.douban_spider.py
-- coding: utf-8 --
import scrapy
from douban.items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
name = ‘douban_spider’
allowed_domains = [‘movie.douban.com’]
#入口url
start_urls = [‘https://movie.douban.com/top250‘]
def parse(self, response):
movice_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
for i_item in movice_list:
douban_item = DoubanItem()
douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em//text()").extract_first()
douban_item['movie_name'] = i_item.xpath(".//div[@class='info']//div[@class='hd']/a/span[1]/text()").extract_first()
content = i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract()
for i_content in content:
content_s = "".join(i_content.split())
douban_item['introduce'] = content_s
douban_item['star'] = i_item.xpath(".//div[@class='info']//div[@class='star']//span[2]/text()").extract_first()
douban_item['evaluate'] = i_item.xpath(".//div[@class='info']//div[@class='star']//span[4]/text()").extract_first()
douban_item['describe'] = i_item.xpath(".//div[@class='info']//p[@class='quote']//span/text()").extract_first()
#extract()方法是提取出節點,extract_first()提取該節點的第一個元素
#需要將數據yield到Pipelines裏面去
yield douban_item
#解析下一頁的規則,取得後頁的XPATTH
next_link = response.xpath("//span[@class='next']/link/@href").extract()
if next_link:
next_link = next_link[0]
yield scrapy.Request("https://movie.douban.com/top250"+next_link,callback=self.parse)
3.main.py 作爲python 調試
from scrapy import cmdline
cmdline.execute(‘scrapy crawl douban_spider’.split())
4.pipelines.py
清理HTML數據
驗證爬取數據,檢驗爬取字段
查看並丟棄重複內容
將爬取結果保存到數據庫
import pymongo
from douban.settings import mongo_host,mongo_port,mongo_db_name,mongo_db_collection
class DoubanPipeline(object):
def init(self):
host = mongo_host
port = mongo_port
dbname = mongo_db_name
sheetname = mongo_db_collection
client =pymongo.MongoClient(host=host,port=port)
mydb = client[dbname]
self.port = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.port.insert(data)
return item
5.settings.py
-- coding: utf-8 --
BOT_NAME = ‘douban’
SPIDER_MODULES = [‘douban.spiders’]
NEWSPIDER_MODULE = ‘douban.spiders’
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36’
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
‘douban.pipelines.DoubanPipeline’: 300,
}
mongo_host = ‘127.0.0.1’
mongo_port = 27017
mongo_db_name = ‘douban’
mongo_db_collection = ‘douban_movie’
運行結果: