scrapy框架之爬取豆瓣電影

scrapy框架之爬取豆瓣電影

思路:
1.建立項目
scrapy startproject douban
創建爬蟲者:scrapy genspider douban movie.douban.com
2.明確目標,主要是處理items.py
3.編寫爬蟲處理,數據爬取和解析
4.數據存儲,可以存儲格式一般是json/csv/mongodb/redis/mysql,本練習主要是採用mongodb數據存儲

整個部署的目錄結構
這裏寫圖片描述

1. items.py
import scrapy

class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
serial_number = scrapy.Field()
movie_name = scrapy.Field()
introduce = scrapy.Field()
star =scrapy.Field()
evaluate = scrapy.Field()
describe = scrapy.Field()

2.douban_spider.py

-- coding: utf-8 --

import scrapy
from douban.items import DoubanItem

class DoubanSpiderSpider(scrapy.Spider):
name = ‘douban_spider’
allowed_domains = [‘movie.douban.com’]
#入口url
start_urls = [‘https://movie.douban.com/top250‘]

def parse(self, response):
    movice_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
    for i_item in movice_list:
        douban_item = DoubanItem()
        douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em//text()").extract_first()
        douban_item['movie_name'] = i_item.xpath(".//div[@class='info']//div[@class='hd']/a/span[1]/text()").extract_first()
        content = i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract()
        for i_content in content:
            content_s = "".join(i_content.split())
            douban_item['introduce'] = content_s
        douban_item['star'] = i_item.xpath(".//div[@class='info']//div[@class='star']//span[2]/text()").extract_first()
        douban_item['evaluate'] = i_item.xpath(".//div[@class='info']//div[@class='star']//span[4]/text()").extract_first()
        douban_item['describe'] = i_item.xpath(".//div[@class='info']//p[@class='quote']//span/text()").extract_first()
        #extract()方法是提取出節點,extract_first()提取該節點的第一個元素
        #需要將數據yield到Pipelines裏面去
        yield douban_item
    #解析下一頁的規則,取得後頁的XPATTH
    next_link = response.xpath("//span[@class='next']/link/@href").extract()
    if next_link:
        next_link = next_link[0]
        yield scrapy.Request("https://movie.douban.com/top250"+next_link,callback=self.parse)

3.main.py 作爲python 調試
from scrapy import cmdline

cmdline.execute(‘scrapy crawl douban_spider’.split())

4.pipelines.py
清理HTML數據
驗證爬取數據,檢驗爬取字段
查看並丟棄重複內容
將爬取結果保存到數據庫

import pymongo
from douban.settings import mongo_host,mongo_port,mongo_db_name,mongo_db_collection

class DoubanPipeline(object):
def init(self):
host = mongo_host
port = mongo_port
dbname = mongo_db_name
sheetname = mongo_db_collection
client =pymongo.MongoClient(host=host,port=port)
mydb = client[dbname]
self.port = mydb[sheetname]

def process_item(self, item, spider):
    data = dict(item)
    self.port.insert(data)
    return item

5.settings.py

-- coding: utf-8 --

BOT_NAME = ‘douban’

SPIDER_MODULES = [‘douban.spiders’]
NEWSPIDER_MODULE = ‘douban.spiders’

USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36’

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
‘douban.pipelines.DoubanPipeline’: 300,
}

mongo_host = ‘127.0.0.1’
mongo_port = 27017
mongo_db_name = ‘douban’
mongo_db_collection = ‘douban_movie’

運行結果:
這裏寫圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章