python爬蟲之scrapy爬取豆瓣電影（練習）

原創

张小丑

2020-02-24 23:34

開發環境：windows+pycharm+MongoDB+Scrapy

任務目標：任務目標：爬取豆瓣電影top250，將數據存儲到MongoDB中。

items.py文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # 電影名字
    title = scrapy.Field()
    # 基本信息
    bd = scrapy.Field()
    # 簡介
    star = scrapy.Field()
    # 評分
    quote = scrapy.Field()
    pass

spiders文件

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem


class DoubantopSpider(scrapy.Spider):
    name = "DoubanTop"
    allowed_domains = ["movie.douban.com"]
    offset = 0
    url = "https://movie.douban.com/top250?start="
    start_urls = (
        url + str(offset),
    )

    def parse(self, response):
        item = DoubanItem()
        movies = response.xpath('//div[@class="info"]')
        for each in movies:
            # 電影名
            item['title'] = each.xpath('.//span[@class="title"][1]/text()').extract()[0]
            # 基本信息
            item['bd'] = each.xpath('.//div[@class="bd"]/p/text()').extract()[0]
            # 評分
            item['star'] = each.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            # 簡介
            quote = each.xpath('.//p[@class="quote"]/span/text()').extract()
            if len(quote) != 0:
                item['quote'] = quote[0]
            yield item

        if self.offset < 225:
            self.offset += 25
            yield scrapy.Request(self.url + str(self.offset), callback=self.parse)

pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings


class DoubanPipeline(object):
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DBNAME"]
        sheetname = settings["MONGODB_SHEETNAME"]
        # 創建MONGODB數據庫鏈接
        client = pymongo.MongoClient(host=host, port=port)
        # 指定數據庫
        mydb = client[dbname]
        # 存放數據的數據庫表名
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

settings.py

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}
# MONGODB 主機名
MONGODB_HOST = '127.0.0.1'
# 端口號
MONGODB_PORT = 27017
# 數據庫名稱
MONGODB_DBNAME = "Douban"
# 存放數據的表名稱
MONGODB_SHEETNAME = "doubanmovies"

最後的結果：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲之scrapy爬取豆瓣電影（練習）

items.py文件

spiders文件

pipelines.py文件

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

python之math模塊的使用方法詳解

Merkle Tree（默克爾樹或梅爾克爾樹）基礎概念及操作

非對稱加密，散列（哈希）算法

python鏈表之單向鏈表實踐

Spark和hadoop對比之spark解析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結