scrapy複習(豆瓣250電影存mysql)

前言

最近在家太無聊,開始看以前在慕課學的課程。順便學習下scrapy這個以前半懂不懂的框架。Python最火爬蟲框架Scrapy入門與實踐

在這裏插入圖片描述

代碼

首先新建項目

在cmd中輸入scrapy startobject xxx

第二創建爬蟲

進入spider文件夾下,scrapy genspider 自己爬蟲名字 域名

第三確定目標

編寫item.py確定要保存的字段

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #序號,電影名稱,電影介紹,星級,電影評論數,電影描述
    serial_number = scrapy.Field()
    movie_name = scrapy.Field()
    introduce = scrapy.Field()
    star = scrapy.Field()
    evaluate = scrapy.Field()
    depict = scrapy.Field()

第四解析網頁獲取自己要保存的字段

# -*- coding: utf-8 -*-
import scrapy
from items import DoubanItem

class DoubanSpiderSpider(scrapy.Spider):
    #爬蟲名字
    name = 'douban_spider'
    #允許域名
    allowed_domains = ['movie.douban.com']
    #入庫url
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="article"]//ol[@class="grid_view"]/li')
        for i in movie_list:

            douban_item = DoubanItem()
            douban_item['serial_number'] = i.xpath(".//div[@class ='item']//em/text()")\
                .extract_first()
            douban_item['movie_name'] = i.xpath(
                ".//div[@class='info']//div[@class='hd']/a/span[@class = 'title'][1]/text()")\
                .extract_first()
            content = i.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract()
            for j in content:
                content_i = "".join(j.split())
                douban_item['introduce'] = content_i
            douban_item['star'] = i.xpath(".//span[@class='rating_num']/text()").extract_first()
            douban_item['evaluate'] = i.xpath(".//div[@class = 'star']//span[4]/text()").extract_first()
            douban_item['depict'] = i.xpath(".//p[@class='quote']/span/text()").extract_first()
            # douban_item['describe'] = douban_item['describe'].strip()
            yield douban_item

        next_link = response.xpath("//span[@class = 'next']/link/@href").extract()
        if next_link:
            next_link = next_link[0]
            yield scrapy.Request('https://movie.douban.com/top250'+next_link,callback=self.parse)

第五編寫mysql鏈接

編寫pipline.py

class MySQLPipeline(object):
    def __init__(self):
        self.connect = connect(
            host='0.0.0.0',
            port=3306,
            db='scrapy',
            user='root',
            passwd='xxxxx',
            charset='utf8',
            use_unicode=True)
        # 連接數據庫
        self.cursor = self.connect.cursor()
        # 使用cursor()方法獲取操作遊標

    def process_item(self, item, spider):
        self.cursor.execute(
            """INSERT INTO douban (serial_number, movie_name, introduce, star, evaluate, depict) VALUES (%s, %s, %s, %s, %s, %s)""",
            (item['serial_number'],
             item['movie_name'],
             item['introduce'],
             item['star'],
             item['evaluate'],
             item['depict']
             ))
        # 執行sql語句,item裏面定義的字段和表字段一一對應
        self.connect.commit()
        # 提交
        return item
    # 返回item

    def close_spider(self, spider):
        self.cursor.close()
        # 關閉遊標
        self.connect.close()
        # 關閉數據庫連接

注意!字段不能爲describe、desc等,這是mysql保留關鍵字。這個坑查了很久!再次提醒自己複習mysql。。。。。

第六更改middleware.py隱藏自己

import random
# user agent 列表
USER_AGENT_LIST = [
    'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
    'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
    'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
    'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
    'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
    'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
    'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
    'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
]
# 隨機生成user agent
USER_AGENT = random.choice(USER_AGENT_LIST)

借鑑python爬蟲之scrapy中user agent淺談(兩種方法)

第七更改setting.py

55行開啓middleware,67行開啓pipline
DOWNLOAD_DELAY爲設置延遲更改0…5
USER_AGENT修改(如果使有middleware則可以不用)

後續進行neo4j知識圖譜學習。

發佈了118 篇原創文章 · 獲贊 64 · 訪問量 18萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章