能夠利用redis緩存數據庫的優點去重來避免數據的大面積冗餘
1、首先就是要創建貓眼爬蟲項目
2、進入項目內部創建一個爬蟲文件
創建完文件之後就是要爬取的內容,我這邊以爬取貓眼電影的title和link爲例(這個完全看個人你可以先去寫爬蟲,然後再來寫items文件)
3、編寫item文件
class MaoyanTestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() pass
4、編寫爬蟲文件(確定自己要爬的內容,然後與items中的指定字段連接起來)
import scrapy from fake_useragent import UserAgent from scrapy.selector import Selector from maoyan_test.items import MaoyanTestItem headers = { 'user-agent': UserAgent(verify_ssl=False).chrome } class MovieSpiderSpider(scrapy.Spider): name = 'movie' allowed_domains = ['www.maoyan.com/board/4'] start_urls = ['http://www.maoyan.com/board/4?offset=%s'] def start_requests(self): for i in range(10): url = self.start_urls[0] % str((i*10)) yield scrapy.Request(url, callback=self.parse, dont_filter=False, headers=headers) def parse(self, response): item = MaoyanTestItem() sel = Selector(response) movie_list = sel.xpath('//dl[@class="board-wrapper"]/dd') for movie in movie_list: title = movie.xpath('a/@title').extract_first() link = 'https://www.maoyan.com' + movie.xpath('a/@href').extract_first() item['title'] = title item['link'] = link yield item
5、編寫Pipline文件:--> 這裏面主要是通過redis緩存數據庫來對數據進行篩選,然後將數據主要保存到Mysql中
首先配置settings文件
# 這個是需要手動加上的,通過scrapy-redis自帶的pipeline將item存入redis中 ITEM_PIPELINES = { 'maoyan_test.pipelines.MaoyanTestPipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 400 } # 啓動redis自帶的去重 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 啓用調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否在關閉spider的時候保存記錄 SCHEDULER_PERSIST = True # 使用優先級調度請求隊列(默認使用) SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue' # 指定redis的地址和端口,有密碼的需要加上密碼 REDIS_HOST = '127.0.0.1' REDIS_PORT = '6379' REDIS_PARAMS = { 'password': '123456', } #SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 調度器中請求存放在redis中的key #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 對保存到redis中的數據進行序列化,默認使用pickle #SCHEDULER_FLUSH_ON_START = False # 是否在開始之前清空 調度器和去重記錄,True=清空,False=不清空 # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去調度器中獲取數據時,如果爲空,最多等待時間(最後沒數據,未獲取到)。 #SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重規則,在redis中保存時對應的key chouti:dupefilter #SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重規則對應處理的類 #DUPEFILTER_DEBUG = False #上述的擴展類需要的 MYEXT_ENABLED = True # 開啓擴展 IDLE_NUMBER = 10 # 配置空閒持續時間單位爲 10個 ,一個時間單位爲5s #如果爲True,則使用redis的'spop'進行操作。 #因爲本次請求每一次帶上的都是時間戳,所以就用了lpush #如果需要避免起始網址列表出現重複,這個選項非常有用。開啓此選項urls必須通過sadd添加,否則會出現類型錯誤。 #REDIS_START_URLS_AS_SET = True
之後就是要在pipeline文件中將真是的數據保存到MySQL中:
import pymysql class MaoyanTestPipeline(object): comments = [] def __init__(self): self.conn = pymysql.connect( host='localhost', user='root', passwd='123456', port=3306, db='spider', charset='utf8', autocommit=True ) self.cursor = self.conn.cursor() def process_item(self, item, spider): self.comments.append([item['title'], item['link']]) if len(self.comments) == 1: self.insert_to_sql(self.comments) self.comments.clear() return item def close_spider(self, spider): self.insert_to_sql(self.comments) def insert_to_sql(self, data): try: sql = 'insert into maoyan_movie (title, link) values (%s, %s);' print(data) self.cursor.executemany(sql, data[0]) except: print('插入數據有誤...') self.conn.rollback()