寫在前面
本篇筆記接【python 爬蟲
7】:https://blog.csdn.net/a__int__/article/details/104762424
1、新片場翻頁爬取
查看每頁連接
爬取連接
pages = response.xpath('//div[@class="page"]/a/@href').extract()
for page in pages:
yield response.follow(page, self.parse)
我們發現最多隻能爬取到第22頁
爬22頁後面的內容需要登錄
1.1、模仿登錄
點擊登錄,抓取cookie
把這串證書複製下來
settings:
重新執行,發現抓取到100多條的時候就停止了。
1.2、訪問上限
找到PHPSESSID
查看這串字符,長度爲26
我們寫一個PHPSESSID生成器
現在就能成功訪問多條數據了
2、爬取個人詳情頁
把爬取視頻詳情頁 打開
打開詳情頁發現每個用戶詳情頁地址通過data-userid變化
爬取用戶id
creator_list = response.xpath('//@data-userid').extract()
creator_url = 'https://www.xinpianchang.com/u%s'
print(creator_list)
for creator in creator_list:
request = response.follow(creator_url % creator, self.parse_composer)
request.meta['cid'] = creator
yield request
def parse_composer(self, response):
banner = response.xpath('//div[@class="banner-wrap"]/@style').get()
composer = {}
composer['banner'], = re.findall('background-image:url\((.+?)\)', banner)
composer['name'] = response.xpath('//p[contains(@class,"creator-name")]/text()').get()
yield composer
運行之後其他的都沒問題,visit_userid_10037339=1; 這個值會被迭代
訪問網頁發現,這個值在每次切換個人詳情頁的時候,就會被設置一次,每隔30幾秒會自動刪除,這30幾秒內連續訪問會迭代
退出登錄後訪問發現,訪問個人詳情頁是不需要登錄的,所以我們關掉這個頁面訪問時的cookie
再次抓取就沒有這個問題了
3、存入mysql
創建數據庫xpc: CREATE DATABASE xpc;
新建表composer
CREATE TABLE `composer` (
`cid` bigint(20) NOT NULL,
`name` varchar(512) DEFAULT NULL,
`banner` varchar(512) DEFAULT NULL,
PRIMARY KEY (`cid`) USING BTREE
) ENGINE = InnoDB CHARSET = utf8mb4;
items.py
import scrapy
from scrapy import Field
class ComposerItem(scrapy.Item):
"""保存個人詳情頁信息"""
table_name = 'composer'
cid = Field()
banner = Field()
name = Field()
settings.py
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
import redis
from scrapy.exceptions import DropItem
class RedisPipeline(object):
def open_spider(self, spider):
self.r = redis.Redis(host='127.0.0.1')
def process_item(self, item, spider):
if self.r.sadd(spider.name, item['name']):
return item
# 如果添加name失敗就自動刪除item,並停止執行process_item
raise DropItem
class MysqlPipeline(object):
def open_spider(self, spider):
self.conn = pymysql.connect(
host="127.0.0.1",
port=3306,
db="xpc",
user="root",
password="root",
charset="utf8",
)
self.cur = self.conn.cursor()
def close_spider(self, spider):
self.cur.close()
self.conn.close()
def process_item(self, item, spider):
keys, values = zip(*item.items())
sql = "insert into {} ({}) values ({})" \
" ON DUPLICATE KEY UPDATE {};".format(
item.table_name,
','.join(keys),
','.join(["'%s'"] * len(values)),
','.join(["{}='%s'".format(k) for k in keys])
)
sqlp = (sql % (values*2))
self.cur.execute(sqlp)
self.conn.commit()
# 顯示sql語句
print(self.cur._last_executed)
return item
修改discovery.py
# -*- coding: utf-8 -*-
import json
import random
import re
import scrapy
from scrapy import Request
from xpc.items import ComposerItem
str = 'qazwsxedcrfvtgbyhnujmikolp1234567890'
cookies = dict(Authorization='F8FB7C7E1E8354A671E83548091E835A4181E83528E62CE43391')
def gen_sessionid():
return ''.join(random.choices(str, k=26))
class DiscoverySpider(scrapy.Spider):
name = 'discovery'
allowed_domains = ['xinpianchang.com', 'openapi-vtom.vmovier.com']
start_urls = ['https://www.xinpianchang.com/channel/index/sort-like?from=navigator']
page_count = 0
def parse(self, response):
self.page_count += 1
if self.page_count >= 100:
cookies.update(PHPSESSID=gen_sessionid())
self.page_count = 0
pid_list = response.xpath('//@data-articleid').extract()
url = "https://www.xinpianchang.com/a%s?from=ArticleList"
for pid in pid_list:
yield response.follow(url % pid, self.parse_post)
pages = response.xpath('//div[@class="page"]/a/@href').extract()
for page in pages:
yield response.follow(page, self.parse, cookies=cookies)
def parse_post(self, response):
# post = {}
# 這裏get()和extract_first()一樣
# post['title'] = response.xpath('//div[@class="title-wrap"]/h3/text()').get()
# categorys = response.xpath('//span[contains(@class,"cate")]//text()').extract()
# post['category'] = ''.join([category.strip() for category in categorys])
# post['created_at'] = response.xpath('//span[contains(@class,"update-time")]/i//text()').get()
# post['play_counts'] = response.xpath('//i[contains(@class,"play-counts")]/@data-curplaycounts').get()
# vid, = re.findall('vid: \"(\w+)\",', response.text)
# video_url = 'https://openapi-vtom.vmovier.com/v3/video/%s?expand=resource&usage=xpc_web'
# request = Request(video_url % vid, callback=self.parse_video)
# request.meta['post'] = post
# yield request
creator_list = response.xpath('//@data-userid').extract()
creator_url = 'https://www.xinpianchang.com/u%s'
for creator in creator_list:
request = response.follow(creator_url % creator, self.parse_composer)
request.meta['cid'] = creator
request.meta['dont_merge_cookies'] = True
yield request
# def parse_video(self, response):
# # post = response.meta['post']
# result = json.loads(response.text)
# if 'resource' in result['data']:
# post['video'] = result['data']['resource']['default']['url']
# else:
# d = result['data']['third']['data']
# post['video'] = d.get('iframe_url', d.get('swf', ''))
# yield post
def parse_composer(self, response):
banner = response.xpath('//div[@class="banner-wrap"]/@style').get()
composer = ComposerItem()
composer['cid'] = int(response.meta['cid'])
banner, = re.findall('background-image:url\((.+?)\)', banner)
composer['banner'] = banner.replace('\t', '').replace(' ', '')
name = response.xpath('//p[contains(@class,"creator-name")]/text()').get()
composer['name'] = name.replace('\t', '').replace(' ', '')
yield composer
4、存入redis
settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://127.0.0.1:6379'
啓動redis
查看redis的key
安裝scrapy-redis:pip install scrapy-redis
啓動爬蟲
查看redis
查看discovery:dupefilter裏面存的內容: srandmember discovery:dupefilter
zrange discovery:requests 0 0 withscores
在ipython裏面解析字符
分析這串數據,發現它是列表包含元組這樣的結構
用pickle進行二進制序列化
In [1]: import redis
In [2]: r = redis.Redis(host='127.0.0.1')
In [3]: m = r.zrange('discovery:requests',0,0,withscores=True)
In [7]: import pickle
In [9]: (a,b), = m
In [10]: pickle.loads(a)
將數據存入redis(一般不會把redis當數據庫用)
settings.py
‘scrapy_redis.pipelines.RedisPipeline’: 301,
運行爬蟲
查看redis