看了看代碼,本次利用遞歸函數調取多頁面信息,應該有更有效的辦法,回來看看crawlspider,以及代理池,分佈式的知識。
需求:
- scrapy框架,爬取某電影網頁面的每個電影的一級頁面的名字
- https://www.55xia.com/
- 爬取每部電影二級頁面的詳細信息
- 使用代理ip
- 保存日誌文件
- 存爲csv文件
總結:
1、xpath解析使用extract()的各種情況分析
https://blog.csdn.net/nzjdsds/article/details/77278400
2、xpath用法注意的點:
div[not(contains(@class,"col-xs-12"))]
class屬性不包括"col-xs-12"
的div標籤
https://blog.csdn.net/caorya/article/details/81839928?utm_source=blogxgwz1
3、二次解析時,用meta參數字典格式傳遞第一次解析的參數值。
# meta 傳遞第二次解析函數
yield scrapy.Request(url=url, callback=self.parse_detail, meta={'item': item})
4、存爲csv文件:
import csv
csv.writer
writerow
https://blog.csdn.net/qq_40243365/article/details/83003161
5、空行加參數newline='',
self.f=open('./movie.csv','w',newline='', encoding='utf-8')
6、僞裝UA,保存日誌,編碼格式
settings裏設置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'
FEED_EXPORT_ENCODING = 'utf-8-sig'
LOG_LEVEL = 'ERROR'
LOG_FILE = 'log.txt'
ROBOTSTXT_OBEY = False
7、代理ip中間件
class MyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'https://157.230.150.101:8080'
settings設置:
DOWNLOADER_MIDDLEWARES = {
'movie.middlewares.MyMiddleware': 543,
}
代碼:
movieinfo.py
import scrapy
from movie.items import MovieItem
class MovieinfoSpider(scrapy.Spider):
name = 'movieinfo'
# allowed_domains = ['www.movie.com']
start_urls = ['https://www.55xia.com/movie']
page = 1
base_url = 'https://www.55xia.com/movie/?page={}'
# 解析二級子頁面
def parse_detail(self, response):
# 導演可能不止一人,不用extract_first(),拼接成字符串
directors = response.xpath(
'/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]//a/text()').extract()
directors = " ".join(directors)
movieType = response.xpath(
'/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[4]/td[2]/a/text()').extract_first()
area = response.xpath(
'/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[5]/td[2]//text()').extract_first()
time = response.xpath(
'/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[7]/td[2]//text()').extract_first()
score = response.xpath(
'/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[9]/td[2]//a/text()').extract_first()
# 取出meta的item
item = response.meta['item']
print('二級子頁面:', item['name'])
item['directors'] = directors
item['movieType'] = movieType
item['area'] = area
item['time'] = time
item['score'] = score
yield item
def parse(self, response):
"""
獲取超鏈接
導演,編劇,主演,類型,地區,語言,上映時間,別名,評分
:param response:
:return:
"""
div_list = response.xpath('/html/body/div[1]/div[1]/div[2]/div[not(contains(@class,"col-xs-12"))]')
for div in div_list:
name = div.xpath('./div/div/h1/a/text()').extract_first()
print('已找到:',name)
url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()
url = "https:" + url
# 實例化item對象並存儲
item = MovieItem()
item['name'] = name
# meta 傳遞第二次解析函數
yield scrapy.Request(url=url, callback=self.parse_detail, meta={'item': item})
# 完成每頁之後開始下一頁
if self.page < 3:
self.page += 1
new_url=self.base_url.format(self.page)
yield scrapy.Request(url=new_url, callback=self.parse)
items.py
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
directors = scrapy.Field()
movieType = scrapy.Field()
area = scrapy.Field()
time = scrapy.Field()
score = scrapy.Field()
middleware.py
class MyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'https://157.230.150.101:8080'
pipelines.py
import csv
class MoviePipeline(object):
def open_spider(self, spider):
print('開始存儲')
self.f=open('./movie.csv','w',newline='', encoding='utf-8')
self.writer= csv.writer(self.f)
self.writer.writerow(['name','directors','movieType','area','time','score'])
def process_item(self, item, spider):
print('正在寫入')
self.writer.writerow([item['name'],item['directors'],item['movieType'],item['area'],item['time'],item['score']])
return item
def close_spider(self, spider):
self.f.close()
print('保存完成')
結果:
附加:
Excel和CSV格式文件的不同之處
https://blog.csdn.net/weixin_39198406/article/details/78705016