目录
1、Scrapy框架
(1)安装
- 安装Scrapy:pip install scrpay
- Scrapy是一个适用爬取网站数据、提取结构性数据的应用程序框架
(2)流程
- 1、Engine首先打开一个网站,找到处理该网站的Spider,并向该Spider请求第一个要爬取的URL;
- 2、Engine从Spider中获取到第一个要爬取的URL,并通过Scheduler以Request的形式调度;
- 3、Engine向Scheduler请求下一个要爬取的URL;
- 4、Scheduler返回下一个要爬取的URL给Engine,Engine将URL通过Downloader Middlewares转发给Downloader下载;
- 5、一旦页面下载完毕,Downloader生成该页面的Response,并将其通过Downloader Middlewares发送给Engine;
- 6、Engine从下载器中接收到Response,并将其通过Spider Middlewares发送给Spider处理;
- 7、Spider处理Response,并返回爬取到的Item及新的Request给Scheduler;
- 8、Spider处理Response,并返回爬取到的I及新的Request给Scheduler;
- 9、重复2到8步,直到Scheduler中没有更多的Request,Engine关闭该网站,爬取结束
(3)解析介绍
- scrapy提供了两个方法,response.xpath()和response.css()
- extract()匹配所有元素,extract_first()匹配单个元素
- extract_first()传递一个默认值参数,可以在xpath匹配不到结果的时候,返回这个参数代替;
- 正则匹配:re()匹配所有结果,re_first()匹配第一个
- response.xpath()使用:
response.xpath("//a/text()").extract()
response.xpath("//a/text()").extract_first()
response.xpath("//a/text()").extract_first("Default None")
response.xpath("//a/text()").re('Name:\s(.*)')
response.xpath("//a/text()").re_first('Name:\s(.*)')
- response.css()使用:
response.css("a[href='image.html']::text").extract()
response.css("a[href='image.html']::text").extract_first()
response.css("a[href='image.html'] img::attr(src)").extract_first("Default None")
2、scrapy genspider 操作步骤scrapy.Spider
(1)创建项目和爬虫
A:终端创建项目和爬虫,输入如下命令
scrapy startproject [项目名称]
cd [项目名称路径]
scrapy genspider [爬虫文件名] [爬虫URL域名]
B:生成的项目目录结构:
* spiders:以后所有的爬虫,都是存放到这个里面
* items.py:用来存放爬虫爬取下来数据的模型;
* middlewares.py:用来存放各种中间件的文件;
* pipelines.py:定义数据管道,用来将items的模型存储到本地磁盘中;
* settings.py:本爬虫的一些配置信息(比如请求头、多久发送一次请求、ip代理池等);
* scrapy.cfg:它是scrapy项目的配置文件,其内定义了项目的配置文件路径,部署相关信息等内容
C:qsbk.py生成的内容介绍:
* name:它是每个项目唯一的名字,用来区分不同的Spider
* allowed_domains:它是允许爬取的域名,如果初始或后续的请求链接不是这个域名下的,则请求链接会被过滤掉
* start_urls:它包含了Spider在启动时爬取的url列表,初始请求时由它定义的
* parse:该方法负责解析返回的响应、提取数据或者进一步生成处理的请求
* response是一个'scrapy.http.response.html.HtmlResponse'对象,可以执行‘xpath’和'css'语法来提取数据,提取出来的数据是一个‘Selector’或者是一个'SelectorList'对象,如果想要获取其中的字符串,可以用getall或者get方法
# -*- coding: utf-8 -*-
import scrapy
class QsbkSpider(scrapy.Spider):
name = 'qsbk'
allowed_domains = ['qiushibaike.com']
start_urls = ['https://www.qiushibaike.com/text/page/1/']
def parse(self, response):
pass
(2)制作爬虫开始爬取网页
A、settings设置HEADERS等内容
- ROBOTSTXT_OBEY改为False
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
- DEFAULT_REQUEST_HEADERS修改
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537'
'.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
B、items.py里面编写需要的字段模型
- 定义字段模型,需要的字段
import scrapy
class HelloItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
C、qsbk.py里面编写爬虫
- 修改start_urls,在qsbk.py里面的parse方法里编写解析过程
- 定位下一页,回调函数,需要用到scrapy.Request,需要传递两个参数:url和callback
- dont_filter=False过滤掉已请求过的url;反之不用过滤
import scrapy
from Hello.items import HelloItem
class QsbkSpider(scrapy.Spider):
name = 'qsbk'
allowed_domains = ['qiushibaike.com']
start_urls = ['https://www.qiushibaike.com/text/page/1/']
base_domain = "https://www.qiushibaike.com"
def parse(self, response):
divs = response.xpath('//div[@class="col1 old-style-col1"]/div')
for div in divs:
author = div.xpath(".//h2/text()").get().strip()
# 选择器的前方加.,代表提取元素内部的数据,如果没有加.则代表从根节点开始提取
content = div.xpath('.//div[@class="content"]//text()').getall()
content = ''.join(content).strip()
# items.py里面定义好模型后,替换下面两行
# duanzi = {'author': author, 'content': content}
# yield duanzi
item = HelloItem(author=author, content=content)
yield item
# 处理下一页的请求,并settings设置DOWNLOAD_DELAY = 1
next_page_url = response.xpath(
'//span[@class="next"]/ancestor::a/@href').get()
if not next_page_url:
return
else:
yield scrapy.Request(
f"{self.base_domain}{next_page_url}", callback=self.parse, dont_filter=False)
(3)存储内容 (pipelines.py):设计管道存储爬取内容
A、settings设置ITEM_PIPELINES
- ITEM_PIPELINES设置,其中值越小优先级越高
ITEM_PIPELINES = {
'Hello.pipelines.HelloPipeline': 300,
}
B、编写pipelines.py
- open_spider()方法,当爬虫被打开的时候执行
- process_item()方法,当爬虫有item传过来的时候会被调用
- close_spider()方法,当爬虫关闭时会被调用
import json
class HelloPipeline(object):
def open_spider(self, spdier):
print("爬虫开始了……")
def process_item(self, item, spider):
with open('duanzi.json', "a+", encoding="utf-8") as fp:
item_json = json.dumps(dict(item), ensure_ascii=False)
fp.write(item_json+'\n')
return item
def close_spider(self, spdier):
print("爬虫结束了……")
- 上述代码的第二种写法:使用JsonLinesItemExporter:这个每次调用export_item的时候就把这个item存储到硬盘中,输出也是字典
from scrapy.exporters import JsonLinesItemExporter
class HelloPipeline(object):
def __init__(self):
self.fp = open('duanzi.json', 'wb')
self.exporter = JsonLinesItemExporter(
self.fp, ensure_ascii=False, encoding='utf-8')
def open_spider(self, spdier):
print("爬虫开始了……")
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spdier):
self.fp.close()
print("爬虫结束了……")
C、不通过pipelines,直接命令保存到文件
命令 | 解释 |
---|---|
scrapy crawl quotes -o quotes.json | 保存成JSON文件 |
scrapy crawl quotes -o quotes.jl | 每一个Item输出一行JSON |
scrapy crawl quotes -o quotes.csv | 保存成csv文件 |
scrapy crawl quotes -o quotes.xml | 保存成xml文件 |
scrapy crawl quotes -o quotes.pickle | 保存成pickle文件 |
scrapy crawl quotes -o quotes.marshal | 保存成marshal文件 |
scrapy crawl quotes -o quotes.ftp://user:[email protected]/path/to/quotes.csv | ftp远程输出 |
D、MongoDB存储
import pymongo
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI')
mongo_db = crawler.settings.get('MONGO_DB')
)
def open_spider(self, spdier):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
self.db[item.collection].insert(dict(item))
return item
def close_spider(self, spdier):
self.client.close()
E、MySQL存储
import pymysql
class MysqlPipeline(object):
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host = crawler.settings.get('MYSQL_HOST')
database = crawler.settings.get('MYSQL_DATABASE')
user = crawler.settings.get('MYSQL_USER')
password = crawler.settings.get('MYSQL_PASSWORD')
port = crawler.settings.get('MYSQL_PORT')
)
def open_spider(self, spdier):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset="utf-8", port=self.port)
def process_item(self, item, spider):
data = dict(item)
keys = ','.join(data.keys())
values = ','.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
def close_spider(self, spdier):
self.db.close()
(4)运行爬虫的两种方法
- 第一种:切换到Hello文件夹路径下,然后直接终端scrapy crawl qsbk
E:\Test_com\Hello>scrapy crawl qsbk
- 第二种:在Hello文件夹下新建一个start.py文件,然后写命令,然后运行start.py文件即可
from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', 'qsbk'])
3、scrapy Shell操作
- 可以方便我们做一些数据提取的测试代码;
- 如果想要执行scrapy命令,首先进入到scrapy所在的环境中;
scrapy shell url
4、scrapy Request对象/Respose对象
(1)Request对象
- callback:在下载其下载完响应的数据后执行的回调函数;
- headers:请求头,对于一些固定的设置,放在settings.py中指定就可以了,对于那些非固定的,可以在发送请求时指定
- meta:共享数据,用于在不同的请求之间传递数据的
- dot_filter:表示不由调度器过滤,在执行多次重复的请求的时候用的比较多
- errback:在发生错误的时候执行的函数
import scrapy
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None,
encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None, cb_kwargs=None)
- 发送POST请求:有时候我们想要在请求数据的时候发送post请求;那么这时候需要使用Request的子类FormRequest来实;如果想要在爬虫一开始的时候就发送POST请求,那么需要在爬虫类中重写start_requests(self)方法,并且不再调用start_urls里的url,在这个方法中发送post请求
# 模拟登录人人网, 只是案例,网站已改,不一定能实现
import scrapy
class RenrenSpiderSpider(scrapy.Spider):
name = 'renren_spider'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/']
# def parse(self, response):
# pass
def start_requests(self):
url = "http://www.renren.com/PLogin.do"
data = {"email": "[email protected]", "password": "pythonspider"}
request = scrapy.FormRequest(
url, formdata=data, callback=self.parse_page)
yield request
def parse_page(self, response):
# with open("renren.html", "w", encoding="utf-8") as f:
# f.write(response.text)
request = scrapy.Request(
url="http://www.renren.com/880151247/profile",
callback=self.parse_profile)
yield request
def parse_profile(self, response):
with open("renren.html", "w", encoding="utf-8") as f:
f.write(response.text)
- 第二个post请求案例
# 模拟登录豆瓣网,可能已失效,验证码已改,只是一个思路案例
import scrapy
from urllib import request
from PIL import Image
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['http://douban.com/']
def parse(self, response):
form_data = {
"source": "None",
"redir": "https://www.douban.com/",
"form_email": "[email protected]",
"form_password": "pythonspider",
"login": "登录"
}
captcha_url = response.css("img#captcha_image::attr(src)").get()
if captcha_url:
captcha = self.regonize_captcha(captcha_url)
form_data['captcha-solution'] = captcha
captcha_id = response.xpath("//input[@name='captcha-id']/@value").get()
form_data['captcha-id'] = captcha_id
yield scrapy.FormRequest(url=self.login_url, formdata=form_data,
callback=self.parse_after_login)
def parse_after_login(self, response):
if response.url == "https://www.douban.com/":
print("登录成功")
else:
print("登录失败")
@staticmethod
def regonize_captcha(image_url):
"""肉眼识别验证码,以前网站是验证码,现在是滑动,只提供简短思路"""
request.urlretrieve(image_url, 'captcha.png')
image = Image.open('captcha.png')
image.show()
captcha = input("请输入验证码:")
return captcha
(2)Respose对象
- meta:从其他请求传过来的meta属性,可以用来保持多个请求之间的数据连接;
- encoding:返回当前字符串编码和解码的格式;
- text:将返回来的数据作为unicode字符串返回;
- body:将返回来的数据作为bytes字符串返回;
- xpath:xpath选择器;
- css:css选择器
5、scrapy内置下载图片
(1)介绍
- scrapy为下载item中包含的文件(比如在爬取到产品时,同时也想保存对应的图片提供了一个可重用的items pipelines。这写pipelines有些共同的方法和结构我们称之为media pipeline),一般来说会使用Files Pipeline或者Images Pipeline;
- 使用scrapy内置下载文件的优点:避免重复下载,方便指定路径,方便转换为通用格式,方便生成缩略图,方便检测图片宽和高,异步下载,效率非常高
- 步骤:
- 定义好一个item,然后item中定义两个属性,分别为image_urls以及images;image_urls是用来存储需要下载的文件的url链接,需要给一个列表;
- 当文件下载完成后,会把文件下载的相关信息存储到item的images属性中,比如下载路径、下载的url和文件的校验码等;
- 在配置文件settings.py中配置IMAGES_STORE,这个配置是用来设置文件下载下来的路径;
- 启动pipeline:在ITEM_PIPELINES中设置’scrapy.pipelines.images.ImagesPipelin’: 1
(2)设置
- items.py设置
import scrapy
class ImageItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
category = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
- settings设置
import os
# 图片pipelines设置,将原有的pipeline注释掉
ITEM_PIPELINES = {
# 'bmw.pipelines.BmwPipeline': 300,
'scrapy.pipelines.images.ImagesPipelin': 1
}
# 图片下载路径, 供images pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
6、scrapy下载器中间件Downloader Middlewares
- 下载器中间件是引擎和下载器之间通信的中间件。在这个中间件中我们可以设置代理、更换请求头等来达到反反爬虫的目的;要写下载器中间件,可以在啊下载器中实现两个方法。
- 一个是process_request(self, request, spider),这个方法是在请求发送之前会执行,
- 还有一个是process_response(self,request,response,spider),这个方法是数据下载到引擎之前执行;
(1)下载请求头中间件
- user-agent链接
- settings设置:
DOWNLOADER_MIDDLEWARES = {
'Hello.middlewares.UserAgentDownloadMiddleware': 543,
}
- middlewares.py
import random
class UserAgentDownloadMiddleware:
USER_AGENT = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
]
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENT)
request.headers['User-Agent'] = user_agent
(2)ip代理中间件
- 查看当前ip代理链接
- settings设置
DOWNLOADER_MIDDLEWARES = {
'Hello.middlewares.UserAgentDownloadMiddleware': 543,
'Hello.middlewares.IPProxyDownloadMiddleware': 100,
}
class IPProxyDownloadMiddleware:
PROXIES = ["http://101.132.190.101:80", "http://175.148.71.139:1133", "http://222.95.144.68:3000"]
def process_request(self, request, spider):
proxy = random.choice(self.PROXIES)
request.meta['proxy'] = proxy
(3)selenium中间件
- 当process_request()方法返回Response对象的时候,更低优先级的Downloader Middleware的process_request()和process_exception()方法就不会被继续调用了, 转而开始执行每个Downloader Middleware的process_response()方法,调用完毕之后直接将Response对象发送给Spider处理
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from logging import getLogger
class SeleniumMiddleware():
def __init__(self, timeout=None, service_args=[]):
self.logger = getLogger(__name__)
self.timeout = timeout
self.browser = webdriver.PhantomJS(service_args=service_args)
self.browser.set_window_size(1400, 700)
self.browser.set_page_load_timeout(self.timeout)
self.wait = WebDriverWait(self.browser, self.timeout)
def __del__(self):
self.browser.close()
def process_request(self, request, spider):
"""
用PhantomJS抓取页面
:param request: Request对象
:param spider: Spider对象
:return: HtmlResponse
"""
self.logger.debug('PhantomJS is Starting')
page = request.meta.get('page', 1)
try:
self.browser.get(request.url)
if page > 1:
input = self.wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))
submit = self.wait.until(
EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))
input.clear()
input.send_keys(page)
submit.click()
self.wait.until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))
self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))
return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',
status=200)
except TimeoutException:
return HtmlResponse(url=request.url, status=500, request=request)
@classmethod
def from_crawler(cls, crawler):
return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),
service_args=crawler.settings.get('PHANTOMJS_SERVICE_ARGS'))
7、scrapy genspider -t 操作步骤CrawlSpider,适用于数据量大的网站
(1)创建项目和爬虫
A:终端创建项目和爬虫,输入如下命令
scrapy startproject [项目名称]
cd [项目名称路径]
scrapy genspider -t crawl [爬虫文件名] [爬虫URL域名]
B:生成的项目目录结构:
(2)制作爬虫开始爬取网页
A、settings设置HEADERS等内容
- ROBOTSTXT_OBEY改为False
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
- DEFAULT_REQUEST_HEADERS修改
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537'
'.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
B、items.py里面编写需要的字段模型
- 定义字段模型,需要的字段
import scrapy
class WxappItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
pub_time = scrapy.Field()
C、wxap_spider.py里面编写爬虫
- 修改start_urls,在wxap_spider.py里面的parse_detail方法里编写解析过程;
- 需要使用"LinuxExtractor和Rule",这两个东西决定爬虫的具体走向;
- allow设置的规则的方法,要能够限制在我们想要的url上面,不要和其他的url产生相同的正则表达式即可;
- 上面情况下使用follow:如果在爬取页面的时候,需要将满足当前条件的url再进行跟进,那么设置为True,否则设置为False;
- 什么情况下该指定callback:如果这个url对应的页面,只是为了获取更多的url,并不需要里面的数据,可以不指定callback;如果想要获取url对应页面中的数据,那么就需要指定一个callback
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem
class WxapSpiderSpider(CrawlSpider):
name = 'wxap_spider'
allowed_domains = ['www.wxapp-union.com']
start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=1']
rules = (
# 列表页
Rule(LinkExtractor(
allow=r'.+mod=list&catid=2&page=\d'),
follow=True),
# 详情页
Rule(LinkExtractor(
allow=r'.+article.+\.html'),
callback="parse_detail",
follow=False),
)
def parse_detail(self, response):
title = response.xpath('//h1[@class="ph"]/text()').get()
author_p = response.xpath('//p[@class="authors"]')
author = author_p.xpath(".//a/text()").get()
pub_time = author_p.xpath(".//span/text()").get()
print(f"title:{title},author:{author},pub_time:{pub_time}")
item = WxappItem(title=title, author=author, pub_time=pub_time)
yield item
(3)存储内容 (pipelines.py):设计管道存储爬取内容
A、settings设置ITEM_PIPELINES
- ITEM_PIPELINES设置,其中值越小优先级越高
ITEM_PIPELINES = {
'Hello.pipelines.HelloPipeline': 300,
}
B、编写pipelines.py
- 使用JsonLinesItemExporter:这个每次调用export_item的时候就把这个item存储到硬盘中,输出也是字典
from scrapy.exporters import JsonLinesItemExporter
class WxappPipeline(object):
def __init__(self):
self.fp = open('wxjc.json', 'wb')
self.exporter = JsonLinesItemExporter(
self.fp, ensure_ascii=False, encoding='utf-8')
def open_spider(self, spdier):
print("爬虫开始了……")
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spdier):
self.fp.close()
print("爬虫结束了……")
(4)运行爬虫
- 在wxapp文件夹下新建一个start.py文件,然后写命令,然后运行start.py文件即可
from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', 'wxap_spider'.split()])