基于Scrapy的爬虫解决方案

原創

2021-06-23 10:43

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景介绍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"笔者在业务中遇到了爬虫需求，由于之前没做过相关的活儿，所以从网上调研了很多内容。但是互联网上的信息比较杂乱，且真真假假，特别不方便，所以完成业务后就想写一篇对初学者友好且较为完整的文章，希望能对阅读者有所帮助。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于笔者最近Python用得比较熟练，所以就想用Python语言来完成这个任务。经过一番调研，发现Scrapy框架使用者比较多，文档也比较全，所以选择了使用该框架。（其实Scrapy只做了非常简单的封装，对于普通的爬虫任务，使用requests库和bs4库中的BeautifulSoup类就完全能解决了）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先简单介绍一下爬虫是什么。爬虫就是从一个或多个URL链接开始，使用某种方法（例如requests库中的函数）获取到该URL对应的网页的内容（一般是HTML格式），然后从该网页的内容中提取出需要记录下来的信息和需要继续爬取的URL链接（例如使用上文中提到的BeautifulSoup类）。之后，再对爬取到的URL链接进行上述同样的操作，直到所有URL链接都被爬取完，爬虫程序结束。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scrapy的官网【1】，英文版官方文档【2】，第三方的汉化文档（较为简陋和过时）【3】提供如下，感兴趣的读者也可以自行查阅。由于本文重点不在这里，就不在此处对Scrapy进行介绍了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【1】：https:\/\/scrapy.org\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【2】：https:\/\/docs.scrapy.org\/en\/latest\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【3】：https:\/\/scrapy-chs.readthedocs.io\/zh_CN\/0.24\/index.html"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、Scrapy使用方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安装Scrapy库"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"pip install scrapy"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"新建一个爬虫项目"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"scrapy startproject your_project_name"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入该命令后，会在当前目录下新建一个名为your_project_name的文件夹，该文件夹下的文件层级关系如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"your_project_name\n| scrapy.cfg\n|----your_project_name\n| | __init__.py\n| | items.py\n| | middlewares.py\n| | pipelines.py\n| | settings.py\n| |----spiders\n| | | __init__.py"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中，scrapy.cfg是整个项目的配置文件，spiders目录下存放爬虫的逻辑代码，因为该项目刚建立，还没有写具体的爬虫代码，所以该目录下为空。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"生成一个爬虫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在刚刚新建的项目目录下输入命令："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"scrapy genspider example www.qq.com"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中example是爬虫的名字，www.qq.com是该爬虫的第一个要爬取的URL链接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"执行该命令后，Scrapy会在spiders目录下生成一个叫example.py的文件，该文件是一个非常基础的爬虫模板。之后要做的事情就是在该py文件里填入具体的爬虫逻辑代码，然后再执行该爬虫脚本就可以了。example.py文件内的代码如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"import scrapy\n\n\nclass ExampleSpider(scrapy.Spider):\n name = 'example'\n allowed_domains = ['qq.com']\n start_urls = ['http:\/\/qq.com\/']\n\n def parse(self, response):\n pass"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"代码中的ExampleSpider就是刚才生成的爬虫类。其中，name是爬虫的名字，allowed_domains是对域名的限制（即该爬虫只会爬取该限制下的URL域名），start_urls是爬虫的初始URL链接，这里面的值是刚才创建爬虫时输入的URL链接，parse函数是默认的解析函数。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"运行爬虫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在项目目录下执行命令:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scrapy crawl example"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中example是要运行的爬虫名字。执行该命令后，该框架就会用example爬虫里定义的初始URL链接和解析函数去爬取网页了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"调试爬虫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在写代码的过程中，由于不同网页的源码的组织方式不同，所以需要用一种交互式的方式来访问网页，以此来修改代码。虽然在很多情况下可以通过Chrome浏览器F12的审查模式来查看网页的HTML源码，但是在有些情况下代码中获得的源码和浏览器中看到的却是不一样的，所以交互式访问网页就必不可少了。（也可以通过运行完整爬虫的方式来调试代码，但是效率就有点低下了）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想交互式访问网页，需要在项目目录下执行命令："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scrapy shell www.qq.com"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用体验类似于直接在命令行输入python进入Python的交互式界面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"完善解析函数"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解析函数的完善是爬虫的核心步骤。解析函数的初始化如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"def parse(self, response): pass\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中只有response一个实参，该实参就是访问某个URL链接的返回结果，里面含有该URL链接的HTML源码（该response是对requests.Response类的封装，所以用法类似，但是包含的成员函数更多）。而解析函数parse的作用就是从response中杂乱的HTML源码提取出有价值的信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Scrapy框架中，有两种解析HTML源码的函数，分别是css和xpath。其中css是Scrapy专有的函数，具体用法只能在Scrapy文档中查找，不建议使用；而xpath是一种通用的语言（例如BeautifulSoup类中也能使用），它的一些语法的定义在网上资料更多。xpath的具体用法要讲的话就太多了，所以这里不多做介绍，如果有需要，可以直接去搜索引擎查找相关资料。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果需要在解析过程中遇到了需要解析的URL链接，则可以直接调用："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"yield scrapy.Request(url_str, callback=self.parse)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中，url_str是需要解析的URL链接的字符串，self.parse是解析函数，这里我使用的是默认的解析函数，当然这里也能使用自定义的解析函数（自定义解析函数的入参出参类型需要和默认解析函数相同）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是：scrapy.Request除了以上俩必须的参数外，还能通过meta字段来传递参数，而参数的获取能通过 response.meta 来实现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"小建议"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默认情况下，Scrapy会遵守被爬取网站的robots.txt规则（该文件规定了哪些能爬，哪些不能爬），但往往我们想要爬取的内容都被规定为不能爬取的内容。可以将settings.py文件中的 ROBOTSTXT_OBEY = True 改为 ROBOTSTXT_OBEY = False 来避免这种情况的发生。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、常见问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"动态网页不能正确解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述的简单操作只能解析静态网页，需要动态加载的网页（例如含有Javascript代码的网页）则无法正常解析，因为response里的HTML源码是动态加载之前的页面的源码，而我们需要的大多是动态加载之后的页面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以通过在Python中调用Chrome浏览器的方式来处理这个问题。除此之外，还能使用Chrome浏览器的headless模式。使用了该模式之后，Chrome浏览器并不会真的被调用，但是Python中能获取到和浏览器相同的返回结果，而浏览器中返回的结果就是动态加载之后的页面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不过，要使用这个方法，必须在机器上安装Chrome浏览器和对应版本的Chrome驱动程序。安装完成之后，在middlewares.py文件中加入以下代码："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"from selenium import webdriver\nfrom scrapy.http import HtmlResponse\n\n\nclass JavaScriptMiddleware:\n def process_request(self, request, spider):\n option = webdriver.ChromeOptions()\n option.add_argument('--headless')\n option.add_argument('--no-sandbox')\n option.add_argument('--disable-gpu')\n driver = webdriver.Chrome(options=option, executable_path=chrome_driver_path_str)\n driver.get(request.url)\n js = 'var q=document.documentElement.scrollTop=10000'\n driver.execute_script(js)\n body = driver.page_source\n return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外，还要在settings.py文件中加入以下代码："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"\nDOWNLOADER_MIDDLEWARES = {\n 'your_project_name.middlewares.JavaScriptMiddleware': 543,\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"经过这两处修改之后，爬虫脚本里的所有request请求都会通过Chrome headless浏览器包装后再发向要爬取的URL链接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"防爬虫之修改header"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多网站都有各自的反爬虫机制，但是最基础的一种方式是检查请求的HTTP包里面的header是否正常。其中经常检查的一个字段是User-Agent，User-Agent字段指的是浏览器的型号。反爬虫机制会检查该字段是否为普通的浏览器，而普通的爬虫程序是不会修饰该字段的。如果不显式将该字段设为某种浏览器型号，就容易触发反爬虫，从而不能正常地获得数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想修改Scrapy里的user-agent字段，可以在settings.py文件里添加以下代码："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"USER_AGENT = 'Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_15_6) \nAppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/84.0.4147.89 Safari\/537.36'"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"添加完该代码后，Scrapy在发起request请求时就会将上面的值替换到header中的User-Agent中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"反爬虫之IP池"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在很多时候，爬取网站时一开始是能正常获得数据的，但是爬着爬着，就不能正常地获得数据了。一个很大的可能是IP被该网站封禁了。每个网站封IP的策略都不一样，但是总体来说其实就是该IP访问该网站的频率太高，网站害怕该访问是恶意攻击或者担心服务器承受不了大量的访问从而直接封禁该IP。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"应对方式也非常粗暴，那就是用代理IP去爬虫。网站封一个IP，我就用另外的IP去访问，只要我IP足够多，就总能获取到我想要的所有数据。而正好互联网上就有服务商提供这种IP服务。网上大致分为免费和付费两种服务，其中免费提供商提供的IP质量非常低，有不小的概率是直接不能用的，所以这里不推荐使用免费服务。至于付费服务商网上有很多家都挺靠谱的，本文里使用的名为“快代理”的服务商，下面提供的代码也是只针对该特定厂家的。不同服务商使用IP池的方式都不一样，具体使用方法还是以各自的官方文档为主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在“快代理”上购买IP套餐后，在middleware.py文件中添加一下代码："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"from w3lib.http import basic_auth_header\nimport requests\n\n\nclass ProxyDownloaderMiddleware:\n username = 'your_username'\n password = 'your_password'\n api_url = 'https:\/\/dps.kdlapi.com\/api\/getdps\/?orderid=your_orderid&num=1&pt=1&dedup=1&sep=1'\n proxy_ip_list = []\n list_max_len = 20\n\n def update_ip(self):\n if len(self.proxy_ip_list) != self.list_max_len:\n ip_str = requests.get('https:\/\/dps.kdlapi.com\/api\/getdps\/?orderid=your_orderid&num={}&pt=1&dedup=1&sep=3'.format(self.list_max_len)).text\n self.proxy_ip_list = ip_str.split(' ')\n while True:\n try:\n proxy_ip = self.proxy_ip_list.pop(0)\n proxies = {\n 'http': 'http:\/\/{}:{}@{}'.format(self.username, self.password, proxy_ip),\n 'https': 'http:\/\/{}:{}@{}'.format(self.username, self.password, proxy_ip)\n }\n requests.get('http:\/\/www.baidu.com', proxies=proxies, timeout=3.05)\n self.proxy_ip_list.append(proxy_ip)\n return\n except Exception as e:\n self.proxy_ip_list.append(requests.get(self.api_url).text)\n\n def process_request(self, request, spider):\n self.update_ip()\n request.meta['proxy'] = 'http:\/\/{}'.format(self.proxy_ip_list[-1])\n # 用户名密码认证\n request.headers['Proxy-Authorization'] = basic_auth_header(self.username, self.password)\n return None"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中username，password，order id都是“快代理”中使用IP所要用的参数。上面的代码维护了一个大小为20的IP池，每次要使用时就提取第一个IP并先要检查该IP是否已经失效，如果失效了就丢弃并补充新的IP。Scrapy每次发起request请求时，会经过该proxy层的封装，但要想正常使用，还得在settings.py文件中添加以下代码："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"DOWNLOADER_MIDDLEWARES = { 'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上文爬取动态页面的相关内容中也修改了这个 DOWNLOADER_MIDDLEWARES 这个字典。该字典中的key和value分别是在middlewares.py文件中添加的类和封装request包的顺序。如果要同时使用动态页面爬取和IP池，那么settings.py文件的该参数应该如下所示："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"DOWNLOADER_MIDDLEWARES = { 'your_project_name.middlewares.JavaScriptMiddleware': 543, 'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中100 < 543，代表request请求要先经过代理封装再经过动态加载封装，最后才发送给目标URL链接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"头图：Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者：赵宇航"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文：https:\/\/mp.weixin.qq.com\/s\/-jCxnhzo-G9fzZNT-Azp7g"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文：基于Scrapy的爬虫解决方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"来源：云加社区 - 微信公众号 [ID：QcloudCommunity]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"转载：著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

01 稳定性（一）如何应对事故并做好覆盘？

這幾年，我在跨境電商經歷了業務快速發展的黃金時期，也遇到了一些令人“出乎意料”的事故。最嚴重的一次宕機，直接導致技術組織調整和一位技術大牛的離開，公司也賠了十幾萬；甚至半年後還有銷售經理和我抱怨說：“如果那一天系統不出問題，訂單量肯定會創

2024-05-22 00:17:24

为什么不推荐在Spring Boot中使用@Value加载配置

@Value註解相信很多Spring Boot的開發者都已經有接觸了，通過使用該註解，我們可以快速的把配置信息加載到Spring的Bean中。比如下面這樣，就可以輕鬆的把配置文件中key爲com.didispace.title配置信息加載

2024-05-21 21:46:20

基于多模态信息抽取的菜品知识图谱构建

菜品作爲到店餐飲各相關業務的基石，提供了更細粒度的視角理解餐飲供給，爲到餐精細化運營提供了抓手。美團到店研發平臺/數據智能平臺部與天津大學劉安安教授團隊展開了“基於多模態信息抽取的菜品知識圖譜構建”的科研合作，利用多模態檢索實現圖文食材的

2024-05-21 21:18:07

「实用推荐」如何为桌面 & 移动跨平台应用选择UI框架/APP架构？

DevExpress .NET MAUI UI組件庫提供了用於Android和iOS移動開發的高性能UI組件，該庫包括數據網格、圖表、日程、數據編輯器、CollectionView和選項卡組件。獲取DevExpress .NET MAUI

2024-05-21 12:19:30

用好AppBuilder-SDK，每天都能偷偷早下班

本文主要是對這次AppBuilder-SDK直播課程的文字總結，主題是如何在Python中使用AppBuilder-SDK（使用的IDE 爲 PyCharm社區版）感興趣的朋友也可以去看直播課回放。直播課： AppBuil

2024-05-21 12:12:15

实现“代码可视化”需要了解的前置知识-编译器中端

1. 前言前文實現“代碼可視化”需要了解的前置知識-編譯器前端介紹了編譯器前端知識並附帶了小練習，本文將繼續介紹編譯器中端相關的知識，還是概念+練習的學習方式。中間代碼是用來進行程序分析和實現代碼可視化的關鍵數據，瞭解其生成和優化方式能

2024-05-21 11:56:05

「Kimi」加入微信，最方便的私人财经助手

喜大普奔，「Kimi」大模型現在可以接入微信了！如果說其他大模型應用還略顯繁瑣，需要下載一個新的APP 甚至翻牆，那「Kimi」加入微信後，把大模型應用的體驗成本降低到 0。這意味着我們有了一個免費且好用且穩定且無

2024-05-21 11:48:27

探讨篇（二）：分层架构的艺术 - 打造合理且高效的架构体系

上篇從服務粒度角度進行了探討，本文繼續從服務內的分層角度探討。本文的觀點源自我在學習與實踐過程中的深思熟慮，尚處於不斷探索和驗證的階段。希望能“拋磚引玉”，激發更多的討論與交流。讓我們共同進步，在探討與實證中尋求真知。一、背景應用

2024-05-20 23:55:42

探讨篇（一）：服务粒度的艺术 - 简化架构与避免服务泛滥

一、背景上週小組有個需求上線牽扯9個應用（小組目前維護了26個服務，由於團隊系統業務屬性特徵基於高可用、高性能原則拆分，有些是合理的，有些不是很合理的），同時上週OpsReview的一個微服務濫用典範案例（Promise服務A調用服務B，

2024-05-20 23:55:39

Java常用的JSON序列化与反序列化工具实践

JSON簡介： JSON（Java Script Object Notation）是一種輕量級的數據交換格式，通常用於在不同系統之間傳輸數據。它基於 JavaScript 對象語法，但已成爲一種獨立於語言的格式。JSON 數據以鍵值對的形式

2024-05-20 23:55:38

MQTTX 1.9.10 发布：升级 Faker.js、增强连接与订阅诊断、优化 UI

MQTTX 1.9.10 版本現已發佈。本次更新帶來了重要的 Faker.js 升級、對連接斷開和訂閱問題的深度診斷，並增強了 CLI 與 UI 的多項功能。此次更新旨在簡化用戶操作流程，並提升問題排查效率。點擊此處下載最新版本：htt

2024-05-20 22:10:05

CDH配置Kerberos和Sentry详解

1.安全之Kerberos安全認證 1 Kerberos概述 1.1 什麼是Kerberos Kerberos是一種計算機網絡授權協議，用來在非安全網絡中，對個人通信以安全的手段進行身份認證。這個詞又指麻省理工學院爲這個協議開發的一套計算

2024-05-20 21:36:31

年薪百万的程序员都在用的摸鱼方式……

隨着信息技術的不斷髮展，許多傳統的工作流程正在向自動化方向迅速轉變。在過去，開發人員在完成代碼編寫後，需要通過手動執行一系列操作來將代碼部署到生產環境中。這看似簡單，但在實際操作過程中，不僅容易出現各種人爲失誤，還會消耗開發人員大量的時

2024-05-20 14:05:33

常用第三方库的package.json入口配置

react@^17 內部含有 umd和commonjs的包，沒有esm。只提供了main字段，並沒有browser字段。提供了browserify 配置。總結：構建時需要特殊處理，不能按照統一規則處理。 react-dom@^17 內

2024-05-20 13:41:31

MFC扩展库BCGControlBar Pro v34.1新版亮点：日历和计划表等功能升级

BCGControlBar庫擁有500多個經過全面設計、測試和充分記錄的MFC擴展類。我們的組件可以輕鬆地集成到您的應用程序中，併爲您節省數百個開發和調試時間。 BCGControlBar專業版 v34.1已正式發佈了，這個版本包含了對W

2024-05-20 12:20:52

24小時熱門文章

最新文章

最新評論文章