URLError:

原創

2020-06-26 04:13

在寫一個簡單小爬蟲時，命令行執行時遇到下面這個錯誤：

Traceback (most recent call last):
  File "E:\Anaconda2\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "E:\Anaconda2\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "E:\Anaconda2\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "E:\Anaconda2\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "E:\Anaconda2\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "E:\Anaconda2\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
<span style="color:#ff0000;">URLError: <urlopen error [Errno 10051] ></span>

百度之後，發現原因：

That particular error message is being generated by boto (boto 2.38.0 py27_0), which is used to connect to Amazon S3. Scrapy doesn't have this enabled by default。

解決方法：

在settings.py文件中添加

DOWNLOAD_HANDLERS = {'s3': None,}

如果上面的方法沒有解決問題，還有另一種方法：
在spider.py文件中添加：
from scrapy import optional_features 
optional_features.remove('boto')問題解決。
另貼出一個關於海投網的超簡單爬蟲：
items.py文件如下：
import scrapy
from scrapy.item import Item,Field

class XuanjianghuiItem(Item):
    # define the fields for your item here like:
    title = Field()
    holdTime = Field()settings.py文件如下：
BOT_NAME = 'XuanJiangHui'
SPIDER_MODULES = ['XuanJiangHui.spiders']
NEWSPIDER_MODULE = 'XuanJiangHui.spiders'
DOWNLOAD_HANDLERS = {'s3': None,}
ITEM_PIPELINES = {
    'XuanJiangHui.pipelines.XuanjianghuiPipeline': 300,
}
pipelines.py文件如下：
import codecs

class XuanjianghuiPipeline(object):
    	def __init__(self):
    		self.file = codecs.open('F://XuanJiangHui.txt','wb',encoding='utf-8')
        def process_item(self,item,spider):
        	title = item['title'].strip()
        	holdTime = item['holdTime']
        	self.file.write(title+'\n'+holdTime)
        	self.file.write('\r\n')
        	self.file.write('\r\n')
        	return item
     XuanJiangHui.py文件如下：
# -*- coding:utf-8 -*- 

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from XuanJiangHui.items import XuanjianghuiItem

class XuanjianghuiSpider(Spider):
	name = "XuanJiangHui"
	download_deplay = 1
	start_urls = [
	"http://xjh.haitou.cc/wh/uni-1",
	"http://xjh.haitou.cc/bj/uni-13",
	"http://xjh.haitou.cc/cd/uni-147",
	"http://xjh.haitou.cc/hf/uni-47",
	"http://xjh.haitou.cc/gz/uni-32",
	"http://xjh.haitou.cc/gz/uni-34",
	"http://xjh.haitou.cc/gz/uni-36"
	]
	header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
	def parse(self,response):
		sel = HtmlXPathSelector(response)
		item = XuanjianghuiItem()
		for tr in sel.xpath('//div[@id="w0"]//tbody/tr'):
			title = tr.xpath('./td[@class="cxxt-title"]/a/@title')
			holdTime = tr.xpath('./td[@class="text-left cxxt-holdtime"]/span[@class="hold-ymd"]/text()')
			item['title'] = title.extract()[0]
			item['holdTime'] = holdTime.extract()[0]
			yield item
		urls = sel.xpath('//*[@id="w0"]/ul/li[@class="next"]/a/@href').extract()
		for url in urls:
			url = "http://xjh.haitou.cc"+url
			yield Request(url,headers=self.header,callback=self.parse)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

URLError:

釘釘打卡速度慢

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

URLError:

centos6.7+python3.5.2安裝scrapy（待修改完善）

Hive管理表（內部表）數據加載及JOIN操作

快速開始Spark

Mariadb集羣實現部分數據庫不同步的功能

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結