URLError:

在寫一個簡單小爬蟲時,命令行執行時遇到下面這個錯誤:

Traceback (most recent call last):
  File "E:\Anaconda2\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "E:\Anaconda2\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "E:\Anaconda2\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "E:\Anaconda2\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "E:\Anaconda2\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "E:\Anaconda2\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
<span style="color:#ff0000;">URLError: <urlopen error [Errno 10051] ></span>
百度之後,發現原因:

That particular error message is being generated by boto (boto 2.38.0 py27_0), which is used to connect to Amazon S3. Scrapy doesn't have this enabled by default。

解決方法:

在settings.py文件中添加

DOWNLOAD_HANDLERS = {'s3': None,}
如果上面的方法沒有解決問題,還有另一種方法:
在spider.py文件中添加:
from scrapy import optional_features 
optional_features.remove('boto')
問題解決。
另貼出一個關於海投網的超簡單爬蟲:
items.py文件如下:
import scrapy
from scrapy.item import Item,Field

class XuanjianghuiItem(Item):
    # define the fields for your item here like:
    title = Field()
    holdTime = Field()
settings.py文件如下:
BOT_NAME = 'XuanJiangHui'
SPIDER_MODULES = ['XuanJiangHui.spiders']
NEWSPIDER_MODULE = 'XuanJiangHui.spiders'
DOWNLOAD_HANDLERS = {'s3': None,}
ITEM_PIPELINES = {
    'XuanJiangHui.pipelines.XuanjianghuiPipeline': 300,
}

pipelines.py文件如下:
import codecs

class XuanjianghuiPipeline(object):
    	def __init__(self):
    		self.file = codecs.open('F://XuanJiangHui.txt','wb',encoding='utf-8')
        def process_item(self,item,spider):
        	title = item['title'].strip()
        	holdTime = item['holdTime']
        	self.file.write(title+'\n'+holdTime)
        	self.file.write('\r\n')
        	self.file.write('\r\n')
        	return item
     
XuanJiangHui.py文件如下:
# -*- coding:utf-8 -*- 

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from XuanJiangHui.items import XuanjianghuiItem

class XuanjianghuiSpider(Spider):
	name = "XuanJiangHui"
	download_deplay = 1
	start_urls = [
	"http://xjh.haitou.cc/wh/uni-1",
	"http://xjh.haitou.cc/bj/uni-13",
	"http://xjh.haitou.cc/cd/uni-147",
	"http://xjh.haitou.cc/hf/uni-47",
	"http://xjh.haitou.cc/gz/uni-32",
	"http://xjh.haitou.cc/gz/uni-34",
	"http://xjh.haitou.cc/gz/uni-36"
	]
	header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
	def parse(self,response):
		sel = HtmlXPathSelector(response)
		item = XuanjianghuiItem()
		for tr in sel.xpath('//div[@id="w0"]//tbody/tr'):
			title = tr.xpath('./td[@class="cxxt-title"]/a/@title')
			holdTime = tr.xpath('./td[@class="text-left cxxt-holdtime"]/span[@class="hold-ymd"]/text()')
			item['title'] = title.extract()[0]
			item['holdTime'] = holdTime.extract()[0]
			yield item
		urls = sel.xpath('//*[@id="w0"]/ul/li[@class="next"]/a/@href').extract()
		for url in urls:
			url = "http://xjh.haitou.cc"+url
			yield Request(url,headers=self.header,callback=self.parse)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章