使用scrapy框架進行爬蟲需要注意的

原創

疯子～

2018-08-30 00:09

1.start_urls 內的網址要補充完整

2.在獲取數據時，如果我們想要先獲取先獲取某一整塊內容，然後再從此內容中獲取單個零碎信息

比如

    def parse(self, response):
        ul_list = response.xpath('//ul[@class="post small-post"]')  #一小整塊
        print(ul_list)

        for ul in ul_list:
            title = ul.xpath('.//div[@class="cover"]/@cover-text').extract()  #取出整塊內的局部

那麼需要注意： xpath對象獲取的內容都是一個列表，返回的內容爲 scrapy.selector
如果類型爲scrapy.selector 那麼這個對象可以被繼續迭代，也可以被xpath繼續尋找裏面的內容

如果上面獲取的ul_list 後面加上extract（）那麼下面的xpath將不能用

scrapy crawl meikong -o meikong.xml

meikong是美空拼音，文件名

將文件存儲爲指定類型支持四種數據類型

ValueError:Missing scheme in request url:h

相關URL必須是一個List，所以遇到該錯誤只需要將url轉換成list即可

將得到的下載地址放入到數據模型中，將數據模型傳輸給管道，下載地址要包在列表當中

5.在進行多頁操作的時候

將url傳給scrapy.Request 得到的結果繼續用self.parse進行處理

if len(next_url) !=0:
    #print(next_url)
    url = 'http://pic.netbian.com' + next_url[0]
    #將url傳給scrapy.Request 得到的結果繼續用self.parse進行處理
    yield  scrapy.Request(url=url,callback=self.parse)

callback 後面跟的是要把信息傳遞給誰

很多時候在傳遞url時都是需要拼接的注意url的形式，如 / ，避免出現錯誤

6.scrapy中專門負責圖片下載的管道

'scrapt.pipelines.images.ImagesPipeline':1

7.將文本轉爲json格式時

jsonData = json.loads(response.text)

8.如果想要下載或者保存信息到本地（img_url,name）

#需要引入items 設置字段
item = Xxxxitem()
item['img_url'] = [img_url]
item[name] = name

需要往items.py裏面添加對應內容

img_url = scrapy.Field()
name = scrapy.Field()

9.獲取信息時需要去掉第一條數據時用到 del type_list[0] 遇到具體情況隨機應變

10.關於獲取全部文本的方式，詳細見上次博客內容

 if len(author)  !=0:
                # 獲取標籤內部全部文本的幾種方式
                # 1.獲取最外層標籤，遍歷內部所有的子標籤，獲取標籤文本
                # 2.正則去掉所有標籤，re.compile.sub()
                # 3./text() 獲取標籤的文本  //text() 獲取標籤以及子標籤的文本
                content_list = div.xpath('.//div[@class="d_post_content j_d_post_content "]//text()').extract()
                # 4 使用xpath('string(.)'),這種方式來獲取所有文本並拼接
                content = div.xpath('.//div[@class="d_post_content j_d_post_content "]').xpath('string(.)').extract()[0]+'\n'
                self.f.write(content)
                print(content_list)
                remove = re.compile('\s')
                douhao = re.compile(',')
                content = ''
                for string in content_list:
                    string = re.sub(remove,'',string)
                    string = re.sub(douhao,'',string)
                    # print(string)
                    content +=string+','
                print(content)

11.

         yield scrapy.Request(url=url,meta={'type':catId[0]},callback=self.get_content_with_url)
    
  #把url 和 meta的值傳到另個函數中 
    def get_content_with_url(self,response):

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用scrapy框架進行爬蟲需要注意的

Python中基本郵件發送

線程，線程鎖，線程隊列-------之（線程鎖）（線程隊列）

續 html

Python 用django ，celery 實現郵件發送

python url安全轉碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結