Scrapy爬取淘寶網數據的嘗試

因爲想學習數據庫，想要獲取較大量的數據，第一個想到的自然就是淘寶。。。。其中有大量的商品信息，淘寶網反爬措施還是比較多，特別是詳情頁面還有噁心的動態內容

該例子中使用Scrapy框架中的基礎爬蟲(CrawlSpider還有點沒搞清楚= = b)

先貼上整體代碼

import scrapy
import re
import csv
import pymongo
from tmail.items import TmailItem
class WeisuenSpider(scrapy.Spider):
    name = 'weisuen'
    start_url = 'https://s.taobao.com/search?q=%E5%B8%BD%E5%AD%90&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170817&s=300'
    detail_urls=[]
    data=[]
    client=pymongo.MongoClient("localhost",27017)
    db=client.taobao
    db=db.items
    def start_requests(self):
        for i in range(30):#爬31頁數據差不多了
            url=self.start_url+'&s='+str(i*44)
            yield scrapy.FormRequest(url=url,callback=self.parse)
    def url_decode(self,temp):
        while '\\' in temp:
            index=temp.find('\\')
            st=temp[index:index+7]
            temp=temp.replace(st,'')

        index=temp.find('id')
        temp=temp[:index+2]+'='+temp[index+2:]
        index=temp.find('ns')
        temp=temp[:index]+'&'+'ns='+temp[index+2:]
        index=temp.find('abbucket')
        temp='https:'+temp[:index]+'&'+'abbucket='+temp[index+8:]
        return temp
    def parse(self, response):
        item=response.xpath('//script/text()').extract()
        pat='"raw_title":"(.*?)","pic_url".*?,"detail_url":"(.*?)","view_price":"(.*?)"'
        urls=re.findall(pat,str(item))
        urls.pop(0)
        row={}.fromkeys(['name','price','link'])
        for url in urls:#解析url並放入數組中
            weburl=self.url_decode(temp=url[1])
            item=TmailItem()
            item['name']=url[0]
            item['link']=weburl
            item['price']=url[2]
            row['name']=item['name']
            row['price']=item['price']
            row['link']=item['link']
            self.db.insert(row)
            row={}.fromkeys(['name','price','link'])
            self.detail_urls.append(weburl)
            self.data.append(item)
        return item
        for item in self.detail_urls:#這個可以抓取評論等更多相關信息
            yield scrapy.FormRequest(url=item,callback=self.detail)
    def detail(self,response):
        print(response.url)
        #首先判斷url來自天貓還是淘寶
        if 'tmall' in str(response.url):
            pass
        else:
            pass

items.py中定義3個屬性：name，price，link

起始網頁爲淘寶的搜索地址，關鍵字我設置爲“帽子”，當然修改關鍵字就只需要修改一下url中的q=後面的值就可以了

因爲該類型商品信息量很大，有很多頁所以重寫start_requests(self)方法，獲取前31頁的內容

首先

name = 'weisuen'
    start_url = 'https://s.taobao.com/search?q=%E5%B8%BD%E5%AD%90&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170817&s=300'
    detail_urls=[]
    data=[]
    client=pymongo.MongoClient("localhost",27017)
    db=client.taobao
    db=db.items

先在定義中打開mongodb數據庫當然我最初使用txt文本和CSV文件看結果，成功後再使用數據庫

def start_requests(self):
        for i in range(30):#爬31頁數據差不多了
            url=self.start_url+'&s='+str(i*44)
            yield scrapy.FormRequest(url=url,callback=self.parse)

通過觀察發現頁數由url後的s=xx決定且這個值等於頁數*44

def parse(self, response):
        item=response.xpath('//script/text()').extract()
        pat='"raw_title":"(.*?)","pic_url".*?,"detail_url":"(.*?)","view_price":"(.*?)"'
        urls=re.findall(pat,str(item))
        urls.pop(0)
        row={}.fromkeys(['name','price','link'])
        for url in urls:#解析url並放入數組中
            weburl=self.url_decode(temp=url[1])
            item=TmailItem()
            item['name']=url[0]
            item['link']=weburl
            item['price']=url[2]
            row['name']=item['name']
            row['price']=item['price']
            row['link']=item['link']
            self.db.insert(row)
            row={}.fromkeys(['name','price','link'])
            self.detail_urls.append(weburl)
            self.data.append(item)
        return item
        for item in self.detail_urls:#這個可以抓取評論等更多相關信息
            yield scrapy.FormRequest(url=item,callback=self.detail)

在回調函數中對獲取的網頁數據進行解析，這裏遇到的麻煩就是response.text會報錯‘GBK xxxxx’因爲淘寶網頁不僅僅由UTF-8編碼還有其他編碼格式所以這樣解碼就會出現問題，我這裏採取的是先使用xpath獲取所有相關類容，再使用正則表達式對相關信息進行提取。其中每件商品的url都有動態類容需要去掉，這個使用了一個url_decode()方法去掉其中的動態類容。解碼方法代碼如下：

def url_decode(self,temp):
        while '\\' in temp:
            index=temp.find('\\')
            st=temp[index:index+7]
            temp=temp.replace(st,'')

        index=temp.find('id')
        temp=temp[:index+2]+'='+temp[index+2:]
        index=temp.find('ns')
        temp=temp[:index]+'&'+'ns='+temp[index+2:]
        index=temp.find('abbucket')
        temp='https:'+temp[:index]+'&'+'abbucket='+temp[index+8:]
        return temp

最後返回的url是可以直接打開的，在回調函數parse中將相關類容寫入了數據庫中，爲了便於擴展，parse中生成了對於詳情頁面的請求，可以之後進行評論，評分等相關信息的抓取

數據庫內容：

之前生成的CSV文件

Scrapy爬取淘寶網數據的嘗試

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

Scrapy爬取淘寶網數據的嘗試

C#實現貪喫蛇

利用Python解決將黃金比例計算至100位

Python爬蟲獲取電影鏈接(續)

STM32輸入捕獲加DMA傳輸的解決方案

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結