【開發日記】馬桶識別之數據收集，通過Python抓取天貓評論圖片

原創

curisan

2020-06-24 13:49

之前不論是看視頻學習，還是跟着教程做深度學習，數據集都是現成的。這次從頭開始開發一個馬桶識別程序，就需要自己收集數據了。

還好互聯網，特別是電商網站的發展，爲產品積累了很多的數據。

可以通過Python抓取天貓和京東某品牌馬桶的評論圖片作爲數據集。

通過百度搜索，有很多的程序代碼可以實現評論圖片的抓取，本文抓取天貓的評論圖片參考以下鏈接的代碼：Python爬蟲（6）——獲取天貓商品評論信

實現評論圖片獲取的難點在於url的提取，上述的鏈接給出了具體的方法，這裏簡述如下：

以下是一個url實例，

https://rate.tmall.com/list_detail_rate.htm?itemId=45492997665&spuId=64652363&sellerId=667286523&order=3&currentPage=1&append=⊙&content=1

前面https://rate.tmall.com/list_detail_rate.htm?都是固定的

itemId可以從商品詳情的url中找到，以下是11170馬桶的url，其中id=45492997665就是itemId

https://detail.tmall.com/item.htm?spm=a1z10.5-b-s.w4011-601288098.32.218191f88EYy6o&id=45492997665&rn=c00b3253858596ec80a7c4e9431e2848&abbucket=9

spuId是店鋪Id，sellerId是店主Id，可以從上述商品詳情頁的源代碼中找到

其中shopId就是spuId

currentPage是現在的頁碼，可以通過改變該值讀取所有頁面的評論圖片。

其他的保持默認即可，以下是源代碼：

import requests
import json
import urllib.request

url_s = 'https://rate.tmall.com/list_detail_rate.htm?'
url_itemId ='itemId=45492997665&'
url_spuId = 'spuId=64652363&'
url_sellerId = 'sellerId=667286523&'
url_order = "order=3&"
url_append ='append=⊙&'
count = 0
for pages in range(0,99):
    url_currentPage ="currentPage="+str(pages+1)+"&"
    url = url_s+url_itemId+url_spuId+url_sellerId+url_order+url_currentPage+url_append+'content=1'


    req = requests.get(url)
    jsondata = req.text[15:]
    try:
        data = json.loads(jsondata)
    except:
        continue

    #輸出頁面信息
    print('page:',data['paginator']['page'])
    #遍歷評論信息列表
    for i in data["rateList"]:
        for url_image in i['pics']:
            if count<9:
                name = '0000'+str(count+1)+'.jpg'
            elif count<99:
                name = '000'+str(count+1)+'.jpg'
            elif count<999:
                name= '00'+str(count+1)+'.jpg'
            elif count<9999:
                name= '0'+str(count+1)+'.jpg'
            else:
                name= str(count+1)+'.jpg'
                    
            conn = urllib.request.urlopen("http:"+url_image)
            f = open(name, 'wb')
            f.write(conn.read())
            f.close()
            count+=1

通過對以上url的分析，可以通過python程序自動獲取itemID, sellerId, spuId。這樣，只要拷貝下產品頁的url，就可以順暢地下載評論圖片了，具體實現代碼如下：

import requests
import json
import urllib.request
import re

# Get the comment url from url of goods detail url
def geturl(url_detail):

    req = requests.get(url_detail)
    jsondata = req.text[15:]
    info = re.search('itemId:"[0-9]*",sellerId:"[0-9]*",shopId:"[0-9]*"',jsondata)
    info = info.group(0)
    info = info.split(',')
    itemId = info[0].split(':')[1][1:-1]
    sellerId = info[1].split(':')[1][1:-1]
    shopId = info[1].split(':')[1][1:-1]
    return itemId, sellerId, shopId

# Download the comment images    
def getImage(url_detail):      

    url_s = 'https://rate.tmall.com/list_detail_rate.htm?'
    itemId, sellerId, shopId = geturl(url_detail)
    url_itemId = 'itemId='+itemId+'&'
    url_spuId = 'spuId='+shopId+'&'
    url_sellerId = 'sellerId='+sellerId+'&'
    url_order = "order=3&"
    url_append ='append=⊙&'
    count = 0
    for pages in range(0,99):
        url_currentPage ="currentPage="+str(pages+1)+"&"
        url = url_s+url_itemId+url_spuId+url_sellerId+url_order+url_currentPage+url_append+'content=1'


        req = requests.get(url)
        jsondata = req.text[15:]
        try:
            data = json.loads(jsondata)
        except:
            continue

        print('page:',data['paginator']['page'])
        for i in data["rateList"]:
            for url_image in i['pics']:
                if count<9:
                    name = '0000'+str(count+1)+'.jpg'
                elif count<99:
                    name = '000'+str(count+1)+'.jpg'
                elif count<999:
                    name= '00'+str(count+1)+'.jpg'
                elif count<9999:
                    name= '0'+str(count+1)+'.jpg'
                else:
                    name= str(count+1)+'.jpg'
                        
                conn = urllib.request.urlopen("http:"+url_image)
                f = open(name, 'wb')
                f.write(conn.read())
                f.close()
                count+=1

if __name__=="__main__":

    url_detail = 'https://detail.tmall.com/item.htm?spm=a1z10.5-b-s.w4011-14601288098.32.218191f88EYy6o&id=45492997665&rn=c00b3253858596ec80a7c4e9431e2848&abbucket=9'
    getImage(url_detail)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【開發日記】馬桶識別之數據收集，通過Python抓取天貓評論圖片

Android啓動過程-萬字長文(Android14)

【SQL進階】CASE語句的使用

這種嵌套字典類型的數據，我想把它讀取到df裏，如何操作？

微調真的能讓LLM學到新東西嗎:引入新知識可能讓模型產生更多的幻覺

iNeuOS工業互聯網操作系統，增加電力IEC104協議

微服務實踐k8s&dapr開發部署實驗（3）訂閱發佈

kbgressdb之數據結構V0.2

【開發日記】馬桶識別之馬桶分類，利用百度人工智能定製化圖像識別進行分類

【開發日記】馬桶識別之數據收集，通過Python抓取天貓評論圖片

【開發日記】"門沒關好”之樹莓派裝系統以及使用筆記本電腦遠程桌面登錄樹莓派

【開發日記】馬桶識別之數據收集，通過Python抓取京東評論圖片

【開發日記】馬桶型號識別

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結