之前不論是看視頻學習,還是跟着教程做深度學習,數據集都是現成的。這次從頭開始開發一個馬桶識別程序,就需要自己收集數據了。
還好互聯網,特別是電商網站的發展,爲產品積累了很多的數據。
可以通過Python抓取天貓和京東某品牌馬桶的評論圖片作爲數據集。
通過百度搜索,有很多的程序代碼可以實現評論圖片的抓取,本文抓取天貓的評論圖片參考以下鏈接的代碼:Python爬蟲(6)——獲取天貓商品評論信
實現評論圖片獲取的難點在於url的提取,上述的鏈接給出了具體的方法,這裏簡述如下:
以下是一個url實例,
https://rate.tmall.com/list_detail_rate.htm?itemId=45492997665&spuId=64652363&sellerId=667286523&order=3¤tPage=1&append=⊙&content=1
前面https://rate.tmall.com/list_detail_rate.htm?都是固定的
itemId可以從商品詳情的url中找到,以下是11170馬桶的url,其中id=45492997665就是itemId
https://detail.tmall.com/item.htm?spm=a1z10.5-b-s.w4011-601288098.32.218191f88EYy6o&id=45492997665&rn=c00b3253858596ec80a7c4e9431e2848&abbucket=9
spuId是店鋪Id,sellerId是店主Id,可以從上述商品詳情頁的源代碼中找到
其中shopId就是spuId
currentPage是現在的頁碼,可以通過改變該值讀取所有頁面的評論圖片。
其他的保持默認即可,以下是源代碼:
import requests
import json
import urllib.request
url_s = 'https://rate.tmall.com/list_detail_rate.htm?'
url_itemId ='itemId=45492997665&'
url_spuId = 'spuId=64652363&'
url_sellerId = 'sellerId=667286523&'
url_order = "order=3&"
url_append ='append=⊙&'
count = 0
for pages in range(0,99):
url_currentPage ="currentPage="+str(pages+1)+"&"
url = url_s+url_itemId+url_spuId+url_sellerId+url_order+url_currentPage+url_append+'content=1'
req = requests.get(url)
jsondata = req.text[15:]
try:
data = json.loads(jsondata)
except:
continue
#輸出頁面信息
print('page:',data['paginator']['page'])
#遍歷評論信息列表
for i in data["rateList"]:
for url_image in i['pics']:
if count<9:
name = '0000'+str(count+1)+'.jpg'
elif count<99:
name = '000'+str(count+1)+'.jpg'
elif count<999:
name= '00'+str(count+1)+'.jpg'
elif count<9999:
name= '0'+str(count+1)+'.jpg'
else:
name= str(count+1)+'.jpg'
conn = urllib.request.urlopen("http:"+url_image)
f = open(name, 'wb')
f.write(conn.read())
f.close()
count+=1
通過對以上url的分析,可以通過python程序自動獲取itemID, sellerId, spuId。這樣,只要拷貝下產品頁的url,就可以順暢地下載評論圖片了,具體實現代碼如下:
import requests
import json
import urllib.request
import re
# Get the comment url from url of goods detail url
def geturl(url_detail):
req = requests.get(url_detail)
jsondata = req.text[15:]
info = re.search('itemId:"[0-9]*",sellerId:"[0-9]*",shopId:"[0-9]*"',jsondata)
info = info.group(0)
info = info.split(',')
itemId = info[0].split(':')[1][1:-1]
sellerId = info[1].split(':')[1][1:-1]
shopId = info[1].split(':')[1][1:-1]
return itemId, sellerId, shopId
# Download the comment images
def getImage(url_detail):
url_s = 'https://rate.tmall.com/list_detail_rate.htm?'
itemId, sellerId, shopId = geturl(url_detail)
url_itemId = 'itemId='+itemId+'&'
url_spuId = 'spuId='+shopId+'&'
url_sellerId = 'sellerId='+sellerId+'&'
url_order = "order=3&"
url_append ='append=⊙&'
count = 0
for pages in range(0,99):
url_currentPage ="currentPage="+str(pages+1)+"&"
url = url_s+url_itemId+url_spuId+url_sellerId+url_order+url_currentPage+url_append+'content=1'
req = requests.get(url)
jsondata = req.text[15:]
try:
data = json.loads(jsondata)
except:
continue
print('page:',data['paginator']['page'])
for i in data["rateList"]:
for url_image in i['pics']:
if count<9:
name = '0000'+str(count+1)+'.jpg'
elif count<99:
name = '000'+str(count+1)+'.jpg'
elif count<999:
name= '00'+str(count+1)+'.jpg'
elif count<9999:
name= '0'+str(count+1)+'.jpg'
else:
name= str(count+1)+'.jpg'
conn = urllib.request.urlopen("http:"+url_image)
f = open(name, 'wb')
f.write(conn.read())
f.close()
count+=1
if __name__=="__main__":
url_detail = 'https://detail.tmall.com/item.htm?spm=a1z10.5-b-s.w4011-14601288098.32.218191f88EYy6o&id=45492997665&rn=c00b3253858596ec80a7c4e9431e2848&abbucket=9'
getImage(url_detail)