根據關鍵字爬取京東評論區圖片

原創

笨小孩哈哈

2020-07-03 21:05

根據關鍵字爬取京東商城評論區圖片

聲明：本文章所涉及的技術和代碼僅供學習交流使用，切勿擴散和頻繁爬取網站。

分析過程

首先進入京東官網，在搜索欄中輸入關鍵字如“三明治”，F12打穀歌開發者工具，選中network面板，network捕獲的的請求分類欄中有all、xhr等，all代表的是所有請求,xhr代表的是異步請求，絕大多數的網站的大多數的重要數據請求都是採用異步請求，但此次京東搜索欄的搜索請求不是異步請求，按下enter鍵，發送搜索請求，然後就是分析請求，篩選出哪個是我們需要的請求，這一個過程很重要，下圖是我找到的請求

由此便得到了三明治這個類別商品的信息，注意這個請求有referer防盜鏈，headers信息頭裏面的信息不要遺漏。然後解析這個頁面，得到商品的id信息

然後再再點擊一個商品進入商品詳情頁，然後再打開谷歌開發者工具，選中network面板，選中下面的商品評價欄，再選中“曬圖”，如下所示

注意點擊右邊的圖片翻動鍵，觀察請求欄中的變化。

划動了一頁就可以看到左邊請求欄中的請求變化，出現“https://club.jd.com/discussion/getProductPageImageCommentList.action?productId=1281063&isShadowSku=0&callback=jQuery9833581&page=3&pageSize=10&_=1569503354670”的請求，是不是很眼熟“productId”，"productId"代表的是商品id，正是我們剛纔爬取的目標，“page”很明顯代表的是頁碼，“pageSize”不就是一頁的數量嗎？沒有用的參數可以去掉。
然後我們在選中“preview”面板，該請求返回的json數據裏面就有我們要的圖片地址

至此我們只要根據productId拼接該請求即可得到我們想要的評論區的圖片，page(頁碼)我們自己控制，pageSize固定了。
總結
整個爬蟲過程分爲兩步：
1、發送帶關鍵字查詢請求得到商品ID
2、根據商品ID自己拼接請求得到帶有評論區圖片地址的json。

代碼

完整代碼如下，博主碼字不易，點個讚唄！

import json
import urllib
import jsonpath
import requests
import lxml
from lxml import etree
import os

def getProductIdsByKeyword(keyword):
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
        'authority': 'search.jd.com',
        'method': 'GET',
        'scheme': 'https',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept - Encoding': 'gzip, deflate, br',
        'accept-language': 'zh-CN,zh;q=0.9',
        'upgrade-insecure-requests': '1'
    }
    header['Referer'] = 'https://www.jd.com/'
    url = 'https://search.jd.com/Search?keyword=' + keyword + '&enc=utf-8'
    response = requests.get(url, headers=header)
    response.encoding = 'utf-8'
    response = lxml.etree.HTML(response.text)
    productIds = response.xpath("//li[@class='gl-item']/@data-sku")
    return productIds

def getJdCommentsImage(startPage,endPage,productId,path):
    num=1
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
        'authority':'club.jd.com',
        'method':'GET',
        'scheme':'https',
        'Accept': '*/*', 'Accept - Encoding': 'gzip, deflate, br'
    }
    header['path'] = '/discussion/getProductPageImageCommentList.action?productId='+productId+'&page='+str(num)+'&pageSize=10'
    header['Referer'] = 'https://item.jd.com/'+productId+'.html'
    requests.packages.urllib3.disable_warnings()
    for num in range(startPage, (endPage + 1)):
        url = 'https://club.jd.com/discussion/getProductPageImageCommentList.action?productId='+productId+'&page=' + str(
            num) + '&pageSize=10'
        # url_2 = 'https://club.jd.com/comment/skuProductPageComments.action?productId='+productId+'&page=' + str( num) + '&pageSize=10'
        images = requests.post(url, headers=header, verify=False,timeout=10)
        jsonObjs = json.loads(images.text)
        # print(jsonObjs)
        images = jsonpath.jsonpath(jsonObjs, '$..imageUrl')
        i = 1
        for image_url in images:
            print('*' * 10 + '正在下載第' + str((num - 1) * 10 + i) + '張圖片' + '*' * 10)
            try:
                # urllib.request.urlretrieve('https:'+image_url, path + productids[j]+str((num - 1) * 10 + i) + '.jpg')
                res = urllib.request.urlopen('https:'+image_url,timeout=5).read()
                with open(path + productids[j]+str((num - 1) * 10 + i) + '.jpg','wb') as file:
                    file.write(res)
                    file.close()
            except Exception as e:
                print('第' + str((num - 1) * 10 + i) + '張圖片下載出錯，錯誤信息如下：')
                print(' ' * 10 + str(e))
                print('')
                continue
            finally:
                i += 1

    print('*' * 15 + '下載完成' + '*' * 15)

# getJdCommentsImage(1,10,'d:/download/評論/')  # 一頁10張 （起始頁，結束頁，圖片存儲路徑）


if __name__ == '__main__':
    keywords = ['三明治']  # 分類關鍵字在這裏放在這裏
    for keyword in keywords:
        productids=getProductIdsByKeyword(keyword)
        # print(len(productids))
        path = 'd:/download/京東買家秀/' + keyword + '/'
        if not os.path.exists(path):
            os.makedirs(path)
        for j in range(1,31):
            try:
                getJdCommentsImage(1,30,productids[j],path)
            except Exception as e:
                print(str(e))
                continue

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

根據關鍵字爬取京東評論區圖片

根據關鍵字爬取京東商城評論區圖片

分析過程

代碼

根據關鍵字爬取京東評論區圖片

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結