京東爬取評論簡單分析

1.定義一個獲取所有評論的函數

def get_comment(url):
"""
獲取評論函數
"""
i = 0
# 獲取所有的評論,直到正則匹配爲空的時候停止
while True:
    url = "http://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6&productId=11510787177&score=0&sortType=5&page=" + str(i) + "&pageSize=10&isShadowSku=0&fold=1"
    headers = {"User-Agent": random.choice(ua), }
    response = requests.get(url, headers=headers)
    # 評論列表
    comment_list = re.compile(r'"content":"(.*?)"').findall(response.text)
    for comment in set(comment_list):
        # 打印評論
        print comment
    # 評論終止的條件
    if len(comment_list) == 0:
        break
    i += 1

2.先獲得你搜索關鍵詞的url：

#keyword是你搜索的關鍵詞,每類商品基本都是100頁，京東page頁數是奇數變化
for j in range(101):
    url = "https://search.jd.com/Search?keyword=%E7%94%B7%E8%A1%A3&enc=utf-8 &page=" + str(j)

3.根據這個url就可以獲得商品列表的前28個數據，一共有60個，另外有四個廣告：

res = requests.get(url, headers=headers
# 使用正則提取出商品的id字段
id_list =re.compile('J_AD_(\d+)').findall(res.content)
# print len(id_list)
# 定義一個列表來存放所有的id
str_id = []
for id in id_list:
    # 詳細頁面的url
    detail = "http://item.jd.com/"+str(id)+".html"
    # 添加id到列表
    str_id.append(id)
    # 這個是獲取評論的url
    comment_url = "http://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6&productId="+str(id)+"&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
    # 調用獲取評論的函數
    get_comment(comment_url)
# 拼接id
str_id = ",".join(str_id)

4.另外的28條數據是動態加載的，鼠標下載的時候會加載出來，加載這個是要根據前面頁面的id信息和翻頁信息，

# 這是加載出來的url，其中後面的str_id是前面獲取所有的id的拼接在一起，用，隔開
url2= "https://search.jd.com/s_new.php?keyword=%E7%94%B7%E8%A1%A3&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E7%94%B7%E8%A1%A3&page="+str(j+1)+"&s=25&scrolling=y&log_id=1504059001.28625&tpl=3_L&show_items=" + str_id
# 這個是拼接請求的頭，分析得到這個Referer字段
headers_page = "https://search.jd.com/Search?keyword=%E7%94%B7%E8%A1%A3&enc=utf-8&page="+str(j)+"&s=1"
# 其中請求頭要添加　Referer字段，上一頁的關聯地址，頭的page是請求主頁的page
headers_next = {"User-Agent": random.choice(ua),
           "Referer": headers_page}

# 獲取後面的數據
url2= "https://search.jd.com/s_new.php?keyword=%E7%94%B7%E8%A1%A3&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E7%94%B7%E8%A1%A3&page="+str(j+1)+"&s=25&scrolling=y&log_id=1504059001.28625&tpl=3_L&show_items=" + str_id
headers_page = "https://search.jd.com/Search?keyword=%E7%94%B7%E8%A1%A3&enc=utf-8&page="+str(j)+"&s=1"
# 其中請求頭要添加　Referer字段，上一頁的關聯地址
headers_next = {"User-Agent": random.choice(ua),
           "Referer": headers_page}
# 發送請求
res1 = requests.get(url2, headers=headers_next)
# 使用正則得到id的列表
id_list2 = re.compile('J_AD_(\d+)').findall(res1.content)
for id2 in id_list2:
    # 詳細頁面的url，
    detail_url = "http://item.jd.com/"+str(id2)+".html"
    # 評論的url
    comment_url = "http://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6&productId="+str(id2)+"&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
    # 調用評論函數
    get_comment(comment_url)

京東爬取評論簡單分析

京東爬取評論簡單分析

window下安裝gensim

京東爬取評論簡單分析

win10下python2和python3共存問題和pip2和pip3問題

linux中使用crontab實現定時任務

python知識

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結