Python爬取京東商品評論(三)

        本篇用作前兩篇的首尾,並解決最後一個問題。

        第一篇中我介紹了Python爬蟲的基本語法,介紹瞭如何分析頁面、如何分析json文件,並將爬取評論的基本功能實現了,是最初的版本。鏈接:Python爬取京東商品評論(一)

        第二篇中我對第一篇的程序進行了大量的修改,改進了數據量、自動化等問題,對程序代碼進行了整理,程序功能已經比較完整且穩定。鏈接:Python爬取京東商品評論(二)

        本篇解決第二篇中的一個遺留問題,即自動獲取不同配色的手機的商品代碼。上一篇的程序中,只能通過手動輸入所有的商品代碼才能去爬取對應的評論數據,本篇將會使該過程更自動化,只需輸入一次該商品在京東的商品頁網址,就能自動獲取其他配色款式的商品代碼,並自動爬取評論。

        需要指出的是,這裏只針對某個固定的內存版本,也就是說,只能提取出所有64GB的配色,或者所有128GB的配色。

頁面分析:

        分析頁面的辦法就不贅述了,這裏我直接貼上保存商品代碼的HTML語句,從圖中可以看出,商品代碼保存在一個id爲“choose-attr-1”的div塊下,我們首先需要提取出這個名爲“choose-attr-1”的div塊,然後使用正則表達式,取出商品代碼。

        所以我們先獲取該頁面的HTML代碼:

url=str(input("請輸入網址:"))
Req=urllib.request.Request(url=url)
content=urllib.request.urlopen(Req)
content=content.read().decode('utf-8')

BeautifulSoup庫:

        這個庫可以對HTML代碼進行分析,提取出其中的信息,程序中只用了庫中的find_all函數,所以該庫的使用方法這裏就不贅述了。我們先利用這個庫來提取出id爲“choose-attr-1”的div塊:

soup=BeautifulSoup(content,'html.parser')
choose=soup.find_all(id='choose-attr-1')

        find_all函數得到的是一個迭代器,其中的元素即爲所有id爲“choose-attr-1”的div塊中的HTML代碼,且元素可以再次使用find_all函數來再次分析。

        在上面的頁面分析中我們可以看出,商品代碼保存在class爲“item”的div塊中,所以我們再次使用find_all函數:

    for i in choose:
        j=i.find_all(class_='item')

         此時j爲迭代器,其中就是所有保存有商品代碼的div塊。

正則表達式:

        正則表達式同樣可以從文本中提取信息。在得到商品代碼的div塊HTML代碼後,我們就可以用一句簡單的正則表達式來提取商品代碼,並保存到list中:

web_regex=re.compile(r"[0-9]+")
for i in choose:
    j=i.find_all(class_='item')
    for one in j:
        text=web_regex.findall(str(one))
        id_list.append(text[0])

程序代碼:

        在第二篇中的程序中加入此部分的代碼後,代碼整理如下:

import urllib.request
from bs4 import BeautifulSoup
import re
import json
import random
import sys
import xlwt
import pandas as pd

start_url = "https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId={}&score={}&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1"
id_list = []
user_agents = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
    'Opera/8.0 (Windows NT 5.1; U; en)',
    'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
]
header = {
    'User_Agent': str(random.choice(user_agents)),
    'Referer': 'https://item.jd.com/100009177424.html'
}

def get_id_list():
    url=str(input("請輸入網址:"))
    Req=urllib.request.Request(url=url)
    content=urllib.request.urlopen(Req)
    content=content.read().decode('utf-8')
    soup=BeautifulSoup(content,'html.parser')
    choose=soup.find_all(id='choose-attr-1')
    web_regex=re.compile(r"[0-9]+")
    for i in choose:
        j=i.find_all(class_='item')
        for one in j:
            text=web_regex.findall(str(one))
            id_list.append(text[0])

def start_deal(start_url) -> int:
    req = urllib.request.Request(url=start_url, headers=header)
    content = urllib.request.urlopen(req)
    if content.getcode() != 200:
        print("爬取起始頁失敗!")
        sys.exit(0)
    content = content.read().decode('gbk', 'replace')
    content = content.strip("fetchJSON_comment98();")
    text = json.loads(content)
    max_page = text['maxPage']
    comment = text['comments']
    fp = open('京東.txt', 'a', encoding='utf-8')
    for i in comment:
        text = str(i['content']).replace('\n', ' ')
        fp.write(text + '\n')
    print('起始頁完成!')
    fp.close()
    return max_page


def deal(id, score, max_page):
    for i in range(1, max_page):
        url = start_url.format(id, str(score), str(i))
        req = urllib.request.Request(url=url, headers=header)
        content = urllib.request.urlopen(req)
        if (content.getcode() != 200):
            print('爬取失敗!')
            sys.exit(0)
        else:
            content = content.read().decode('gbk', 'replace')
        content = content.strip('fetchJSON_comment98();')
        text = json.loads(content)
        comment = text['comments']
        fp = open('京東.txt', 'a', encoding='utf-8')
        for j in comment:
            text = str(j['content']).replace('\n', ' ')
            fp.write(text + '\n')
        print('第%s頁完成!' % (i + 1))
        fp.close()


def book_save():
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    sheet = book.add_sheet('京東商品評論', cell_overwrite_ok=True)
    sheet.write(0, 0, "內容")
    n = 1
    with open('京東.txt', 'r', encoding='utf-8') as f:
        li = f.readlines()
        for i in li:
            sheet.write(n, 0, i)
            n += 1
    book.save(u'京東.xlsx')
    data_xls = pd.read_excel('京東.xlsx')
    data_xls.to_csv('京東.csv', encoding='utf-8')



if __name__ == '__main__':
    get_id_list()
    for i in range(3, 0, -1):
        for j in id_list:
            getpage_url = start_url.format(j, i, '0')
            page = start_deal(getpage_url)
            deal(j, i, 3)
            book_save()

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章