python爬取豆瓣影評—《惡人傳》

原創

刘小航9527

2019-07-28 13:25

前面爬取過毒液影評，這段時間很多人找我要源碼，我之前的代碼已經遺失，所以重新做了下，分享給大家，希望幫到大家😀

爬取豆瓣電影惡人轉：https://movie.douban.com/subject/30211551/comments?start=0&limit=20&sort=new_score&status=P

1.分析網頁

可以看出，strat=0&limit=20，說明第一頁，一頁20條

第二頁，以此類推，所以我們可以得到此信息頁。

2.保存所有網頁

直接給出保存代碼，同學們自己查看：

import requests
from bs4 import BeautifulSoup
def gethtml(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
    xx= requests.get(url, headers=headers, timeout=3)
    xx.encoding='utf-8'
    html=xx.text
    return html
        #soup=BeautifulSoup(html, 'html.parser')
        #for i in soup.find_all('div', class_="comment-item"):  # 名字
         #   data.append(i)
        #alldata.append(data)
def savehtml(file_name,html):
    with open(file_name.replace('/', '_') + ".html", "wb") as f:
        # 寫文件用bytes而不是str，所以要轉碼
        f.write(html.encode('utf-8'))
x=0
while x <200:
    y=str(x)
    url='https://movie.douban.com/subject/30211551/comments?start='+y+'&limit=20&sort=new_score&status=P'
    html=gethtml(url)
    xxx=savehtml('data'+y,html)
    x=x+20

顯示結果如上，我們防止其被封殺，所以保存網頁下來

3.數據處理

from bs4 import BeautifulSoup
num=0
allname = []
alltime = []
alldatapeople = []
alltalk = []
while num<200:
    y=str(num)
    with open('data'+y+'.html',"r",encoding='utf-8') as f:
        soup=BeautifulSoup(f.read(), 'html.parser')
        #print(soup)
        data=[]
        alldata=[]
        for i in soup.find_all('div', class_="comment-item"):
            data.append(i.text.replace("\n",''))
            alldata.append(i)
        #print(data)
        name=[]
        time=[]
        datapeople=[]
        talk=[]
        for i in data:
            alldatapeople.append(i.split("                ")[0].split('有用')[0])
            allname.append(i.split("                ")[0].split('有用')[1].split('看過')[0])
            alltime.append(i.split("                ")[1].replace('    ',''))
            alltalk.append(i.split("                ")[2])
        num = num + 20
        #print(alldata)
print(alldatapeople)
print(allname)
print(alltime)
print(alltalk)

上述獲取了200條評論，以及時間，評論者名字，顯示如下：

數據可視化，以及數據挖掘，從上述數據中自行處理，此處不多講解
4.獲取更多數據

我們已經獲取數據了，以及通過數據可視化，和數據挖掘，那麼我們依舊可以獲取更多數據，例如，評論者得地區：代碼如下：

import requests,csv
from bs4 import BeautifulSoup
def gethtml():
    num = 0
    alldata=[]
    while num<200:
        y=str(num)
        with open('data'+y+'.html',"r",encoding='utf-8') as f:
            soup=BeautifulSoup(f.read(), 'html.parser')
            data=[]
            newdata=[]
            for i in soup.find_all('div', class_="comment-item"):
                data.append(i)
            for i in data:
                x=str(i).replace("\n","")
                newdata.append(x)
            #print(newdata)
            for i in newdata:
               # print(i.split('href="')[1].split('"')[0])
                alldata.append(i.split('href="')[1].split('"')[0])
        num = num + 20
    return alldata
#https://www.douban.com/people/128265644/
def getdata(url_list):
    alldata=[]
    for i in url_list:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
        xx = requests.get(i, headers=headers, timeout=30)
        xx.encoding = 'utf-8'
        html = xx.text
        soup = BeautifulSoup(html, 'html.parser')
        for i in soup.find_all('div', class_="user-info"):  # 名字
            try:
                print(i.text.split("常居: ")[1].split("\n")[0])
                alldata.append(i.text.split("常居: ")[1].split("\n")[0])
            except:
                alldata.append("暫無")
    with open("居住地.csv", 'w+', newline="",encoding='utf-8') as f:
        writer = csv.writer(f)
        for row in alldata:
            writer.writerow(row)
y=gethtml()
x=getdata(y)

結果如下：

我將其保存爲csv格式，應爲涉及到得網頁較多，所以容易被封，建議大家保存數據在自己電腦上，或者代理ip，等等，基本上所有代碼都在上面了，關於數據分析，大家自行百度，如何畫圖，之類的既然有數據，那應該很簡單了，大家依舊可以聯繫我，幫大家解決問題：qq：626529441.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬取豆瓣影評—《惡人傳》

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

rnn神經網絡概述-tensorflow實現

BM25-nlp經典算法

python-tensorflow 實現圖像分類

CA-RNN

datawhale爬蟲

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結