Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評並製作詞雲

入門Python不久，想做個小項目練練手，碰巧最近一部國產電影《紅海行動》反響不錯，便萌生想法去抓取一下它最新的評論，並製作詞雲，來看看網頁對這部電影的一些評價，接下來就是開始分析啦（分析過程也參考了很多其他博主的博文，原涼我個渣渣。。。）

操作環境：Python 3.6、Pycharm 2017.2.3

一、抓取網頁數據

第一步就是要對你所要抓取的網頁進行訪問，獲取網頁內容，Python用的是urllib庫，先去豆瓣電影的正在上映那裏看看

看到沒有，那部《紅海行動》正是我們要抓取的，它的網址是 https://movie.douban.com/cinema/nowplaying/guangzhou/ ，先抓取這個網頁先。

第二步，沒錯，看到那個網頁上還有其他很多電影，那麼我們怎麼抓取我們想要的那部的，接下來就要解析這個網頁了，藉助chorme的開發工具,按F12,找到我們要找的那部電影，發現我們需要的數據在這個標籤

從上圖中可以看出在div id=”nowplaying“標籤開始是我們想要的數據，裏面有電影的名稱、評分、主演等信息，需要用到find_all來讀取HTML中的內容，代碼如下

其中nowplaying_movie_list 是一個列表，在上圖中可以看到data-subject屬性裏面放了電影的id號碼，而在img標籤的alt屬性裏面放了電影的名字，因此我們就通過這兩個屬性來得到電影的id和名稱。（注：打開電影短評的網頁時需要用到電影的id，所以需要對它進行解析），編寫代碼如下

其中列表nowplaying_list中就存放了最新電影的id和名稱，可以使用print(nowplaying_list)進行查看。現在我們已經的到的電影ID和名字了，接下來就是分析這部電影的影評了
進入這部電影短評的頁面，網址：https://movie.douban.com/subject/26861685/comments?start=0&limit=20&sort=new_score&status=P&percent_type= ，其中26363254就是電影的id，star代表的是首頁，limit代表的每頁的評論條數

接下來接對該網址進行解析了。打開上圖中的短評頁面的html代碼，我們發現關於評論的數據是在div標籤的comment屬性下面，如下圖所示：

因此對此標籤進行解析，代碼如下

此時在comment_div_lits 列表中存放的就是div標籤和comment屬性下面的html代碼了。在上圖中還可以發現在p標籤下面存放了網友對電影的評論，因此對comment_div_lits 代碼中的html代碼繼續進行解析，

二、數據清洗

爲了方便進行數據進行清洗，我們將列表中的數據放在一個字符串數組中，代碼如下

可以看到所有的評論已經變成一個字符串了，但是我們發現評論中還有不少的標點符號等。這些符號對我們進行詞頻統計時根本沒有用，因此要將它們清除。所用的方法是正則表達式。python中正則表達式是通過re模塊來實現的

接下來就是要進行詞頻統計，所以先要進行中文分詞操作。在這裏我使用的是結巴分詞。如果沒有安裝結巴分詞，可以在控制檯使用pip install jieba進行安裝。（注：可以使用pip list查看是否安裝了這些庫），因爲我們的數據中有“看”、“太”、“的”等虛詞（停用詞），而這些詞在任何場景中都是高頻時，並且沒有實際的含義，所以我們要他們進行清除。我把停用詞放在一個stopwords.txt文件中，將我們的數據與停用詞進行比對即可（注：自己百度下載stopwords.txt）

接下來就是進行詞頻統計

然後就是製作詞雲了

好了，分析到此爲止，我也是懵懵懂懂的~~我運行出來的詞雲結果如下
好了，本來前幾次運行時沒有問題的，現在寫完這篇博客後再想去運行一下，發現這個錯誤了

也就是被封了IP，GG，於是在請求那裏加上了頭信息，僞裝成瀏覽器也不行，我。。。個渣渣先不理了，現在貼出我的完整代碼吧（注：運行一兩次就夠了，運行多次會被發現被封了IP，或許可以使用IP代理進行訪問，我也不懂，GG）

from urllib import request
from bs4 import BeautifulSoup
import re
import jieba
import pandas as pd
import numpy
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams[‘figure.figsize’] = (10.0, 5.0)
from wordcloud import WordCloud
import warnings
warnings.filterwarnings(“ignore”)

def getmovie_list():#分析網頁函數

movie_url = 'https://movie.douban.com/cinema/nowplaying/guangzhou/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
req = request.Request(url=movie_url, headers=headers)
#html_data = req.request.urlopen(req).read()
resp = request.urlopen(req)
html_data = resp.read().decode('utf-8')
#print(html_data)
soup = BeautifulSoup(html_data, 'html.parser')
nowplaying_movie = soup.find_all('div', id = 'nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_ = 'list-item')
#print(nowplaying_movie_list[0])
global nowplaying_list
nowplaying_list = []
for item in nowplaying_movie_list:
    nowplaying_dict = {}
    nowplaying_dict['id'] = item['data-subject']
    for tag_img_item in item.find_all('img'):
        nowplaying_dict['name'] = tag_img_item['alt']
        nowplaying_list.append(nowplaying_dict)
#print(nowplaying_list)
return nowplaying_list

def get_comment(movieId, pageNum):#獲取評論函數

global eachcomment
eachcomment = []
if pageNum > 0:
    start = (pageNum-1) * 20
else:
    return False
comment_url = 'https://movie.douban.com/subject/'+ nowplaying_list[0]['id'] + '/comments' + '?' + 'start=0' + '&limit=20' \
              + '&sort=new_score' + '&status=P&percent_type='
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
resp = request.Request(url = comment_url,headers  =  headers)
resp = request.urlopen(resp)
html_data = resp.read().decode('utf-8')
soup = BeautifulSoup(html_data, 'html.parser')
comment_div = soup.find_all('div', class_ = 'comment')
for item in comment_div:
    if item.find_all('p')[0].string is not None:
        eachcomment.append(item.find_all('p')[0].string)
#print(eachcomment)
return eachcomment

def main():

#循環獲取第一個電影的前10頁評論
commentslist = []
nowplayingmovie_list = getmovie_list()
for i in range(20):
    num = i+1
    commentlist_temp = get_comment(nowplayingmovie_list[0]['id'],num)
    commentslist.append(commentlist_temp)
    comments = ''
for com in range(len(eachcomment)):
    comments = comments + (str(eachcomment[com])).strip()
#使用正則表達式去除標點符號

pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern,comments)
clean_comments = ''.join(filterdata)
#print(clean_comments)

segment = jieba.lcut(clean_comments)
words_df = pd.DataFrame({'segment':segment})
#去掉停用詞

stopwords = pd.read_csv("stopwords.txt", index_col=False, quotechar="3", sep="\t", names = ['stopwords'], encoding='gb2312')
words_df = words_df[~words_df.segment.isin(stopwords)]
#統計詞頻
words_stat = words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
words_stat = words_stat.reset_index().sort_values(by = ["計數"], ascending=False)
#用詞雲進行顯示
wordcloud = WordCloud(font_path="C:/windows/fonts/simhei.ttf", background_color="white", max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
word_frequence_list = []
for key in word_frequence:
    temp = (key,word_frequence[key])
    word_frequence_list.append(temp)
wordcloud = wordcloud.fit_words(dict(word_frequence_list))
plt.imshow(wordcloud)
#plt.savefig("result.jpg")
plt.axis('off')
plt.show()

main()

昨天說我的IP被封了，今天重新在headers加上了referer這個參數，解釋一下，referer這個參數可以讓網站知道我是通過哪個鏈接訪問到這個網站的，當我加上這個後，又運行了一下，可以了，結果如下：

終於可以了，有一丟丟激動，雖然很多原理還是不懂，但一步步慢慢來，加油！！

Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評並製作詞雲

一、抓取網頁數據

二、數據清洗

公司新來一個幹練小夥，把 MyBatis 替換成 MyBatis-Plus，上線後哭暈在廁所。。。

Testin雲測上線華爲Pura 70系列真機測試服務！

10分鐘本地運行llama3及初體驗

手寫協議報文 c語言手法

甲骨文(Oracle)宣佈將以74億美元收購Sun公司

與sql server 文件(*.mdf)的連接要求本地計算機上安裝並運行SQL server 2005express或者SQL SQL 2008 Express

python3.6 學習筆記之安裝PIL

Python每日練習 07 一個HTML文件，找出裏面的正文與鏈接

Python每日練習 05 任一個英文的純文本文件，統計其中的單詞出現的個數

Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評並製作詞雲

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結