Python 爬虫实战（1）：分析豆瓣中最新电影的影评并制作词云

入门Python不久，想做个小项目练练手，碰巧最近一部国产电影《红海行动》反响不错，便萌生想法去抓取一下它最新的评论，并制作词云，来看看网页对这部电影的一些评价，接下来就是开始分析啦（分析过程也参考了很多其他博主的博文，原凉我个渣渣。。。）

操作环境：Python 3.6、Pycharm 2017.2.3

一、抓取网页数据

第一步就是要对你所要抓取的网页进行访问，获取网页内容，Python用的是urllib库，先去豆瓣电影的正在上映那里看看

看到没有，那部《红海行动》正是我们要抓取的，它的网址是 https://movie.douban.com/cinema/nowplaying/guangzhou/ ，先抓取这个网页先。

第二步，没错，看到那个网页上还有其他很多电影，那么我们怎么抓取我们想要的那部的，接下来就要解析这个网页了，借助chorme的开发工具,按F12,找到我们要找的那部电影，发现我们需要的数据在这个标签

从上图中可以看出在div id=”nowplaying“标签开始是我们想要的数据，里面有电影的名称、评分、主演等信息，需要用到find_all来读取HTML中的内容，代码如下

其中nowplaying_movie_list 是一个列表，在上图中可以看到data-subject属性里面放了电影的id号码，而在img标签的alt属性里面放了电影的名字，因此我们就通过这两个属性来得到电影的id和名称。（注：打开电影短评的网页时需要用到电影的id，所以需要对它进行解析），编写代码如下

其中列表nowplaying_list中就存放了最新电影的id和名称，可以使用print(nowplaying_list)进行查看。现在我们已经的到的电影ID和名字了，接下来就是分析这部电影的影评了
进入这部电影短评的页面，网址：https://movie.douban.com/subject/26861685/comments?start=0&limit=20&sort=new_score&status=P&percent_type= ，其中26363254就是电影的id，star代表的是首页，limit代表的每页的评论条数

接下来接对该网址进行解析了。打开上图中的短评页面的html代码，我们发现关于评论的数据是在div标签的comment属性下面，如下图所示：

因此对此标签进行解析，代码如下

此时在comment_div_lits 列表中存放的就是div标签和comment属性下面的html代码了。在上图中还可以发现在p标签下面存放了网友对电影的评论，因此对comment_div_lits 代码中的html代码继续进行解析，

二、数据清洗

为了方便进行数据进行清洗，我们将列表中的数据放在一个字符串数组中，代码如下

可以看到所有的评论已经变成一个字符串了，但是我们发现评论中还有不少的标点符号等。这些符号对我们进行词频统计时根本没有用，因此要将它们清除。所用的方法是正则表达式。python中正则表达式是通过re模块来实现的

接下来就是要进行词频统计，所以先要进行中文分词操作。在这里我使用的是结巴分词。如果没有安装结巴分词，可以在控制台使用pip install jieba进行安装。（注：可以使用pip list查看是否安装了这些库），因为我们的数据中有“看”、“太”、“的”等虚词（停用词），而这些词在任何场景中都是高频时，并且没有实际的含义，所以我们要他们进行清除。我把停用词放在一个stopwords.txt文件中，将我们的数据与停用词进行比对即可（注：自己百度下载stopwords.txt）

接下来就是进行词频统计

然后就是制作词云了

好了，分析到此为止，我也是懵懵懂懂的~~我运行出来的词云结果如下
好了，本来前几次运行时没有问题的，现在写完这篇博客后再想去运行一下，发现这个错误了

也就是被封了IP，GG，于是在请求那里加上了头信息，伪装成浏览器也不行，我。。。个渣渣先不理了，现在贴出我的完整代码吧（注：运行一两次就够了，运行多次会被发现被封了IP，或许可以使用IP代理进行访问，我也不懂，GG）

from urllib import request
from bs4 import BeautifulSoup
import re
import jieba
import pandas as pd
import numpy
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams[‘figure.figsize’] = (10.0, 5.0)
from wordcloud import WordCloud
import warnings
warnings.filterwarnings(“ignore”)

def getmovie_list():#分析网页函数

movie_url = 'https://movie.douban.com/cinema/nowplaying/guangzhou/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
req = request.Request(url=movie_url, headers=headers)
#html_data = req.request.urlopen(req).read()
resp = request.urlopen(req)
html_data = resp.read().decode('utf-8')
#print(html_data)
soup = BeautifulSoup(html_data, 'html.parser')
nowplaying_movie = soup.find_all('div', id = 'nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_ = 'list-item')
#print(nowplaying_movie_list[0])
global nowplaying_list
nowplaying_list = []
for item in nowplaying_movie_list:
    nowplaying_dict = {}
    nowplaying_dict['id'] = item['data-subject']
    for tag_img_item in item.find_all('img'):
        nowplaying_dict['name'] = tag_img_item['alt']
        nowplaying_list.append(nowplaying_dict)
#print(nowplaying_list)
return nowplaying_list

def get_comment(movieId, pageNum):#获取评论函数

global eachcomment
eachcomment = []
if pageNum > 0:
    start = (pageNum-1) * 20
else:
    return False
comment_url = 'https://movie.douban.com/subject/'+ nowplaying_list[0]['id'] + '/comments' + '?' + 'start=0' + '&limit=20' \
              + '&sort=new_score' + '&status=P&percent_type='
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
resp = request.Request(url = comment_url,headers  =  headers)
resp = request.urlopen(resp)
html_data = resp.read().decode('utf-8')
soup = BeautifulSoup(html_data, 'html.parser')
comment_div = soup.find_all('div', class_ = 'comment')
for item in comment_div:
    if item.find_all('p')[0].string is not None:
        eachcomment.append(item.find_all('p')[0].string)
#print(eachcomment)
return eachcomment

def main():

#循环获取第一个电影的前10页评论
commentslist = []
nowplayingmovie_list = getmovie_list()
for i in range(20):
    num = i+1
    commentlist_temp = get_comment(nowplayingmovie_list[0]['id'],num)
    commentslist.append(commentlist_temp)
    comments = ''
for com in range(len(eachcomment)):
    comments = comments + (str(eachcomment[com])).strip()
#使用正则表达式去除标点符号

pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern,comments)
clean_comments = ''.join(filterdata)
#print(clean_comments)

segment = jieba.lcut(clean_comments)
words_df = pd.DataFrame({'segment':segment})
#去掉停用词

stopwords = pd.read_csv("stopwords.txt", index_col=False, quotechar="3", sep="\t", names = ['stopwords'], encoding='gb2312')
words_df = words_df[~words_df.segment.isin(stopwords)]
#统计词频
words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})
words_stat = words_stat.reset_index().sort_values(by = ["计数"], ascending=False)
#用词云进行显示
wordcloud = WordCloud(font_path="C:/windows/fonts/simhei.ttf", background_color="white", max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
word_frequence_list = []
for key in word_frequence:
    temp = (key,word_frequence[key])
    word_frequence_list.append(temp)
wordcloud = wordcloud.fit_words(dict(word_frequence_list))
plt.imshow(wordcloud)
#plt.savefig("result.jpg")
plt.axis('off')
plt.show()

main()

昨天说我的IP被封了，今天重新在headers加上了referer这个参数，解释一下，referer这个参数可以让网站知道我是通过哪个链接访问到这个网站的，当我加上这个后，又运行了一下，可以了，结果如下：

终于可以了，有一丢丢激动，虽然很多原理还是不懂，但一步步慢慢来，加油！！

Python 爬虫实战（1）：分析豆瓣中最新电影的影评并制作词云

一、抓取网页数据

二、数据清洗

win11关闭自动检测病毒删文件

與sql server 文件(*.mdf)的連接要求本地計算機上安裝並運行SQL server 2005express或者SQL SQL 2008 Express

python3.6 學習筆記之安裝PIL

Python每日練習 07 一個HTML文件，找出裏面的正文與鏈接

Python每日練習 05 任一個英文的純文本文件，統計其中的單詞出現的個數

Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評並製作詞雲

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結