kaggle TMDB5000電影數據分析和電影推薦模型

數據來自kaggle上tmdb5000電影數據集,本次數據分析主要包括電影數據可視化和簡單的電影推薦模型,如:
1.電影類型分配及其隨時間的變化
2.利潤、評分、受歡迎程度直接的關係
3.哪些導演的電影賣座或較好
4.最勤勞的演職人員
5.電影關鍵字分析
6.電影相似性推薦

數據分析

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import json
import warnings
warnings.filterwarnings('ignore')#忽略警告
movie = pd.read_csv('tmdb_5000_movies.csv')
credit = pd.read_csv('tmdb_5000_credits.csv')
movie.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam… http://www.avatarmovie.com/ 19995 [{“id”: 1463, “name”: “culture clash”}, {“id”:… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [{“name”: “Ingenious Film Partners”, “id”: 289… [{“iso_3166_1”: “US”, “name”: “United States o… 2009-12-10 2787965087 162.0 [{“iso_639_1”: “en”, “name”: “English”}, {“iso… Released Enter the World of Pandora. Avatar 7.2 11800
movie.tail(3)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
4800 0 [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam… http://www.hallmarkchannel.com/signedsealeddel… 231617 [{“id”: 248, “name”: “date”}, {“id”: 699, “nam… en Signed, Sealed, Delivered “Signed, Sealed, Delivered” introduces a dedic… 1.444476 [{“name”: “Front Street Pictures”, “id”: 3958}… [{“iso_3166_1”: “US”, “name”: “United States o… 2013-10-13 0 120.0 [{“iso_639_1”: “en”, “name”: “English”}] Released NaN Signed, Sealed, Delivered 7.0 6
4801 0 [] http://shanghaicalling.com/ 126186 [] en Shanghai Calling When ambitious New York attorney Sam is sent t… 0.857008 [] [{“iso_3166_1”: “US”, “name”: “United States o… 2012-05-03 0 98.0 [{“iso_639_1”: “en”, “name”: “English”}] Released A New Yorker in Shanghai Shanghai Calling 5.7 7
4802 0 [{“id”: 99, “name”: “Documentary”}] NaN 25975 [{“id”: 1523, “name”: “obsession”}, {“id”: 224… en My Date with Drew Ever since the second grade when he first saw … 1.929883 [{“name”: “rusty bear entertainment”, “id”: 87… [{“iso_3166_1”: “US”, “name”: “United States o… 2005-08-05 0 90.0 [{“iso_639_1”: “en”, “name”: “English”}] Released NaN My Date with Drew 6.3 16
movie.info()#樣本數量爲4803,部分特徵有缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

樣本數爲4803,部分特徵有缺失值,homepage,tagline缺損較多,但這倆不影響基本分析,release_date和runtime可以填充;仔細觀察,部分樣本的genres,keywords,production company特徵值是[],需要注意。

credit.info

數據清理

數據特徵中有很多特徵爲json格式,即類似於字典的鍵值對形式,爲了方便後續處理,我們需要將其轉換成便於python操作的str或者list形式,利於提取有用信息。

#movie genres電影流派,便於歸類
movie['genres']=movie['genres'].apply(json.loads)
#apply function to axis in df,對df中某一行、列應用某種操作。
movie['genres'].head(1)
0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
list(zip(movie.index,movie['genres']))[:2]
[(0,
  [{'id': 28, 'name': 'Action'},
   {'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 878, 'name': 'Science Fiction'}]),
 (1,
  [{'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 28, 'name': 'Action'}])]
for index,i in zip(movie.index,movie['genres']):
    list1=[]
    for j in range(len(i)):
        list1.append((i[j]['name']))# name:genres,Action...
    movie.loc[index,'genres']=str(list1)
movie.head(1)
#genres列已經不是json格式,而是將name將的value即電影類型提取出來重新賦值給genres
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [{“id”: 1463, “name”: “culture clash”}, {“id”:… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [{“name”: “Ingenious Film Partners”, “id”: 289… [{“iso_3166_1”: “US”, “name”: “United States o… 2009-12-10 2787965087 162.0 [{“iso_639_1”: “en”, “name”: “English”}, {“iso… Released Enter the World of Pandora. Avatar 7.2 11800
#同樣的方法應用到keywords列
movie['keywords'] = movie['keywords'].apply(json.loads)
for index,i in zip(movie.index,movie['keywords']):
    list2=[]
    for j in range(len(i)):
        list2.append(i[j]['name'])
    movie.loc[index,'keywords'] = str(list2)
#同理production_companies
movie['production_companies'] = movie['production_companies'].apply(json.loads)
for index,i in zip(movie.index,movie['production_companies']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_companies']=str(list3)
movie['production_countries'] = movie['production_countries'].apply(json.loads)
for index,i in zip(movie.index,movie['production_countries']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_countries']=str(list3)
movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
for index,i in zip(movie.index,movie['spoken_languages']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'spoken_languages']=str(list3)
movie.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [‘culture clash’, ‘future’, ‘space war’, ‘spac… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [‘Ingenious Film Partners’, ‘Twentieth Century… [‘United States of America’, ‘United Kingdom’] 2009-12-10 2787965087 162.0 [‘English’, ‘Español’] Released Enter the World of Pandora. Avatar 7.2 11800
credit.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id title cast crew
0 19995 Avatar [{“cast_id”: 242, “character”: “Jake Sully”, “… [{“credit_id”: “52fe48009251416c750aca23”, “de…
credit['cast'] = credit['cast'].apply(json.loads)
for index,i in zip(credit.index,credit['cast']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    credit.loc[index,'cast']=str(list3)
credit['crew'] = credit['crew'].apply(json.loads)
#提取crew中director,增加電影導演一列,用作後續分析
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
credit['crew']=credit['crew'].apply(director)
credit.rename(columns={'crew':'director'},inplace=True)
credit.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id title cast director
0 19995 Avatar [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … James Cameron

觀察movie中id和credit中movie_id相同,可以將兩個表合併,將所有信息統一在一個表中。

fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')
fulldf.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies spoken_languages status tagline title_x vote_average vote_count movie_id title_y cast director
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [‘culture clash’, ‘future’, ‘space war’, ‘spac… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [‘Ingenious Film Partners’, ‘Twentieth Century… [‘English’, ‘Español’] Released Enter the World of Pandora. Avatar 7.2 11800 19995 Avatar [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … James Cameron

1 rows × 24 columns

fulldf.shape
(4803, 24)
#觀察到有相同列title,合併後自動命名成title_x,title_y
fulldf.rename(columns={'title_x':'title'},inplace=True)
fulldf.drop('title_y',axis=1,inplace=True)
#缺失值
NAs = pd.DataFrame(fulldf.isnull().sum())
NAs[NAs.sum(axis=1)>0].sort_values(by=[0],ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0
homepage 3091
tagline 844
director 30
overview 3
runtime 2
release_date 1
#補充release_date
fulldf.loc[fulldf['release_date'].isnull(),'title']
4553 America Is Still the Place Name: title, dtype: object
#上網查詢補充
fulldf['release_date']=fulldf['release_date'].fillna('2014-06-01')
#runtime爲電影時長,按均值補充
fulldf['runtime'] = fulldf['runtime'].fillna(fulldf['runtime'].mean())
#爲方便分析,將release_date(object)轉爲datetime類型,並提取year,month
fulldf['release_year'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.year
fulldf['release_month'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.month

數據探索

#電影類型genres
#觀察其格式,我們需要做str相關處理,先移除兩邊中括號
#相鄰類型間有空格,需要移除
#再移除單引號,並按,分割提取即可
fulldf['genres']=fulldf['genres'].str.strip('[]').str.replace(" ","").str.replace("'","")
#每種類型現在以,分割
fulldf['genres']=fulldf['genres'].str.split(',')
list1=[]
for i in fulldf['genres']:
    list1.extend(i)
gen_list=pd.Series(list1).value_counts()[:10].sort_values(ascending=False)
gen_df = pd.DataFrame(gen_list)
gen_df.rename(columns={0:'Total'},inplace=True)
fulldf.ix[4801]
  budget                                                                  0
genres                                                                 []
homepage                                      http://shanghaicalling.com/
id                                                                 126186
keywords                                                               []
original_language                                                      en
original_title                                           Shanghai Calling
overview                When ambitious New York attorney Sam is sent t...
popularity                                                       0.857008
production_companies                                                   []
production_countries                ['United States of America', 'China']
release_date                                                   2012-05-03
revenue                                                                 0
runtime                                                                98
spoken_languages                                              ['English']
status                                                           Released
tagline                                          A New Yorker in Shanghai
title                                                    Shanghai Calling
vote_average                                                          5.7
vote_count                                                              7
movie_id                                                           126186
cast                    ['Daniel Henney', 'Eliza Coupe', 'Bill Paxton'...
director                                                      Daniel Hsia
release_year                                                         2012
release_month                                                           5
Name: 4801, dtype: object
plt.subplots(figsize=(10,8))
sns.barplot(y=gen_df.index,x='Total',data=gen_df,palette='GnBu_d')
plt.xticks(fontsize=15)#設置刻度字體大小
plt.yticks(fontsize=15)
plt.xlabel('Total',fontsize=15)
plt.ylabel('Genres',fontsize=15)
plt.title('Top 10 Genres',fontsize=20)
plt.show()

png

數量最多的前10種電影類型,有劇情、喜劇、驚悚、動作等,也是目前影院常見電影類型,那這些電影類型數量較多的背後原因有哪些呢?
我們再看看電影數量和時間的關係。

#對電影類型去重
l=[]
for i in list1:
    if i not in l:
        l.append(i)
#l.remove("")#有部分電影類型爲空
len(l)#l就是去重後的電影類型
21
year_min = fulldf['release_year'].min()
year_max = fulldf['release_year'].max()

year_genr =pd.DataFrame(index=l,columns=range(year_min,year_max+1))#生成類型爲index,年份爲列的dataframe,用於每種類型在各年份的數量
year_genr.fillna(value=0,inplace=True)#初始值爲0


intil_y = np.array(fulldf['release_year'])#用於遍歷所有年份
z = 0
for i in fulldf['genres']:
    splt_gen = list(i)#每一部電影的所有類型
    for j in splt_gen:
        year_genr.loc[j,intil_y[z]] = year_genr.loc[j,intil_y[z]]+1#計數該類型電影在某一年份的數量
    z+=1
year_genr = year_genr.sort_values(by=2006,ascending=False)
year_genr = year_genr.iloc[0:10,-49:-1]
year_genr
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Drama 7 8 3 6 4 3 2 4 8 5 97 106 122 115 99 79 110 110 95 37
Comedy 3 4 1 3 3 3 3 3 3 1 67 82 97 87 82 80 71 62 52 26
Thriller 3 2 3 1 2 1 1 2 4 5 53 55 59 56 69 58 53 66 67 27
Action 4 4 4 1 2 1 1 2 6 5 44 46 51 49 58 43 56 54 46 39
Romance 2 1 3 0 1 2 1 2 2 3 37 38 57 45 30 39 25 24 23 9
Family 0 0 1 0 0 1 0 1 0 1 20 29 28 29 28 17 22 23 17 9
Crime 3 0 2 3 2 2 0 2 0 0 28 33 32 30 24 27 37 27 26 10
Adventure 2 3 1 2 1 2 2 2 5 4 25 37 36 30 32 25 36 37 35 23
Fantasy 0 0 1 0 0 0 1 0 2 2 19 20 22 21 15 19 21 16 10 13
Horror 0 0 1 1 1 1 1 1 3 4 27 21 30 27 24 33 25 21 33 20

10 rows × 48 columns

plt.subplots(figsize=(10,8))
plt.plot(year_genr.T)
plt.title('Genres vs Time',fontsize=20)
plt.xticks(range(1969,2020,5))
plt.legend(year_genr.T)
plt.show()

png

可以看到,從1994年左右,電影進入繁榮發展時期,各種類型的電影均有大幅增加,而增加最多的又以劇情、喜劇、驚悚、動作等類型電影,可見,這些類型電影數量居多和電影藝術整體繁榮發展有一定關係。

#爲了方便分析,構造一個新的dataframe,選取部分特徵,分析這些特徵和電影類型的關係。
partdf = fulldf[['title','vote_average','vote_count','release_year','popularity','budget','revenue']].reset_index(drop=True)
partdf.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
title vote_average vote_count release_year popularity budget revenue
0 Avatar 7.2 11800 2009 150.437577 237000000 2787965087
1 Pirates of the Caribbean: At World’s End 6.9 4500 2007 139.082615 300000000 961000000

因爲一部電影可能有多種電影類型,將每種類型加入column中,對每部電影,是某種類型就賦值1,不是則賦值0

for per in l:
    partdf[per]=0

    z=0
    for gen in fulldf['genres']:

        if per in list(gen):
            partdf.loc[z,per] = 1
        else:
            partdf.loc[z,per] = 0
        z+=1
partdf.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
title vote_average vote_count release_year popularity budget revenue Action Adventure Fantasy Romance Horror Mystery History War Music Documentary Foreign TVMovie
0 Avatar 7.2 11800 2009 150.437577 237000000 2787965087 1 1 1 0 0 0 0 0 0 0 0 0 0
1 Pirates of the Caribbean: At World’s End 6.9 4500 2007 139.082615 300000000 961000000 1 1 1 0 0 0 0 0 0 0 0 0 0

2 rows × 28 columns

現在我們想了解每種電影類型一些特徵的平均值,創建一個新的dataframe,index就是電影類型,列是平均特徵,如平分vote,收入revenue,受歡迎程度等。

mean_gen = pd.DataFrame(l)
#點評分數取均值
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['vote_average'].mean())
#現在newArray中是按類型[0]平均值[1]平均值存放,我們只關心[1]的值。
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_votes_average']=newArray2
mean_gen.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 mean_votes_average
0 Action 5.989515
1 Adventure 6.156962
#同理,用到別的特徵上
#預算budget
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['budget'].mean())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_budget']=newArray2
#收入revenue
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['revenue'].mean())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_revenue']=newArray2
#popularity:相關頁面查看次數
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['popularity'].mean())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_popular']=newArray2
#vote_count:評分次數取count
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['vote_count'].count())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['vote_count']=newArray2
mean_gen.rename(columns={0:'genre'},inplace=True)
mean_gen.replace('','none',inplace=True)
#none代表有些電影類型或其他特徵有缺失,可以看到數量很小,我們將其捨得不考慮
mean_gen.drop(20,inplace=True)
mean_gen['vote_count'].describe()

count 20.000000
mean 608.000000
std 606.931974
min 8.000000
25% 174.750000
50% 468.500000
75% 816.000000
max 2297.000000
Name: vote_count, dtype: float64

mean_gen['mean_votes_average'].describe()

count 20.000000
mean 6.173921
std 0.278476
min 5.626590
25% 6.009644
50% 6.180978
75% 6.344325
max 6.719797
Name: mean_votes_average, dtype: float64

#fig = plt.figure(figsize=(10, 8))
f,ax = plt.subplots(figsize=(10,6))
ax1 = f.add_subplot(111)
ax2 = ax1.twinx()
grid1 = sns.factorplot(x='genre', y='mean_votes_average',data=mean_gen,ax=ax1)
ax1.axes.set_ylabel('votes_average')
ax1.axes.set_ylim((4,7))

grid2 = sns.factorplot(x='genre',y='mean_popular',data=mean_gen,ax=ax2,color='blue')
ax2.axes.set_ylabel('popularity')
ax2.axes.set_ylim((0,40))
ax1.set_xticklabels(mean_gen['genre'],rotation=90)

plt.show()

png

從上圖可知,外國電影並不受歡迎,雖然評分不低,但也是因爲評分人數太少,動漫電影(Animation)、科幻(Science Fiction)、奇幻電影(Fantasy)、動作片(Action)受歡迎程度較高,評分也不低,數量最多的劇情片評分很高,但受歡迎程度較低,猜測可能大部分劇情片不是商業類型。

mean_gen['profit'] = mean_gen['mean_revenue']-mean_gen['mean_budget']
s = mean_gen['profit'].sort_values(ascending=False)[:10]
pdf = mean_gen.ix[s.index]

plt.subplots(figsize=(10,6))
sns.barplot(x='profit',y='genre',data=pdf,palette='BuGn_r')
plt.xticks(fontsize=15)#設置刻度字體大小
plt.yticks(fontsize=15)
plt.xlabel('Profit',fontsize=15)
plt.ylabel('Genres',fontsize=15)
plt.title('Top 10 Profit of Genres',fontsize=20)

plt.show()

png

可以看出,動畫、探險、家庭和科幻是最賺錢的電影類型,適合去電影院觀看,同時也是受歡迎的類型,那麼我們看看變量的關係。

cordf = partdf.drop(l,axis=1)
cordf.columns#含有我們想了解的特徵,適合分析
 Index(['title', 'vote_average', 'vote_count', 'release_year', 'popularity',
       'budget', 'revenue'],
      dtype='object')
corrmat = cordf.corr()
f, ax = plt.subplots(figsize=(10,7))
sns.heatmap(corrmat,cbar=True, annot=True,vmax=.8, cmap='PuBu',square=True)

png

從上圖可以看出,評分次數和受歡迎程度有比較強的關係,證明看的人多參與度也高,預算和票房也關係較強,票房和受歡迎程度、評分次數也有比較強的關係,爲電影做好宣傳很重要,我們再進一步看一下。

#budget, revenue在數據中都有爲0的項,我們去除這些髒數據,
partdf = partdf[partdf['budget']>0]
partdf = partdf[partdf['revenue']>0]
partdf = partdf[partdf['vote_count']>3]
plt.subplots(figsize=(6,5))

plt.xlabel('Budget',fontsize=15)
plt.ylabel('Revenue',fontsize=15)
plt.title('Budget vs Revenue',fontsize=20)
sns.regplot(x='budget',y='revenue',data=partdf,ci=None)

png

plt.subplots(figsize=(6,5))
plt.xlabel('vote_average',fontsize=15)
plt.ylabel('popularity',fontsize=15)
plt.title('Score vs Popular',fontsize=20)
sns.regplot(x='vote_average',y='popularity',data=partdf)

png

可以看出,成本和票房、評分高低和受歡迎程度還是呈線性關係的。但成本較低的電影,成本對票房的影響不大,評分高的的電影基本上也很受歡迎,我們再看看究竟是哪幾部電影最掙錢、最受歡迎、口碑最好。

print(partdf.loc[partdf['revenue']==partdf['revenue'].max()]['title'])
print(partdf.loc[partdf['popularity']==partdf['popularity'].max()]['title'])
print(partdf.loc[partdf['vote_average']==partdf['vote_average'].max()]['title'])

0 Avatar
Name: title, dtype: object
546 Minions
Name: title, dtype: object
1881 The Shawshank Redemption
Name: title, dtype: object

partdf['profit'] = partdf['revenue']-partdf['budget']
print(partdf.loc[partdf['profit']==partdf['profit'].max()]['title'])

0 Avatar
Name: title, dtype: object

小黃人電影最受歡迎,阿凡達最賺錢,肖申克的救贖口碑最好。

s1 = cordf.groupby(by='release_year').budget.sum()
s2 = cordf.groupby(by='release_year').revenue.sum()
sdf = pd.concat([s1,s2],axis=1)
sdf = sdf.iloc[-39:-2]
plt.plot(sdf)
plt.xticks(range(1979,2020,5))
plt.legend(sdf)
plt.show()

png

電影業果然是蓬勃發展啊!現在大製作的電影越來越多,看來是有原因的啊!

對於科幻迷們,也可以看看最受歡迎的科幻電影都有哪些:

#最受歡迎的科幻電影
s = partdf.loc[partdf['ScienceFiction']==1,'popularity'].sort_values(ascending=False)[:10]
sdf = partdf.ix[s.index]
sns.barplot(x='popularity',y='title',data=sdf)
plt.show()

png

星際穿越最受歡迎,銀河護衛隊緊隨其後,同理,我們也可以瞭解其他電影類型的情況。現在。讓我們再看看電影人對電影市場的影響,一部好電影離不開臺前幕後工作人員的貢獻,是每一位優秀的電影人爲我們帶來好看的電影,這裏,我們主要分析導演和演員。

#平均票房最高的導演
rev_d = fulldf.groupby('director')['revenue'].mean()
top_rev_d = rev_d.sort_values(ascending=False).head(20)
top_rev_d = pd.DataFrame(top_rev_d)
plt.subplots(figsize=(10,6))
sns.barplot(x='revenue',y=top_rev_d.index,data=top_rev_d,palette='BuGn_r')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Average Revenue',fontsize=15)
plt.ylabel('Director',fontsize=15)
plt.title('Top 20 Revenue by Director',fontsize=20)
plt.show()

png

如圖是市場好的導演,那麼電影產量最高、或者既叫好又叫座的導演有哪些呢?

list2 = fulldf[fulldf['director']!=''].director.value_counts()[:10].sort_values(ascending=True)
list2 = pd.Series(list2)
list2

Oliver Stone 14
Renny Harlin 15
Steven Soderbergh 15
Robert Rodriguez 16
Spike Lee 16
Ridley Scott 16
Martin Scorsese 20
Clint Eastwood 20
Woody Allen 21
Steven Spielberg 27
Name: director, dtype: int64

plt.subplots(figsize=(10,6))
ax = list2.plot.barh(width=0.85,color='y')
for i,v in enumerate(list2.values):
    ax.text(.5, i, v,fontsize=12,color='white',weight='bold')
ax.patches[9].set_facecolor('g')
plt.title('Directors with highest movies')
plt.show()

png

top_vote_d = fulldf[fulldf['vote_average']>=8].sort_values(by='vote_average',ascending=False)
top_vote_d = top_vote_d.dropna()
top_vote_d = top_vote_d.loc[:,['director','vote_average']]
tmp = rev_d.sort_values(ascending=False)
vote_rev_d = tmp[tmp.index.isin(list(top_vote_d['director']))]
vote_rev_d = vote_rev_d.sort_values(ascending=False)
vote_rev_d = pd.DataFrame(vote_rev_d)
plt.subplots(figsize=(10,6))
sns.barplot(x='revenue',y=vote_rev_d.index,data=vote_rev_d,palette='BuGn_r')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Average Revenue',fontsize=15)
plt.ylabel('Director',fontsize=15)
plt.title('Revenue by vote above 8 Director',fontsize=20)
plt.show()

png

再看看演職人員,cast特徵裏每一部電影有很多演職人員,幸運的是,cast是按演職人員的重要程度排序的,那麼排名靠前的我們可以認爲是主要演員。

fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
fulldf['cast']=fulldf['cast'].str.split(',')
list1=[]
for i in fulldf['cast']:
    list1.extend(i)
list1 = pd.Series(list1)
list1 = list1.value_counts()[:15].sort_values(ascending=True)
plt.subplots(figsize=(10,6))
ax = list1.plot.barh(width=0.9,color='green')
for i,v in enumerate(list1.values):
    ax.text(.8, i, v,fontsize=10,color='white',weight='bold')
plt.title('Actors with highest appearance')
ax.patches[14].set_facecolor('b')
plt.show()

png

fulldf['keywords'][2]

“[‘spy’, ‘based on novel’, ‘secret agent’, ‘sequel’, ‘mi6’, ‘british secret service’, ‘united kingdom’]”

from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import stopwords
#如果stopwords報錯沒有安裝,可以在anaconda cmd中import nltk;nltk.download()
#在彈出窗口中選擇corpa,stopword,刷新並下載
import io
from PIL import Image
plt.subplots(figsize=(12,12))
stop_words=set(stopwords.words('english'))
stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')

img1 = Image.open('timg1.jpg')
hcmask1 = np.array(img1)
words=fulldf['keywords'].dropna().apply(nltk.word_tokenize)
word=[]
for i in words:
    word.extend(i)
word=pd.Series(word)
word=([i for i in word.str.lower() if i not in stop_words])
wc = WordCloud(background_color="black", max_words=4000, mask=hcmask1,
               stopwords=STOPWORDS, max_font_size= 60)
wc.generate(" ".join(word))


plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.figure()
plt.show()

png

我們可以對關鍵詞有大概瞭解,女性導演、獨立電影占比較大,這也可能是電影的一個發展趨勢。

電影推薦模型

現在我們根據上述的分析,可以考慮做一個電影推薦,通常來說,我們在搜索電影時,我們會去找同類的電影、或者同一導演演員的電影、或者評分較高的電影,那麼需要的特徵有genres,cast,director,score

l[:5]

[‘Action’, ‘Adventure’, ‘Fantasy’, ‘ScienceFiction’, ‘Crime’]

特徵向量化

genre

def binary(genre_list):
    binaryList = []

    for genre in l:
        if genre in genre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList
fulldf['genre_vec'] = fulldf['genres'].apply(lambda x: binary(x))
fulldf['genre_vec'][0]

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

cast

for i,j in zip(fulldf['cast'],fulldf.index):
    list2=[]
    list2=i[:4]
    list2.sort()
    fulldf.loc[j,'cast']=str(list2)
fulldf['cast'][0]
“[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]”
fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')
fulldf['cast']=fulldf['cast'].str.split(',')
fulldf['cast'][0]
[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]
castList = []
for index, row in fulldf.iterrows():
    cast = row["cast"]
    for i in cast:
        if i not in castList:
            castList.append(i)
len(castList)
7515
def binary(cast_list):
    binaryList = []

    for genre in castList:
        if genre in cast_list:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList
fulldf['cast_vec'] = fulldf['cast'].apply(lambda x:binary(x))
fulldf['cast_vec'].head(2)

0 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
1 [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
Name: cast_vec, dtype: object

director

fulldf['director'][0]
‘James Cameron’
def xstr(s):
    if s is None:
        return ''
    return str(s)
fulldf['director']=fulldf['director'].apply(xstr)
directorList=[]
for i in fulldf['director']:
    if i not in directorList:
        directorList.append(i)
def binary(director_list):
    binaryList = []

    for direct in directorList:
        if direct in director_list:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList
fulldf['director_vec'] = fulldf['director'].apply(lambda x:binary(x))

keywords

fulldf['keywords'][0]

“[‘culture clash’, ‘future’, ‘space war’, ‘space colony’, ‘society’, ‘space travel’, ‘futuristic’, ‘romance’, ‘space’, ‘alien’, ‘tribe’, ‘alien planet’, ‘cgi’, ‘marine’, ‘soldier’, ‘battle’, ‘love affair’, ‘anti war’, ‘power relations’, ‘mind and soul’, ‘3d’]”

#change keywords to type list
fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
fulldf['keywords']=fulldf['keywords'].str.split(',')
for i,j in zip(fulldf['keywords'],fulldf.index):
    list2=[]
    list2 = i
    list2.sort()
    fulldf.loc[j,'keywords']=str(list2)
fulldf['keywords'][0]

“[‘3d’, ‘alien’, ‘alienplanet’, ‘antiwar’, ‘battle’, ‘cgi’, ‘cultureclash’, ‘future’, ‘futuristic’, ‘loveaffair’, ‘marine’, ‘mindandsoul’, ‘powerrelations’, ‘romance’, ‘society’, ‘soldier’, ‘space’, ‘spacecolony’, ‘spacetravel’, ‘spacewar’, ‘tribe’]”

fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
fulldf['keywords']=fulldf['keywords'].str.split(',')
words_list = []
for index, row in fulldf.iterrows():
    genres = row["keywords"]

    for genre in genres:
        if genre not in words_list:
            words_list.append(genre)
len(words_list)
9772
def binary(words):
    binaryList = []

    for genre in words_list:
        if genre in words:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList
fulldf['words_vec'] = fulldf['keywords'].apply(lambda x: binary(x))

recommend model

取餘弦值作爲相似性度量,根據選取的特徵向量計算影片間的相似性;計算距離最近的前10部影片作爲推薦

fulldf=fulldf[(fulldf['vote_average']!=0)] #removing the fulldf with 0 score and without drector names 
fulldf=fulldf[fulldf['director']!='']
from scipy import spatial

def Similarity(movieId1, movieId2):
    a = fulldf.iloc[movieId1]
    b = fulldf.iloc[movieId2]

    genresA = a['genre_vec']
    genresB = b['genre_vec']
    genreDistance = spatial.distance.cosine(genresA, genresB)

    castA = a['cast_vec']
    castB = b['cast_vec']
    castDistance = spatial.distance.cosine(castA, castB)

    directA = a['director_vec']
    directB = b['director_vec']
    directDistance = spatial.distance.cosine(directA, directB)

    wordsA = a['words_vec']
    wordsB = b['words_vec']
    wordsDistance = spatial.distance.cosine(directA, directB)
    return genreDistance + directDistance + castDistance + wordsDistance
Similarity(3,160)
2.7958758547680684
columns =['original_title','genres','vote_average','genre_vec','cast_vec','director','director_vec','words_vec']
tmp = fulldf.copy()
tmp =tmp[columns]
tmp['id'] = list(range(0,fulldf.shape[0]))
tmp.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
original_title genres vote_average genre_vec cast_vec director director_vec words_vec id
0 Avatar [Action, Adventure, Fantasy, ScienceFiction] 7.2 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … James Cameron [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … 0
1 Pirates of the Caribbean: At World’s End [Adventure, Fantasy, Action] 6.9 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, … Gore Verbinski [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 1
2 Spectre [Action, Adventure, Crime] 6.3 [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, … Sam Mendes [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 2
3 The Dark Knight Rises [Action, Crime, Drama, Thriller] 7.6 [1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, … Christopher Nolan [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 3
4 John Carter [Action, Adventure, ScienceFiction] 6.1 [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … Andrew Stanton [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 4
tmp.isnull().sum()

original_title 0
genres 0
vote_average 0
genre_vec 0
cast_vec 0
director 0
director_vec 0
words_vec 0
id 0
dtype: int64

import operator
def recommend(name):
    film=tmp[tmp['original_title'].str.contains(name)].iloc[0].to_frame().T
    print('Selected Movie: ',film.original_title.values[0])
    def getNeighbors(baseMovie):
        distances = []
        for index, movie in tmp.iterrows():
            if movie['id'] != baseMovie['id'].values[0]:
                dist = Similarity(baseMovie['id'].values[0], movie['id'])
                distances.append((movie['id'], dist))

        distances.sort(key=operator.itemgetter(1))

        neighbors = []
        for x in range(10):
            neighbors.append(distances[x])
        return neighbors
    neighbors = getNeighbors(film)
    print('\nRecommended Movies: \n')

    for nei in neighbors:  
        print( tmp.iloc[nei[0]][0]+" | Genres: "+
              str(tmp.iloc[nei[0]][1]).strip('[]').replace(' ','')+" | Rating: "
              +str(tmp.iloc[nei[0]][2]))

    print('\n')
recommend('Godfather')

Selected Movie: The Godfather: Part III

Recommended Movies:

The Godfather: Part II | Genres: 'Drama','Crime' | Rating: 8.3
The Godfather | Genres: 'Drama','Crime' | Rating: 8.4
The Rainmaker | Genres: 'Drama','Crime','Thriller' | Rating: 6.7
The Outsiders | Genres: 'Crime','Drama' | Rating: 6.9
The Conversation | Genres: 'Crime','Drama','Mystery' | Rating: 7.5
The Cotton Club | Genres: 'Music','Drama','Crime','Romance' | Rating: 6.6
Apocalypse Now | Genres: 'Drama','War' | Rating: 8.0
Twixt | Genres: 'Horror','Thriller' | Rating: 5.0
New York Stories | Genres: 'Comedy','Drama','Romance' | Rating: 6.2
Peggy Sue Got Married | Genres: 'Comedy','Drama','Fantasy','Romance' | Rating: 5.9

相關函數解釋

json格式處理

json是一種數據交換格式,以鍵值對的形式呈現,支持任何類型
- json.loads用於解碼json格式,將其轉爲dict;
- 其逆操作,即轉爲json格式,是json.dumps(),若要存儲爲json文件,需要先dumps轉換再寫入
- json.dump()用於將dict類型的數據轉成str,並寫入到json文件中,json.dump(json,file)
- json.load()用於從json文件中讀取數據。json.load(file)

exam = {'a':'1111','b':'2222','c':'3333','d':'4444'}
file = 'exam.json'
jsobj = json.dumps(exam)
# solution 1
with open(file,'w') as f:
    f.write(jsobj)
    f.close()
#solution 2
json.dump(exam,open(file,'w'))

zip()操作

  • zip()操作:用於將可迭代的對象作爲參數,將對象中對應的元素打包成一個個元組,然後返回由這些元組組成的列表。
  • 其逆操作爲*zip(),舉例如下:
a = [1,2,3]
b = [4,5,6]
c = [4,5,6,7,8]
zipped = zip(a,b)
for i in zipped:
    print(i)
print('\n')
shor_z = zip(a,c)
for j in shor_z:#取最短
    print(j)
(1, 4) (2, 5) (3, 6) (1, 4) (2, 5) (3, 6)
z=list(zip(a,b))
z
[(1, 4), (2, 5), (3, 6)]
list(zip(*z))#轉爲list能看見
[(1, 2, 3), (4, 5, 6)]

pandas merge/rename

pd.merge()通過鍵合併

a=pd.DataFrame({'lkey':['foo','foo','bar','bar'],'value':[1,2,3,4]})
a
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
lkey value
0 foo 1
1 foo 2
2 bar 3
3 bar 4
for index,row in a.iterrows():
    print(index)
    print('*****')
    print(row)
0 ***** lkey foo value 1 Name: 0, dtype: object 1 ***** lkey foo value 2 Name: 1, dtype: object 2 ***** lkey bar value 3 Name: 2, dtype: object 3 ***** lkey bar value 4 Name: 3, dtype: object
b=pd.DataFrame({'rkey':['foo','foo','bar','bar'],'value':[5,6,7,8]})
b
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
rkey value
0 foo 5
1 foo 6
2 bar 7
3 bar 8
pd.merge(a,b,left_on='lkey',right_on='rkey',how='left')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 6
2 foo 2 foo 5
3 foo 2 foo 6
4 bar 3 bar 7
5 bar 3 bar 8
6 bar 4 bar 7
7 bar 4 bar 8
pd.merge(a,b,left_on='lkey',right_on='rkey',how='inner')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 6
2 foo 2 foo 5
3 foo 2 foo 6
4 bar 3 bar 7
5 bar 3 bar 8
6 bar 4 bar 7
7 bar 4 bar 8

pd.rename()對行列重命名

dframe= pd.DataFrame(np.arange(12).reshape((3, 4)),
                 index=['NY', 'LA', 'SF'],
                 columns=['A', 'B', 'C', 'D'])
dframe
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
A B C D
NY 0 1 2 3
LA 4 5 6 7
SF 8 9 10 11
dframe.rename(columns={'A':'alpha'})
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
alpha B C D
NY 0 1 2 3
LA 4 5 6 7
SF 8 9 10 11
dframe
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
A B C D
NY 0 1 2 3
LA 4 5 6 7
SF 8 9 10 11
dframe.rename(columns={'A':'alpha'},inplace=True)
dframe
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
alpha B C D
NY 0 1 2 3
LA 4 5 6 7
SF 8 9 10 11

pandas datetime格式

pandas to_datetime()轉爲datetime格式

Wordcloud

wordcloud詞雲模塊:
1.安裝:在conda cmd中輸入conda install -c conda-forge wordcloud
2.步驟:讀入背景圖片,文本,實例化Wordcloud對象wc,
wc.generate(text)產生雲圖,plt.imshow()顯示圖片參數:
mask:遮罩圖,字的大小布局和顏色都會依據遮罩圖生成
background_color:背景色,默認黑
max_font_size:最大字號

nltk簡單介紹

from nltk.corpus import stopwords
如果stopwords報錯沒有安裝,可以在anaconda cmd中import nltk;nltk.download()
在彈出窗口中選擇corpa,stopword,刷新並下載
同理,在models選項卡中選擇Punkt Tokenizer Model刷新並下載,可安裝nltk.word_tokenize()分詞:
nltk.sent_tokenize(text) #對文本按照句子進行分割

nltk.word_tokenize(sent) #對句子進行分詞

stopwords:個人理解是對錶述不構成影響,大量存在,且可以直接過濾掉的詞

參考文章:

what’s my score
TMDB means per genre


新手學習,歡迎指教!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章