數據來自kaggle上tmdb5000電影數據集，本次數據分析主要包括電影數據可視化和簡單的電影推薦模型，如：
1.電影類型分配及其隨時間的變化
2.利潤、評分、受歡迎程度直接的關係
3.哪些導演的電影賣座或較好
4.最勤勞的演職人員
5.電影關鍵字分析
6.電影相似性推薦

數據分析

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import json
import warnings
warnings.filterwarnings('ignore')#忽略警告

movie = pd.read_csv('tmdb_5000_movies.csv')
credit = pd.read_csv('tmdb_5000_credits.csv')

movie.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{“id”: 28, “name”: “Action”}, {“id”: 12, “nam…	http://www.avatarmovie.com/	19995	[{“id”: 1463, “name”: “culture clash”}, {“id”:…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[{“name”: “Ingenious Film Partners”, “id”: 289…	[{“iso_3166_1”: “US”, “name”: “United States o…	2009-12-10	2787965087	162.0	[{“iso_639_1”: “en”, “name”: “English”}, {“iso…	Released	Enter the World of Pandora.	Avatar	7.2	11800

movie.tail(3)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
4800	[{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam…	http://www.hallmarkchannel.com/signedsealeddel…	231617	[{“id”: 248, “name”: “date”}, {“id”: 699, “nam…	en	Signed, Sealed, Delivered	“Signed, Sealed, Delivered” introduces a dedic…	1.444476	[{“name”: “Front Street Pictures”, “id”: 3958}…	[{“iso_3166_1”: “US”, “name”: “United States o…	2013-10-13	120.0	[{“iso_639_1”: “en”, “name”: “English”}]	Released	NaN	Signed, Sealed, Delivered	7.0	6
4801	[]	http://shanghaicalling.com/	126186	[]	en	Shanghai Calling	When ambitious New York attorney Sam is sent t…	0.857008	[]	[{“iso_3166_1”: “US”, “name”: “United States o…	2012-05-03	98.0	[{“iso_639_1”: “en”, “name”: “English”}]	Released	A New Yorker in Shanghai	Shanghai Calling	5.7	7
4802	[{“id”: 99, “name”: “Documentary”}]	NaN	25975	[{“id”: 1523, “name”: “obsession”}, {“id”: 224…	en	My Date with Drew	Ever since the second grade when he first saw …	1.929883	[{“name”: “rusty bear entertainment”, “id”: 87…	[{“iso_3166_1”: “US”, “name”: “United States o…	2005-08-05	90.0	[{“iso_639_1”: “en”, “name”: “English”}]	Released	NaN	My Date with Drew	6.3	16

movie.info()#樣本數量爲4803，部分特徵有缺失值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

樣本數爲4803，部分特徵有缺失值，homepage,tagline缺損較多，但這倆不影響基本分析，release_date和runtime可以填充；仔細觀察，部分樣本的genres,keywords,production company特徵值是[]，需要注意。

credit.info

數據清理

數據特徵中有很多特徵爲json格式，即類似於字典的鍵值對形式，爲了方便後續處理，我們需要將其轉換成便於python操作的str或者list形式，利於提取有用信息。

#movie genres電影流派，便於歸類
movie['genres']=movie['genres'].apply(json.loads)
#apply function to axis in df,對df中某一行、列應用某種操作。

movie['genres'].head(1)

0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object

list(zip(movie.index,movie['genres']))[:2]

[(0,
  [{'id': 28, 'name': 'Action'},
   {'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 878, 'name': 'Science Fiction'}]),
 (1,
  [{'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 28, 'name': 'Action'}])]

for index,i in zip(movie.index,movie['genres']):
    list1=[]
    for j in range(len(i)):
        list1.append((i[j]['name']))# name:genres,Action...
    movie.loc[index,'genres']=str(list1)

movie.head(1)
#genres列已經不是json格式，而是將name將的value即電影類型提取出來重新賦值給genres

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…	http://www.avatarmovie.com/	19995	[{“id”: 1463, “name”: “culture clash”}, {“id”:…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[{“name”: “Ingenious Film Partners”, “id”: 289…	[{“iso_3166_1”: “US”, “name”: “United States o…	2009-12-10	2787965087	162.0	[{“iso_639_1”: “en”, “name”: “English”}, {“iso…	Released	Enter the World of Pandora.	Avatar	7.2	11800

#同樣的方法應用到keywords列
movie['keywords'] = movie['keywords'].apply(json.loads)
for index,i in zip(movie.index,movie['keywords']):
    list2=[]
    for j in range(len(i)):
        list2.append(i[j]['name'])
    movie.loc[index,'keywords'] = str(list2)

#同理production_companies
movie['production_companies'] = movie['production_companies'].apply(json.loads)
for index,i in zip(movie.index,movie['production_companies']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_companies']=str(list3)

movie['production_countries'] = movie['production_countries'].apply(json.loads)
for index,i in zip(movie.index,movie['production_countries']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_countries']=str(list3)

movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
for index,i in zip(movie.index,movie['spoken_languages']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'spoken_languages']=str(list3)

movie.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…	http://www.avatarmovie.com/	19995	[‘culture clash’, ‘future’, ‘space war’, ‘spac…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[‘Ingenious Film Partners’, ‘Twentieth Century…	[‘United States of America’, ‘United Kingdom’]	2009-12-10	2787965087	162.0	[‘English’, ‘Español’]	Released	Enter the World of Pandora.	Avatar	7.2	11800

credit.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	movie_id	title	cast	crew
0	19995	Avatar	[{“cast_id”: 242, “character”: “Jake Sully”, “…	[{“credit_id”: “52fe48009251416c750aca23”, “de…

credit['cast'] = credit['cast'].apply(json.loads)
for index,i in zip(credit.index,credit['cast']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    credit.loc[index,'cast']=str(list3)

credit['crew'] = credit['crew'].apply(json.loads)
#提取crew中director，增加電影導演一列，用作後續分析
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
credit['crew']=credit['crew'].apply(director)
credit.rename(columns={'crew':'director'},inplace=True)

credit.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	movie_id	title	cast	director
0	19995	Avatar	[‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …	James Cameron

觀察movie中id和credit中movie_id相同，可以將兩個表合併，將所有信息統一在一個表中。

fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')

fulldf.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	…	spoken_languages	status	tagline	title_x	vote_average	vote_count	movie_id	title_y	cast	director
0	237000000	[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…	http://www.avatarmovie.com/	19995	[‘culture clash’, ‘future’, ‘space war’, ‘spac…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[‘Ingenious Film Partners’, ‘Twentieth Century…	…	[‘English’, ‘Español’]	Released	Enter the World of Pandora.	Avatar	7.2	11800	19995	Avatar	[‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …	James Cameron

1 rows × 24 columns

fulldf.shape

(4803, 24)

#觀察到有相同列title，合併後自動命名成title_x,title_y
fulldf.rename(columns={'title_x':'title'},inplace=True)
fulldf.drop('title_y',axis=1,inplace=True)

#缺失值
NAs = pd.DataFrame(fulldf.isnull().sum())
NAs[NAs.sum(axis=1)>0].sort_values(by=[0],ascending=False)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	0
homepage	3091
tagline	844
director	30
overview	3
runtime	2
release_date	1

#補充release_date
fulldf.loc[fulldf['release_date'].isnull(),'title']

4553 America Is Still the Place Name: title, dtype: object

#上網查詢補充
fulldf['release_date']=fulldf['release_date'].fillna('2014-06-01')

#runtime爲電影時長，按均值補充
fulldf['runtime'] = fulldf['runtime'].fillna(fulldf['runtime'].mean())

#爲方便分析，將release_date（object）轉爲datetime類型，並提取year,month
fulldf['release_year'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.year
fulldf['release_month'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.month

數據探索

#電影類型genres
#觀察其格式，我們需要做str相關處理,先移除兩邊中括號
#相鄰類型間有空格，需要移除
#再移除單引號，並按,分割提取即可
fulldf['genres']=fulldf['genres'].str.strip('[]').str.replace(" ","").str.replace("'","")

#每種類型現在以，分割
fulldf['genres']=fulldf['genres'].str.split(',')

list1=[]
for i in fulldf['genres']:
    list1.extend(i)
gen_list=pd.Series(list1).value_counts()[:10].sort_values(ascending=False)
gen_df = pd.DataFrame(gen_list)
gen_df.rename(columns={0:'Total'},inplace=True)

fulldf.ix[4801]

  budget                                                                  0
genres                                                                 []
homepage                                      http://shanghaicalling.com/
id                                                                 126186
keywords                                                               []
original_language                                                      en
original_title                                           Shanghai Calling
overview                When ambitious New York attorney Sam is sent t...
popularity                                                       0.857008
production_companies                                                   []
production_countries                ['United States of America', 'China']
release_date                                                   2012-05-03
revenue                                                                 0
runtime                                                                98
spoken_languages                                              ['English']
status                                                           Released
tagline                                          A New Yorker in Shanghai
title                                                    Shanghai Calling
vote_average                                                          5.7
vote_count                                                              7
movie_id                                                           126186
cast                    ['Daniel Henney', 'Eliza Coupe', 'Bill Paxton'...
director                                                      Daniel Hsia
release_year                                                         2012
release_month                                                           5
Name: 4801, dtype: object

plt.subplots(figsize=(10,8))
sns.barplot(y=gen_df.index,x='Total',data=gen_df,palette='GnBu_d')
plt.xticks(fontsize=15)#設置刻度字體大小
plt.yticks(fontsize=15)
plt.xlabel('Total',fontsize=15)
plt.ylabel('Genres',fontsize=15)
plt.title('Top 10 Genres',fontsize=20)
plt.show()

數量最多的前10種電影類型，有劇情、喜劇、驚悚、動作等，也是目前影院常見電影類型，那這些電影類型數量較多的背後原因有哪些呢？
我們再看看電影數量和時間的關係。

#對電影類型去重
l=[]
for i in list1:
    if i not in l:
        l.append(i)
#l.remove("")#有部分電影類型爲空
len(l)#l就是去重後的電影類型

year_min = fulldf['release_year'].min()
year_max = fulldf['release_year'].max()

year_genr =pd.DataFrame(index=l,columns=range(year_min,year_max+1))#生成類型爲index，年份爲列的dataframe，用於每種類型在各年份的數量
year_genr.fillna(value=0,inplace=True)#初始值爲0


intil_y = np.array(fulldf['release_year'])#用於遍歷所有年份
z = 0
for i in fulldf['genres']:
    splt_gen = list(i)#每一部電影的所有類型
    for j in splt_gen:
        year_genr.loc[j,intil_y[z]] = year_genr.loc[j,intil_y[z]]+1#計數該類型電影在某一年份的數量
    z+=1

year_genr = year_genr.sort_values(by=2006,ascending=False)
year_genr = year_genr.iloc[0:10,-49:-1]
year_genr

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	…	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
Drama	7	8	3	6	4	3	2	4	8	5	…	97	106	122	115	99	79	110	110	95	37
Comedy	3	4	1	3	3	3	3	3	3	1	…	67	82	97	87	82	80	71	62	52	26
Thriller	3	2	3	1	2	1	1	2	4	5	…	53	55	59	56	69	58	53	66	67	27
Action	4	4	4	1	2	1	1	2	6	5	…	44	46	51	49	58	43	56	54	46	39
Romance	2	1	3	0	1	2	1	2	2	3	…	37	38	57	45	30	39	25	24	23	9
Family	0	0	1	0	0	1	0	1	0	1	…	20	29	28	29	28	17	22	23	17	9
Crime	3	0	2	3	2	2	0	2	0	0	…	28	33	32	30	24	27	37	27	26	10
Adventure	2	3	1	2	1	2	2	2	5	4	…	25	37	36	30	32	25	36	37	35	23
Fantasy	0	0	1	0	0	0	1	0	2	2	…	19	20	22	21	15	19	21	16	10	13
Horror	0	0	1	1	1	1	1	1	3	4	…	27	21	30	27	24	33	25	21	33	20

10 rows × 48 columns

plt.subplots(figsize=(10,8))
plt.plot(year_genr.T)
plt.title('Genres vs Time',fontsize=20)
plt.xticks(range(1969,2020,5))
plt.legend(year_genr.T)
plt.show()

可以看到，從1994年左右，電影進入繁榮發展時期，各種類型的電影均有大幅增加，而增加最多的又以劇情、喜劇、驚悚、動作等類型電影，可見，這些類型電影數量居多和電影藝術整體繁榮發展有一定關係。

#爲了方便分析，構造一個新的dataframe,選取部分特徵，分析這些特徵和電影類型的關係。
partdf = fulldf[['title','vote_average','vote_count','release_year','popularity','budget','revenue']].reset_index(drop=True)

partdf.head(2)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	title	vote_average	vote_count	release_year	popularity	budget	revenue
0	Avatar	7.2	11800	2009	150.437577	237000000	2787965087
1	Pirates of the Caribbean: At World’s End	6.9	4500	2007	139.082615	300000000	961000000

因爲一部電影可能有多種電影類型，將每種類型加入column中，對每部電影，是某種類型就賦值1，不是則賦值0

for per in l:
    partdf[per]=0

    z=0
    for gen in fulldf['genres']:

        if per in list(gen):
            partdf.loc[z,per] = 1
        else:
            partdf.loc[z,per] = 0
        z+=1
partdf.head(2)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	title	vote_average	vote_count	release_year	popularity	budget	revenue	Action	Adventure	Fantasy	…	Romance	Horror	Mystery	History	War	Music	Documentary	Foreign	TVMovie
0	Avatar	7.2	11800	2009	150.437577	237000000	2787965087	1	1	1	…	0	0	0	0	0	0	0	0	0	0
1	Pirates of the Caribbean: At World’s End	6.9	4500	2007	139.082615	300000000	961000000	1	1	1	…	0	0	0	0	0	0	0	0	0	0

2 rows × 28 columns

現在我們想了解每種電影類型一些特徵的平均值，創建一個新的dataframe，index就是電影類型，列是平均特徵，如平分vote，收入revenue，受歡迎程度等。

mean_gen = pd.DataFrame(l)

#點評分數取均值
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['vote_average'].mean())
#現在newArray中是按類型[0]平均值[1]平均值存放，我們只關心[1]的值。
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_votes_average']=newArray2
mean_gen.head(2)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	0	mean_votes_average
0	Action	5.989515
1	Adventure	6.156962

#同理，用到別的特徵上
#預算budget
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['budget'].mean())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_budget']=newArray2

#收入revenue
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['revenue'].mean())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_revenue']=newArray2

#popularity:相關頁面查看次數
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['popularity'].mean())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['mean_popular']=newArray2

#vote_count:評分次數取count
newArray = []*len(l)
for genre in l:
    newArray.append(partdf.groupby(genre, as_index=True)['vote_count'].count())
newArray2 = []*len(l)
for i in range(len(l)):
    newArray2.append(newArray[i][1])

mean_gen['vote_count']=newArray2

mean_gen.rename(columns={0:'genre'},inplace=True)
mean_gen.replace('','none',inplace=True)
#none代表有些電影類型或其他特徵有缺失，可以看到數量很小，我們將其捨得不考慮
mean_gen.drop(20,inplace=True)

mean_gen['vote_count'].describe()

count 20.000000
mean 608.000000
std 606.931974
min 8.000000
25% 174.750000
50% 468.500000
75% 816.000000
max 2297.000000
Name: vote_count, dtype: float64

mean_gen['mean_votes_average'].describe()

count 20.000000
mean 6.173921
std 0.278476
min 5.626590
25% 6.009644
50% 6.180978
75% 6.344325
max 6.719797
Name: mean_votes_average, dtype: float64

#fig = plt.figure(figsize=(10, 8))
f,ax = plt.subplots(figsize=(10,6))
ax1 = f.add_subplot(111)
ax2 = ax1.twinx()
grid1 = sns.factorplot(x='genre', y='mean_votes_average',data=mean_gen,ax=ax1)
ax1.axes.set_ylabel('votes_average')
ax1.axes.set_ylim((4,7))

grid2 = sns.factorplot(x='genre',y='mean_popular',data=mean_gen,ax=ax2,color='blue')
ax2.axes.set_ylabel('popularity')
ax2.axes.set_ylim((0,40))
ax1.set_xticklabels(mean_gen['genre'],rotation=90)

plt.show()

從上圖可知，外國電影並不受歡迎，雖然評分不低，但也是因爲評分人數太少，動漫電影（Animation）、科幻（Science Fiction）、奇幻電影（Fantasy）、動作片（Action）受歡迎程度較高，評分也不低，數量最多的劇情片評分很高，但受歡迎程度較低，猜測可能大部分劇情片不是商業類型。

mean_gen['profit'] = mean_gen['mean_revenue']-mean_gen['mean_budget']

s = mean_gen['profit'].sort_values(ascending=False)[:10]
pdf = mean_gen.ix[s.index]

plt.subplots(figsize=(10,6))
sns.barplot(x='profit',y='genre',data=pdf,palette='BuGn_r')
plt.xticks(fontsize=15)#設置刻度字體大小
plt.yticks(fontsize=15)
plt.xlabel('Profit',fontsize=15)
plt.ylabel('Genres',fontsize=15)
plt.title('Top 10 Profit of Genres',fontsize=20)

plt.show()

可以看出，動畫、探險、家庭和科幻是最賺錢的電影類型，適合去電影院觀看，同時也是受歡迎的類型，那麼我們看看變量的關係。

cordf = partdf.drop(l,axis=1)
cordf.columns#含有我們想了解的特徵，適合分析

 Index(['title', 'vote_average', 'vote_count', 'release_year', 'popularity',
       'budget', 'revenue'],
      dtype='object')

corrmat = cordf.corr()
f, ax = plt.subplots(figsize=(10,7))
sns.heatmap(corrmat,cbar=True, annot=True,vmax=.8, cmap='PuBu',square=True)

從上圖可以看出，評分次數和受歡迎程度有比較強的關係，證明看的人多參與度也高，預算和票房也關係較強，票房和受歡迎程度、評分次數也有比較強的關係，爲電影做好宣傳很重要，我們再進一步看一下。

#budget, revenue在數據中都有爲0的項，我們去除這些髒數據，
partdf = partdf[partdf['budget']>0]
partdf = partdf[partdf['revenue']>0]
partdf = partdf[partdf['vote_count']>3]
plt.subplots(figsize=(6,5))

plt.xlabel('Budget',fontsize=15)
plt.ylabel('Revenue',fontsize=15)
plt.title('Budget vs Revenue',fontsize=20)
sns.regplot(x='budget',y='revenue',data=partdf,ci=None)

plt.subplots(figsize=(6,5))
plt.xlabel('vote_average',fontsize=15)
plt.ylabel('popularity',fontsize=15)
plt.title('Score vs Popular',fontsize=20)
sns.regplot(x='vote_average',y='popularity',data=partdf)

可以看出，成本和票房、評分高低和受歡迎程度還是呈線性關係的。但成本較低的電影，成本對票房的影響不大，評分高的的電影基本上也很受歡迎，我們再看看究竟是哪幾部電影最掙錢、最受歡迎、口碑最好。

print(partdf.loc[partdf['revenue']==partdf['revenue'].max()]['title'])
print(partdf.loc[partdf['popularity']==partdf['popularity'].max()]['title'])
print(partdf.loc[partdf['vote_average']==partdf['vote_average'].max()]['title'])

0 Avatar
Name: title, dtype: object
546 Minions
Name: title, dtype: object
1881 The Shawshank Redemption
Name: title, dtype: object

partdf['profit'] = partdf['revenue']-partdf['budget']
print(partdf.loc[partdf['profit']==partdf['profit'].max()]['title'])

0 Avatar
Name: title, dtype: object

小黃人電影最受歡迎，阿凡達最賺錢，肖申克的救贖口碑最好。

s1 = cordf.groupby(by='release_year').budget.sum()
s2 = cordf.groupby(by='release_year').revenue.sum()
sdf = pd.concat([s1,s2],axis=1)
sdf = sdf.iloc[-39:-2]
plt.plot(sdf)
plt.xticks(range(1979,2020,5))
plt.legend(sdf)
plt.show()

電影業果然是蓬勃發展啊！現在大製作的電影越來越多，看來是有原因的啊！

對於科幻迷們，也可以看看最受歡迎的科幻電影都有哪些：

#最受歡迎的科幻電影
s = partdf.loc[partdf['ScienceFiction']==1,'popularity'].sort_values(ascending=False)[:10]
sdf = partdf.ix[s.index]
sns.barplot(x='popularity',y='title',data=sdf)
plt.show()

星際穿越最受歡迎，銀河護衛隊緊隨其後，同理，我們也可以瞭解其他電影類型的情況。現在。讓我們再看看電影人對電影市場的影響，一部好電影離不開臺前幕後工作人員的貢獻，是每一位優秀的電影人爲我們帶來好看的電影，這裏，我們主要分析導演和演員。

#平均票房最高的導演
rev_d = fulldf.groupby('director')['revenue'].mean()
top_rev_d = rev_d.sort_values(ascending=False).head(20)
top_rev_d = pd.DataFrame(top_rev_d)

plt.subplots(figsize=(10,6))
sns.barplot(x='revenue',y=top_rev_d.index,data=top_rev_d,palette='BuGn_r')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Average Revenue',fontsize=15)
plt.ylabel('Director',fontsize=15)
plt.title('Top 20 Revenue by Director',fontsize=20)
plt.show()

如圖是市場好的導演，那麼電影產量最高、或者既叫好又叫座的導演有哪些呢？

list2 = fulldf[fulldf['director']!=''].director.value_counts()[:10].sort_values(ascending=True)
list2 = pd.Series(list2)
list2

Oliver Stone 14
Renny Harlin 15
Steven Soderbergh 15
Robert Rodriguez 16
Spike Lee 16
Ridley Scott 16
Martin Scorsese 20
Clint Eastwood 20
Woody Allen 21
Steven Spielberg 27
Name: director, dtype: int64

plt.subplots(figsize=(10,6))
ax = list2.plot.barh(width=0.85,color='y')
for i,v in enumerate(list2.values):
    ax.text(.5, i, v,fontsize=12,color='white',weight='bold')
ax.patches[9].set_facecolor('g')
plt.title('Directors with highest movies')
plt.show()

top_vote_d = fulldf[fulldf['vote_average']>=8].sort_values(by='vote_average',ascending=False)
top_vote_d = top_vote_d.dropna()
top_vote_d = top_vote_d.loc[:,['director','vote_average']]

tmp = rev_d.sort_values(ascending=False)
vote_rev_d = tmp[tmp.index.isin(list(top_vote_d['director']))]
vote_rev_d = vote_rev_d.sort_values(ascending=False)
vote_rev_d = pd.DataFrame(vote_rev_d)

plt.subplots(figsize=(10,6))
sns.barplot(x='revenue',y=vote_rev_d.index,data=vote_rev_d,palette='BuGn_r')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Average Revenue',fontsize=15)
plt.ylabel('Director',fontsize=15)
plt.title('Revenue by vote above 8 Director',fontsize=20)
plt.show()

再看看演職人員，cast特徵裏每一部電影有很多演職人員，幸運的是，cast是按演職人員的重要程度排序的，那麼排名靠前的我們可以認爲是主要演員。

fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
fulldf['cast']=fulldf['cast'].str.split(',')

list1=[]
for i in fulldf['cast']:
    list1.extend(i)
list1 = pd.Series(list1)
list1 = list1.value_counts()[:15].sort_values(ascending=True)

plt.subplots(figsize=(10,6))
ax = list1.plot.barh(width=0.9,color='green')
for i,v in enumerate(list1.values):
    ax.text(.8, i, v,fontsize=10,color='white',weight='bold')
plt.title('Actors with highest appearance')
ax.patches[14].set_facecolor('b')
plt.show()

fulldf['keywords'][2]

“[‘spy’, ‘based on novel’, ‘secret agent’, ‘sequel’, ‘mi6’, ‘british secret service’, ‘united kingdom’]”

from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import stopwords
#如果stopwords報錯沒有安裝，可以在anaconda cmd中import nltk;nltk.download()
#在彈出窗口中選擇corpa,stopword,刷新並下載
import io
from PIL import Image

plt.subplots(figsize=(12,12))
stop_words=set(stopwords.words('english'))
stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')

img1 = Image.open('timg1.jpg')
hcmask1 = np.array(img1)
words=fulldf['keywords'].dropna().apply(nltk.word_tokenize)
word=[]
for i in words:
    word.extend(i)
word=pd.Series(word)
word=([i for i in word.str.lower() if i not in stop_words])
wc = WordCloud(background_color="black", max_words=4000, mask=hcmask1,
               stopwords=STOPWORDS, max_font_size= 60)
wc.generate(" ".join(word))


plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.figure()
plt.show()

我們可以對關鍵詞有大概瞭解，女性導演、獨立電影占比較大，這也可能是電影的一個發展趨勢。

電影推薦模型

現在我們根據上述的分析，可以考慮做一個電影推薦，通常來說，我們在搜索電影時，我們會去找同類的電影、或者同一導演演員的電影、或者評分較高的電影，那麼需要的特徵有genres,cast,director,score

l[:5]

[‘Action’, ‘Adventure’, ‘Fantasy’, ‘ScienceFiction’, ‘Crime’]

特徵向量化

genre

def binary(genre_list):
    binaryList = []

    for genre in l:
        if genre in genre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList

fulldf['genre_vec'] = fulldf['genres'].apply(lambda x: binary(x))

fulldf['genre_vec'][0]

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

cast

for i,j in zip(fulldf['cast'],fulldf.index):
    list2=[]
    list2=i[:4]
    list2.sort()
    fulldf.loc[j,'cast']=str(list2)
fulldf['cast'][0]

“[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]”

fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')
fulldf['cast']=fulldf['cast'].str.split(',')
fulldf['cast'][0]

[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]

castList = []
for index, row in fulldf.iterrows():
    cast = row["cast"]
    for i in cast:
        if i not in castList:
            castList.append(i)

len(castList)

7515

def binary(cast_list):
    binaryList = []

    for genre in castList:
        if genre in cast_list:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList

fulldf['cast_vec'] = fulldf['cast'].apply(lambda x:binary(x))
fulldf['cast_vec'].head(2)

0 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
1 [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
Name: cast_vec, dtype: object

director

fulldf['director'][0]

‘James Cameron’

def xstr(s):
    if s is None:
        return ''
    return str(s)
fulldf['director']=fulldf['director'].apply(xstr)

directorList=[]
for i in fulldf['director']:
    if i not in directorList:
        directorList.append(i)

def binary(director_list):
    binaryList = []

    for direct in directorList:
        if direct in director_list:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList

fulldf['director_vec'] = fulldf['director'].apply(lambda x:binary(x))

keywords

fulldf['keywords'][0]

“[‘culture clash’, ‘future’, ‘space war’, ‘space colony’, ‘society’, ‘space travel’, ‘futuristic’, ‘romance’, ‘space’, ‘alien’, ‘tribe’, ‘alien planet’, ‘cgi’, ‘marine’, ‘soldier’, ‘battle’, ‘love affair’, ‘anti war’, ‘power relations’, ‘mind and soul’, ‘3d’]”

#change keywords to type list
fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
fulldf['keywords']=fulldf['keywords'].str.split(',')

for i,j in zip(fulldf['keywords'],fulldf.index):
    list2=[]
    list2 = i
    list2.sort()
    fulldf.loc[j,'keywords']=str(list2)
fulldf['keywords'][0]

“[‘3d’, ‘alien’, ‘alienplanet’, ‘antiwar’, ‘battle’, ‘cgi’, ‘cultureclash’, ‘future’, ‘futuristic’, ‘loveaffair’, ‘marine’, ‘mindandsoul’, ‘powerrelations’, ‘romance’, ‘society’, ‘soldier’, ‘space’, ‘spacecolony’, ‘spacetravel’, ‘spacewar’, ‘tribe’]”

fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
fulldf['keywords']=fulldf['keywords'].str.split(',')

words_list = []
for index, row in fulldf.iterrows():
    genres = row["keywords"]

    for genre in genres:
        if genre not in words_list:
            words_list.append(genre)
len(words_list)

9772

def binary(words):
    binaryList = []

    for genre in words_list:
        if genre in words:
            binaryList.append(1)
        else:
            binaryList.append(0)

    return binaryList

fulldf['words_vec'] = fulldf['keywords'].apply(lambda x: binary(x))

取餘弦值作爲相似性度量，根據選取的特徵向量計算影片間的相似性；計算距離最近的前10部影片作爲推薦

fulldf=fulldf[(fulldf['vote_average']!=0)] #removing the fulldf with 0 score and without drector names 
fulldf=fulldf[fulldf['director']!='']

from scipy import spatial

def Similarity(movieId1, movieId2):
    a = fulldf.iloc[movieId1]
    b = fulldf.iloc[movieId2]

    genresA = a['genre_vec']
    genresB = b['genre_vec']
    genreDistance = spatial.distance.cosine(genresA, genresB)

    castA = a['cast_vec']
    castB = b['cast_vec']
    castDistance = spatial.distance.cosine(castA, castB)

    directA = a['director_vec']
    directB = b['director_vec']
    directDistance = spatial.distance.cosine(directA, directB)

    wordsA = a['words_vec']
    wordsB = b['words_vec']
    wordsDistance = spatial.distance.cosine(directA, directB)
    return genreDistance + directDistance + castDistance + wordsDistance

Similarity(3,160)

2.7958758547680684

columns =['original_title','genres','vote_average','genre_vec','cast_vec','director','director_vec','words_vec']
tmp = fulldf.copy()
tmp =tmp[columns]
tmp['id'] = list(range(0,fulldf.shape[0]))
tmp.head()

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	original_title	genres	vote_average	genre_vec	cast_vec	director	director_vec	words_vec	id
0	Avatar	[Action, Adventure, Fantasy, ScienceFiction]	7.2	[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	James Cameron	[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …	0
1	Pirates of the Caribbean: At World’s End	[Adventure, Fantasy, Action]	6.9	[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …	Gore Verbinski	[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	1
2	Spectre	[Action, Adventure, Crime]	6.3	[1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, …	Sam Mendes	[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	2
3	The Dark Knight Rises	[Action, Crime, Drama, Thriller]	7.6	[1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, …	Christopher Nolan	[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	3
4	John Carter	[Action, Adventure, ScienceFiction]	6.1	[1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	Andrew Stanton	[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	4

tmp.isnull().sum()

original_title 0
genres 0
vote_average 0
genre_vec 0
cast_vec 0
director 0
director_vec 0
words_vec 0
id 0
dtype: int64

import operator
def recommend(name):
    film=tmp[tmp['original_title'].str.contains(name)].iloc[0].to_frame().T
    print('Selected Movie: ',film.original_title.values[0])
    def getNeighbors(baseMovie):
        distances = []
        for index, movie in tmp.iterrows():
            if movie['id'] != baseMovie['id'].values[0]:
                dist = Similarity(baseMovie['id'].values[0], movie['id'])
                distances.append((movie['id'], dist))

        distances.sort(key=operator.itemgetter(1))

        neighbors = []
        for x in range(10):
            neighbors.append(distances[x])
        return neighbors
    neighbors = getNeighbors(film)
    print('\nRecommended Movies: \n')

    for nei in neighbors:  
        print( tmp.iloc[nei[0]][0]+" | Genres: "+
              str(tmp.iloc[nei[0]][1]).strip('[]').replace(' ','')+" | Rating: "
              +str(tmp.iloc[nei[0]][2]))

    print('\n')

recommend('Godfather')

Selected Movie: The Godfather: Part III

Recommended Movies:

The Godfather: Part II | Genres: 'Drama','Crime' | Rating: 8.3
The Godfather | Genres: 'Drama','Crime' | Rating: 8.4
The Rainmaker | Genres: 'Drama','Crime','Thriller' | Rating: 6.7
The Outsiders | Genres: 'Crime','Drama' | Rating: 6.9
The Conversation | Genres: 'Crime','Drama','Mystery' | Rating: 7.5
The Cotton Club | Genres: 'Music','Drama','Crime','Romance' | Rating: 6.6
Apocalypse Now | Genres: 'Drama','War' | Rating: 8.0
Twixt | Genres: 'Horror','Thriller' | Rating: 5.0
New York Stories | Genres: 'Comedy','Drama','Romance' | Rating: 6.2
Peggy Sue Got Married | Genres: 'Comedy','Drama','Fantasy','Romance' | Rating: 5.9

	lkey	value
0	foo	1
1	foo	2
2	bar	3
3	bar	4

	rkey	value
0	foo	5
1	foo	6
2	bar	7
3	bar	8

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

	A	B	C	D
NY	0	1	2	3
LA	4	5	6	7
SF	8	9	10	11

	alpha	B	C	D
NY	0	1	2	3
LA	4	5	6	7
SF	8	9	10	11

	A	B	C	D
NY	0	1	2	3
LA	4	5	6	7
SF	8	9	10	11

	alpha	B	C	D
NY	0	1	2	3
LA	4	5	6	7
SF	8	9	10	11

參考文章：

what’s my score
TMDB means per genre

新手學習，歡迎指教！

kaggle TMDB5000電影數據分析和電影推薦模型

數據分析

數據清理

數據探索

電影推薦模型

特徵向量化

genre

cast

director

keywords

相關函數解釋

json格式處理

zip()操作

pandas merge/rename

pandas datetime格式

Wordcloud

nltk簡單介紹

參考文章：

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

kaggle Home Depot relevance相關性預測

selenium+Python Behave行爲驅動測試開發用例設計

使用bat快速打開Jupyter到指定目錄

Python實現http接口自動化測試

selenium+Python Page Object自動化測試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

kaggle TMDB5000電影數據分析和電影推薦模型

數據分析

數據清理

數據探索

電影推薦模型

特徵向量化

genre

cast

director

keywords

recommend model

相關函數解釋

json格式處理

zip()操作

pandas merge/rename

pandas datetime格式

Wordcloud

nltk簡單介紹

參考文章：

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8

	lkey	value_x	rkey	value_y
0	foo	1	foo	5
1	foo	1	foo	6
2	foo	2	foo	5
3	foo	2	foo	6
4	bar	3	bar	7
5	bar	3	bar	8
6	bar	4	bar	7
7	bar	4	bar	8