數據分析入門之好萊塢百萬級評論數據分析

操作環境: window10,Python3.7,Jupyter notebook
數據資料: https://www.lanzous.com/i96rt3e

數據分析要求:

  1. 數據的加載與集成
  2. 平均分較高的電影
  3. 不同性別對電影平均評分
  4. 不同性別爭議最大電影
  5. 評分次數最多熱門的電影
  6. 不同年齡段爭議最大的電影

1、數據的加載與集成

1.1、導入相關的包

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

1.2、導入數據

  • 這裏一共有三個.dat 數據,還有一個是數據的說明文檔(README),我們可以直接把它們分別拖進瀏覽器打開查看,如我打開README文件,查看其他三個文件的表頭

在這裏插入圖片描述

1.2.1、讀取用戶數據

# UserID::Gender::Age::Occupation::Zip-code
labels = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']
users = pd.read_csv('users.dat', sep='::', header=None, names=labels, engine ='python')
users.shape
(6040, 5)

查看前五行:

users.head()
UserID Gender Age Occupation Zip-code
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

1.2.2、讀取電影數據

# MovieID::Title::Genres
labels = ['MovieID', 'Title', 'Genres']
movies = pd.read_csv('movies.dat', sep='::', header=None, names=labels, engine ='python')
movies.shape
(3883, 3)

查看前五行:

MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

1.2.3、讀取評分數據

# UserID::MovieID::Rating::Timestamp
labels = ['UserID', 'MovieID', 'Rating', 'Timestamp']
ratings = pd.read_csv('ratings.dat', sep='::', header=None, names=labels, engine ='python')
ratings.shape
(1000209, 4)

查看前五行:

UserID MovieID Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

1.3、數據合併

  • 數據分佈於三個表,可以將數據合併到一個表;數據合併專業詞彙,數據集成

展示這三個數據:

display(users.head(), movies.head(), users.head())

movies和ratings有共同的head(MovieID),先進行合併

df1 = pd.merge(left=movies, right=ratings)
df1.head()
MovieID Title Genres UserID Rating Timestamp
0 1 Toy Story (1995) Animation|Children's|Comedy 1 5 978824268
1 1 Toy Story (1995) Animation|Children's|Comedy 6 4 978237008
2 1 Toy Story (1995) Animation|Children's|Comedy 8 4 978233496
3 1 Toy Story (1995) Animation|Children's|Comedy 9 5 978225952
4 1 Toy Story (1995) Animation|Children's|Comedy 10 5 978226474

df1 和 users 合併:

movie_data = pd.merge(df1, users)

1.4、查看數據

1.4.1、查看數據形狀

movie_data.shape
(1000209, 10)

1.4.2、查看前5行

movie_data.head()
MovieID Title Genres UserID Rating Timestamp Gender Age Occupation Zip-code
0 1 Toy Story (1995) Animation|Children's|Comedy 1 5 978824268 F 1 10 48067
1 48 Pocahontas (1995) Animation|Children's|Musical|Romance 1 5 978824351 F 1 10 48067
2 150 Apollo 13 (1995) Drama 1 5 978301777 F 1 10 48067
3 260 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi 1 4 978300760 F 1 10 48067
4 527 Schindler's List (1993) Drama|War 1 5 978824195 F 1 10 48067

1.4.3、查看去重後大小

movie_data['Title'].unique().size
3706


2、平均分較高的電影

2.1、調用透視表

movie_rate_mean = pd.pivot_table(movie_data, values=['Rating'], index=['Title'], aggfunc='mean')
movie_rate_mean.shape
(3706, 1)

2.2、查看前五行

Rating
Title
$1,000,000 Duck (1971) 3.027027
'Night Mother (1986) 3.371429
'Til There Was You (1997) 2.692308
'burbs, The (1989) 2.910891
...And Justice for All (1979) 3.713568

2.3、排序

movie_rate_mean.sort_values(by='Rating', ascending=False, inplace=True)

2.4、查看前10名

  • 直接截取出前面10條數據
movie_rate_mean[0: 10]
Rating
Title
Ulysses (Ulisse) (1954) 5.0
Lured (1947) 5.0
Follow the Bitch (1998) 5.0
Bittersweet Motel (2000) 5.0
Song of Freedom (1936) 5.0
One Little Indian (1973) 5.0
Smashing Time (1967) 5.0
Schlafes Bruder (Brother of Sleep) (1995) 5.0
Gate of Heavenly Peace, The (1995) 5.0
Baby, The (1973) 5.0

2.5、查看後10名

movie_rate_mean[-10: ]


3、不同性別對電影平均評分

  • 透視表透視數據的結構

方法一:

movie_gender_rating_mean = pd.pivot_table(movie_data, values=['Rating'], index=['Title', 'Gender'], aggfunc='mean')
movie_gender_rating_mean.shape #(7152, 1)
movie_gender_rating_mean.head()
Rating
Title Gender
$1,000,000 Duck (1971) F 3.375000
M 2.761905
'Night Mother (1986) F 3.388889
M 3.352941
'Til There Was You (1997) F 2.675676

方法二:

movie_gender_rating_mean = pd.pivot_table(movie_data, values='Rating', index=['Title'], columns=['Gender'], aggfunc='mean')
movie_gender_rating_mean.shape #(3706, 2)
movie_gender_rating_mean.head()
Gender F M
Title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024

4、不同性別爭議最大的電影

原理: 用女性的評分減去男性的評分得出它們評分得差距

4.1、評分差距

# 新增一列,男女用戶對電影評分的差異
movie_gender_rating_mean['diff'] = movie_gender_rating_mean['F'] - movie_gender_rating_mean['M']
movie_gender_rating_mean.head()
Gender F M diff
Title
$1,000,000 Duck (1971) 3.375000 2.761905 0.613095
'Night Mother (1986) 3.388889 3.352941 0.035948
'Til There Was You (1997) 2.675676 2.733333 -0.057658
'burbs, The (1989) 2.793478 2.962085 -0.168607
...And Justice for All (1979) 3.828571 3.689024 0.139547

4.2、排序

movie_gender_rating_mean.sort_values(by='diff', ascending=False, inplace=True)

4.3、查看差距情況

● 女性用戶和男性用戶差異最大,前面爲正,女性用戶最喜歡的前10個

movie_gender_rating_mean[:10]
Gender F M diff
Title
James Dean Story, The (1957) 4.000000 1.000000 3.000000
Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919) 4.000000 1.000000 3.000000
Country Life (1994) 5.000000 2.000000 3.000000
Babyfever (1994) 3.666667 1.000000 2.666667
Woman of Paris, A (1923) 5.000000 2.428571 2.571429
Cobra (1925) 4.000000 1.500000 2.500000
Other Side of Sunday, The (S鴑dagsengler) (1996) 5.000000 2.928571 2.071429
Theodore Rex (1995) 3.000000 1.000000 2.000000
For the Moment (1994) 5.000000 3.000000 2.000000
Separation, The (La S閜aration) (1994) 4.000000 2.000000 2.000000

● 女性用戶和男性用戶差異最大,後面爲負,男性用戶最喜歡的前10個,也就是倒數10個

movie_gender_rating_mean[-10: ]
Gender F M diff
Title
White Boys (1999) NaN 1.000000 NaN
Wild Bill (1995) NaN 3.146341 NaN
Windows (1980) NaN 1.000000 NaN
Wings of Courage (1995) NaN 3.000000 NaN
With Byrd at the South Pole (1930) NaN 2.000000 NaN
With Friends Like These... (1998) NaN 4.000000 NaN
Wooden Man's Bride, The (Wu Kui) (1994) NaN 3.000000 NaN
Year of the Horse (1997) NaN 3.250000 NaN
Zachariah (1971) NaN 3.500000 NaN
Zero Kelvin (Kj鎟lighetens kj鴗ere) (1995) NaN 3.500000 NaN

出現空值的原因: 由於有寫電影女性不觀看和不評論,所以出現空值,需要去掉空值再查看數據。

movie_gender_rating_mean.dropna()[-10: ]
Gender F M diff
Title
Jamaica Inn (1939) 1.0 3.142857 -2.142857
Flying Saucer, The (1950) 1.0 3.300000 -2.300000
Rosie (1998) 1.0 3.333333 -2.333333
In God's Hands (1998) 1.0 3.333333 -2.333333
Dangerous Ground (1997) 1.0 3.333333 -2.333333
Killer: A Journal of Murder (1995) 1.0 3.428571 -2.428571
Stalingrad (1993) 1.0 3.593750 -2.593750
Enfer, L' (1994) 1.0 3.750000 -2.750000
Neon Bible, The (1995) 1.0 4.000000 -3.000000
Tigrero: A Film That Was Never Made (1994) 1.0 4.333333 -3.333333

4.4、男女數據集聯

diff = pd.concat([f, m])

4.5、分析結果

# 分析結果,數據可視化
diff.plot(kind='barh', figsize=(12, 9)) #barh水平方向

在這裏插入圖片描述



5、評分次數最多熱門的電影

5.1、pandas 分組運算

rating_count = movie_data.groupby(['Title']).size()#統計電影名稱出現的次數
rating_count
Title
$1,000,000 Duck (1971)                       37
'Night Mother (1986)                         70
'Til There Was You (1997)                    52
'burbs, The (1989)                          303
...And Justice for All (1979)               199
                                           ... 
Zed & Two Noughts, A (1985)                  29
Zero Effect (1998)                          301
Zero Kelvin (Kj鎟lighetens kj鴗ere) (1995)      2
Zeus and Roxanne (1997)                      23
eXistenZ (1999)                             410
Length: 3706, dtype: int64

5.2、排序

rating_count.sort_values(ascending=False) #ascending=False不進行升序
Title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
                                                         ... 
Anna (1996)                                                 1
McCullochs, The (1975)                                      1
Shadows (Cienie) (1988)                                     1
Night Tide (1961)                                           1
Another Man's Poison (1952)                                 1
Length: 3706, dtype: int64


6、查看不同年齡段爭議最大的電影

6.1、查看用戶的年齡分佈情況

直方圖展示:

movie_data['Age'].plot(kind='hist', bins=20)

在這裏插入圖片描述
求最大值:

movie_data.Age.max()
56

6.2、用pandas.cut()函數將用戶年齡分組

labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59']
movie_data['Age_range'] = pd.cut(movie_data['Age'], bins=range(0, 61, 10), labels=labels)
movie_data.head()
MovieID Title Genres UserID Rating Timestamp Gender Age Occupation Zip-code Age_range
0 1 Toy Story (1995) Animation|Children's|Comedy 1 5 978824268 F 1 10 48067 0-9
1 48 Pocahontas (1995) Animation|Children's|Musical|Romance 1 5 978824351 F 1 10 48067 0-9
2 150 Apollo 13 (1995) Drama 1 5 978301777 F 1 10 48067 0-9
3 260 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi 1 4 978300760 F 1 10 48067 0-9
4 527 Schindler's List (1993) Drama|War 1 5 978824195 F 1 10 48067 0-9

6.3、每個年齡段用戶評分人數和打分偏好

6.3.1、年齡範圍評分的平均分

movie_data.groupby('Age_range')['Rating'].mean()
Age_range
0-9      3.549520
10-19    3.507573
20-29    3.545235
30-39    3.618162
40-49    3.673559
50-59    3.766632
Name: Rating, dtype: float64

6.3.2、年齡範圍評分的人數

movie_data.groupby('Age_range')['Rating'].size()
Age_range
0-9       27211
10-19    183536
20-29    395556
30-39    199003
40-49    156123
50-59     38780
Name: Rating, dtype: int64

6.3.3、同時求每個年齡段評分人數及平均分

movie_data.groupby('Age_range').agg({'Rating':[np.size, np.mean]})
Rating
size mean
Age_range
0-9 27211 3.549520
10-19 183536 3.507573
20-29 395556 3.545235
30-39 199003 3.618162
40-49 156123 3.673559
50-59 38780 3.766632


7、優化數據,真實可靠

問題: 爲什麼那些平均分高的電影,我們重來沒有看過?甚至有些聽都沒有聽說過?這個問題是不是不符合常理,畢竟國內外好的電影大家案例說都應該耳熟能詳的,所有這其中一定存在錯誤

movie_rate_mean[:10]
Rating
Title
Ulysses (Ulisse) (1954) 5.0
Smashing Time (1967) 5.0
Baby, The (1973) 5.0
Gate of Heavenly Peace, The (1995) 5.0
Schlafes Bruder (Brother of Sleep) (1995) 5.0
Lured (1947) 5.0
One Little Indian (1973) 5.0
Song of Freedom (1936) 5.0
Bittersweet Motel (2000) 5.0
Follow the Bitch (1998) 5.0

這是爲什麼? 因爲評分次數相差懸殊,看的人少,少數人評分反而很高

解決方案:

  1. 加入評分次數限制來分析不同性別對電影的平均評分
  2. 加入評分次數限制來分析平均分高的電影

7.1加入評分次數限制來分析不同性別對電影的平均評分

7.1.1、建立索引

#以Title進行分組,統計次數大小,排序,數據反轉,前50列,索引
top_movie_title = movie_data.groupby('Title').size().sort_values()[::-1][:50].index
top_movie_title.size
50

7.1.2、找出這50行數據

flag = movie_gender_rating_mean.index.isin(top_movie_title)
df1 = movie_gender_rating_mean[flag].sort_values(by='diff')
df1.head()
Gender F M diff
Title
Airplane! (1980) 3.656566 4.064419 -0.407854
Godfather: Part II, The (1974) 4.040936 4.437778 -0.396842
Aliens (1986) 3.802083 4.186684 -0.384601
Terminator 2: Judgment Day (1991) 3.785088 4.115367 -0.330279
Alien (1979) 3.888252 4.216119 -0.327867

7.1.3、數據可視化分析

  • 查看被評價過最多次的50部電影在不同年齡段之間的打分差異
df1.plot(kind='barh', figsize=(12, 9))

在這裏插入圖片描述

7.2、加入評分次數限制來分析平均分高的電影

7.2.1、建立索引

index = movie_data.groupby('Title').size().sort_values()[::-1][:50].index
index.shape
(50,)

7.2.2、索引出符合條件的數據

flag = movie_rating_mean.index.isin(index)
# 熱門電影平均分
movie_rating_top_mean = movie_rating_mean[flag]
movie_rating_top_mean.sort_values(by='Rating', ascending=False)
Rating
Title
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Raiders of the Lost Ark (1981) 4.477725
Star Wars: Episode IV - A New Hope (1977) 4.453694
Sixth Sense, The (1999) 4.406263
One Flew Over the Cuckoo's Nest (1975) 4.390725
Godfather: Part II, The (1974) 4.357565
Silence of the Lambs, The (1991) 4.351823
Saving Private Ryan (1998) 4.337354
American Beauty (1999) 4.317386
Matrix, The (1999) 4.315830
Princess Bride, The (1987) 4.303710
Star Wars: Episode V - The Empire Strikes Back (1980) 4.292977
Pulp Fiction (1994) 4.278213
Blade Runner (1982) 4.273333
Fargo (1996) 4.254676
Wizard of Oz, The (1939) 4.247963
Braveheart (1995) 4.234957
L.A. Confidential (1997) 4.219406
Alien (1979) 4.159585
Terminator, The (1984) 4.152050
Toy Story (1995) 4.146846
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章