九、Pandas高級處理

4.6高級處理-缺失值處理

點擊標題即可獲取文章源代碼和筆記
數據集:https://download.csdn.net/download/weixin_44827418/12548095

在這裏插入圖片描述

Pandas高級處理
    缺失值處理
    數據離散化
    合併
    交叉表與透視表
    分組與聚合
    綜合案例

4.6 高級處理-缺失值處理
    1)如何進行缺失值處理
        兩種思路:
            1)刪除含有缺失值的樣本
            2)替換/插補
        4.6.1 如何處理nan
            1)判斷數據中是否存在NaN
                pd.isnull(df)
                pd.notnull(df)
            2)刪除含有缺失值的樣本
                df.dropna(inplace=False)
               替換/插補
                df.fillna(value, inplace=False)
         4.6.2 不是缺失值nan,有默認標記的
            1)替換 ?-> np.nan
                df.replace(to_replace="?", value=np.nan)
            2)處理np.nan缺失值的步驟
    2)缺失值處理實例
4.7 高級處理-數據離散化
    性別 年齡
A    1   23
B    2   30
C    1   18
    物種 毛髮
A    1
B    2
C    3
    男 女 年齡
A   1  0  23
B   0  1  30
C   1  0  18

    狗  豬  老鼠 毛髮
A   1   0   0   2
B   0   1   0   1
C   0   0   1   1
one-hot編碼&啞變量
4.7.1 什麼是數據的離散化
    原始的身高數據:165174160180159163192184
4.7.2 爲什麼要離散化
4.7.3 如何實現數據的離散化
    1)分組
        自動分組sr=pd.qcut(data, bins)
        自定義分組sr=pd.cut(data, [])
    2)將分組好的結果轉換成one-hot編碼
        pd.get_dummies(sr, prefix=)
4.8 高級處理-合併
    numpy
        np.concatnate((a, b), axis=)
        水平拼接
            np.hstack()
        豎直拼接
            np.vstack()
    1)按方向拼接
        pd.concat([data1, data2], axis=1)
    2)按索引拼接
        pd.merge實現合併
        pd.merge(left, right, how="inner", on=[索引])
4.9 高級處理-交叉表與透視表
    找到、探索兩個變量之間的關係
    4.9.1 交叉表與透視表什麼作用
    4.9.2 使用crosstab(交叉表)實現
        pd.crosstab(value1, value2)
    4.9.3 pivot_table
4.10 高級處理-分組與聚合
    4.10.1 什麼是分組與聚合
    4.10.2 分組與聚合API
        dataframe
        sr

4.6.1如何處理nan

import pandas as pd 

movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0
997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

1000 rows × 12 columns

# 1. 判斷是否存在NaN類型的缺失值,爲True的就是缺失值
movie.isnull()
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 False False False False False False False False False False False False
1 False False False False False False False False False False False False
2 False False False False False False False False False False False False
3 False False False False False False False False False False False False
4 False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ...
995 False False False False False False False False False False True False
996 False False False False False False False False False False False False
997 False False False False False False False False False False False False
998 False False False False False False False False False False True False
999 False False False False False False False False False False False False

1000 rows × 12 columns

import numpy as np

# any() 只要有一個True就會返回True
# 返回結果爲True,說明數據中存在缺失值
np.any(movie.isnull())
True
# 爲False的就是缺失值
pd.notnull(movie)
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 True True True True True True True True True True True True
1 True True True True True True True True True True True True
2 True True True True True True True True True True True True
3 True True True True True True True True True True True True
4 True True True True True True True True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ...
995 True True True True True True True True True True False True
996 True True True True True True True True True True True True
997 True True True True True True True True True True True True
998 True True True True True True True True True True False True
999 True True True True True True True True True True True True

1000 rows × 12 columns

# all()只要有一個False就返回False
# 返回結果爲False,說明數據中存在缺失值
np.all(pd.notnull(movie))
False
pd.isnull(movie).any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool
pd.notnull(movie).all()
Rank                   True
Title                  True
Genre                  True
Description            True
Director               True
Actors                 True
Year                   True
Runtime (Minutes)      True
Rating                 True
Votes                  True
Revenue (Millions)    False
Metascore             False
dtype: bool
# 缺失值處理
# 方法1: 刪除含有缺失值的樣本
movie_full = movie.dropna()
movie_full.isnull().any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool
# 方法2: 替換
movie.head()
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
movie["Revenue (Millions)"].mean()
82.95637614678897
# 含有缺失值的字段
# Revenue (Millions)    False
# Metascore             False
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Revenue (Millions)"].isnull().any()
False
# inplace=True ,直接在原數據上進行填充
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
movie["Metascore"].isnull().any()
False
movie.isnull().any() # 缺失值已經處理完畢
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

不是缺失值nan,有默認標記的處理方法

data = pd.read_csv("./datas/GBvideos.csv",encoding="GBK")
data
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date
0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09
1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09
2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09
3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09
4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09
... ... ... ... ... ... ... ... ... ... ... ...
1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09
1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09
1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09
1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09
1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 ? 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09

1600 rows × 11 columns

# 1. 將 ! 替換爲np.nan
new_data = data.replace(to_replace="?",value=np.nan)
new_data
video_id title channel_title category_id tags views likes dislikes comment_total thumbnail_link date
0 jt2OHQh0HoQ Live Apple Event - Apple September Event 2017 ... Apple Event 28 apple events|apple event|iphone 8|iphone x|iph... 7426393 78240 13548 705 https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv... 13.09
1 AqokkXoa7uE Holly and Phillip Meet Samantha the Sex Robot ... This Morning 24 this morning|interview|holly willoughby|philli... 494203 2651 1309 0 https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg 13.09
2 YPVcg45W0z4 My DNA Test Results? I'm WHAT?? emmablackery 24 emmablackery|emma blackery|emma|blackery|briti... 142819 13119 151 1141 https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg 13.09
3 T_PuZBdT2iM getting into a conversation in a language you ... ProZD 1 skit|korean|language|conversation|esl|japanese... 1580028 65729 1529 3598 https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg 13.09
4 NsjsmgmbCfc Baby Name Challenge? Sprinkleofglitter 26 sprinkleofglitter|sprinkle of glitter|baby gli... 40592 5019 57 490 https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg 13.09
... ... ... ... ... ... ... ... ... ... ... ...
1595 w8fAellnPns Juicy Chicken Breast - You Suck at Cooking (ep... You Suck At Cooking 26 how to|cooking|recipe|kitchen|chicken|chicken ... 788466 31945 945 2274 https://i.ytimg.com/vi/w8fAellnPns/default.jpg 20.09
1596 RsG37JcEQNw Weezer - Beach Boys weezer 10 weezer|pacific daydream|pacificdaydream|beach ... 107927 2435 412 641 https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg 20.09
1597 htSiIA2g7G8 Berry Frozen Yogurt Bark Recipe SORTEDfood 26 frozen yogurt bark|frozen yoghurt bark|frozen ... 109222 4840 35 212 https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg 20.09
1598 ZQK1F0wz6z4 What Do You Want to Eat?? Wong Fu Productions 24 panda|what should we eat|buzzfeed|comedy|boyfr... 626223 22962 532 1559 https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg 20.09
1599 DuPXdnSWoLk The Child in Time: Trailer - BBC One BBC 24 BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi... 99228 1699 NaN 135 https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg 20.09

1600 rows × 11 columns

new_data.isnull().any() # 說明dislikes列中的?已經替換成了NaN
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes           True
comment_total     False
thumbnail_link    False
date              False
dtype: bool
new_data.dropna(inplace=True)
new_data.isnull().any()
video_id          False
title             False
channel_title     False
category_id       False
tags              False
views             False
likes             False
dislikes          False
comment_total     False
thumbnail_link    False
date              False
dtype: bool

4.7 高級處理-數據離散化

import pandas as pd 

# 準備數據
data = pd.Series([165,174,160,180,159,163,192,184],index=["No1:165","No2:174","No3:160","No4:180","No5:159","No6:163","No7:192","No8:184"])
data
No1:165    165
No2:174    174
No3:160    160
No4:180    180
No5:159    159
No6:163    163
No7:192    192
No8:184    184
dtype: int64

自動分組

# 1. 分組

# 自動分組
#qcut(data,組數)
sr = pd.qcut(data,3)
sr
No1:165      (163.667, 178.0]
No2:174      (163.667, 178.0]
No3:160    (158.999, 163.667]
No4:180        (178.0, 192.0]
No5:159    (158.999, 163.667]
No6:163    (158.999, 163.667]
No7:192        (178.0, 192.0]
No8:184        (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
# 查看分組情況
sr.value_counts()
(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
type(sr)
pandas.core.series.Series
# 2. 將分組好的結果轉換成獨熱編碼
# prefix,設置列名的前綴
pd.get_dummies(sr,prefix="height")
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0]
No1:165 0 1 0
No2:174 0 1 0
No3:160 1 0 0
No4:180 0 0 1
No5:159 1 0 0
No6:163 1 0 0
No7:192 0 0 1
No8:184 0 0 1

自定義分組

# 自定義分組
# pd.cut(data,包含全部分界值的列表)
sr = pd.cut(data,[150,165,180,195])
sr
No1:165    (150, 165]
No2:174    (165, 180]
No3:160    (150, 165]
No4:180    (165, 180]
No5:159    (150, 165]
No6:163    (150, 165]
No7:192    (180, 195]
No8:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr.value_counts()
(150, 165]    4
(180, 195]    2
(165, 180]    2
dtype: int64
pd.get_dummies(sr,prefix="身高")
身高_(150, 165] 身高_(165, 180] 身高_(180, 195]
No1:165 1 0 0
No2:174 0 1 0
No3:160 1 0 0
No4:180 0 1 0
No5:159 1 0 0
No6:163 1 0 0
No7:192 0 0 1
No8:184 0 0 1

4.8 高級處理-合併

4.8.1 pd.concat實現合併(按方向拼接)

data1 = np.arange(0,20,1).reshape(4,5)
data1 = pd.DataFrame(data1)
data1
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
data2 = np.arange(100,120,1).reshape(4,5)
data2 = pd.DataFrame(data2)
data2
0 1 2 3 4
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
# 將data1 和 data2 進行水平拼接
data_concat = pd.concat([data1,data2],axis=1)
data_concat
0 1 2 3 4 0 1 2 3 4
0 0 1 2 3 4 100 101 102 103 104
1 5 6 7 8 9 105 106 107 108 109
2 10 11 12 13 14 110 111 112 113 114
3 15 16 17 18 19 115 116 117 118 119
data2.T
0 1 2 3
0 100 105 110 115
1 101 106 111 116
2 102 107 112 117
3 103 108 113 118
4 104 109 114 119
# 將data1 和 data2 進行豎直拼接
data_concat1 = pd.concat([data1,data2.T],axis=0)
data_concat1
0 1 2 3 4
0 0 1 2 3 4.0
1 5 6 7 8 9.0
2 10 11 12 13 14.0
3 15 16 17 18 19.0
0 100 105 110 115 NaN
1 101 106 111 116 NaN
2 102 107 112 117 NaN
3 103 108 113 118 NaN
4 104 109 114 119 NaN

4.8.2 pd.merge實現合併(按索引拼接)

left=pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right=pd.DataFrame({'key1':['K0','K1','K1','K2'], 
                    'key2':['K0','K0','K0','K0'], 
                    'C':['Co','C1','C2','C3'],
                    'D':['DO','D1','D2','D3']})
right
key1 key2 C D
0 K0 K0 Co DO
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
# 默認內連接inner
# inner 保留共有的key
result = pd.merge(left,right,on=['key1','key2'],how="inner")
result
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
# left ,左連接
# 左表中所有的key都保留,以左表爲主進行合併
result_left = pd.merge(left,right,on=['key1','key2'],how="left")
result_left
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
# right ,右連接
# 右表中所有的key都保留,以右表爲主進行合併
result_right = pd.merge(left,right,on=['key1','key2'],how="right")
result_right
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
# outer ,外連接
# 左右兩表中所有的key都保留,進行合併
result_outer = pd.merge(left,right,on=['key1','key2'],how="outer")
result_outer
key1 key2 A B C D
0 K0 K0 A0 B0 Co DO
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3

4.9 高級處理-交叉表與透視表

  • 用來探索兩個變量之間的關係

4.9.2 使用crosstab(交叉表)實現

data = pd.read_excel("./datas/szfj_baoan.xls")
data
district roomnum hall AREA C_floor floor_num school subway per_price
0 baoan 3 2 89.3 middle 31 0 0 7.0773
1 baoan 4 2 127.0 high 31 0 0 6.9291
2 baoan 1 1 28.0 low 39 0 0 3.9286
3 baoan 1 1 28.0 middle 30 0 0 3.3568
4 baoan 2 2 78.0 middle 8 1 1 5.0769
... ... ... ... ... ... ... ... ... ...
1246 baoan 4 2 89.3 low 8 0 0 4.2553
1247 baoan 2 1 67.0 middle 30 0 0 3.8060
1248 baoan 2 2 67.4 middle 29 1 0 5.3412
1249 baoan 2 2 73.1 low 15 1 0 5.9508
1250 baoan 3 2 86.2 middle 32 0 1 4.5244

1251 rows × 9 columns

time = "2020-06-23"
# pandas日期類型
date = pd.to_datetime(time)
date
Timestamp('2020-06-23 00:00:00')
type(date)
pandas._libs.tslibs.timestamps.Timestamp
date.year
2020
date.month
6
data["week"] = date.weekday
data.drop("week",axis=1,inplace=True)
data
district roomnum hall AREA C_floor floor_num school subway per_price
0 baoan 3 2 89.3 middle 31 0 0 7.0773
1 baoan 4 2 127.0 high 31 0 0 6.9291
2 baoan 1 1 28.0 low 39 0 0 3.9286
3 baoan 1 1 28.0 middle 30 0 0 3.3568
4 baoan 2 2 78.0 middle 8 1 1 5.0769
... ... ... ... ... ... ... ... ... ...
1246 baoan 4 2 89.3 low 8 0 0 4.2553
1247 baoan 2 1 67.0 middle 30 0 0 3.8060
1248 baoan 2 2 67.4 middle 29 1 0 5.3412
1249 baoan 2 2 73.1 low 15 1 0 5.9508
1250 baoan 3 2 86.2 middle 32 0 1 4.5244

1251 rows × 9 columns

data["feature"] = np.where(data["per_price"] > 5.0000,1,0)
data
district roomnum hall AREA C_floor floor_num school subway per_price feature
0 baoan 3 2 89.3 middle 31 0 0 7.0773 1
1 baoan 4 2 127.0 high 31 0 0 6.9291 1
2 baoan 1 1 28.0 low 39 0 0 3.9286 0
3 baoan 1 1 28.0 middle 30 0 0 3.3568 0
4 baoan 2 2 78.0 middle 8 1 1 5.0769 1
... ... ... ... ... ... ... ... ... ... ...
1246 baoan 4 2 89.3 low 8 0 0 4.2553 0
1247 baoan 2 1 67.0 middle 30 0 0 3.8060 0
1248 baoan 2 2 67.4 middle 29 1 0 5.3412 1
1249 baoan 2 2 73.1 low 15 1 0 5.9508 1
1250 baoan 3 2 86.2 middle 32 0 1 4.5244 0

1251 rows × 10 columns

# 交叉表

# 查看樓層 和 每平方米單價是否>50000的關係
# 返回值爲每個樓層中,爲0的個數和爲1的個數
data0 = pd.crosstab(data["floor_num"],data["feature"])
data0
feature 0 1
floor_num
1 6 8
3 0 1
4 0 10
6 3 7
7 16 25
8 19 32
9 2 11
10 4 9
11 8 11
12 1 3
13 4 20
14 0 5
15 8 33
16 9 19
17 20 21
18 17 35
19 11 5
20 2 4
21 1 6
22 0 1
23 4 8
24 10 26
25 4 37
26 9 57
27 5 38
28 6 35
29 26 68
30 30 78
31 4 151
32 21 126
33 34 20
34 1 5
35 1 2
36 0 4
37 1 1
38 0 1
39 5 10
40 1 3
43 0 1
44 0 6
45 0 7
47 0 1
50 0 1
51 0 3
52 0 2
53 0 1
data0.sum(axis=1) # 按行求和
floor_num
1      14
3       1
4      10
6      10
7      41
8      51
9      13
10     13
11     19
12      4
13     24
14      5
15     41
16     28
17     41
18     52
19     16
20      6
21      7
22      1
23     12
24     36
25     41
26     66
27     43
28     41
29     94
30    108
31    155
32    147
33     54
34      6
35      3
36      4
37      2
38      1
39     15
40      4
43      1
44      6
45      7
47      1
50      1
51      3
52      2
53      1
dtype: int64
data0.div(data0.sum(axis=1),axis=0) # 按行做除法
feature 0 1
floor_num
1 0.428571 0.571429
3 0.000000 1.000000
4 0.000000 1.000000
6 0.300000 0.700000
7 0.390244 0.609756
8 0.372549 0.627451
9 0.153846 0.846154
10 0.307692 0.692308
11 0.421053 0.578947
12 0.250000 0.750000
13 0.166667 0.833333
14 0.000000 1.000000
15 0.195122 0.804878
16 0.321429 0.678571
17 0.487805 0.512195
18 0.326923 0.673077
19 0.687500 0.312500
20 0.333333 0.666667
21 0.142857 0.857143
22 0.000000 1.000000
23 0.333333 0.666667
24 0.277778 0.722222
25 0.097561 0.902439
26 0.136364 0.863636
27 0.116279 0.883721
28 0.146341 0.853659
29 0.276596 0.723404
30 0.277778 0.722222
31 0.025806 0.974194
32 0.142857 0.857143
33 0.629630 0.370370
34 0.166667 0.833333
35 0.333333 0.666667
36 0.000000 1.000000
37 0.500000 0.500000
38 0.000000 1.000000
39 0.333333 0.666667
40 0.250000 0.750000
43 0.000000 1.000000
44 0.000000 1.000000
45 0.000000 1.000000
47 0.000000 1.000000
50 0.000000 1.000000
51 0.000000 1.000000
52 0.000000 1.000000
53 0.000000 1.000000
data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
feature 0 1
floor_num
1 0.428571 0.571429
3 0.000000 1.000000
4 0.000000 1.000000
6 0.300000 0.700000
7 0.390244 0.609756
8 0.372549 0.627451
9 0.153846 0.846154
10 0.307692 0.692308
11 0.421053 0.578947
12 0.250000 0.750000
13 0.166667 0.833333
14 0.000000 1.000000
15 0.195122 0.804878
16 0.321429 0.678571
17 0.487805 0.512195
18 0.326923 0.673077
19 0.687500 0.312500
20 0.333333 0.666667
21 0.142857 0.857143
22 0.000000 1.000000
23 0.333333 0.666667
24 0.277778 0.722222
25 0.097561 0.902439
26 0.136364 0.863636
27 0.116279 0.883721
28 0.146341 0.853659
29 0.276596 0.723404
30 0.277778 0.722222
31 0.025806 0.974194
32 0.142857 0.857143
33 0.629630 0.370370
34 0.166667 0.833333
35 0.333333 0.666667
36 0.000000 1.000000
37 0.500000 0.500000
38 0.000000 1.000000
39 0.333333 0.666667
40 0.250000 0.750000
43 0.000000 1.000000
44 0.000000 1.000000
45 0.000000 1.000000
47 0.000000 1.000000
50 0.000000 1.000000
51 0.000000 1.000000
52 0.000000 1.000000
53 0.000000 1.000000
# stacked=True 是否重疊顯示
data_percent.plot(kind="bar",stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Hp6tTqN9-1592912393310)(output_70_1.png)]

data_percent = data0.div(data0.sum(axis=1),axis=0)
data_percent
<tr>
  <th>50</th>
  <td>0.000000</td>
  <td>1.000000</td>
</tr>
<tr>
  <th>51</th>
  <td>0.000000</td>
  <td>1.000000</td>
</tr>
<tr>
  <th>52</th>
  <td>0.000000</td>
  <td>1.000000</td>
</tr>
<tr>
  <th>53</th>
  <td>0.000000</td>
  <td>1.000000</td>
</tr>
feature 0 1
floor_num
1 0.428571 0.571429
3 0.000000 1.000000
4 0.000000 1.000000
6 0.300000 0.700000
7 0.390244 0.609756
8 0.372549 0.627451
9 0.153846 0.846154
10 0.307692 0.692308
11 0.421053 0.578947
12 0.250000 0.750000
13 0.166667 0.833333
14 0.000000 1.000000
15 0.195122 0.804878
16 0.321429 0.678571
17 0.487805 0.512195
18 0.326923 0.673077
19 0.687500 0.312500
20 0.333333 0.666667
21 0.142857 0.857143
22 0.000000 1.000000
23 0.333333 0.666667
24 0.277778 0.722222
25 0.097561 0.902439
26 0.136364 0.863636
27 0.116279 0.883721
28 0.146341 0.853659
29 0.276596 0.723404
30 0.277778 0.722222

4.9.3使用pivot_table(透視表)實現

# 通過透視表,整個過程會變得更加簡單些
# 結果直接就是值爲1的百分比
data.pivot_table(["feature"],index=["floor_num"])

...

feature
floor_num
1 0.571429
3 1.000000
4 1.000000
6 0.700000
50 1.000000
51 1.000000
52 1.000000
53 1.000000

4.10 高級處理-分組與聚合

4.10.2 分組與聚合API

col = pd.DataFrame({'color':['white','red','green','red','green'],
                   'object':["pen","pencil","pencil","ashtray","pen"],
                  'price1':[4.56,4.20,1.30,0.56,2.75],
                  'price2':[4.75,4.12,1.68,0.75,3.15]})
col
color object price1 price2
0 white pen 4.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.68
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
#  進行分組,對顏色進行分組,對價格price1進行聚合
# 用DataFrame的方法進行分組
col.groupby(by="color")["price1"].max()
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64
# 用Series的方法進行分組
col['price1'].groupby(col["color"])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col['price1'].groupby(col["color"]).max()
color
green    2.75
red      4.20
white    4.56
Name: price1, dtype: float64

4.11 綜合案例

# 1. 準備數據
movie = pd.read_csv("./datas/IMDB-Movie-Data.csv")
movie
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0
997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

1000 rows × 12 columns

#問題1:我們想知道這些電影數據中評分的平均分,導演的人數等信息,
# 我們應該怎麼獲取?
movie["Rating"].mean()
6.723200000000003
movie["Director"]
0                James Gunn
1              Ridley Scott
2        M. Night Shyamalan
3      Christophe Lourdelet
4                David Ayer
               ...         
995               Billy Ray
996                Eli Roth
997              Jon M. Chu
998          Scot Armstrong
999        Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
# np.unique()去重,因爲導演可能是多個電影的導演
np.unique(movie["Director"])
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay',
       'Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh',
       'Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes',
       'Alejandro Amenábar', 'Alejandro González Iñárritu',
 		...
      'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight',
       'Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill',
       'Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball',
       'Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe',
       'William Brent Bell', 'William Oldroyd', 'Woody Allen',
       'Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder',
       'Zackary Adler'], dtype=object)
# 導演的人數
np.unique(movie["Director"]).size
644
# 問題2 : 對於這一組電影數據,如果我們先rating,runtime的分佈情況,應該如何呈現數據?
movie["Rating"].plot(kind="hist",figsize=(20,8),fontsize=40)
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-fDymqgEf-1592912393314)(output_86_1.png)]

import matplotlib.pyplot as plt

# 1. 創建畫布
plt.figure(figsize=(20,8),dpi=100)

# 2. 繪製直方圖
plt.hist(movie["Rating"],20)

# 修改刻度
plt.xticks(np.linspace(movie["Rating"].min(),movie["Rating"].max(),21))

# 添加網格
plt.grid(linestyle="--",alpha=0.5)

# 3. 顯示圖像
plt.show()

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-gC3tfVMD-1592912393315)(output_87_0.png)]

movie["Rating"]
0      8.1
1      7.0
2      7.3
3      7.2
4      6.2
      ... 
995    6.2
996    5.5
997    6.2
998    5.6
999    5.3
Name: Rating, Length: 1000, dtype: float64
# 問題3:對於這一組電影數據,如果我們希望統計電影分類(genre)的情況,應該如何處理數據?

# 先統計電影類別有哪些
movie_genre = [i.split(",") for i in movie["Genre"]]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],
 ['Adventure', 'Mystery', 'Sci-Fi'],
 ['Horror', 'Thriller'],
 ['Animation', 'Comedy', 'Family'],
 ['Action', 'Adventure', 'Fantasy'],
	...

 ['Horror'],
 ['Drama', 'Music', 'Romance'],
 ['Adventure', 'Comedy'],
 ['Comedy', 'Family', 'Fantasy']]
[j for i in movie_genre for j in i]
['Action',
 'Adventure',
 'Sci-Fi',
 'Adventure',
 'Mystery',
 'Sci-Fi',
...

 'Animation',
 'Action',
 'Adventure',
 'Action',
 'Adventure',
 'Drama',
 ...]
movie_class = np.unique([j for i in movie_genre for j in i])
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
       'War', 'Western'], dtype='<U9')
len(movie_class) # 20 個電影類別
20
# 統計每個類別有幾個電影

# 先創建一個空的DataFrame表
count = pd.DataFrame(np.zeros(shape=[1000,20],dtype="int32"),columns=movie_class)
count.head()
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
count.loc[0,movie_genre[0]]
Action       0
Adventure    0
Sci-Fi       0
Name: 0, dtype: int32
movie_genre[0]
['Action', 'Adventure', 'Sci-Fi']
# 計數填表
for i in range(1000):
    count.loc[i,movie_genre[i]] = 1
count
Action Adventure Animation Biography Comedy Crime Drama Family Fantasy History Horror Music Musical Mystery Romance Sci-Fi Sport Thriller War Western
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
3 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0
996 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
997 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
998 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
999 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

1000 rows × 20 columns

# 按列求和
count.sum(axis=0)
Action       303
Adventure    259
Animation     49
Biography     81
Comedy       279
Crime        150
Drama        513
Family        51
Fantasy      101
History       29
Horror       119
Music         16
Musical        5
Mystery      106
Romance      141
Sci-Fi       120
Sport         18
Thriller     195
War           13
Western        7
dtype: int64
count.sum(axis=0).sort_values(ascending=False)
Drama        513
Action       303
Comedy       279
Adventure    259
Thriller     195
Crime        150
Romance      141
Sci-Fi       120
Horror       119
Mystery      106
Fantasy      101
Biography     81
Family        51
Animation     49
History       29
Sport         18
Music         16
War           13
Western        7
Musical        5
dtype: int64
count.sum(axis=0).sort_values(ascending=False).plot(kind="bar",fontsize=20,figsize=(20,9),colormap="cool")
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章