我們分析的數據來源有很多種,例如:爬取、公司數據庫、數據公司等。但是這些數據中有些數據項是我們不需要的,甚至可能會存在重複數據和空值的情況。
一、刪除數據
import pandas as pd
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
print(df.shape)
print(df.head())
# 輸出結果:
(219, 15)
CountryName Country Code 1990 2000 2007 \
0 Afghanistan AFG 101.094930 103.254202 100.000371
1 Albania ALB 61.808311 59.585866 50.862987
2 Algeria DZA 87.675705 62.886169 49.487870
3 American Samoa ASM NaN NaN NaN
4 Andorra ADO NaN NaN NaN
2008 2009 2010 2011 2012 2013 \
0 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785
1 49.663787 48.637067 NaN 46.720288 45.835739 45.247477
2 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
2014 2015 Change 1990-2015 Change 2007-2015
0 89.773777 86.954464 -14.140466 -13.045907
1 44.912168 44.806973 -17.001338 -6.056014
2 51.536631 52.617579 -35.058127 3.129709
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
在結果中,我們發現有多個NaN,表明的是文件的單元格中沒有值,在使用pandas讀取後就會用NaN表示,也就是我們常說的空值。
在Numpy中提供了nan的值,如果你想要創建一個空值可以使用如下的代碼:
from numpy import nan as NaN
而且需要注意的是,NaN比較特殊的點是其本身是float類型的數據
from numpy import nan as NaN
print(type(NaN))
# 輸出結果:
<class 'float'>
當NaN參與到數據計算中,其最終結果永遠都是NaN。
from numpy import nan as NaN
print(NaN+1)
# 輸出結果:
nan
所以,空值是會影響我們的計算結果的。
對於大批量的Series數據,使用肉眼很難判斷空值的存在,此時,我們需要先對空值進行過濾。
import pandas as pd
se = pd.Series([4,NaN,8,NaN,5])
print(se.notnull())
print('='*20)
print(se[se.notnull()])
# 輸出結果:
0 True
1 False
2 True
3 False
4 True
dtype: bool
====================
0 4.0
2 8.0
4 5.0
dtype: float64
而在DataFrame類型數據中,一般我們會將存在NaN的數據使用dropna()方法全部刪掉。
df1 = df.dropna()
dropna()是刪除空值數據的方法:
- 默認將含有NaN的整行數據刪掉;
- 若想要刪除整行都是空值的數據需要添加how='all’參數;
- 如果要對列做刪除操作,需要添加axis參數:axis=1表示列,axis=0表示行;
- 也可以使用thresh參數篩選要刪除的數據,thresh=n保留至少n個非NaN數據的行。
那麼,如果我只是想要刪除兩行數據該怎麼做呢?可以使用df.drop()方法,瞭解一下該函數:
DataFrame.drop(labels=None,axis=0,index=None,columns=None,inplace=False)
詳細參數如下所示:
- labels:就要刪除的行或列的名字,用列表給定
- index:直接指定要刪除的行
- columns:直接指定要刪除的列
- inplace=False:默認該刪除操作不改變原數據,而是返回一個執行刪除操作後的新dataframe
- inplace=True:會直接在源數據上進行刪除操作,刪除後無法還原
所以,根據參數可以總結出刪除行列有兩種方式:
- labels = [],axis=0的組合
- index或columns直接指定要刪除的行或列
# 第一種方法
import pandas as pd
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 刪除第2行和第3行
df3 = df.drop(labels=[0,1],axis=0)
print(df3)
# 輸出結果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andorra ADO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
6 Angola AGO 101.394722 100.930475 102.563811 102.609186 102.428788 102.035690 102.106756 101.836900 101.315234 100.637667 99.855751 -1.538971 -2.708060
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
217 rows × 15 columns
# 第二種方法
# 刪除列名爲1990的列
df4 = df.drop(axis=1,columns=1990)
print(df4)
# 輸出結果:
CountryName Country Code 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 59.585866 50.862987 49.663787 48.637067 NaN 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andorra ADO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 14 columns
二、 空值的處理
對於空值,我們可以將整條數據刪除,也可以使用fillna()方法對空值進行填充。
df.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)
注意:method參數不能與value參數同時出現。
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 用常數填充fillna
print(df3.fillna(0))
# 輸出結果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 0.000000 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 Andorra ADO 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 15 columns
# # 用一列的平均值填充
print(df3.fillna(df3.mean()))
# 輸出結果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 59.623883 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM 72.706435 66.999367 61.311772 60.652862 60.066010 59.623883 59.297153 59.058143 58.864043 58.726255 58.633697 -14.072738 -2.678076
4 Andorra ADO 72.706435 66.999367 61.311772 60.652862 60.066010 59.623883 59.297153 59.058143 58.864043 58.726255 58.633697 -14.072738 -2.678076
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 15 columns
# 用前面的值來填充ffill
print(df3.fillna(method='ffill',axis=0)
# 輸出結果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 99.459839 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
4 Andorra ADO 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 15 columns
三、重複數據處理
重複數據的存在,不僅會降低分析的準確度,也會降低分析的效率,所以我們在整理數據的時候應該講重複的數據刪除掉。
- 可以利用duplicated()函數,返回每一行以判斷是否有重複的結果(重複則爲True)
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 返回重複的結果
print(df3.duplicated())
# 輸出結果:
0 False
1 False
2 False
3 False
4 False
...
214 False
215 False
216 False
217 False
218 False
Length: 219, dtype: bool
通過結果我們發現,返回的是一個值爲Bool類型的Series,如果當前行所有列的數據與前面的數據是重複的就返回True;反之,則返回False。
- 可以使用drop_duplicates()將重複的數據進行刪除。
df.drop_duplicates()
- 我們也可以通過判斷某一列的重複數據,然後進行刪除。
df.drop_duplicates(['CountryName'],inplace=False)
其中,[‘CountryName’]表示對比CountryName列數據是否有重複,inplace用來控制是否直接對原始數據進行修改。
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 刪除CountryName列數據重複的行
print(df3.drop_duplicates(['CountryName'],inplace=False))
# 輸出結果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 NaN 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andorra ADO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
217 rows × 15 columns
四、總結
刪除數據
空值處理
移除重複數據
五、練習
- 從自2009-2010賽季以來的英格蘭當地足球比賽結果數據中,刪除2009/2010賽季的所有數據以及county列。
import pandas as pd
# 讀取數據
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\sccore_game.xlsx')
# 瞭解數據的基本特徵
print(df.shape)
print(df.head())
# 循環遍歷獲取2009/2010賽季的數據索引
index_list = []
for value,row_data in df.iterrows():
if row_data['season'] == '2009/2010':
index_list.append(value)
# 刪除2009/2010賽季的數據
df1 = df.drop(labels=index_list,axis=0)
print(df1)
# 刪除country列第一種方法
df2 = df.drop(labels=['country'],axis=1)
print(df2)
# 刪除country列的第二種方法
df3 = df.drop(columns='country',axis=1)
print(df3)
- 觀察數據,根據商品的單價(OndPrice)和購買數量(Count),計算對應的總價(Price)。
import pandas as pd
# 讀取數據
books = pd.read_excel(r'C:\Users\lin-a\Desktop\data\04books.xlsx')
# 瞭解數據的基本特徵
print(books.shape)
print(books.head())
#print('{:-^50}'.format('我是分隔線'))
# 對所有行計算price的值(這種方法是列與列之間對齊後進行計算)
#books['Price'] = books['OnePrice']*books['Count']
#print(books)
print('{:-^50}'.format('我是分隔線'))
# 只針對某一段距離(即連續的多少行進行計算),使用循環
for i in range(5,16):
books['Price'].iloc[i] = books['OnePrice'].iloc[i]*books['Count'].iloc[i]
print(books)
# 輸出結果如下:
(20, 5)
ID Name OnePrice Count Price
0 1 Product_001 9.82 5 NaN
1 2 Product_002 11.99 4 NaN
2 3 Product_003 9.62 6 NaN
3 4 Product_004 11.08 8 NaN
4 5 Product_005 7.75 3 NaN
----------------------我是分隔線-----------------------
ID Name OnePrice Count Price
0 1 Product_001 9.82 5 NaN
1 2 Product_002 11.99 4 NaN
2 3 Product_003 9.62 6 NaN
3 4 Product_004 11.08 8 NaN
4 5 Product_005 7.75 3 NaN
5 6 Product_006 7.34 4 29.36
6 7 Product_007 10.97 6 65.82
7 8 Product_008 11.14 7 77.98
8 9 Product_009 8.98 2 17.96
9 10 Product_010 9.18 3 27.54
10 11 Product_011 8.31 4 33.24
11 12 Product_012 7.29 9 65.61
12 13 Product_013 8.36 5 41.80
13 14 Product_014 9.16 6 54.96
14 15 Product_015 10.31 3 30.93
15 16 Product_016 10.26 6 61.56
16 17 Product_017 11.95 8 NaN
17 18 Product_018 11.22 2 NaN
18 19 Product_019 10.95 4 NaN
19 20 Product_020 8.82 6 NaN