我们分析的数据来源有很多种,例如:爬取、公司数据库、数据公司等。但是这些数据中有些数据项是我们不需要的,甚至可能会存在重复数据和空值的情况。
一、删除数据
import pandas as pd
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
print(df.shape)
print(df.head())
# 输出结果:
(219, 15)
CountryName Country Code 1990 2000 2007 \
0 Afghanistan AFG 101.094930 103.254202 100.000371
1 Albania ALB 61.808311 59.585866 50.862987
2 Algeria DZA 87.675705 62.886169 49.487870
3 American Samoa ASM NaN NaN NaN
4 Andorra ADO NaN NaN NaN
2008 2009 2010 2011 2012 2013 \
0 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785
1 49.663787 48.637067 NaN 46.720288 45.835739 45.247477
2 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
2014 2015 Change 1990-2015 Change 2007-2015
0 89.773777 86.954464 -14.140466 -13.045907
1 44.912168 44.806973 -17.001338 -6.056014
2 51.536631 52.617579 -35.058127 3.129709
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
在结果中,我们发现有多个NaN,表明的是文件的单元格中没有值,在使用pandas读取后就会用NaN表示,也就是我们常说的空值。
在Numpy中提供了nan的值,如果你想要创建一个空值可以使用如下的代码:
from numpy import nan as NaN
而且需要注意的是,NaN比较特殊的点是其本身是float类型的数据
from numpy import nan as NaN
print(type(NaN))
# 输出结果:
<class 'float'>
当NaN参与到数据计算中,其最终结果永远都是NaN。
from numpy import nan as NaN
print(NaN+1)
# 输出结果:
nan
所以,空值是会影响我们的计算结果的。
对于大批量的Series数据,使用肉眼很难判断空值的存在,此时,我们需要先对空值进行过滤。
import pandas as pd
se = pd.Series([4,NaN,8,NaN,5])
print(se.notnull())
print('='*20)
print(se[se.notnull()])
# 输出结果:
0 True
1 False
2 True
3 False
4 True
dtype: bool
====================
0 4.0
2 8.0
4 5.0
dtype: float64
而在DataFrame类型数据中,一般我们会将存在NaN的数据使用dropna()方法全部删掉。
df1 = df.dropna()
dropna()是删除空值数据的方法:
- 默认将含有NaN的整行数据删掉;
- 若想要删除整行都是空值的数据需要添加how='all’参数;
- 如果要对列做删除操作,需要添加axis参数:axis=1表示列,axis=0表示行;
- 也可以使用thresh参数筛选要删除的数据,thresh=n保留至少n个非NaN数据的行。
那么,如果我只是想要删除两行数据该怎么做呢?可以使用df.drop()方法,了解一下该函数:
DataFrame.drop(labels=None,axis=0,index=None,columns=None,inplace=False)
详细参数如下所示:
- labels:就要删除的行或列的名字,用列表给定
- index:直接指定要删除的行
- columns:直接指定要删除的列
- inplace=False:默认该删除操作不改变原数据,而是返回一个执行删除操作后的新dataframe
- inplace=True:会直接在源数据上进行删除操作,删除后无法还原
所以,根据参数可以总结出删除行列有两种方式:
- labels = [],axis=0的组合
- index或columns直接指定要删除的行或列
# 第一种方法
import pandas as pd
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 删除第2行和第3行
df3 = df.drop(labels=[0,1],axis=0)
print(df3)
# 输出结果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andorra ADO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
6 Angola AGO 101.394722 100.930475 102.563811 102.609186 102.428788 102.035690 102.106756 101.836900 101.315234 100.637667 99.855751 -1.538971 -2.708060
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
217 rows × 15 columns
# 第二种方法
# 删除列名为1990的列
df4 = df.drop(axis=1,columns=1990)
print(df4)
# 输出结果:
CountryName Country Code 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 59.585866 50.862987 49.663787 48.637067 NaN 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andorra ADO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 14 columns
二、 空值的处理
对于空值,我们可以将整条数据删除,也可以使用fillna()方法对空值进行填充。
df.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)
注意:method参数不能与value参数同时出现。
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 用常数填充fillna
print(df3.fillna(0))
# 输出结果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 0.000000 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 Andorra ADO 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 15 columns
# # 用一列的平均值填充
print(df3.fillna(df3.mean()))
# 输出结果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 59.623883 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM 72.706435 66.999367 61.311772 60.652862 60.066010 59.623883 59.297153 59.058143 58.864043 58.726255 58.633697 -14.072738 -2.678076
4 Andorra ADO 72.706435 66.999367 61.311772 60.652862 60.066010 59.623883 59.297153 59.058143 58.864043 58.726255 58.633697 -14.072738 -2.678076
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 15 columns
# 用前面的值来填充ffill
print(df3.fillna(method='ffill',axis=0)
# 输出结果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 99.459839 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
4 Andorra ADO 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
219 rows × 15 columns
三、重复数据处理
重复数据的存在,不仅会降低分析的准确度,也会降低分析的效率,所以我们在整理数据的时候应该讲重复的数据删除掉。
- 可以利用duplicated()函数,返回每一行以判断是否有重复的结果(重复则为True)
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 返回重复的结果
print(df3.duplicated())
# 输出结果:
0 False
1 False
2 False
3 False
4 False
...
214 False
215 False
216 False
217 False
218 False
Length: 219, dtype: bool
通过结果我们发现,返回的是一个值为Bool类型的Series,如果当前行所有列的数据与前面的数据是重复的就返回True;反之,则返回False。
- 可以使用drop_duplicates()将重复的数据进行删除。
df.drop_duplicates()
- 我们也可以通过判断某一列的重复数据,然后进行删除。
df.drop_duplicates(['CountryName'],inplace=False)
其中,[‘CountryName’]表示对比CountryName列数据是否有重复,inplace用来控制是否直接对原始数据进行修改。
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
# 删除CountryName列数据重复的行
print(df3.drop_duplicates(['CountryName'],inplace=False))
# 输出结果:
CountryName Country Code 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 Change 1990-2015 Change 2007-2015
0 Afghanistan AFG 101.094930 103.254202 100.000371 100.215886 100.060480 99.459839 97.667911 95.312707 92.602785 89.773777 86.954464 -14.140466 -13.045907
1 Albania ALB 61.808311 59.585866 50.862987 49.663787 48.637067 NaN 46.720288 45.835739 45.247477 44.912168 44.806973 -17.001338 -6.056014
2 Algeria DZA 87.675705 62.886169 49.487870 48.910002 48.645026 48.681853 49.233576 49.847713 50.600697 51.536631 52.617579 -35.058127 3.129709
3 American Samoa ASM NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andorra ADO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 Virgin Islands (U.S.) VIR 53.546901 52.647183 49.912779 50.459425 51.257336 52.382523 53.953515 55.687666 57.537152 59.399244 61.199651 7.652751 11.286872
215 West Bank and Gaza WBG 102.789182 100.410470 88.367099 86.204936 84.173732 82.333762 80.851747 79.426778 78.118403 76.975462 76.001869 -26.787313 -12.365230
216 Yemen, Rep. YEM 118.779727 105.735754 88.438350 85.899643 83.594489 81.613924 80.193948 78.902603 77.734373 76.644268 75.595147 -43.184581 -12.843203
217 Zambia ZMB 100.485263 97.124776 98.223058 98.277253 98.260795 98.148325 97.854237 97.385802 96.791310 96.122165 95.402326 -5.082936 -2.820732
218 Zimbabwe ZWE 96.365244 84.877487 81.272166 81.024020 80.934968 80.985702 80.740494 80.579870 80.499816 80.456439 80.391033 -15.974211 -0.881133
217 rows × 15 columns
四、总结
删除数据
空值处理
移除重复数据
五、练习
- 从自2009-2010赛季以来的英格兰当地足球比赛结果数据中,删除2009/2010赛季的所有数据以及county列。
import pandas as pd
# 读取数据
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\sccore_game.xlsx')
# 了解数据的基本特征
print(df.shape)
print(df.head())
# 循环遍历获取2009/2010赛季的数据索引
index_list = []
for value,row_data in df.iterrows():
if row_data['season'] == '2009/2010':
index_list.append(value)
# 删除2009/2010赛季的数据
df1 = df.drop(labels=index_list,axis=0)
print(df1)
# 删除country列第一种方法
df2 = df.drop(labels=['country'],axis=1)
print(df2)
# 删除country列的第二种方法
df3 = df.drop(columns='country',axis=1)
print(df3)
- 观察数据,根据商品的单价(OndPrice)和购买数量(Count),计算对应的总价(Price)。
import pandas as pd
# 读取数据
books = pd.read_excel(r'C:\Users\lin-a\Desktop\data\04books.xlsx')
# 了解数据的基本特征
print(books.shape)
print(books.head())
#print('{:-^50}'.format('我是分隔线'))
# 对所有行计算price的值(这种方法是列与列之间对齐后进行计算)
#books['Price'] = books['OnePrice']*books['Count']
#print(books)
print('{:-^50}'.format('我是分隔线'))
# 只针对某一段距离(即连续的多少行进行计算),使用循环
for i in range(5,16):
books['Price'].iloc[i] = books['OnePrice'].iloc[i]*books['Count'].iloc[i]
print(books)
# 输出结果如下:
(20, 5)
ID Name OnePrice Count Price
0 1 Product_001 9.82 5 NaN
1 2 Product_002 11.99 4 NaN
2 3 Product_003 9.62 6 NaN
3 4 Product_004 11.08 8 NaN
4 5 Product_005 7.75 3 NaN
----------------------我是分隔线-----------------------
ID Name OnePrice Count Price
0 1 Product_001 9.82 5 NaN
1 2 Product_002 11.99 4 NaN
2 3 Product_003 9.62 6 NaN
3 4 Product_004 11.08 8 NaN
4 5 Product_005 7.75 3 NaN
5 6 Product_006 7.34 4 29.36
6 7 Product_007 10.97 6 65.82
7 8 Product_008 11.14 7 77.98
8 9 Product_009 8.98 2 17.96
9 10 Product_010 9.18 3 27.54
10 11 Product_011 8.31 4 33.24
11 12 Product_012 7.29 9 65.61
12 13 Product_013 8.36 5 41.80
13 14 Product_014 9.16 6 54.96
14 15 Product_015 10.31 3 30.93
15 16 Product_016 10.26 6 61.56
16 17 Product_017 11.95 8 NaN
17 18 Product_018 11.22 2 NaN
18 19 Product_019 10.95 4 NaN
19 20 Product_020 8.82 6 NaN