Python数据分析第三课:数据的处理(删除数据及空值、重复数据的处理)

我们分析的数据来源有很多种,例如:爬取、公司数据库、数据公司等。但是这些数据中有些数据项是我们不需要的,甚至可能会存在重复数据和空值的情况。

一、删除数据

import pandas as pd
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
print(df.shape)
print(df.head())

# 输出结果:
(219, 15)
      CountryName Country Code        1990        2000        2007  \
0     Afghanistan          AFG  101.094930  103.254202  100.000371   
1         Albania          ALB   61.808311   59.585866   50.862987   
2         Algeria          DZA   87.675705   62.886169   49.487870   
3  American Samoa          ASM         NaN         NaN         NaN   
4         Andorra          ADO         NaN         NaN         NaN   

         2008        2009       2010       2011       2012       2013  \
0  100.215886  100.060480  99.459839  97.667911  95.312707  92.602785   
1   49.663787   48.637067        NaN  46.720288  45.835739  45.247477   
2   48.910002   48.645026  48.681853  49.233576  49.847713  50.600697   
3         NaN         NaN        NaN        NaN        NaN        NaN   
4         NaN         NaN        NaN        NaN        NaN        NaN   

        2014       2015  Change 1990-2015  Change 2007-2015  
0  89.773777  86.954464        -14.140466        -13.045907  
1  44.912168  44.806973        -17.001338         -6.056014  
2  51.536631  52.617579        -35.058127          3.129709  
3        NaN        NaN               NaN               NaN  
4        NaN        NaN               NaN               NaN  

在结果中,我们发现有多个NaN,表明的是文件的单元格中没有值,在使用pandas读取后就会用NaN表示,也就是我们常说的空值

在Numpy中提供了nan的值,如果你想要创建一个空值可以使用如下的代码:

from numpy import nan as NaN

而且需要注意的是,NaN比较特殊的点是其本身是float类型的数据

from numpy import nan as NaN
print(type(NaN))

# 输出结果:
<class 'float'>

当NaN参与到数据计算中,其最终结果永远都是NaN。

from numpy import nan as NaN
print(NaN+1)

# 输出结果:
nan

所以,空值是会影响我们的计算结果的

对于大批量的Series数据,使用肉眼很难判断空值的存在,此时,我们需要先对空值进行过滤。

import pandas as pd
se = pd.Series([4,NaN,8,NaN,5])
print(se.notnull())
print('='*20)
print(se[se.notnull()])

# 输出结果:
0     True
1    False
2     True
3    False
4     True
dtype: bool
====================
0    4.0
2    8.0
4    5.0
dtype: float64

而在DataFrame类型数据中,一般我们会将存在NaN的数据使用dropna()方法全部删掉。

df1 = df.dropna()

dropna()是删除空值数据的方法:

  • 默认将含有NaN的整行数据删掉
  • 若想要删除整行都是空值的数据需要添加how='all’参数;
  • 如果要对列做删除操作,需要添加axis参数:axis=1表示列,axis=0表示行;
  • 也可以使用thresh参数筛选要删除的数据,thresh=n保留至少n个非NaN数据的行。

那么,如果我只是想要删除两行数据该怎么做呢?可以使用df.drop()方法,了解一下该函数:

DataFrame.drop(labels=None,axis=0,index=None,columns=None,inplace=False)

详细参数如下所示:

  • labels:就要删除的行或列的名字,用列表给定
  • index:直接指定要删除的行
  • columns:直接指定要删除的列
  • inplace=False:默认该删除操作不改变原数据,而是返回一个执行删除操作后的新dataframe
  • inplace=True:会直接在源数据上进行删除操作,删除后无法还原

所以,根据参数可以总结出删除行列有两种方式:

  • labels = [],axis=0的组合
  • index或columns直接指定要删除的行或列
# 第一种方法
import pandas as pd 
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
 # 删除第2行和第3行
df3 = df.drop(labels=[0,1],axis=0)
print(df3)

# 输出结果:
	CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Andorra	ADO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
6	Angola	AGO	101.394722	100.930475	102.563811	102.609186	102.428788	102.035690	102.106756	101.836900	101.315234	100.637667	99.855751	-1.538971	-2.708060
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
217 rows × 15 columns
# 第二种方法
# 删除列名为1990的列
df4 = df.drop(axis=1,columns=1990)
print(df4)

# 输出结果:
	CountryName	Country Code	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	59.585866	50.862987	49.663787	48.637067	NaN	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Andorra	ADO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 14 columns

二、 空值的处理

对于空值,我们可以将整条数据删除,也可以使用fillna()方法对空值进行填充。

df.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)

注意:method参数不能与value参数同时出现。

import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')

# 用常数填充fillna
print(df3.fillna(0))

# 输出结果:
CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	0.000000	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	Andorra	ADO	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 15 columns

# # 用一列的平均值填充
print(df3.fillna(df3.mean()))

# 输出结果:
	CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	59.623883	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	72.706435	66.999367	61.311772	60.652862	60.066010	59.623883	59.297153	59.058143	58.864043	58.726255	58.633697	-14.072738	-2.678076
4	Andorra	ADO	72.706435	66.999367	61.311772	60.652862	60.066010	59.623883	59.297153	59.058143	58.864043	58.726255	58.633697	-14.072738	-2.678076
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 15 columns

# 用前面的值来填充ffill
print(df3.fillna(method='ffill',axis=0)

# 输出结果:
CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	99.459839	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
4	Andorra	ADO	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 15 columns

三、重复数据处理

重复数据的存在,不仅会降低分析的准确度,也会降低分析的效率,所以我们在整理数据的时候应该讲重复的数据删除掉。

  • 可以利用duplicated()函数,返回每一行以判断是否有重复的结果(重复则为True)
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')

# 返回重复的结果
print(df3.duplicated())

# 输出结果:
0      False
1      False
2      False
3      False
4      False
       ...  
214    False
215    False
216    False
217    False
218    False
Length: 219, dtype: bool

通过结果我们发现,返回的是一个值为Bool类型的Series,如果当前行所有列的数据与前面的数据是重复的就返回True;反之,则返回False。

  • 可以使用drop_duplicates()将重复的数据进行删除。
df.drop_duplicates()
  • 我们也可以通过判断某一列的重复数据,然后进行删除。
df.drop_duplicates(['CountryName'],inplace=False)

其中,[‘CountryName’]表示对比CountryName列数据是否有重复,inplace用来控制是否直接对原始数据进行修改。

import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')

# 删除CountryName列数据重复的行
print(df3.drop_duplicates(['CountryName'],inplace=False))

# 输出结果:
	CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	NaN	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Andorra	ADO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
217 rows × 15 columns

四、总结

删除数据

在这里插入图片描述

空值处理

在这里插入图片描述

移除重复数据

在这里插入图片描述

五、练习

  1. 从自2009-2010赛季以来的英格兰当地足球比赛结果数据中,删除2009/2010赛季的所有数据以及county列。
import pandas as pd

# 读取数据
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\sccore_game.xlsx')

# 了解数据的基本特征
print(df.shape)
print(df.head())

# 循环遍历获取2009/2010赛季的数据索引
index_list = []
for value,row_data in df.iterrows():
    if row_data['season'] == '2009/2010':
        index_list.append(value)

# 删除2009/2010赛季的数据
df1 = df.drop(labels=index_list,axis=0)
print(df1)

# 删除country列第一种方法
df2 = df.drop(labels=['country'],axis=1)
print(df2)


# 删除country列的第二种方法
df3 = df.drop(columns='country',axis=1)
print(df3)
  1. 观察数据,根据商品的单价(OndPrice)和购买数量(Count),计算对应的总价(Price)。
import pandas as pd

# 读取数据
books = pd.read_excel(r'C:\Users\lin-a\Desktop\data\04books.xlsx')
# 了解数据的基本特征
print(books.shape)
print(books.head())

#print('{:-^50}'.format('我是分隔线'))
# 对所有行计算price的值(这种方法是列与列之间对齐后进行计算)
#books['Price'] = books['OnePrice']*books['Count']
#print(books)

print('{:-^50}'.format('我是分隔线'))
# 只针对某一段距离(即连续的多少行进行计算),使用循环
for i in range(5,16):
    books['Price'].iloc[i] = books['OnePrice'].iloc[i]*books['Count'].iloc[i]    
print(books)

# 输出结果如下:
(20, 5)
   ID         Name  OnePrice  Count  Price
0   1  Product_001      9.82      5    NaN
1   2  Product_002     11.99      4    NaN
2   3  Product_003      9.62      6    NaN
3   4  Product_004     11.08      8    NaN
4   5  Product_005      7.75      3    NaN
----------------------我是分隔线-----------------------
    ID         Name  OnePrice  Count  Price
0    1  Product_001      9.82      5    NaN
1    2  Product_002     11.99      4    NaN
2    3  Product_003      9.62      6    NaN
3    4  Product_004     11.08      8    NaN
4    5  Product_005      7.75      3    NaN
5    6  Product_006      7.34      4  29.36
6    7  Product_007     10.97      6  65.82
7    8  Product_008     11.14      7  77.98
8    9  Product_009      8.98      2  17.96
9   10  Product_010      9.18      3  27.54
10  11  Product_011      8.31      4  33.24
11  12  Product_012      7.29      9  65.61
12  13  Product_013      8.36      5  41.80
13  14  Product_014      9.16      6  54.96
14  15  Product_015     10.31      3  30.93
15  16  Product_016     10.26      6  61.56
16  17  Product_017     11.95      8    NaN
17  18  Product_018     11.22      2    NaN
18  19  Product_019     10.95      4    NaN
19  20  Product_020      8.82      6    NaN
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章