Python數據分析第三課:數據的處理(刪除數據及空值、重複數據的處理)

我們分析的數據來源有很多種,例如:爬取、公司數據庫、數據公司等。但是這些數據中有些數據項是我們不需要的,甚至可能會存在重複數據和空值的情況。

一、刪除數據

import pandas as pd
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
print(df.shape)
print(df.head())

# 輸出結果:
(219, 15)
      CountryName Country Code        1990        2000        2007  \
0     Afghanistan          AFG  101.094930  103.254202  100.000371   
1         Albania          ALB   61.808311   59.585866   50.862987   
2         Algeria          DZA   87.675705   62.886169   49.487870   
3  American Samoa          ASM         NaN         NaN         NaN   
4         Andorra          ADO         NaN         NaN         NaN   

         2008        2009       2010       2011       2012       2013  \
0  100.215886  100.060480  99.459839  97.667911  95.312707  92.602785   
1   49.663787   48.637067        NaN  46.720288  45.835739  45.247477   
2   48.910002   48.645026  48.681853  49.233576  49.847713  50.600697   
3         NaN         NaN        NaN        NaN        NaN        NaN   
4         NaN         NaN        NaN        NaN        NaN        NaN   

        2014       2015  Change 1990-2015  Change 2007-2015  
0  89.773777  86.954464        -14.140466        -13.045907  
1  44.912168  44.806973        -17.001338         -6.056014  
2  51.536631  52.617579        -35.058127          3.129709  
3        NaN        NaN               NaN               NaN  
4        NaN        NaN               NaN               NaN  

在結果中,我們發現有多個NaN,表明的是文件的單元格中沒有值,在使用pandas讀取後就會用NaN表示,也就是我們常說的空值

在Numpy中提供了nan的值,如果你想要創建一個空值可以使用如下的代碼:

from numpy import nan as NaN

而且需要注意的是,NaN比較特殊的點是其本身是float類型的數據

from numpy import nan as NaN
print(type(NaN))

# 輸出結果:
<class 'float'>

當NaN參與到數據計算中,其最終結果永遠都是NaN。

from numpy import nan as NaN
print(NaN+1)

# 輸出結果:
nan

所以,空值是會影響我們的計算結果的

對於大批量的Series數據,使用肉眼很難判斷空值的存在,此時,我們需要先對空值進行過濾。

import pandas as pd
se = pd.Series([4,NaN,8,NaN,5])
print(se.notnull())
print('='*20)
print(se[se.notnull()])

# 輸出結果:
0     True
1    False
2     True
3    False
4     True
dtype: bool
====================
0    4.0
2    8.0
4    5.0
dtype: float64

而在DataFrame類型數據中,一般我們會將存在NaN的數據使用dropna()方法全部刪掉。

df1 = df.dropna()

dropna()是刪除空值數據的方法:

  • 默認將含有NaN的整行數據刪掉
  • 若想要刪除整行都是空值的數據需要添加how='all’參數;
  • 如果要對列做刪除操作,需要添加axis參數:axis=1表示列,axis=0表示行;
  • 也可以使用thresh參數篩選要刪除的數據,thresh=n保留至少n個非NaN數據的行。

那麼,如果我只是想要刪除兩行數據該怎麼做呢?可以使用df.drop()方法,瞭解一下該函數:

DataFrame.drop(labels=None,axis=0,index=None,columns=None,inplace=False)

詳細參數如下所示:

  • labels:就要刪除的行或列的名字,用列表給定
  • index:直接指定要刪除的行
  • columns:直接指定要刪除的列
  • inplace=False:默認該刪除操作不改變原數據,而是返回一個執行刪除操作後的新dataframe
  • inplace=True:會直接在源數據上進行刪除操作,刪除後無法還原

所以,根據參數可以總結出刪除行列有兩種方式:

  • labels = [],axis=0的組合
  • index或columns直接指定要刪除的行或列
# 第一種方法
import pandas as pd 
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')
 # 刪除第2行和第3行
df3 = df.drop(labels=[0,1],axis=0)
print(df3)

# 輸出結果:
	CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Andorra	ADO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
6	Angola	AGO	101.394722	100.930475	102.563811	102.609186	102.428788	102.035690	102.106756	101.836900	101.315234	100.637667	99.855751	-1.538971	-2.708060
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
217 rows × 15 columns
# 第二種方法
# 刪除列名爲1990的列
df4 = df.drop(axis=1,columns=1990)
print(df4)

# 輸出結果:
	CountryName	Country Code	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	59.585866	50.862987	49.663787	48.637067	NaN	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Andorra	ADO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 14 columns

二、 空值的處理

對於空值,我們可以將整條數據刪除,也可以使用fillna()方法對空值進行填充。

df.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)

注意:method參數不能與value參數同時出現。

import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')

# 用常數填充fillna
print(df3.fillna(0))

# 輸出結果:
CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	0.000000	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	Andorra	ADO	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 15 columns

# # 用一列的平均值填充
print(df3.fillna(df3.mean()))

# 輸出結果:
	CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	59.623883	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	72.706435	66.999367	61.311772	60.652862	60.066010	59.623883	59.297153	59.058143	58.864043	58.726255	58.633697	-14.072738	-2.678076
4	Andorra	ADO	72.706435	66.999367	61.311772	60.652862	60.066010	59.623883	59.297153	59.058143	58.864043	58.726255	58.633697	-14.072738	-2.678076
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 15 columns

# 用前面的值來填充ffill
print(df3.fillna(method='ffill',axis=0)

# 輸出結果:
CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	99.459839	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
4	Andorra	ADO	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
219 rows × 15 columns

三、重複數據處理

重複數據的存在,不僅會降低分析的準確度,也會降低分析的效率,所以我們在整理數據的時候應該講重複的數據刪除掉。

  • 可以利用duplicated()函數,返回每一行以判斷是否有重複的結果(重複則爲True)
import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')

# 返回重複的結果
print(df3.duplicated())

# 輸出結果:
0      False
1      False
2      False
3      False
4      False
       ...  
214    False
215    False
216    False
217    False
218    False
Length: 219, dtype: bool

通過結果我們發現,返回的是一個值爲Bool類型的Series,如果當前行所有列的數據與前面的數據是重複的就返回True;反之,則返回False。

  • 可以使用drop_duplicates()將重複的數據進行刪除。
df.drop_duplicates()
  • 我們也可以通過判斷某一列的重複數據,然後進行刪除。
df.drop_duplicates(['CountryName'],inplace=False)

其中,[‘CountryName’]表示對比CountryName列數據是否有重複,inplace用來控制是否直接對原始數據進行修改。

import pandas as pd
df3 = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx')

# 刪除CountryName列數據重複的行
print(df3.drop_duplicates(['CountryName'],inplace=False))

# 輸出結果:
	CountryName	Country Code	1990	2000	2007	2008	2009	2010	2011	2012	2013	2014	2015	Change 1990-2015	Change 2007-2015
0	Afghanistan	AFG	101.094930	103.254202	100.000371	100.215886	100.060480	99.459839	97.667911	95.312707	92.602785	89.773777	86.954464	-14.140466	-13.045907
1	Albania	ALB	61.808311	59.585866	50.862987	49.663787	48.637067	NaN	46.720288	45.835739	45.247477	44.912168	44.806973	-17.001338	-6.056014
2	Algeria	DZA	87.675705	62.886169	49.487870	48.910002	48.645026	48.681853	49.233576	49.847713	50.600697	51.536631	52.617579	-35.058127	3.129709
3	American Samoa	ASM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Andorra	ADO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
214	Virgin Islands (U.S.)	VIR	53.546901	52.647183	49.912779	50.459425	51.257336	52.382523	53.953515	55.687666	57.537152	59.399244	61.199651	7.652751	11.286872
215	West Bank and Gaza	WBG	102.789182	100.410470	88.367099	86.204936	84.173732	82.333762	80.851747	79.426778	78.118403	76.975462	76.001869	-26.787313	-12.365230
216	Yemen, Rep.	YEM	118.779727	105.735754	88.438350	85.899643	83.594489	81.613924	80.193948	78.902603	77.734373	76.644268	75.595147	-43.184581	-12.843203
217	Zambia	ZMB	100.485263	97.124776	98.223058	98.277253	98.260795	98.148325	97.854237	97.385802	96.791310	96.122165	95.402326	-5.082936	-2.820732
218	Zimbabwe	ZWE	96.365244	84.877487	81.272166	81.024020	80.934968	80.985702	80.740494	80.579870	80.499816	80.456439	80.391033	-15.974211	-0.881133
217 rows × 15 columns

四、總結

刪除數據

在這裏插入圖片描述

空值處理

在這裏插入圖片描述

移除重複數據

在這裏插入圖片描述

五、練習

  1. 從自2009-2010賽季以來的英格蘭當地足球比賽結果數據中,刪除2009/2010賽季的所有數據以及county列。
import pandas as pd

# 讀取數據
df = pd.read_excel(r'C:\Users\lin-a\Desktop\data\sccore_game.xlsx')

# 瞭解數據的基本特徵
print(df.shape)
print(df.head())

# 循環遍歷獲取2009/2010賽季的數據索引
index_list = []
for value,row_data in df.iterrows():
    if row_data['season'] == '2009/2010':
        index_list.append(value)

# 刪除2009/2010賽季的數據
df1 = df.drop(labels=index_list,axis=0)
print(df1)

# 刪除country列第一種方法
df2 = df.drop(labels=['country'],axis=1)
print(df2)


# 刪除country列的第二種方法
df3 = df.drop(columns='country',axis=1)
print(df3)
  1. 觀察數據,根據商品的單價(OndPrice)和購買數量(Count),計算對應的總價(Price)。
import pandas as pd

# 讀取數據
books = pd.read_excel(r'C:\Users\lin-a\Desktop\data\04books.xlsx')
# 瞭解數據的基本特徵
print(books.shape)
print(books.head())

#print('{:-^50}'.format('我是分隔線'))
# 對所有行計算price的值(這種方法是列與列之間對齊後進行計算)
#books['Price'] = books['OnePrice']*books['Count']
#print(books)

print('{:-^50}'.format('我是分隔線'))
# 只針對某一段距離(即連續的多少行進行計算),使用循環
for i in range(5,16):
    books['Price'].iloc[i] = books['OnePrice'].iloc[i]*books['Count'].iloc[i]    
print(books)

# 輸出結果如下:
(20, 5)
   ID         Name  OnePrice  Count  Price
0   1  Product_001      9.82      5    NaN
1   2  Product_002     11.99      4    NaN
2   3  Product_003      9.62      6    NaN
3   4  Product_004     11.08      8    NaN
4   5  Product_005      7.75      3    NaN
----------------------我是分隔線-----------------------
    ID         Name  OnePrice  Count  Price
0    1  Product_001      9.82      5    NaN
1    2  Product_002     11.99      4    NaN
2    3  Product_003      9.62      6    NaN
3    4  Product_004     11.08      8    NaN
4    5  Product_005      7.75      3    NaN
5    6  Product_006      7.34      4  29.36
6    7  Product_007     10.97      6  65.82
7    8  Product_008     11.14      7  77.98
8    9  Product_009      8.98      2  17.96
9   10  Product_010      9.18      3  27.54
10  11  Product_011      8.31      4  33.24
11  12  Product_012      7.29      9  65.61
12  13  Product_013      8.36      5  41.80
13  14  Product_014      9.16      6  54.96
14  15  Product_015     10.31      3  30.93
15  16  Product_016     10.26      6  61.56
16  17  Product_017     11.95      8    NaN
17  18  Product_018     11.22      2    NaN
18  19  Product_019     10.95      4    NaN
19  20  Product_020      8.82      6    NaN
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章