python自學篇十六[pandas——數據分析 (二):讀取文件+索引+NaNs處理方法]

概括:Numpy+Scipy+pandas+matplotlib

在這裏插入圖片描述

pandas基本功能

一.數據文件讀取/文本數據讀取

1.pandas:數據文件讀取

通過pandas提供的read_xxx相關的函數可以讀取文件中的數據,並形成DataFrame,常用的數據讀取方法爲:read_csv,主要可以讀取文本類型的數據

help(pd.read_csv)

在這裏插入圖片描述

2.讀取csv文件

先創建一個data1.csv文件,裏面內容寫:

name,age,source
Peter,18,98.5
Tom,21,78.2
Bob,24,98.5
Wangdachui,20,89.2

Jupyter notebook代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
##讀取csv文件
df=pd.read_csv("data1.csv")
df

在這裏插入圖片描述

3.讀取txt文件

先創建一個data01.txt文件,裏面內容寫:

王大錘;18;100;99;98
王大錘;18;100;99;98
王大錘;18;100;99;98

Jupyter notebook代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
#讀取文本數據,指定";"爲分隔符,不讀取頭部數據
df=pd.read_csv("data01.txt",sep=';',header=None)
df

在這裏插入圖片描述

二.索引、選取和數據過濾

pandas:數據過濾獲取
通過DataFrame的相關方式可以獲取對應的列或者數據形成一個新的DataFrame, 方便後續進行統計計算。

對於DataFrame/Series中的NaN一般採取的方式爲刪除對應的列/行或者填充一個默認值

1.指定行頭部

創建一個data01.txt文件,裏面內容寫:

王大錘;18;100;99;98
王大錘;18;100;99;98
王大錘;18;100;99;98

Jupyter notebook代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
columns=['name','age',u'語文',u'數學',u'英語']
df.columns=columns
df

在這裏插入圖片描述

2.切片獲取內容

創建一個data01.txt文件,裏面內容寫:

王大錘;18;100;99;98
王大錘;18;100;99;98
王大錘;18;100;99;98

Jupyter notebook代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
columns=['name','age',u'語文',u'數學',u'英語']
df=df[columns[2:]]
df

在這裏插入圖片描述

三.pandas:缺省值NaN處理方法

對於DataFrame/Series中的NaN一般採取的方式爲刪除對應的列/行或者填充一個默認值
代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
    ['Tom',np.nan,456.67,'M'],
    ['Merry',34,345.56,np.nan],
    ['Gerry',np.nan,np.nan,np.nan],
    ['Jom',np.nan,456.67,'M'],
    ['Jone',18,35.12,'F']],
    columns=['name','age','salary','gender']
)
df2

結果:


name	age	salary	gender
0	Tom	NaN	456.67	M
1	Merry	34.0	345.56	NaN
2	Gerry	NaN	NaN	NaN
3	Jom	NaN	456.67	M
4	Jone	18.0	35.12	F

1.dropna

根據標籤的值中是否存在缺失數據對軸標籤進行過濾(刪除), 可以通過閾值的調節對缺失值的容忍度

1.dropna()

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
    ['Tom',np.nan,456.67,'M'],
    ['Merry',34,345.56,np.nan],
    ['Gerry',np.nan,np.nan,np.nan],
    ['Jom',np.nan,456.67,'M'],
    ['Jone',18,35.12,'F']],
    columns=['name','age','salary','gender']
)
df2.dropna()#默認丟棄只要包含缺失值的行(去掉有NaN的數據)

結果:

name	age	salary	gender
4	Jone	18.0	35.12	F
2.dropna(how=‘all’)

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
    ['Tom',np.nan,456.67,'M'],
    ['Merry',34,345.56,np.nan],
    ['Gerry',np.nan,np.nan,np.nan],
    ['Jom',np.nan,456.67,'M'],
    ['Jone',18,35.12,'F']],
    columns=['name','age','salary','gender']
)
df2.dropna(how='all')#給定只丟棄值全部爲缺失值的行

結果:

	name	age	salary	gender
0	Tom	NaN	456.67	M
1	Merry	34.0	345.56	NaN
2	Gerry	NaN	NaN	NaN
3	Jom	NaN	456.67	M
4	Jone	18.0	35.12	F
3.dropna(axis=1)

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
    ['Tom',np.nan,456.67,'M'],
    ['Merry',34,345.56,np.nan],
    ['Gerry',np.nan,np.nan,np.nan],
    ['Jom',np.nan,456.67,'M'],
    ['Jone',18,35.12,'F']],
    columns=['name','age','salary','gender']
)
df2.dropna(axis=1)#丟棄列

結果:

name
0	Tom
1	Merry
2	Gerry
3	Jom
4	Jone

2.fillna

用指定值或者插值的方式填充缺失數據,比如: ffill或者bfill

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd

df=DataFrame(np.random.randn(7,3))
df.loc[:4,1]=np.nan
df.loc[:2,2]=np.nan
df

結果:

0	1	2
0	-0.031422	NaN	NaN
1	-0.916141	NaN	NaN
2	0.427765	NaN	NaN
3	0.242490	NaN	0.200289
4	-0.214651	NaN	0.533594
5	0.302438	-0.228859	-0.883538
6	-0.356205	0.154669	-0.448864
1.fillna(0)

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd

df=DataFrame(np.random.randn(7,3))
df.loc[:4,1]=np.nan
df.loc[:2,2]=np.nan
print(df.fillna(0))
print()
print(df.fillna('fill'))

結果:

         0         1         2
0 -1.034991  0.000000  0.000000
1 -1.923518  0.000000  0.000000
2 -1.314832  0.000000  0.000000
3  0.167929  0.000000 -0.806852
4  0.900060  0.000000  1.443878
5  0.312364  0.222698 -1.000081
6 -0.291597  1.095243  0.678713

          0         1         2
0  1.093552      fill      fill
1  2.208307      fill      fill
2  0.319327      fill      fill
3 -0.525311      fill  -0.11486
4  1.539547      fill  -1.30771
5 -3.682927  0.588243   1.15384
6  0.255435  -2.06252 -0.808872
2.fillna({1:0.5,2:-1,3:1})

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd

df=DataFrame(np.random.randn(7,3))
df.loc[:4,1]=np.nan
df.loc[:2,2]=np.nan
df.fillna({1:0.5,2:-1,3:1})
結果:
```python
0	1	2
0	1.317635	-1.574467	1.399366
1	-0.690984	0.823191	-0.138721
2	2.840376	-0.522517	0.347104
3	-0.036683	0.767751	-0.646185
4	0.676305	-1.961409	-1.337382
5	1.752402	-1.192964	-0.057789
6	0.171615	0.554056	-1.322705

3.isnul

返回一個含有布爾值的對象,這些布爾值表示那些值是缺失值NA
代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
    ['Tom',np.nan,456.67,'M'],
    ['Merry',34,345.56,np.nan],
    ['Gerry',np.nan,np.nan,np.nan],
    ['Jom',np.nan,456.67,'M'],
    ['Jone',18,35.12,'F']],
    columns=['name','age','salary','gender']
)
df2.isnull()

結果:

	name	age	salary	gender
0	False	True	False	False
1	False	False	False	True
2	False	True	True	True
3	False	True	False	False
4	False	False	False	False

4. notnull

isnull的否定式

代碼:

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
    ['Tom',np.nan,456.67,'M'],
    ['Merry',34,345.56,np.nan],
    ['Gerry',np.nan,np.nan,np.nan],
    ['Jom',np.nan,456.67,'M'],
    ['Jone',18,35.12,'F']],
    columns=['name','age','salary','gender']
)
df2.notnull()

結果:

name	age	salary	gender
0	True	False	True	True
1	True	True	True	False
2	True	False	False	False
3	True	False	True	True
4	True	True	True	True
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章