文章目錄
概括:Numpy+Scipy+pandas+matplotlib
pandas基本功能
一.數據文件讀取/文本數據讀取
1.pandas:數據文件讀取
通過pandas提供的read_xxx相關的函數可以讀取文件中的數據,並形成DataFrame,常用的數據讀取方法爲:read_csv,主要可以讀取文本類型的數據
help(pd.read_csv)
2.讀取csv文件
先創建一個data1.csv文件,裏面內容寫:
name,age,source
Peter,18,98.5
Tom,21,78.2
Bob,24,98.5
Wangdachui,20,89.2
Jupyter notebook代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
##讀取csv文件
df=pd.read_csv("data1.csv")
df
3.讀取txt文件
先創建一個data01.txt文件,裏面內容寫:
王大錘;18;100;99;98
王大錘;18;100;99;98
王大錘;18;100;99;98
Jupyter notebook代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
#讀取文本數據,指定";"爲分隔符,不讀取頭部數據
df=pd.read_csv("data01.txt",sep=';',header=None)
df
二.索引、選取和數據過濾
pandas:數據過濾獲取
通過DataFrame的相關方式可以獲取對應的列或者數據形成一個新的DataFrame, 方便後續進行統計計算。
對於DataFrame/Series中的NaN一般採取的方式爲刪除對應的列/行或者填充一個默認值
1.指定行頭部
創建一個data01.txt文件,裏面內容寫:
王大錘;18;100;99;98
王大錘;18;100;99;98
王大錘;18;100;99;98
Jupyter notebook代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
columns=['name','age',u'語文',u'數學',u'英語']
df.columns=columns
df
2.切片獲取內容
創建一個data01.txt文件,裏面內容寫:
王大錘;18;100;99;98
王大錘;18;100;99;98
王大錘;18;100;99;98
Jupyter notebook代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
columns=['name','age',u'語文',u'數學',u'英語']
df=df[columns[2:]]
df
三.pandas:缺省值NaN處理方法
對於DataFrame/Series中的NaN一般採取的方式爲刪除對應的列/行或者填充一個默認值
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
['Tom',np.nan,456.67,'M'],
['Merry',34,345.56,np.nan],
['Gerry',np.nan,np.nan,np.nan],
['Jom',np.nan,456.67,'M'],
['Jone',18,35.12,'F']],
columns=['name','age','salary','gender']
)
df2
結果:
name age salary gender
0 Tom NaN 456.67 M
1 Merry 34.0 345.56 NaN
2 Gerry NaN NaN NaN
3 Jom NaN 456.67 M
4 Jone 18.0 35.12 F
1.dropna
根據標籤的值中是否存在缺失數據對軸標籤進行過濾(刪除), 可以通過閾值的調節對缺失值的容忍度
1.dropna()
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
['Tom',np.nan,456.67,'M'],
['Merry',34,345.56,np.nan],
['Gerry',np.nan,np.nan,np.nan],
['Jom',np.nan,456.67,'M'],
['Jone',18,35.12,'F']],
columns=['name','age','salary','gender']
)
df2.dropna()#默認丟棄只要包含缺失值的行(去掉有NaN的數據)
結果:
name age salary gender
4 Jone 18.0 35.12 F
2.dropna(how=‘all’)
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
['Tom',np.nan,456.67,'M'],
['Merry',34,345.56,np.nan],
['Gerry',np.nan,np.nan,np.nan],
['Jom',np.nan,456.67,'M'],
['Jone',18,35.12,'F']],
columns=['name','age','salary','gender']
)
df2.dropna(how='all')#給定只丟棄值全部爲缺失值的行
結果:
name age salary gender
0 Tom NaN 456.67 M
1 Merry 34.0 345.56 NaN
2 Gerry NaN NaN NaN
3 Jom NaN 456.67 M
4 Jone 18.0 35.12 F
3.dropna(axis=1)
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
['Tom',np.nan,456.67,'M'],
['Merry',34,345.56,np.nan],
['Gerry',np.nan,np.nan,np.nan],
['Jom',np.nan,456.67,'M'],
['Jone',18,35.12,'F']],
columns=['name','age','salary','gender']
)
df2.dropna(axis=1)#丟棄列
結果:
name
0 Tom
1 Merry
2 Gerry
3 Jom
4 Jone
2.fillna
用指定值或者插值的方式填充缺失數據,比如: ffill或者bfill
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df=DataFrame(np.random.randn(7,3))
df.loc[:4,1]=np.nan
df.loc[:2,2]=np.nan
df
結果:
0 1 2
0 -0.031422 NaN NaN
1 -0.916141 NaN NaN
2 0.427765 NaN NaN
3 0.242490 NaN 0.200289
4 -0.214651 NaN 0.533594
5 0.302438 -0.228859 -0.883538
6 -0.356205 0.154669 -0.448864
1.fillna(0)
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df=DataFrame(np.random.randn(7,3))
df.loc[:4,1]=np.nan
df.loc[:2,2]=np.nan
print(df.fillna(0))
print()
print(df.fillna('fill'))
結果:
0 1 2
0 -1.034991 0.000000 0.000000
1 -1.923518 0.000000 0.000000
2 -1.314832 0.000000 0.000000
3 0.167929 0.000000 -0.806852
4 0.900060 0.000000 1.443878
5 0.312364 0.222698 -1.000081
6 -0.291597 1.095243 0.678713
0 1 2
0 1.093552 fill fill
1 2.208307 fill fill
2 0.319327 fill fill
3 -0.525311 fill -0.11486
4 1.539547 fill -1.30771
5 -3.682927 0.588243 1.15384
6 0.255435 -2.06252 -0.808872
2.fillna({1:0.5,2:-1,3:1})
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df=DataFrame(np.random.randn(7,3))
df.loc[:4,1]=np.nan
df.loc[:2,2]=np.nan
df.fillna({1:0.5,2:-1,3:1})
結果:
```python
0 1 2
0 1.317635 -1.574467 1.399366
1 -0.690984 0.823191 -0.138721
2 2.840376 -0.522517 0.347104
3 -0.036683 0.767751 -0.646185
4 0.676305 -1.961409 -1.337382
5 1.752402 -1.192964 -0.057789
6 0.171615 0.554056 -1.322705
3.isnul
返回一個含有布爾值的對象,這些布爾值表示那些值是缺失值NA
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
['Tom',np.nan,456.67,'M'],
['Merry',34,345.56,np.nan],
['Gerry',np.nan,np.nan,np.nan],
['Jom',np.nan,456.67,'M'],
['Jone',18,35.12,'F']],
columns=['name','age','salary','gender']
)
df2.isnull()
結果:
name age salary gender
0 False True False False
1 False False False True
2 False True True True
3 False True False False
4 False False False False
4. notnull
isnull的否定式
代碼:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
df2=DataFrame([
['Tom',np.nan,456.67,'M'],
['Merry',34,345.56,np.nan],
['Gerry',np.nan,np.nan,np.nan],
['Jom',np.nan,456.67,'M'],
['Jone',18,35.12,'F']],
columns=['name','age','salary','gender']
)
df2.notnull()
結果:
name age salary gender
0 True False True True
1 True True True False
2 True False False False
3 True False True True
4 True True True True