import pandas as pd
from pandas import DataFrame
import numpy as np
DataFrame
DataFrame是一個表格型的數據結構,既有行索引(保存在index)又有列索引(保存在columns)。
一、DataFrame對象常用屬性:
- 創建DateFrame方法有很多(後面再介紹),最常用的是直接傳入一個由等長列表或Numpy組成的字典:
dict1={"Province":["Guangdong","Beijing","Qinghai","Fujiang"],
"year":[2018]*4,
"pop":[1.3,2.5,1.1,0.7]}
df1=DataFrame(dict1)
df1
代碼結果:
Province | pop | year | |
---|---|---|---|
0 | Guangdong | 1.3 | 2018 |
1 | Beijing | 2.5 | 2018 |
2 | Qinghai | 1.1 | 2018 |
3 | Fujiang | 0.7 | 2018 |
- 同Series一樣,也可在創建時指定序列(對於字典中缺失的用NaN):
df2=DataFrame(dict1,columns=['year','Province','pop','debt'],index=['one','two','three','four'])
df2
代碼結果:
year | Province | pop | debt | |
---|---|---|---|---|
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
four | 2018 | Fujiang | 0.7 | NaN |
- 同Series一樣,DataFrame的index和columns有name屬性:
df2
代碼結果:
year | Province | pop | debt | |
---|---|---|---|---|
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
four | 2018 | Fujiang | 0.7 | NaN |
df2.index.name='English'
df2.columns.name='Province'
df2
代碼結果:
Province | year | Province | pop | debt |
---|---|---|---|---|
English | ||||
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
four | 2018 | Fujiang | 0.7 | NaN |
- 通過shape屬性獲取DataFrame的行數和列數:
df2.shape
代碼結果:
(4, 4)
- values屬性也會以二維ndarray的形式返回DataFrame的數據:
df2.values
代碼結果:
array([[2018, 'Guangdong', 1.3, nan],
[2018, 'Beijing', 2.5, nan],
[2018, 'Qinghai', 1.1, nan],
[2018, 'Fujiang', 0.7, nan]], dtype=object)
- 列索引會作爲DataFrame對象的屬性:
df2.Province
代碼結果:
English
one Guangdong
two Beijing
three Qinghai
four Fujiang
Name: Province, dtype: object
二、DataFrame對象常見存取、賦值和刪除方式:
- DataFrame_object[ ] 能通過列索引來存取,當只有一個標籤則返回Series,多於一個則返回DataFrame:
df2['Province']
代碼結果:
English
one Guangdong
two Beijing
three Qinghai
four Fujiang
Name: Province, dtype: object
df2[['Province','pop']]
代碼結果:
Province | Province | pop |
---|---|---|
English | ||
one | Guangdong | 1.3 |
two | Beijing | 2.5 |
three | Qinghai | 1.1 |
four | Fujiang | 0.7 |
- DataFrame_object.loc[ ] 能通過行索引來獲取指定行:
df2.loc['one']
代碼結果:
Province
year 2018
Province Guangdong
pop 1.3
debt NaN
Name: one, dtype: object
df2.loc['one':'three']
代碼結果:
Province | year | Province | pop | debt |
---|---|---|---|---|
English | ||||
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
- 還可以獲取單值:
df2.loc['one','Province']
代碼結果:
'Guangdong'
- DataFrame的列可以通過賦值(一個值或一組值)來修改:
df2["debt"]=np.arange(2,3,0.25)
df2
代碼結果:
Province | year | Province | pop | debt |
---|---|---|---|---|
English | ||||
one | 2018 | Guangdong | 1.3 | 2.00 |
two | 2018 | Beijing | 2.5 | 2.25 |
three | 2018 | Qinghai | 1.1 | 2.50 |
four | 2018 | Fujiang | 0.7 | 2.75 |
- 爲不存在的列賦值會創建一個新的列,可通過del來刪除:
df2['eastern']=df2.Province=='Guangdong'
df2
代碼結果:
Province | year | Province | pop | debt | eastern |
---|---|---|---|---|---|
English | |||||
one | 2018 | Guangdong | 1.3 | 2.00 | True |
two | 2018 | Beijing | 2.5 | 2.25 | False |
three | 2018 | Qinghai | 1.1 | 2.50 | False |
four | 2018 | Fujiang | 0.7 | 2.75 | False |
del df2['eastern']
df2.columns
代碼結果:
Index(['year', 'Province', 'pop', 'debt'], dtype='object', name='Province')
- 當然,還可以轉置:
df2.T
English | one | two | three | four |
---|---|---|---|---|
Province | ||||
year | 2018 | 2018 | 2018 | 2018 |
Province | Guangdong | Beijing | Qinghai | Fujiang |
pop | 1.3 | 2.5 | 1.1 | 0.7 |
debt | 2 | 2.25 | 2.5 | 2.75 |
三、多種創建DataFrame方式
- 調用DataFrame()可以將多種格式的數據轉換爲DataFrame對象,它的的三個參數data、index和columns分別爲數據、行索引和列索引。data可以是:
1 二維數組
df3=pd.DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=['A','B','C','D'])
df3
代碼結果:
A | B | C | D | |
---|---|---|---|---|
1 | 9 | 8 | 4 | 6 |
2 | 5 | 7 | 7 | 4 |
3 | 6 | 3 | 0 | 2 |
4 | 4 | 6 | 9 | 8 |
2 字典
行索引由index決定,列索引由字典的鍵決定
dict1
代碼結果:
{'Province': ['Guangdong', 'Beijing', 'Qinghai', 'Fujiang'],
'pop': [1.3, 2.5, 1.1, 0.7],
'year': [2018, 2018, 2018, 2018]}
df4=pd.DataFrame(dict1,index=[1,2,3,4])
df4
代碼結果:
Province | pop | year | |
---|---|---|---|
1 | Guangdong | 1.3 | 2018 |
2 | Beijing | 2.5 | 2018 |
3 | Qinghai | 1.1 | 2018 |
4 | Fujiang | 0.7 | 2018 |
3 結構數組
其中列索引由結構數組的字段名決定
arr=np.array([('item1',10),('item2',20),('item3',30),('item4',40)],dtype=[("name","10S"),("count",int)])
df5=pd.DataFrame(arr)
df5
代碼結果:
name | count | |
---|---|---|
0 | b’item1’ | 10 |
1 | b’item2’ | 20 |
2 | b’item3’ | 30 |
3 | b’item4’ | 40 |
- 此外可以調用from_開頭的類方法,將特定的數據轉換爲DataFrame對象。例如from_dict(),其orient參數指定字典鍵對應的方向,默認爲”columns”:
dict2={"a":[1,2,3],"b":[4,5,6]}
df6=pd.DataFrame.from_dict(dict2)
df6
代碼結果:
a | b | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
df7=pd.DataFrame.from_dict(dict2,orient="index")
df7
代碼結果:
0 | 1 | 2 | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
四、將DataFrame對象轉換爲其他格式的數據
- to_dict()方法將DataFrame對象轉換爲字典,參數orient決定字典元素的類型:
df7.to_dict()
代碼結果:
{0: {'a': 1, 'b': 4}, 1: {'a': 2, 'b': 5}, 2: {'a': 3, 'b': 6}}
df7.to_dict(orient="records")
代碼結果:
[{0: 1, 1: 2, 2: 3}, {0: 4, 1: 5, 2: 6}]
df7.to_dict(orient="list")
代碼結果:
{0: [1, 4], 1: [2, 5], 2: [3, 6]}
- 類似的還有to_records()、to_csv()等
謝謝大家的瀏覽,
希望我的努力能幫助到您,
共勉!