【課程2.5】 Pandas數據結構Dataframe:基本概念及創建
"二維數組"Dataframe:是一個表格型的數據結構,包含一組有序的列,其列的值類型可以是數值、字符串、布爾值等。
Dataframe中的數據以一個或多個二維塊存放,不是列表、字典或一維數組結構。
1.Dataframe 數據結構
# Dataframe是一個表格型的數據結構,“帶有標籤的二維數組”。
# Dataframe帶有index(行標籤)和columns(列標籤)
data = {'name':['Jack','Tom','Mary'],
'age':[18,19,20],
'gender':['m','m','w']}
frame = pd.DataFrame(data)
print(frame)
print(type(frame))
print(frame.index,'\n該數據類型爲:',type(frame.index))
print(frame.columns,'\n該數據類型爲:',type(frame.columns))
print(frame.values,'\n該數據類型爲:',type(frame.values))
# 查看數據,數據類型爲dataframe
# .index查看行標籤
# .columns查看列標籤
# .values查看值,數據類型爲ndarray
-----------------------------------------------------------------------
age gender name
0 18 m Jack
1 19 m Tom
2 20 w Mary
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1)
該數據類型爲: <class 'pandas.indexes.range.RangeIndex'>
Index(['age', 'gender', 'name'], dtype='object')
該數據類型爲: <class 'pandas.indexes.base.Index'>
[[18 'm' 'Jack']
[19 'm' 'Tom']
[20 'w' 'Mary']]
該數據類型爲: <class 'numpy.ndarray'>
2.Dataframe 創建方法一:由數組/list組成的字典
# 創建方法:pandas.Dataframe()
data1 = {'a':[1,2,3],
'b':[3,4,5],
'c':[5,6,7]}
data2 = {'one':np.random.rand(3),
'two':np.random.rand(3)} # 這裏如果嘗試 'two':np.random.rand(4) 會怎麼樣?
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由數組/list組成的字典 創建Dataframe,columns爲字典key,index爲默認數字標籤
# 字典的值的長度必須保持一致!
df1 = pd.DataFrame(data1, columns = ['b','c','a','d'])
print(df1)
df1 = pd.DataFrame(data1, columns = ['b','c'])
print(df1)
# columns參數:可以重新指定列的順序,格式爲list,如果現有數據中沒有該列(比如'd'),則產生NaN值
# 如果columns重新指定時候,列的數量可以少於原數據
df2 = pd.DataFrame(data2, index = ['f1','f2','f3']) # 這裏如果嘗試 index = ['f1','f2','f3','f4'] 會怎麼樣?
print(df2)
# index參數:重新定義index,格式爲list,長度必須保持一致
-----------------------------------------------------------------------
{'a': [1, 2, 3], 'c': [5, 6, 7], 'b': [3, 4, 5]}
{'one': array([ 0.00101091, 0.08807153, 0.58345056]), 'two': array([ 0.49774634, 0.16782565, 0.76443489])}
a b c
0 1 3 5
1 2 4 6
2 3 5 7
one two
0 0.001011 0.497746
1 0.088072 0.167826
2 0.583451 0.764435
b c a d
0 3 5 1 NaN
1 4 6 2 NaN
2 5 7 3 NaN
b c
0 3 5
1 4 6
2 5 7
one two
f1 0.001011 0.497746
f2 0.088072 0.167826
f3 0.583451 0.764435
Dataframe 創建方法二:由Series組成的字典
data1 = {'one':pd.Series(np.random.rand(2)),
'two':pd.Series(np.random.rand(3))} # 沒有設置index的Series
data2 = {'one':pd.Series(np.random.rand(2), index = ['a','b']),
'two':pd.Series(np.random.rand(3),index = ['a','b','c'])} # 設置了index的Series
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由Seris組成的字典 創建Dataframe,columns爲字典key,index爲Series的標籤(如果Series沒有指定標籤,則是默認數字標籤)
# Series可以長度不一樣,生成的Dataframe會出現NaN值
-----------------------------------------------------------------------
{'one': 0 0.892580
1 0.834076
dtype: float64, 'two': 0 0.301309
1 0.977709
2 0.489000
dtype: float64}
{'one': a 0.470947
b 0.584577
dtype: float64, 'two': a 0.122659
b 0.136429
c 0.396825
dtype: float64}
one two
0 0.892580 0.301309
1 0.834076 0.977709
2 NaN 0.489000
one two
a 0.470947 0.122659
b 0.584577 0.136429
c NaN 0.396825
3.Dataframe 創建方法三:通過二維數組直接創建
ar = np.random.rand(9).reshape(3,3)
print(ar)
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three']) # 可以嘗試一下index或columns長度不等於已有數組的情況
print(df1)
print(df2)
# 通過二維數組直接創建Dataframe,得到一樣形狀的結果數據,如果不指定index和columns,兩者均返回默認數字格式
# index和colunms指定長度與原數組保持一致
-----------------------------------------------------------------------
[[ 0.54492282 0.28956161 0.46592269]
[ 0.30480674 0.12917132 0.38757672]
[ 0.2518185 0.13544544 0.13930429]]
0 1 2
0 0.544923 0.289562 0.465923
1 0.304807 0.129171 0.387577
2 0.251819 0.135445 0.139304
one two three
a 0.544923 0.289562 0.465923
b 0.304807 0.129171 0.387577
c 0.251819 0.135445 0.139304
4.Dataframe 創建方法四:由字典組成的列表
data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
print(data)
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index = ['a','b'])
df3 = pd.DataFrame(data, columns = ['one','two'])
print(df1)
print(df2)
print(df3)
# 由字典組成的列表創建Dataframe,columns爲字典的key,index不做指定則爲默認數組標籤
# colunms和index參數分別重新指定相應列及行標籤
-----------------------------------------------------------------------[{'one': 1, 'two': 2}, {'one': 5, 'three': 20, 'two': 10}]
one three two
0 1 NaN 2
1 5 20.0 10
one three two
a 1 NaN 2
b 5 20.0 10
one two
0 1 2
1 5 10
5.Dataframe 創建方法五:由字典組成的字典
data = {'Jack':{'math':90,'english':89,'art':78},
'Marry':{'math':82,'english':95,'art':92},
'Tom':{'math':78,'english':67}}
df1 = pd.DataFrame(data)
print(df1)
# 由字典組成的字典創建Dataframe,columns爲字典的key,index爲子字典的key
df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
df3 = pd.DataFrame(data, index = ['a','b','c'])
print(df2)
print(df3)
# columns參數可以增加和減少現有列,如出現新的列,值爲NaN
# index在這裏和之前不同,並不能改變原有index,如果指向新的標籤,值爲NaN (非常重要!)
-----------------------------------------------------------------------
Jack Marry Tom
art 78 92 NaN
english 89 95 67.0
math 90 82 78.0
Jack Tom Bob
art 78 NaN NaN
english 89 67.0 NaN
math 90 78.0 NaN
Jack Marry Tom
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN