import pandas as pd
from pandas import DataFrame
import numpy as np
DataFrame
DataFrame是一个表格型的数据结构,既有行索引(保存在index)又有列索引(保存在columns)。
一、DataFrame对象常用属性:
- 创建DateFrame方法有很多(后面再介绍),最常用的是直接传入一个由等长列表或Numpy组成的字典:
dict1={"Province":["Guangdong","Beijing","Qinghai","Fujiang"],
"year":[2018]*4,
"pop":[1.3,2.5,1.1,0.7]}
df1=DataFrame(dict1)
df1
代码结果:
Province | pop | year | |
---|---|---|---|
0 | Guangdong | 1.3 | 2018 |
1 | Beijing | 2.5 | 2018 |
2 | Qinghai | 1.1 | 2018 |
3 | Fujiang | 0.7 | 2018 |
- 同Series一样,也可在创建时指定序列(对于字典中缺失的用NaN):
df2=DataFrame(dict1,columns=['year','Province','pop','debt'],index=['one','two','three','four'])
df2
代码结果:
year | Province | pop | debt | |
---|---|---|---|---|
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
four | 2018 | Fujiang | 0.7 | NaN |
- 同Series一样,DataFrame的index和columns有name属性:
df2
代码结果:
year | Province | pop | debt | |
---|---|---|---|---|
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
four | 2018 | Fujiang | 0.7 | NaN |
df2.index.name='English'
df2.columns.name='Province'
df2
代码结果:
Province | year | Province | pop | debt |
---|---|---|---|---|
English | ||||
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
four | 2018 | Fujiang | 0.7 | NaN |
- 通过shape属性获取DataFrame的行数和列数:
df2.shape
代码结果:
(4, 4)
- values属性也会以二维ndarray的形式返回DataFrame的数据:
df2.values
代码结果:
array([[2018, 'Guangdong', 1.3, nan],
[2018, 'Beijing', 2.5, nan],
[2018, 'Qinghai', 1.1, nan],
[2018, 'Fujiang', 0.7, nan]], dtype=object)
- 列索引会作为DataFrame对象的属性:
df2.Province
代码结果:
English
one Guangdong
two Beijing
three Qinghai
four Fujiang
Name: Province, dtype: object
二、DataFrame对象常见存取、赋值和删除方式:
- DataFrame_object[ ] 能通过列索引来存取,当只有一个标签则返回Series,多于一个则返回DataFrame:
df2['Province']
代码结果:
English
one Guangdong
two Beijing
three Qinghai
four Fujiang
Name: Province, dtype: object
df2[['Province','pop']]
代码结果:
Province | Province | pop |
---|---|---|
English | ||
one | Guangdong | 1.3 |
two | Beijing | 2.5 |
three | Qinghai | 1.1 |
four | Fujiang | 0.7 |
- DataFrame_object.loc[ ] 能通过行索引来获取指定行:
df2.loc['one']
代码结果:
Province
year 2018
Province Guangdong
pop 1.3
debt NaN
Name: one, dtype: object
df2.loc['one':'three']
代码结果:
Province | year | Province | pop | debt |
---|---|---|---|---|
English | ||||
one | 2018 | Guangdong | 1.3 | NaN |
two | 2018 | Beijing | 2.5 | NaN |
three | 2018 | Qinghai | 1.1 | NaN |
- 还可以获取单值:
df2.loc['one','Province']
代码结果:
'Guangdong'
- DataFrame的列可以通过赋值(一个值或一组值)来修改:
df2["debt"]=np.arange(2,3,0.25)
df2
代码结果:
Province | year | Province | pop | debt |
---|---|---|---|---|
English | ||||
one | 2018 | Guangdong | 1.3 | 2.00 |
two | 2018 | Beijing | 2.5 | 2.25 |
three | 2018 | Qinghai | 1.1 | 2.50 |
four | 2018 | Fujiang | 0.7 | 2.75 |
- 为不存在的列赋值会创建一个新的列,可通过del来删除:
df2['eastern']=df2.Province=='Guangdong'
df2
代码结果:
Province | year | Province | pop | debt | eastern |
---|---|---|---|---|---|
English | |||||
one | 2018 | Guangdong | 1.3 | 2.00 | True |
two | 2018 | Beijing | 2.5 | 2.25 | False |
three | 2018 | Qinghai | 1.1 | 2.50 | False |
four | 2018 | Fujiang | 0.7 | 2.75 | False |
del df2['eastern']
df2.columns
代码结果:
Index(['year', 'Province', 'pop', 'debt'], dtype='object', name='Province')
- 当然,还可以转置:
df2.T
English | one | two | three | four |
---|---|---|---|---|
Province | ||||
year | 2018 | 2018 | 2018 | 2018 |
Province | Guangdong | Beijing | Qinghai | Fujiang |
pop | 1.3 | 2.5 | 1.1 | 0.7 |
debt | 2 | 2.25 | 2.5 | 2.75 |
三、多种创建DataFrame方式
- 调用DataFrame()可以将多种格式的数据转换为DataFrame对象,它的的三个参数data、index和columns分别为数据、行索引和列索引。data可以是:
1 二维数组
df3=pd.DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=['A','B','C','D'])
df3
代码结果:
A | B | C | D | |
---|---|---|---|---|
1 | 9 | 8 | 4 | 6 |
2 | 5 | 7 | 7 | 4 |
3 | 6 | 3 | 0 | 2 |
4 | 4 | 6 | 9 | 8 |
2 字典
行索引由index决定,列索引由字典的键决定
dict1
代码结果:
{'Province': ['Guangdong', 'Beijing', 'Qinghai', 'Fujiang'],
'pop': [1.3, 2.5, 1.1, 0.7],
'year': [2018, 2018, 2018, 2018]}
df4=pd.DataFrame(dict1,index=[1,2,3,4])
df4
代码结果:
Province | pop | year | |
---|---|---|---|
1 | Guangdong | 1.3 | 2018 |
2 | Beijing | 2.5 | 2018 |
3 | Qinghai | 1.1 | 2018 |
4 | Fujiang | 0.7 | 2018 |
3 结构数组
其中列索引由结构数组的字段名决定
arr=np.array([('item1',10),('item2',20),('item3',30),('item4',40)],dtype=[("name","10S"),("count",int)])
df5=pd.DataFrame(arr)
df5
代码结果:
name | count | |
---|---|---|
0 | b’item1’ | 10 |
1 | b’item2’ | 20 |
2 | b’item3’ | 30 |
3 | b’item4’ | 40 |
- 此外可以调用from_开头的类方法,将特定的数据转换为DataFrame对象。例如from_dict(),其orient参数指定字典键对应的方向,默认为”columns”:
dict2={"a":[1,2,3],"b":[4,5,6]}
df6=pd.DataFrame.from_dict(dict2)
df6
代码结果:
a | b | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
df7=pd.DataFrame.from_dict(dict2,orient="index")
df7
代码结果:
0 | 1 | 2 | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
四、将DataFrame对象转换为其他格式的数据
- to_dict()方法将DataFrame对象转换为字典,参数orient决定字典元素的类型:
df7.to_dict()
代码结果:
{0: {'a': 1, 'b': 4}, 1: {'a': 2, 'b': 5}, 2: {'a': 3, 'b': 6}}
df7.to_dict(orient="records")
代码结果:
[{0: 1, 1: 2, 2: 3}, {0: 4, 1: 5, 2: 6}]
df7.to_dict(orient="list")
代码结果:
{0: [1, 4], 1: [2, 5], 2: [3, 6]}
- 类似的还有to_records()、to_csv()等
谢谢大家的浏览,
希望我的努力能帮助到您,
共勉!