Pandas简介与Series的基础应用

一、Pandas的简介

Pandas(Panel data & Python data analysis)是基于Numpy来构建,是一个强大的Python数据分析包。Pandas能够快速对数据进行统计分析,能够较好的处理缺失数据,能够灵活的对csv、excel、txt等进行相关的数据处理,此外还有时间序列的特定功能,用起来比Excel处理数据更方便,能够做的事更多。

pandas学习途径:【pandas官方文档链接】,建议学之前先学【Numpy】

pandas库安装方法

pip install pandas

二、Pandas的数据结构

Pandas 常用的数据结构有两种:Series 和 DataFrame。这些数据结构构建在 Numpy 的二维数组基础上,因此它们执行效率比较高。我自己的理解就是Series就是单列数组,即只有一列数据; DataFrame则是二维数组,如同Excel表格一样,由多行多列构成,不同于Excel之处在于多了一个行列索引,有了索引在数据处理与分析中用起来更方便,更灵活。

2.1 Series简介

Series 是一个带有名称索引的一维数组对象,在 Series 中包含的数据类型可以是整数、浮点、字符串、list、ndarray等。

使用pandas创建Series引例

# 导入pandas库
import pandas as pd
data = [1,2]
pd.Series(data = data,index=None, dtype=None, name=None, copy=False, fastpath=False)
0    1
1    2
dtype: int64

参数解析:

编号 参数 说明 默认
1 data(必选) 存储在Series中的数据,如list data=None
2 index(可选) 类似数组的或索引与data相同长度。允许非唯一索引值。将默认为RangeIndex(0,1,2,.,n),如果没有提供。如果同时使用dict和index序列,则索引将覆盖在dict中找到的键 index=None
3 dtype(可选) 用于数据类型,如果没有,则将推断数据类型 dtype=None
4 name(可选) Series的名字 name=None
5 copy(可选) 复制输入数据 copy=False
6 fastpath(可选) 快速路径 fastpath=False

2.2 Series 的创建

列表或者Numpy数组创建

"""未设置索引"""
import numpy as np
import pandas as pd
lst = ["a","b","c"]
ndarry = np.arange(3)
print(lis,'\t\t',ndarry)
ds1 = pd.Series(lst) 
ds2 = pd.Series(ndarry)
print(ds1,'\n',ds2)
[0, 1, 2] 		 [0 1 2]
0    a
1    b
2    c
dtype: object 
 0    0
1    1
2    2
dtype: int32

元组创建

# 创建pandas的序列,,np.nan为空值
tup = (1,np.nan,1)
s = pd.Series(tup)
print(s)
0    1.0
1    NaN
2    1.0
dtype: float64

字典创建

dic = {"a":[1,2],"b":2,"c":3} 
pd.Series(dic) # 默认key为列索引
a    [1, 2]
b         2
c         3
dtype: object

集合创建

# 集合不能创建,因为无序的,且无法索引获取值
s = set(range(3))
pd.Series(s)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-55-12e87c61ee70> in <module>()
      1 # 集合不能创建,因为无序的,且无法索引获取值
      2 s = set(range(3))
----> 3 pd.Series(s)


~\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    272                 pass
    273             elif isinstance(data, (set, frozenset)):
--> 274                 raise TypeError(f"'{type(data).__name__}' type is unordered")
    275             elif isinstance(data, ABCSparseArray):
    276                 # handle sparse passed here (and force conversion)


TypeError: 'set' type is unordered

标量创建

# 需要设置索引,不设置就只有一个数据
cc = pd.Series(5,index=["a","b"],name="aa") 
cc
a    5
b    5
Name: aa, dtype: int64

2.3 Series索引

设置索引

"""设置索引方法1"""
tup = (1,np.nan,1)
s = pd.Series(tup,index=["a","b","c"],name="cc")
s
a    1.0
b    NaN
c    1.0
Name: cc, dtype: float64
"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s # 指定数据类型
abc
a      1
b    nan
c      1
Name: cc, dtype: object
"""设置索引方法3"""
tup = (1,np.nan,1)
s = pd.Series(tup)
s.index=["a",'2','3']
s
a    1.0
2    NaN
3    1.0
dtype: float64

修改索引的name

"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s.index.name = 'new'  # 对index的名字进行重命名
s
new
a      1
b    nan
c      1
Name: cc, dtype: object

查看索引

"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print(s.index)
print("索引转为列表:",s.index.tolist())
Index(['a', 'b', 'c'], dtype='object', name='abc')
索引转为列表: ['a', 'b', 'c']

修改索引名

"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s.index.tolist())
s.rename(index={'a':'aa'},inplace=True)
print("修改后:",s.index.tolist())
修改前: ['a', 'b', 'c']
修改后: ['aa', 'b', 'c']
"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s.index.tolist())

print("修改后:",s.index.tolist())
修改前: ['a', 'b', 'c']



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-101-8b84520332ab> in <module>()
      5 s = pd.Series(tup,index=index_name,name="cc",dtype="str")
      6 print("修改前:",s.index.tolist())
----> 7 s.index(["1",'2','3'])
      8 print("修改后:",s.index.tolist())


TypeError: 'Index' object is not callable

查看数据

"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print(s.values)
print("数据转为列表:",s.values.tolist())
['1' 'nan' '1']
数据转为列表: ['1', 'nan', '1']

查看Series名

"""设置索引方法2"""
# 构建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print(s.name)
cc

2.4 Series的增删改查

2.4.1 增

import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s1 = pd.Series(tup,index=index_name,name="cc",dtype="str")
s1
abc
a      1
b    nan
c      1
Name: cc, dtype: object
s1["d"] = 2 # 可以当做字典的增在末尾添加
s1 
abc
a      1
b    nan
c      1
d      2
Name: cc, dtype: object
dic = {"a":[1,2],"b":2,"c":3} 
s2 = pd.Series(dic) # 默认key为列索引
s2
a    [1, 2]
b         2
c         3
dtype: object
s1.append(s2) # 用于连接两个Series
a         1
b       nan
c         1
d         2
a    [1, 2]
b         2
c         3
dtype: object

2.4.2 删

import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
display(s)
abc
a      1
b    nan
c      1
Name: cc, dtype: object
# 方法1 del方式
del s["b"]
print(s)
abc
a    1
c    1
Name: cc, dtype: object
print("删除前:",s)
# 方法2 drop方式
a = s.drop("a") 
print("删除后:",s)
删除前: abc
a    1
c    1
Name: cc, dtype: object
删除后: abc
a    1
c    1
Name: cc, dtype: object
# 可以看到上述步骤s并发生改变,这里输出a看一下
print(a)
abc
c    1
Name: cc, dtype: object
# 可以看到a才是我们需要的结果,这里通过设置一下inplace,即可实现
print("删除前:",s)
aa = s.drop("a",inplace=True)
print("删除后:",s)
删除前: abc
a    1
c    1
Name: cc, dtype: object
删除后: abc
c    1
Name: cc, dtype: object
"""使用Drop同时删除多个"""
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("删除前:",s)
aa = s.drop(["a","b"],inplace=True)
print("删除后:",s)
删除前: abc
a      1
b    nan
c      1
Name: cc, dtype: object
删除后: abc
c    1
Name: cc, dtype: object

2.4.3 改

# 获取到某个值后,采用赋值方式修改值
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s)
s["a"] = 2
print("修改后:",s)
修改前: abc
a      1
b    nan
c      1
Name: cc, dtype: object
修改后: abc
a      2
b    nan
c      1
Name: cc, dtype: object
# 获取到某个值后,采用赋值方式修改值
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s)
# 通过标签或布尔数组访问一组行和列
s.loc["a"] = 3
print("修改后:",s)
修改前: abc
a      1
b    nan
c      1
Name: cc, dtype: object
修改后: abc
a      3
b    nan
c      1
Name: cc, dtype: object
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s)
# 纯整数-基于位置的索引,用于按位置选择。
s.iloc[2] = 3
print("修改后:",s)
修改前: abc
a      1
b    nan
c      1
Name: cc, dtype: object
修改后: abc
a      1
b    nan
c      3
Name: cc, dtype: object

2.4.4 查

通过索引查单值

import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s["a"]
'1'

通过索引值查多值

import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s[["a","b"]]
abc
a      1
b    nan
Name: cc, dtype: object

通过布尔类型索引筛选

import pandas as pd
index_name = pd.Index(["a","b","c","d"],name="num")
tup = (1,2,3,4)
s = pd.Series(tup,index=index_name,name="cc",dtype="float")
s[s>2]
num
c    3.0
d    4.0
Name: cc, dtype: float64

通过位置切片和标签切片查询数据

import pandas as pd
index_name = pd.Index(["a","b","c","d"],name="num")
tup = (1,2,3,4)
s = pd.Series(tup,index=index_name,name="cc",dtype="float")
s[:2] # 左闭右开原则
num
a    1.0
b    2.0
Name: cc, dtype: float64
s["a":"c"]
num
a    1.0
b    2.0
c    3.0
Name: cc, dtype: float64
s[[0,1]]
num
a    1.0
b    2.0
Name: cc, dtype: float64

纯整数-基于位置的索引,用于按位置选择

s.iloc[:2][:]
num
a    1.0
b    2.0
Name: cc, dtype: float64

通过标签或布尔数组访问一组行和列

s.loc["c":]
num
c    3.0
d    4.0
Name: cc, dtype: float64
s.loc[["c","b"]]
num
c    3.0
b    2.0
Name: cc, dtype: float64

查看前后n行

import pandas as pd
tup = (1,2,3,4,4,5,6,7,8,9)
s = pd.Series(tup)
print("查看前5行:",s.head()) # 默认5行
print("查看前5行:",s.tail()) # 默认5行
print("查看前2行:",s.head(2)) # 指定2行
print("查看前2行:",s.tail(2))  # 指定2行
查看前5行: 0    1
1    2
2    3
3    4
4    4
dtype: int64
查看前5行: 5    5
6    6
7    7
8    8
9    9
dtype: int64
查看前2行: 0    1
1    2
dtype: int64
查看前2行: 8    8
9    9
dtype: int64

2.5 Series统计计算

单个Series的计算

import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s1 = pd.Series(tup[:5])
s1 * 2 # 每个值都乘以2 ,相当于向量运算
0     2
1     4
2     6
3     8
4    10
dtype: int64
s1 +1 # 每个位置都加1
0    2
1    3
2    4
3    5
4    6
dtype: int64

两个Series之间的运算(索引相同)

# + 运算
import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s1 = pd.Series(tup[:5])
s2 = pd.Series(tup[5:])
print("s1:",s1)
print("s2:",s2)
s1: 0    1
1    2
2    3
3    4
4    5
dtype: int64
s2: 0    5
1    6
2    7
3    8
4    9
dtype: int64
s1 + s2 # 索引值对应相加
0     6
1     8
2    10
3    12
4    14
dtype: int64
s2 - s1 # 索引值对应相减
0    4
1    4
2    4
3    4
4    4
dtype: int64

两个Series之间的运算(索引不同)

# + 运算
import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s1 = pd.Series(tup[:5],index=["a","b",1,2,3])
s2 = pd.Series(tup[5:])
print("s1:",s1)
print("s2:",s2)
s1: a    1
b    2
1    3
2    4
3    5
dtype: int64
s2: 0    5
1    6
2    7
3    8
4    9
dtype: int64
s1 + s2 # 索引对应不上则为NaN
0     NaN
1     9.0
2    11.0
3    13.0
4     NaN
a     NaN
b     NaN
dtype: float64
s1 - s2 # 索引对应不上则为NaN
0    NaN
1   -3.0
2   -3.0
3   -3.0
4    NaN
a    NaN
b    NaN
dtype: float64

统计计算

import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s = pd.Series(tup)
s.describe() # 快速查看统计信息
count    10.000000
mean      5.000000
std       2.581989
min       1.000000
25%       3.250000
50%       5.000000
75%       6.750000
max       9.000000
dtype: float64
# 求平均数
s.mean()
5.0
# 求和
s.sum()
50
# 标准差
s.std()
2.581988897471611
# 最大值
s.max()
9
# 最小值
s.min()
1
# 分位数
print("下四分位数:",s.quantile(0.25))
print("中四分位数:",s.quantile(0.5))
print("上四分位数:",s.quantile(0.75))
下四分位数: 3.25
中四分位数: 5.0
上四分位数: 6.75
# 求累加
s.cumsum()
0     1
1     3
2     6
3    10
4    15
5    20
6    26
7    33
8    41
9    50
dtype: int64
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章