一、Pandas的簡介
Pandas(Panel data & Python data analysis)是基於Numpy來構建,是一個強大的Python數據分析包。Pandas能夠快速對數據進行統計分析,能夠較好的處理缺失數據,能夠靈活的對csv、excel、txt等進行相關的數據處理,此外還有時間序列的特定功能,用起來比Excel處理數據更方便,能夠做的事更多。
pandas學習途徑:【pandas官方文檔鏈接】,建議學之前先學【Numpy】。
pandas庫安裝方法
pip install pandas
二、Pandas的數據結構
Pandas 常用的數據結構有兩種:Series 和 DataFrame。這些數據結構構建在 Numpy 的二維數組基礎上,因此它們執行效率比較高。我自己的理解就是Series就是單列數組,即只有一列數據; DataFrame則是二維數組,如同Excel表格一樣,由多行多列構成,不同於Excel之處在於多了一個行列索引,有了索引在數據處理與分析中用起來更方便,更靈活。
2.1 Series簡介
Series 是一個帶有名稱和索引的一維數組對象,在 Series 中包含的數據類型可以是整數、浮點、字符串、list、ndarray等。
使用pandas創建Series引例
# 導入pandas庫
import pandas as pd
data = [1,2]
pd.Series(data = data,index=None, dtype=None, name=None, copy=False, fastpath=False)
0 1
1 2
dtype: int64
參數解析:
編號 | 參數 | 說明 | 默認 |
---|---|---|---|
1 | data(必選) | 存儲在Series中的數據,如list | data=None |
2 | index(可選) | 類似數組的或索引與data相同長度。允許非唯一索引值。將默認爲RangeIndex(0,1,2,.,n),如果沒有提供。如果同時使用dict和index序列,則索引將覆蓋在dict中找到的鍵 | index=None |
3 | dtype(可選) | 用於數據類型,如果沒有,則將推斷數據類型 | dtype=None |
4 | name(可選) | Series的名字 | name=None |
5 | copy(可選) | 複製輸入數據 | copy=False |
6 | fastpath(可選) | 快速路徑 | fastpath=False |
2.2 Series 的創建
列表或者Numpy數組創建
"""未設置索引"""
import numpy as np
import pandas as pd
lst = ["a","b","c"]
ndarry = np.arange(3)
print(lis,'\t\t',ndarry)
ds1 = pd.Series(lst)
ds2 = pd.Series(ndarry)
print(ds1,'\n',ds2)
[0, 1, 2] [0 1 2]
0 a
1 b
2 c
dtype: object
0 0
1 1
2 2
dtype: int32
元組創建
# 創建pandas的序列,,np.nan爲空值
tup = (1,np.nan,1)
s = pd.Series(tup)
print(s)
0 1.0
1 NaN
2 1.0
dtype: float64
字典創建
dic = {"a":[1,2],"b":2,"c":3}
pd.Series(dic) # 默認key爲列索引
a [1, 2]
b 2
c 3
dtype: object
集合創建
# 集合不能創建,因爲無序的,且無法索引獲取值
s = set(range(3))
pd.Series(s)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-55-12e87c61ee70> in <module>()
1 # 集合不能創建,因爲無序的,且無法索引獲取值
2 s = set(range(3))
----> 3 pd.Series(s)
~\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
272 pass
273 elif isinstance(data, (set, frozenset)):
--> 274 raise TypeError(f"'{type(data).__name__}' type is unordered")
275 elif isinstance(data, ABCSparseArray):
276 # handle sparse passed here (and force conversion)
TypeError: 'set' type is unordered
標量創建
# 需要設置索引,不設置就只有一個數據
cc = pd.Series(5,index=["a","b"],name="aa")
cc
a 5
b 5
Name: aa, dtype: int64
2.3 Series索引
設置索引
"""設置索引方法1"""
tup = (1,np.nan,1)
s = pd.Series(tup,index=["a","b","c"],name="cc")
s
a 1.0
b NaN
c 1.0
Name: cc, dtype: float64
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s # 指定數據類型
abc
a 1
b nan
c 1
Name: cc, dtype: object
"""設置索引方法3"""
tup = (1,np.nan,1)
s = pd.Series(tup)
s.index=["a",'2','3']
s
a 1.0
2 NaN
3 1.0
dtype: float64
修改索引的name
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s.index.name = 'new' # 對index的名字進行重命名
s
new
a 1
b nan
c 1
Name: cc, dtype: object
查看索引
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print(s.index)
print("索引轉爲列表:",s.index.tolist())
Index(['a', 'b', 'c'], dtype='object', name='abc')
索引轉爲列表: ['a', 'b', 'c']
修改索引名
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s.index.tolist())
s.rename(index={'a':'aa'},inplace=True)
print("修改後:",s.index.tolist())
修改前: ['a', 'b', 'c']
修改後: ['aa', 'b', 'c']
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s.index.tolist())
print("修改後:",s.index.tolist())
修改前: ['a', 'b', 'c']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-101-8b84520332ab> in <module>()
5 s = pd.Series(tup,index=index_name,name="cc",dtype="str")
6 print("修改前:",s.index.tolist())
----> 7 s.index(["1",'2','3'])
8 print("修改後:",s.index.tolist())
TypeError: 'Index' object is not callable
查看數據
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print(s.values)
print("數據轉爲列表:",s.values.tolist())
['1' 'nan' '1']
數據轉爲列表: ['1', 'nan', '1']
查看Series名
"""設置索引方法2"""
# 構建索引及指定索引名
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print(s.name)
cc
2.4 Series的增刪改查
2.4.1 增
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s1 = pd.Series(tup,index=index_name,name="cc",dtype="str")
s1
abc
a 1
b nan
c 1
Name: cc, dtype: object
s1["d"] = 2 # 可以當做字典的增在末尾添加
s1
abc
a 1
b nan
c 1
d 2
Name: cc, dtype: object
dic = {"a":[1,2],"b":2,"c":3}
s2 = pd.Series(dic) # 默認key爲列索引
s2
a [1, 2]
b 2
c 3
dtype: object
s1.append(s2) # 用於連接兩個Series
a 1
b nan
c 1
d 2
a [1, 2]
b 2
c 3
dtype: object
2.4.2 刪
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
display(s)
abc
a 1
b nan
c 1
Name: cc, dtype: object
# 方法1 del方式
del s["b"]
print(s)
abc
a 1
c 1
Name: cc, dtype: object
print("刪除前:",s)
# 方法2 drop方式
a = s.drop("a")
print("刪除後:",s)
刪除前: abc
a 1
c 1
Name: cc, dtype: object
刪除後: abc
a 1
c 1
Name: cc, dtype: object
# 可以看到上述步驟s併發生改變,這裏輸出a看一下
print(a)
abc
c 1
Name: cc, dtype: object
# 可以看到a纔是我們需要的結果,這裏通過設置一下inplace,即可實現
print("刪除前:",s)
aa = s.drop("a",inplace=True)
print("刪除後:",s)
刪除前: abc
a 1
c 1
Name: cc, dtype: object
刪除後: abc
c 1
Name: cc, dtype: object
"""使用Drop同時刪除多個"""
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("刪除前:",s)
aa = s.drop(["a","b"],inplace=True)
print("刪除後:",s)
刪除前: abc
a 1
b nan
c 1
Name: cc, dtype: object
刪除後: abc
c 1
Name: cc, dtype: object
2.4.3 改
# 獲取到某個值後,採用賦值方式修改值
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s)
s["a"] = 2
print("修改後:",s)
修改前: abc
a 1
b nan
c 1
Name: cc, dtype: object
修改後: abc
a 2
b nan
c 1
Name: cc, dtype: object
# 獲取到某個值後,採用賦值方式修改值
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s)
# 通過標籤或布爾數組訪問一組行和列
s.loc["a"] = 3
print("修改後:",s)
修改前: abc
a 1
b nan
c 1
Name: cc, dtype: object
修改後: abc
a 3
b nan
c 1
Name: cc, dtype: object
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
print("修改前:",s)
# 純整數-基於位置的索引,用於按位置選擇。
s.iloc[2] = 3
print("修改後:",s)
修改前: abc
a 1
b nan
c 1
Name: cc, dtype: object
修改後: abc
a 1
b nan
c 3
Name: cc, dtype: object
2.4.4 查
通過索引查單值
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s["a"]
'1'
通過索引值查多值
import pandas as pd
index_name = pd.Index(["a","b","c"],name="abc")
tup = (1,np.nan,1)
s = pd.Series(tup,index=index_name,name="cc",dtype="str")
s[["a","b"]]
abc
a 1
b nan
Name: cc, dtype: object
通過布爾類型索引篩選
import pandas as pd
index_name = pd.Index(["a","b","c","d"],name="num")
tup = (1,2,3,4)
s = pd.Series(tup,index=index_name,name="cc",dtype="float")
s[s>2]
num
c 3.0
d 4.0
Name: cc, dtype: float64
通過位置切片和標籤切片查詢數據
import pandas as pd
index_name = pd.Index(["a","b","c","d"],name="num")
tup = (1,2,3,4)
s = pd.Series(tup,index=index_name,name="cc",dtype="float")
s[:2] # 左閉右開原則
num
a 1.0
b 2.0
Name: cc, dtype: float64
s["a":"c"]
num
a 1.0
b 2.0
c 3.0
Name: cc, dtype: float64
s[[0,1]]
num
a 1.0
b 2.0
Name: cc, dtype: float64
純整數-基於位置的索引,用於按位置選擇
s.iloc[:2][:]
num
a 1.0
b 2.0
Name: cc, dtype: float64
通過標籤或布爾數組訪問一組行和列
s.loc["c":]
num
c 3.0
d 4.0
Name: cc, dtype: float64
s.loc[["c","b"]]
num
c 3.0
b 2.0
Name: cc, dtype: float64
查看前後n行
import pandas as pd
tup = (1,2,3,4,4,5,6,7,8,9)
s = pd.Series(tup)
print("查看前5行:",s.head()) # 默認5行
print("查看前5行:",s.tail()) # 默認5行
print("查看前2行:",s.head(2)) # 指定2行
print("查看前2行:",s.tail(2)) # 指定2行
查看前5行: 0 1
1 2
2 3
3 4
4 4
dtype: int64
查看前5行: 5 5
6 6
7 7
8 8
9 9
dtype: int64
查看前2行: 0 1
1 2
dtype: int64
查看前2行: 8 8
9 9
dtype: int64
2.5 Series統計計算
單個Series的計算
import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s1 = pd.Series(tup[:5])
s1 * 2 # 每個值都乘以2 ,相當於向量運算
0 2
1 4
2 6
3 8
4 10
dtype: int64
s1 +1 # 每個位置都加1
0 2
1 3
2 4
3 5
4 6
dtype: int64
兩個Series之間的運算(索引相同)
# + 運算
import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s1 = pd.Series(tup[:5])
s2 = pd.Series(tup[5:])
print("s1:",s1)
print("s2:",s2)
s1: 0 1
1 2
2 3
3 4
4 5
dtype: int64
s2: 0 5
1 6
2 7
3 8
4 9
dtype: int64
s1 + s2 # 索引值對應相加
0 6
1 8
2 10
3 12
4 14
dtype: int64
s2 - s1 # 索引值對應相減
0 4
1 4
2 4
3 4
4 4
dtype: int64
兩個Series之間的運算(索引不同)
# + 運算
import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s1 = pd.Series(tup[:5],index=["a","b",1,2,3])
s2 = pd.Series(tup[5:])
print("s1:",s1)
print("s2:",s2)
s1: a 1
b 2
1 3
2 4
3 5
dtype: int64
s2: 0 5
1 6
2 7
3 8
4 9
dtype: int64
s1 + s2 # 索引對應不上則爲NaN
0 NaN
1 9.0
2 11.0
3 13.0
4 NaN
a NaN
b NaN
dtype: float64
s1 - s2 # 索引對應不上則爲NaN
0 NaN
1 -3.0
2 -3.0
3 -3.0
4 NaN
a NaN
b NaN
dtype: float64
統計計算
import pandas as pd
tup = (1,2,3,4,5,5,6,7,8,9)
s = pd.Series(tup)
s.describe() # 快速查看統計信息
count 10.000000
mean 5.000000
std 2.581989
min 1.000000
25% 3.250000
50% 5.000000
75% 6.750000
max 9.000000
dtype: float64
# 求平均數
s.mean()
5.0
# 求和
s.sum()
50
# 標準差
s.std()
2.581988897471611
# 最大值
s.max()
9
# 最小值
s.min()
1
# 分位數
print("下四分位數:",s.quantile(0.25))
print("中四分位數:",s.quantile(0.5))
print("上四分位數:",s.quantile(0.75))
下四分位數: 3.25
中四分位數: 5.0
上四分位數: 6.75
# 求累加
s.cumsum()
0 1
1 3
2 6
3 10
4 15
5 20
6 26
7 33
8 41
9 50
dtype: int64