【Python數據分析之pandas03】數據結構的基本功能--2

原創

2018-09-03 21:07

算數運算和數據對齊

pandas一個強大的功能是，它可以對不同索引的對象進行算數運算。

s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])
s1
'''
a    1
b    2
c    3
d    4
dtype: int64

'''

s2 = pd.Series([4,78,32,89,61],index=['a','b','e','d','g'])
s2
'''
a     4
b    78
e    32
d    89
g    61
dtype: int64

'''

#將兩個Series相加
s1+s2
'''
a     5.0
b    80.0
c     NaN
d    93.0
e     NaN
g     NaN
dtype: float64

'''

可以看到，對於兩個對象相同的索引，值直接想加；對於不同的索引，引入了NaN值。

而對於DataFrame，只有當行索引和列索引都相同時，纔會執行算數操作，否則引入NaN值。

data1 = pd.DataFrame(np.arange(9).reshape((3,3)),columns = list('bcd'),index = ['one','two','three'])
print("data1:")
print(data1)
print('\n')
data2 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bed'),index = ['one','second','three','four'])
print("data2:")
print(data2)
print('\n')
#相加
print("data1+data2:")
print(data1+data2)

'''
結果:
data1:
       b  c  d
one    0  1  2
two    3  4  5
three  6  7  8


data2:
        b   e   d
one     0   1   2
second  3   4   5
three   6   7   8
four    9  10  11


data1+data2:
           b   c     d   e
four     NaN NaN   NaN NaN
one      0.0 NaN   4.0 NaN
second   NaN NaN   NaN NaN
three   12.0 NaN  16.0 NaN
two      NaN NaN   NaN NaN
'''

但是往往我們希望能夠出現數值而非NaN，這時有以下兩種方法：

#採用add方法
data1 = pd.DataFrame(np.arange(9).reshape((3,3)),columns = list('bcd'),index = ['one','two','three'])
data2 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bed'),index = ['one','second','three','four'])
print(data1.add(data2,fill_value=0))
'''
結果：
           b    c     d     e
four     9.0  NaN  11.0  10.0
one      0.0  1.0   4.0   1.0
second   3.0  NaN   5.0   4.0
three   12.0  7.0  16.0   7.0
two      3.0  4.0   5.0   NaN
'''

顯然地，這個結果沒有達到預期。原因是data1和data2兩個對象中，都有對方沒有的索引。換句話說，它們之間沒有包含關係。而add方法又是單向操作（姑且這麼描述吧），沒有辦法通過add方法對這兩個對象的索引進行全覆蓋。

data1 = pd.DataFrame(np.arange(9).reshape((3,3)),columns = list('bcd'),index = ['one','two','three'])
data2 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bed'),index = ['one','two','three','four'])
print(data2.reindex(columns = data1.columns,fill_value=0))
'''
       b  c   d
one    0  0   2
two    3  0   5
three  6  0   8
four   9  0  11

'''

通過重新索引可以對不同索引進行填充，但因爲是單向操作，所以有些索引會丟失。

Series和DataFrame之間的運算

基本上，它們之間的運算會根據Series的索引匹配到DataFrame的列，並且沿着行一直向下廣播，若存在不同索引，將會採用並集索引。

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape(4,4),
                index=['one','two','three','four'],
                columns=['first','second','third','forth'])
series = data.ix[0]
series
'''
first     0
second    1
third     2
forth     3
Name: one, dtype: int32
'''

data-series

結果如下:

*這裏碰到一個之前沒有注意的問題:DataFrame中直接索引訪問的是列元素，訪問行元素要用.ix元素(這裏碰到報錯，具體原因未知)

排序和排名

一般使用sort_index方法進行索引排序。

obj =pd. Series(range(4),index=['d','b','a','c'])
obj.sort_index()
'''
a    2
b    1
c    3
d    0
dtype: int64

'''

如果想要對值進行排序，則採用order方法：

*這裏嘗試的時候遇到一個問題，報錯總是提示Series沒有order屬性，實在是苦惱.....

DataFrame方法同理，可以根據行索引(axis=0)，也可通過列索引(axis=1)進行排序

上述默認排序爲升序，可以採用降序，引入其ascending值（默認值爲true，即升序）。

frame = pd.DataFrame(np.arange(8).reshape(2,4),index=['three','one'],columns=['d','a','b','c'])
frame
'''
d	a	b	c
three	0	1	2	3
one	4	5	6	7
'''

frame.sort_index(axis=1,ascending=False)

結果爲：

Data Frame按一個或多個列值排序引入by屬性：

frame = pd.DataFrame({'b':[1,2,4,3],'c':[24,56,95,3]})
print(frame)
print("引入by屬性：")
print(frame.sort_index(by='b'))

'''
b   c
0  1  24
1  2  56
2  4  95
3  3   3
引入by屬性：
   b   c
0  1  24
1  2  56
3  3   3
2  4  95
'''

Series和DataFrame的排名返回了一個Series或DataFrame，其中主要屬性是method(破壞平級關係)。默認的method屬性是取均值(對值相同的元素取排名均值)：

a = pd.Series([3,5,2,7,3,4,9])
a.rank()

'''
0    2.5
1    5.0
2    1.0
3    6.0
4    2.5
5    4.0
6    7.0
dtype: float64
'''

也可以依照出現順序排名，不取均值：

a = pd.Series([3,5,2,7,3,4,9])
a.rank(method='first')

'''
0    2.0
1    5.0
2    1.0
3    6.0
4    3.0
5    4.0
6    7.0
dtype: float64
'''

可以按照較大排名值排名：

a = pd.Series([3,5,2,7,3,4,9])
a.rank(method='max')

'''
0    3.0
1    5.0
2    1.0
3    6.0
4    3.0
5    4.0
6    7.0
dtype: float64
'''

按較小值排名用法相同，只要把max改成min就可以。

帶有重複值的索引

python允許索引值相同，只是在選取時會將所有擁有這個索引的元素調用出來。

a = pd.Series([1,2,3,4],index=['a','b','a','d'])
a['a']

'''
a    1
a    3
dtype: int64
'''

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Python數據分析之pandas03】數據結構的基本功能--2

算數運算和數據對齊

Series和DataFrame之間的運算

排序和排名

帶有重複值的索引

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

【微信公衆平臺02】雲服務器搭建及url配置

僅以此博客勉勵自己

【微信公衆平臺04】自定義菜單

分享一篇激動人心的文章

【Python全棧05】函數與參數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結