[pandas]方法總結

pandas.rolling 方法

window：表示時間窗的大小，注意有兩種形式（int or offset）。如果使用int，則數值表示計算統計量的觀測值的數量即向前幾個數據。如果是offset類型，表示時間窗的大小。pandas offset相關可以參考這裏。
min_periods：最少需要有值的觀測點的數量，對於int類型，默認與window相等。對於offset類型，默認爲1。
freq：從0.18版本中已經被捨棄。
center：是否使用window的中間值作爲label，默認爲false。只能在window是int時使用。
win_type：窗口類型，默認爲None一般不特殊指定，瞭解支持的其他窗口類型，參考這裏。

on：對於DataFrame如果不使用index（索引）作爲rolling的列，那麼用on來指定使用哪列。
closed：定義區間的開閉，曾經支持int類型的window，新版本已經不支持了。對於offset類型默認是左開右閉的即默認爲right。可以根據情況指定爲left both等。
axis：方向（軸），一般都是0。

DataFrame.rolling(window, min_periods=None, freq=None, center=False, win_type=None, on=None, axis=0, closed=None)

rolling 方法其實就在固定一個window，這樣你就可以在這個window做sum, mean.等等
例子

import pandas as pd
import numpy as np
index = pd.date_range('2019-01-01', periods=5)
# 創建日期序列
data = pd.DataFrame(np.arange(len(index)), index=index, columns=['test'])
# 創建簡單的pd.DataFrame
print(data)
# 打印data
data['sum'] = data.test.rolling(3).sum()
print('-----')
print(data['sum'])
# 移動3個值，進行求和
data['mean'] = data.test.rolling(3).mean()
print('-----')
print(data['mean'] )
# 移動3個值，進行求平均數
data['mean1'] = data.test.rolling(3, min_periods=2).mean()
print('-----')
print(data['mean1'] )

            test
2019-01-01     0
2019-01-02     1
2019-01-03     2
2019-01-04     3
2019-01-05     4
-----
2019-01-01    NaN
2019-01-02    NaN
2019-01-03    3.0
2019-01-04    6.0
2019-01-05    9.0
Freq: D, Name: sum, dtype: float64
-----
2019-01-01    NaN
2019-01-02    NaN
2019-01-03    1.0
2019-01-04    2.0
2019-01-05    3.0
Freq: D, Name: mean, dtype: float64
-----
2019-01-01    NaN
2019-01-02    0.5
2019-01-03    1.0
2019-01-04    2.0
2019-01-05    3.0
Freq: D, Name: mean1, dtype: float64

Process finished with exit code 0

pandas.cut 方法

用途

pandas.cut用來把一組數據分割成離散的區間。比如有一組年齡數據，可以使用pandas.cut將年齡數據分割成不同的年齡段並打上標籤。

原型

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4

參數含義

x：被切分的類數組（array-like）數據，必須是1維的（不能用DataFrame）；
bins：bins是被切割後的區間（或者叫“桶”、“箱”、“面元”），有3中形式：一個int型的標量、標量序列（數組）或者pandas.IntervalIndex 。

一個int型的標量
當bins爲一個int型的標量時，代表將x平分成bins份。x的範圍在每側擴展0.1%，以包括x的最大值和最小值。
標量序列
標量序列定義了被分割後每一個bin的區間邊緣，此時x沒有擴展。
pandas.IntervalIndex
定義要使用的精確區間。
right：bool型參數，默認爲True，表示是否包含區間右部。比如如果bins=[1,2,3]，right=True，則區間爲(1,2]，(2,3]；right=False，則區間爲(1,2),(2,3)。
labels：給分割後的bins打標籤，比如把年齡x分割成年齡段bins後，可以給年齡段打上諸如青年、中年的標籤。labels的長度必須和劃分後的區間長度相等，比如bins=[1,2,3]，劃分後有2個區間(1,2]，(2,3]，則labels的長度必須爲2。如果指定labels=False，則返回x中的數據在第幾個bin中（從0開始）。
retbins：bool型的參數，表示是否將分割後的bins返回，當bins爲一個int型的標量時比較有用，這樣可以得到劃分後的區間，默認爲False。
precision：保留區間小數點的位數，默認爲3.
include_lowest：bool型的參數，表示區間的左邊是開還是閉的，默認爲false，也就是不包含區間左部（閉）。
duplicates：是否允許重複區間。有兩種選擇：raise：不允許，drop：允許。

返回值

out：一個pandas.Categorical, Series或者ndarray類型的值，代表分區後x中的每個值在哪個bin（區間）中，如果指定了labels，則返回對應的label。
bins：分隔後的區間，當指定retbins爲True時返回。

例子

這裏拿給年齡分組當做例子。

import numpy as np
import pandas as pd

ages = np.array([1,5,10,40,36,12,58,62,77,100]) #年齡數據

將ages平分成5個區間

r1 = pd.cut(ages, 5)
print(r1)

輸出：

[(0.901, 20.8], (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], (0.901, 20.8], (40.6, 60.4], (60.4, 80.2], (60.4, 80.2], (80.2, 100.0]]
Categories (5, interval[float64]): [(0.901, 20.8] < (20.8, 40.6] < (40.6, 60.4] < (60.4, 80.2] <
                                    (80.2, 100.0]]

可以看到ages被平分成5個區間，且區間兩邊都有擴展以包含最大值和最小值。

將ages平分成5個區間並指定labels

r2 = pd.cut(ages, 5, labels=[u"嬰兒",u"青年",u"中年",u"壯年",u"老年"])
print(r2)

輸出

[嬰兒, 嬰兒, 嬰兒, 青年, 青年, 嬰兒, 中年, 壯年, 壯年, 老年]
Categories (5, object): [嬰兒 < 青年 < 中年 < 壯年 < 老年]

給ages指定區間進行分割

r3 = pd.cut(ages, [0,5,20,30,50,100], labels=[u"嬰兒",u"青年",u"中年",u"壯年",u"老年"])
print(r3)

[嬰兒, 嬰兒, 青年, 壯年, 壯年, 青年, 老年, 老年, 老年, 老年]
Categories (5, object): [嬰兒 < 青年 < 中年 < 壯年 < 老年]

這裏不再平分ages，而是將ages分爲了5個區間(0, 5],(5, 20],(20, 30],(30,50],(50,100].
返回分割後的bins
令retbins=True即可

r4 = pd.cut(ages, [0,5,20,30,50,100], labels=[u"嬰兒",u"青年",u"中年",u"壯年",u"老年"],retbins=True)
print(r4)

([嬰兒, 嬰兒, 青年, 壯年, 壯年, 青年, 老年, 老年, 老年, 老年]
Categories (5, object): [嬰兒 < 青年 < 中年 < 壯年 < 老年], array([  0,   5,  20,  30,  50, 100]))

只返回x中的數據在哪個bin
令labels=False即可

r5 =  pd.cut(ages, [0,5,20,30,50,100], labels=False)

[0 0 1 3 3 1 4 4 4 4]

第一個0表示1在第0個bin中。

pandas.rename 方法

改列表名字

data = pd.DataFrame(ages, columns=['test'])
r6 =  pd.cut(data['test'], [0,5,20,30,50,100], labels=False)
print(r6)
print(r6.rename('s'))

0    0
1    0
2    1
3    3
4    3
5    1
6    4
7    4
8    4
9    4
Name: test, dtype: int64
0    0
1    0
2    1
3    3
4    3
5    1
6    4
7    4
8    4
9    4
Name: s, dtype: int64
<class 'pandas.core.series.Series'>

Process finished with exit code 0

參考

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html

[pandas]方法總結

[pandas]方法總結

pandas.rolling 方法

pandas.cut 方法

用途

原型

參數含義

返回值

例子

pandas.rename 方法

參考

[機器學習-邏輯迴歸]邏輯迴歸(LogisticRegression)多分類(OvR, OvO, MvM）

[深度學習NPL]word2vector總結與理解

[深度學習TF2] 梯度帶(GradientTape)

[機器學習-概念篇]徹底搞懂信息量，熵、相對熵、交叉熵

[深度學習TF2][RNN-LSTM]文本情感分析包含（數據預處理-訓練-預測）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結