DataFrame rolling apply 多列 return 多列

原創

2020-02-24 23:16

原文
pandas DataFrame rolling 後的 apply 只能處理單列，就算用lambda的方式傳入了多列，也不能返回多列。想過在apply function中直接處理外部的DataFrame，也不是不行，就是感覺不太好，而且效率估計不高。

這是我在寫向量化回測時遇到的問題，很小衆的問題，如果有朋友遇到可以參考我這個解決方案。內容來自於 StockOverFlow，我做了一下修改。

相對於傳統的rolling，這個roll默認就是min_periods = window，然後只支持二維的

還有點要注意，就是apply function裏面傳進來的DataFrame是有多級索引的

import pandas as pd
from numpy.lib.stride_tricks import as_strided as stride

dates = pd.date_range(‘20130101’, periods=13, freq=‘D’)
df = pd.DataFrame({‘C’: [1.6, 4.1, 2.7, 4.9, 5.4, 1.3, 6.6, 9.6, 3.5, 5.4, 10.1, 3.08, 5.38]}, index=dates)
df.index.name = ‘datetime’

def roll(df: pd.DataFrame, window: int, **kwargs):
“”"
rolling with multiple columns on 2 dim pd.Dataframe
* the result can apply the function which can return pd.Series with multiple columns

Reference:
https://stackoverflow.com/questions/38878917/how-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column

:param df:
:param window:
:param kwargs:
:return:
"""

# move index to values
v = df.reset_index().values

dim0, dim1 = v.shape
stride0, stride1 = v.strides

stride_values = stride(v, (dim0 - (window - 1), window, dim1), (stride0, stride0, stride1))

rolled_df = pd.concat({
    row: pd.DataFrame(values[:, 1:], columns=df.columns, index=values[:, 0].flatten())
    for row, values in zip(df.index[window - 1:], stride_values)
})

return rolled_df.groupby(level=0, **kwargs)

def own_func(df):
“”"
attention: df has MultiIndex
:param df:
:return:
“”"

return pd.Series([df["C"].mean(), df["C"].max() + df["D"].min()])

測試運行結果：

print(df)
C
datetime
2013-01-01 1.60
2013-01-02 4.10
2013-01-03 2.70
2013-01-04 4.90
2013-01-05 5.40
2013-01-06 1.30
2013-01-07 6.60
2013-01-08 9.60
2013-01-09 3.50
2013-01-10 5.40
2013-01-11 10.10
2013-01-12 3.08
2013-01-13 5.38
df[[“C_mean”, “C+D”]] = roll(df, 5).apply(own_func)

print(df)
C D C_mean C+D
datetime
2013-01-01 1.60 5.40 NaN NaN
2013-01-02 4.10 3.20 NaN NaN
2013-01-03 2.70 8.80 NaN NaN
2013-01-04 4.90 3.60 NaN NaN
2013-01-05 5.40 12.60 3.740 8.6
2013-01-06 1.30 9.30 3.680 8.6
2013-01-07 6.60 11.80 4.180 10.2
2013-01-08 9.60 8.90 5.560 13.2
2013-01-09 3.50 4.60 5.280 14.2
2013-01-10 5.40 1.90 5.280 11.5
2013-01-11 10.10 0.10 7.040 10.2
2013-01-12 3.08 8.02 6.336 10.2
2013-01-13 5.38 3.80 5.492 10.2

測試發現 stride的速度很快，不過concat的速度很慢，pandas的各路操作確實是慢，不知有什麼方法能優化一下

pandas concat group 這一路操作太慢了，無法接受，又改了一版純numpy的，速度快很多

def roll_np(df: pd.DataFrame, apply_func: callable, window: int, return_col_num: int, **kwargs):
“”"
rolling with multiple columns on 2 dim pd.Dataframe
* the result can apply the function which can return pd.Series with multiple columns

call apply function with numpy ndarray
:param return_col_num: 返回的列數
:param apply_func:
:param df:
:param window
:param kwargs:
:return:
"""

# move index to values
v = df.reset_index().values

dim0, dim1 = v.shape
stride0, stride1 = v.strides

stride_values = stride(v, (dim0 - (window - 1), window, dim1), (stride0, stride0, stride1))

result_values = np.full((dim0, return_col_num), np.nan)

for idx, values in enumerate(stride_values, window - 1):
    # values : col 1 is index, other is value
    result_values[idx,] = apply_func(values, **kwargs)

return result_values

def own_func_np(narr, **kwargs):
“”"
:param narr:
:return:
“”"

c = narr[:, 1]
d = narr[:, 2]
return np.mean(c), np.max(c) + np.min(d)

測試運行結果：

return_values = tableRollNp(df, own_func_np, 3, 2)
df[“C_mean_np”] = return_values[:,0]
df[“C+D_np”] = return_values[:,1]

print(df)
C D C_mean_np C+D_np
datetime
2013-01-01 1.60 5.40 NaN NaN
2013-01-02 4.10 3.20 NaN NaN
2013-01-03 2.70 8.80 2.800000 7.3
2013-01-04 4.90 3.60 3.900000 8.1
2013-01-05 5.40 12.60 4.333333 9.0
2013-01-06 1.30 9.30 3.866667 9.0
2013-01-07 6.60 11.80 4.433333 15.9
2013-01-08 9.60 8.90 5.833333 18.5
2013-01-09 3.50 4.60 6.566667 14.2
2013-01-10 5.40 1.90 6.166667 11.5
2013-01-11 10.10 0.10 6.333333 10.2
2013-01-12 3.08 8.02 6.193333 10.2
2013-01-13 5.38 3.80 6.186667 10.2
結果和pandas那一版一樣。但是中間處理 index，結果什麼的需要自己切一下numpy二維矩陣，這個應該是小case吧。

回頭看看代碼，其實裏面最重要的函數就是stride，這也是numpy的核心，爲什麼速度這麼快的核心。numpy的數據在內存中是連續存儲的，所以numpy的底層操作是直接進行對內存進行尋址訪問，stride告訴我們加一行，加一列需要加的內存地址是多少。這樣訪問是飛快的。

所以對numpy操作時，進行slice操作是對原數組進行的操作，速度快；儘量不要重新生存數組，儘量不要做類似append的操作，這樣內存會反覆拷貝，就慢了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

DataFrame rolling apply 多列 return 多列

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Python下的並行計算

使用gitea搭建git版本服務器

python中元類在創建類和實例的作用

量化選哪個工具呢？

DataFrame rolling apply 多列 return 多列

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結