[pandas學習筆記] - 不同列數據處理方式的性能差異

這裏參考了他的測試案例《還在抱怨pandas運行速度慢？這幾個方法會顛覆你的看法》

https://www.jianshu.com/p/ef690275390c

案例：
按小時分割十年的數據。製作成dataframe。
將一天24小時平均分成三份，0-7，8-15，16-23，打上對應的tag。

# -*- coding: utf-8 -*-
"""
Created on Tue Feb  4 14:19:45 2020

@author: Administrator
"""

import pandas as pd
import numpy as np
import time

def time_elapse(fn):
    def _wrapper(*args, **kwargs):
        start = time.perf_counter_ns()
        fn(*args, **kwargs)
        print(f"{fn.__name__} cost {(time.perf_counter_ns() - start)/1_000_000_000} s")
    return _wrapper


df = pd.DataFrame({
    "Time": [x for x in pd.date_range('20100101', '20200101',freq='1H')], 
    "Hour": [x.hour for x in pd.date_range('20100101', '20200101',freq='1H')]})

def f(hour):
    c = 0
    if 0 <= hour < 8:
        c = 1
    elif 8 <= hour < 16:
        c = 2
    elif 16 <= hour < 24:
        c = 3
    return (c)

# 266s
# 使用了循環與loc
@time_elapse
def f1():
    df["Tag"] = 0
    for i in range(len(df)):
        h = df.iloc[i]["Hour"]
        df.loc[i, "Tag"] = f(h)

# 35s
# 使用了循環與iloc
@time_elapse
def f1_1():
    df["Tag"] = 0
    for i in range(len(df)):
        h = df.iloc[i]["Hour"]
        df.iloc[i]["Tag"] = f(h)

# 8.5s
# 使用了iterrows與list
@time_elapse
def f2():
    df["Tag"] = 0
    c = []
    for index, row in df.iterrows():
        h = row["Hour"]
        c.append(f(h))
    df["Tag"] = c
    
# 0.35s
# 使用了itertuples與list
@time_elapse
def f2_1():
    df["Tag"] = 0
    c = []
    for row in df.itertuples():
        h = row.Hour
        c.append(f(h))
    df["Tag"] = c
    
# 0.035s
# 使用了apply
@time_elapse
def f3():
    df["Tag"] = df.Hour.apply(f)
    
# 0.0129084 s
# 使用了索引，列操作
@time_elapse
def f4():
    index_1 = df.Hour.isin(range(0, 8))
    index_2 = df.Hour.isin(range(8, 16))
    index_3 = df.Hour.isin(range(16, 24))
    
    df.loc[index_1, "Tag"] = 1
    df.loc[index_2, "Tag"] = 2
    df.loc[index_3, "Tag"] = 3
    
#  0.0051495 s
# 使用pd.cut
@time_elapse
def f5():
    df["Tag"] = pd.cut(x=df.Hour, bins=[0, 8, 16, 24],
                        include_lowest=True, labels=[1, 2, 3]).astype(int)

# 0.001368 s
# 使用了np
@time_elapse
def f6():
    c = np.array([1, 2, 3])
    df["Tag"] = c[np.digitize(df.Hour, bins=[8, 16, 24])]

飛翔的烤雞翅

發佈了41 篇原創文章 · 獲贊 8 · 訪問量 3萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[pandas學習筆記] - 不同列數據處理方式的性能差異

sm4加密工具類

[部署] -VirtualBox安裝linux虛擬機

[部署] - ubuntu開發環境配置

[部署] - python安裝及環境配置

[confluence] - 每日備份及開啓自定義備份目錄

【部署】Windows 安裝 .Net Core SDK/Runtime 及Server 2008 R2的處理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結