這裏參考了他的測試案例《還在抱怨pandas運行速度慢?這幾個方法會顛覆你的看法》
https://www.jianshu.com/p/ef690275390c
案例:
按小時分割十年的數據。製作成dataframe。
將一天24小時平均分成三份,0-7,8-15,16-23,打上對應的tag。
# -*- coding: utf-8 -*-
"""
Created on Tue Feb 4 14:19:45 2020
@author: Administrator
"""
import pandas as pd
import numpy as np
import time
def time_elapse(fn):
def _wrapper(*args, **kwargs):
start = time.perf_counter_ns()
fn(*args, **kwargs)
print(f"{fn.__name__} cost {(time.perf_counter_ns() - start)/1_000_000_000} s")
return _wrapper
df = pd.DataFrame({
"Time": [x for x in pd.date_range('20100101', '20200101',freq='1H')],
"Hour": [x.hour for x in pd.date_range('20100101', '20200101',freq='1H')]})
def f(hour):
c = 0
if 0 <= hour < 8:
c = 1
elif 8 <= hour < 16:
c = 2
elif 16 <= hour < 24:
c = 3
return (c)
# 266s
# 使用了循環與loc
@time_elapse
def f1():
df["Tag"] = 0
for i in range(len(df)):
h = df.iloc[i]["Hour"]
df.loc[i, "Tag"] = f(h)
# 35s
# 使用了循環與iloc
@time_elapse
def f1_1():
df["Tag"] = 0
for i in range(len(df)):
h = df.iloc[i]["Hour"]
df.iloc[i]["Tag"] = f(h)
# 8.5s
# 使用了iterrows與list
@time_elapse
def f2():
df["Tag"] = 0
c = []
for index, row in df.iterrows():
h = row["Hour"]
c.append(f(h))
df["Tag"] = c
# 0.35s
# 使用了itertuples與list
@time_elapse
def f2_1():
df["Tag"] = 0
c = []
for row in df.itertuples():
h = row.Hour
c.append(f(h))
df["Tag"] = c
# 0.035s
# 使用了apply
@time_elapse
def f3():
df["Tag"] = df.Hour.apply(f)
# 0.0129084 s
# 使用了索引,列操作
@time_elapse
def f4():
index_1 = df.Hour.isin(range(0, 8))
index_2 = df.Hour.isin(range(8, 16))
index_3 = df.Hour.isin(range(16, 24))
df.loc[index_1, "Tag"] = 1
df.loc[index_2, "Tag"] = 2
df.loc[index_3, "Tag"] = 3
# 0.0051495 s
# 使用pd.cut
@time_elapse
def f5():
df["Tag"] = pd.cut(x=df.Hour, bins=[0, 8, 16, 24],
include_lowest=True, labels=[1, 2, 3]).astype(int)
# 0.001368 s
# 使用了np
@time_elapse
def f6():
c = np.array([1, 2, 3])
df["Tag"] = c[np.digitize(df.Hour, bins=[8, 16, 24])]