Python : pyarrow與parquet、feather格式比較

最近看到Arrow格式,感覺設計很牛B,具體就不介紹了。所以實操瞭解一下。

一、材料準備
準備了一個csv文件,大約約59萬行,14列,大小約61M.

table shape row:  589680
table shape col:  14

有了這個csv材料可以轉成Dataframe,轉成parquet格式。

二、具體代碼

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time as t

# 生成arrow格式
print("write parquet file....")
csv_path = "C:\\Users\\songroom\\Desktop\\test.csv"
time_0 = t.time()
df = pd.read_csv(csv_path)
time_1 =t.time()
print("read csv cost :", time_1-time_0)
print("type of df : ",type(df))
time_2 = t.time()
table = pa.Table.from_pandas(df)
time_3 = t.time()
print("type of table :",type(table))
print("Dataframe convert table:", time_3-time_2)
#
print("write to parquet to disk.....")
time_4 = t.time()
pq_path = "C:\\Users\\songroom\\Desktop\\test.parquet"
pq.write_table(table, pq_path)
time_5 = t.time()
print("write  parquet cost :", time_5-time_4)

print("read parquet file from disk ....")
table2 = pq.read_table(pq_path)

time_6 = t.time()
print("read  parquet cost :", time_6-time_5)
print("type of table2 :",type(table2))
print("table shape row: ",table2.shape[0])
print("table shape col: ",table2.shape[1])

三、文件大小比較

生成parquet文件,大約是11.3M,和原來的csv文件比,大約是20%,這個很省空間呀。

讀寫速度具體比較如下:

write parquet file....
read csv cost : 1.0619995594024658
type of df :  <class 'pandas.core.frame.DataFrame'>
type of table : <class 'pyarrow.lib.Table'>
Dataframe convert table: 0.08900094032287598
write to parquet to disk.....
write  parquet cost : 0.3249986171722412
read parquet file from disk ....
read  parquet cost : 0.05600690841674805
type of table2 : <class 'pyarrow.lib.Table'>
table2 shape row:  589680
table2 shape col:  14

也就是說,parquet讀的用時大約是csv的50%不到,文件大小約20%。當然這個數量級和不同運行環境並不一定相同,謹供參考。

四、和Feather比較

還是同一個csv文件,我們用feather處理一下,比較一下讀的速度。

using DataFrames
using CSV
using Feather

csv_path = s"C:\Users\songroom\Desktop\test.csv"
println("csv => DataFrame: ")
df = @time CSV.File(csv_path) |> DataFrame;
ft_path = s"C:\Users\songroom\Desktop\ft.ft"
println("DataFrame=> Feather:")
@time ft_file = Feather.write(ft_path,_df)
println("read Feather")
@time ft = Feather.read(ft_path)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章