Python : pyarrow與parquet、feather格式比較

原創

2021-01-30 10:13

最近看到Arrow格式，感覺設計很牛B，具體就不介紹了。所以實操瞭解一下。

一、材料準備
準備了一個csv文件，大約約59萬行，14列，大小約61M.

table shape row:  589680
table shape col:  14

有了這個csv材料可以轉成Dataframe,轉成parquet格式。

二、具體代碼

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time as t

# 生成arrow格式
print("write parquet file....")
csv_path = "C:\\Users\\songroom\\Desktop\\test.csv"
time_0 = t.time()
df = pd.read_csv(csv_path)
time_1 =t.time()
print("read csv cost :", time_1-time_0)
print("type of df : ",type(df))
time_2 = t.time()
table = pa.Table.from_pandas(df)
time_3 = t.time()
print("type of table :",type(table))
print("Dataframe convert table:", time_3-time_2)
#
print("write to parquet to disk.....")
time_4 = t.time()
pq_path = "C:\\Users\\songroom\\Desktop\\test.parquet"
pq.write_table(table, pq_path)
time_5 = t.time()
print("write  parquet cost :", time_5-time_4)

print("read parquet file from disk ....")
table2 = pq.read_table(pq_path)

time_6 = t.time()
print("read  parquet cost :", time_6-time_5)
print("type of table2 :",type(table2))
print("table shape row: ",table2.shape[0])
print("table shape col: ",table2.shape[1])

三、文件大小比較

生成parquet文件，大約是11.3M，和原來的csv文件比，大約是20%，這個很省空間呀。

讀寫速度具體比較如下：

write parquet file....
read csv cost : 1.0619995594024658
type of df :  <class 'pandas.core.frame.DataFrame'>
type of table : <class 'pyarrow.lib.Table'>
Dataframe convert table: 0.08900094032287598
write to parquet to disk.....
write  parquet cost : 0.3249986171722412
read parquet file from disk ....
read  parquet cost : 0.05600690841674805
type of table2 : <class 'pyarrow.lib.Table'>
table2 shape row:  589680
table2 shape col:  14

也就是說，parquet讀的用時大約是csv的50%不到，文件大小約20%。當然這個數量級和不同運行環境並不一定相同，謹供參考。

四、和Feather比較

還是同一個csv文件，我們用feather處理一下，比較一下讀的速度。

using DataFrames
using CSV
using Feather

csv_path = s"C:\Users\songroom\Desktop\test.csv"
println("csv => DataFrame: ")
df = @time CSV.File(csv_path) |> DataFrame;
ft_path = s"C:\Users\songroom\Desktop\ft.ft"
println("DataFrame=> Feather:")
@time ft_file = Feather.write(ft_path,_df)
println("read Feather")
@time ft = Feather.read(ft_path)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python : pyarrow與parquet、feather格式比較

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

ipa文件生成掃碼安裝二維碼的方法

向 Elixir 學習測試

淺談UEBA基本實現步驟

nginx+upstream+supervisor+gunicorn+django集羣

Go語言的簡介與特點

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結