Python : pyarrow与parquet、feather格式比较

原創

2021-01-30 10:13

最近看到Arrow格式，感觉设计很牛B，具体就不介绍了。所以实操了解一下。

一、材料准备
准备了一个csv文件，大约约59万行，14列，大小约61M.

table shape row:  589680
table shape col:  14

有了这个csv材料可以转成Dataframe,转成parquet格式。

二、具体代码

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time as t

# 生成arrow格式
print("write parquet file....")
csv_path = "C:\\Users\\songroom\\Desktop\\test.csv"
time_0 = t.time()
df = pd.read_csv(csv_path)
time_1 =t.time()
print("read csv cost :", time_1-time_0)
print("type of df : ",type(df))
time_2 = t.time()
table = pa.Table.from_pandas(df)
time_3 = t.time()
print("type of table :",type(table))
print("Dataframe convert table:", time_3-time_2)
#
print("write to parquet to disk.....")
time_4 = t.time()
pq_path = "C:\\Users\\songroom\\Desktop\\test.parquet"
pq.write_table(table, pq_path)
time_5 = t.time()
print("write  parquet cost :", time_5-time_4)

print("read parquet file from disk ....")
table2 = pq.read_table(pq_path)

time_6 = t.time()
print("read  parquet cost :", time_6-time_5)
print("type of table2 :",type(table2))
print("table shape row: ",table2.shape[0])
print("table shape col: ",table2.shape[1])

三、文件大小比较

生成parquet文件，大约是11.3M，和原来的csv文件比，大约是20%，这个很省空间呀。

读写速度具体比较如下：

write parquet file....
read csv cost : 1.0619995594024658
type of df :  <class 'pandas.core.frame.DataFrame'>
type of table : <class 'pyarrow.lib.Table'>
Dataframe convert table: 0.08900094032287598
write to parquet to disk.....
write  parquet cost : 0.3249986171722412
read parquet file from disk ....
read  parquet cost : 0.05600690841674805
type of table2 : <class 'pyarrow.lib.Table'>
table2 shape row:  589680
table2 shape col:  14

也就是说，parquet读的用时大约是csv的50%不到，文件大小约20%。当然这个数量级和不同运行环境并不一定相同，谨供参考。

四、和Feather比较

还是同一个csv文件，我们用feather处理一下，比较一下读的速度。

using DataFrames
using CSV
using Feather

csv_path = s"C:\Users\songroom\Desktop\test.csv"
println("csv => DataFrame: ")
df = @time CSV.File(csv_path) |> DataFrame;
ft_path = s"C:\Users\songroom\Desktop\ft.ft"
println("DataFrame=> Feather:")
@time ft_file = Feather.write(ft_path,_df)
println("read Feather")
@time ft = Feather.read(ft_path)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python : pyarrow与parquet、feather格式比较

HTML页面关于高分屏的设置

北欧瑞典挪威芬兰瑞士TikTok海外网红与YouTube博主的合作模式

欧洲英国德国法国TikTok与YouTube海外网红达人的完美合作策略

druid数据源 xml配置

ipa文件生成掃碼安裝二維碼的方法

向 Elixir 學習測試

淺談UEBA基本實現步驟

nginx+upstream+supervisor+gunicorn+django集羣

Go語言的簡介與特點

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結