1. 使用分佈式框架處理,如上次介紹的spark
這種情況下集羣纔有優勢,local單機版只能使用8G內存,rdd的優勢也沒發揮出來,好在是多patition和多任務。
2. 使用pandas chunk, 不比單機版的spark慢
import pandas as pd
df_chunk = pd.read_json('F://total.json', chunksize=1000000, lines=True,encoding='utf-8')
chunk_list = [] # append each chunk df here
i =1
#%%
# Each chunk is in df format
for chunk in df_chunk:
# perform data filtering
# chunk_filter = chunk_preprocessing(chunk)
# Once the data filtering is done, append the chunk to list
# chunk_list.append(chunk_filter)
chunk_list.append(chunk)
print("當前chunnk:{}".format(i))
i += 1
# concat the list into dataframe
df_concat = pd.concat(chunk_list)
每塊100萬跑滿16G內存。上述方法用到list,也就是處理後的數據list不能超過你電腦的內存,有侷限性。
3. 使用dask pandas , 分佈式的pandas
import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4, n_workers=4, memory_limit='12GB')
#%%
df = dd.read_csv("F://total2.csv", blocksize=25e6,encoding='utf-8',dtype='object')
#%%
for i in df.columns:
print("{}".format(df.head(1)[i]))
#%%
logs = 'Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\n\n+----------------------------------+--------+----------+\n| Column | Found | Expected |\n+----------------------------------+--------+----------+\n| check.0.reportorphone | object | float64 |\n| damagetypecode | object | float64 |\n| lossmain.0.handlercode | object | float64 |\n| lossmain.0.repairbrandcode | object | float64 |\n| lossmain.0.repairbrandname | object | float64 |\n| lossmain.0.repairfactorycode | object | float64 |\n| lossmain.0.repairfactoryname | object | float64 |\n| lossthirdparty.0.insurecomcode | object | float64 |\n| lossthirdparty.0.losscarkindname | object | float64 |\n| lossthirdparty.0.thirdcarlinker | object | float64 |\n| lossthirdparty.0.vinno | object | float64 |\n| phonenumber | object | int64 |\n| prplcitemcar.0.brandid | object | float64 |\n| prplcitemcar.0.brandname1 '
print(logs)
上述log錯誤的接解決方法:dtype=‘object’