1. 使用分佈式框架處理，如上次介紹的spark

這種情況下集羣纔有優勢，local單機版只能使用8G內存，rdd的優勢也沒發揮出來，好在是多patition和多任務。

2. 使用pandas chunk, 不比單機版的spark慢

import pandas as pd
df_chunk = pd.read_json('F://total.json', chunksize=1000000, lines=True,encoding='utf-8')
chunk_list = []  # append each chunk df here
i =1
#%%
# Each chunk is in df format
for chunk in df_chunk:
    # perform data filtering
    # chunk_filter = chunk_preprocessing(chunk)

    # Once the data filtering is done, append the chunk to list
    # chunk_list.append(chunk_filter)
    chunk_list.append(chunk)
    print("當前chunnk:{}".format(i))
    i += 1

# concat the list into dataframe
df_concat = pd.concat(chunk_list)

每塊100萬跑滿16G內存。上述方法用到list,也就是處理後的數據list不能超過你電腦的內存，有侷限性。

3. 使用dask pandas , 分佈式的pandas

import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4, n_workers=4, memory_limit='12GB')
#%%
df = dd.read_csv("F://total2.csv", blocksize=25e6,encoding='utf-8',dtype='object')
#%%
for i in df.columns:
    print("{}".format(df.head(1)[i]))

#%%
logs =  'Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\n\n+----------------------------------+--------+----------+\n| Column                           | Found  | Expected |\n+----------------------------------+--------+----------+\n| check.0.reportorphone            | object | float64  |\n| damagetypecode                   | object | float64  |\n| lossmain.0.handlercode           | object | float64  |\n| lossmain.0.repairbrandcode       | object | float64  |\n| lossmain.0.repairbrandname       | object | float64  |\n| lossmain.0.repairfactorycode     | object | float64  |\n| lossmain.0.repairfactoryname     | object | float64  |\n| lossthirdparty.0.insurecomcode   | object | float64  |\n| lossthirdparty.0.losscarkindname | object | float64  |\n| lossthirdparty.0.thirdcarlinker  | object | float64  |\n| lossthirdparty.0.vinno           | object | float64  |\n| phonenumber                      | object | int64    |\n| prplcitemcar.0.brandid           | object | float64  |\n| prplcitemcar.0.brandname1  '
print(logs)

上述log錯誤的接解決方法：dtype=‘object’

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

50-100G大文件的處理辦法

文章目錄

1. 使用分佈式框架處理，如上次介紹的spark

2. 使用pandas chunk, 不比單機版的spark慢

3. 使用dask pandas , 分佈式的pandas

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

從GB到GBDT到XGBoost

D3js（六）：支持css的tooltips

《Neo4j全棧開發》_陳韶健

idea破解，Maven配置web步驟

哈夫曼編碼的非樹節點形式實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結