大數據處理經驗(持續更新)

原創

2020-06-28 03:54

先取少量數據跑代碼，確保代碼沒有語法和邏輯錯誤，再放到大量數據上面跑。
使用pandas的DataFrame表示數據的時候，對於int和float的默認爲int64和float64位，但實際可能不需要這樣的高精度表示。可通過以下代碼節省內存：

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / (1024 ** 3)
    print('Memory usage of dataframe is {:.2f} GB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            min_val = df[col].min()
            max_val = df[col].max()

            if str(col_type).startswith('int'):
                type_list = [np.int8, np.int16, np.int32, np.int64]
                for i in type_list:
                    if min_val >= np.iinfo(i).min and max_val <= np.iinfo(i).max:
                        df[col] = df[col].astype(i)
                        break
            else:
                type_list = [np.float16, np.float32, np.float64]
                for i in type_list:
                    if min_val >= np.iinfo(i).min and max_val <= np.iinfo(i).max:
                        df[col] = df[col].astype(i)
                        break

    end_mem = df.memory_usage().sum() / (1024 ** 3)
    print('Memory usage of dataframe is {:.2f} GB'.format(end_mem))
    return df

使用pandas的read_csv或者excel讀取大文件時，在讀取過程中出現OOM(Out of memory，內存溢出)，但是結合watch -n 0.1 free -hm和已讀取的行數佔比來查看的話，發現需要內存超出實際內存大約佔10%左右，可通過設置chunksize進行分塊讀取(如總行數的1/10)。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據處理經驗(持續更新)

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

貪心算法和動態規劃的區別與聯繫

使用區間來簡化代碼思考

NLTK使用匯總

NLP定義和機器翻譯

tensorflow.keras使用匯總(持續更新)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結