創建子表並快速測試唯一性的封裝(自定義)函數

從一個大數據中創建子表並快速測試唯一性的輔助函數

player_index = 'playerShort'
player_cols = ['birthday', 'height', 'weight', 'position', 'photoID', 'rater1', 'rater2']

def get_subgroup(dataframe, g_index, g_columns):
    g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
    if g[g > 1].dropna().shape[0] != 0:
        print("Warning: you probably assumed this had all unique values but it doesn't.")
    return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})

players = get_subgroup(df, player_index, player_cols)
players.head()

在這裏插入圖片描述
保存數據, 並檢測是否一致:

def save_subgroup(dataframe, g_index, subgroup_name, prefix = 'raw_'):
    save_subgroup_filename = ''.join([prefix, subgroup_name, '.csv.gz'])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding = 'UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col = g_index, encoding='UTF-8')
    if dataframe.equals(test_df):
        print('Test-passed: we recover the equivalent subgroup dataframe.')
    else:
        print('Warning -- equivalence test!!! Double-check.')
        
save_subgroup(players, player_index, 'players')

Test-passed: we recover the equivalent subgroup dataframe.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章