一般呢,我们拿到的原始数据中包含大量的脏数据,常常需要对其进行预处理,得到我们想要的数据格式。最常用的不外乎过滤数据、日期格式转换、填空值、排序、去重等,下面就用个实例来展示下pandas处理数据的基本用法吧。
原始数据:
实现功能:
- 读取原始数据
- 在A列中去除包含‘||’的行–>过滤数据
- 去除一行有3个空值的行–>过滤数据
- 将日期中的‘-’去掉–>日期格式转换
- E列的空值填1–>填空值
- 按D列的日期降序排列–>排序
- B列去重,保留第一行–>去重
- 保存处理结果
import pandas as pd
data = pd.read_csv('buydata.csv', sep=',', header=None, names=['cookie', 'phone', 'deal_time', 'lead_time', 'num'])
print('raw data:\n', data)
data = data[~data.cookie.str.contains('\|\|')]
data.dropna(axis=0, thresh=3, inplace=True)
data.deal_time = data.deal_time.str.replace('-', '')
data.lead_time = data.lead_time.str.replace('-', '')
data.fillna({'num': 1}, inplace=True)
data = data.sort_values(by='lead_time', ascending=False).drop_duplicates(['phone'], keep='first')
print('preprocessing data:\n', data)
data.reset_index(drop=False, inplace=True)
print('reset_index data:\n', data)
data.to_csv('buydata_clear.csv', columns=['cookie', 'phone', 'deal_time', 'lead_time', 'num'], index_label='index')
运行结果:
raw data:
cookie phone deal_time lead_time num
0 asdfawef||asff 1123545.0 2018-10-10 2018-10-05 1.0
1 ghsdrg 4521665.0 2018-10-11 2018-10-06 2.0
2 dfag||adgh 544862.0 2018-10-12 2018-10-07 46.0
3 dfgtntsrg 5588662.0 2018-10-13 2018-10-08 7.0
4 aedfga 1123545.0 2018-10-14 2018-10-09 NaN
5 asdgh 4521665.0 2018-10-15 2018-10-10 2.0
6 ayjsdr 544862.0 2018-10-16 2018-10-11 7.0
7 kjfghjtd 5588662.0 2018-10-17 2018-10-12 3.0
8 kfghjtewert NaN NaN NaN NaN
9 uwrtywqeru 1123545.0 2018-10-11 2018-10-05 8.0
10 jsdfh||adfhs 4521665.0 2018-10-12 2018-10-06 3.0
11 iryuisfdh 544862.0 2018-10-13 2018-10-07 7.0
12 fhjulfy 5588662.0 2018-10-14 2018-10-08 1.0
preprocessing data:
cookie phone deal_time lead_time num
7 kjfghjtd 5588662.0 20181017 20181012 3.0
6 ayjsdr 544862.0 20181016 20181011 7.0
5 asdgh 4521665.0 20181015 20181010 2.0
4 aedfga 1123545.0 20181014 20181009 1.0
reset_index data:
index cookie phone deal_time lead_time num
0 7 kjfghjtd 5588662.0 20181017 20181012 3.0
1 6 ayjsdr 544862.0 20181016 20181011 7.0
2 5 asdgh 4521665.0 20181015 20181010 2.0
3 4 aedfga 1123545.0 20181014 20181009 1.0
处理后的数据: