数据处理阶段（一）

原創

Alicia_N

2020-02-21 03:41

此代码是在ubuntu虚拟机下的jupyter下进行操作的

#导包

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
1、删除重复元素
使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True
#检测是否重复
df=DataFrame({'color':['red','white','black','green'],
'value':[2,3,7,5]})
df.duplicated()
0 False
1 False
2 False
3 False
dtype: bool
df.drop_duplicates() #删除重复的列
2.映射
映射的含义：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定
需要使用字典：
map = {
'label1':'value1',
'label2':'value2',
...
}
包含三种操作：
replace()函数：替换元素
最重要：map()函数：新建一列
rename()函数：替换索引
1) replace()函数：替换元素
使用replace()函数，对values进行替换操作
df=DataFrame({'item':['ball','pen','pensil','cup','watch'],
'color':['red','white','yellow','black','green'],
'weight':[2,3,4,5,6]})
df
首先定义一个字典
new={'yellow':'blue','black':'white'}
new
df.replace(new)
replace还经常用来替换NaN元素
s=Series([1,np.nan,3])
s
0 1.0 #执行结果
1 NaN
2 3.0
dtype: float64
s1={np.nan:0}
s.replace(s1)
0 1.0
1 0.0
2 3.0
dtype: float64

2) map()函数：新建一列
使用map()函数，由已有的列生成一个新列
适合处理某一单独的列。
df = DataFrame({'item':['ball','mug','pen'],
'color':['white','rosso','verde']
})
df
仍然是新建一个字典
price = {
'ball':5.56,
'mug':4.20,
'pen':1.30
}
price['ball']
使用map()函数新建一个price列
df['color'] = df['item'].map(price)
df
3) rename()函数：替换索引
仍然是新建一个字典
new_index = {
0:'first',
1:'second',
2:'third'
}
使用rename()函数替换行索引
df.rename(new_index)
3. 异常值检测和过滤
使用describe()函数查看每一列的描述性统计量
df = DataFrame(np.random.randn(1000,3))
df.head()
df.describe()
使用std()函数可以求得DataFrame对象每一列的标准差
In [37]:
df.std()
根据每一列的标准差，对DataFrame元素进行过滤。
借助any()函数，对每一列应用筛选条件
df[(np.abs(df) > (3*df.std())).any(axis=1)]
删除某一指定的行
df.drop(288,inplace=True)

df.shape

df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True)
df.shape
4. 排序
使用.take()函数排序
可以借助np.random.permutation()函数随机排序
df = DataFrame(np.arange(25).reshape(5,5))
df
new_order = np.random.permutation(5)
new_order
df.take(new_order)
随机抽样
当DataFrame规模足够大时，直接使用np.random.randint()函数，就配合take()函数实现随机抽样
sample = np.random.randint(0,len(df),size=3)
sample
df.take(sample)