【Python】處理城市空氣質量數據(異常值處理,interpolate()線性插值)

一、內容來源

課程來源:大數據分析師(第一期)(學堂在線 北郵 楊亞)

數據集分享:鏈接:https://pan.baidu.com/s/1nU29LEfrILve3-ERqccUTQ
提取碼:6ptf

二、學習筆記(廣州)

3σ原則爲

數值分佈在(μ-σ,μ+σ)中的概率爲0.6827

數值分佈在(μ-2σ,μ+2σ)中的概率爲0.9545

數值分佈在(μ-3σ,μ+3σ)中的概率爲0.9973

數據處理代碼1:找出異常值,並通過線性插值的方式處理掉


import numpy as np
import pandas as pd

#1 讀取數據
filename = 'GuangzhouPM20100101_20151231.csv'
#df = pd.read_csv(filename,encoding='utf-8',dtype=str)
#df = pd.read_csv(filename,encoding='utf-8')
df = pd.read_csv(filename,encoding='utf-8',usecols=[0,1,2,3,4,5,10])

#2 查看數據基本情況
print('head--------------------------------\n',df.head())
print('deacribe----------------------------\n',df.describe())
print('info--------------------------------\n',df.info())

# 查找HUMI中小於0的值
temp_list = df[df.HUMI < 0].index.tolist()
print(temp_list)

df["HUMI_new1"] = df["HUMI"]
for i in temp_list:
    df["HUMI_new1"][i] = np.nan
    #df.loc['HUMI_new1',i] = np.nan
df["HUMI_new2"]=df["HUMI_new1"].interpolate()


# 保存文件
df.to_csv('gz1.csv')

數據處理代碼2:找出小於3σ的異常數據並處理掉

import numpy as np
import pandas as pd

#讀取數據
filename = 'gz1.csv'
df = pd.read_csv(filename,encoding='utf-8',usecols=[1,2,3,4,5,6,9])
print('-------------------------head--------------------------\n',df.head())
print('------------------------deacribe------------------------\n',df.describe())
print('-------------------------info---------------------------\n',df.info())

HUMI_mean = df.HUMI_new2.mean()
HUMI_std = df.HUMI_new2.std()
print(HUMI_mean-3 * HUMI_std, HUMI_mean + 3 *HUMI_std)

#求出HUMI_new2列中數據小於3倍標準差的數據
index_list = df[df.HUMI_new2 < HUMI_mean-3 * HUMI_std].index.tolist()
value_list = df[df.HUMI_new2 < HUMI_mean-3 * HUMI_std]
print("there are {} item:".format(len(index_list)))
print(index_list)
print(value_list)

#將這些數改爲3倍標準差(下邊界)
df["HUMI_new3"] = df["HUMI_new2"]
for i in index_list:
    df["HUMI_new3"][i] = int(HUMI_mean-3 * HUMI_std)

#保存文件
df.to_csv("gz2.csv")

三、作業(北京)

在這裏插入圖片描述

import numpy as np
import pandas as pd

#讀取數據
filename = 'BeijingPM20100101_20151231.csv'
df = pd.read_csv(filename,encoding='utf-8')

#查看數據基本情況
print('-------------------------head--------------------------\n',df.head())
print('------------------------deacribe------------------------\n',
      df["HUMI"].describe(),df["PRES"].describe(),df["TEMP"].describe())
print('-----------------------缺失值---------------------------\n',
      df.isnull().sum().sort_values(ascending=False))

#對缺失值進行線性插值
df["HUMI"]=df["HUMI"].interpolate()
df["PRES"]=df["PRES"].interpolate()
df["TEMP"]=df["TEMP"].interpolate()
print('-----------------------缺失值---------------------------\n',
      df.isnull().sum().sort_values(ascending=False))
#對超過3倍標準差的高度異常數據,修改爲3倍標準差的數值
#這裏只寫一個,其他兩個同理,其實觀察describe結果可知沒有數據超過3倍標準差
HUMI_mean = df.HUMI.mean()
HUMI_std = df.HUMI.std()
index_list = df[df.HUMI > HUMI_mean+3 * HUMI_std].index.tolist()
for i in index_list:
    df["HUMI"][i] = int(HUMI_mean+3 * HUMI_std)


#對PM_Dongsi、PM_Dongsihuan、PM_Nongzhanguan三列中超過500的數據,修改爲500
print('------------------------deacribe(before)------------------------\n',
      df["PM_Dongsi"].describe(),df["PM_Dongsihuan"].describe(),df["PM_Dongsihuan"].describe())

for i in df[df.PM_Dongsi > 500 ].index.tolist():
    df["PM_Dongsi"][i] = 500
for i in df[df.PM_Dongsihuan > 500 ].index.tolist():
    df["PM_Dongsihuan"][i] = 500
for i in df[df.PM_Nongzhanguan > 500 ].index.tolist():
    df["PM_Nongzhanguan"][i] = 500

print('------------------------deacribe(after)------------------------\n',
      df["PM_Dongsi"].describe(),df["PM_Dongsihuan"].describe(),df["PM_Dongsihuan"].describe())

#修改cbwd列中值爲“cv"的單元格,其值用後項數據填充。
temp_index = df[df.cbwd =='cv' ].index.tolist()
for i in reversed(temp_index):  #倒敘遍歷
    df["cbwd"][i] = df["cbwd"][i+1]
print("after",len(df[df.cbwd =='cv' ].index.tolist()))

'''#方法2:全部數據倒序遍歷
print("before",len(df[df.cbwd =='cv' ].index.tolist()))
for i in reversed(range(len(df))):
    if df["cbwd"][i] =='cv':
        df["cbwd"][i] = df["cbwd"][i+1]
print("after",len(df[df.cbwd =='cv' ].index.tolist()))
'''

#保存文件
df.to_csv("bj.csv")
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章