【學習筆記】讀取數據集和手動處理離羣點-以房價預測爲例

原創

2020-06-13 09:50

讀取數據集大小
讀出大小集的樣本數量和屬性個數注意這個{}和.format的用法由於這個ID屬性不需要用到所以我們用一個變量把它保存起來並且將原來數據集上的ID屬性這一列刪除（drop函數默認刪除行 axis=1表示刪除列另外inplace的值默認爲false 如果指定爲true的話則表示直接修改並替換內存中數據的值）

#output the size of sets
print ('The size of training set before drop the id is {}'.format(df_train.shape))
print ('The size of test set before drop the id is {}'.format(df_test.shape))

train_ID=df_train['Id']
test_ID=df_test['Id']

df_train.drop(['Id'],axis=1,inplace=True)
df_test.drop(['Id'],axis=1,inplace=True)

print ('\nThe size of training set after drop the id is {}'.format(df_train.shape))
print ('The size of test set after drop the id is {}'.format(df_test.shape))

處理離羣點

首先畫圖找出離羣點調用matplotlib的subplots函數，返回fig和ax對象通常我們只使用ax對象用scatter函數指定x軸和y軸的數據 xlabel和ylabel則用來指明圖中x軸和y軸顯示的字

fig,ax=plt.subplots()
ax=plt.scatter(x=df_train['GrLivArea'],y=df_train['SalePrice'])
plt.xlabel('GrLivArea',fontsize=13)
plt.ylabel('SalePrice',fontsize=13)

觀察可知這兩個屬性大概是呈線性正相關的所以那種GrLivArea很大而SalePrice很小的數據很明顯是異常的也就是圖中右下角那兩個點所以我們要把離羣點刪了

#刪除離羣點
df_train=df_train.drop(df_train[(df_train['GrLivArea']>4000)&(df_train['SalePrice']<300000)].index)

#再檢查一次
fig,ax=plt.subplots()
ax=plt.scatter(x=df_train['GrLivArea'],y=df_train['SalePrice'])
plt.xlabel('GrLivArea',fontsize=13)
plt.ylabel('SalePrice',fontsize=13)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【學習筆記】讀取數據集和手動處理離羣點-以房價預測爲例

sm4加密工具類

networkx報錯：TypeError: object of type 'dictionary-keyiterator' has no len()

【seaborn】ValueError: Colormap Y1GnBu is not recognized.

【matplotlib】AttributeError: Unknown property figsize

【pandas】TypeError: concat() got multiple values for argument 'axis'

【學習筆記】數據理解-以房價預測爲例

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結