Kaggle入門級賽題:房價預測——數據分析篇

本次分享的項目來自 Kaggle 的經典賽題:房價預測。分爲數據分析和數據挖掘兩部分介紹。本篇爲數據分析篇。


賽題解讀

比賽概述

影響房價的因素有很多,在本題的數據集中有 79 個變量幾乎描述了愛荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求預測最終的房價。

技術棧

  • 特徵工程 (Creative feature engineering)
  • 迴歸模型 (Advanced regression techniques like random forest and
    gradient boosting)

最終目標

預測出每間房屋的價格,對於測試集中的每一個Id,給出變量SalePrice相應的值。

提交格式

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

數據分析

數據描述

首先我們導入數據並查看:

train_df = pd.read_csv('./input/train.csv', index_col=0)
test_df = pd.read_csv('./input/test.csv', index_col=0)
train_df.head()

clipboard.png

我們可以看到有 80 列,也就是有 79 個特徵。

接下來將訓練集和測試集合併在一起,這麼做是爲了進行數據預處理的時候更加方便,讓測試集和訓練集的特徵變換爲相同的格式,等預處理進行完之後,再把他們分隔開。

我們知道SalePrice作爲我們的訓練目標,只出現在訓練集中,不出現在測試集,因此我們需要把這一列拿出來再進行合併。在拿出這一列前,我們先來觀察它,看看它長什麼樣子,也就是查看它的分佈。

prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])})
prices.hist()

clipboard.png

因爲label本身並不平滑,爲了我們分類器的學習更加準確,我們需要首先把label平滑化(正態化)。我在這裏使用的是log1p, 也就是 log(x+1)。要注意的是我們這一步把數據平滑化了,在最後算結果的時候,還要把預測到的平滑數據給變回去,那麼log1p()的反函數就是expm1(),後面用到時再具體細說。

然後我們把這一列拿出來:

y_train = np.log1p(train_df.pop('SalePrice'))

y_train.head()

Id
1    12.247699
2    12.109016
3    12.317171
4    11.849405
5    12.429220
Name: SalePrice, dtype: float64

這時,y_train就是SalePrice那一列。

然後我們把兩個數據集合並起來:

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是我們合併之後的DataFrame。


數據預處理

根據 kaggle 給出的說明,有以下特徵及其說明:

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

接下來我們對特徵進行分析。上述列出了一個目標變量SalePrice和 79 個特徵,數量較多,這一步的特徵分析是爲了之後的特徵工程做準備。

我們來查看哪些特徵存在缺失值:

print(pd.isnull(df).sum())

clipboard.png
clipboard.png

這樣並不方便觀察,我們先查看缺失值最多的 10 個特徵:

df.isnull().sum().sort_values(ascending=False).head(10)

clipboard.png

爲了更清楚的表示,我們用缺失率來考察缺失情況:

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'缺失率': df_na})
missing_data.head(10)

clipboard.png

對其進行可視化:

f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

clipboard.png

我們可以看到PoolQCMiscFeatureAlleyFenceFireplaceQu 等特徵存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageTypeGarageFinishGarageQualGarageCond等缺失率相近,這些特徵有的是 category 數據,有的是 numerical 數據,對它們的缺失值如何處理,將在關於特徵工程的部分給出。

最後,我們對每個特徵進行相關性分析,查看熱力圖:

corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)

clipboard.png

我們看到有些特徵相關性大,容易造成過擬合現象,因此需要進行剔除。在下一篇的數據挖掘篇我們來對這些特徵進行處理並訓練模型。


不足之處,歡迎指正。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章