本次分享的項目來自 Kaggle 的經典賽題：房價預測。分爲數據分析和數據挖掘兩部分介紹。本篇爲數據分析篇。

賽題解讀

比賽概述

影響房價的因素有很多，在本題的數據集中有 79 個變量幾乎描述了愛荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面，要求預測最終的房價。

技術棧

特徵工程 (Creative feature engineering)
迴歸模型 (Advanced regression techniques like random forest and
gradient boosting)

最終目標

預測出每間房屋的價格，對於測試集中的每一個Id，給出變量SalePrice相應的值。

提交格式

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

數據分析

數據描述

首先我們導入數據並查看：

train_df = pd.read_csv('./input/train.csv', index_col=0)
test_df = pd.read_csv('./input/test.csv', index_col=0)

train_df.head()

我們可以看到有 80 列，也就是有 79 個特徵。

接下來將訓練集和測試集合併在一起，這麼做是爲了進行數據預處理的時候更加方便，讓測試集和訓練集的特徵變換爲相同的格式，等預處理進行完之後，再把他們分隔開。

我們知道SalePrice作爲我們的訓練目標，只出現在訓練集中，不出現在測試集，因此我們需要把這一列拿出來再進行合併。在拿出這一列前，我們先來觀察它，看看它長什麼樣子，也就是查看它的分佈。

prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])})
prices.hist()

因爲label本身並不平滑，爲了我們分類器的學習更加準確，我們需要首先把label給平滑化（正態化）。我在這裏使用的是log1p, 也就是 log(x+1)。要注意的是我們這一步把數據平滑化了，在最後算結果的時候，還要把預測到的平滑數據給變回去，那麼log1p()的反函數就是expm1()，後面用到時再具體細說。

然後我們把這一列拿出來：

y_train = np.log1p(train_df.pop('SalePrice'))

y_train.head()

有

Id
1    12.247699
2    12.109016
3    12.317171
4    11.849405
5    12.429220
Name: SalePrice, dtype: float64

這時，y_train就是SalePrice那一列。

然後我們把兩個數據集合並起來：

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是我們合併之後的DataFrame。

數據預處理

根據 kaggle 給出的說明，有以下特徵及其說明：

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

接下來我們對特徵進行分析。上述列出了一個目標變量SalePrice和 79 個特徵，數量較多，這一步的特徵分析是爲了之後的特徵工程做準備。

我們來查看哪些特徵存在缺失值：

print(pd.isnull(df).sum())

這樣並不方便觀察，我們先查看缺失值最多的 10 個特徵：

df.isnull().sum().sort_values(ascending=False).head(10)

爲了更清楚的表示，我們用缺失率來考察缺失情況：

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'缺失率': df_na})
missing_data.head(10)

對其進行可視化：

f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

我們可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特徵存在大量缺失，LotFrontage 有 16.7% 的缺失率，GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近，這些特徵有的是 category 數據，有的是 numerical 數據，對它們的缺失值如何處理，將在關於特徵工程的部分給出。

最後，我們對每個特徵進行相關性分析，查看熱力圖：

corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)

我們看到有些特徵相關性大，容易造成過擬合現象，因此需要進行剔除。在下一篇的數據挖掘篇我們來對這些特徵進行處理並訓練模型。

不足之處，歡迎指正。

Kaggle入門級賽題：房價預測——數據分析篇

賽題解讀

比賽概述

技術棧

最終目標

提交格式

數據分析

數據描述

數據預處理

常用的 Git 指令

sm4加密工具類

Kaggle入門級賽題：房價預測——數據挖掘篇

大話 Git 使用

Kaggle入門級賽題：房價預測——數據分析篇

【Leetcode刷題】第 35 題：Search Insert Position 搜索插入位置——解題篇

【數據結構_浙江大學MOOC】第三四五講樹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結