[B4]鏈家二手房價格預測

“這篇博客主要分享一個數據分析初級項目，基本概括了一個完整項目的各個分析階段，但是數據獲取是直接在鏈家官網爬取的，這部分先不分享了。過程中還有很多不足的地方，希望各位大佬多多指點。”

1.數據預處理
首先導入科學計算包

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

from IPython.display import display
plt.style.use('fivethirtyeight')
sns.set_style({'font.sans-serif':['simhei','Arial']})
%matplotlib inline

讀取數據，對數據進行初步觀察，查看缺失值和異常值，並進行描述性統計

#導入數據，查看前三行
lianjia_df = pd.read_csv("C:\Jupyter_working_path\Projects\lianjia.csv")
display(lianjia_df.head(3))

初步觀察到有11個特徵變量，Price爲目標變量

#檢查缺失值情況

#檢查缺失值情況
lianjia_df.info()

很明顯Elevator特徵有缺失值

#描述性統計

lianjia_df.describe()

可見，size特徵最大值1019平米，最小值2平米

#添加房屋特徵均價

df = lianjia_df.copy()
df['PerPrice'] = lianjia_df['Price']/lianjia_df['Size']
#重新擺放列位置
columns = ['Region','District','Garden','Layout','Floor','Year','Size','Elevator','Direction','Renovation','PerPrice','Price']
df = pd.DataFrame(df, columns = columns)
#重新審視數據集
display(df.head(3))

2.特徵分析
接下來對特徵變量進行逐一分析
（1）Region特徵分析

#對二手房區域分組對比二手房數量和每平方米房價
df_house_count = df.groupby('Region')['Price'].count().sort_values(ascending=False).to_frame().reset_index()
df_house_mean = df.groupby('Region')['PerPrice'].mean().sort_values(ascending=False).to_frame().reset_index()

f,[ax1,ax2,ax3] = plt.subplots(3,1,figsize=(20,15))
sns.barplot(x = 'Region', y = 'PerPrice', palette = "Blues_d", data=df_house_mean, ax=ax1)

ax1.set_title('北京各區二手房每平米單價對比',fontsize=15)

ax1.set_xlabel('區域')

ax1.set_ylabel('每平米價格')

sns.barplot(x='Region', y='Price',palette = "Greens_d", data= df_house_count, ax = ax2)

ax2.set_title('北京各區二手房每平米數量對比',fontsize=15)

ax2.set_xlabel('區域')

ax2.set_ylabel('數量')

sns.boxplot(x='Region', y='Price', data= df, ax = ax3)

ax2.set_title('北京各區二手房房屋總價',fontsize=15)

ax2.set_xlabel('區域')

ax2.set_ylabel('房屋總價')

plt.savefig("C:\Jupyter_working_path\Projects\picture")

區域特徵可視化過程直接採用seaborn來完成，顏色使用調色板palette參數，漸變，越淺表示越少。

可見: 1)二手房均價：西城區房價最貴大約11萬/平，其次是東城大約10萬/平，然後海淀區8.5萬/平，其他地方均低於8萬/平；
2）二手房數量：海淀區和朝陽區二手房數量最多，需求量也大。
3）二手房總價：各大區域總價中位數都在1000萬以下，西城達到6000萬。

（2）Size特徵分析

f, [ax1,ax2] = plt.subplots(1,2,figsize=(15,5))

#建房時間分佈情況
sns.distplot(df['Size'], bins = 20, ax=ax1, color='r')

sns.kdeplot(df['Size'], shade=True, ax =ax1)

#建房時間和出售價格的關係
sns.regplot(x ='Size', y='Price',data=df, ax=ax2)

plt.savefig("C:\Jupyter_working_path\Projects\pictures")

#探索Size和price的關係

通過regplot繪製size和price之間的散點圖，發現size特徵基本和price特徵程線性關係，即房屋越大價格越貴。但有明顯的異常點需要進一步觀察：

df.loc[df['Size']<10]

df.loc[df['Size']>1000]

df=df[(df['Layout']!='疊拼別墅') & (df['Size']<1000)]

(3)Layout特徵分析

f, ax1 = plt.subplots(figsize=(20,20))
sns.countplot(y='Layout', data = df, ax=ax1)
ax1.set_title('房屋戶型',fontsize=15)
ax1.set_xlabel('數量')
ax1.set_ylabel('戶型')
plt.savefig("C:\Jupyter_working_path\Projects\picture1")

可見2室1廳佔絕大部分，其次是3室1廳，2室2廳，3室2廳。

(4)Renovation特徵分析

df['Renovation'].value_counts()

#畫幅設置

f,[ax1,ax2,ax3] = plt.subplots(1,3,figsize=(20,5))
sns.countplot(df['Renovation'],ax=ax1)
sns.barplot(x='Renovation',y='Price', data=df, ax=ax2)
sns.boxplot(x='Renovation',y='Price', data=df, ax=ax3)
plt.savefig("C:\Jupyter_working_path\Projects\picture3")

``

觀察到，精裝修的二手房數量最多，簡裝其次。毛坯類的價格最高，其次是精裝。

（5）Elevator特徵分析

查看缺失值

misn = len(df.loc[(df['Elevator'].isnull()),'Elevator'])
print('Elevator缺失值數量爲：'+ str(misn))

這麼多缺失值肯定不能直接移除，這裏考慮填補法。
根據樓層來判斷是否有電梯，一般樓層大於6的都有電梯，小於6就無電梯。

#由於存在個別類型錯誤，故需要移除
df['Elevator']=df.loc[(df['Elevator'] =='有電梯')|(df['Elevator'] =='無電梯'),'Elevator']

#填補缺失值
df.loc[(df['Floor']>6)&(df['Elevator'].isnull()), 'Elevator'] ='有電梯'
df.loc[(df['Floor']<=6)&(df['Elevator'].isnull()), 'Elevator'] ='無電梯'

f, [ax1,ax2] = plt.subplots(1,2,figsize=(20,10))
sns.countplot(df['Elevator'], ax=ax1)
ax1.set_title('有無電梯數量對比',fontsize=15)
ax1.set_xlabel('是否有電梯')
ax1.set_ylabel('數量')
sns.barplot(x='Elevator',y='Price', data=df, ax=ax2)
ax2.set_title('有無電梯房價對比',fontsize=15)
ax2.set_xlabel('是否有電梯')
ax2.set_ylabel('總價')
plt.show()

可見，有電梯的二手房更多，但是有電梯的二手房價格也高，這個很容易理解。
（5）Year特徵分析

grid = sns.FacetGrid(df, row='Elevator', col='Renovation', palette='seismic', size=4)
grid.map(plt.scatter, 'Year', 'Price')
grid.add_legend()
plt.savefig("C:\Jupyter_working_path\Projects\picture6")

在Renovation和Elevator的分類條件下，使用FaceGrid分析Year特徵，結果顯示：
整個二手房房價趨勢隨着時間增長而增長；
2000年後建造的二手房價格比之前的明顯上漲；
1980年前幾乎沒有電梯房數據，說明那個年代還沒有普遍安裝電梯

（6）Floor特徵分析

f, ax1=plt.subplots(figsize=(20,5))
sns.countplot(x='Floor',data=df, ax=ax1)
ax1.set_title('房屋戶型',fontsize=15)
ax1.set_xlabel('數量')
ax1.set_ylabel('戶型')
plt.savefig("C:\Jupyter_working_path\Projects\picture8")

可見：6層二手房數量最多。根據中國的習俗，七上八下，所以顯然7層比8層受歡迎；而且4層和18層一般不受歡迎。樓層特徵影響因素衆多故不一一分析。

本次先分享到這裏，其實還可以深入對一些特徵進行分析，通過這次學習更加鍛鍊了我的數據分析思維。特徵工程是一件複雜的事情，後續還應努力學習。

[B4]鏈家二手房價格預測

[B4]鏈家二手房價格預測

[B11]數據挖掘實戰：客戶流失預警系統

[B5]我的第一個量化策略

[B9]爬蟲課程01

[B10]爬蟲課程02

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結