前言

曾聽人說過，中國經濟是房地產市場，美國經濟是股票市場。中國房地產市場超過400萬億，房地產總值是美國、歐盟、日本總和，但是股市才50萬億，不到美歐日的十分之一。可見房地產對於中國來說地位尤其明顯！對於我們很難在一線城市買房的年輕剛需族來說，這確是一個十分頭疼的問題。於此，萌生了分析房價並預測的想法（曾經採用R做過嘗試，這次將採用python）。
本次將基於北京房價作爲測試數據，後期通過爬蟲將抓取包括北上廣深等城市的數據以供分析。

數據

感謝Qichen Qiu提供鏈家網2011-2017北京房價數據，感謝Jonathan Bouchet提供的思路。
本次分析基於python3，代碼將稍後整理提供於github。
數據特徵包含，kaggle上有具體介紹，在此暫不贅述：

url: the url which fetches the data( character )
id: the id of transaction( character )
Lng: and Lat coordinates, using the BD09 protocol. ( numerical )
Cid: community id( numerical )
tradeTime: the time of transaction( character )
DOM: active days on market.( numerical )
followers: the number of people follow the transaction.( numerical )
totalPrice: the total price( numerical )
price: the average price by square( numerical )
square: the square of house( numerical )
livingRoom: the number of living room( character )
drawingRoom: the number of drawing room( character )
kitchen: the number of kitchen( numerical )
bathroom the number of bathroom( character )
floor: the height of the house. I will turn the Chinese characters to English in the next version.( character )
buildingType: including tower( 1 ) , bungalow( 2 )，combination of plate and tower( 3 ), plate( 4 )( numerical )
constructionTime: the time of construction( numerical )
renovationCondition: including other( 1 ), rough( 2 ),Simplicity( 3 ), hardcover( 4 )( numerical )
buildingStructure: including unknow( 1 ), mixed( 2 ), brick and wood( 3 ), brick and concrete( 4 ),steel( 5 ) and steel-concrete composite ( 6 ).( numerical )
ladderRatio: the proportion between number of residents on the same floor and number of elevator of ladder. It describes how many ladders a resident have on average.( numerical )
elevator have ( 1 ) or not have elevator( 0 )( numerical )
fiveYearsProperty: if the owner have the property for less than 5 years( numerical )

EDA

瞭解數據以後，首先進行探索分析，查看缺失值情況：

url                       0
id                        0
Lng                       0
Lat                       0
Cid                       0
tradeTime                 0
DOM                       0
followers                 0
totalPrice                0
price                     0
square                    0
livingRoom                0
drawingRoom               0
kitchen                   0
bathRoom                  0
floor                     0
buildingType           2021
constructionTime          0
renovationCondition       0
buildingStructure         0
ladderRatio               0
elevator                 32
fiveYearsProperty        32
subway                   32
district                  0
communityAverage        463
get_floor                32
province                  0
dtype: int64

採用msno圖形化查看：

msno.matrix(source_data)

針對不同特徵對缺失值進行填補處理：

test_data.fillna({'DOM': test_data['DOM'].median()}, inplace=True)
test_data['buildingType'] = [makeBuildingType(x) for x in test_data['buildingType']]
test_data = test_data[(test_data['buildingType'] != 'wrong_coded') & (test_data['buildingType'] != 'missing')]
test_data['renovationCondition'] = [makeRenovationCondition(x) for x in test_data['renovationCondition']]
test_data['buildingStructure'] = [makeBuildingStructure(x) for x in test_data['buildingStructure']]
test_data['elevator'] = ['has_elevator' if x==1 else 'no_elevator' for x in test_data['elevator']]
test_data['subway'] = ['has_subway' if x==1 else 'no_subway' for x in test_data['subway']]
test_data['fiveYearsProperty'] = ['owner_less_5y' if x==1 else 'owner_more_5y' for x in test_data['fiveYearsProperty']]
pd.to_numeric(test_data['constructionTime'], errors='coerce')
test_data = test_data[(test_data['constructionTime'] != '未知')]
# pd.value_counts(test_data['constructionTime'])
test_data['district'].astype("category")
print(pd.value_counts(test_data['district']))

結果如下：

我們來看一看房價情況：
首先總體情況：

再看一看2017年的情況：

可見房價總體滿足正偏分佈。
處理下數據看一看相關性：

test_data['tradeTime'] = pd.to_datetime(test_data['tradeTime'])
test_data['constructionTime'] = pd.to_numeric(test_data['constructionTime'])
test_data['livingRoom'] = pd.to_numeric(test_data['livingRoom'])
test_data['drawingRoom'] = pd.to_numeric(test_data['drawingRoom'])
test_data['bathRoom'] = pd.to_numeric(test_data['bathRoom'])
test_data['get_floor'] = pd.to_numeric(test_data['get_floor'])

抽部分特徵看一看：
communityAverage:

square:

再看看電梯和區域：

房價真的是沒有最高只有更高！一般情況下，儘量分區域進行分析會更直觀也更準確。比如海淀和通州就會差別較大。

包括臥室，浴室等特徵基本上都與總價成正相關，不過還有單價或總價爲0的需要考慮是缺失還是已售/待定等情況，這些在自己抓取房價的時候要注意分析。
最後看一下均價和計數：

price_data = test_data[['price','year-month']]
price_data.head()
price_group = price_data.groupby(['year-month']).agg(['mean','count'])
price_group.head()

2017年中旬出現拐點，點的大小代表該價格計數。
到這裏已經很想加上2018年的數據看看了！

迴歸

試試多元線性迴歸：
用2017年以前的數據進行訓練，預測2017的價格。
訓練模型：

x_train = temp_train.drop(['tradeTime','totalPrice','floor','province'], axis=1)
y_train = temp_train[['totalPrice']]
model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
model.fit(x_train,y_train)

x_test = temp_test.drop(['tradeTime','totalPrice','floor','province'], axis=1)
y_test = temp_test[['totalPrice']]
print(model.score(x_test,y_test))

第一次擬合優度0.7971355163827527.

預測值普遍偏低。
後期將用更多的特徵組合以及參數來進行嘗試，並採用不同的方式進行預測。

北京房價預測圖說

前言

數據

EDA

迴歸

千億級數據整合方案架構

北京房價預測圖說

ClickHouse vs. MySQL vs. vertica vs. PostGreSQL

HIVE2.1 vs impala

HDP HELLO WORLD案例

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結