數據集下載鏈接：https://pan.baidu.com/s/13OtaUv6j4x8dD7cgD4sL5g
提取碼：7tze

5.10 房租租⾦模型預測

1 項⽬背景

當今社會，房屋租⾦由裝修情況、位置地段、戶型格局、交通便利程度、市場供需量等多⽅⾯因素綜合

決定，對於租房這個相對傳統的⾏業來說，信息嚴重不對稱⼀直存在。

⼀⽅⾯，房東不瞭解租房的市場真實價格，只能忍痛空置⾼租⾦的房屋；

另⼀⽅⾯，租客也找不到滿⾜⾃⼰需求⾼性價⽐房屋，這造成了租房資源的極⼤浪費。

本項⽬將基於租房市場的痛點，提供脫敏處理後的真實租房市場數據。⼤家需要利⽤有⽉租⾦標籤的歷

史數據建⽴模型，實現基於房屋基本信息的住房⽉租⾦預測，爲該城市租房市場提供客觀衡量標準。

2 任務

數據爲某地3個⽉的房屋租賃價格以及房屋的基本信息，我們對數據做了脫敏處理。

⼤家需要利⽤訓練集中的房屋信息和⽉租⾦訓練模型，利⽤測試集中的房屋信息對測試集數據中的房屋

的⽉租⾦進⾏預測。

3 數據

數據分爲兩組，分別是訓練集和測試集。

訓練集爲前3個⽉採集的數據，共150539條。具體數據示例如下圖：

測試集爲第3個⽉採集到的部分數據，相對於訓練集，增加了“id”字段，爲房屋的唯⼀id，且⽆“⽉

租⾦”字段，其它字段與訓練集相同，共46000條。具體數據示例如下圖：

4 評分標準

4.1 評價標準

算法通過計算預測值和真實房租⽉租⾦的均⽅根誤差來衡量回歸模型的優劣。均⽅根誤差越⼩，說明回

歸模型越好。

均⽅根誤差計算公式如下

庫安裝：pip install xgboost

數據初步分析

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')  # 忽略一些警告

導入數據

In [2]:

train=pd.read_csv("data/train.csv")

test=pd.read_csv("data/test.csv")

數據探索

基本信息

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150539 entries, 0 to 150538
Data columns (total 19 columns):
時間          150539 non-null int64
小區名         150539 non-null int64
小區房屋出租數量    149571 non-null float64
樓層          150539 non-null int64
總樓層         150539 non-null float64
房屋面積        150539 non-null float64
房屋朝向        150539 non-null object
居住狀態        15979 non-null float64
臥室數量        150539 non-null int64
廳的數量        150539 non-null int64
衛的數量        150539 non-null int64
出租方式        19576 non-null float64
區           150522 non-null float64
位置          150522 non-null float64
地鐵線路        70180 non-null float64
地鐵站點        70180 non-null float64
距離          70180 non-null float64
裝修情況        14604 non-null float64
月租金         150539 non-null float64
dtypes: float64(12), int64(6), object(1)
memory usage: 21.8+ MB

In [4]:

train.describe()

Out[4]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
count	150539.000000	150539.000000	149571.000000	150539.000000	150539.000000	150539.000000	15979.000000	150539.000000	150539.000000	150539.000000	19576.000000	150522.000000	150522.000000	70180.000000	70180.000000	70180.000000	14604.000000	150539.000000
mean	1.844871	3233.610035	0.120978	0.955852	0.406459	0.013156	2.722761	2.229854	1.303563	1.223291	0.917705	7.906731	67.937923	3.252707	57.571915	0.551246	3.600110	7.962330
std	0.704477	2020.913396	0.129586	0.851612	0.183616	0.007551	0.669594	0.893350	0.612709	0.487023	0.274820	4.010860	43.515929	1.471257	35.141576	0.246250	2.008348	6.314068
min	1.000000	0.000000	0.007812	0.000000	0.000000	0.000166	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.001667	1.000000	0.000000
25%	1.000000	1394.000000	0.039062	0.000000	0.290909	0.009268	3.000000	2.000000	1.000000	1.000000	1.000000	4.000000	33.000000	2.000000	23.000000	0.356667	2.000000	4.923599
50%	2.000000	3092.000000	0.082031	1.000000	0.418182	0.012910	3.000000	2.000000	1.000000	1.000000	1.000000	9.000000	61.000000	4.000000	59.000000	0.554167	2.000000	6.621392
75%	2.000000	5199.000000	0.156250	2.000000	0.563636	0.014896	3.000000	3.000000	2.000000	1.000000	1.000000	11.000000	102.000000	5.000000	87.000000	0.745000	6.000000	8.998302
max	3.000000	6627.000000	1.000000	2.000000	1.000000	1.000000	3.000000	11.000000	8.000000	8.000000	1.000000	14.000000	152.000000	5.000000	119.000000	1.000000	6.000000	100.000000

In [5]:

train.shape

Out[5]:

(150539, 19)

In [6]:

test.shape

Out[6]:

(46000, 19)

In [7]:

train.head()

Out[7]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [8]:

test.head()

Out[8]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
0	1	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN
1	2	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN
3	4	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN
4	5	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

缺失值比例

In [9]:

# 每列的缺失值個數/總行數

train_missing = (train.isnull().sum()/len(train))*100

# 去掉缺失比例爲0的列

train_missing = train_missing.drop(

    train_missing[train_missing == 0].index).sort_values(ascending=False)

# 構造確實比例統計表

miss_data = pd.DataFrame({'缺失百分比': train_missing})

miss_data

Out[9]:

	缺失百分比
裝修情況	90.298859
居住狀態	89.385475
出租方式	86.996061
距離	53.380851
地鐵站點	53.380851
地鐵線路	53.380851
小區房屋出租數量	0.643023
位置	0.011293
區	0.011293

In [10]:

# 每列的缺失值個數/總行數

train_missing = (train.isnull().sum()/len(test))*100

# 去掉缺失比例爲0的列

train_missing = train_missing.drop(

    train_missing[train_missing == 0].index).sort_values(ascending=False)

# 構造確實比例統計表

miss_data = pd.DataFrame({'缺失百分比': train_missing})

miss_data

Out[10]:

	缺失百分比
裝修情況	295.510870
居住狀態	292.521739
出租方式	284.702174
距離	174.693478
地鐵站點	174.693478
地鐵線路	174.693478
小區房屋出租數量	2.104348
位置	0.036957
區	0.036957

目標值分佈

In [11]:

train['月租金'].head()

Out[11]:

0     5.602716
1    16.977929
2     8.998302
3     5.602716
4     7.300509
Name: 月租金, dtype: float64

In [12]:

plt.figure(figsize=(20, 6))

plt.subplot(221)

plt.title('月租金佔比分佈', fontsize=18)

sns.distplot(train['月租金'])

plt.subplot(222)

plt.title('月租金價格排序圖', fontsize=18)

plt.scatter(range(train.shape[0]), np.sort(train['月租金'].values))

plt.show()

所有特徵分佈

直方圖和柱狀分佈圖

In [13]:

train.head()

Out[13]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [14]:

train.hist(figsize=(20,20),bins=50,grid=False)

plt.show()

異常值分析

這裏我們主要分析跟月租金相關性較大的房屋面積的異常值

In [21]:

def plot_reg(xs,y,data):

    n=len(xs)

    for i in range(n):

        plt.figure(figsize=(10,10))

        sns.regplot(x=data[xs[i]],y=data[y])

        plt.show()

In [22]:

reg_cols=['房屋面積']

plot_reg(reg_cols,"月租金",train)

問題數據

房間朝向列有多個值

In [23]:

train["房屋朝向"].head()

Out[23]:

0    東南
1     東
2    東南
3     南
4    東北
Name: 房屋朝向, dtype: object

In [24]:

# 查看房屋朝向列有哪些值

train['房屋朝向'].value_counts()

Out[24]:

南              41769
東南             41439
東              24749
西南             13407
北               7898
西               7559
西北              4066
南 北             3046
東北              2574
東南 南             660
東 東南             646
東 西              560
南 西南             334
東 南              309
東南 西南            175
南 西              158
東南 西北            114
西南 西              91
東 北               74
西 北               66
西 西北              64
東 東北              61
西南 西北             57
東南 東北             57
東南 南 西南           52
北 東北              49
南 西北              45
東南 西              44
南 西南 北            44
西北 北              41
西南 東北             40
東南 北              34
西南 北              32
東 西南              32
東 西北              26
東 南 西 北           24
東 東南 南            18
西北 東北             16
南 東               16
南 東北              14
東南 南 北            10
東 南 北              8
南 西 北              8
東南 西南 西北           8
東 南 西              7
東 東南 西南            6
南 西南 西             5
東 西 北              5
東南 西南 西            4
東 西北 北             4
北 南                2
西 西北 北             2
東 南 西北 北           2
東 西 東北             2
東 東南 北             2
東南 南 西南 西          1
東 東南 南 西南 西        1
西南 西 東北            1
北 西                1
Name: 房屋朝向, dtype: int64

In [25]:

%%time

def split(text,i):

"""

    實現對字符串進行分割,並取出結果中下標i對應的值

"""

    items=text.split(" ")

    if i<len(items):

        return items[i]

    else:

        return np.nan

for i in range(5):

    train['朝向_'+str(i)]=train['房屋朝向'].map(lambda x:split(x,i))

CPU times: user 803 ms, sys: 6.96 ms, total: 810 ms
Wall time: 1.09 s

In [26]:

train.head(20)

Out[26]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	...	地鐵線路	地鐵站點	距離	裝修情況	月租金	朝向_0	朝向_1	朝向_2	朝向_3	朝向_4
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	...	2.0	40.0	0.764167	NaN	5.602716	東南	NaN	NaN	NaN	NaN
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	...	4.0	58.0	0.709167	NaN	16.977929	東	NaN	NaN	NaN	NaN
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	...	5.0	37.0	0.572500	NaN	8.998302	東南	NaN	NaN	NaN	NaN
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	...	2.0	63.0	0.658333	NaN	5.602716	南	NaN	NaN	NaN	NaN
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	...	NaN	NaN	NaN	NaN	7.300509	東北	NaN	NaN	NaN	NaN
5	1	1192	0.039062	2	0.309091	0.012579	南	NaN	2	1	...	3.0	59.0	0.495833	NaN	4.923599	南	NaN	NaN	NaN	NaN
6	1	1122	0.125000	0	0.381818	0.010593	南	NaN	3	1	...	2.0	9.0	0.193333	NaN	6.621392	南	NaN	NaN	NaN	NaN
7	1	1251	0.128906	2	0.363636	0.018040	南	NaN	4	2	...	NaN	NaN	NaN	NaN	14.091681	南	NaN	NaN	NaN	NaN
8	1	4718	0.246094	2	0.309091	0.007850	西南	NaN	1	1	...	NaN	NaN	NaN	NaN	4.584041	西南	NaN	NaN	NaN	NaN
9	1	2654	0.218750	2	0.890909	0.020026	東南	NaN	2	1	...	4.0	58.0	0.400000	NaN	39.558574	東南	NaN	NaN	NaN	NaN
10	1	4847	0.042969	2	0.272727	0.010096	南北	NaN	2	2	...	NaN	NaN	NaN	NaN	4.923599	南	北	NaN	NaN	NaN
11	1	3069	0.031250	1	0.272727	0.031034	南	NaN	1	0	...	3.0	57.0	0.692500	NaN	24.278438	南	NaN	NaN	NaN	NaN
12	1	1407	0.015625	2	0.109091	0.020026	東南	NaN	3	2	...	NaN	NaN	NaN	NaN	6.960951	東南	NaN	NaN	NaN	NaN
13	1	623	0.039062	1	0.090909	0.023095	東南	NaN	3	2	...	1.0	86.0	0.125833	NaN	20.882852	東南	NaN	NaN	NaN	NaN
14	1	5814	0.273438	0	0.345455	0.007779	東	NaN	2	1	...	3.0	23.0	0.640833	NaN	5.263158	東	NaN	NaN	NaN	NaN
15	1	1697	0.195312	1	0.581818	0.007448	西南	NaN	1	1	...	NaN	NaN	NaN	NaN	4.923599	西南	NaN	NaN	NaN	NaN
16	1	1691	0.027344	0	0.490909	0.012413	西南	NaN	3	2	...	NaN	NaN	NaN	NaN	5.602716	西南	NaN	NaN	NaN	NaN
17	1	5895	0.031250	1	0.709091	0.014227	東南	NaN	2	1	...	4.0	58.0	0.235000	NaN	29.371817	東南	NaN	NaN	NaN	NaN
18	1	3142	0.007812	2	0.109091	0.016882	南	NaN	2	2	...	3.0	87.0	0.173333	NaN	5.602716	南	NaN	NaN	NaN	NaN
19	1	6181	0.015625	0	0.109091	0.024495	東南	NaN	4	2	...	5.0	17.0	0.927500	NaN	15.789474	東南	NaN	NaN	NaN	NaN

20 rows × 24 columns

In [27]:

names=["朝向_{}".format(i) for i in range(5)]

train[names].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150539 entries, 0 to 150538
Data columns (total 5 columns):
朝向_0    150539 non-null object
朝向_1    7078 non-null object
朝向_2    214 non-null object
朝向_3    28 non-null object
朝向_4    1 non-null object
dtypes: object(5)
memory usage: 5.7+ MB

同一個小區屬於不同的區

In [28]:

train.head()

Out[28]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	...	地鐵線路	地鐵站點	距離	裝修情況	月租金	朝向_0	朝向_1	朝向_2	朝向_3	朝向_4
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	...	2.0	40.0	0.764167	NaN	5.602716	東南	NaN	NaN	NaN	NaN
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	...	4.0	58.0	0.709167	NaN	16.977929	東	NaN	NaN	NaN	NaN
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	...	5.0	37.0	0.572500	NaN	8.998302	東南	NaN	NaN	NaN	NaN
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	...	2.0	63.0	0.658333	NaN	5.602716	南	NaN	NaN	NaN	NaN
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	...	NaN	NaN	NaN	NaN	7.300509	東北	NaN	NaN	NaN	NaN

5 rows × 24 columns

In [29]:

train.columns

Out[29]:

Index(['時間', '小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積', '房屋朝向', '居住狀態', '臥室數量',
       '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況', '月租金',
       '朝向_0', '朝向_1', '朝向_2', '朝向_3', '朝向_4'],
      dtype='object')

In [29]:

neighbors1=train[['小區名','區','位置']]

print(neighbors1.shape)

neighbors1.head()

(150539, 3)

Out[29]:

	小區名	區	位置
0	3072	11.0	118.0
1	3152	10.0	100.0
2	5575	12.0	130.0
3	3103	7.0	90.0
4	5182	3.0	31.0

In [30]:

# 去掉'小區名','位置'兩個列重複值後  有5292個不重複值

neighbors1 = train[['小區名', '位置']].drop_duplicates()

neighbors1.shape

Out[30]:

(5292, 2)

In [31]:

# 去掉'小區名','位置'兩個列重複值 ,同時刪除缺失值  得,有5291個不重複值

neighbors1 = train[['小區名', '位置']].drop_duplicates().dropna()

neighbors1.shape

Out[31]:

(5291, 2)

In [32]:

# neighbors1按照小區名分組後保留分組條數大於1的小區名

count = neighbors1.groupby('小區名')['位置'].count()

ids = count[count > 1].index

ids

Out[32]:

Int64Index([ 284,  385,  418,  701,  783, 2228, 2468, 2513, 3183, 3482, 3645,
            3967, 4054, 4071, 4471, 4767, 4859, 5320, 5699, 5844, 5968, 6122,
            6515, 6626],
           dtype='int64', name='小區名')

In [33]:

# 在原數據中篩選出這些小區的信息

neighbors_has_problem = train[['小區名', '位置']

                              ][train['小區名'].isin(ids)].sort_values(by='小區名')

print(neighbors_has_problem.shape)

neighbors_has_problem.head()

(521, 2)

Out[33]:

	小區名	位置
129747	284	102.0
127972	284	102.0
127314	284	102.0
126698	284	102.0
126496	284	102.0

In [34]:

# 找到每個小區的位置衆數

# 這裏要注意x.mode有可能返回多個衆數，所以用一個np.max拿到最值最大的衆數作爲最終的結果

position_mode_of_neighbors = neighbors_has_problem.groupby(

    '小區名').apply(lambda x: np.max(x['位置'].mode()))

# 位置缺失值就用這個數據來進行填充，

# 對於已有的一個小區位於不同的位置，考慮到可能是因爲小區太大導致，並不能認爲是邏輯錯誤，保持不變

position_mode_of_neighbors.head()

Out[34]:

小區名
284    102.0
385    108.0
418    122.0
701    113.0
783    134.0
dtype: float64

同一個小區地鐵線路不同的問題

In [35]:

# 去掉'小區名','地鐵線路'兩個列重複之後  有3207個不重複值

lines = train[['小區名', '地鐵線路']].drop_duplicates().dropna()

lines.shape

Out[35]:

(3207, 2)

In [36]:

# 而有地鐵的小區名只有3138個不重複值  說明有69個小區有多個地鐵線路

train[train['地鐵線路'].notnull()].drop_duplicates(['小區名']).shape

Out[36]:

(3138, 24)

In [37]:

# lines按照小區名分組後保留分組條數大於1的小區名   最終有多條地鐵的小區有68個

# 這個地鐵線路分位置可能有關係  因爲同一個小區位於不同的位置，地鐵線路也有可能不同

count = lines.groupby('小區名')['地鐵線路'].count()

ids = count[count > 1].index

ids.shape

Out[37]:

(68,)

研究一下位置和地鐵線路的關係

In [38]:

train[['位置', '地鐵線路']].drop_duplicates().dropna().head()

Out[38]:

	位置	地鐵線路
0	118.0	2.0
1	100.0	4.0
2	130.0	5.0
3	90.0	2.0
5	143.0	3.0

In [39]:

# 去掉'位置','地鐵線路'兩個列重複之後  有184個不重複值

pos_lines = train[['位置', '地鐵線路']].drop_duplicates().dropna()

pos_lines.shape

Out[39]:

(184, 2)

In [40]:

#我們在來看一下有地鐵的位置中有多少個不同的   120個

pos_lines['位置'].value_counts().head()

Out[40]:

113.0    4
100.0    4
118.0    3
63.0     3
106.0    3
Name: 位置, dtype: int64

In [41]:

# pos_lines按照位置分組後保留分組條數大於1的位置  最終有多條地鐵的位置有49個

count = pos_lines.groupby('位置')['地鐵線路'].count()

ids = count[count > 1].index

ids.shape

Out[41]:

(49,)

研究一下位置和地鐵站點的關係

In [42]:

# 去掉'位置','地鐵站點'兩個列重複之後  有337個不重複值

pos_stations = train[['位置', '地鐵站點']].drop_duplicates().dropna()

print(pos_stations.shape)

pos_stations.head()

(337, 2)

Out[42]:

	位置	地鐵站點
0	118.0	40.0
1	100.0	58.0
2	130.0	37.0
3	90.0	63.0
5	143.0	59.0

In [43]:

# 我們在來看一下有地鐵的位置中有多少個不同的   120個

pos_stations['位置'].value_counts().head()

Out[43]:

63.0     9
106.0    6
86.0     6
100.0    6
143.0    6
Name: 位置, dtype: int64

In [44]:

# pos_stations按照位置分組後保留分組條數大於1的位置  最終有多個站點的位置有97個

count = pos_stations.groupby('位置')['地鐵站點'].count()

ids = count[count > 1].index

ids.shape

Out[44]:

(97,)

研究一下小區名，位置，地鐵線路，站點的關係

In [45]:

# 去掉"小區名，位置，地鐵線路，站點"四列重複之後  有3356個不重複值

neighbor_pos_stations = train[['小區名', '位置',

                               '地鐵線路', '地鐵站點']].drop_duplicates().dropna()

neighbor_pos_stations.shape

Out[45]:

(3356, 4)

In [46]:

# 看一下是否存在下小區名，位置一樣的情況下，地鐵線路不一樣的情況

# 可以看出：3356-3209=147條小區名，位置，地鐵線路同樣的情況下，地鐵站點不一樣

# 3356-3147=209條小區名，位置一樣，地鐵線路不一樣

# 這種情況可能是因爲數據錯誤，也有可能是實際情況，後面對此我們不做處理

print(neighbor_pos_stations[['小區名', '位置', '地鐵線路']

                            ].drop_duplicates().dropna().shape)

print(neighbor_pos_stations[['小區名', '位置']].drop_duplicates().dropna().shape)

(3209, 3)
(3147, 2)

研究一下是否有換乘站的存在

用站點分組，然後統計地鐵線路數

In [47]:

train[['地鐵線路', '地鐵站點']].head()

Out[47]:

	地鐵線路	地鐵站點
0	2.0	40.0
1	4.0	58.0
2	5.0	37.0
3	2.0	63.0
4	NaN	NaN

In [48]:

train[['地鐵線路', '地鐵站點']].drop_duplicates(

).dropna().groupby('地鐵站點').count().head()

Out[48]:

	地鐵線路
地鐵站點
1.0	1
2.0	1
3.0	1
4.0	1
5.0	1

In [49]:

# 結果說明沒有換乘站點存在，因爲每個站點僅僅屬於一條地鐵線路

train[['地鐵線路', '地鐵站點']].drop_duplicates(

).dropna().groupby('地鐵站點').count().max(0)

Out[49]:

地鐵線路    1
dtype: int64

研究一下每個位置的地鐵線路數和站點數

In [50]:

#每個位置的線路數 這個可以作爲新特徵加入

a=train[['位置','地鐵線路']].drop_duplicates().dropna().groupby('位置').count()

a.head()

Out[50]:

	地鐵線路
位置
0.0	1
1.0	2
2.0	1
3.0	2
4.0	1

In [51]:

# 每個位置的站點數   也可以作爲新特徵加入

b = train[['位置', '地鐵站點']].drop_duplicates().dropna().groupby('位置').count()

b.head()

Out[51]:

	地鐵站點
位置
0.0	1
1.0	3
2.0	1
3.0	4
4.0	1

In [52]:

# 兩者的相關性

al = pd.concat([a, b], axis=1)

al.head()

Out[52]:

	地鐵線路	地鐵站點
位置
0.0	1	1
1.0	2	3
2.0	1	1
3.0	2	4
4.0	1	1

In [53]:

al.corr()

Out[53]:

	地鐵線路	地鐵站點
地鐵線路	1.000000	0.689305
地鐵站點	0.689305	1.000000

研究一下位置缺失的樣本地鐵站點是否也是缺失的

In [54]:

train[["位置", "地鐵站點", "地鐵線路"]].head()

Out[54]:

	位置	地鐵站點	地鐵線路
0	118.0	40.0	2.0
1	100.0	58.0	4.0
2	130.0	37.0	5.0
3	90.0	63.0	2.0
4	31.0	NaN	NaN

In [55]:

# 發現存在地鐵線路爲缺失而位置缺失的情況   說明後面在填充位置缺失值的時候可以用地鐵站點來進行填充

pos_lines = train[['位置', '地鐵站點']].drop_duplicates()

In [56]:

pos_lines.head()

Out[56]:

	位置	地鐵站點
0	118.0	40.0
1	100.0	58.0
2	130.0	37.0
3	90.0	63.0
4	31.0	NaN

In [57]:

pos_lines['位置'].isnull().sum()

Out[57]:

In [58]:

# 每個站點的位置數   也可以作爲新特徵加入

train[['位置', '地鐵站點']].drop_duplicates().dropna().groupby('地鐵站點').count().head()

Out[58]:

	位置
地鐵站點
1.0	4
2.0	1
3.0	5
4.0	1
5.0	5

位置和區的關係校驗

In [59]:

# 查看是否存在一個位置率屬於不同的區

train[['位置', '區']].head()

Out[59]:

	位置	區
0	118.0	11.0
1	100.0	10.0
2	130.0	12.0
3	90.0	7.0
4	31.0	3.0

In [60]:

train[['位置', '區']].drop_duplicates().dropna().groupby('位置').count().head()

Out[60]:

	區
位置
0.0	1
1.0	1
2.0	1
3.0	1
4.0	1

In [61]:

# 說明每個位置僅僅屬於一個區，不存在同一個位置屬於兩個區的現象

train[['位置', '區']].drop_duplicates().dropna().groupby('位置').count().max()

Out[61]:

區    1
dtype: int64

看一下小區名過多的問題

In [62]:

train['小區名'].head()

Out[62]:

0    3072
1    3152
2    5575
3    3103
4    5182
Name: 小區名, dtype: int64

In [63]:

neighbors=train['小區名'].value_counts()

In [64]:

neighbors.head()

Out[64]:

5512    1406
1085     917
5208     847
6221     815
1532     775
Name: 小區名, dtype: int64

In [65]:

# 觀察條目數超過50的小區有多少

(neighbors > 50).sum()

Out[65]:

In [66]:

# 觀察條目數超過100的小區有多少

(neighbors > 100).sum()

Out[66]:

數據清洗

In [30]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')  # 忽略一些警告

導入數據

數據基本信息查看

In [31]:

train=pd.read_csv("data/train.csv")

train.head()

Out[31]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [32]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150539 entries, 0 to 150538
Data columns (total 19 columns):
時間          150539 non-null int64
小區名         150539 non-null int64
小區房屋出租數量    149571 non-null float64
樓層          150539 non-null int64
總樓層         150539 non-null float64
房屋面積        150539 non-null float64
房屋朝向        150539 non-null object
居住狀態        15979 non-null float64
臥室數量        150539 non-null int64
廳的數量        150539 non-null int64
衛的數量        150539 non-null int64
出租方式        19576 non-null float64
區           150522 non-null float64
位置          150522 non-null float64
地鐵線路        70180 non-null float64
地鐵站點        70180 non-null float64
距離          70180 non-null float64
裝修情況        14604 non-null float64
月租金         150539 non-null float64
dtypes: float64(12), int64(6), object(1)
memory usage: 21.8+ MB

In [33]:

train.shape

Out[33]:

(150539, 19)

In [34]:

# 出租方式中有很多缺失值

train["出租方式"].value_counts()

Out[34]:

1.0    17965
0.0     1611
Name: 出租方式, dtype: int64

In [35]:

train["裝修情況"].value_counts()

Out[35]:

2.0    7379
6.0    5862
1.0     906
4.0     339
3.0     103
5.0      15
Name: 裝修情況, dtype: int64

In [36]:

train["居住狀態"].value_counts()

Out[36]:

3.0    13530
1.0     1981
2.0      468
Name: 居住狀態, dtype: int64

設置後面要用的填充量

In [37]:

space_threshold = 0.3

dist_value_for_fill = 2  # 爲什麼是2,因爲距離的最大值是1,沒有地鐵 意味着很遠

line_value_for_fill = 0

station_value_for_fill = 0

state_value_for_fill = 0  # train["居住狀態"].mode().values[0]

decration_value_for_fill = -1  # train["裝修情況"].mode().values[0]

rent_value_for_fill = -1  # train["出租方式"].mode().values[0]

In [38]:

# 拿到每個區的位置衆數

area_value_for_fill = train["區"].mode().values[0]

position_by_area = train.groupby('區').apply(lambda x: x["位置"].mode())

# print(position_by_area)

position_value_for_fill = position_by_area[position_by_area.index ==

                                           area_value_for_fill].values[0][0]

# print(position_value_for_fill)

In [39]:

# 拿到每個小區房屋出租數量的衆數

ratio_by_neighbor = train.groupby('小區名').apply(lambda x: x["小區房屋出租數量"].mode())

index = [x[0] for x in ratio_by_neighbor.index]

ratio_by_neighbor.index = index

ratio_by_neighbor = ratio_by_neighbor.to_dict()

ratio_mode = train["小區房屋出租數量"].mode().values[0]

缺失值處理

缺失值比例

In [40]:

# 缺失值比例

def ratio_of_null():

    train_missing = (train.isnull().sum()/len(train))*100

    train_missing = train_missing.drop(train_missing[train_missing==0].index).sort_values(ascending=False)

    return pd.DataFrame({'缺失百分比':train_missing})

ratio_of_null()

Out[40]:

	缺失百分比
裝修情況	90.298859
居住狀態	89.385475
出租方式	86.996061
距離	53.380851
地鐵站點	53.380851
地鐵線路	53.380851
小區房屋出租數量	0.643023
位置	0.011293
區	0.011293

填充區和位置

尋找位置確實的相應數據

In [41]:

train.head()

Out[41]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [42]:

# 檢索後發現,都是小區名爲3269的,"位置"爲NaN

train[train["位置"].isna()]

Out[42]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
87169	2	3269	0.136719	1	0.290909	0.014565	西南	3.0	3	2	1	1.0	NaN	NaN	NaN	NaN	NaN	6.0	7.640068
87686	2	3269	0.050781	0	0.290909	0.006455	東	NaN	1	1	1	NaN	NaN	NaN	3.0	59.0	0.390000	NaN	4.244482
89090	2	3269	0.238281	2	0.600000	0.010180	西南	NaN	2	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11.035654
101618	2	3269	0.082031	1	0.581818	0.020026	南	NaN	4	2	1	1.0	NaN	NaN	NaN	NaN	NaN	NaN	8.998302
102958	2	3269	0.058594	0	0.200000	0.014305	西北	NaN	2	2	1	NaN	NaN	NaN	2.0	70.0	0.950000	NaN	7.300509
105400	2	3269	0.007812	1	0.309091	0.012494	南	NaN	2	1	1	NaN	NaN	NaN	5.0	71.0	0.649167	NaN	5.602716
106243	2	3269	0.070312	1	0.600000	0.012248	南	NaN	2	2	1	NaN	NaN	NaN	2.0	65.0	0.482500	NaN	6.621392
107728	2	3269	0.070312	1	0.309091	0.011255	西	NaN	2	2	1	NaN	NaN	NaN	5.0	27.0	0.294167	NaN	8.998302
108349	2	3269	0.027344	1	0.309091	0.013737	東南	NaN	4	2	2	NaN	NaN	NaN	3.0	59.0	0.491667	NaN	8.319185
113818	2	3269	0.062500	0	0.181818	0.012271	東南	NaN	2	1	1	NaN	NaN	NaN	2.0	55.0	0.400000	NaN	7.300509
119571	2	3269	0.089844	1	0.454545	0.011178	東南	NaN	2	1	1	1.0	NaN	NaN	5.0	29.0	1.000000	NaN	4.584041
127246	3	3269	NaN	1	0.090909	0.011255	東	NaN	2	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.584041
132357	3	3269	0.023438	0	0.290909	0.001821	東南	NaN	1	0	1	NaN	NaN	NaN	3.0	88.0	0.325833	NaN	2.886248
137717	3	3269	0.011719	1	0.090909	0.010593	南	NaN	2	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.263158
140425	3	3269	0.031250	2	0.581818	0.014234	東	NaN	3	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.960951
141042	3	3269	0.316406	0	0.600000	0.007180	西北	3.0	2	1	1	1.0	NaN	NaN	NaN	NaN	NaN	2.0	4.074703
144922	3	3269	0.117188	1	0.600000	0.006455	東南	1.0	1	0	1	1.0	NaN	NaN	NaN	NaN	NaN	2.0	4.923599

In [43]:

# "位置"爲NaN的就那麼幾條,對他們直接刪除處理

train=train[train['小區名']!=3269]

# 此處原文中雖然按照這種模式處理,但是不建議這麼做;可以使用衆數進行替換,如下面註釋代碼.

# test["位置"].fillna(test["位置"].mode()[0], inplace=True)

# test["區"].fillna(test["區"].mode()[0], inplace=True)

In [44]:

ratio_of_null()

Out[44]:

	缺失百分比
裝修情況	90.299757
居住狀態	89.386269
出租方式	86.997914
距離	53.381565
地鐵站點	53.381565
地鐵線路	53.381565
小區房屋出租數量	0.642431

地鐵站點，距離處理

先用每個同名小區名和同位置的地鐵線路,地鐵站點,距離衆數來填充
剩下的地鐵站點，距離，地鐵線路的缺失值作爲一種特徵，表示該房屋附近沒有地鐵

In [45]:

# 先按照小區名和位置分組，然後獲取每組的站點衆數

station_by_nb_pos = train[['小區名', '位置', '地鐵站點', '距離']].drop_duplicates().dropna(

).groupby(['小區名', '位置'])['地鐵站點', '距離'].apply(lambda x: np.max(x.mode()))

station_by_nb_pos.head()

Out[45]:

		地鐵站點	距離
小區名	位置
0	59.0	57.0	0.478333
1	59.0	57.0	0.563333
2	40.0	33.0	0.971667
11	24.0	103.0	0.914167
12	28.0	69.0	0.487500

In [46]:

station_by_nb = train[['小區名', '地鐵站點', '距離']].drop_duplicates().dropna(

).groupby('小區名')['地鐵站點', '距離'].apply(lambda x: np.max(x.mode()))

station_by_nb.head()

Out[46]:

	地鐵站點	距離
小區名
0	57.0	0.478333
1	57.0	0.563333
2	33.0	0.971667
11	103.0	0.914167
12	69.0	0.487500

In [47]:

# 拿到每個站點對應的線路

lines_by_station = train[['地鐵站點', '地鐵線路']].drop_duplicates(

).dropna().groupby('地鐵站點')['地鐵線路'].min()

In [48]:

def fill_stations(line, s_by_np, s_by_n, l_by_s):

"""

    s_by_np:接收station_by_nb_pos

    s_by_n:接收station_by_nb

    l_by_s:接收lines_by_station

"""

    # 首先判斷line行地鐵站點是否缺失

    # 注意這裏最好用pd.isna,不要用np.isnull

    if not pd.isna(line['地鐵站點']):  # 不是空，就直接返回原行

        return line

    # 如果小區名和位置組合在數據索引中，就查找進行填充

    if (line['小區名'], line['位置']) in s_by_np:

        line['地鐵站點'] = s_by_np.loc[(line['小區名'], line['位置']), '地鐵站點']

        line['距離'] = s_by_np.loc[(line['小區名'], line['位置']), '距離']

        line['地鐵線路'] = l_by_s[line['地鐵站點']]

    elif line['小區名'] in s_by_n.index:

        line['地鐵站點'] = s_by_n.loc[line['小區名'], '地鐵站點']  # 用小區衆數填充

        line['距離'] = s_by_n.loc[line['小區名'], '距離']

        line['地鐵線路'] = l_by_s[line['地鐵站點']]

    else:  # 小區名也找不到的情況下  單獨作爲一類，即沒有地鐵

        line['地鐵站點'] = 0

        line['距離'] = 2  # 距離用2填充

        line['地鐵線路'] = 0

    return line

train = train.apply(fill_stations, s_by_np=station_by_nb_pos,

                    s_by_n=station_by_nb, l_by_s=lines_by_station, axis=1)

ratio_of_null()

Out[48]:

	缺失百分比
裝修情況	90.299757
居住狀態	89.386269
出租方式	86.997914
小區房屋出租數量	0.642431

小區房屋出租數量處理

用每個小區的房屋出租數量衆數填充

In [49]:

# 拿到每個小區房屋出租數量的衆數

ratio_by_neighbor = train[['小區名', '小區房屋出租數量']].dropna().groupby(

    '小區名').apply(lambda x: np.mean(x["小區房屋出租數量"].mode()))

ratio_by_neighbor.head()

Out[49]:

小區名
0    0.007812
1    0.011719
2    0.007812
4    0.017578
5    0.007812
dtype: float64

In [50]:

#拿到所有小區的“小區房屋出租數量”衆數

ratio_mode=train["小區房屋出租數量"].mode().values[0]

ratio_mode

Out[50]:

0.015625

In [51]:

def fill_by_key(x,k,v,values,mode):

    if not pd.isna(x[v]):

        return x

    else:

        if x[k] in values.index:

            x[v]=values[x[k]]

        else:

            x[v]=mode

        return x

# train['小區房屋出租數量']=train['小區房屋出租數量'].map()

train=train.apply(fill_by_key,k="小區名",v="小區房屋出租數量",values=ratio_by_neighbor,mode=ratio_mode,axis=1)

In [52]:

ratio_of_null()

Out[52]:

	缺失百分比
裝修情況	90.299757
居住狀態	89.386269
出租方式	86.997914

裝修，居住狀態，出租方式--作爲單獨一類

In [53]:

train["出租方式"]=train["出租方式"].fillna(int(-1))

train["裝修情況"]=train["裝修情況"].fillna(int(-1))

train["居住狀態"]=train["居住狀態"].fillna(int(0))

In [54]:

ratio_of_null()

Out[54]:

清除異常樣本

針對房屋面積存在的異常值，去掉房屋面積異常的樣本

In [55]:

train['房屋面積'].head()

Out[55]:

0    0.008628
1    0.017046
2    0.010593
3    0.019199
4    0.010427
Name: 房屋面積, dtype: float64

In [56]:

print(space_threshold)

[train[train['房屋面積']>space_threshold]]

0.3

Out[56]:

[        時間   小區名  小區房屋出租數量  樓層       總樓層      房屋面積 房屋朝向  居住狀態  臥室數量  廳的數量  \
 100648   2    17  0.335938   0  0.727273  1.000000   東南   0.0     1     1   
 105736   2    17  0.320312   0  0.727273  1.000000   東南   0.0     1     1   
 127221   3    17  0.339844   0  0.727273  1.000000   東南   0.0     1     1   
 150066   3  3946  0.050781   0  0.272727  0.330354    西   0.0     2     1   
 
         衛的數量  出租方式     區     位置  地鐵線路   地鐵站點        距離  裝修情況        月租金  
 100648     1  -1.0  11.0   55.0   5.0  113.0  0.364167  -1.0  18.845501  
 105736     1  -1.0  11.0   55.0   5.0  113.0  0.364167  -1.0  18.845501  
 127221     1  -1.0  11.0   55.0   5.0  113.0  0.364167  -1.0  18.845501  
 150066     1  -1.0   0.0  109.0   0.0    0.0  2.000000  -1.0   5.602716  ]

In [57]:

train=train[train['房屋面積']<space_threshold]

train.shape

Out[57]:

(150518, 19)

糾偏

針對目標值月租金普遍分佈過散，進行對數平滑

In [58]:

train["log_rent"] = np.log1p(train["月租金"])  # np.log1p  log(1+x)

# 參考資料: https://www.cnblogs.com/wqbin/p/10346292.html

In [59]:

# 糾正之前

plt.figure(figsize=(10, 5))

sns.boxplot(x="月租金", data=train, orient='h')

plt.show()

In [60]:

# 糾正之後

plt.figure(figsize=(10, 5))

sns.boxplot(x="log_rent", data=train, orient='h')

plt.show()

In [61]:

train.head()

Out[61]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金	log_rent
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	1	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	0	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	2	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	2	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	1	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317

問題數據處理

房間朝向列有多個值,這裏我們只要第一個

In [66]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150518 entries, 0 to 150538
Data columns (total 21 columns):
時間          150518 non-null int64
小區名         150518 non-null int64
小區房屋出租數量    150518 non-null float64
樓層          150518 non-null int64
總樓層         150518 non-null float64
房屋面積        150518 non-null float64
房屋朝向        150518 non-null object
居住狀態        150518 non-null float64
臥室數量        150518 non-null int64
廳的數量        150518 non-null int64
衛的數量        150518 non-null int64
出租方式        150518 non-null float64
區           150518 non-null float64
位置          150518 non-null float64
地鐵線路        150518 non-null float64
地鐵站點        150518 non-null float64
距離          150518 non-null float64
裝修情況        150518 non-null float64
月租金         150518 non-null float64
log_rent    150518 non-null float64
新朝向         150518 non-null object
dtypes: float64(13), int64(6), object(2)
memory usage: 25.3+ MB

In [68]:

train["房屋朝向"].head()

Out[68]:

0    東南
1     東
2    東南
3     南
4    東北
Name: 房屋朝向, dtype: object

In [62]:

def split(text,i):

    items=text.split(" ")

    if i<len(items):

        return items[i]

    else:

        return np.nan

train['新朝向']=train['房屋朝向'].map(lambda x:split(x,0))

In [63]:

train.head()

train['新朝向'].value_counts()

Out[63]:

南     45435
東南    42590
東     26533
西南    13626
北      7950
西      7689
西北     4121
東北     2574
Name: 新朝向, dtype: int64

In [64]:

train.head()

Out[64]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金	log_rent	新朝向
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481	東南
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145	東
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415	東南
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481	南
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317	東北

5 rows × 21 columns

存儲數據

In [65]:

train.to_csv("./data/train_data_cleaning.csv",index=None)

特徵工程

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')#忽略一些警告

獲取數據

數據基本信息產看

In [2]:

train=pd.read_csv("./data/train_data_cleaning.csv")

train.head()

Out[2]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金	log_rent	新朝向
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481	東南
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145	東
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415	東南
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481	南
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317	東北

5 rows × 21 columns

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150518 entries, 0 to 150517
Data columns (total 21 columns):
時間          150518 non-null int64
小區名         150518 non-null int64
小區房屋出租數量    150518 non-null float64
樓層          150518 non-null int64
總樓層         150518 non-null float64
房屋面積        150518 non-null float64
房屋朝向        150518 non-null object
居住狀態        150518 non-null float64
臥室數量        150518 non-null int64
廳的數量        150518 non-null int64
衛的數量        150518 non-null int64
出租方式        150518 non-null float64
區           150518 non-null float64
位置          150518 non-null float64
地鐵線路        150518 non-null float64
地鐵站點        150518 non-null float64
距離          150518 non-null float64
裝修情況        150518 non-null float64
月租金         150518 non-null float64
log_rent    150518 non-null float64
新朝向         150518 non-null object
dtypes: float64(13), int64(6), object(2)
memory usage: 24.1+ MB

特徵處理

根據房間,廳,衛,房屋面積構造新特徵

In [4]:

train["房+衛+廳"] = train["臥室數量"]+train["廳的數量"]+train["衛的數量"]

train["房/總"] = train["臥室數量"]/(train["房+衛+廳"]+1)  # 加1是爲了防止分母=0出現結果爲inf無窮大的現象

train["衛/總"] = train["衛的數量"]/(train["房+衛+廳"]+1)

train["廳/總"] = train["廳的數量"]/(train["房+衛+廳"]+1)

train['臥室面積'] = train['房屋面積']/(train['臥室數量']+1)

train['樓層比'] = train['樓層']/(train["總樓層"]+1)

train['戶型'] = train[['臥室數量', '廳的數量', '衛的數量']].apply(

    lambda x: str(x['臥室數量'])+str(x['廳的數量'])+str(x['衛的數量']), axis=1)

In [5]:

train.head()

Out[5]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	月租金	log_rent	新朝向	房+衛+廳	房/總	衛/總	廳/總	臥室面積	樓層比	戶型
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	5.602716	1.887481	東南	3	0.250000	0.250000	0.250000	0.004314	1.617647	111
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	16.977929	2.889145	東	1	0.500000	0.000000	0.000000	0.008523	0.723684	100
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	8.998302	2.302415	東南	5	0.333333	0.333333	0.166667	0.003531	0.000000	212
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	5.602716	1.887481	南	7	0.375000	0.250000	0.250000	0.004800	1.264368	322
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	7.300509	2.116317	東北	4	0.400000	0.200000	0.200000	0.003476	0.000000	211

5 rows × 28 columns

構造是否有地鐵

In [6]:

train["有地鐵"]=(train["地鐵站點"]>-1).map(int)

In [7]:

train.head()

Out[7]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	log_rent	新朝向	房+衛+廳	房/總	衛/總	廳/總	臥室面積	樓層比	戶型	有地鐵
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	1.887481	東南	3	0.250000	0.250000	0.250000	0.004314	1.617647	111	1
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	2.889145	東	1	0.500000	0.000000	0.000000	0.008523	0.723684	100	1
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	2.302415	東南	5	0.333333	0.333333	0.166667	0.003531	0.000000	212	1
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	1.887481	南	7	0.375000	0.250000	0.250000	0.004800	1.264368	322	1
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	2.116317	東北	4	0.400000	0.200000	0.200000	0.003476	0.000000	211	1

5 rows × 29 columns

In [8]:

train.columns

Out[8]:

Index(['時間', '小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積', '房屋朝向', '居住狀態', '臥室數量',
       '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況', '月租金',
       'log_rent', '新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型',
       '有地鐵'],
      dtype='object')

構造地鐵線路數特徵

In [9]:

lines_count1=train[['小區名','地鐵線路']].drop_duplicates().groupby('小區名').count()

lines_count2=train[['位置','地鐵線路']].drop_duplicates().groupby('位置').count()

lines_count2.columns=['位置線路數']

lines_count1.columns=['小區線路數']

In [10]:

train=pd.merge(train,lines_count1,how='left',on=['小區名'])

train=pd.merge(train,lines_count2,how='left',on=['位置'])

In [11]:

train.head()

Out[11]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	房+衛+廳	房/總	衛/總	廳/總	臥室面積	樓層比	戶型	有地鐵	小區線路數	位置線路數
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	3	0.250000	0.250000	0.250000	0.004314	1.617647	111	1	2	4
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	1	0.500000	0.000000	0.000000	0.008523	0.723684	100	1	1	5
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	5	0.333333	0.333333	0.166667	0.003531	0.000000	212	1	1	3
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	7	0.375000	0.250000	0.250000	0.004800	1.264368	322	1	1	2
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	4	0.400000	0.200000	0.200000	0.003476	0.000000	211	1	1	3

5 rows × 31 columns

去掉出現數量較少的小區

In [12]:

neighbors=train['小區名'].value_counts()

neighbors.head()

Out[12]:

5512    1406
1085     917
5208     847
6221     815
1532     775
Name: 小區名, dtype: int64

In [13]:

train['新小區名']=train.apply(lambda x: x['小區名'] if neighbors[x['小區名']]>100 else -1,axis=1)

train['小區條數大於100']=train.apply(lambda x: 1 if neighbors[x['小區名']]>100 else 0,axis=1)

In [14]:

train.head()

Out[14]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	衛/總	廳/總	臥室面積	樓層比	戶型	有地鐵	小區線路數	位置線路數	新小區名	小區條數大於100
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	0.250000	0.250000	0.004314	1.617647	111	1	2	4	3072	1
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	0.000000	0.000000	0.008523	0.723684	100	1	1	5	-1	0
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	0.333333	0.166667	0.003531	0.000000	212	1	1	3	-1	0
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	0.250000	0.250000	0.004800	1.264368	322	1	1	2	3103	1
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	0.200000	0.200000	0.003476	0.000000	211	1	1	3	5182	1

5 rows × 33 columns

轉換類型

In [15]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150518 entries, 0 to 150517
Data columns (total 33 columns):
時間           150518 non-null int64
小區名          150518 non-null int64
小區房屋出租數量     150518 non-null float64
樓層           150518 non-null int64
總樓層          150518 non-null float64
房屋面積         150518 non-null float64
房屋朝向         150518 non-null object
居住狀態         150518 non-null float64
臥室數量         150518 non-null int64
廳的數量         150518 non-null int64
衛的數量         150518 non-null int64
出租方式         150518 non-null float64
區            150518 non-null float64
位置           150518 non-null float64
地鐵線路         150518 non-null float64
地鐵站點         150518 non-null float64
距離           150518 non-null float64
裝修情況         150518 non-null float64
月租金          150518 non-null float64
log_rent     150518 non-null float64
新朝向          150518 non-null object
房+衛+廳        150518 non-null int64
房/總          150518 non-null float64
衛/總          150518 non-null float64
廳/總          150518 non-null float64
臥室面積         150518 non-null float64
樓層比          150518 non-null float64
戶型           150518 non-null object
有地鐵          150518 non-null int64
小區線路數        150518 non-null int64
位置線路數        150518 non-null int64
新小區名         150518 non-null int64
小區條數大於100    150518 non-null int64
dtypes: float64(18), int64(12), object(3)
memory usage: 39.0+ MB

In [16]:

#將離散特徵轉換成字符串類型

colunms = ['時間', '小區名', '居住狀態', '出租方式', '區','位置','地鐵線路','地鐵站點','裝修情況']

for col in colunms:

    train[col] = train[col].astype(str)

In [17]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150518 entries, 0 to 150517
Data columns (total 33 columns):
時間           150518 non-null object
小區名          150518 non-null object
小區房屋出租數量     150518 non-null float64
樓層           150518 non-null int64
總樓層          150518 non-null float64
房屋面積         150518 non-null float64
房屋朝向         150518 non-null object
居住狀態         150518 non-null object
臥室數量         150518 non-null int64
廳的數量         150518 non-null int64
衛的數量         150518 non-null int64
出租方式         150518 non-null object
區            150518 non-null object
位置           150518 non-null object
地鐵線路         150518 non-null object
地鐵站點         150518 non-null object
距離           150518 non-null float64
裝修情況         150518 non-null object
月租金          150518 non-null float64
log_rent     150518 non-null float64
新朝向          150518 non-null object
房+衛+廳        150518 non-null int64
房/總          150518 non-null float64
衛/總          150518 non-null float64
廳/總          150518 non-null float64
臥室面積         150518 non-null float64
樓層比          150518 non-null float64
戶型           150518 non-null object
有地鐵          150518 non-null int64
小區線路數        150518 non-null int64
位置線路數        150518 non-null int64
新小區名         150518 non-null int64
小區條數大於100    150518 non-null int64
dtypes: float64(11), int64(10), object(12)
memory usage: 39.0+ MB

保存處理後的數據

In [18]:

# 保存處理後的數據

train.to_csv("./data/onehot_feature.csv")

In [20]:

train.head()

Out[20]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	衛/總	廳/總	臥室面積	樓層比	戶型	有地鐵	小區線路數	位置線路數	新小區名	小區條數大於100
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	0.250000	0.250000	0.004314	1.617647	111	1	2	4	3072	1
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	0.000000	0.000000	0.008523	0.723684	100	1	1	5	-1	0
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	0.333333	0.166667	0.003531	0.000000	212	1	1	3	-1	0
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	0.250000	0.250000	0.004800	1.264368	322	1	1	2	3103	1
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	0.200000	0.200000	0.003476	0.000000	211	1	1	3	5182	1

5 rows × 33 columns

初步建模

In [144]:

import pandas as pd

import numpy as np

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import mean_squared_error

In [145]:

# 使用初步獲取的數據,嘗試建模,驗證數據階段OK

數據處理

In [146]:

data=pd.read_csv("data/onehot_feature.csv")

data_test = pd.read_csv("./data/onehot_feature_test.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150518 entries, 0 to 150517
Data columns (total 34 columns):
Unnamed: 0    150518 non-null int64
時間            150518 non-null int64
小區名           150518 non-null int64
小區房屋出租數量      150518 non-null float64
樓層            150518 non-null int64
總樓層           150518 non-null float64
房屋面積          150518 non-null float64
房屋朝向          150518 non-null object
居住狀態          150518 non-null float64
臥室數量          150518 non-null int64
廳的數量          150518 non-null int64
衛的數量          150518 non-null int64
出租方式          150518 non-null float64
區             150518 non-null float64
位置            150518 non-null float64
地鐵線路          150518 non-null float64
地鐵站點          150518 non-null float64
距離            150518 non-null float64
裝修情況          150518 non-null float64
月租金           150518 non-null float64
log_rent      150518 non-null float64
新朝向           150518 non-null object
房+衛+廳         150518 non-null int64
房/總           150518 non-null float64
衛/總           150518 non-null float64
廳/總           150518 non-null float64
臥室面積          150518 non-null float64
樓層比           150518 non-null float64
戶型            150518 non-null int64
有地鐵           150518 non-null int64
小區線路數         150518 non-null int64
位置線路數         150518 non-null int64
新小區名          150518 non-null int64
小區條數大於100     150518 non-null int64
dtypes: float64(18), int64(14), object(2)
memory usage: 39.0+ MB

In [147]:

data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46000 entries, 0 to 45999
Data columns (total 33 columns):
Unnamed: 0    46000 non-null int64
id            46000 non-null int64
時間            46000 non-null int64
小區名           46000 non-null int64
小區房屋出租數量      46000 non-null float64
樓層            46000 non-null int64
總樓層           46000 non-null float64
房屋面積          46000 non-null float64
房屋朝向          46000 non-null object
居住狀態          46000 non-null float64
臥室數量          46000 non-null int64
廳的數量          46000 non-null int64
衛的數量          46000 non-null int64
出租方式          46000 non-null float64
區             46000 non-null float64
位置            46000 non-null float64
地鐵線路          46000 non-null float64
地鐵站點          46000 non-null float64
距離            46000 non-null float64
裝修情況          46000 non-null float64
新朝向           46000 non-null object
房+衛+廳         46000 non-null int64
房/總           46000 non-null float64
衛/總           46000 non-null float64
廳/總           46000 non-null float64
臥室面積          46000 non-null float64
樓層比           46000 non-null float64
戶型            46000 non-null int64
有地鐵           46000 non-null int64
小區線路數         46000 non-null int64
位置線路數         46000 non-null int64
新小區名          46000 non-null int64
小區條數大於100     46000 non-null int64
dtypes: float64(16), int64(15), object(2)
memory usage: 11.6+ MB

In [148]:

# 將離散特徵轉換成字符串類型

colunms = ['時間', '新小區名', '居住狀態', '出租方式', '區',

           '位置', '地鐵線路', '地鐵站點', '裝修情況', '戶型']

for col in colunms:

    data[col] = data[col].astype(str)

In [149]:

np.any(data_test.isna())

# np.any(data.isna())

Out[149]:

Unnamed: 0    False
id            False
時間            False
小區名           False
小區房屋出租數量      False
樓層            False
總樓層           False
房屋面積          False
房屋朝向          False
居住狀態          False
臥室數量          False
廳的數量          False
衛的數量          False
出租方式          False
區             False
位置            False
地鐵線路          False
地鐵站點          False
距離            False
裝修情況          False
新朝向           False
房+衛+廳         False
房/總           False
衛/總           False
廳/總           False
臥室面積          False
樓層比           False
戶型            False
有地鐵           False
小區線路數         False
位置線路數         False
新小區名          False
小區條數大於100     False
dtype: bool

確定特徵值,目標值

In [150]:

x_columns=['小區房屋出租數量','新小區名', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

       '衛的數量',  '位置',  '地鐵站點', '距離', '裝修情況',

       '新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','有地鐵','小區線路數','位置線路數','小區條數大於100',]

y_label='log_rent'

x=data[x_columns]

y=data[y_label]

X_TEST = data_test[x_columns]

分割數據集

In [151]:

train_x, test_x, train_y, test_y = train_test_split(

    x, y, test_size=0.25, random_state=12)

特徵工程

In [152]:

# 1.特徵轉換

vector = DictVectorizer(sparse=True)

x_train = vector.fit_transform(train_x.to_dict(orient='records'))

x_test = vector.transform(test_x.to_dict(orient='records'))

X_TEST = vector.transform(X_TEST.to_dict(orient="records"))

In [153]:

print(x_train.shape, x_test.shape, X_TEST.shape)

(112888, 826) (37630, 826) (46000, 826)

In [155]:

# 2.降維

pca=PCA(0.98)

pca_x_train=pca.fit_transform(x_train.toarray())

pca_x_test=pca.transform(x_test.toarray())

PCA_X_TEST = pca.transform(X_TEST.toarray())

In [156]:

print(pca_x_train.shape, pca_x_test.shape, PCA_X_TEST.shape)

(112888, 361) (37630, 361) (46000, 361)

In [157]:

# 3.特徵標準化

trans = StandardScaler()

new_x_train = trans.fit_transform(pca_x_train)

new_x_test = trans.transform(pca_x_test)

NEW_X_TEST = trans.transform(PCA_X_TEST)

In [158]:

print(new_x_train.shape, new_x_test.shape, NEW_X_TEST.shape)

(112888, 361) (37630, 361) (46000, 361)

確定評估函數

In [159]:

def rmse(y_true, y_pred):

    y_pred = np.exp(y_pred)-1  # 轉換成真實的租金

    y_true = np.exp(y_true)-1

    return np.sqrt(mean_squared_error(y_true, y_pred))

模型訓練

構建嶺迴歸模型

In [160]:

%%time

# 1.通過參數搜索,確定最優參數alpha的值

ridge = Ridge()

params = {

    "alpha": [0.005, 0.01, 1, 5, 10, 20, 50]

model1 = GridSearchCV(ridge, param_grid=params, cv=5, n_jobs=-1)

model1.fit(new_x_train, train_y)

model1.best_params_

#{'alpha': 50, 'fit_intercept': True}

CPU times: user 1.54 s, sys: 781 ms, total: 2.32 s
Wall time: 17.5 s

In [161]:

# 利用搜索出的最優參數構建模型

ridge = Ridge(alpha=50)

ridge.fit(new_x_train, train_y)

Out[161]:

Ridge(alpha=50, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [162]:

y_pred_test=ridge.predict(new_x_test)

y_pred_train=ridge.predict(new_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 4.096368900367207
測試集rmse： 4.198922171577452

模型保存

In [163]:

from sklearn.externals import joblib

joblib.dump(Ridge, "./data/Ridge.kpl")

Out[163]:

['./data/Ridge.kpl']

提交結果輸出

In [164]:

Y_PRED_TEST = ridge.predict(NEW_X_TEST)

Y_PRED_TEST = np.exp(Y_PRED_TEST)-1

In [165]:

data = range(1, len(Y_PRED_TEST)+1)

In [166]:

Y_PRED = pd.DataFrame(data=Y_PRED_TEST, columns=["月租金"])

In [167]:

Y_PRED["id"] = range(1, Y_PRED.shape[0]+1)

In [168]:

Y_PRED.head()

Out[168]:

	月租金	id
0	5.182775	1
1	4.600273	2
2	8.306692	3
3	7.178559	4
4	5.187525	5

In [171]:

Y_PRED.shape

Out[171]:

(46000, 2)

In [172]:

Y_PRED.to_csv("./data/Y_PRED_RIDGE.csv")

模型融合

In [1]:

from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

from sklearn.svm import LinearSVR, SVR

from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor

from sklearn.decomposition import PCA

import pandas as pd

import numpy as np

from sklearn.metrics import mean_squared_error

In [2]:

#沒有用bagging和boosting

#stacking    先用幾個不同的模型做預測  輸出預測值  然後將這幾個模型輸出的預測值作爲特徵來訓練一個新的模型

獲取數據

In [3]:

data=pd.read_csv("data/onehot_feature.csv")

data_test = pd.read_csv("./data/onehot_feature_test.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150518 entries, 0 to 150517
Data columns (total 34 columns):
Unnamed: 0    150518 non-null int64
時間            150518 non-null int64
小區名           150518 non-null int64
小區房屋出租數量      150518 non-null float64
樓層            150518 non-null int64
總樓層           150518 non-null float64
房屋面積          150518 non-null float64
房屋朝向          150518 non-null object
居住狀態          150518 non-null float64
臥室數量          150518 non-null int64
廳的數量          150518 non-null int64
衛的數量          150518 non-null int64
出租方式          150518 non-null float64
區             150518 non-null float64
位置            150518 non-null float64
地鐵線路          150518 non-null float64
地鐵站點          150518 non-null float64
距離            150518 non-null float64
裝修情況          150518 non-null float64
月租金           150518 non-null float64
log_rent      150518 non-null float64
新朝向           150518 non-null object
房+衛+廳         150518 non-null int64
房/總           150518 non-null float64
衛/總           150518 non-null float64
廳/總           150518 non-null float64
臥室面積          150518 non-null float64
樓層比           150518 non-null float64
戶型            150518 non-null int64
有地鐵           150518 non-null int64
小區線路數         150518 non-null int64
位置線路數         150518 non-null int64
新小區名          150518 non-null int64
小區條數大於100     150518 non-null int64
dtypes: float64(18), int64(14), object(2)
memory usage: 39.0+ MB

In [4]:

# 將離散特徵轉換成字符串類型

colunms = ['時間', '新小區名', '居住狀態', '出租方式', '區',

           '位置', '地鐵線路', '地鐵站點', '裝修情況', '戶型']

for col in colunms:

    data[col] = data[col].astype(str)

In [5]:

x_columns=['小區房屋出租數量','新小區名', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

       '衛的數量',  '位置',  '地鐵站點', '距離', '裝修情況',

       '新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','有地鐵','小區線路數','位置線路數','小區條數大於100',]

y_label='log_rent'

x=data[x_columns]

y=data[y_label]

X_TEST = data_test[x_columns]

In [6]:

# 2.分割數據集

train_x, test_x, train_y, test_y = train_test_split(

    x, y, test_size=0.25, random_state=12)

In [7]:

# 1.特徵轉換

vector = DictVectorizer(sparse=True)

x_train = vector.fit_transform(train_x.to_dict(orient='records'))

x_test = vector.transform(test_x.to_dict(orient='records'))

X_TEST = vector.transform(X_TEST.to_dict(orient="records"))

In [8]:

print(x_train.shape, x_test.shape, X_TEST.shape)

(112888, 826) (37630, 826) (46000, 826)

In [9]:

# 2.降維

pca=PCA(0.98)

pca_x_train=pca.fit_transform(x_train.toarray())

pca_x_test=pca.transform(x_test.toarray())

PCA_X_TEST = pca.transform(X_TEST.toarray())

In [10]:

print(pca_x_train.shape, pca_x_test.shape, PCA_X_TEST.shape)

(112888, 361) (37630, 361) (46000, 361)

In [68]:

def rmse(y_true,y_pred):

    y_pred=np.exp(y_pred)-1  # 轉換成真實的租金

    y_true=np.exp(y_true)-1

    return np.sqrt(mean_squared_error(y_true,y_pred))

構建子模型

構建嶺迴歸模型

In [69]:

%%time

# 1.通過參數搜索,確定最優參數alpha的值

ridge = Ridge(normalize=True)

params = {

    "alpha": [0.005, 0.01, 1, 5, 10, 20, 50]

model1 = GridSearchCV(ridge, param_grid=params, cv=5, n_jobs=-1)

model1.fit(pca_x_train, train_y)

model1.best_params_

#{'alpha': 50, 'fit_intercept': True}

CPU times: user 1.78 s, sys: 705 ms, total: 2.48 s
Wall time: 21.5 s

In [70]:

# 利用搜索出的最優參數構建模型

ridge = Ridge(alpha=50, normalize=True)

ridge.fit(pca_x_train, train_y)

Out[70]:

Ridge(alpha=50, copy_X=True, fit_intercept=True, max_iter=None, normalize=True,
      random_state=None, solver='auto', tol=0.001)

In [71]:

y_pred_test=ridge.predict(pca_x_test)

y_pred_train=ridge.predict(pca_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 6.342657781238426
測試集rmse： 6.493947602276618

構建lasso迴歸

In [72]:

%%time

# 1.參數搜索

lasso = Lasso(normalize=True)

params = {

    "alpha": [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],

    "fit_intercept": [True, False]

model2 = GridSearchCV(lasso, param_grid=params, cv=5, n_jobs=-1)

model2.fit(pca_x_train, train_y)

print(model2.best_params_)

#{'alpha': 0.001, 'fit_intercept': True}

{'alpha': 0.001, 'fit_intercept': True}
CPU times: user 1.68 s, sys: 551 ms, total: 2.23 s
Wall time: 49.6 s

In [73]:

# 利用搜索出的最優參數構建模型

lasso=Lasso(alpha=0.001, normalize=True)

lasso.fit(pca_x_train,train_y)

Out[73]:

Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=True, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [74]:

%%time

y_pred_test=lasso.predict(pca_x_test)

y_pred_train=lasso.predict(pca_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 6.385065714494761
測試集rmse： 6.53676743372339
CPU times: user 393 ms, sys: 47.4 ms, total: 440 ms
Wall time: 87.1 ms

構建隨機森林

In [75]:

%%time

# 1.參數搜索

rf = RandomForestRegressor(max_features='sqrt')  # 設置max_features='sqrt'，不然太耗時間

params = {

    "n_estimators": [200],  # [200,500,700],

    "max_depth": [50],  # [40, 50, 60]

    "min_samples_split": [20, 50, 100],

    "min_samples_leaf": [10, 20, 30]

model3 = GridSearchCV(rf, param_grid=params, cv=5, n_jobs=-1, verbose=2)

model3.fit(pca_x_train, train_y)

print(model3.best_params_)

# {'max_depth': 50,

#  'min_samples_leaf': 10,

#  'min_samples_split': 20,

#  'n_estimators': 200}

Fitting 5 folds for each of 9 candidates, totalling 45 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 55.7min
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed: 81.1min finished

{'max_depth': 50, 'min_samples_leaf': 10, 'min_samples_split': 20, 'n_estimators': 200}
CPU times: user 10min 4s, sys: 8.96 s, total: 10min 13s
Wall time: 1h 31min 30s

In [76]:

%%time

# 利用搜索出的最優參數構建模型

rf=RandomForestRegressor(n_estimators=200,

                         max_features=0.8,

                         max_depth=50,

                         min_samples_split=20,

                         min_samples_leaf=10,

                         n_jobs=-1)

rf.fit(pca_x_train,train_y)

CPU times: user 3h 34min 3s, sys: 1min 29s, total: 3h 35min 32s
Wall time: 33min 4s

In [77]:

%%time

y_pred_test=rf.predict(pca_x_test)

y_pred_train=rf.predict(pca_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 2.133144119124377
測試集rmse： 2.7950254213867094
CPU times: user 24.4 s, sys: 465 ms, total: 24.9 s
Wall time: 4.53 s

構建決策樹

In [78]:

%%time

tree=DecisionTreeRegressor()

params={

    "max_depth":[60],  # [40,50,60,70],

    "min_samples_split":[5],  # [5,10,20,30,40,50]

    "min_samples_leaf":[5], # [2,3,5,7,9,11]

model4=GridSearchCV(tree,param_grid=params,cv=5,n_jobs=-1)

model4.fit(pca_x_train,train_y)

print(model4.best_params_)

# {'max_depth': 60, 'min_samples_leaf': 2, 'min_samples_split': 5}

{'max_depth': 60, 'min_samples_leaf': 5, 'min_samples_split': 5}
CPU times: user 1min 34s, sys: 2.06 s, total: 1min 36s
Wall time: 3min 26s

In [79]:

%%time

from sklearn.tree import DecisionTreeRegressor

#利用搜索出的最優參數構建模型

tree=DecisionTreeRegressor(max_depth=60,min_samples_leaf=2,min_samples_split=5)

tree.fit(pca_x_train,train_y)

CPU times: user 1min 36s, sys: 1.48 s, total: 1min 38s
Wall time: 1min 40s

In [80]:

%%time

y_pred_test=tree.predict(pca_x_test)

y_pred_train=tree.predict(pca_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 0.805142479875888
測試集rmse： 2.6702036461919856
CPU times: user 254 ms, sys: 123 ms, total: 377 ms
Wall time: 380 ms

In [81]:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10),dpi=100)

plt.scatter(test_y,y_pred_test)

plt.xlabel("真實值")

plt.ylabel("預測值")

plt.show()

構建支持向量機

In [ ]:

# %%time

# # 1.參數搜索----數據量大 svm太耗時，調參幾乎不可能

# svr=SVR()

# params={

#     "gamma":[0.001,0.01,0.1,0.5,1,5],

#     "C":[0.001,0.1,0.5,1,5]

# }

# model5=GridSearchCV(svr,param_grid=params,cv=5,n_jobs=-1,verbose=10)

# # verbose：日誌冗長度，int：冗長度，0：不輸出訓練過程，1：偶爾輸出，>1：對每個子模型都輸出。

# model5.fit(pca_x_train,train_y)

# model5.best_params_

In [ ]:

# %%time

# # 隨意選一組參數   --- 耗時太長 放棄該模型

# svr=SVR(gamma=0.1,C=0.5)

# svr.fit(pca_x_train,train_y)

# y_pred=svr.predict(pca_x_test)

# print(rmse(test_y,y_pred))

構建xgboost模型

In [82]:

%%time

import xgboost as xgb

xgbr = xgb.XGBRegressor(objective='reg:linear', learning_rate=0.1, gamma=0.05, max_depth=45,

                 min_child_weight=0.5, subsample=0.6, reg_alpha=0.5, reg_lambda=0.8, colsample_bytree=0.5, n_jobs=-1)

xgbr.fit(pca_x_train, train_y)

y_pred = xgbr.predict(pca_x_test)

print(rmse(test_y,y_pred))

/Users/sherwin/anaconda3/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \

[12:23:28] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
2.1601162492127104
CPU times: user 28min 30s, sys: 24.2 s, total: 28min 54s
Wall time: 29min 29s

In [83]:

%%time

y_pred_test=xgbr.predict(pca_x_test)

y_pred_train=xgbr.predict(pca_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 0.9609658477710833
測試集rmse： 2.1601162492127104
CPU times: user 10 s, sys: 427 ms, total: 10.4 s
Wall time: 10.6 s

In [84]:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,10),dpi=100)

plt.scatter(test_y,y_pred_test)

plt.xlabel("真實值")

plt.ylabel("預測值")

plt.show()

Stacking融合

構建Stacking模型需要的數據

In [86]:

%%time

# 獲取每個子模型的預測結果作爲特徵

# 訓練特徵

train_features=[]

train_features.append(ridge.predict(pca_x_train))  # 將每個模型預測值保存起來

train_features.append(lasso.predict(pca_x_train))

# train_features.append(svr.predict(pca_x_train))  # 這個太慢了  不要了

train_features.append(rf.predict(pca_x_train))

train_features.append(tree.predict(pca_x_train))

train_features.append(xgbr.predict(pca_x_train))

# 測試特徵

test_features=[]

test_features.append(ridge.predict(pca_x_test))

test_features.append(lasso.predict(pca_x_test))

# test_features.append(svr.predict(pca_x_test))

test_features.append(rf.predict(pca_x_test))

test_features.append(tree.predict(pca_x_test))

test_features.append(xgbr.predict(pca_x_test))

# 提交結果特徵

TEST_FEATURES=[]

TEST_FEATURES.append(ridge.predict(PCA_X_TEST))

TEST_FEATURES.append(lasso.predict(PCA_X_TEST))

# TEST_FEATURES.append(svr.predict(PCA_X_TEST))

TEST_FEATURES.append(rf.predict(PCA_X_TEST))

TEST_FEATURES.append(tree.predict(PCA_X_TEST))

TEST_FEATURES.append(xgbr.predict(PCA_X_TEST))

CPU times: user 42.1 s, sys: 1.49 s, total: 43.6 s
Wall time: 20.3 s

In [87]:

train_features

Out[87]:

[array([2.04715431, 2.05232901, 2.04572967, ..., 2.04659472, 2.04508413,
        2.05562638]),
 array([2.05200758, 2.05200758, 2.05200758, ..., 2.05200758, 2.05200758,
        2.05200758]),
 array([1.67325566, 1.94499122, 1.85460452, ..., 1.92275812, 1.76267895,
        2.22438597]),
 array([1.59023952, 1.84714777, 1.85130219, ..., 1.96150612, 1.77317884,
        2.23207518]),
 array([1.6343094, 1.9145248, 1.8356705, ..., 1.9381661, 1.7626299,
        2.2465973], dtype=float32)]

In [88]:

test_features

Out[88]:

[array([2.04925512, 2.04865288, 2.04878586, ..., 2.07295592, 2.05666692,
        2.0560697 ]),
 array([2.05200758, 2.05200758, 2.05200758, ..., 2.05200758, 2.05200758,
        2.05200758]),
 array([1.93842148, 1.71689679, 1.71233925, ..., 3.7684956 , 2.1988801 ,
        2.15518207]),
 array([1.93762954, 1.71991266, 1.59023952, ..., 3.92681962, 2.1296814 ,
        2.08786427]),
 array([1.9394264, 1.6995616, 1.8815998, ..., 3.7348156, 2.2026072,
        2.1582646], dtype=float32)]

In [89]:

# np.vstack:按垂直方向（行順序）堆疊數組構成一個新的數組

mx_train=np.vstack(train_features).T

mx_test=np.vstack(test_features).T

MX_TEST=np.vstack(TEST_FEATURES).T

MX_TEST.shape

Out[89]:

(46000, 5)

Stacking模型訓練

In [90]:

%%time

stack_model=Ridge(fit_intercept=False)

params={

    "alpha":np.logspace(-2,3,20)

model=GridSearchCV(stack_model,param_grid=params,cv=5,n_jobs=-1)

model.fit(mx_train,train_y)

print(model.best_params_)

{'alpha': 0.06158482110660264}
CPU times: user 580 ms, sys: 439 ms, total: 1.02 s
Wall time: 3.47 s

In [91]:

%%time

stack_model=Ridge(alpha=0.379269,fit_intercept=False)

stack_model.fit(mx_train,train_y)

y_pred=stack_model.predict(mx_test)

y_pred_train=stack_model.predict(mx_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred))

訓練集rmse： 0.7337935133190991
測試集rmse： 2.3272631885188044
CPU times: user 30.8 ms, sys: 9.28 ms, total: 40.1 ms
Wall time: 13.2 ms

In [92]:

stack_model.coef_

Out[92]:

array([-0.1330147 ,  0.13235901, -0.15773228,  0.6991465 ,  0.45928745])

提交結果輸出

In [96]:

Y_PRED_TEST = stack_model.predict(MX_TEST)

Y_PRED_TEST = np.exp(Y_PRED_TEST)-1

print(Y_PRED_TEST)

data = range(1, len(Y_PRED_TEST)+1)

Y_PRED = pd.DataFrame(data=Y_PRED_TEST, columns=["月租金"])

Y_PRED["id"] = range(1, Y_PRED.shape[0]+1)

Y_PRED.head()

[6.2493489  5.12626054 8.64297508 ... 3.59608672 1.05481017 4.8740706 ]

Out[96]:

	月租金	id
0	6.249349	1
1	5.126261	2
2	8.642975	3
3	8.885262	4
4	4.482541	5

In [97]:

Y_PRED.to_csv("./data/Y_PRED_STACK.csv")

模型保存

In [98]:

from sklearn.externals import joblib

joblib.dump(stack_model, "./data/stack_model.kpl")

Out[98]:

['./data/stack_model.kpl']

測試集結果運行

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')  # 忽略一些警告

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import mean_squared_error

獲取數據

In [2]:

test=pd.read_csv("data/test.csv")

test.head()

Out[2]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
0	1	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN
1	2	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN
3	4	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN
4	5	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

In [3]:

space_threshold = 0.3

dist_value_for_fill = 2  # 爲什麼是2,因爲距離的最大值是1,沒有地鐵 意味着很遠

line_value_for_fill = 0

station_value_for_fill = 0

state_value_for_fill = 0  # test["居住狀態"].mode().values[0]

decration_value_for_fill = -1  # test["裝修情況"].mode().values[0]

rent_value_for_fill = -1  # test["出租方式"].mode().values[0]

# 拿到每個區的位置衆數

area_value_for_fill = test["區"].mode().values[0]

position_by_area = test.groupby('區').apply(lambda x: x["位置"].mode())

# print(position_by_area)

position_value_for_fill = position_by_area[position_by_area.index ==

                                           area_value_for_fill].values[0][0]

# print(position_value_for_fill)

# 拿到每個小區房屋出租數量的衆數

ratio_by_neighbor = test.groupby('小區名').apply(lambda x: x["小區房屋出租數量"].mode())

index = [x[0] for x in ratio_by_neighbor.index]

ratio_by_neighbor.index = index

ratio_by_neighbor = ratio_by_neighbor.to_dict()

ratio_mode = test["小區房屋出租數量"].mode().values[0]

In [4]:

test.shape

Out[4]:

(46000, 19)

數據清洗

In [5]:

# 缺失值比例

def ratio_of_null():

    test_missing = (test.isnull().sum()/len(test))*100

    test_missing = test_missing.drop(test_missing[test_missing==0].index).sort_values(ascending=False)

    return pd.DataFrame({'缺失百分比':test_missing})

ratio_of_null()

Out[5]:

	缺失百分比
裝修情況	91.547826
居住狀態	90.958696
出租方式	89.882609
距離	53.047826
地鐵站點	53.047826
地鐵線路	53.047826
小區房屋出租數量	0.071739
位置	0.030435
區	0.030435

In [6]:

test["小區名"].mode().values[0]

Out[6]:

In [7]:

test[test['小區名'] == 3269]

Out[7]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
72	73	3	3269	0.093750	2	0.581818	0.008937	南	NaN	2	1	1	NaN	NaN	NaN	5.0	27.0	0.113333	NaN
372	373	3	3269	0.066406	0	0.545455	0.013100	西	1.0	2	1	1	1.0	NaN	NaN	5.0	72.0	0.614167	6.0
481	482	3	3269	0.148438	2	0.618182	0.024992	北	NaN	3	1	2	1.0	NaN	NaN	4.0	7.0	0.094167	NaN
1062	1063	3	3269	0.078125	1	0.272727	0.013903	南	NaN	2	2	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3550	3551	3	3269	0.070312	0	0.581818	0.014214	西南	NaN	2	2	1	NaN	NaN	NaN	4.0	15.0	0.578333	NaN
4344	4345	3	3269	0.039062	1	0.181818	0.020689	南	NaN	3	2	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4540	4541	3	3269	0.152344	0	0.527273	0.010427	東南	NaN	1	1	1	NaN	NaN	NaN	3.0	22.0	0.420833	NaN
5622	5623	3	3269	0.207031	0	0.527273	0.010758	東	NaN	3	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6479	6480	3	3269	0.167969	0	0.454545	0.014565	南	NaN	2	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14515	14516	3	3269	0.109375	0	0.545455	0.027143	東南	NaN	3	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23976	23977	3	3269	0.015625	1	0.109091	0.017440	東南	NaN	2	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27098	27099	3	3269	0.328125	0	0.309091	0.007458	東	NaN	1	1	1	NaN	NaN	NaN	1.0	77.0	0.850833	NaN
29168	29169	3	3269	0.035156	0	0.090909	0.002648	東	NaN	1	0	1	NaN	NaN	NaN	1.0	119.0	0.977500	NaN
41927	41928	3	3269	0.148438	1	0.581818	0.013903	東	NaN	3	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [8]:

test["位置"].fillna(test["位置"].mode()[0], inplace=True)

test["區"].fillna(test["區"].mode()[0], inplace=True)

test["位置"].mode()

Out[8]:

0    52.0
dtype: float64

In [9]:

test.shape

# test[test["位置"].isna()]

test[test['小區名'] == 3269]

Out[9]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
72	73	3	3269	0.093750	2	0.581818	0.008937	南	NaN	2	1	1	NaN	12.0	52.0	5.0	27.0	0.113333	NaN
372	373	3	3269	0.066406	0	0.545455	0.013100	西	1.0	2	1	1	1.0	12.0	52.0	5.0	72.0	0.614167	6.0
481	482	3	3269	0.148438	2	0.618182	0.024992	北	NaN	3	1	2	1.0	12.0	52.0	4.0	7.0	0.094167	NaN
1062	1063	3	3269	0.078125	1	0.272727	0.013903	南	NaN	2	2	2	NaN	12.0	52.0	NaN	NaN	NaN	NaN
3550	3551	3	3269	0.070312	0	0.581818	0.014214	西南	NaN	2	2	1	NaN	12.0	52.0	4.0	15.0	0.578333	NaN
4344	4345	3	3269	0.039062	1	0.181818	0.020689	南	NaN	3	2	2	NaN	12.0	52.0	NaN	NaN	NaN	NaN
4540	4541	3	3269	0.152344	0	0.527273	0.010427	東南	NaN	1	1	1	NaN	12.0	52.0	3.0	22.0	0.420833	NaN
5622	5623	3	3269	0.207031	0	0.527273	0.010758	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
6479	6480	3	3269	0.167969	0	0.454545	0.014565	南	NaN	2	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
14515	14516	3	3269	0.109375	0	0.545455	0.027143	東南	NaN	3	2	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
23976	23977	3	3269	0.015625	1	0.109091	0.017440	東南	NaN	2	2	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN
27098	27099	3	3269	0.328125	0	0.309091	0.007458	東	NaN	1	1	1	NaN	12.0	52.0	1.0	77.0	0.850833	NaN
29168	29169	3	3269	0.035156	0	0.090909	0.002648	東	NaN	1	0	1	NaN	12.0	52.0	1.0	119.0	0.977500	NaN
41927	41928	3	3269	0.148438	1	0.581818	0.013903	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

In [222]:

ratio_of_null()

Out[222]:

	缺失百分比
裝修情況	91.547826
居住狀態	90.958696
出租方式	89.882609
距離	53.047826
地鐵站點	53.047826
地鐵線路	53.047826
小區房屋出租數量	0.071739

In [223]:

# 先按照小區名和位置分組，然後獲取每組的站點衆數

station_by_nb_pos = test[['小區名', '位置', '地鐵站點', '距離']].drop_duplicates().dropna(

).groupby(['小區名', '位置'])['地鐵站點', '距離'].apply(lambda x: np.max(x.mode()))

station_by_nb_pos.head()

station_by_nb = test[['小區名', '地鐵站點', '距離']].drop_duplicates().dropna(

).groupby('小區名')['地鐵站點', '距離'].apply(lambda x: np.max(x.mode()))

station_by_nb.head()

# 拿到每個站點對應的線路

lines_by_station = test[['地鐵站點', '地鐵線路']].drop_duplicates(

).dropna().groupby('地鐵站點')['地鐵線路'].min()

def fill_stations(line, s_by_np, s_by_n, l_by_s):

"""

    s_by_np:接收station_by_nb_pos

    s_by_n:接收station_by_nb

    l_by_s:接收lines_by_station

"""

    # 首先判斷line行地鐵站點是否缺失

    # 注意這裏最好用pd.isna,不要用np.isnull

    if not pd.isna(line['地鐵站點']):  # 不是空，就直接返回原行

        return line

    # 如果小區名和位置組合在數據索引中，就查找進行填充

    if (line['小區名'], line['位置']) in s_by_np:

        line['地鐵站點'] = s_by_np.loc[(line['小區名'], line['位置']), '地鐵站點']

        line['距離'] = s_by_np.loc[(line['小區名'], line['位置']), '距離']

        line['地鐵線路'] = l_by_s[line['地鐵站點']]

    elif line['小區名'] in s_by_n.index:

        line['地鐵站點'] = s_by_n.loc[line['小區名'], '地鐵站點']  # 用小區衆數填充

        line['距離'] = s_by_n.loc[line['小區名'], '距離']

        line['地鐵線路'] = l_by_s[line['地鐵站點']]

    else:  # 小區名也找不到的情況下  單獨作爲一類，即沒有地鐵

        line['地鐵站點'] = 0

        line['距離'] = 2  # 距離用2填充

        line['地鐵線路'] = 0

    return line

test = test.apply(fill_stations, s_by_np=station_by_nb_pos,

                    s_by_n=station_by_nb, l_by_s=lines_by_station, axis=1)

ratio_of_null()

Out[223]:

	缺失百分比
裝修情況	91.547826
居住狀態	90.958696
出租方式	89.882609
小區房屋出租數量	0.071739

In [224]:

# 拿到每個小區房屋出租數量的衆數

ratio_by_neighbor = test[['小區名', '小區房屋出租數量']].dropna().groupby(

    '小區名').apply(lambda x: np.mean(x["小區房屋出租數量"].mode()))

ratio_by_neighbor.head()

#拿到所有小區的“小區房屋出租數量”衆數

ratio_mode=test["小區房屋出租數量"].mode().values[0]

ratio_mode

def fill_by_key(x,k,v,values,mode):

    if not pd.isna(x[v]):

        return x

    else:

        if x[k] in values.index:

            x[v]=values[x[k]]

        else:

            x[v]=mode

        return x

# test['小區房屋出租數量']=test['小區房屋出租數量'].map()

test=test.apply(fill_by_key,k="小區名",v="小區房屋出租數量",values=ratio_by_neighbor,mode=ratio_mode,axis=1)

ratio_of_null()

Out[224]:

	缺失百分比
裝修情況	91.547826
居住狀態	90.958696
出租方式	89.882609

In [225]:

test["出租方式"]=test["出租方式"].fillna(int(-1))

test["裝修情況"]=test["裝修情況"].fillna(int(-1))

test["居住狀態"]=test["居住狀態"].fillna(int(0))

ratio_of_null()

Out[225]:

In [226]:

ratio_of_null()

Out[226]:

Type Markdown and LaTeX: 𝛼2α2

特徵工程

In [227]:

test["房屋朝向"].head()

Out[227]:

0    東南
1    東南
2     南
3    東南
4     東
Name: 房屋朝向, dtype: object

In [228]:

def split(text,i):

    items=text.split(" ")

    if i<len(items):

        return items[i]

    else:

        return np.nan

test['新朝向']=test['房屋朝向'].map(lambda x:split(x,0))

In [229]:

test.shape

Out[229]:

(46000, 20)

In [230]:

test["房+衛+廳"] = test["臥室數量"]+test["廳的數量"]+test["衛的數量"]

test["房/總"] = test["臥室數量"]/(test["房+衛+廳"]+1)  # 加1是爲了防止分母=0出現結果爲inf無窮大的現象

test["衛/總"] = test["衛的數量"]/(test["房+衛+廳"]+1)

test["廳/總"] = test["廳的數量"]/(test["房+衛+廳"]+1)

test['臥室面積'] = test['房屋面積']/(test['臥室數量']+1)

test['樓層比'] = test['樓層']/(test["總樓層"]+1)

test['戶型'] = test[['臥室數量', '廳的數量', '衛的數量']].apply(

    lambda x: str(x['臥室數量'])+str(x['廳的數量'])+str(x['衛的數量']), axis=1

test["有地鐵"]=(test["地鐵站點"]>-1).map(int)

lines_count1=test[['小區名','地鐵線路']].drop_duplicates().groupby('小區名').count()

lines_count2=test[['位置','地鐵線路']].drop_duplicates().groupby('位置').count()

lines_count2.columns=['位置線路數']

lines_count1.columns=['小區線路數']

test=pd.merge(test,lines_count1,how='left',on=['小區名'])

test=pd.merge(test,lines_count2,how='left',on=['位置'])

neighbors=test['小區名'].value_counts()

test['新小區名']=test.apply(lambda x: x['小區名'] if neighbors[x['小區名']]>100 else -1,axis=1)

test['小區條數大於100']=test.apply(lambda x: 1 if neighbors[x['小區名']]>100 else 0,axis=1)

In [231]:

#將離散特徵轉換成字符串類型

colunms = ['時間', '小區名', '居住狀態', '出租方式', '區','位置','地鐵線路','地鐵站點','裝修情況']

for col in colunms:

    test[col] = test[col].astype(str)

In [232]:

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46000 entries, 0 to 45999
Data columns (total 32 columns):
id           46000 non-null int64
時間           46000 non-null object
小區名          46000 non-null object
小區房屋出租數量     46000 non-null float64
樓層           46000 non-null int64
總樓層          46000 non-null float64
房屋面積         46000 non-null float64
房屋朝向         46000 non-null object
居住狀態         46000 non-null object
臥室數量         46000 non-null int64
廳的數量         46000 non-null int64
衛的數量         46000 non-null int64
出租方式         46000 non-null object
區            46000 non-null object
位置           46000 non-null object
地鐵線路         46000 non-null object
地鐵站點         46000 non-null object
距離           46000 non-null float64
裝修情況         46000 non-null object
新朝向          46000 non-null object
房+衛+廳        46000 non-null int64
房/總          46000 non-null float64
衛/總          46000 non-null float64
廳/總          46000 non-null float64
臥室面積         46000 non-null float64
樓層比          46000 non-null float64
戶型           46000 non-null object
有地鐵          46000 non-null int64
小區線路數        46000 non-null int64
位置線路數        46000 non-null int64
新小區名         46000 non-null int64
小區條數大於100    46000 non-null int64
dtypes: float64(9), int64(11), object(12)
memory usage: 11.6+ MB

In [233]:

test.head()

Out[233]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	...	衛/總	廳/總	臥室面積	樓層比	戶型	有地鐵	小區線路數	位置線路數	新小區名	小區條數大於100
0	1	3	3882	0.035156	1	0.436364	0.013075	東南	3	...	0.166667	0.166667	0.003269	0.696203	311	1	1	2	-1	0
1	2	3	6353	0.078125	1	0.436364	0.012248	東南	3	...	0.166667	0.166667	0.003062	0.696203	311	1	1	2	-1	0
2	3	3	1493	0.203125	1	0.381818	0.023006	南	4	...	0.222222	0.222222	0.004601	0.723684	422	1	1	2	1493	1
3	4	3	1532	0.414062	1	0.600000	0.019695	東南	3	...	0.250000	0.250000	0.004924	0.625000	322	1	1	2	1532	1
4	5	3	1251	0.226562	1	0.381818	0.014730	東	3	...	0.166667	0.166667	0.003683	0.723684	311	1	1	5	1251	1

5 rows × 32 columns

In [234]:

test.shape

Out[234]:

(46000, 32)

In [235]:

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46000 entries, 0 to 45999
Data columns (total 32 columns):
id           46000 non-null int64
時間           46000 non-null object
小區名          46000 non-null object
小區房屋出租數量     46000 non-null float64
樓層           46000 non-null int64
總樓層          46000 non-null float64
房屋面積         46000 non-null float64
房屋朝向         46000 non-null object
居住狀態         46000 non-null object
臥室數量         46000 non-null int64
廳的數量         46000 non-null int64
衛的數量         46000 non-null int64
出租方式         46000 non-null object
區            46000 non-null object
位置           46000 non-null object
地鐵線路         46000 non-null object
地鐵站點         46000 non-null object
距離           46000 non-null float64
裝修情況         46000 non-null object
新朝向          46000 non-null object
房+衛+廳        46000 non-null int64
房/總          46000 non-null float64
衛/總          46000 non-null float64
廳/總          46000 non-null float64
臥室面積         46000 non-null float64
樓層比          46000 non-null float64
戶型           46000 non-null object
有地鐵          46000 non-null int64
小區線路數        46000 non-null int64
位置線路數        46000 non-null int64
新小區名         46000 non-null int64
小區條數大於100    46000 non-null int64
dtypes: float64(9), int64(11), object(12)
memory usage: 11.6+ MB

數據保存

In [237]:

# 保存處理後的數據

test.to_csv("./data/onehot_feature_test.csv")

test_for_each_group

In [43]:

import numpy as np

import pandas as pd

from sklearn.metrics import mean_squared_error

獲取數據

測試集結果

In [44]:

test_r = pd.read_csv("./data/test_result.csv")

In [45]:

test_r.head()

Out[45]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN	6.281834
1	2	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN	6.281834
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN	23.259762
3	4	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN	2.886248
4	5	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN	10.696095

各小組提交結果

In [46]:

students_res = pd.read_csv("./data/第五組_result_11.csv")

# students_res = pd.read_csv("./data/第四組_result_28.csv", encoding="gbk")

a = pd.read_csv("./data/Y_PRED_STACK.csv")

In [47]:

students_res.shape

Out[47]:

(46000, 2)

開始檢測

單個模型檢測

In [48]:

def rmse(y_true,y_pred):

    return np.sqrt(mean_squared_error(y_true,y_pred))

In [49]:

y_true = test_r["月租金"]

In [50]:

# y_pred = students_res["月租金"]

y_pred = a["月租金"]

In [51]:

rmse(y_true, y_pred)

Out[51]:

6.363011257567193

多個模型檢測

In [31]:

for i in range(2, 15):

    str = "./data/第四組/第四組_result_{}.csv".format(i)

    c_4 = pd.read_csv(str, encoding="gbk")

    y_pred = c_4["月租金"]

    ret = rmse(y_true, y_pred)

    print("第{}個數據測試結果是:".format(i), ret)

第2個數據測試結果是: 2.0101892375103816
第3個數據測試結果是: 1.9881682568747705
第4個數據測試結果是: 2.217309210690951
第5個數據測試結果是: 2.1021356120093677
第6個數據測試結果是: 2.112276196225913
第7個數據測試結果是: 2.006692666194838
第8個數據測試結果是: 2.038233947555217
第9個數據測試結果是: 2.065344244978377
第10個數據測試結果是: 2.0763622914485294
第11個數據測試結果是: 2.1008100828126306
第12個數據測試結果是: 2.3086208012888645
第13個數據測試結果是: 2.0620903819477547
第14個數據測試結果是: 2.1388232636828235

Sklearn：房租租⾦模型預測 版本一

庫安裝：pip install xgboost

數據初步分析

導入數據

數據探索

基本信息

缺失值比例

目標值分佈

所有特徵分佈

直方圖和柱狀分佈圖

相關性分析

連續特徵和目標值的散點圖

特徵和目標相關性分析

皮爾森相關性熱力圖

皮爾森相關

繪製箱線圖

異常值分析

問題數據

房間朝向列有多個值

同一個小區屬於不同的區

同一個小區地鐵線路不同的問題

研究一下位置和地鐵線路的關係

研究一下位置和地鐵站點的關係

研究一下小區名，位置，地鐵線路，站點的關係

研究一下是否有換乘站的存在

研究一下每個位置的地鐵線路數和站點數

研究一下位置缺失的樣本地鐵站點是否也是缺失的

位置和區的關係校驗

看一下小區名過多的問題

數據清洗​

導入數據

數據基本信息查看

設置後面要用的填充量

缺失值處理

缺失值比例

填充區和位置

地鐵站點，距離 處理

小區房屋出租數量處理

裝修，居住狀態，出租方式--作爲單獨一類

清除異常樣本

糾偏

問題數據處理

存儲數據

特徵工程

獲取數據

數據基本信息產看

特徵處理

根據房間,廳,衛,房屋面積構造新特徵

構造是否有地鐵

構造地鐵線路數特徵

去掉出現數量較少的小區

轉換類型

保存處理後的數據

初步建模​

數據處理

確定特徵值,目標值

分割數據集

特徵工程

確定評估函數

模型訓練

構建嶺迴歸模型

模型保存

提交結果輸出

模型融合

獲取數據

構建子模型

構建嶺迴歸模型

構建lasso迴歸

構建隨機森林

構建決策樹

構建支持向量機

構建xgboost模型

Stacking融合

構建Stacking模型需要的數據

Stacking模型訓練

提交結果輸出

模型保存

測試集結果運行

獲取數據

數據清洗

Sklearn：房租租⾦模型預測版本一

數據清洗

地鐵站點，距離處理

初步建模