數據集下載鏈接：https://pan.baidu.com/s/13OtaUv6j4x8dD7cgD4sL5g
提取碼：7tze

庫安裝：pip install xgboost

數據初步分析

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')#忽略一些警告

# plt.rcParams['font.sans-serif']=['SimHei']

# plt.rcParams['axes.unicode_minus']=False

導入數據

In [2]:

train=pd.read_csv("data/train.csv")

test=pd.read_csv("data/test.csv")

train.head()

Out[2]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [4]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196539 entries, 0 to 196538
Data columns (total 19 columns):
時間          196539 non-null int64
小區名         196539 non-null int64
小區房屋出租數量    195538 non-null float64
樓層          196539 non-null int64
總樓層         196539 non-null float64
房屋面積        196539 non-null float64
房屋朝向        196539 non-null object
居住狀態        20138 non-null float64
臥室數量        196539 non-null int64
廳的數量        196539 non-null int64
衛的數量        196539 non-null int64
出租方式        24230 non-null float64
區           196508 non-null float64
位置          196508 non-null float64
地鐵線路        91778 non-null float64
地鐵站點        91778 non-null float64
距離          91778 non-null float64
裝修情況        18492 non-null float64
月租金         196539 non-null float64
dtypes: float64(12), int64(6), object(1)
memory usage: 28.5+ MB

In [5]:

train.describe()

Out[5]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
count	196539.000000	196539.000000	195538.000000	196539.000000	196539.000000	196539.000000	20138.000000	196539.000000	196539.000000	196539.000000	24230.000000	196508.000000	196508.000000	91778.000000	91778.000000	91778.000000	18492.000000	196539.000000
mean	2.115229	3224.116562	0.124151	0.955449	0.408711	0.013139	2.725196	2.236635	1.299625	1.223818	0.900289	7.905139	67.945982	3.284850	57.493735	0.551202	3.589228	7.949313
std	0.786980	2023.073726	0.133299	0.851511	0.183100	0.008104	0.667763	0.896961	0.613169	0.487234	0.299621	4.025696	43.522394	1.477147	35.191414	0.247268	1.996912	6.310609
min	1.000000	0.000000	0.007812	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.001667	1.000000	0.000000
25%	1.000000	1388.000000	0.039062	0.000000	0.290909	0.009268	3.000000	2.000000	1.000000	1.000000	1.000000	4.000000	33.000000	2.000000	23.000000	0.356667	2.000000	4.923599
50%	2.000000	3086.000000	0.082031	1.000000	0.418182	0.012910	3.000000	2.000000	1.000000	1.000000	1.000000	9.000000	61.000000	4.000000	59.000000	0.554167	2.000000	6.621392
75%	3.000000	5199.000000	0.160156	2.000000	0.563636	0.014896	3.000000	3.000000	2.000000	1.000000	1.000000	11.000000	103.000000	5.000000	87.000000	0.745833	6.000000	8.998302
max	3.000000	6627.000000	1.000000	2.000000	1.000000	1.000000	3.000000	11.000000	8.000000	8.000000	1.000000	14.000000	152.000000	5.000000	119.000000	1.000000	6.000000	100.000000

數據探索

基本信息

In [6]:

train.head()

Out[6]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

缺失值比例

In [7]:

train_missing = (train.isnull().sum()/len(train))*100# 每列的缺失值個數/總行數

train_missing = train_missing.drop(train_missing[train_missing==0].index).sort_values(ascending=False)#去掉缺失比例爲0的列

miss_data = pd.DataFrame({'缺失百分比':train_missing})

miss_data

Out[7]:

	缺失百分比
裝修情況	90.591180
居住狀態	89.753688
出租方式	87.671658
距離	53.302907
地鐵站點	53.302907
地鐵線路	53.302907
小區房屋出租數量	0.509314
位置	0.015773
區	0.015773

目標值分佈

In [8]:

plt.figure(figsize=(12,6))

plt.subplot(211)

plt.title('月租金分佈')

sns.distplot(train['月租金'])#價格的數量分佈曲線

plt.subplot(212)

plt.scatter(range(train.shape[0]),np.sort(train['月租金'].values))

plt.show()

所有特徵分佈

直方圖和柱狀分佈圖

In [9]:

train.hist(figsize=(20,20),bins=50,grid=False)

plt.show()

異常值分析

這裏我們主要分析跟月租金相關性較大的房屋面積的異常值

In [20]:

def plot_reg(xs,y,data,cols=1):

n=len(xs)

for i in range(n):

plt.figure(figsize=(10,10))

sns.regplot(x=data[xs[i]],y=data[y])

plt.show()

In [21]:

reg_cols=['房屋面積']

plot_reg(reg_cols,"月租金",train)

問題數據

房間朝向列有多個值

In [22]:

train['房屋朝向'].value_counts()

Out[22]:

南            54770
東南           54359
東            31962
西南           17470
北            10428
             ...  
東南 南 西南 西        1
東 西南 北           1
東南 西 北           1
南 西南 西 西北        1
西南 西 東北          1
Name: 房屋朝向, Length: 64, dtype: int64

In [24]:

def split(text,i):

"""

實現對字符串進行分割,並取出結果中下標i對應的值

"""

items=text.split(" ")

if i<len(items):

return items[i]

else:

return np.nan

for i in range(5):

train['朝向_'+str(i)]=train['房屋朝向'].map(lambda x:split(x,i))

# train['房屋地鐵站點朝向'].map(lambda x:x.split(" ")).map(lambda x:len(x)).max()

In [ ]:

names=["朝向_{}".format(i) for i in range(5)]

train[names].info()

同一個小區屬於不同的區

In [25]:

train.columns

Out[25]:

Index(['時間', '小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積', '房屋朝向', '居住狀態', '臥室數量',
       '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況', '月租金',
       '朝向_0', '朝向_1', '朝向_2', '朝向_3', '朝向_4'],
      dtype='object')

In [26]:

#去掉'小區名','區','位置'三個列重複之後有5578個不重複值

neighbors1=train[['小區名','區','位置']]

neighbors1.shape

Out[26]:

(196539, 3)

In [27]:

#去掉'小區名','區','位置'三個列重複之後有5577個不重複值

neighbors1=train[['小區名','位置']].drop_duplicates().dropna()

neighbors1.shape

Out[27]:

(5577, 2)

In [28]:

#而有位置的小區名只有5547個不重複值說明有31個小區位於不同的位置

train[train['位置'].notnull()].drop_duplicates(['小區名']).shape

Out[28]:

(5546, 24)

In [29]:

#neighbors1按照小區名分組後保留分組條數大於1的小區名

count=neighbors1.groupby('小區名')['位置'].count()

ids=count[count>1].index

ids

Out[29]:

Int64Index([ 284,  385,  418,  701,  783, 1455, 1870, 2228, 2468, 2513, 2611,
            2916, 3183, 3268, 3482, 3645, 3967, 4054, 4071, 4471, 4767, 4859,
            5320, 5699, 5844, 5968, 6020, 6122, 6515, 6626, 6627],
           dtype='int64', name='小區名')

In [30]:

#在原數據中篩選出這些小區的信息

neighbors_has_problem=train[['小區名','位置']][train['小區名'].isin(ids)].sort_values(by='小區名')

neighbors_has_problem

Out[30]:

	小區名	位置
105800	284	102.0
105988	284	102.0
105228	284	102.0
105076	284	102.0
105074	284	102.0
...	...	...
131530	6627	86.0
158621	6627	136.0
56569	6627	86.0
39956	6627	86.0
161162	6627	136.0

843 rows × 2 columns

In [31]:

#找到每個小區的位置衆數

#這裏要注意x.mode有可能返回多個衆數，所以用一個np.max拿到最值最大的衆數作爲最終的結果

position_mode_of_neighbors=neighbors_has_problem.groupby('小區名').apply(lambda x:np.max(x['位置'].mode()))

#位置缺失值就用這個數據來進行填充，對於已有的一個小區位於不同的位置，考慮到可能是因爲小區太大導致，並不能認爲是邏輯錯誤，保持不變

position_mode_of_neighbors

Out[31]:

小區名
284     102.0
385     108.0
418     122.0
701      92.0
783     134.0
1455     40.0
1870    106.0
2228    101.0
2468     43.0
2513     86.0
2611    112.0
2916     31.0
3183    136.0
3268     86.0
3482     64.0
3645    121.0
3967    100.0
4054      1.0
4071    129.0
4471     15.0
4767     18.0
4859     73.0
5320     95.0
5699    120.0
5844    143.0
5968     40.0
6020    109.0
6122     18.0
6515    130.0
6626     86.0
6627     86.0
dtype: float64

同一個小區地鐵線路不同的問題

In [32]:

#去掉'小區名','地鐵線路'兩個列重複之後有3412個不重複值

lines=train[['小區名','地鐵線路']].drop_duplicates().dropna()

lines.shape

Out[32]:

(3412, 2)

In [33]:

#而有地鐵的小區名只有3330個不重複值說明有112個小區有多個地鐵線路

train[train['地鐵線路'].notnull()].drop_duplicates(['小區名']).shape

Out[33]:

(3330, 24)

In [34]:

#lines按照小區名分組後保留分組條數大於1的小區名最終有多條地鐵的小區有79個

#這個地鐵線路分位置可能有關係因爲同一個小區位於不同的位置，地鐵線路也有可能不同

count=lines.groupby('小區名')['地鐵線路'].count()

ids=count[count>1].index

ids.shape

Out[34]:

(79,)

研究一下位置和地鐵線路的關係

In [35]:

#去掉'位置','地鐵線路'兩個列重複之後有184個不重複值

pos_lines=train[['位置','地鐵線路']].drop_duplicates().dropna()

pos_lines.shape

Out[35]:

(184, 2)

In [36]:

#我們在來看一下有地鐵的位置中有多少個不同的 120個

pos_lines['位置'].value_counts()

Out[36]:

113.0    4
100.0    4
118.0    3
63.0     3
106.0    3
        ..
22.0     1
34.0     1
151.0    1
28.0     1
99.0     1
Name: 位置, Length: 120, dtype: int64

In [37]:

#pos_lines按照位置分組後保留分組條數大於1的位置最終有多條地鐵的位置有49個

count=pos_lines.groupby('位置')['地鐵線路'].count()

ids=count[count>1].index

ids.shape

Out[37]:

(49,)

研究一下位置和地鐵站點的關係

In [38]:

#去掉'位置','地鐵站點'兩個列重複之後有339個不重複值

pos_stations=train[['位置','地鐵站點']].drop_duplicates().dropna()

pos_stations.shape

Out[38]:

(339, 2)

In [39]:

#我們在來看一下有地鐵的位置中有多少個不同的 120個

pos_stations['位置'].value_counts()

Out[39]:

63.0     9
136.0    7
106.0    6
100.0    6
143.0    6
        ..
17.0     1
12.0     1
67.0     1
82.0     1
148.0    1
Name: 位置, Length: 120, dtype: int64

In [40]:

#pos_stations按照位置分組後保留分組條數大於1的位置最終有多個站點的位置有97個

count=pos_stations.groupby('位置')['地鐵站點'].count()

ids=count[count>1].index

ids.shape

Out[40]:

(97,)

研究一下小區名，位置，地鐵線路，站點的關係

In [41]:

#去掉'位置','地鐵站點'兩個列重複之後有3575個不重複值

neighbor_pos_stations=train[['小區名','位置','地鐵線路','地鐵站點']].drop_duplicates().dropna()

neighbor_pos_stations.shape

Out[41]:

(3575, 4)

In [42]:

#看一下是否存在下小區名，位置一樣的情況下，地鐵線路不一樣的情況

#可以看出：3575-3414=161條小區名，位置，地鐵線路同樣的情況下，地鐵站點不一樣

#3414-3342=72條小區名，位置一樣，地鐵線路不一樣

#這種情況可能是因爲數據錯誤，也有可能是實際情況，後面對此我們不做處理

print(neighbor_pos_stations[['小區名','位置','地鐵線路']].drop_duplicates().dropna().shape)

print(neighbor_pos_stations[['小區名','位置']].drop_duplicates().dropna().shape)

(3414, 3)
(3342, 2)

研究一下是否有換乘站的存在

用站點分組，然後統計地鐵線路數

In [43]:

#結果說明沒有換乘站點存在，因爲每個站點僅僅屬於一條地鐵線路

train[['地鐵線路','地鐵站點']].drop_duplicates().dropna().groupby('地鐵站點').count().max(0)

Out[43]:

地鐵線路    1
dtype: int64

研究一下每個位置的地鐵線路數和站點數

In [44]:

#每個位置的線路數這個可以作爲新特徵加入

a=train[['位置','地鐵線路']].drop_duplicates().dropna().groupby('位置').count()

Out[44]:

	地鐵線路
位置
0.0	1
1.0	2
2.0	1
3.0	2
4.0	1
...	...
146.0	2
147.0	2
148.0	1
150.0	1
151.0	1

120 rows × 1 columns

In [45]:

#每個位置的站點數也可以作爲新特徵加入

b=train[['位置','地鐵站點']].drop_duplicates().dropna().groupby('位置').count()

Out[45]:

	地鐵站點
位置
0.0	1
1.0	3
2.0	1
3.0	4
4.0	1
...	...
146.0	3
147.0	2
148.0	1
150.0	4
151.0	1

120 rows × 1 columns

In [46]:

#兩者的相關性

al=pd.concat([a,b],axis=1)

al.corr()

Out[46]:

	地鐵線路	地鐵站點
地鐵線路	1.000000	0.685211
地鐵站點	0.685211	1.000000

研究一下位置缺失的樣本地鐵站點是否也是缺失的

In [47]:

#發現存在地鐵線路爲缺失而位置缺失的情況說明後面在填充位置缺失值的時候可以用地鐵站點來進行填充

pos_lines=train[['位置','地鐵站點']].drop_duplicates()

pos_lines['位置'].isnull().sum()

Out[47]:

In [48]:

#每個站點的位置數也可以作爲新特徵加入

train[['位置','地鐵站點']].drop_duplicates().dropna().groupby('地鐵站點').count()

Out[48]:

	位置
地鐵站點
1.0	4
2.0	1
3.0	5
4.0	1
5.0	5
...	...
115.0	3
116.0	2
117.0	1
118.0	2
119.0	4

118 rows × 1 columns

位置和區的關係校驗

In [49]:

#說明每個位置僅僅屬於一個區，不存在同一個位置屬於兩個區的現象

train[['位置','區']].drop_duplicates().dropna().groupby('位置').count().max()

Out[49]:

區    1
dtype: int64

小區名和位置的關係

In [50]:

train[train['小區名']==6626]

Out[50]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	...	地鐵線路	地鐵站點	距離	裝修情況	月租金	朝向_0	朝向_1	朝向_2	朝向_3	朝向_4
1513	1	6626	0.050781	0	0.581818	0.009070	南	NaN	1	1	...	1.0	10.0	0.965000	NaN	5.942275	南	NaN	NaN	NaN	NaN
6622	1	6626	0.050781	2	0.545455	0.022840	南	NaN	3	3	...	5.0	16.0	0.974167	NaN	4.244482	南	NaN	NaN	NaN	NaN
10951	1	6626	0.050781	0	0.581818	0.009103	西	NaN	1	1	...	1.0	10.0	0.965000	NaN	5.602716	西	NaN	NaN	NaN	NaN
12327	1	6626	0.050781	1	0.545455	0.014234	東南	NaN	3	1	...	5.0	16.0	0.974167	NaN	6.960951	東南	NaN	NaN	NaN	NaN
14738	1	6626	0.050781	1	0.545455	0.008039	東	NaN	1	1	...	5.0	16.0	0.974167	NaN	5.602716	東	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
177641	3	6626	0.085938	0	0.545455	0.014068	東南	1.0	2	1	...	5.0	16.0	0.974167	2.0	7.300509	東南	NaN	NaN	NaN	NaN
177725	3	6626	0.085938	1	0.545455	0.014234	東南	NaN	3	1	...	5.0	16.0	0.974167	NaN	6.960951	東南	NaN	NaN	NaN	NaN
191611	3	6626	0.085938	1	0.545455	0.014730	東	3.0	3	2	...	5.0	16.0	0.974167	6.0	8.998302	東	NaN	NaN	NaN	NaN
194534	3	6626	0.085938	2	0.581818	0.013846	東南	3.0	2	1	...	1.0	10.0	0.965000	2.0	8.319185	東南	NaN	NaN	NaN	NaN
195236	3	6626	0.085938	0	0.581818	0.012744	東南	NaN	2	1	...	1.0	10.0	0.965000	NaN	5.602716	東南	NaN	NaN	NaN	NaN

105 rows × 24 columns

In [51]:

#在數據清洗的過程中發現一個問題  3269這個小區可能比較特殊

train[train['小區名']==3269].shape

Out[51]:

(31, 24)

In [52]:

#正好位置缺失31條

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196539 entries, 0 to 196538
Data columns (total 24 columns):
時間          196539 non-null int64
小區名         196539 non-null int64
小區房屋出租數量    195538 non-null float64
樓層          196539 non-null int64
總樓層         196539 non-null float64
房屋面積        196539 non-null float64
房屋朝向        196539 non-null object
居住狀態        20138 non-null float64
臥室數量        196539 non-null int64
廳的數量        196539 non-null int64
衛的數量        196539 non-null int64
出租方式        24230 non-null float64
區           196508 non-null float64
位置          196508 non-null float64
地鐵線路        91778 non-null float64
地鐵站點        91778 non-null float64
距離          91778 non-null float64
裝修情況        18492 non-null float64
月租金         196539 non-null float64
朝向_0        196539 non-null object
朝向_1        9285 non-null object
朝向_2        322 non-null object
朝向_3        45 non-null object
朝向_4        4 non-null object
dtypes: float64(12), int64(6), object(6)
memory usage: 36.0+ MB

我們看一下小區名和位置,地鐵站點的關係

上面可以看出3269這個小區位置缺失，但是我們發現他的地鐵線路和站點非常多，這有些異常，我們對比一下其他小區的情況

In [53]:

#可以看出，正常的小區最多屬於不同的2個位置

train[['位置','小區名']].drop_duplicates().dropna().groupby('小區名').count().max()

Out[53]:

位置    2
dtype: int64

In [54]:

#可以看出，除了3269這個小區外，其他小區最多隻有4個站點相關

#因此可以斷定3269這個小區是統計不祥的數據，可以作爲異常值丟棄

counts=train[['地鐵站點','小區名']].drop_duplicates().dropna().groupby('小區名').count()

counts[counts['地鐵站點']>3]

Out[54]:

	地鐵站點
小區名
602	4
1728	4
3269	14

看一下小區名過多的問題

In [4]:

neighbors=train['小區名'].value_counts()

In [5]:

neighbors

Out[5]:

5512    1880
1085    1155
5208    1136
6221    1066
6011    1020
        ... 
5829       1
1351       1
711        1
327        1
0          1
Name: 小區名, Length: 5547, dtype: int64

In [8]:

#觀察條目數超過50的小區有多少

(neighbors>50).sum()

Out[8]:

In [9]:

#觀察條目數超過100的小區有多少

(neighbors>100).sum()

Out[9]:

數據清洗

In [8]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')#忽略一些警告

#顯示所有結果

plt.rcParams['font.sans-serif']=['SimHei']

plt.rcParams['axes.unicode_minus']=False

導入數據

In [66]:

train=pd.read_csv("data/train.csv")

test=pd.read_csv("data/test.csv")

train.head()

Out[66]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [67]:

train.shape

Out[67]:

(196539, 19)

In [68]:

# train["出租方式"].value_counts()

# train["裝修情況"].value_counts()

train["居住狀態"].value_counts()

Out[68]:

3.0    17087
1.0     2483
2.0      568
Name: 居住狀態, dtype: int64

In [69]:

train.drop_duplicates(['小區名','地鐵線路'])[['小區名','地鐵線路']].sort_values(by='小區名')

Out[69]:

	小區名	地鐵線路
107337	0	3.0
3620	1	3.0
41286	2	5.0
8211	4	NaN
55962	5	NaN
...	...	...
45347	6625	NaN
6622	6626	5.0
1513	6626	1.0
1727	6627	NaN
2542	6627	1.0

6052 rows × 2 columns

設置後面要用的填充量

In [70]:

space_threshold=0.3

dist_value_for_fill=2#爲什麼是2,因爲距離的最大值是1,沒有地鐵意味着很遠

line_value_for_fill=0

station_value_for_fill=0

area_value_for_fill=train["區"].mode().values[0]

# 拿到每個區的位置衆數

position_by_area=train.groupby('區').apply(lambda x:x["位置"].mode())

position_value_for_fill=position_by_area[position_by_area.index==area_value_for_fill].values[0][0]

state_value_for_fill=0#train["居住狀態"].mode().values[0]

decration_value_for_fill=-1#train["裝修情況"].mode().values[0]

rent_value_for_fill=-1#train["出租方式"].mode().values[0]

In [38]:

#拿到每個小區房屋出租數量的衆數

ratio_by_neighbor=train.groupby('小區名').apply(lambda x:x["小區房屋出租數量"].mode())

index=[x[0] for x in ratio_by_neighbor.index]

ratio_by_neighbor.index=index

ratio_by_neighbor=ratio_by_neighbor.to_dict()

ratio_mode=train["小區房屋出租數量"].mode().values[0]

缺失值處理

缺失值的處理方式有：

刪除帶有缺失值的特徵，最簡單也最浪費信息的方式
用均值，衆數或固定的數等填充，比1好，但仍不夠好
考慮缺失的含義，把缺失值作爲一種信息
用未缺失的數據訓練模型，預測缺失的數據（分類型變量用分類算法，數值型變量用迴歸）

思路：

首先利用沒有缺失值的小區名以及有值的地鐵站點信息，填充區和位置缺失值
利用小區和位置信息填充地鐵線路和地鐵站點和距離
還沒有填充的地鐵站點和地鐵線路用固定值填充，單獨作爲一類（即沒有地鐵），同時對應的距離填充爲2（即距離地鐵較遠）
按照同名小區的小區房屋出租數量的衆數來填充小區房屋出租數量

缺失值比例

In [71]:

# 缺失值比例

def ratio_of_null():

train_missing = (train.isnull().sum()/len(train))*100

train_missing = train_missing.drop(train_missing[train_missing==0].index).sort_values(ascending=False)

return pd.DataFrame({'缺失百分比':train_missing})

ratio_of_null()

Out[71]:

	缺失百分比
裝修情況	90.591180
居住狀態	89.753688
出租方式	87.671658
距離	53.302907
地鐵站點	53.302907
地鐵線路	53.302907
小區房屋出租數量	0.509314
位置	0.015773
區	0.015773

填充區和位置

根據數據初步分析時的情況可以看出，位置缺失的就是3269這個小區的，所以選擇全部丟棄

In [72]:

train=train[train['小區名']!=3269]

In [73]:

ratio_of_null()

Out[73]:

	缺失百分比
裝修情況	90.591732
居住狀態	89.754107
出租方式	87.673275
距離	53.303682
地鐵站點	53.303682
地鐵線路	53.303682
小區房屋出租數量	0.508885

地鐵站點，距離處理

先用每個同名小區名和同位置的地鐵線路,地鐵站點,距離衆數來填充
剩下的地鐵站點，距離，地鐵線路的缺失值作爲一種特徵，表示該房屋附近沒有地鐵

In [92]:

1.#先按照小區名和位置分組，然後獲取每組的站點衆數

station_by_nb_pos=train[['小區名','位置','地鐵站點','距離']].drop_duplicates().dropna().groupby(['小區名','位置'])['地鐵站點','距離'].apply(lambda x:np.max(x.mode()))

station_by_nb_pos

Out[92]:

		地鐵站點	距離
小區名	位置
0	59.0	57.0	0.478333
1	59.0	57.0	0.563333
2	40.0	33.0	0.971667
11	24.0	103.0	0.914167
12	28.0	69.0	0.633333
...	...	...	...
6625	41.0	88.0	0.931667
6626	86.0	16.0	0.974167
6626	136.0	16.0	0.974167
6627	86.0	10.0	0.985000
6627	136.0	10.0	0.985000

3342 rows × 2 columns

In [93]:

station_by_nb=train[['小區名','地鐵站點','距離']].drop_duplicates().dropna().groupby('小區名')['地鐵站點','距離'].apply(lambda x:np.max(x.mode()))

station_by_nb

Out[93]:

	地鐵站點	距離
小區名
0	57.0	0.478333
1	57.0	0.563333
2	33.0	0.971667
11	103.0	0.914167
12	69.0	0.633333
...	...	...
6622	25.0	0.245000
6623	119.0	0.410000
6625	88.0	0.931667
6626	16.0	0.974167
6627	10.0	0.985000

3329 rows × 2 columns

In [84]:

#拿到每個站點對應的線路

lines_by_station=train[['地鐵站點','地鐵線路']].drop_duplicates().dropna().groupby('地鐵站點')['地鐵線路'].min()

In [97]:

def fill_stations(line,s_by_np,s_by_n,l_by_s):

"""

s_by_np:接收station_by_nb_pos

s_by_n:接收station_by_nb

l_by_s:接收lines_by_station

"""

#首先判斷line行地鐵站點是否缺失

#注意這裏最好用pd.isna,不要用np.isnull

if not pd.isna(line['地鐵站點']):#不是空，就直接返回原行

return line

#如果小區名和位置組合在數據索引中，就查找進行填充

if (line['小區名'],line['位置']) in s_by_np:

line['地鐵站點']=s_by_np.loc[(line['小區名'],line['位置']),'地鐵站點']

line['距離']=s_by_np.loc[(line['小區名'],line['位置']),'距離']

line['地鐵線路']=l_by_s[line['地鐵站點']]

elif line['小區名'] in s_by_n.index:

line['地鐵站點']=s_by_n.loc[line['小區名'],'地鐵站點']#用小區衆數填充

line['距離']=s_by_n.loc[line['小區名'],'距離']

line['地鐵線路']=l_by_s[line['地鐵站點']]

else:#小區名也找不到的情況下單獨作爲一類，即沒有地鐵

line['地鐵站點']=0

line['距離']=2#距離用2填充

line['地鐵線路']=0

return line

train=train.apply(fill_stations,s_by_np=station_by_nb_pos,s_by_n=station_by_nb,l_by_s=lines_by_station,axis=1)

ratio_of_null()

Out[97]:

	缺失百分比
裝修情況	90.591732
居住狀態	89.754107
出租方式	87.673275
小區房屋出租數量	0.508885

小區房屋出租數量處理

用每個小區的房屋出租數量衆數填充

In [105]:

#拿到每個小區房屋出租數量的衆數

ratio_by_neighbor=train[['小區名','小區房屋出租數量']].dropna().groupby('小區名').apply(lambda x:np.mean(x["小區房屋出租數量"].mode()))

ratio_by_neighbor

Out[105]:

小區名
0       0.007812
1       0.011719
2       0.007812
4       0.017578
5       0.009766
          ...   
6623    0.011719
6624    0.013672
6625    0.011719
6626    0.076172
6627    0.093750
Length: 5535, dtype: float64

In [99]:

#拿到所有小區的“小區房屋出租數量”衆數

ratio_mode=train["小區房屋出租數量"].mode().values[0]

ratio_mode

Out[99]:

0.01953125

In [106]:

def fill_by_key(x,k,v,values,mode):

if not pd.isna(x[v]):

return x

else:

if x[k] in values.index:

x[v]=values[x[k]]

else:

x[v]=mode

return x

# train['小區房屋出租數量']=train['小區房屋出租數量'].map()

train=train.apply(fill_by_key,k="小區名",v="小區房屋出租數量",values=ratio_by_neighbor,mode=ratio_mode,axis=1)

In [107]:

ratio_of_null()

Out[107]:

	缺失百分比
裝修情況	90.591732
居住狀態	89.754107
出租方式	87.673275

裝修，居住狀態，出租方式--作爲單獨一類

In [108]:

train["出租方式"]=train["出租方式"].fillna(int(-1))

train["裝修情況"]=train["裝修情況"].fillna(int(-1))

train["居住狀態"]=train["居住狀態"].fillna(int(0))

In [109]:

ratio_of_null()

Out[109]:

清除異常樣本

針對房屋面集存在的異常值，去掉房屋面積異常的樣本

In [110]:

train=train[train['房屋面積']<space_threshold]

train.shape

Out[110]:

(196499, 19)

糾偏

針對目標值月租金普遍分佈過散，進行對數平滑

In [111]:

train["log_rent"] = np.log1p(train["月租金"])#np.log1p  log(1+x)   inf

In [112]:

#糾正之前

plt.figure(figsize=(10,5))

sns.boxplot(x="月租金",data=train,orient='h')

plt.show()

In [113]:

#糾正之後

plt.figure(figsize=(10,5))

sns.boxplot(x="log_rent",data=train,orient='h')

plt.show()

In [114]:

train.head()

Out[114]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金	log_rent
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	1	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	0	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	2	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	2	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	1	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317

問題數據處理

房間朝向列有多個值,這裏我們只要第一個

In [115]:

def split(text,i):

items=text.split(" ")

if i<len(items):

return items[i]

else:

return np.nan

train['新朝向']=train['房屋朝向'].map(lambda x:split(x,0))

In [116]:

train.head()

train['新朝向'].value_counts()

Out[116]:

南     59605
東南    55854
東     34282
西南    17750
北     10490
西      9972
西北     5259
東北     3287
Name: 新朝向, dtype: int64

存儲數據

In [117]:

train.to_csv("data/train_etl.csv",index=None)

In [30]:

import pickle

time_for_fill=train_data['時間'].mode().values[0]

neighbor_for_fill=train_data['小區名'].mode().values[0]

ting_for_fill=train_data['廳的數量'].mode().values[0]

wei_for_fill=train_data['衛的數量'].mode().values[0]

bed_for_fill=train_data['臥室數量'].mode().values[0]

direction_for_fill=train_data['新朝向'].mode().values[0]

mianji_for_fill=train_data['房屋面積'].mean()

louceng_for_fill=train_data['樓層'].mode().values[0]

zonglouceng_for_fill=train_data['總樓層'].mode().values[0]

values={

'距離':dist_value_for_fill,

'地鐵線路':line_value_for_fill,

'地鐵站點':station_value_for_fill,

'區':area_value_for_fill,

'位置':position_value_for_fill,

'居住狀態':state_value_for_fill,

'裝修情況':decration_value_for_fill,

'出租方式':rent_value_for_fill,

'ratio_by_neighbor':ratio_by_neighbor,

'小區房屋出租數量':ratio_mode,

'時間':time_for_fill,

'小區名':neighbor_for_fill,

'廳的數量':ting_for_fill,

'衛的數量':wei_for_fill,

'臥室數量':bed_for_fill,

'新朝向':direction_for_fill,

'房屋面積':mianji_for_fill,

'樓層':louceng_for_fill,

'總樓層':zonglouceng_for_fill,

'所有朝向':list(np.unique(train_data['新朝向']))

}

with open("data/values.pkl",'wb') as f:

pickle.dump(values,f)

In [31]:

values

Out[31]:

{'距離': 2,
 '地鐵線路': 0,
 '地鐵站點': 0,
 '區': 12.0,
 '位置': 52.0,
 '居住狀態': 0,
 '裝修情況': -1,
 '出租方式': -1,
 'ratio_by_neighbor': {0: 0.0078125,
  1: 0.01171875,
  2: 0.0078125,
  4: 0.01953125,
  5: 0.01171875,
  8: 0.0078125,
  9: 0.0234375,
。。。。。。
  1181: 0.0078125,
  1183: 0.01171875,
  ...},
 '小區房屋出租數量': 0.01953125,
 '時間': 3,
 '小區名': 5512,
 '廳的數量': 1,
 '衛的數量': 1,
 '臥室數量': 2,
 '新朝向': '南',
 '房屋面積': 0.013104835639490475,
 '樓層': 0,
 '總樓層': 0.3090909090909091,
 '所有朝向': ['東', '東北', '東南', '北', '南', '西', '西北', '西南']}

特徵工程

In [1]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

warnings.filterwarnings('ignore')#忽略一些警告

In [2]:

train=pd.read_csv("data/train_etl.csv")

train.head()

Out[2]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	臥室數量	廳的數量	...	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金	log_rent	新朝向
0	1	3072	0.128906	2	0.236364	0.008628	東南	1	1	...	-1.0	11.0	118.0	2.0	40.0	0.764167	-1.0	5.602716	1.887481	東南
1	1	3152	0.132812	1	0.381818	0.017046	東	1	0	...	-1.0	10.0	100.0	4.0	58.0	0.709167	-1.0	16.977929	2.889145	東
2	1	5575	0.042969	0	0.290909	0.010593	東南	2	1	...	-1.0	12.0	130.0	5.0	37.0	0.572500	-1.0	8.998302	2.302415	東南
3	1	3103	0.085938	2	0.581818	0.019199	南	3	2	...	-1.0	7.0	90.0	2.0	63.0	0.658333	-1.0	5.602716	1.887481	南
4	1	5182	0.214844	0	0.545455	0.010427	東北	2	1	...	-1.0	3.0	31.0	0.0	0.0	2.000000	-1.0	7.300509	2.116317	東北

5 rows × 21 columns

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196499 entries, 0 to 196498
Data columns (total 21 columns):
時間          196499 non-null int64
小區名         196499 non-null int64
小區房屋出租數量    196499 non-null float64
樓層          196499 non-null int64
總樓層         196499 non-null float64
房屋面積        196499 non-null float64
房屋朝向        196499 non-null object
居住狀態        196499 non-null float64
臥室數量        196499 non-null int64
廳的數量        196499 non-null int64
衛的數量        196499 non-null int64
出租方式        196499 non-null float64
區           196499 non-null float64
位置          196499 non-null float64
地鐵線路        196499 non-null float64
地鐵站點        196499 non-null float64
距離          196499 non-null float64
裝修情況        196499 non-null float64
月租金         196499 non-null float64
log_rent    196499 non-null float64
新朝向         196499 non-null object
dtypes: float64(13), int64(6), object(2)
memory usage: 31.5+ MB

根據房間,廳,衛,房屋面積構造新特徵

In [4]:

train["房+衛+廳"]=train["臥室數量"]+train["廳的數量"]+train["衛的數量"]

train["房/總"]=train["臥室數量"]/(train["房+衛+廳"]+1)

train["衛/總"]=train["衛的數量"]/(train["房+衛+廳"]+1)

train["廳/總"]=train["廳的數量"]/(train["房+衛+廳"]+1)

train['臥室面積']=train['房屋面積']/(train['臥室數量']+1)#加1是爲了防止分母=0出現結果爲inf無窮大的現象

train['樓層比']=train['樓層']/(train["總樓層"]+1)#加1是爲了防止分母=0出現結果爲inf無窮大的現象

train['戶型']=train[['臥室數量','廳的數量','衛的數量']].apply(lambda x:str(x['臥室數量'])+str(x['廳的數量'])+str(x['衛的數量']),axis=1)

構建租金平均值特徵

In [5]:

rent_means=train[['小區名','新朝向','地鐵站點','位置','log_rent']].groupby(['小區名','新朝向','地鐵站點','位置'],as_index=False).mean()

rent_means.columns=['小區名','新朝向','地鐵站點','位置','平均值特徵1']

train=pd.merge(train,rent_means,how='left',on=['小區名','新朝向','地鐵站點','位置'])

In [6]:

rent_means2=train[['小區名','log_rent']].groupby(['小區名'],as_index=False).mean()

rent_means2.columns=['小區名','小區平均值特徵']

train=pd.merge(train,rent_means2,how='left',on=['小區名'])

In [7]:

rent_means3=train[['新朝向','log_rent']].groupby(['新朝向'],as_index=False).mean()

rent_means3.columns=['新朝向','朝向平均值特徵']

train=pd.merge(train,rent_means3,how='left',on=['新朝向'])

In [8]:

rent_means4=train[['地鐵站點','log_rent']].groupby(['地鐵站點'],as_index=False).mean()

rent_means4.columns=['地鐵站點','站點平均值特徵']

train=pd.merge(train,rent_means4,how='left',on=['地鐵站點'])

In [9]:

rent_means5=train[['位置','log_rent']].groupby(['位置'],as_index=False).mean()

rent_means5.columns=['位置','位置平均值特徵']

train=pd.merge(train,rent_means5,how='left',on=['位置'])

構造是否有地鐵

In [10]:

train["有地鐵"]=(train["地鐵站點"]>-1).map(int)

In [11]:

train.columns

Out[11]:

Index(['時間', '小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積', '房屋朝向', '居住狀態', '臥室數量',
       '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況', '月租金',
       'log_rent', '新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型',
       '平均值特徵1', '小區平均值特徵', '朝向平均值特徵', '站點平均值特徵', '位置平均值特徵', '有地鐵'],
      dtype='object')

構造聚類特徵

In [12]:

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

features=train[['房屋面積','臥室數量','廳的數量','衛的數量','距離','樓層比']]

trans=StandardScaler()

new_features=trans.fit_transform(features)

kmeans=KMeans(n_clusters=5)

kmeans.fit(new_features)

train['聚類特徵']=kmeans.predict(new_features).astype(str)

#計算每個聚類特徵的月租金平均值

cluster_means=train[['聚類特徵','log_rent']].groupby('聚類特徵',as_index=False).mean()

cluster_means.columns=['聚類特徵','平均值特徵2']

train=pd.merge(train,cluster_means,how='left',on=['聚類特徵'])

保存標準化和聚類模型

In [13]:

import pickle

with open("data/kmeans.pkl",'wb') as f:

pickle.dump({

"std_transer":trans,

"kmeans":kmeans

},f)

構造地鐵線路數特徵

In [14]:

lines_count1=train[['小區名','地鐵線路']].drop_duplicates().groupby('小區名').count()

lines_count2=train[['位置','地鐵線路']].drop_duplicates().groupby('位置').count()

lines_count2.columns=['位置線路數']

lines_count1.columns=['小區線路數']

In [15]:

train=pd.merge(train,lines_count1,how='left',on=['小區名'])

train=pd.merge(train,lines_count2,how='left',on=['位置'])

去掉出現數量較少的小區

In [16]:

neighbors=train['小區名'].value_counts()

train['新小區名']=train.apply(lambda x: x['小區名'] if neighbors[x['小區名']]>100 else -1,axis=1)

train['小區條數大於100']=train.apply(lambda x: 1 if neighbors[x['小區名']]>100 else 0,axis=1)

In [17]:

train.info()

train['新小區名'].value_counts()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 196499 entries, 0 to 196498
Data columns (total 40 columns):
時間           196499 non-null int64
小區名          196499 non-null int64
小區房屋出租數量     196499 non-null float64
樓層           196499 non-null int64
總樓層          196499 non-null float64
房屋面積         196499 non-null float64
房屋朝向         196499 non-null object
居住狀態         196499 non-null float64
臥室數量         196499 non-null int64
廳的數量         196499 non-null int64
衛的數量         196499 non-null int64
出租方式         196499 non-null float64
區            196499 non-null float64
位置           196499 non-null float64
地鐵線路         196499 non-null float64
地鐵站點         196499 non-null float64
距離           196499 non-null float64
裝修情況         196499 non-null float64
月租金          196499 non-null float64
log_rent     196499 non-null float64
新朝向          196499 non-null object
房+衛+廳        196499 non-null int64
房/總          196499 non-null float64
衛/總          196499 non-null float64
廳/總          196499 non-null float64
臥室面積         196499 non-null float64
樓層比          196499 non-null float64
戶型           196499 non-null object
平均值特徵1       196499 non-null float64
小區平均值特徵      196499 non-null float64
朝向平均值特徵      196499 non-null float64
站點平均值特徵      196499 non-null float64
位置平均值特徵      196499 non-null float64
有地鐵          196499 non-null int64
聚類特徵         196499 non-null object
平均值特徵2       196499 non-null float64
小區線路數        196499 non-null int64
位置線路數        196499 non-null int64
新小區名         196499 non-null int64
小區條數大於100    196499 non-null int64
dtypes: float64(24), int64(12), object(4)
memory usage: 61.5+ MB

Out[17]:

-1       72488
 5512     1880
 1085     1155
 5208     1136
 6221     1066
         ...  
 244       102
 6461      102
 5906      102
 1654      102
 5196      101
Name: 新小區名, Length: 512, dtype: int64

轉換類型

In [41]:

#將離散特徵轉換成字符串類型

colunms = ['時間', '小區名', '居住狀態', '出租方式', '區','位置','地鐵線路','地鐵站點','裝修情況']

for col in colunms:

train[col] = train[col].astype(str)

In [14]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 196499 entries, 0 to 196498
Data columns (total 34 columns):
時間          196499 non-null object
小區名         196499 non-null object
小區房屋出租數量    196499 non-null float64
樓層          196499 non-null int64
總樓層         196499 non-null float64
房屋面積        196499 non-null float64
房屋朝向        196499 non-null object
居住狀態        196499 non-null object
臥室數量        196499 non-null int64
廳的數量        196499 non-null int64
衛的數量        196499 non-null int64
出租方式        196499 non-null object
區           196499 non-null object
位置          196499 non-null object
地鐵線路        196499 non-null object
地鐵站點        196499 non-null object
距離          196499 non-null float64
裝修情況        196499 non-null object
月租金         196499 non-null float64
log_rent    196499 non-null float64
新朝向         196499 non-null object
房+衛+廳       196499 non-null int64
房/總         196499 non-null float64
衛/總         196499 non-null float64
廳/總         196499 non-null float64
臥室面積        196499 non-null float64
樓層比         196499 non-null float64
戶型          196499 non-null object
平均值特徵1      196499 non-null float64
有地鐵         196499 non-null int64
聚類特徵        196499 non-null object
平均值特徵2      196499 non-null float64
小區線路數       196499 non-null int64
位置線路數       196499 non-null int64
dtypes: float64(13), int64(8), object(13)
memory usage: 52.5+ MB

In [18]:

#保存處理後的數據

train.to_csv("data/onehot_feature.csv")

In [11]:

x_columns=['時間', '小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

'廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況',

'新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型',

'有地鐵']

x=train[x_columns]

y=train['月租金']

構建數據清洗和特徵工程函數

In [61]:

# 'dist_value_for_fill':dist_value_for_fill,

# 'line_value_for_fill':line_value_for_fill,

# 'station_value_for_fill':station_value_for_fill,

# 'area_value_for_fill':area_value_for_fill,

# 'position_value_for_fill':position_value_for_fill,

# 'state_value_for_fill':state_value_for_fill,

# 'decration_value_for_fill':decration_value_for_fill,

# 'rent_value_for_fill':rent_value_for_fill,

# 'ratio_by_neighbor':ratio_by_neighbor,

# 'ratio_mode':ratio_mode,

# 'time_for_fill':time_for_fill,

# 'neighbor_for_fill':neighbor_for_fill,

# 'ting_for_fill':ting_for_fill,

# 'wei_for_fill':wei_for_fill,

# 'bed_for_fill':bed_for_fill,

# 'direction_for_fill':direction_for_fill,

# 'mianji_for_fill':mianji_for_fill,

# 'louceng_for_fill':louceng_for_fill,

# 'zonglouceng_for_fill':zonglouceng_for_fill

# def process(x,values,models):

# x=x.to_dict()

# keys=['時間', '小區名', '樓層', '總樓層', '房屋面積', '居住狀態', '臥室數量',

# '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況']

# #原特徵缺失值填充

# for key in keys:

# if pd.isna(x[key]):

# x[key]=values[key]

# #小區房屋出租數量

# if pd.isna(str(x['小區房屋出租數量'])):

# if x['小區名'] in values['ratio_by_neighbor']:

# x['小區房屋出租數量']=values['ratio_by_neighbor'][x['小區名']]

# else:

# x['小區房屋出租數量']=values['小區房屋出租數量']

# #房屋朝向問題

# # print(x['房屋朝向'])

# if pd.isna(x['房屋朝向']):

# x['新朝向']=values['新朝向']

# else:

# chaoxiang=x['房屋朝向'].split(" ")[0]

# if chaoxiang in values['所有朝向']:

# x['新朝向']=chaoxiang

# else:

# x['新朝向']=values['新朝向']

# #構造特徵

# x["房+衛+廳"]=x["臥室數量"]+x["廳的數量"]+x["衛的數量"]

# x["房/總"]=x["臥室數量"]/x["房+衛+廳"]

# x["衛/總"]=x["衛的數量"]/x["房+衛+廳"]

# x["廳/總"]=x["廳的數量"]/x["房+衛+廳"]

# x['臥室面積']=x['房屋面積']/x['臥室數量']

# x['樓層比']=x['樓層']/(x["總樓層"]+1)#加1是爲了防止分母=0出現結果爲inf無窮大的現象

# x['戶型']=str(x['臥室數量'])+str(x['廳的數量'])+str(x['衛的數量'])

# if x["地鐵站點"]>-1:

# x["有地鐵"]=1

# else:

# x["有地鐵"]=0

# #構造聚類特徵

# features=np.array([[x['房屋面積'],x['臥室數量'],x['廳的數量'],x['衛的數量'],x['距離'],x['樓層比']]])

# models['std_transer'].transform(features)

# x['聚類特徵']=models['kmeans'].predict(new_features).astype(str)[0]

# return pd.Series(x)

In [66]:

test=pd.read_csv('data/test.csv')

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56279 entries, 0 to 56278
Data columns (total 19 columns):
id          56279 non-null int64
時間          56279 non-null int64
小區名         56279 non-null int64
小區房屋出租數量    56257 non-null float64
樓層          56279 non-null int64
總樓層         56279 non-null float64
房屋面積        56279 non-null float64
房屋朝向        56279 non-null object
居住狀態        4483 non-null float64
臥室數量        56279 non-null int64
廳的數量        56279 non-null int64
衛的數量        56279 non-null int64
出租方式        4971 non-null float64
區           56269 non-null float64
位置          56269 non-null float64
地鐵線路        26494 non-null float64
地鐵站點        26494 non-null float64
距離          26494 non-null float64
裝修情況        4207 non-null float64
dtypes: float64(11), int64(7), object(1)
memory usage: 8.2+ MB

In [46]:

with open("data/kmeans.pkl",'rb') as f:

models=pickle.load(f)

In [47]:

with open("data/values.pkl",'rb') as f:

values=pickle.load(f)

In [65]:

test.iloc[:100,:].apply(process,models=models,values=values,axis=1)

Out[65]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	...	新朝向	房+衛+廳	房/總	衛/總	廳/總	臥室面積	樓層比	戶型	有地鐵	聚類特徵
0	1	4	6011	0.382812	1	0.600000	0.007117	東	3.0	2	...	東	4	0.500000	0.250000	0.250000	0.003558	0.625000	211	1	2
1	2	4	1697	0.152344	1	0.472727	0.007448	東	0.0	2	...	東	4	0.500000	0.250000	0.250000	0.003724	0.679012	211	1	2
2	3	4	754	0.207031	2	0.709091	0.014068	東南	0.0	3	...	東南	7	0.428571	0.285714	0.285714	0.004689	1.170213	322	1	2
3	4	4	1285	0.011719	0	0.090909	0.008937	南	0.0	2	...	南	4	0.500000	0.250000	0.250000	0.004469	0.000000	211	1	2
4	5	4	4984	0.035156	1	0.218182	0.008606	東南	0.0	2	...	東南	4	0.500000	0.250000	0.250000	0.004303	0.820896	211	1	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	96	4	5239	0.433594	0	0.545455	0.006202	北	0.0	1	...	北	2	0.500000	0.500000	0.000000	0.006202	0.000000	101	1	2
96	97	4	1027	0.046875	0	0.727273	0.007779	南	0.0	2	...	南	4	0.500000	0.250000	0.250000	0.003889	0.000000	211	1	2
97	98	4	300	0.140625	2	0.454545	0.013744	西	0.0	2	...	西	5	0.400000	0.200000	0.400000	0.006872	1.375000	221	1	2
98	99	4	1021	0.269531	0	0.581818	0.013860	東南	0.0	3	...	東南	7	0.428571	0.285714	0.285714	0.004620	0.000000	322	1	2
99	100	4	2644	0.300781	0	0.309091	0.014896	東南	0.0	3	...	東南	6	0.500000	0.166667	0.333333	0.004965	0.000000	321	1	2

100 rows × 29 columns

建模

In [21]:

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

import warnings

import xgboost as xgb

import copy

from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

from sklearn.metrics import mean_squared_error

warnings.filterwarnings('ignore')#忽略一些警告

讀取數據

In [22]:

data=pd.read_csv("data/onehot_feature.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196499 entries, 0 to 196498
Data columns (total 41 columns):
Unnamed: 0    196499 non-null int64
時間            196499 non-null int64
小區名           196499 non-null int64
小區房屋出租數量      196499 non-null float64
樓層            196499 non-null int64
總樓層           196499 non-null float64
房屋面積          196499 non-null float64
房屋朝向          196499 non-null object
居住狀態          196499 non-null float64
臥室數量          196499 non-null int64
廳的數量          196499 non-null int64
衛的數量          196499 non-null int64
出租方式          196499 non-null float64
區             196499 non-null float64
位置            196499 non-null float64
地鐵線路          196499 non-null float64
地鐵站點          196499 non-null float64
距離            196499 non-null float64
裝修情況          196499 non-null float64
月租金           196499 non-null float64
log_rent      196499 non-null float64
新朝向           196499 non-null object
房+衛+廳         196499 non-null int64
房/總           196499 non-null float64
衛/總           196499 non-null float64
廳/總           196499 non-null float64
臥室面積          196499 non-null float64
樓層比           196499 non-null float64
戶型            196499 non-null int64
平均值特徵1        196499 non-null float64
小區平均值特徵       196499 non-null float64
朝向平均值特徵       196499 non-null float64
站點平均值特徵       196499 non-null float64
位置平均值特徵       196499 non-null float64
有地鐵           196499 non-null int64
聚類特徵          196499 non-null int64
平均值特徵2        196499 non-null float64
小區線路數         196499 non-null int64
位置線路數         196499 non-null int64
新小區名          196499 non-null int64
小區條數大於100     196499 non-null int64
dtypes: float64(24), int64(15), object(2)
memory usage: 61.5+ MB

In [23]:

#將離散特徵轉換成字符串類型

colunms = ['時間', '新小區名', '居住狀態', '出租方式', '區','位置','地鐵線路','地鐵站點','裝修情況','戶型','聚類特徵']

for col in colunms:

data[col] = data[col].astype(str)

獲取x和y

In [24]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196499 entries, 0 to 196498
Data columns (total 41 columns):
Unnamed: 0    196499 non-null int64
時間            196499 non-null object
小區名           196499 non-null int64
小區房屋出租數量      196499 non-null float64
樓層            196499 non-null int64
總樓層           196499 non-null float64
房屋面積          196499 non-null float64
房屋朝向          196499 non-null object
居住狀態          196499 non-null object
臥室數量          196499 non-null int64
廳的數量          196499 non-null int64
衛的數量          196499 non-null int64
出租方式          196499 non-null object
區             196499 non-null object
位置            196499 non-null object
地鐵線路          196499 non-null object
地鐵站點          196499 non-null object
距離            196499 non-null float64
裝修情況          196499 non-null object
月租金           196499 non-null float64
log_rent      196499 non-null float64
新朝向           196499 non-null object
房+衛+廳         196499 non-null int64
房/總           196499 non-null float64
衛/總           196499 non-null float64
廳/總           196499 non-null float64
臥室面積          196499 non-null float64
樓層比           196499 non-null float64
戶型            196499 non-null object
平均值特徵1        196499 non-null float64
小區平均值特徵       196499 non-null float64
朝向平均值特徵       196499 non-null float64
站點平均值特徵       196499 non-null float64
位置平均值特徵       196499 non-null float64
有地鐵           196499 non-null int64
聚類特徵          196499 non-null object
平均值特徵2        196499 non-null float64
小區線路數         196499 non-null int64
位置線路數         196499 non-null int64
新小區名          196499 non-null object
小區條數大於100     196499 non-null int64
dtypes: float64(17), int64(11), object(13)
memory usage: 61.5+ MB

In [25]:

x_columns=['時間', '新小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

'廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況',

'新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','平均值特徵1',

'平均值特徵2','有地鐵','小區線路數','位置線路數','小區條數大於100','小區平均值特徵','朝向平均值特徵',

'站點平均值特徵','位置平均值特徵']

y_label='log_rent'

x=data[x_columns]

y=data[y_label]

y.isnull().sum()

Out[25]:

構建訓練函數

In [26]:

def feature_transformer(x,y,test_size=0.3,random_state=12):

"""

負責分割數據集並轉換onehot特徵，返回轉換後的稀疏矩陣和特徵名

"""

#1.重新命名列名

#原列名

cols=['時間', '新小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

'廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況',

'新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','平均值特徵1',

'平均值特徵2','有地鐵','小區線路數','位置線路數','小區條數大於100','小區平均值特徵','朝向平均值特徵',

'站點平均值特徵','位置平均值特徵']

#新列名

new_cols=[chr(65+s)+str(i) for s in range(len(cols)//10+1) for i in range(10)]

new_cols=new_cols[:len(cols)]

#特徵名映射字典

cols_map={k:v for k,v in zip(cols,new_cols)}

#重新命名列

x.columns=[cols_map[k] for k in x.columns]

#2.分割數據集

train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=test_size,random_state=random_state)

#3.轉換onehot特徵

#返回稀疏矩陣，有兩個優點：

#1.佔用內存大幅減小，讓並行成爲可能，不然並行的話，內存爆掉

#2.可以加速xgboost訓練

vector=DictVectorizer(sparse=True)

x_train=vector.fit_transform(train_x.to_dict(orient='records'))

x_test=vector.transform(test_x.to_dict(orient='records'))

features=vector.get_feature_names()

#4.構建原始特徵對應的新下標字典

#原始特徵名

#離散列特徵下標字典

feature_map={k:[] for k in cols}

for i in range(len(features)):

for col in cols:

if features[i].startswith(cols_map[col]):#如果新特徵以老特徵開頭

feature_map[col].append(i)

break

return (x_train,x_test,train_y,test_y),feature_map,cols_map

def train(cols,data,feature_map,num_round = 500):

"""

負責完成一次xgboost訓練，返回測試集rmse

"""

#1.獲取數據集

train_x,test_x,train_y,test_y=data

#獲取原始特徵對應的新特徵下標

index=[]

for col in cols:

index.extend(feature_map[col])

x_train=train_x[:,index]

x_test=test_x[:,index]

#3.構建數據格式

#構建DMatrix數據，可以有效利用硬盤緩存，減少內存佔用

dtrain = xgb.DMatrix(x_train,train_y)

dtest = xgb.DMatrix(x_test,test_y)

#4.設置訓練參數

param = {'max_depth':5,

'eta':0.01,

'verbosity':1,

'objective':'reg:linear',

'silent': 1,

'gamma': 0.01,

'min_child_weight': 1,

}

#5.模型訓練

bst = xgb.train(param, dtrain, num_round)

#6.模型預測

preds = bst.predict(dtest)

preds=np.exp(preds)-1#轉換成真實的租金

y_true=np.exp(test_y)-1

#7.模型評估

return np.sqrt(mean_squared_error(y_true,preds))

In [27]:

#完成特徵分割和轉換

d,f_map,c_map=feature_transformer(x,y)

模型特徵篩選

In [26]:

# #構造列名類

# class ColData:

# cols=np.array(['時間', '小區名', '小區房屋出租數量', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

# '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況',

# '新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','平均值特徵1',

# '有地鐵','聚類特徵','小區線路數','位置線路數'])

# def __init__(self,ids):

# self.ids=ids

# def include_names(self):

# print(type(ColData.cols))

# return list(ColData.cols[self.ids])

# def exclude_names(self):

# non_ids=list(set(range(ColData.cols.shape[0]))-set(self.ids))

# return list(ColData.cols[non_ids])

In [28]:

#構造篩選特徵函數

def select_features(cols,min_score):

include_features=cols

exclude_features=[]

cols=np.array(cols)

for i in range(cols.shape[0]):

#選中的特徵

features=list(cols[list(set(range(cols.shape[0]))-set([i]))])

print("開始第{}次訓練:".format(i))

print("未選中特徵：",cols[i])

rmse=train(features,d,f_map,2500)

print("開始第{}次訓練測試集成績{}：".format(i,rmse))

print("-"*10)

if rmse<=min_score:

exclude_features=cols[i]

include_features=features

min_score=rmse

return min_score,include_features,exclude_features

In [29]:

min_score=train(x_columns,d,f_map)

In [30]:

features=x_columns

min_score

Out[30]:

2.627438015769018

In [31]:

i=1

while len(features)>0:

print("開始第{}輪篩選:\n".format(i))

ms,include_f,exclude_f=select_features(features,min_score)

if len(exclude_f)<=0:

break

features=include_f

min_score=ms

print("第{}次篩選成績:{}\n".format(i,min_score))

i=i+1

print("保留特徵：",include_f)

print("排除特徵：",exclude_f)

print("\n*************************\n")

print("最終特徵：",features)

print("最好成績：",min_score)

開始第1輪篩選:

開始第0次訓練:
未選中特徵： 時間
開始第0次訓練測試集成績2.3425739100557874：
----------
開始第1次訓練:
未選中特徵： 新小區名
開始第1次訓練測試集成績2.3159098956260884：
----------
開始第2次訓練:
未選中特徵： 小區房屋出租數量
開始第2次訓練測試集成績2.3610726419651975：
----------
開始第3次訓練:
未選中特徵： 樓層
開始第3次訓練測試集成績2.3482224666430156：
----------
開始第4次訓練:
未選中特徵： 總樓層
開始第4次訓練測試集成績2.3548737824509476：
----------
開始第5次訓練:
未選中特徵： 房屋面積
開始第5次訓練測試集成績2.3528631046347734：
----------
開始第6次訓練:
未選中特徵： 居住狀態
開始第6次訓練測試集成績2.343644368107605：
----------
開始第7次訓練:
未選中特徵： 臥室數量
開始第7次訓練測試集成績2.3447264002503947：
----------
開始第8次訓練:
未選中特徵： 廳的數量
開始第8次訓練測試集成績2.3488349025371353：
----------
開始第9次訓練:
未選中特徵： 衛的數量
開始第9次訓練測試集成績2.3455068228561444：
----------
開始第10次訓練:
未選中特徵： 出租方式
開始第10次訓練測試集成績2.347907742954758：
----------
開始第11次訓練:
未選中特徵： 區
開始第11次訓練測試集成績2.3470549409629813：
----------
開始第12次訓練:
未選中特徵： 位置
開始第12次訓練測試集成績2.3442351978396916：
----------
開始第13次訓練:
未選中特徵： 地鐵線路
開始第13次訓練測試集成績2.3457640963554747：
----------
開始第14次訓練:
未選中特徵： 地鐵站點
開始第14次訓練測試集成績2.3514284469336877：
----------
開始第15次訓練:
未選中特徵： 距離
開始第15次訓練測試集成績2.3536807483560542：
----------
開始第16次訓練:
未選中特徵： 裝修情況
開始第16次訓練測試集成績2.3748498709037094：
----------
開始第17次訓練:
未選中特徵： 新朝向
開始第17次訓練測試集成績2.3442905601261614：
----------
開始第18次訓練:
未選中特徵： 房+衛+廳
開始第18次訓練測試集成績2.340930304773355：
----------
開始第19次訓練:
未選中特徵： 房/總
開始第19次訓練測試集成績2.3480559804451464：
----------
開始第20次訓練:
未選中特徵： 衛/總
開始第20次訓練測試集成績2.3465279974923803：
----------
開始第21次訓練:
未選中特徵： 廳/總
開始第21次訓練測試集成績2.348225921086065：
----------
開始第22次訓練:
未選中特徵： 臥室面積
開始第22次訓練測試集成績2.352140741124247：
----------
開始第23次訓練:
未選中特徵： 樓層比
開始第23次訓練測試集成績2.3495753059171336：
----------
開始第24次訓練:
未選中特徵： 戶型
開始第24次訓練測試集成績2.355693315193697：
----------
開始第25次訓練:
未選中特徵： 平均值特徵1
開始第25次訓練測試集成績2.544232443156709：
----------
開始第26次訓練:
未選中特徵： 平均值特徵2
開始第26次訓練測試集成績2.3445018995961795：
----------
開始第27次訓練:
未選中特徵： 有地鐵
開始第27次訓練測試集成績2.3481264617837376：
----------
開始第28次訓練:
未選中特徵： 小區線路數
開始第28次訓練測試集成績2.3481264617837376：
----------
開始第29次訓練:
未選中特徵： 位置線路數
開始第29次訓練測試集成績2.3504238672632107：
----------
開始第30次訓練:
未選中特徵： 小區條數大於100
開始第30次訓練測試集成績2.3481264617837376：
----------
開始第31次訓練:
未選中特徵： 小區平均值特徵
開始第31次訓練測試集成績2.349284024062191：
----------
開始第32次訓練:
未選中特徵： 朝向平均值特徵
開始第32次訓練測試集成績2.3424697168707467：
----------
開始第33次訓練:
未選中特徵： 站點平均值特徵
開始第33次訓練測試集成績2.357579803345613：
----------
開始第34次訓練:
未選中特徵： 位置平均值特徵
開始第34次訓練測試集成績2.370372655988748：
----------
第1次篩選成績:2.3159098956260884

保留特徵： ['時間', '小區房屋出租數量', '樓層', '總樓層', '房屋面積', '居住狀態', '臥室數量', '廳的數量', '衛的數量', '出租方式', '區', '位置', '地鐵線路', '地鐵站點', '距離', '裝修情況', '新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型', '平均值特徵1', '平均值特徵2', '有地鐵', '小區線路數', '位置線路數', '小區條數大於100', '小區平均值特徵', '朝向平均值特徵', '站點平均值特徵', '位置平均值特徵']
排除特徵： 新小區名

*************************

開始第2輪篩選:

開始第0次訓練:
未選中特徵： 時間
開始第0次訓練測試集成績2.319998668753389：
----------
開始第1次訓練:
未選中特徵： 小區房屋出租數量
開始第1次訓練測試集成績2.322355105258432：
----------
開始第2次訓練:
未選中特徵： 樓層
開始第2次訓練測試集成績2.3158954168386106：
----------
開始第3次訓練:
未選中特徵： 總樓層
開始第3次訓練測試集成績2.329094243429236：
----------
開始第4次訓練:
未選中特徵： 房屋面積
開始第4次訓練測試集成績2.3258240532005727：
----------
開始第5次訓練:
未選中特徵： 居住狀態
開始第5次訓練測試集成績2.318009433185204：
----------
開始第6次訓練:
未選中特徵： 臥室數量
開始第6次訓練測試集成績2.317658870332904：
----------
開始第7次訓練:
未選中特徵： 廳的數量
開始第7次訓練測試集成績2.317469293657488：
----------
開始第8次訓練:
未選中特徵： 衛的數量

In [21]:

cols=['小區房屋出租數量','房屋面積', '居住狀態', '出租方式', '位置', '地鐵站點', '距離', '裝修情況', '新朝向', '房+衛+廳', '房/總', '衛/總', '臥室面積', '樓層比', '平均值特徵1', '平均值特徵2', '小區線路數', '位置線路數','小區條數大於100']

train(cols,d,f_map)

Out[21]:

1.6355982447085897

參數搜索

構建交叉驗證和參數搜索函數

In [32]:

from sklearn.model_selection import KFold

#構建交叉驗證函數

def train_cv(data,target,params,num_round=200,k_fold=5,silent=0):

"""

負責完成一種參數組合的情況下k_flod折交叉驗證的平均rmse值

"""

rmses=[]

#數據分割

kfold= KFold(n_splits=k_fold,random_state =None,shuffle=True)

for i,(train_index,val_index) in zip(range(k_fold),kfold.split(data,target)):

train_x,val_x,train_y,val_y=data[train_index,:],data[val_index,:],target[train_index],target[val_index]

#構建DMatrix數據

dtrain = xgb.DMatrix(train_x,train_y)

dtest = xgb.DMatrix(val_x,val_y)

if silent==0:

print("開始第{}/{}折驗證：".format(i,k_fold))

#模型訓練

bst = xgb.train(params, dtrain, num_round)

#6.模型預測

preds = bst.predict(dtest)

preds=np.exp(preds)-1#轉換成真實的租金

y_true=np.exp(val_y)-1

rmse=np.sqrt(mean_squared_error(y_true,preds))

if silent==0:

print("第{}/{}折驗證rmse：{}".format(i,k_fold,rmse))

rmses.append(rmse)

return sum(rmses)/k_fold

def search_params(x,y,params_grid,n_estimators=200,cv=3,silent=0):

min_rmse=9999

best_params=None

params_list=[[]]

#根據參數表格構建所有參數組合

for k,v in params_grid.items():

if isinstance(v,list):

temp=params_list

params_list=[i+[j] for j in v for i in temp]

else:

params_list=[i+[v] for i in params_list]

params_list=[{k:v for k,v in zip(params_grid.keys(),v_list)} for v_list in params_list]

for i,params in zip(range(len(params_list)),params_list):

if silent==0:

print("開始實驗第{}組參數：".format(i),params)

rmse=train_cv(data=x,target=y,params=params,num_round=n_estimators,k_fold=cv)

if silent==0:

print("第{}組參數平均rmse：{}".format(i,rmse))

print("-"*50)

if rmse<min_rmse:

min_rmse=rmse

best_params=params

return min_rmse,best_params

開始搜索

In [34]:

params_dict={

"objective":'reg:linear',

'eta':[0.01,0.1,0.5],

'gamma': [0.01,0.05,0.1],

'silent': 1,

'max_depth':[15,25,35],

'min_child_weight':[0.5,1,3],

}

cols=['小區房屋出租數量', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

'衛的數量', '位置', '地鐵站點', '距離', '裝修情況',

'新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','平均值特徵1',

'平均值特徵2','有地鐵','小區線路數','位置線路數','小區條數大於100','小區平均值特徵','朝向平均值特徵',

'站點平均值特徵','位置平均值特徵']

#1.獲取數據集

train_x,test_x,train_y,test_y=d

#獲取原始特徵對應的新特徵下標

index=[]

for col in cols:

index.extend(f_map[col])

x_train=train_x[:50000,index]#只用前50000條數據做運算

x_test=test_x[:,index]

#由於要用新下標訪問，所以要重置索引

train_y=train_y.reset_index(drop=True)[:50000]

test_y=test_y.reset_index(drop=True)

search_params(x=x_train,y=train_y,params_grid=params_dict,n_estimators=1000)

開始實驗第0組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.1063645180292268
開始第1/3折驗證：
第1/3折驗證rmse：2.270438488734099
開始第2/3折驗證：
第2/3折驗證rmse：2.140966941345831
第0組參數平均rmse：2.172589982703052
--------------------------------------------------
開始實驗第1組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.1029110307375034
開始第1/3折驗證：
第1/3折驗證rmse：2.151089425447841
開始第2/3折驗證：
第2/3折驗證rmse：2.2555141001168315
第1組參數平均rmse：2.169838185434059
--------------------------------------------------
開始實驗第2組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.4535784391141995
開始第1/3折驗證：
第1/3折驗證rmse：2.424286846886855
開始第2/3折驗證：
第2/3折驗證rmse：2.1266645639199573
第2組參數平均rmse：2.334843283307004
--------------------------------------------------
開始實驗第3組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.1509258765444517
開始第1/3折驗證：
第1/3折驗證rmse：2.2639212917057883
開始第2/3折驗證：
第2/3折驗證rmse：2.208002017298694
第3組參數平均rmse：2.207616395182978
--------------------------------------------------
開始實驗第4組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.224230101453484
開始第1/3折驗證：
第1/3折驗證rmse：2.194730061107077
開始第2/3折驗證：
第2/3折驗證rmse：2.104946600682978
第4組參數平均rmse：2.1746355877478463
--------------------------------------------------
開始實驗第5組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.4173987601952147
開始第1/3折驗證：
第1/3折驗證rmse：2.3193751890787064
開始第2/3折驗證：
第2/3折驗證rmse：2.350550916955072
第5組參數平均rmse：2.362441622076331
--------------------------------------------------
開始實驗第6組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.150912061182521
開始第1/3折驗證：
第1/3折驗證rmse：2.1842688983958243
開始第2/3折驗證：
第2/3折驗證rmse：2.4129215673670013
第6組參數平均rmse：2.249367508981782
--------------------------------------------------
開始實驗第7組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.2327447467966457
開始第1/3折驗證：
第1/3折驗證rmse：2.4065138418376923
開始第2/3折驗證：
第2/3折驗證rmse：2.286097658735343
第7組參數平均rmse：2.30845208245656
--------------------------------------------------
開始實驗第8組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.3313455873087507
開始第1/3折驗證：
第1/3折驗證rmse：2.4397538237904057
開始第2/3折驗證：
第2/3折驗證rmse：2.31328569652659
第8組參數平均rmse：2.361461702541915
--------------------------------------------------
開始實驗第9組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.351802702595958
開始第1/3折驗證：
第1/3折驗證rmse：2.230930544499273
開始第2/3折驗證：
第2/3折驗證rmse：2.251661755576135
第9組參數平均rmse：2.278131667557122
--------------------------------------------------
開始實驗第10組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.327051827692085
開始第1/3折驗證：
第1/3折驗證rmse：2.3296903644220945
開始第2/3折驗證：
第2/3折驗證rmse：2.384475189364813
第10組參數平均rmse：2.347072460492998
--------------------------------------------------
開始實驗第11組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.630278587228796
開始第1/3折驗證：
第1/3折驗證rmse：2.222563695431261
開始第2/3折驗證：
第2/3折驗證rmse：2.2461094587201518
第11組參數平均rmse：2.366317247126736
--------------------------------------------------
開始實驗第12組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.125414735583022
開始第1/3折驗證：
第1/3折驗證rmse：2.2014928222849846
開始第2/3折驗證：
第2/3折驗證rmse：2.097238703895008
第12組參數平均rmse：2.141382087254338
--------------------------------------------------
開始實驗第13組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.4197639273877596
開始第1/3折驗證：
第1/3折驗證rmse：2.146503225999798
開始第2/3折驗證：
第2/3折驗證rmse：2.1724123431688636
第13組參數平均rmse：2.2462264988521405
--------------------------------------------------
開始實驗第14組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.512321338137171
開始第1/3折驗證：
第1/3折驗證rmse：2.4320489841732864
開始第2/3折驗證：
第2/3折驗證rmse：2.283527320349841
第14組參數平均rmse：2.4092992142200997
--------------------------------------------------
開始實驗第15組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.1677308366792225
開始第1/3折驗證：
第1/3折驗證rmse：2.255620150072001
開始第2/3折驗證：
第2/3折驗證rmse：2.3948376802971194
第15組參數平均rmse：2.272729555682781
--------------------------------------------------
開始實驗第16組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.347093266421053
開始第1/3折驗證：
第1/3折驗證rmse：2.3134536451552785
開始第2/3折驗證：
第2/3折驗證rmse：2.2327661973592092
第16組參數平均rmse：2.297771036311847
--------------------------------------------------
開始實驗第17組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.3748013645667525
開始第1/3折驗證：
第1/3折驗證rmse：2.358196569845885
開始第2/3折驗證：
第2/3折驗證rmse：2.4795866533195476
第17組參數平均rmse：2.404194862577395
--------------------------------------------------
開始實驗第18組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.240512423440991
開始第1/3折驗證：
第1/3折驗證rmse：2.1091848489427965
開始第2/3折驗證：
第2/3折驗證rmse：2.288639830011804
第18組參數平均rmse：2.212779034131864
--------------------------------------------------
開始實驗第19組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.1775090908414034
開始第1/3折驗證：
第1/3折驗證rmse：2.202914246471916
開始第2/3折驗證：
第2/3折驗證rmse：2.3757522648670806
第19組參數平均rmse：2.2520585340601333
--------------------------------------------------
開始實驗第20組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.412426495391516
開始第1/3折驗證：
第1/3折驗證rmse：2.3243322412870735
開始第2/3折驗證：
第2/3折驗證rmse：2.3949040510708914
第20組參數平均rmse：2.377220929249827
--------------------------------------------------
開始實驗第21組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.269681476252998
開始第1/3折驗證：
第1/3折驗證rmse：2.146913481367659
開始第2/3折驗證：
第2/3折驗證rmse：2.2304382113812
第21組參數平均rmse：2.2156777230006193
--------------------------------------------------
開始實驗第22組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.438955492998712
開始第1/3折驗證：
第1/3折驗證rmse：2.1329319198071146
開始第2/3折驗證：
第2/3折驗證rmse：2.124099501339859
第22組參數平均rmse：2.2319956380485624
--------------------------------------------------
開始實驗第23組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.503644611663843
開始第1/3折驗證：
第1/3折驗證rmse：2.528994032876117
開始第2/3折驗證：
第2/3折驗證rmse：2.312941463640303
第23組參數平均rmse：2.4485267027267543
--------------------------------------------------
開始實驗第24組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.237095467623757
開始第1/3折驗證：
第1/3折驗證rmse：2.4029885243008007
開始第2/3折驗證：
第2/3折驗證rmse：2.2734978813307447
第24組參數平均rmse：2.3045272910851007
--------------------------------------------------
開始實驗第25組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}

開始第0/3折驗證：
第0/3折驗證rmse：2.306825279691586
開始第1/3折驗證：
第1/3折驗證rmse：2.244977467458515
開始第2/3折驗證：
第2/3折驗證rmse：2.1922851419755007
第25組參數平均rmse：2.2480292963752
--------------------------------------------------
開始實驗第26組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 0.5}
開始第0/3折驗證：
第0/3折驗證rmse：2.3549137991155114
開始第1/3折驗證：
第1/3折驗證rmse：2.592401863979216
開始第2/3折驗證：
第2/3折驗證rmse：2.29801882782804
第26組參數平均rmse：2.4151114969742555
--------------------------------------------------
開始實驗第27組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.171764054695917
開始第1/3折驗證：
第1/3折驗證rmse：2.2142377501924937
開始第2/3折驗證：
第2/3折驗證rmse：2.284722624617894
第27組參數平均rmse：2.223574809835435
--------------------------------------------------
開始實驗第28組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.1858975633007525
開始第1/3折驗證：
第1/3折驗證rmse：2.4068418601897177
開始第2/3折驗證：
第2/3折驗證rmse：2.0969563799576494
第28組參數平均rmse：2.229898601149373
--------------------------------------------------
開始實驗第29組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.388813347279593
開始第1/3折驗證：
第1/3折驗證rmse：2.284844842576387
開始第2/3折驗證：
第2/3折驗證rmse：2.4146204135428695
第29組參數平均rmse：2.362759534466283
--------------------------------------------------
開始實驗第30組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2487871387994858
開始第1/3折驗證：
第1/3折驗證rmse：2.1714058400384646
開始第2/3折驗證：
第2/3折驗證rmse：2.184533044019575
第30組參數平均rmse：2.2015753409525085
--------------------------------------------------
開始實驗第31組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2972859501336935
開始第1/3折驗證：
第1/3折驗證rmse：2.2362463997194886
開始第2/3折驗證：
第2/3折驗證rmse：2.2898365113716483
第31組參數平均rmse：2.2744562870749436
--------------------------------------------------
開始實驗第32組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.419329912099752
開始第1/3折驗證：
第1/3折驗證rmse：2.6154375686147384
開始第2/3折驗證：
第2/3折驗證rmse：2.230305815806549
第32組參數平均rmse：2.4216910988403466
--------------------------------------------------
開始實驗第33組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.200577515646503
開始第1/3折驗證：
第1/3折驗證rmse：2.2299424412995243
開始第2/3折驗證：
第2/3折驗證rmse：2.3140760122142896
第33組參數平均rmse：2.2481986563867724
--------------------------------------------------
開始實驗第34組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.57849663446406
開始第1/3折驗證：
第1/3折驗證rmse：2.116087043730732
開始第2/3折驗證：
第2/3折驗證rmse：2.274100944461905
第34組參數平均rmse：2.3228948742188993
--------------------------------------------------
開始實驗第35組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.3050644007916503
開始第1/3折驗證：
第1/3折驗證rmse：2.5075464973857162
開始第2/3折驗證：
第2/3折驗證rmse：2.3930588500403567
第35組參數平均rmse：2.4018899160725744
--------------------------------------------------
開始實驗第36組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.276957221983997
開始第1/3折驗證：
第1/3折驗證rmse：2.3703205041784905
開始第2/3折驗證：
第2/3折驗證rmse：2.1867592895461
第36組參數平均rmse：2.278012338569529
--------------------------------------------------
開始實驗第37組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2507333499616866
開始第1/3折驗證：
第1/3折驗證rmse：2.2480253735311346
開始第2/3折驗證：
第2/3折驗證rmse：2.2176798693697006
第37組參數平均rmse：2.238812864287507
--------------------------------------------------
開始實驗第38組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2951887068467824
開始第1/3折驗證：
第1/3折驗證rmse：2.3602517398138096
開始第2/3折驗證：
第2/3折驗證rmse：2.3994056671768162
第38組參數平均rmse：2.351615371279136
--------------------------------------------------
開始實驗第39組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.1837643392924027
開始第1/3折驗證：
第1/3折驗證rmse：2.2504664193983692
開始第2/3折驗證：
第2/3折驗證rmse：2.135889446314574
第39組參數平均rmse：2.190040068335115
--------------------------------------------------
開始實驗第40組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.0869573705948192
開始第1/3折驗證：
第1/3折驗證rmse：2.2127491322380353
開始第2/3折驗證：
第2/3折驗證rmse：2.3164522580246296
第40組參數平均rmse：2.2053862536191615
--------------------------------------------------
開始實驗第41組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.188863229863099
開始第1/3折驗證：
第1/3折驗證rmse：2.400519116701612
開始第2/3折驗證：
第2/3折驗證rmse：2.4379628348438347
第41組參數平均rmse：2.3424483938028486
--------------------------------------------------
開始實驗第42組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.454450166538962
開始第1/3折驗證：
第1/3折驗證rmse：2.0656417463376475
開始第2/3折驗證：
第2/3折驗證rmse：2.3655598707868464
第42組參數平均rmse：2.295217261221152
--------------------------------------------------
開始實驗第43組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.1824157942948537
開始第1/3折驗證：
第1/3折驗證rmse：2.3676364588012406
開始第2/3折驗證：
第2/3折驗證rmse：2.145039852244817
第43組參數平均rmse：2.2316973684469708
--------------------------------------------------
開始實驗第44組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.396458431999462
開始第1/3折驗證：
第1/3折驗證rmse：2.4112866277165343
開始第2/3折驗證：
第2/3折驗證rmse：2.383011977510674
第44組參數平均rmse：2.39691901240889
--------------------------------------------------
開始實驗第45組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.3398767792743698
開始第1/3折驗證：
第1/3折驗證rmse：2.150834454625684
開始第2/3折驗證：
第2/3折驗證rmse：2.3983698308079955
第45組參數平均rmse：2.296360354902683
--------------------------------------------------
開始實驗第46組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.1467337553620087
開始第1/3折驗證：
第1/3折驗證rmse：2.067141831677618
開始第2/3折驗證：
第2/3折驗證rmse：2.246084426708412
第46組參數平均rmse：2.1533200045826795
--------------------------------------------------
開始實驗第47組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.291145736550947
開始第1/3折驗證：
第1/3折驗證rmse：2.4679262622671856
開始第2/3折驗證：
第2/3折驗證rmse：2.3334840777758163
第47組參數平均rmse：2.3641853588646495
--------------------------------------------------
開始實驗第48組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2070883532476704
開始第1/3折驗證：
第1/3折驗證rmse：2.0854684885544548
開始第2/3折驗證：
第2/3折驗證rmse：2.190895382039525
第48組參數平均rmse：2.1611507412805504
--------------------------------------------------
開始實驗第49組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2406075908241965
開始第1/3折驗證：
第1/3折驗證rmse：2.1997157111560504
開始第2/3折驗證：
第2/3折驗證rmse：2.3159740377238247
第49組參數平均rmse：2.2520991132346904
--------------------------------------------------
開始實驗第50組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.4411125744725406

開始第1/3折驗證：
第1/3折驗證rmse：2.45578081917117
開始第2/3折驗證：
第2/3折驗證rmse：2.2108615863413847
第50組參數平均rmse：2.3692516599950317
--------------------------------------------------
開始實驗第51組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.3555601196130684
開始第1/3折驗證：
第1/3折驗證rmse：2.1563837142863638
開始第2/3折驗證：
第2/3折驗證rmse：2.2789944615773194
第51組參數平均rmse：2.2636460984922504
--------------------------------------------------
開始實驗第52組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.2950794012987616
開始第1/3折驗證：
第1/3折驗證rmse：2.2213071464839884
開始第2/3折驗證：
第2/3折驗證rmse：2.169220925172176
第52組參數平均rmse：2.2285358243183087
--------------------------------------------------
開始實驗第53組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 1}
開始第0/3折驗證：
第0/3折驗證rmse：2.3781817980568287
開始第1/3折驗證：
第1/3折驗證rmse：2.2231348209756074
開始第2/3折驗證：
第2/3折驗證rmse：2.452517601712933
第53組參數平均rmse：2.3512780735817898
--------------------------------------------------
開始實驗第54組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.1697340094788573
開始第1/3折驗證：
第1/3折驗證rmse：2.1152437442840575
開始第2/3折驗證：
第2/3折驗證rmse：2.2024548685544176
第54組參數平均rmse：2.162477540772444
--------------------------------------------------
開始實驗第55組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.209310097715091
開始第1/3折驗證：
第1/3折驗證rmse：2.138762610248193
開始第2/3折驗證：
第2/3折驗證rmse：2.2925897483081115
第55組參數平均rmse：2.2135541520904654
--------------------------------------------------
開始實驗第56組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.396555278419278
開始第1/3折驗證：
第1/3折驗證rmse：2.363139367149146
開始第2/3折驗證：
第2/3折驗證rmse：2.2318443518447233
第56組參數平均rmse：2.330512999137716
--------------------------------------------------
開始實驗第57組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.350916617214958
開始第1/3折驗證：
第1/3折驗證rmse：2.426769178733778
開始第2/3折驗證：
第2/3折驗證rmse：2.1294132232677314
第57組參數平均rmse：2.3023663397388225
--------------------------------------------------
開始實驗第58組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.2718710349597204
開始第1/3折驗證：
第1/3折驗證rmse：2.2724245754078067
開始第2/3折驗證：
第2/3折驗證rmse：2.1415978768994695
第58組參數平均rmse：2.228631162422332
--------------------------------------------------
開始實驗第59組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.501290093920802
開始第1/3折驗證：
第1/3折驗證rmse：2.381049210488406
開始第2/3折驗證：
第2/3折驗證rmse：2.2750317759288943
第59組參數平均rmse：2.3857903601127006
--------------------------------------------------
開始實驗第60組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.3181114399810796
開始第1/3折驗證：
第1/3折驗證rmse：2.2154998493769624
開始第2/3折驗證：
第2/3折驗證rmse：2.4049210337342464
第60組參數平均rmse：2.3128441076974293
--------------------------------------------------
開始實驗第61組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.3851664509564388
開始第1/3折驗證：
第1/3折驗證rmse：2.3030068446632947
開始第2/3折驗證：
第2/3折驗證rmse：2.280847797963317
第61組參數平均rmse：2.3230070311943503
--------------------------------------------------
開始實驗第62組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 15, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.417937402473705
開始第1/3折驗證：
第1/3折驗證rmse：2.3997823968592797
開始第2/3折驗證：
第2/3折驗證rmse：2.380405546022984
第62組參數平均rmse：2.399375115118656
--------------------------------------------------
開始實驗第63組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.174178876935193
開始第1/3折驗證：
第1/3折驗證rmse：2.266418809440447
開始第2/3折驗證：
第2/3折驗證rmse：2.045909029982191
第63組參數平均rmse：2.16216890545261
--------------------------------------------------
開始實驗第64組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.1185599029996527
開始第1/3折驗證：
第1/3折驗證rmse：2.2976351878015926
開始第2/3折驗證：
第2/3折驗證rmse：2.371055913517158
第64組參數平均rmse：2.262417001439468
--------------------------------------------------
開始實驗第65組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.4192265174807486
開始第1/3折驗證：
第1/3折驗證rmse：2.547778062631276
開始第2/3折驗證：
第2/3折驗證rmse：2.453398370601104
第65組參數平均rmse：2.4734676502377098
--------------------------------------------------
開始實驗第66組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.3116700347493646
開始第1/3折驗證：
第1/3折驗證rmse：2.142077163214947
開始第2/3折驗證：
第2/3折驗證rmse：2.1577471460239233
第66組參數平均rmse：2.203831447996078
--------------------------------------------------
開始實驗第67組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.260847355749237
開始第1/3折驗證：
第1/3折驗證rmse：2.038611778956116
開始第2/3折驗證：
第2/3折驗證rmse：2.5484310808888635
第67組參數平均rmse：2.2826300718647388
--------------------------------------------------
開始實驗第68組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.3981763646113885
開始第1/3折驗證：
第1/3折驗證rmse：2.2875108174804377
開始第2/3折驗證：
第2/3折驗證rmse：2.414449111458425
第68組參數平均rmse：2.3667120978500837
--------------------------------------------------
開始實驗第69組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.28888986741368
開始第1/3折驗證：
第1/3折驗證rmse：2.2767189333146267
開始第2/3折驗證：
第2/3折驗證rmse：2.33555813056071
第69組參數平均rmse：2.3003889770963393
--------------------------------------------------
開始實驗第70組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.1262568065515675
開始第1/3折驗證：
第1/3折驗證rmse：2.2618381987583054
開始第2/3折驗證：
第2/3折驗證rmse：2.1984332633424066
第70組參數平均rmse：2.1955094228840935
--------------------------------------------------
開始實驗第71組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 25, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.503310188661614
開始第1/3折驗證：
第1/3折驗證rmse：2.3547656436918927
開始第2/3折驗證：
第2/3折驗證rmse：2.305990528979543
第71組參數平均rmse：2.3880221204443495
--------------------------------------------------
開始實驗第72組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.121248127368896
開始第1/3折驗證：
第1/3折驗證rmse：2.297606544576462
開始第2/3折驗證：
第2/3折驗證rmse：2.127186689902887
第72組參數平均rmse：2.1820137872827483
--------------------------------------------------
開始實驗第73組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.1410764654177217
開始第1/3折驗證：
第1/3折驗證rmse：2.264051683017106
開始第2/3折驗證：
第2/3折驗證rmse：2.272036190864403
第73組參數平均rmse：2.225721446433077
--------------------------------------------------
開始實驗第74組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.01, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.282223990590916
開始第1/3折驗證：
第1/3折驗證rmse：2.4344388315143215
開始第2/3折驗證：
第2/3折驗證rmse：2.1473976008596702
第74組參數平均rmse：2.2880201409883028
--------------------------------------------------
開始實驗第75組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.2557078363373244
開始第1/3折驗證：
第1/3折驗證rmse：2.012167036686711

開始第2/3折驗證：
第2/3折驗證rmse：2.187687495347084
第75組參數平均rmse：2.151854122790373
--------------------------------------------------
開始實驗第76組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.2177760911023863
開始第1/3折驗證：
第1/3折驗證rmse：2.06291663889677
開始第2/3折驗證：
第2/3折驗證rmse：2.2830701173735686
第76組參數平均rmse：2.187920949124242
--------------------------------------------------
開始實驗第77組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.05, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.3170419334417427
開始第1/3折驗證：
第1/3折驗證rmse：2.366394134764698
開始第2/3折驗證：
第2/3折驗證rmse：2.3975321343351474
第77組參數平均rmse：2.3603227341805293
--------------------------------------------------
開始實驗第78組參數： {'objective': 'reg:linear', 'eta': 0.01, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.1943993462539413
開始第1/3折驗證：
第1/3折驗證rmse：2.389702628638741
開始第2/3折驗證：
第2/3折驗證rmse：2.087077836239587
第78組參數平均rmse：2.2237266037107566
--------------------------------------------------
開始實驗第79組參數： {'objective': 'reg:linear', 'eta': 0.1, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.183353747113912
開始第1/3折驗證：
第1/3折驗證rmse：2.209950759672495
開始第2/3折驗證：
第2/3折驗證rmse：2.2352641756192404
第79組參數平均rmse：2.209522894135216
--------------------------------------------------
開始實驗第80組參數： {'objective': 'reg:linear', 'eta': 0.5, 'gamma': 0.1, 'silent': 1, 'max_depth': 35, 'min_child_weight': 3}
開始第0/3折驗證：
第0/3折驗證rmse：2.3471298414272037
開始第1/3折驗證：
第1/3折驗證rmse：2.2745681485301392
開始第2/3折驗證：
第2/3折驗證rmse：2.386511745091769
第80組參數平均rmse：2.3360699116830372
--------------------------------------------------

Out[34]:

(2.141382087254338,
 {'objective': 'reg:linear',
  'eta': 0.01,
  'gamma': 0.05,
  'silent': 1,
  'max_depth': 25,
  'min_child_weight': 0.5})

In [35]:

#1.獲取數據集

train_x,test_x,train_y,test_y=d

#獲取原始特徵對應的新特徵下標

index=[]

for col in cols:

index.extend(f_map[col])

x_train=train_x[:,index]

x_test=test_x[:,index]

#由於要用新下標訪問，所以要重置索引

train_y=train_y.reset_index(drop=True)

test_y=test_y.reset_index(drop=True)

利用搜索的參數訓練模型

In [36]:

# xgb_model = xgb.XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=500,verbosity=1, objective='reg:linear',random_state=12,)

# xgb_model.fit(new_train_x, train_y, early_stopping_rounds=10, eval_metric="rmse",

# eval_set=[(new_test_x, test_y)])

params={

"objective":'reg:linear',

'eta':0.01,

'gamma': 0.05,

'silent': 1,

'max_depth':25,

'min_child_weight':0.5,

'sub_sample':0.6,

'reg_alpha':0.5,

'reg_lambda':0.8,

'colsample_bytree':0.5

}

dtrain = xgb.DMatrix(x_train,train_y)

dtest = xgb.DMatrix(x_test,test_y)

bst = xgb.train(params, dtrain, num_boost_round=1500)

#6.模型預測

preds = bst.predict(dtest)

preds=np.exp(preds)-1#轉換成真實的租金

y_true=np.exp(test_y)-1

rmse=np.sqrt(mean_squared_error(y_true,preds))

rmse

Out[36]:

1.4322241873197603

In [124]:

# cv_params = {'n_estimators': [400, 500, 600, 700, 800]}

# other_params = {'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,

# 'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

# model = xgb.XGBRegressor(**other_params)

# optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1)

# optimized_GBM.fit(new_train_x, train_y)

# evalute_result = optimized_GBM.grid_scores_

# print('每輪迭代運行結果:{0}'.format(evalute_result))

# print('參數的最佳取值：{0}'.format(optimized_GBM.best_params_))

# print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

rmse

Out[124]:

1.5180160328127699

模型融合

In [1]:

from sklearn.linear_model import RidgeCV,LassoCV,Ridge,Lasso

from sklearn.svm import LinearSVR,SVR

from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor

from sklearn.neural_network import MLPRegressor

from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.feature_extraction import DictVectorizer

from sklearn.preprocessing import StandardScaler,PolynomialFeatures

from sklearn.decomposition import PCA

import pandas as pd

import numpy as np

from sklearn.metrics import mean_squared_error

In [ ]:

#沒有用bagging和boosting

#stacking 先用幾個不同的模型做預測輸出預測值然後將這幾個模型輸出的預測值作爲特徵來訓練一個新的模型

獲取數據

In [2]:

data=pd.read_csv("data/onehot_feature.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196499 entries, 0 to 196498
Data columns (total 41 columns):
Unnamed: 0    196499 non-null int64
時間            196499 non-null int64
小區名           196499 non-null int64
小區房屋出租數量      196499 non-null float64
樓層            196499 non-null int64
總樓層           196499 non-null float64
房屋面積          196499 non-null float64
房屋朝向          196499 non-null object
居住狀態          196499 non-null float64
臥室數量          196499 non-null int64
廳的數量          196499 non-null int64
衛的數量          196499 non-null int64
出租方式          196499 non-null float64
區             196499 non-null float64
位置            196499 non-null float64
地鐵線路          196499 non-null float64
地鐵站點          196499 non-null float64
距離            196499 non-null float64
裝修情況          196499 non-null float64
月租金           196499 non-null float64
log_rent      196499 non-null float64
新朝向           196499 non-null object
房+衛+廳         196499 non-null int64
房/總           196499 non-null float64
衛/總           196499 non-null float64
廳/總           196499 non-null float64
臥室面積          196499 non-null float64
樓層比           196499 non-null float64
戶型            196499 non-null int64
平均值特徵1        196499 non-null float64
小區平均值特徵       196499 non-null float64
朝向平均值特徵       196499 non-null float64
站點平均值特徵       196499 non-null float64
位置平均值特徵       196499 non-null float64
有地鐵           196499 non-null int64
聚類特徵          196499 non-null int64
平均值特徵2        196499 non-null float64
小區線路數         196499 non-null int64
位置線路數         196499 non-null int64
新小區名          196499 non-null int64
小區條數大於100     196499 non-null int64
dtypes: float64(24), int64(15), object(2)
memory usage: 61.5+ MB

In [3]:

#將離散特徵轉換成字符串類型

colunms = ['時間', '新小區名', '居住狀態', '出租方式', '區','位置','地鐵線路','地鐵站點','裝修情況','戶型','聚類特徵']

for col in colunms:

data[col] = data[col].astype(str)

In [4]:

x_columns=['小區房屋出租數量','新小區名', '樓層', '總樓層', '房屋面積','居住狀態', '臥室數量',

'衛的數量', '位置', '地鐵站點', '距離', '裝修情況',

'新朝向', '房+衛+廳', '房/總', '衛/總', '廳/總', '臥室面積', '樓層比', '戶型','平均值特徵1',

'平均值特徵2','有地鐵','小區線路數','位置線路數','小區條數大於100','小區平均值特徵','朝向平均值特徵',

'站點平均值特徵','位置平均值特徵']

y_label='log_rent'

x=data[x_columns]

y=data[y_label]

In [5]:

#2.分割數據集

train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.3,random_state=12)

In [6]:

#3.特徵轉換

vector=DictVectorizer(sparse=True)

x_train=vector.fit_transform(train_x.to_dict(orient='records'))

x_test=vector.transform(test_x.to_dict(orient='records'))

In [7]:

x_train.shape

Out[7]:

(137549, 964)

In [8]:

#4.降維----線性迴歸和svm可以採用降維後的特徵

pca=PCA(0.98)

pca_x_train=pca.fit_transform(x_train.toarray())

pca_x_test=pca.transform(x_test.toarray())

In [9]:

pca_x_train.shape

Out[9]:

(137549, 407)

In [10]:

#5.特徵標準化

trans=StandardScaler()

new_x_train=trans.fit_transform(pca_x_train)

new_x_test=trans.transform(pca_x_test)

In [11]:

new_x_train.shape

Out[11]:

(137549, 407)

In [12]:

# ploy=PolynomialFeatures(degree=2,interaction_only=True)

# ploy_x_train=ploy.fit_transform(new_x_train)

# ploy_x_test=ploy.transform(new_x_test)

In [13]:

# ploy_x_train.shape

In [14]:

def rmse(y_true,y_pred):

y_pred=np.exp(y_pred)-1#轉換成真實的租金

y_true=np.exp(y_true)-1

return np.sqrt(mean_squared_error(y_true,y_pred))

構建子模型

構建嶺迴歸模型

In [15]:

# #1.參數搜索

# ridge=Ridge()

# params={

# "alpha":[0.005,0.01,1,5,10,20,50]

# }

# model1=GridSearchCV(ridge,param_grid=params,cv=5,n_jobs=-1)

# model1.fit(new_x_train,train_y)

# model1.best_params_

# #{'alpha': 10, 'fit_intercept': True}

In [16]:

#利用搜索出的最優參數構建模型

ridge=Ridge(alpha=0.005)

ridge.fit(new_x_train,train_y)

Out[16]:

Ridge(alpha=0.005, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)

In [17]:

y_pred_test=ridge.predict(new_x_test)

y_pred_train=ridge.predict(new_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 3.162947362147576
測試集rmse： 3.1450837952700987

構建lasso迴歸

In [18]:

# #1.參數搜索

# lasso=Lasso()

# params={

# "alpha":[0.001,0.01,0.05,0.1,0.5,1,5,10],

# "fit_intercept":[True,False]

# }

# model2=GridSearchCV(lasso,param_grid=params,cv=5,n_jobs=-1)

# model2.fit(new_x_train,train_y)

# model2.best_params_

# #{'alpha': 0.001, 'fit_intercept': True}

In [19]:

#利用搜索出的最優參數構建模型

lasso=Lasso(alpha=0.001)

lasso.fit(new_x_train,train_y)

Out[19]:

Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)

In [20]:

y_pred_test=lasso.predict(new_x_test)

y_pred_train=lasso.predict(new_x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 3.17469838175572
測試集rmse： 3.1600537035148

構建隨機森林

In [55]:

# #1.參數搜索

# rf=RandomForestRegressor(max_features='sqrt')#設置max_features='sqrt'，不然太耗時間

# params={

# "n_estimators":[200],#[200,500,700],

# "max_depth":[40,50,60],

# "min_samples_split":[20,50,100],

# "min_samples_leaf":[10,20,30]

# }

# model3=GridSearchCV(rf,param_grid=params,cv=5,n_jobs=-1,verbose=2)

# model3.fit(x_train,train_y)

# model3.best_params_

# # {'max_depth': 50,

# # 'min_samples_leaf': 10,

# # 'min_samples_split': 20,

# # 'n_estimators': 200}

Fitting 5 folds for each of 27 candidates, totalling 135 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 5.1min
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed: 35.5min finished

Out[55]:

{'max_depth': 50,
'min_samples_leaf': 10,
'min_samples_split': 20,
'n_estimators': 200}

In [ ]:

# import time

# print time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

In [21]:

#利用搜索出的最優參數構建模型

rf=RandomForestRegressor(n_estimators=200,

max_features=0.8,

max_depth=50,

min_samples_split=20,

min_samples_leaf=10,

n_jobs=-1)

rf.fit(x_train,train_y)

Out[21]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
                      max_features=0.8, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=10, min_samples_split=20,
                      min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
                      oob_score=False, random_state=None, verbose=0,
                      warm_start=False)

In [22]:

y_pred_test=rf.predict(x_test)

y_pred_train=rf.predict(x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 1.8016406346453875
測試集rmse： 2.078690010364972

構建決策樹

In [93]:

# tree=DecisionTreeRegressor()

# params={

# "max_depth":[40,50,60,70],

# "min_samples_split":[5,10,20,30,40,50],

# "min_samples_leaf":[2,3,5,7,9,11]

# }

# model4=GridSearchCV(tree,param_grid=params,cv=5,n_jobs=-1)

# model4.fit(x_train,train_y)

# model4.best_params_

# {'max_depth': 60, 'min_samples_leaf': 2, 'min_samples_split': 5}

Out[93]:

{'max_depth': 60, 'min_samples_leaf': 2, 'min_samples_split': 5}

In [94]:

from sklearn.tree import DecisionTreeRegressor

#利用搜索出的最優參數構建模型

tree=DecisionTreeRegressor(max_depth=60,min_samples_leaf=2,min_samples_split=5)

tree.fit(x_train,train_y)

Out[94]:

DecisionTreeRegressor(criterion='mse', max_depth=60, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=2,
                      min_samples_split=5, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [95]:

y_pred_test=tree.predict(x_test)

y_pred_train=tree.predict(x_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 0.7147616009940363
測試集rmse： 1.5876633153093507

In [96]:

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif']=['SimHei']

plt.rcParams['axes.unicode_minus']=False

plt.figure(figsize=(20,20),dpi=100)

plt.scatter(test_y,y_pred_test)

plt.xlabel("真實值")

plt.ylabel("預測值")

plt.show()

構建支持向量機

In [1]:

# #1.參數搜索----數據量大 svm太耗時，調參幾乎不可能

# svr=SVR()

# params={

# "gamma":[0.001,0.01,0.1,0.5,1,5],

# "C":[0.001,0.1,0.5,1,5]

# }

# model5=GridSearchCV(svr,param_grid=params,cv=5,n_jobs=-1,verbose=10)

# model5.fit(new_x_train,train_y)

# model5.best_params_

In [ ]:

# #隨意選一組參數 --- 耗時太長放棄該模型

# svr=SVR(gamma=0.1,C=0.5)

# svr.fit(new_x_train,train_y)

# y_pred=svr.predict(new_x_test)

# rmse(test_y,y_pred)

構建xgboost模型

In [25]:

import xgboost as xgb

params={

"objective":'reg:linear',

'eta':0.1,

'gamma': 0.05,

'silent': 1,

'max_depth':45,

'min_child_weight':0.5,

'sub_sample':0.6,

'reg_alpha':0.5,

'reg_lambda':0.8,

'colsample_bytree':0.5

}

dtrain = xgb.DMatrix(x_train,train_y)

dtest = xgb.DMatrix(x_test,test_y)

bst = xgb.train(params, dtrain, num_boost_round=2000)

/root/anaconda3/envs/ml/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and \

In [66]:

y_pred_test=bst.predict(dtest)

y_pred_train=bst.predict(dtrain)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred_test))

訓練集rmse： 0.8318620679177371
測試集rmse： 1.3412344636800162

In [68]:

import matplotlib.pyplot as plt

plt.figure(figsize=(20,20),dpi=100)

plt.scatter(test_y,y_pred_test)

plt.xlabel("真實值")

plt.ylabel("預測值")

plt.show()

Stacking融合

In [97]:

#獲取每個子模型的預測結果作爲特徵

train_features=[]

train_features.append(ridge.predict(new_x_train))#將每個模型預測值保存起來

train_features.append(lasso.predict(new_x_train))

# train_features.append(svr.predict(new_x_train))#這個太慢了不要了

train_features.append(rf.predict(x_train))

train_features.append(tree.predict(x_train))

train_features.append(bst.predict(dtrain))

test_features=[]

test_features.append(ridge.predict(new_x_test))

test_features.append(lasso.predict(new_x_test))

# test_features.append(svr.predict(new_x_test))

test_features.append(rf.predict(x_test))

test_features.append(tree.predict(x_test))

test_features.append(bst.predict(dtest))

In [98]:

mx_train=np.vstack(train_features).T

mx_test=np.vstack(test_features).T

mx_train.shape

Out[98]:

(137549, 5)

In [110]:

stack_model=Ridge(fit_intercept=False)

params={

"alpha":np.logspace(-2,3,20)

}

model=GridSearchCV(stack_model,param_grid=params,cv=5,n_jobs=-1)

model.fit(mx_train,train_y)

model.best_params_

Out[110]:

{'alpha': 0.20691380811147891}

In [120]:

stack_model=Ridge(alpha=0.206,fit_intercept=False)

stack_model.fit(mx_train,train_y)

y_pred=stack_model.predict(mx_test)

y_pred_train=stack_model.predict(mx_train)

print("訓練集rmse：",rmse(train_y,y_pred_train))

print("測試集rmse：",rmse(test_y,y_pred))

訓練集rmse： 0.6824409675644203
測試集rmse： 1.3588850206739824

In [121]:

stack_model.coef_

Out[121]:

array([-0.01476048, -0.01601146, 0.04143987, 0.55925897, 0.43009737])

模型保存

In [ ]:

import pickle

with open("data/model.pkl","wb") as f:

pickle.dump({

"vector":vector,

"pca":pca,

"sc":trans,

"ridge":ridge,

"lasso":lasso,

"rf":rf,

"tree":tree,

"bst":bst,

"stack":stack_model

},f)

數據處理

In [1]:

import pandas as pd

In [2]:

train = pd.read_csv("./data/train.csv")

In [3]:

test = pd.read_csv("./data/test.csv")

In [4]:

train.head()

Out[4]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [5]:

test.head()

Out[5]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
0	1	4	6011	0.382812	1	0.600000	0.007117	東	3.0	2	1	1	1.0	10.0	5.0	NaN	NaN	NaN	6.0
1	2	4	1697	0.152344	1	0.472727	0.007448	東	NaN	2	1	1	NaN	3.0	0.0	NaN	NaN	NaN	NaN
2	3	4	754	0.207031	2	0.709091	0.014068	東南	NaN	3	2	2	NaN	10.0	9.0	4.0	74.0	0.400833	NaN
3	4	4	1285	0.011719	0	0.090909	0.008937	南	NaN	2	1	1	NaN	6.0	96.0	5.0	17.0	0.384167	NaN
4	5	4	4984	0.035156	1	0.218182	0.008606	東南	NaN	2	1	1	NaN	6.0	61.0	3.0	114.0	0.598333	NaN

In [7]:

train.shape

Out[7]:

(196539, 19)

In [8]:

import seaborn as sns

In [11]:

# 圖形可視化,查看數據分佈

import seaborn as sns

import matplotlib.pyplot as plt

sns.countplot(train.時間)

plt.show()

In [14]:

train1 = train[train.時間 == 1]

train1.shape

Out[14]:

(50843, 19)

In [15]:

train2 = train[train.時間 == 2]

train2.shape

Out[15]:

(72206, 19)

In [16]:

train3 = train[train.時間 == 3]

train3.shape

Out[16]:

(73490, 19)

In [21]:

train2.index

Out[21]:

Int64Index([ 50843,  50844,  50845,  50846,  50847,  50848,  50849,  50850,
             50851,  50852,
            ...
            123039, 123040, 123041, 123042, 123043, 123044, 123045, 123046,
            123047, 123048],
           dtype='int64', length=72206)

In [22]:

train3.index

Out[22]:

Int64Index([123049, 123050, 123051, 123052, 123053, 123054, 123055, 123056,
            123057, 123058,
            ...
            196529, 196530, 196531, 196532, 196533, 196534, 196535, 196536,
            196537, 196538],
           dtype='int64', length=73490)

In [31]:

train2.月租金.values

Out[31]:

array([7.64006791, 4.24448217, 6.62139219, ..., 5.60271647, 7.30050934,
       6.96095076])

In [32]:

plt.figure()

plt.plot(train2.index.values, train2.月租金.values)

plt.show()

In [33]:

plt.figure()

plt.plot(train3.index.values, train3.月租金.values)

plt.show()

In [37]:

train_ = train[:150539]

In [38]:

# 圖形可視化,查看數據分佈

import seaborn as sns

import matplotlib.pyplot as plt

sns.countplot(train_.時間)

plt.show()

In [39]:

plt.figure()

plt.plot(train_.index.values, train_.月租金.values)

plt.show()

In [42]:

train_.index.values

Out[42]:

array([ 0, 1, 2, ..., 150536, 150537, 150538])

In [43]:

test_ = train[150539:]

In [44]:

test_.index.values

Out[44]:

array([150539, 150540, 150541, ..., 196536, 196537, 196538])

In [45]:

test_.shape

Out[45]:

(46000, 19)

In [46]:

test_.head()

Out[46]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
150539	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN	6.281834
150540	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN	6.281834
150541	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN	23.259762
150542	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN	2.886248
150543	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN	10.696095

In [50]:

id = [i for i in range(1, 46001)]

test_["id"] = id

In [52]:

test_.head()

Out[52]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金	id
150539	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN	6.281834	1
150540	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN	6.281834	2
150541	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN	23.259762	3
150542	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN	2.886248	4
150543	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN	10.696095	5

In [74]:

train_.to_csv("./train.csv", index=False)

In [54]:

test_.to_csv("./test_result.csv")

In [57]:

test_1 = test_.drop(["月租金"], axis=1)

In [59]:

test_1.head()

Out[59]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	id
150539	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN	1
150540	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN	2
150541	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN	3
150542	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN	4
150543	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN	5

In [60]:

test_1.to_csv("./test.csv")

In [62]:

test = pd.read_csv("./test.csv")

In [63]:

test.head()

Out[63]:

	Unnamed: 0	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
0	1	1	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN
1	2	2	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN
2	3	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN
3	150542	4	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN
4	150543	5	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

In [65]:

test = test.drop(["Unnamed: 0"], axis=1)

In [68]:

test.to_csv("./test.csv", index=False)

In [72]:

test = pd.read_csv("./test.csv")

test.head()

Out[72]:

	id	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況
0	1	3	3882	0.035156	1	0.436364	0.013075	東南	NaN	3	1	1	NaN	8.0	94.0	4.0	76.0	0.383333	NaN
1	2	3	6353	0.078125	1	0.436364	0.012248	東南	NaN	3	1	1	NaN	3.0	33.0	3.0	23.0	1.000000	NaN
2	3	3	1493	0.203125	1	0.381818	0.023006	南	NaN	4	2	2	NaN	12.0	60.0	5.0	115.0	0.945000	NaN
3	4	3	1532	0.414062	1	0.600000	0.019695	東南	NaN	3	2	2	NaN	1.0	40.0	NaN	NaN	NaN	NaN
4	5	3	1251	0.226562	1	0.381818	0.014730	東	NaN	3	1	1	NaN	12.0	52.0	NaN	NaN	NaN	NaN

In [73]:

test.shape

Out[73]:

(46000, 19)

In [79]:

train = pd.read_csv("./train.csv")

train.head()

Out[79]:

	時間	小區名	小區房屋出租數量	樓層	總樓層	房屋面積	房屋朝向	居住狀態	臥室數量	廳的數量	衛的數量	出租方式	區	位置	地鐵線路	地鐵站點	距離	裝修情況	月租金
0	1	3072	0.128906	2	0.236364	0.008628	東南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	東	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	東南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	東北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

In [80]:

train.shape

Out[80]:

(150539, 19)

Sklearn：房租租⾦模型預測 版本二

庫安裝：pip install xgboost

數據初步分析

導入數據

數據探索

基本信息

缺失值比例

目標值分佈

所有特徵分佈

直方圖和柱狀分佈圖

相關性分析

連續特徵和目標值的散點圖

特徵和目標相關性分析

皮爾森相關性熱力圖

皮爾森相關

斯皮爾曼相關

離散特徵和月租金關係分析

均值比較

繪製箱線圖

異常值分析

問題數據

房間朝向列有多個值

同一個小區屬於不同的區

同一個小區地鐵線路不同的問題

研究一下位置和地鐵線路的關係

研究一下位置和地鐵站點的關係

研究一下小區名，位置，地鐵線路，站點的關係

研究一下是否有換乘站的存在

研究一下每個位置的地鐵線路數和站點數

研究一下位置缺失的樣本地鐵站點是否也是缺失的

位置和區的關係校驗

小區名和位置的關係

看一下小區名過多的問題

數據清洗

導入數據

設置後面要用的填充量

缺失值處理

缺失值比例

填充區和位置

地鐵站點，距離 處理

小區房屋出租數量處理

裝修，居住狀態，出租方式--作爲單獨一類

清除異常樣本

糾偏

問題數據處理

存儲數據

特徵工程

根據房間,廳,衛,房屋面積構造新特徵

構建租金平均值特徵

構造是否有地鐵

構造聚類特徵

保存標準化和聚類模型

構造地鐵線路數特徵

去掉出現數量較少的小區

轉換類型

構建數據清洗和特徵工程函數

建模

讀取數據

獲取x和y

構建訓練函數

模型特徵篩選

參數搜索

構建交叉驗證和參數搜索函數

開始搜索

利用搜索的參數訓練模型

模型融合

獲取數據

構建子模型

構建嶺迴歸模型

構建lasso迴歸

構建隨機森林

構建決策樹

構建支持向量機

構建xgboost模型

Stacking融合

模型保存

數據處理

Sklearn：房租租⾦模型預測版本二

地鐵站點，距離處理