房價預測——先進的迴歸技術，梯度提升樹和隨機森林

這是近期在kaggle上的一場迴歸預測的競賽，官方所給的數據集有1491個樣本，79個特徵，需要我們進行一定的特徵處理然後選取適合的模型來進行預測。本文采取兩種先進的迴歸技術進行預測，分別是隨機森林和梯度提升樹以及它倆的集成。
題目 https://www.kaggle.com/c/house-prices-advanced-regression-techniques

數據處理

由於其特徵較多，首先進行特徵篩選，篩選規則是所有的Str列，如果某一個屬性佔了95%以上或者某一列有一半以上的是NaN值，則把這些列刪掉，剩餘的Str列進行0，1，2，3……編碼，然後填補缺失值爲0。數據處理代碼如下：

houseprice_data = pd.read_csv("./data/train.csv")#讀取數據
datalength = len(houseprice_data)#數據量

#數據處理：

houseprice_col = list(houseprice_data.columns)#獲取列名
print(houseprice_data.columns)
delcol_list = []#要刪除的列名列表

for i,col in enumerate(houseprice_col):
    if type(houseprice_data[col][0]) == str:#如果是字符串列
        if np.array(houseprice_data[col].value_counts()).max()>datalength*0.95:#如果次數最多的大於90%
            delcol_list.append(col)#填入要刪除的列名之中

for col in list(houseprice_data.columns):#刪掉nan爲一半的列
    if houseprice_data[col].count() < datalength/2:
        delcol_list.append(col)
delcol_list.append('Id')#把id加進去
delcol_list.append('FireplaceQu')#特殊列加進去
houseprice_data_one = houseprice_data.drop(delcol_list,axis=1)#drop掉delcollist

for j,col in enumerate(list(houseprice_data_one.columns)):#遍歷新列
    if type(houseprice_data[col][0]) == str:
        values = pd.Categorical(houseprice_data_one[col]).codes
        houseprice_data_one[col] = values#對字符串列進行編碼

houseprice_data_one = houseprice_data_one.fillna(0)#填缺失值爲0

隨機森林

處理完數據之後，我們先用隨機森林進行預測：

#進行訓練預測

X = np.array(houseprice_data_one.drop('SalePrice',axis=1))
Y = np.array(houseprice_data_one['SalePrice'])

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=1)#切分數據集

clf = RandomForestRegressor()#隨機森林
clf.fit(x_train,y_train)
print("擬合率：",clf.score(x_train,y_train))
print("預測率：",clf.score(x_test,y_test))


#擬合率： 0.9673565610627388
#預測率： 0.9035892586277517

交叉驗證與歸一化

我們只切分了一次數據集，這只是在一部分數據上的結果，不足以證明模型的有效性，接下來我們做交叉驗證，並對X進行歸一化處理：

#交叉驗證
kf1 = KFold(n_splits=10,shuffle=False)#10折交叉驗證
#隨機森林
clfre = RandomForestRegressor(n_estimators=210,max_depth=30,max_features=30)

scaler = MinMaxScaler()#歸一化處理
X_scaler = scaler.fit_transform(X)

#交叉驗證得分
for retrain,retest in kf1.split(X_scaler,Y):
    clfre.fit(X_scaler[retrain],Y[retrain])
    retrain_scores.append(clfre.score(X_scaler[retrain],Y[retrain]))
    retest_scores.append((clfre.score(X_scaler[retest],Y[retest])))

#得分情況
print("交叉驗證訓練分數：",np.array(retrain_scores).mean())
print("交叉驗證測試分數：",np.array(retest_scores).mean())


#交叉驗證訓練分數： 0.9818125682413038
#交叉驗證測試分數： 0.87427725293159

交叉驗證之後所能達到的最高分數在0.874左右，在kaggle平臺測試分數要比不進行交叉驗證要高，說明我們的模型泛化能力增強了。

梯度提升樹

接下來看梯度提升樹：

train_scoresx = []
test_scoresx = []

kfx = KFold(n_splits=10,shuffle=False)#10折交叉驗證
clfx = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt', loss='huber', random_state=42)#梯度提升樹

scalerx = MinMaxScaler()#歸一化處理
X_scalerx = scalerx.fit_transform(X)

for train,test in kfx.split(X_scalerx,Y):
    clfx.fit(X_scalerx[train],Y[train])
    train_scoresx.append(clfx.score(X_scalerx[train],Y[train]))
    test_scoresx.append((clfx.score(X_scalerx[test],Y[test])))

#得分情況
print("交叉驗證訓練分數：",np.array(train_scoresx).mean())
print("交叉驗證測試分數：",np.array(test_scoresx).mean())


#交叉驗證訓練分數： 0.994345303778351
#交叉驗證測試分數： 0.8911478084451863

模型集成

梯度提升樹交叉驗證得分要比隨機森林高的多，可以達到0.891。接下來我們將兩個模型進行集成：

#均方誤差函數：
def score(y_test,y_true):
    return (1 - ((y_test - y_true)**2).sum() / ((y_true - y_true.mean())**2).sum())

#梯度提升樹與隨機森林集成
test_scores = []


kfx = KFold(n_splits=10,shuffle=False)#10折交叉驗證

clfx = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt', loss='huber', random_state=42)#梯度提升樹
clfr = RandomForestRegressor(n_estimators=210,max_depth=30,max_features=30)#隨機森林

scalerx = MinMaxScaler()#歸一化處理
X_scalerx = scalerx.fit_transform(X)

for train,test in kfx.split(X_scalerx,Y):
    clfx.fit(X_scalerx[train],Y[train])
    clfr.fit(X_scalerx[train],Y[train])
    score((clfx.predict(X_scalerx[test])+clfr.predict(X_scalerx[test]))/2,Y[test])#模型預測結果平均值集成方法

#得分情況
print("集成分數：",np.array(r2score).mean())

#集成分數：0.889

集成分數要比梯度提升樹第一點，但是在kaggle平臺得分比梯度提升樹要高，所有集成的模型適應性更強。如果要評價這三個模型，應該用專用的模型評價指標評價，可以利用roc_auc_score來評價。

數據獲取：
鏈接：https://pan.baidu.com/s/16Ej8Q5I8ep1g8QBuStN8dw 密碼：ix62

房價預測——先進的迴歸技術，梯度提升樹和隨機森林

數據處理

隨機森林

交叉驗證與歸一化

梯度提升樹

模型集成

python爬取b站JFla小姐姐視頻封面

Pyqt5製作嗶哩嗶哩簽到程序

Python時間序列處理之ARIMA模型的使用講解

Python數學模型——線性規劃求解（二）

Python時間序列處理之ARIMA模型

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結