kaggle入門競賽之泰坦尼克事故存活預測(xgboost方法)

傳送門:https://www.missshi.cn/api/view/blog/5a06a441e519f50d0400035e

本文我們詳細講解如何利用xgboost方法來解決泰坦尼克沉船事故人員存活預測的問題。
實現語言以Python爲例來進行講解。

評價標準

我們的目標是預測泰坦尼克號中哪些乘客可以倖存下來。

評價維度就是預測的準確率。

需要上傳的文件格式如下
一個csv文件,包含418行數據和1行Title。
每行都是有兩列,其中,第一列爲乘客的ID,第二列爲是否能夠倖存(倖存爲1,否則爲0)。

例如:

PassengerId,Survived
892,0
893,1
894,0
Etc.

數據集

https://pan.baidu.com/s/1pxgXW4s075j7zLWQpeoc4w

數據集中輸入的特徵包含如下字段:

  1. survived:是否倖存(1爲倖存,0爲死亡)
  2. pclass:船票檔次,分爲1,2,3三類。
  3. Name:乘客姓名
  4. Age:乘客的年齡
  5. SibSp:船上兄弟姐妹的數目
  6. Parch:船上父母子女的數目
  7. Ticket:船票號碼
  8. Fare:船票價格
  9. Cabin:客艙號碼
  10. Embarked:登船港口

第三方庫引入

首先,我們來看下用xgboost解決這個問題需要引入哪些第三方庫吧:

# Load in our libraries
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
 
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
 
import warnings
warnings.filterwarnings('ignore')
 
# Going to use these 5 base models for the stacking
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.cross_validation import KFold;

其中,numpy和pandas是在進行數據計算和分析中最常用的第三方庫。
re是正則表達式庫。
sklearn是專門用於機器學習的第三方庫。
matplotlib,seaborn和plotly是Python用於繪圖的第三方庫。
xgboost是Python基於xgboost算法開發的第三方庫。

特徵的分析和提取

在傳統機器學習算法中,我們首先需要分析數據的內在結構,找出數據的結構特徵信息。

# Load in the train and test datasets
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
 
# Store our passenger ID for easy access
PassengerId = test['PassengerId']
 
train.head(3)

我們利用pandas庫的方法直接讀入excel方法後,讀取訓練集的前三行數據如下:
在這裏插入圖片描述

full_data = [train, test]
# 特徵處理:額外添加一些需要從已有數據中計算得到的其他特徵
# 1.添加姓名長度
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)
# 2.是否有Cabin(客艙號碼),判斷 null
train['Has_Cabin'] = train['Cabin'].apply(lambda x:0 if type(x) == float else 1)
test['Has_Cabin'] = test['Cabin'].apply(lambda x:0 if type(x) == float else 1)
# 3.計算全部家人數目
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# 4.是否是一個人
for dataset in full_data:
    dataset['IsAlone'] = 0
    # loc[position,填寫的列]
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# 5.對Embarked(登船港口)列的空值進行處理
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
# 6.用訓練集數據的Fare(船票價格)的中值來填充所有Fare爲空的數據
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)  # 分爲四類
# 7.創建一個新的特徵CategoricalAge
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
    train['CategoricalAge'] = pd.cut(train['Age'], 5)
# 8.定義函數,用於查詢姓名中的Title(稱謂)
def get_title(name):
    title_search = re.search('([A-Za-z]+)\.', name)
    # 若存在則提取它並返回
    if title_search:
        return title_search.group(1)
    return ""
# 9.創建一個新的變量 Title
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# 10.將一些不常見的Title轉換爲一些對應的常見Title類型
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady','Countess','Capt','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms','Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
# 11.字符型映射到整數
for dataset in full_data:
    # 將性別映射到0,1
    dataset['Sex'] = dataset['Sex'].map({'female':0, 'male':1}).astype(int)

    # 將Tile映射到 0-5
    title_mapping = {"Mr":1, "Miss":2, "Mrs":3, "Master":4, "Rare":5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    # 將Embarked映射到0-2
    dataset['Embarked'] = dataset['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype(int)

    # 將Fare分爲四類0-3
    # 並在符合條件的數據行添加Fare標籤
    dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

    # 將年齡分爲5類:0-4
    dataset.loc[dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[dataset['Age'] > 64, 'Age'] = 4

接下來,我們需要清除一些我們無法直接利用的一些特徵:

# 12. 去除無法直接利用的特徵
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis=1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis=1)
test = test.drop(drop_elements, axis=1)

到目前爲止,我們已經對特徵進行了加工、處理和過濾。

接下來,我們需要簡單的通過當前的數據進行一些可視化來幫助我們進一步進行分析。

print(train.head(3))

在這裏插入圖片描述
接下來,讓我們看一下目前這些特徵之間的相關性吧:

colormap = plt.cm.viridis
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

相關係數圖如下:
在這裏插入圖片描述
Pearson相關係數(Pearson CorrelationCoefficient)是用來衡量兩個數據集合是否在一條線上面,它用來衡量定距變量間的線性關係。

當Pearson相關係數越接近1時,表示兩個特徵之間的相關性越強;而當兩個特徵之間的相關性越接近於0時,表示兩個特徵之間的相關性越低。

其他參數計算

下面,我們需要計算一些在後續訓練過程中需要使用的參數:

ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

分類器封裝

接下來,我們對Sklearn分類器進行一下封裝,便於我們後續直接調用:

class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)
 
    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)
 
    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)

同樣,我們也對XGBoost分類器進行相關的封裝:

def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))
 
    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
 
        clf.train(x_tr, y_tr)
 
        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
 
    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

下面,我們使用5種模型對其進行分類:

模型構建

首先是對模型的參數進行設置:

## 模型參數設置
# 1. Random Forest
rf_params = {
    'n_jobs':-1,
    'n_estimators':500,
    'warm_start':True,
    # 'max_features':0.2,
    'max_depth':6,
    'min_samples_leaf':2,
    'max_features':'sqrt',
    'verbose':0
}
# 2. Extra Trees
et_params = {
    'n_jobs':-1,
    'n_estimators':500,
    # 'max_features':0.5,
    'max_depth':8,
    'min_samples_leaf':2,
    'verbose':0
}
# 3. Adaboost
ada_params = {
    'n_estimators': 500,
    'learning_rate':0.75
}
# 4. Gradint Boosting
gb_params = {
    'n_estimators':500,
    'max_features':0.2,
    'max_depth':5,
    'min_samples_leaf':2,
    'verbose':0
}
# 5. SVM
svc_params = {
    'kernel' : 'linear',
    'C':0.025
}

下面,根據設置的參數來創建模型對象:

# 模型構建
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

接下來,將我們的數據轉換爲模型需要的numpy數組的格式:

# 模型構建
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data

接下來,我們用5個模型分別用於XGBoost訓練模型中進行訓練預測:

et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
 
print("Training is complete")

我們將每個模型中的特徵提取出來:

rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)

在這裏插入圖片描述
整理得到的特徵值如下:

cols = train.columns.values
# Create a dataframe with features
feature_dataframe = pd.DataFrame({
    'features': cols,
    'Random Forest feature importances': rf_features,
    'Extra Trees  feature importances': et_features,
    'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
})

用圖像的方式可以更加明顯的表現出來:

trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]
 
layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
 
# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]
 
layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
 
# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]
 
layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
 
# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]
 
layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
下面,我們在數據中在新增一列,用於添加這個特徵值的平均值:

feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
# 看一下目前的數據格式吧:
feature_dataframe.head(3)

在這裏插入圖片描述
來用柱狀圖看下每個特徵的重要程度吧:

y = feature_dataframe['mean'].values
x = feature_dataframe['features'].values
data = [go.Bar(
            x= x,
             y= y,
            width = 0.5,
            marker=dict(
               color = feature_dataframe['mean'].values,
            colorscale='Portland',
            showscale=True,
            reversescale = False
            ),
            opacity=0.6
        )]
 
layout= go.Layout(
    autosize= True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='bar-direct-labels')

在這裏插入圖片描述
預測一下看看吧:

base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),
     'ExtraTrees': et_oof_train.ravel(),
     'AdaBoost': ada_oof_train.ravel(),
      'GradientBoost': gb_oof_train.ravel()
    })
base_predictions_train.head()

在這裏插入圖片描述
最後,我們再看下這四個特徵的相關性圖吧:

data = [
    go.Heatmap(
        z= base_predictions_train.astype(float).corr().values ,
        x=base_predictions_train.columns.values,
        y= base_predictions_train.columns.values,
          colorscale='Viridis',
            showscale=True,
            reversescale = True
    )
]
py.iplot(data, filename='labelled-heatmap')

在這裏插入圖片描述
最後,我們來對測試集的數據進行預測一下吧,並生成最終需要上傳的csv文件:

x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)
 
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)
 
# Generate Submission File 
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': predictions })
StackingSubmission.to_csv("StackingSubmission.csv", index=False)

運行完成後,我們可以看到我們成功的得到了StackingSubmission.csv文件。

這個文件的格式就是Kaggle競賽要求的結果文件啦!

怎麼樣?對Kaggle比賽是不是有了一點點的瞭解了?趕緊參與其中吧!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章