Kaggle-tiantic數據建模與分析

1.數據可視化

kaggle中數據解釋:https://www.kaggle.com/c/titanic/data

數據形式:

titanic_data

讀取數據,並顯示數據信息

data_train = pd.read_csv("./data/train.csv")
print(data_train.info())

數據結果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object

數據解釋:

PassengerId => 乘客ID
Survive => 乘客是否生還(僅在訓練集中有,測試集中沒有)
Pclass => 乘客等級(1/2/3等艙位)
Name => 乘客姓名
Sex => 性別
Age => 年齡
SibSp => 堂兄弟/妹個數
Parch => 父母與小孩個數
Ticket => 船票信息
Fare => 票價
Cabin => 客艙
Embarked => 登船港口

 

1.1 生存/死亡人數統計

P1

# # 統計 存活/死亡 人數
def sur_die_analysis(data_train):
    fig = plt.figure()
    fig.set(alpha=0.2)  # 設定圖表顏色alpha參數
    data_train.Survived.value_counts().plot(kind='bar')# 柱狀圖
    plt.title(u"獲救情況 (1爲獲救)") # 標題
    plt.ylabel(u"人數")
    plt.show()

1.2 PClass

Pclass

# PClass
def pclass_analysis(data_train):
    fig = plt.figure()
    fig.set(alpha=0.2)  # 設定圖表顏色alpha參數
    sur_data = data_train.Pclass[data_train.Survived == 1].value_counts()
    die_data = data_train.Pclass[data_train.Survived == 0].value_counts()
    pd.DataFrame({'Survived':sur_data,'Died':die_data}).plot(kind='bar')
    plt.ylabel(u"人數")
    plt.title(u"乘客等級分佈")
    plt.show()

通過數據分佈可以很明顯的看出 Pclass 爲 1/2 的乘客存活率比 3 的高很多

1.3 Sex

Psex

#Sex
def sex_analysis(data_train):
    no_survived_g = data_train.Sex[data_train.Survived == 0].value_counts()
    no_survived_g.to_csv("no_survived_g.csv")
    survived_g = data_train.Sex[data_train.Survived == 1].value_counts()
    df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
    df_g.plot(kind='bar', stacked=True)
    plt.title('性別存活率分析')
    plt.xlabel('People')
    plt.ylabel('Survive')
    plt.show()

女性的存活率比男性高

1.4 Age

Page

# age : 將年齡分成十段,分別統計 存活人數和死亡人數
def age_analysis(data_train):
    data_series = pd.DataFrame(columns=['Survived', 'dies'])
    cloms = []
    for num in range(0, 10):
        clo  = "" + str(num * 10) + "-" + str((num + 1) * 10)
        cloms.append(clo)
        sur_df = data_train.Age[(10 * (num + 1) > data_train.Age) & (10 * num < data_train.Age) & (data_train.Survived == 1)].shape[0]
        die_df = data_train.Age[(10 * (num + 1) > data_train.Age) & (10 * num < data_train.Age) & (data_train.Survived == 0)].shape[0]
        data_series.loc[num] = [sur_df,die_df]
    data_series.index = cloms
    data_series.plot(kind='bar', stacked=True)
    plt.ylabel(u"存活率")  # 設定縱座標名稱
    plt.grid(b=True, which='major', axis='y')
    plt.title(u"按年齡看獲救分佈")
    plt.show()

低年齡段的獲救的百分比明顯佔的比例較多

1.5  Family : SibSp + Parch

定義Family項,代表家庭成員數量,並離散分類爲三個等級:

0: 代表沒有任何成員

1: 1-4

2: > 4

PFamliy

# Family: Sibsp + Parch 家庭成員人數
def family_analysis(data_train):
    data_train['Family'] = data_train['SibSp'] + data_train['Parch']
    data_train.loc[(data_train.Family == 0), 'Family'] = 0
    data_train.loc[((data_train.Family > 0) & (data_train.Family < 4)), 'Family'] = 1
    data_train.loc[((data_train.Family >= 4)), 'Family'] = 2

    no_survived_g = data_train.Family[data_train.Survived == 0].value_counts()
    survived_g = data_train.Family[data_train.Survived == 1].value_counts()
    df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
    df_g.plot(kind='bar', stacked=True)
    plt.title('家庭成員分析')
    plt.xlabel('等級:0-無 1-(1~4) 2-(>4)')
    plt.ylabel('存活情況')
    plt.show()

由於數據分佈很不均衡,sibsp 是否和存活率的關係,可以將所有列都除以該列總人數。這裏不再贅述。

1.6 Fare

費用統計:

PClassB

當費用升高到一定時,存活人數已經超過了死亡人數

PFare

 

# Fare
def fare_analysis(data_train):
    # data_train.Fare[data_train.Survived == 1].plot(kind='kde')
    # data_train.Fare[data_train.Survived == 0].plot(kind='kde')
    # data_train["Fare"].plot(kind='kde')
    # plt.legend(('survived', 'died','all'), loc='best')
    # plt.show()
    data_train['NewFare'] = data_train['Fare']
    data_train.loc[(data_train.Fare < 50), 'NewFare'] = 0
    data_train.loc[((data_train.Fare>=50) & (data_train.Fare<100)), 'NewFare'] = 1
    data_train.loc[((data_train.Fare >= 100) & (data_train.Fare < 150)), 'NewFare'] = 2
    data_train.loc[((data_train.Fare >= 150) & (data_train.Fare < 200)), 'NewFare'] = 3
    data_train.loc[(data_train.Fare >= 200), 'NewFare'] = 4
    no_survived_g = data_train.NewFare[data_train.Survived == 0].value_counts()
    survived_g = data_train.NewFare[data_train.Survived == 1].value_counts()
    df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
    df_g.plot(kind='bar', stacked=True)
    plt.title('費用-生存分析')
    plt.xlabel('費用等級')
    plt.ylabel('存活情況')
    plt.show()

很明顯可以看出 費用等級較高的人存活率會高很多。

優化:

上述只是任意的選取了五個費用段,作爲五類,但是具體是多少類才能最好的擬合數據?

這裏可以通過聚類的方法查找最佳的分類個數,再將每個費用數據映射爲其中一類:

def fare_kmeans(data_train):
    for i in range(2,10):
        clusters = KMeans(n_clusters=i)
        clusters.fit(data_train['Fare'].values.reshape(-1,1))
        # intertia_ 參數是衡量聚類的效果,越大則表明效果越差
        print("" + str(i) + "" + str(clusters.inertia_))

打印結果:

 2 846932.9762272763
 3 399906.26606199215
 4 195618.50643749788
 5 104945.73652631264
 6 52749.474696547695
 7 35141.316334118805
 8 26030.553497795216
 9 19501.242236941747

由此可以看出看出當 類別數爲 5 時分類的效果最好。所以這裏將所有的費用映射到爲這五類。

#將費用進行聚類,發現 類別數爲 5 時聚合的效果最好
def fare_kmeans(data_train):
    clusters = KMeans(n_clusters=5)
    clusters.fit(data_train['Fare'].values.reshape(-1, 1))
    predict = clusters.predict(data_train['Fare'].values.reshape(-1, 1))
    print(predict)
    data_train['NewFare'] = predict
    print(data_train[['NewFare','Survived']].groupby(['NewFare'],as_index=False).mean())
    print(""  + str(clusters.inertia_))

等級映射後每個等級的存活率如下:(效果明顯比上面隨便分類的好)

  NewFare  Survived
0        0  0.319832
1        1  0.647059
2        2  0.606557
3        3  1.000000
4        4  0.757576

1.7 Embarked

PEmbark

#Embarked 上船港口情況
def embarked_analysis(data_train):
    no_survived_g = data_train.Embarked[data_train.Survived == 0].value_counts()
    survived_g = data_train.Embarked[data_train.Survived == 1].value_counts()
    df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g})
    df_g.plot(kind='bar', stacked=True)
    plt.title('登陸港口-存活情況分析')
    plt.xlabel('Embarked')
    plt.ylabel('Survive')
    plt.show()

至於就登陸港口而言,三個港口並看不出明顯的差距,C港生還率略高於S港與Q港。

2. 數據預處理

由開頭部分數據信息可以看出,有幾欄的數據是部分缺失的: Age / Cabin / Embarked

對於缺失數據這裏選擇簡單填充的方式進行處理:(可以以中值/均值/衆數等方式填充)

同時對費用進行分類

def dataPreprocess(df):
    df.loc[df['Sex'] == 'male', 'Sex'] = 0
    df.loc[df['Sex'] == 'female', 'Sex'] = 1

    # 由於 Embarked中有兩個數據未填充,需要先將數據填滿
    df['Embarked'] = df['Embarked'].fillna('S')
    # 部分年齡數據未空, 填充爲 均值
    df['Age'] = df['Age'].fillna(df['Age'].median())

    df.loc[df['Embarked']=='S', 'Embarked'] = 0
    df.loc[df['Embarked'] == 'C', 'Embarked'] = 1
    df.loc[df['Embarked'] == 'Q', 'Embarked'] = 2

    df['FamilySize'] = df['SibSp'] + df['Parch']
    df['IsAlone'] = 0
    df.loc[df['FamilySize']==0,'IsAlone'] = 1
    df.drop('FamilySize',axis = 1)
    df.drop('Parch',axis=1)
    df.drop('SibSp',axis=1)
    return  fare_kmeans(df)

def fare_kmeans(data_train):
    clusters = KMeans(n_clusters=5)
    clusters.fit(data_train['Fare'].values.reshape(-1, 1))
    predict = clusters.predict(data_train['Fare'].values.reshape(-1, 1))
    data_train['NewFare'] = predict
    data_train.drop('Fare')
    # print(data_train[['NewFare','Survived']].groupby(['NewFare'],as_index=False).mean())
    # print(" "  + str(clusters.inertia_))
    return data_train

這裏對與分類特徵通過了普通的編碼方式進行實現,也可以通過onehot編碼使每種分類之間的間隔相等。

3. 特徵選擇

上述感性的認識了各個特徵與存活率之間的關係,其實sklearn庫中提供了對每個特徵打分的函數,可以很方便的看出各個特徵的重要性

predictors = ["Pclass", "Sex", "Age", "NewFare", "Embarked",'IsAlone']

    # Perform feature selection
    selector = SelectKBest(f_classif, k=5)
    selector.fit(data_train[predictors], data_train["Survived"])

    # Plot the raw p-values for each feature,and transform from p-values into scores
    scores = -np.log10(selector.pvalues_)

    # Plot the scores.   See how "Pclass","Sex","Title",and "Fare" are the best?
    plt.bar(range(len(predictors)),scores)
    plt.xticks(range(len(predictors)),predictors, rotation='vertical')
    plt.show()

skselectors

上圖可以看到輸入的6個特徵中那些特徵比較重要

4. 線性迴歸建模

def linearRegression(df):
    predictors = ['Pclass', 'Sex', 'Age', 'IsAlone', 'NewFare', 'Embarked']
    #predictors = ['Pclass', 'Sex', 'Age', 'IsAlone', 'NewFare', 'EmbarkedS','EmbarkedC','EmbarkedQ']

    alg = LinearRegression()
    X = df[predictors]
    Y = df['Survived']
    X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

    # 打印 訓練集 測試集 樣本數量
    print (X_train.shape)
    print (Y_train.shape)
    print (X_test.shape)
    print (Y_test.shape)

    # 進行擬合
    alg.fit(X_train, Y_train)

    print (alg.intercept_)
    print (alg.coef_)

    Y_predict = alg.predict(X_test)
    Y_predict[Y_predict >= 0.5 ] = 1
    Y_predict[Y_predict < 0.5] = 0
    acc = sum(Y_predict==Y_test) / len(Y_predict)
    return acc

測試模型預測準確率: 0.79

5. 隨機森林建模

選取最有價值的5個特徵進行模型訓練,並驗證模型的效果:

def randomForest(data_train):
    # Pick only the four best features.
    predictors = ["Pclass", "Sex", "NewFare", "Embarked", 'IsAlone']
    X_train, X_test, Y_train, Y_test = train_test_split(data_train[predictors], data_train['Survived'], test_size=0.2)
    alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
    alg.fit(X_train, Y_train)
    Y_predict = alg.predict(X_test)
    acc = sum(Y_predict == Y_test) / len(Y_predict)
    return acc

經過測試該模型的準確率爲 0.811

初步原因分析: 選取的5個特徵中沒有Age,Age可能因爲缺失很大部分數據對預測的準確率有一定的影響。

 

代碼已經提交git: https://github.com/lsfzlj/kaggle

歡迎指正交流

參考:

https://blog.csdn.net/han_xiaoyang/article/details/49797143

https://blog.csdn.net/CSDN_Black/article/details/80309542

https://www.kaggle.com/sinakhorami/titanic-best-working-classifier

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章