kaggle初探--泰坦尼克號生存預測

繼續學習數據挖掘,嘗試了kaggle上的泰坦尼克號生存預測。

Titanic for Machine Learning

導入和讀取

# data processing
import numpy as np
import pandas as pd
import re
#visiulization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
train = pd.read_csv('D:/data/titanic/train.csv')
test = pd.read_csv('D:/data/titanic/test.csv')
train.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

數據特徵有:PassengerId,無特別意義
Pclass,客艙等級,對生存有影響嗎?是否高等倉的有更多機會?
Name,姓名,可幫助我們判斷性別,大概年齡。
Sex,女性的生產率是否更高?
Age,不同年齡段是否對生存有影響?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有親人的情況下生存率是提高還是降低?
Fare,票價,高票價是否有更多機會?
Cabin,Embarked,客艙和登錄港口……自然理解對生存應該沒有影響

train.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.describe(include=['O'])#['O'] indicates category feature
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Hippach, Mrs. Louis Albert (Ida Sophia Fischer) male 1601 C23 C25 C27 S
freq 1 577 7 4 644

目標Survived特徵

survive_num = train.Survived.value_counts()
survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True)
plt.show()

這裏寫圖片描述

x=[0,1]
plt.bar(x,survive_num,width=0.35)
plt.xticks(x,('died','survived'))
plt.show()

png

特徵分析

num_f = [f for f in train.columns if train.dtypes[f] != 'object']
cat_f = [f for f in train.columns if train.dtypes[f]=='object']
print('there are %d numerical features:'%len(num_f),num_f)
print('there are %d category features:'%len(cat_f),cat_f)

there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]

feature類別:
- 數值型
- 特徵型:可排序/不可排序型
- category不可排序型:sex,Embarked

category特徵

性別

train.groupby(['Sex'])['Survived'].count()
Sex female 314 male 577 Name: Survived, dtype: int64
f,ax = plt.subplots(figsize=(8,6))
fig = sns.countplot(x='Sex',hue='Survived',data=train)
fig.set_title('Sex:Survived vs Dead')
plt.show()

png

train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count()
Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64 船上原有人數,男性遠高於女性;存活率,女性在75%左右,遠高於男性18%-19%.可見女性存活率遠高於男性,是重要特徵。

Embarked

sns.factorplot('Embarked','Survived',data=train)
plt.show()

png

f,ax = plt.subplots(1,3,figsize=(24,6))
sns.countplot('Embarked',data=train,ax=ax[0])
ax[0].set_title('No. Of Passengers Boarded')
sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2])
ax[2].set_title('Embarked vs Pclass')
#plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

png

#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare')
sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train)
plt.show()

png

從圖中看出大部分乘客來自S port,其中多數爲class 3,但是class 1 的人數也是3個口中最多的,C port的存活率最高,爲0.55,因爲C port中class1的人比例較高,Q port 絕大部分乘客是class 3的。C口1,2倉的票價均值較高,可能是暗示這個口上的人的社會地位較高。不過,從邏輯上說登錄口對生存率是沒有影響的,所以可以將其轉成啞變量或drop.

Pclass

train.groupby('Pclass')['Survived'].value_counts()
Pclass Survived 1 1 136 0 80 2 0 97 1 87 3 0 372 1 119 Name: Survived, dtype: int64
plt.subplots(figsize=(8,6))
f = sns.countplot('Pclass',hue='Survived',data=train)

png

sns.factorplot('Pclass','Survived',hue='Sex',data=train)
plt.show()

png

class1,2的存活率明顯較高,1有半數以上存活,2也基本持平,1,2倉女性甚至接近於1,所以客艙等級對生存有很大影響。

SibSp

train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
sns.factorplot('SibSp','Survived',data=train)
plt.show()

png

#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass')
sns.countplot(x='SibSp',hue='Pclass',data=train)
plt.show()

png

在沒有同伴的情況下,存活率大概在0.3左右,有一個同伴的存活率最高>0.5,可能原因是1,2倉的乘客比例較高,隨後,隨着同伴數量增加而降低,降低的主要原因可能是,超過3人以上的乘客主要在class3,class3中3人以上存活率很低

Parch

#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass')
sns.countplot(x='Parch',hue='Pclass',data=train)
plt.show()

png

sns.factorplot('Parch','Survived',data=train)
plt.show()

png

趨勢跟SibSp相似,一個人存活率較低,在有1-3parents時存活率較高,隨後迅速降低,因爲多數乘客來自class3

Age

train.groupby('Survived')['Age'].describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
count mean std min 25% 50% 75% max
Survived
0 424.0 30.626179 14.172110 1.00 21.0 28.0 39.0 74.0
1 290.0 28.343690 14.950952 0.42 19.0 28.0 36.0 80.0
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0])
ax[0].set_title('Pclass Age & Survived')
sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1])
ax[1].set_title('Sex Age & Survived')
plt.show()

png

1等倉獲救年齡總體偏低,生存率年齡跨度大,尤其是20歲以上至50歲的生存率較高,可能和1等倉人年齡總體偏大有關;10歲左右的兒童在2,3等倉的生存率明顯提升,對於男性而言同理,兒童有個明顯提升,;女性的生存年齡集中在中青年;20-40歲左右的中青年人死亡人數最多。

Name

name主要用途是可以幫助我們分辨性別,幫助補充有相同title的年齡缺失值

#用正則表達式幫助找出姓名中表示年齡的title
def getTitle(data):

    name_sal = []
    for i in range(len(data['Name'])):
        name_sal.append(re.findall(r'.\w*\.',data.Name[i]))

    Salut = []
    for i in range(len(name_sal)):
        name = str(name_sal[i])
        name = name[1:-1].replace("'","")
        name = name.replace(".","").strip()
        name = name.replace(" ","")
        Salut.append(name)

    data['Title'] = Salut

getTitle(train)
train.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
pd.crosstab(train['Title'],train['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 124 0
Mrs,L 1 0
Ms 1 0
Rev 0 6
Sir 0 1

補習一波英語:Mme:稱呼非英語民族的”上層社會”已婚婦女,及有職業的婦女,相當於Mrs;Jonkheer:鄉紳;Capt:船長;Lady:貴族夫人;Don唐:是西班牙語中貴族和有地位者的尊稱;the Countess:女伯爵;Ms:Ms.或Mz:婚姻狀態不明的婦女;Col:上校;Major:少校;Mlle:小姐;Rev:牧師。

Fare

train.groupby('Pclass')['Fare'].mean()
Pclass 1 84.154687 2 20.662183 3 13.675550 Name: Fare, dtype: float64
sns.distplot(train['Fare'].dropna())
plt.xlim((0,200))
plt.xticks(np.arange(0,200,10))
plt.show()

這裏寫圖片描述

初步分析總結:
- 對於性別,女性生存率明顯高於男性
- 頭等艙生存率很高,3等倉很低,class1,2女性生存率接近於1
- 10歲左右的兒童生存率又明顯提升
- SibSp和Parch相似,一個人存活率較低,有1-2SibSp或者1-3Parents生存率較高,但超過後生存率大幅下降
- name和age可以對所有數據進行處理,用name提取性別title,藉助均值對age進行補充

數據處理

#合併訓練集和測試集
passID = test['PassengerId']
all_data = pd.concat([train,test],keys=["train","test"])
all_data.shape
#all_data.head()
(1309, 13)
#統計缺失值
NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"])
NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
train percent_train test percent
Cabin 687 0.771044 327.0 0.782297
Age 177 0.198653 86.0 0.205742
Fare 0 0.000000 1.0 0.002392
Embarked 2 0.002245 0.0 0.000000
#刪除無意義特徵
all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True)

all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket Title
train 0 22.0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0.0 A/5 21171 Mr
1 38.0 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 female 1 1.0 PC 17599 Mrs

Age處理

#先提取name中的title
getTitle(all_data)
pd.crosstab(all_data['Title'], all_data['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex female male
Title
Capt 0 1
Col 0 4
Countess 1 0
Don 0 1
Dona 1 0
Dr 1 7
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 61
Miss 260 0
Mlle 2 0
Mme 1 0
Mr 0 757
Mrs 196 0
Mrs,L 1 0
Ms 2 0
Rev 0 8
Sir 0 1

all_data['Title'] = all_data['Title'].replace(
    ['Lady','Dr','Dona','Mme','Countess'],'Mrs')
all_data['Title'] =all_data['Title'].replace('Mlle','Miss')
all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs')
all_data['Title'] = all_data['Title'].replace('Ms', 'Miss')
#all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs')
all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr')
'''
all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs',
                        'Major':'Mr','Lady':'Mrs','Countess':'Mrs',
                        'Jonkheer':'Mr','Col':'Mr','Rev':'Mr',
                        'Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'})

'''
all_data.Title.isnull().sum()
0
all_data[:train.shape[0]].groupby('Title')['Age'].mean()
Title Master 4.574167 Miss 21.845638 Mr 32.891990 Mrs 36.188034 Name: Age, dtype: float64
#通過訓練集中title對應的age均值替換
all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22
#all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46

all_data.Age.isnull().sum()
0
all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.158192
3 Mrs 0.777778
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0])

sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ],
                 color='red', label='Not Survived', ax=ax[1])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ],
                 color='blue', label='Survived', ax=ax[1])
plt.legend(loc='best')
plt.show()

png

  • 16歲左右兒童存活率較高,最年長乘客(80歲)倖存
  • 大量16~40青少年沒有存活
  • 大多數乘客在16~40歲
  • 爲輔助分類,將年齡分段,創造新特徵,同時增加兒童特徵

add isChild

def male_female_child(passenger):
    # 取年齡和性別
    age,sex = passenger
    # 提出兒童特徵
    if age < 16:
        return 'child'
    else:
        return sex
# 創建新特徵
all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1)
#0-80歲的年齡分佈,若分段成3組,按少年、中青年、老年分

all_data['Age_band']=0
all_data.loc[all_data['Age']<=16,'Age_band']=0
all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1
all_data.loc[all_data['Age']>40,'Age_band']=2

Name處理

df = pd.get_dummies(all_data['Title'],prefix='Title')
all_data = pd.concat([all_data,df],axis=1)
all_data.drop('Title',axis=1,inplace=True)
#drop name
all_data.drop('Name',axis=1,inplace=True)

fiilna Embarked

all_data.loc[all_data.Embarked.isnull()]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket Title person Age_band
train 61 38.0 NaN 80.0 0 1 female 0 1.0 113572 2 female 1
829 62.0 NaN 80.0 0 1 female 0 1.0 113572 3 female 2

票價80,一等艙,很大概率是C口

all_data['Embarked'].fillna('C',inplace=True)

all_data.Embarked.isnull().any()
False
embark_dummy = pd.get_dummies(all_data.Embarked)
all_data = pd.concat([all_data,embark_dummy],axis=1)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket person Age_band Title_Master Title_Miss Title_Mr Title_Mrs C Q S
train 0 22.0 S 7.2500 0 3 male 1 0.0 A/5 21171 male 1 0 0 1 0 0 0 1
1 38.0 C 71.2833 0 1 female 1 1.0 PC 17599 female 1 0 0 0 1 1 0 0

add SibSp and Parch

#創造familysize和alone兩個新特徵
all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有親屬總和
all_data['alone'] = 0#不是一個人
all_data.loc[all_data.Family_size==0,'alone']=1#代表是一個人
f,ax=plt.subplots(1,2,figsize=(16,6))
sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0])
ax[0].set_title('Family_size vs Survived')
sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1])
ax[1].set_title('alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

png

當乘客一個人的時候,生存率很低,大概在0.3左右,有1-3家庭成員時生存率上升,但>4時,生存率又急速下降。

#再將family size分段
all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',
                                    np.where(all_data['Family_size']<=3, 'normal', 'big'))
sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass')
plt.show()

png

對於女性,1,2等倉來說,是否一個人對生存率影響不大,但對於3等倉女性,一個人時反而生存率提高。

all_data['poor_girl'] = 0
all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1

連續變量Fare填充、分段

#補充全缺失值
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],
                 color='red', label='Not Survived')
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],
                 color='blue', label='Survived')
plt.xlim((0,100))
(0, 100)

png

sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]])
plt.show()

png

#Fare平均分成3段取均值
all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)

all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean()
Fare_band (-0.001, 8.662] 0.198052 (8.662, 26.0] 0.402778 (26.0, 512.329] 0.559322 Name: Survived, dtype: float64
#將連續變量fare分段,離散化

all_data['Fare_cut'] = 0
all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0
all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1
#all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2
all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2

sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]])
plt.show()

png

價格上升,生存率增加,對男性尤爲明顯

# creat a feature about rich man
all_data['rich_man'] = 0
all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1

類型特徵數值化

all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket person Title_Mrs C Q S Family_size alone poor_girl Fare_band Fare_cut rich_man
train 0 22.0 S 7.2500 0 3 male 1 0.0 A/5 21171 male 0 0 0 1 normal 0 0 (-0.001, 8.662] 0 0
1 38.0 C 71.2833 0 1 female 1 1.0 PC 17599 female 1 1 0 0 normal 0 0 (26.0, 512.329] 2 0
2 26.0 S 7.9250 0 3 female 0 1.0 STON/O2. 3101282 female 0 0 0 1 solo 1 1 (-0.001, 8.662] 0 0
3 35.0 S 53.1000 0 1 female 1 1.0 113803 female 1 0 0 1 normal 0 0 (26.0, 512.329] 2 0
4 35.0 S 8.0500 0 3 male 0 0.0 373450 male 0 0 0 1 solo 1 0 (-0.001, 8.662] 0 0

5 rows × 24 columns

捨棄特徵有Embarked(已離散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch

'''
捨棄不需要的特徵:Age,用Age_band分段代替了,
Fare,Fare_band用Fare_cut分段代替了
Ticket無意義
'''
#all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True)
#all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True)
all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Sex Survived person Age_band Title_Master Title_Miss Title_Mr Title_Mrs Q S Family_size alone poor_girl Fare_cut rich_man
train 0 3 male 0.0 male 1 0 0 1 0 0 1 normal 0 0 0 0
1 1 female 1.0 female 1 0 0 0 1 0 0 normal 0 0 2 0
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size')
df2 = pd.get_dummies(all_data['person'],prefix='person')
df3 = pd.get_dummies(all_data['Age_band'],prefix='age')
all_data = pd.concat([all_data,df1,df2,df3],axis=1)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Sex Survived person Age_band Title_Master Title_Miss Title_Mr Title_Mrs Q rich_man Family_size_big Family_size_normal Family_size_solo person_child person_female person_male age_0 age_1 age_2
train 0 3 male 0.0 male 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
1 1 female 1.0 female 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
2 3 female 1.0 female 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0
3 1 female 1.0 female 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
4 3 male 0.0 male 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0

5 rows × 25 columns

all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Survived Title_Master Title_Miss Title_Mr Title_Mrs Q S alone poor_girl rich_man Family_size_big Family_size_normal Family_size_solo person_child person_female person_male age_0 age_1 age_2
train 0 3 0.0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
1 1 1.0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0
2 3 1.0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0
3 1 1.0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0
4 3 0.0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0

5 rows × 21 columns

建立模型

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix#retun array of prredict and target
from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val 

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
train_data = all_data[:train.shape[0]]
test_data = all_data[train.shape[0]:]
print('train data:'+str(train_data.shape))
print('test data:'+str(test_data.shape))
train data:(668, 21) test data:(641, 21)

train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived'])
train_x = train.drop('Survived',axis=1)

train_y = train['Survived']

test_x = test.drop('Survived',axis=1)
test_y = test['Survived']
print(train_x.shape)
print(test_x.shape)
(668, 20) (223, 20)
# define score on train and test data
def cv_score(model):
    cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy")
    return(cv_result)

def cv_score_test(model):
    cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy")
    return(cv_result_test)

rbf SVM

# RBF SVM model

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf_svc = clf_svc.fit(train_x, train_y)
print("Best estimator found by grid search:")
print(clf_svc.best_estimator_)
acc_svc_train = cv_score(clf_svc.best_estimator_).mean()
acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean()
print(acc_svc_train)
print(acc_svc_test)
Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.826306967835 0.816196122718

決策樹

#a simple tree

clf_tree = DecisionTreeClassifier()
clf_tree.fit(train_x,train_y)
acc_tree_train = cv_score(clf_tree).mean()
acc_tree_test = cv_score_test(clf_tree).mean()
print(acc_tree_train)
print(acc_tree_test)
0.808216271583 0.811631846414

KNN

#test n_neighbors 

pred = []
for i in range(1,11):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(train_x,train_y)
    pred.append(cv_score(model).mean())
n = list(range(1,11))
plt.plot(n,pred)
plt.xticks(range(1,11))
plt.show()  

png

clf_knn = KNeighborsClassifier(n_neighbors=4)
clf_knn.fit(train_x,train_y)
acc_knn_train = cv_score(clf_knn).mean()
acc_knn_test = cv_score_test(clf_knn).mean()
print(acc_knn_train)
print(acc_knn_test)
0.826239790353 0.829653679654

邏輯迴歸

#logistic regression

clf_LR = LogisticRegression()
clf_LR.fit(train_x,train_y)
acc_LR_train = cv_score(clf_LR).mean()
acc_LR_test = cv_score_test(clf_LR).mean()
print(acc_LR_train)
print(acc_LR_test)
0.838226647511 0.811848296631

高斯貝葉斯



clf_gb = GaussianNB()
clf_gb.fit(train_x,train_y)
acc_gb_train = cv_score(clf_gb).mean()
acc_gb_test = cv_score_test(clf_gb).mean()
print(acc_gb_train)
print(acc_gb_test)
0.794959693511 0.789695087521

隨機森林



n_estimators = range(100,1000,100)
grid = {'n_estimators':n_estimators}

clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True)
clf_forest.fit(train_x,train_y)
print(clf_forest.best_estimator_)
print(clf_forest.best_score_)
#print(cv_score(clf_forest).mean())
#print(cv_score_test(clf_forest).mean())
Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) 0.817365269461
clf_forest = RandomForestClassifier(n_estimators=200)
clf_forest.fit(train_x,train_y)
acc_forest_train = cv_score(clf_forest).mean()
acc_forest_test = cv_score_test(clf_forest).mean()
print(acc_forest_train)
print(acc_forest_test)
0.811178066885 0.811434217956
pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8)
plt.show()

png


models = pd.DataFrame({
    'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],
    'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train],
    'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test]
})
models.sort_values(by='score on test', ascending=False)
'''
models = pd.DataFrame({
    'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],
    'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train]
})
'''
models.sort_values(by='score on test', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
model score on test score on train
2 KNN 0.829654 0.826240
0 SVM 0.816196 0.826307
3 Logistic regression 0.811848 0.838227
1 Decision Tree 0.811632 0.808216
5 Random Forest 0.811434 0.811178
4 Gaussion Bayes 0.789695 0.794960

Ensemble

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# bagging Decision tree
from sklearn.ensemble import BaggingClassifier
bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0)
bag_tree.fit(train_x,train_y)
acc_bagtree_train = cv_score(bag_tree).mean()
acc_bagtree_test =cv_score_test(bag_tree).mean()
print(acc_bagtree_train)
print(acc_bagtree_test)
0.82782211935
0.816196122718

Adaboosting

n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True)
ada.fit(train_x,train_y)
print(ada.best_estimator_)
print(ada.best_score_)
#acc_ada_train = cv_score(ada).mean()
#acc_ada_test = cv_score_test(ada).mean()

#print(acc_ada_train)
#print(acc_ada_test)
Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  5.4min finished


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.05, n_estimators=200, random_state=None)
0.835329341317
ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2)
ada.fit(train_x,train_y)

acc_ada_train = cv_score(ada).mean()
acc_ada_test = cv_score_test(ada).mean()

print(acc_ada_train)
print(acc_ada_test)
0.829248144305
0.825719932242
#confusion matrix to see the presiction

y_pred = cross_val_predict(ada,test_x,test_y,cv=10)
sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f')
plt.show()

png

GradientBoosting


n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True)
grad.fit(train_x,train_y)
print(grad.best_estimator_)
print(grad.best_score_)
Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  2.4min finished


GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.05, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=200, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
0.824850299401
#use best estimator in gradient

clf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
clf_grad.fit(train_x,train_y)
acc_grad_train = cv_score(clf_grad).mean()
acc_grad_test = cv_score_test(clf_grad).mean()

print(acc_grad_train)
print(acc_grad_test)
0.818709926304
0.807500470544
from sklearn.metrics import precision_score
class Ensemble(object):

    def __init__(self,estimators):
        self.estimator_names = []
        self.estimators = []
        for i in estimators:
            self.estimator_names.append(i[0])
            self.estimators.append(i[1])
        self.clf = LogisticRegression()

    def fit(self, train_x, train_y):
        for i in self.estimators:
            i.fit(train_x,train_y)
        x = np.array([i.predict(train_x) for i in self.estimators]).T
        y = train_y
        self.clf.fit(x, y)

    def predict(self,x):
        x = np.array([i.predict(x) for i in self.estimators]).T
        #print(x)
        return self.clf.predict(x)


    def score(self,x,y):
        s = precision_score(y,self.predict(x))
        return s
ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)])
score = 0
for i in range(0,10):
    ensem.fit(train_x, train_y)
    sco = round(ensem.score(test_x,test_y) * 100, 2)
    score+=sco
print(score/10)
89.83

提交

pre = ensem.predict(test_data)
pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre})
submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})

提交結果看,ensemble模型和單個模型比並沒有明顯提升,分析可能是基模型相關性較強,訓練數據不夠多,或者是one-hot編碼會不會引入共線性。雖然測試集和訓練集結果相差不大,但提交結果降低明顯,分析可能是數據不夠,訓練不充分,特徵較少且相關性強,可以考慮引入更多特徵。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章