【Kaggle入門】Titanic: Machine Learning from Disaster----分析數據


這個系列博客純粹爲了記錄一下自己學習kaggle的相關內容,也是跟着別人一步步學習。


一般來說數據集都會包含大量屬性,但是並不是所有屬性都對訓練模型有積極作用。而想要知道哪些屬性有用,首先需要對數據進行分析。

先統計一下獲救人數,不同艙位人數,不同登船口岸人數。

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.2)

plt.subplot2grid((1, 3), (0, 0))
data_train.Survived.value_counts().plot(kind='bar')
plt.title("Rescue situation (1 is survived) ")
plt.ylabel("number of person")

plt.subplot2grid((1, 3), (0, 1))
data_train.Pclass.value_counts().plot(kind='bar')
plt.title("Pclass distribution of passengers")
plt.ylabel("number of person")

plt.subplot2grid((1, 3), (0, 2))
data_train.Embarked.value_counts().plot(kind='bar')
plt.title("each embarked board passengers")
plt.ylabel("number of person")
plt.show()

不同年齡獲救情況。

fig = plt.figure()
fig.set(alpha=0.2)
plt.scatter(data_train.Survived, data_train.Age)
plt.grid(b=True, which='major', axis='y')
plt.title("Survived distribution by age (1 is survived)")
plt.ylabel("age")

不同艙位等級的年齡分佈。

fig = plt.figure()
fig.set(alpha=0.2)
data_train.Age[data_train.Pclass == 1].plot(kind='kde')
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.title("Pclass of passengers by age")
plt.xlabel("age")
plt.ylabel("density")
plt.legend(("first class", "second class", "third class"), loc='best')

不同艙位等級獲救分佈。

fig = plt.figure()
fig.set(alpha=0.2)

survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
df = pd.DataFrame({'survived':survived_1, 'not survived':survived_0})
df.plot(kind='bar', stacked=True)
plt.title("survived passengers distribution by Pclass")
plt.xlabel("Pclass")
plt.ylabel("number of person")
plt.show()

不同性別獲救情況。

fig = plt.figure()
fig.set(alpha=0.2)

survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df = pd.DataFrame({'male':survived_m, 'female':survived_f})
df.plot(kind='bar', stacked=True)
plt.title("survived passengers distribution by sex")
plt.xlabel("sex")
plt.ylabel("number of person")
plt.show()

艙位、性別獲救情況。

fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.65)
plt.title("survived passengers distribution by Pclass and sex")

ax1 = fig.add_subplot(141)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass != 3].value_counts().plot(kind='bar', label='female high class', color='#FA2479')
ax1.set_xticklabels(["survived", "not survived"], rotation=0)
plt.legend(["female/high class"], loc='best')

ax2 = fig.add_subplot(142, sharey=ax1)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts().plot(kind='bar', label='female, low class', color='pink')
ax2.set_xticklabels(["not survived", "survived"], rotation=0)
plt.legend(["female/low class"], loc='best')

ax3 = fig.add_subplot(143, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass != 3].value_counts().plot(kind='bar', label='male, high class', color='lightblue')
ax3.set_xticklabels(["survived", "not survived"], rotation=0)
plt.legend(["male/high class"], loc='best')

ax4 = fig.add_subplot(144, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts().plot(kind='bar', label='male, low class', color='steelblue')
ax4.set_xticklabels(["not survived", "survived"], rotation=0)
plt.legend(["male/low class"], loc='best')

plt.show()

不同登船口岸獲救人數。

fig = plt.figure()
fig.set(alpha=0.2)

survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()
survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df = pd.DataFrame({"survived":survived_1, "not survived":survived_0})
df.plot(kind='bar', stacked=True)
plt.title("each embarked passengers survived distribution")
plt.xlabel("embarked")
plt.ylabel("number of person")

plt.show()

客艙信息cabin這一列缺失的數據比較多。

data_train.Cabin.value_counts()

 

看一下有無cabin信息獲救情況的統計。

fig = plt.figure()
fig.set(alpha=0.2)

survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()
survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()
df = pd.DataFrame({"have":survived_cabin, "not have":survived_nocabin}).transpose()
df.plot(kind='bar', stacked=True)
plt.title("survived distribution by cabin")
plt.xlabel("have cabin or not")
plt.ylabel("number of person")

plt.show()

Sibsp和Parch。

g = data_train.groupby(['SibSp', 'Survived'])
df = pd.DataFrame(g.count()['PassengerId'])
print(df)

g = data_train.groupby(['Parch', 'Survived'])
df = pd.DataFrame(g.count()['PassengerId'])
print(df)

從以上數據圖表至少可以明確:

  1. 一等艙獲救人數的比例明顯最高。Pclass應該是一個有用的屬性。
  2. 女性獲救比例高於男性。Sex應該是一個有用的屬性。

不能確定的是:

  1. Age這個屬性。從圖表來看中年人坐一等艙的最多,40歲居多。尚不清楚年齡是否影響獲救。
  2. Embarked這個屬性。從數據來看,S口岸登船人數最多,但C口岸登船的人獲救比例最高。尚不清楚登船口岸和艙位等級二者的聯繫。
  3. Cabin這個屬性。從現有數據來看有cabin信息的似乎獲救概率更大。但由於該屬性缺失值太多,目前參考性應該不大。

覺得不太有用的屬性:

  1. Sib Sp
  2. Parch

補充看了一下不同艙位登船的口岸。

fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.65)
plt.title("each Embarked diffrect Pclass")

ax1 = fig.add_subplot(131, sharey=ax1)
data_train.Embarked[data_train.Pclass == 1].value_counts().plot(kind='bar', label='first class', color='#FA2479')
ax1.set_xticklabels(["S", "C", "Q"], rotation=0)
plt.legend(["first class"], loc='best')

ax1 = fig.add_subplot(132, sharey=ax1)
data_train.Embarked[data_train.Pclass == 2].value_counts().plot(kind='bar', label='second class', color='pink')
ax1.set_xticklabels(["S", "C", "Q"], rotation=0)
plt.legend(["second class"], loc='best')

ax1 = fig.add_subplot(133, sharey=ax1)
data_train.Embarked[data_train.Pclass == 3].value_counts().plot(kind='bar', label='third class', color='lightblue')
ax1.set_xticklabels(["S", "C", "Q"], rotation=0)
plt.legend(["third class"], loc='best')

fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.65)
plt.title("each Pclass diffrect Embarked")

ax1 = fig.add_subplot(131, sharey=ax1)
data_train.Pclass[data_train.Embarked == 'S'].value_counts().plot(kind='bar', label='S', color='#FA2479')
ax1.set_xticklabels(["1", "2", "3"], rotation=0)
plt.legend(["S"], loc='best')

ax1 = fig.add_subplot(132, sharey=ax1)
data_train.Pclass[data_train.Embarked == 'C'].value_counts().plot(kind='bar', label='C', color='pink')
ax1.set_xticklabels(["1", "2", "3"], rotation=0)
plt.legend(["C"], loc='best')

ax1 = fig.add_subplot(133, sharey=ax1)
data_train.Pclass[data_train.Embarked == 'Q'].value_counts().plot(kind='bar', label='Q', color='lightblue')
ax1.set_xticklabels(["1", "2", "3"], rotation=0)
plt.legend(["Q"], loc='best')

似乎也不能獲得更多的信息。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章