【Kaggle入門】Titanic: Machine Learning from Disaster----分析數據

這個系列博客純粹爲了記錄一下自己學習kaggle的相關內容，也是跟着別人一步步學習。

一般來說數據集都會包含大量屬性，但是並不是所有屬性都對訓練模型有積極作用。而想要知道哪些屬性有用，首先需要對數據進行分析。

先統計一下獲救人數，不同艙位人數，不同登船口岸人數。

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.2)

plt.subplot2grid((1, 3), (0, 0))
data_train.Survived.value_counts().plot(kind='bar')
plt.title("Rescue situation (1 is survived) ")
plt.ylabel("number of person")

plt.subplot2grid((1, 3), (0, 1))
data_train.Pclass.value_counts().plot(kind='bar')
plt.title("Pclass distribution of passengers")
plt.ylabel("number of person")

plt.subplot2grid((1, 3), (0, 2))
data_train.Embarked.value_counts().plot(kind='bar')
plt.title("each embarked board passengers")
plt.ylabel("number of person")
plt.show()

不同年齡獲救情況。

fig = plt.figure()
fig.set(alpha=0.2)
plt.scatter(data_train.Survived, data_train.Age)
plt.grid(b=True, which='major', axis='y')
plt.title("Survived distribution by age (1 is survived)")
plt.ylabel("age")

不同艙位等級的年齡分佈。

fig = plt.figure()
fig.set(alpha=0.2)
data_train.Age[data_train.Pclass == 1].plot(kind='kde')
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.title("Pclass of passengers by age")
plt.xlabel("age")
plt.ylabel("density")
plt.legend(("first class", "second class", "third class"), loc='best')

不同艙位等級獲救分佈。

fig = plt.figure()
fig.set(alpha=0.2)

survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
df = pd.DataFrame({'survived':survived_1, 'not survived':survived_0})
df.plot(kind='bar', stacked=True)
plt.title("survived passengers distribution by Pclass")
plt.xlabel("Pclass")
plt.ylabel("number of person")
plt.show()

不同性別獲救情況。

fig = plt.figure()
fig.set(alpha=0.2)

survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df = pd.DataFrame({'male':survived_m, 'female':survived_f})
df.plot(kind='bar', stacked=True)
plt.title("survived passengers distribution by sex")
plt.xlabel("sex")
plt.ylabel("number of person")
plt.show()

艙位、性別獲救情況。

fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.65)
plt.title("survived passengers distribution by Pclass and sex")

ax1 = fig.add_subplot(141)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass != 3].value_counts().plot(kind='bar', label='female high class', color='#FA2479')
ax1.set_xticklabels(["survived", "not survived"], rotation=0)
plt.legend(["female/high class"], loc='best')

ax2 = fig.add_subplot(142, sharey=ax1)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts().plot(kind='bar', label='female, low class', color='pink')
ax2.set_xticklabels(["not survived", "survived"], rotation=0)
plt.legend(["female/low class"], loc='best')

ax3 = fig.add_subplot(143, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass != 3].value_counts().plot(kind='bar', label='male, high class', color='lightblue')
ax3.set_xticklabels(["survived", "not survived"], rotation=0)
plt.legend(["male/high class"], loc='best')

ax4 = fig.add_subplot(144, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts().plot(kind='bar', label='male, low class', color='steelblue')
ax4.set_xticklabels(["not survived", "survived"], rotation=0)
plt.legend(["male/low class"], loc='best')

plt.show()

不同登船口岸獲救人數。

fig = plt.figure()
fig.set(alpha=0.2)

survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()
survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df = pd.DataFrame({"survived":survived_1, "not survived":survived_0})
df.plot(kind='bar', stacked=True)
plt.title("each embarked passengers survived distribution")
plt.xlabel("embarked")
plt.ylabel("number of person")

plt.show()

客艙信息cabin這一列缺失的數據比較多。

data_train.Cabin.value_counts()

看一下有無cabin信息獲救情況的統計。

fig = plt.figure()
fig.set(alpha=0.2)

survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()
survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()
df = pd.DataFrame({"have":survived_cabin, "not have":survived_nocabin}).transpose()
df.plot(kind='bar', stacked=True)
plt.title("survived distribution by cabin")
plt.xlabel("have cabin or not")
plt.ylabel("number of person")

plt.show()

Sibsp和Parch。

g = data_train.groupby(['SibSp', 'Survived'])
df = pd.DataFrame(g.count()['PassengerId'])
print(df)

g = data_train.groupby(['Parch', 'Survived'])
df = pd.DataFrame(g.count()['PassengerId'])
print(df)

從以上數據圖表至少可以明確：

一等艙獲救人數的比例明顯最高。Pclass應該是一個有用的屬性。
女性獲救比例高於男性。Sex應該是一個有用的屬性。

不能確定的是：

Age這個屬性。從圖表來看中年人坐一等艙的最多，40歲居多。尚不清楚年齡是否影響獲救。
Embarked這個屬性。從數據來看，S口岸登船人數最多，但C口岸登船的人獲救比例最高。尚不清楚登船口岸和艙位等級二者的聯繫。
Cabin這個屬性。從現有數據來看有cabin信息的似乎獲救概率更大。但由於該屬性缺失值太多，目前參考性應該不大。

覺得不太有用的屬性：

Sib Sp
Parch

補充看了一下不同艙位登船的口岸。

fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.65)
plt.title("each Embarked diffrect Pclass")

ax1 = fig.add_subplot(131, sharey=ax1)
data_train.Embarked[data_train.Pclass == 1].value_counts().plot(kind='bar', label='first class', color='#FA2479')
ax1.set_xticklabels(["S", "C", "Q"], rotation=0)
plt.legend(["first class"], loc='best')

ax1 = fig.add_subplot(132, sharey=ax1)
data_train.Embarked[data_train.Pclass == 2].value_counts().plot(kind='bar', label='second class', color='pink')
ax1.set_xticklabels(["S", "C", "Q"], rotation=0)
plt.legend(["second class"], loc='best')

ax1 = fig.add_subplot(133, sharey=ax1)
data_train.Embarked[data_train.Pclass == 3].value_counts().plot(kind='bar', label='third class', color='lightblue')
ax1.set_xticklabels(["S", "C", "Q"], rotation=0)
plt.legend(["third class"], loc='best')

fig = plt.figure(figsize=(15, 5))
fig.set(alpha=0.65)
plt.title("each Pclass diffrect Embarked")

ax1 = fig.add_subplot(131, sharey=ax1)
data_train.Pclass[data_train.Embarked == 'S'].value_counts().plot(kind='bar', label='S', color='#FA2479')
ax1.set_xticklabels(["1", "2", "3"], rotation=0)
plt.legend(["S"], loc='best')

ax1 = fig.add_subplot(132, sharey=ax1)
data_train.Pclass[data_train.Embarked == 'C'].value_counts().plot(kind='bar', label='C', color='pink')
ax1.set_xticklabels(["1", "2", "3"], rotation=0)
plt.legend(["C"], loc='best')

ax1 = fig.add_subplot(133, sharey=ax1)
data_train.Pclass[data_train.Embarked == 'Q'].value_counts().plot(kind='bar', label='Q', color='lightblue')
ax1.set_xticklabels(["1", "2", "3"], rotation=0)
plt.legend(["Q"], loc='best')

似乎也不能獲得更多的信息。

【Kaggle入門】Titanic: Machine Learning from Disaster----分析數據

中外程序員到底有啥區別？

Nginx R31 doc-13-Limiting Access to Proxied HTTP Resources 訪問限流

Python數據分析與挖掘實戰（5章）

python包：pandas

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

C++文件/流

一、什麼是Docker

二、Docker 組件

揹包九講一 01揹包

今天！通義靈碼在北京、成都、杭州三城開講啦

【OpenVINO】學習筆記(03):英特爾® OpenVINO™工具套件初級課程-如何加速視頻處理進程？

【OpenVINO】學習筆記(05):英特爾® OpenVINO™工具套件初級課程-視頻分析處理的完整流程

【OpenVINO】學習筆記(04):英特爾® OpenVINO™工具套件初級課程-如何給視覺應用中的神經網絡加速？...

【OpenVINO】學習筆記(02):英特爾® OpenVINO™工具套件初級課程-什麼是視頻？什麼是計算機視覺？如何使用計算機來處理視頻?...

【OpenVINO】學習筆記(01):英特爾® OpenVINO™工具套件初級課程-爲什麼我們需要人工智能

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結