[B3]泰坦尼克號數據分析

這是我做的第一個半完整的數據分析項目，裏面包含數據獲取，數據清洗，描述性統計，數據可視化，機器學習建模等內容。花了我兩天時間，中間出了很多bug，而且原始數據也有問題，因此存在較多缺陷，還請各位大佬多多指教！
目錄：
1.數據獲取
2.數據預處理
3.描述性統計
4.變量分佈統計
5.探索變量間的關係
6.特徵處理
7.機器學習建模
8.模型準確性評估

第一步：數據獲取
直接從互聯網獲取數據

import pandas as pd
#利用pandas的read_csv模塊直接從互聯網蒐集泰坦尼克號乘客數據
titanic= pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

觀察前幾行數據，可以發現數據種類各異。
第二步：數據預處理

#查看數據集的行列數
titanic.shape
#查看前3行數據
titanic.head(3)

第三步：彙總及描述性統計

#查看數據缺失值，數據類型等情況
titanic.info()

藉以上輸出，設計如下幾個數據處理任務
（1）age這個數據列，只有633個，需要補完
（2）sex與pclass都是類別型的，需要轉化爲數值特徵，用0/1替代

#首先補充age裏的數據，使用平均數或者中位數都是對模型偏離造成最小影響的策略
titanic['age'].fillna(titanic['age'].mean(), inplace=True)

#對補充完整的數據進行重新探查
titanic.info()

此時缺失值被我們補全

#統計每一列的均值。最大值，最小值，分位數等
titanic.describe(include='all')

以上輸出可得知：約有34%的人獲救了，乘客年齡平均31.1歲
第四步：變量分佈統計

#1.獲救情況分佈，共1313位乘客，僅446人倖免遇難，佔比34%
titanic['survived'].value_counts().plot(kind='bar',color='yellow',title='Rescue situation', rot=360)

#2.性別分佈，共1313位乘客，男性乘客就有573位，佔比43.64%
titanic['sex'].value_counts().plot(kind='bar',color='pink',title='Gender distrbution', rot=360)

#3.船艙分佈，其中三等艙人數最多，一等艙人口次之
titanic['pclass'].value_counts().plot(kind='bar',color='green',title='GePclass distrbution', rot=360)

#接下來探索連續性變量age
#4.年齡分佈，主要集中在20-40歲之間
titanic['age'].plot(kind='hist',color='pink',title='Age distrbution')

第五步：探索變量之間的關係
1.探索單個變量與survived的關係
由於地位高的人可能最先獲得救助，表明age,sex與pclass可能是影響生存的關鍵因素

#首先通過分組和聚合兩種機器學習函數來實現
titanic[['sex','survived']].groupby(['sex'], as_index=False).mean().sort_values(by='survived', ascending=False)

#也可以通過透視表的方式來實現
sex_pivot = titanic.pivot_table(index='sex',values='survived')
sex_pivot

import matplotlib.pyplot as plt
#構造sex與survived均值的條形圖
sex_pivot.plot.bar(rot=360)

很顯然女性的倖存比例明顯高於男性
現在再來看pclass與survived的關係：

titanic[['pclass','survived']].groupby(['pclass'], as_index=False).mean().sort_values(by='survived', ascending=False)

class_pivot = titanic.pivot_table(index='pclass', values='survived')
class_pivot

class_pivot.plot.bar(rot=360)

可見，一等艙倖存比例的確最高，表現出地位不平等關係
2.探索多個變量與survived之間的關係

import seaborn as sns
g= sns.FacetGrid(titanic, col='survived')
g.map(plt.hist, 'age', bins=20)

首先探索年齡與獲救人數的關係：
獲救人羣中20-40歲的佔比最多；遇難人羣中18-30的佔比最多

grid=sns.FacetGrid(titanic, col='survived',row='pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'age',alpha=0.5,bins=20)

#探索年齡，艙位與獲救人數的關係：
#一等艙獲救人數最多

grid=sns.FacetGrid(titanic,row='embarked', height=2.2, aspect=1.6)
grid.map(sns.pointplot,'pclass','survived',palette='deep',order=[1,2,3], hue_order=['male','female'])
grid.add_legend()

這裏圖沒有加載出來，出了個bug
由圖可以看出：一等艙中從港口Q和港口S登船的女性基本都獲救了，從C口登船的男性獲救比例很高
第六步：特徵處理
1.頭銜轉換
由於name列表中有不同的稱呼如Mr，Dr等，代表了不同地位，因此我們單獨把這一屬性摘出來列成一列

titanic['Title'] = titanic['name'].str.extract('([A-Za-z]+)\.', expand=False)
#交叉表
pd.crosstab(titanic['Title'],titanic['sex'])

由於數據存在缺陷無法進一步分析；跳到下一步
2.年齡轉換
通過Series.descibe()查看age列的概況

titanic['age'].describe()

乘客年齡分佈在0.16-71之間
由於年齡字段是一個連續變量，可以通過直方圖查看其分佈情況
使用布爾索引分別得到獲救與未獲救DataFrame

survived = titanic[titanic['survived']==1]
died = titanic[titanic['survived']==0]
#建立直方圖查看不同年齡獲救對比情況
survived['age'].plot.hist(alpha=0.5,color='red',bins=50)
died['age'].plot.hist(alpha=0.5,color='blue',bins=50)
plt.legend(['survived','died'])

可以看到倖存乘客年齡集中分佈在30歲左右，遇難乘客也集中分佈在這個年齡層次，當然不排除原始數據缺陷
接下來使用pandas.cut()函數將年齡字段進行分段，轉換成類別變量：
先創建一個函數，使用pandas.fillna()方法用-0.5填充所有缺失值
將age變成六段
Missing, from -1 to 0
Infant,from 0 to 5
Child,from 5 to 12
Teenager, from 12 to 18
Young Aault,from 18 to 35
Adult, from 35 to 60
Senior, from 60 to 100

#定義年齡分段處理函數
def process_age(df, cut_points, label_names):
    df['age'] = df['age'].fillna(-0.5)
    df['age_categories'] = pd.cut(df['age'], cut_points, label_names)
    return df

cut_points = [-1,0,5,12,18,35,60,100]
label_names=['Missing','Infant','Child','Teenager','Young Aault','Adult','Senior']

#在訓練集上調用 process_age函數
titanic =process_age(titanic, cut_points, label_names)

#查看分段後的年齡與生還的關係
pivot=titanic.pivot_table(index="age_categories", values='survived')
pivot.plot.bar(rot=360)
plt.show()

我這裏又報錯了，大佬們可以幫我看看啥原因

第八步：機器學習建模分析
1.分割訓練數據
將訓練集分成兩個部分，20%的數據用來預測，80%的數據用來訓練
通過sklearn中的model_selection.train_test_split()函數進行數據切割
包含兩個參數，X指的是特徵變量，y值得是目標變量，返回四個對象：train_X train_y test_X test_y

from sklearn.model_selection import train_test_split
#將要放入模型進行訓練的特徵變量：
columns = [
    'pclass_1', 'pclass_2', 'pclass_3', 'sex_female', 'sex_male',
    'age_categories_Missing', 'age_categories_Infant', 'age_categories_Child',
    'age_categories_Teenager', 'age_categories_Young Aault',
    'age_categories_Adult', 'age_categories_Senior'
]
all_X =titanic[columns]

#訓練集的目標變量
all_y = titanic['survived']
train_X,test_X,train_y,test_y = train_test_split(all_X,all_y, test_size=0.20, random_state=0)

2.使用LogisticRegression建模

#導入sklearn裏的LogisticRegression模型
from sklearn.linear_model import LogisticRegression
#創建LogisticRegression對象
lr = LogisticRegression()

#使用LogisticRegression.fit()方法來訓練模型
lr.fit(titanic[columns],titanic['survived'])

第九步：模型準確度評估

lr = LogisticRegression()
lr.fit(train_X,train_y)
predictions = lr.predict(test_X)
#使用metrics.accuracy_score()函數進行準確性評估
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y,predictions)
print('The accuracy of LogisticRegression is:')

最後精確度應該在79%左右，說明模型預測還是比較準確的。

好了，今天的案例就給大家分享到這裏了，做完一整個流程下來才發現自己要學習的東西太多了，希望以後能夠不斷進步，與君共勉！

學Python的莫小白

發佈了8 篇原創文章 · 獲贊 0 · 訪問量 2117

私信關注

[B3]泰坦尼克號數據分析

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

[B4]鏈家二手房價格預測

[B11]數據挖掘實戰：客戶流失預警系統

[B5]我的第一個量化策略

[B9]爬蟲課程01

[B10]爬蟲課程02

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結