A Data Science Framework: To Achieve 99% Accuracy
學習data scientist的思考方式,而不是如何編碼。
目錄
A Data Science Framework: To Achieve 99% Accuracy
1 怎樣處理問題
The goal is to not just learn the whats, but the whys
二元事件結果預測一個經典問題。在外行術語中,這意味着,它要麼發生,要麼沒有發生。例如,你贏了或沒有贏,你通過了測試或沒有通過,你被接受或沒有被接受。常見應用是客戶流失或客戶保留。另一個流行用例是,醫療衛生的死亡率或生存率分析。二元事件創建了一個有趣的動態,因爲我們從統計學上知道,一個隨機猜測應該達到50%的準確率,而不需要創建任何算法或編寫一行代碼。然而,就像自動更正拼寫檢查技術一樣,有時我們人類也會因爲太聰明而不能達到自己的目的,實際上不如擲硬幣。
2 數據科學基本框架
2.1 問題定義
如果解決方案是數據科學、大數據、機器學習、預測分析、商業智能或任何其他流行詞,那麼問題是什麼?俗話說,不要把馬車放在馬前面。先問題後需求,先需求後解決,先解決後設計,先設計後技術。在確定我們要解決的實際問題之前,我們常常很快就開始使用新的“閃亮”技術、工具或算法。
項目概要:泰坦尼克號的沉沒是歷史上最臭名昭著的沉船事件之一。1912年4月15日,泰坦尼克號在處女航中與冰山相撞,2224名乘客和船員中有1502人喪生。這場轟動性的悲劇震驚了國際社會,並導致了更好的船舶安全規則。這次海難造成人員傷亡的原因之一是沒有足夠的救生艇供乘客和船員使用。儘管在沉船中倖存下來有一些運氣因素,但有些人比其他人更可能存活下來,如婦女、兒童和上層階級。在這個挑戰中,我們要求您完成對哪些人可能存活的分析。特別是,我們要求您應用機器學習工具來預測哪些乘客在悲劇中倖存下來。
問題:對所有乘客進行生存分析並預測哪些乘客可能倖存。
2.2 數據收集
JohnNaisbitt在1984年出版的《大趨勢》(MegaTrends)一書中寫道,我們“沉溺於數據,卻堅持求知。”因此,很可能數據集已經以某種形式存在於某個地方。它可能是外部的或內部的,結構化的或非結構化的,靜態的或流式的,客觀的或主觀的,等等。你不必重新發明輪子,你只需要知道在哪裏找到它。在下一步中,我們將考慮將“髒數據”轉換爲“乾淨數據”。
數據下載:Kaggle's Titanic: Machine Learning from Disaster
2.3 可用數據準備與數據清洗
這一步驟通常被稱爲data wrangling,是將“野生”數據轉變爲“可管理”數據的必要過程。data wrangling包括爲存儲和處理實現數據架構、爲質量和控制制定數據治理標準、數據提取(即ETL和Web抓取)和數據清理,以識別異常、丟失或異常的數據點。
2.3.1 導入包
# 基於Python 3的常用包
#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))
import pandas as pd # 數據處理和分析建模
print("pandas version: {}". format(pd.__version__))
import matplotlib #可視化
print("matplotlib version: {}". format(matplotlib.__version__))
import numpy as np #科學計算
print("NumPy version: {}". format(np.__version__))
import scipy as sp #科學計算與高級函數集合
print("SciPy version: {}". format(sp.__version__))
import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__))
import sklearn #機器學習算法庫
print("scikit-learn version: {}". format(sklearn.__version__))
#其他的庫
import random
import time
#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)
# List the files in the data directory
from subprocess import check_output
print(check_output(["ls", "../data"]).decode("utf8"))
import os #作用同上
print('\n'.join(os.listdir("../data")))
print結果:
Python version: 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)]
pandas version: 0.24.2
matplotlib version: 3.0.3
NumPy version: 1.16.4
SciPy version: 1.2.1
IPython version: 7.4.0
scikit-learn version: 0.21.1
--------------------------------------------------
gender_submission.csv
test.csv
train.csv
導入常見的算法包
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
#可視化
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
#設置可視化參數
#%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8
2.3.2 初見數據
瞭解你的數據,它是什麼樣子的(數據類型和值),是什麼讓它選中(獨立/特徵變量),它在生活中的意義是什麼(依賴/目標變量)。
Survived variable:是結果或因變量。它是一個二元數據類型,1表示生存,0表示不生存。所有其他變量都是潛在預測變量或獨立變量。值得注意的是,預測變量越多不意味着模型越好,只有選擇正確變量才能使模型更好。
Passengerid and Ticket variables:是隨機的唯一標識符,對結果變量沒有影響。因此,它們可能被排除在分析之外。
Pclass variable:是ticket類的順序數據類型,代表社會經濟地位(ses),表示1=上層階級,2=中產階級,3=下層階級。
Name variable:是一個名義數據類型。它可以用於特徵工程中,從頭銜中得出性別,從姓氏中得出家庭規模,從諸如博士或碩士的頭銜中得出社會地位。由於這些變量已經存在,我們將利用它來得到Tilte variable,像master一樣,是否會產生影響。
Sex and Embarked variables:是一個名義的數據類型。它們將轉換爲虛擬變量進行數學計算。
Age and Fare variable:是連續的定量數據類型。
SibSp and Parch variables: SibSp代表船上其兄弟姐妹/配偶的數量,PARCH代表船上其父母/子女的數量。兩者都是離散的定量數據類型。這可用於特徵工程以構建家族大小,並且僅是變量。
Cabin and SES variables:Cabin(座艙)變量是一個名義數據類型,可用於特徵工程中,在事故發生時船上的近似位置和甲板上的SES。但是,由於有許多空值,可供價值不大,因此被排除在分析之外。
#import data from file: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data_raw = pd.read_csv('../data/train.csv')
#a dataset組成: train, test, and (final) validation 在此作者在後面將train.csv分爲train set和test set兩部分,用於模型構建與測試,test.csv作爲validation set.
data_val = pd.read_csv('../data/test.csv')
#創建一個data副本用於後續分析處理
#remember python assignment or equal passes by reference vs values, so we use the copy function: https://stackoverflow.com/questions/46327494/python-pandas-dataframe-copydeep-false-vs-copydeep-true-vs
data1 = data_raw.copy(deep = True)
#將train set和test set合併處理
data_cleaner = [data1, data_val]
#數據概覽
print (data_raw.info()) #查看各字段的信息https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
print(data_raw.head(5)) #顯示前5行數據https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
print(data_raw.tail(5)) #顯示最後5行https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html
print(data_raw.sample(10)) #隨機顯示10行https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
print(data_raw.columns) #查看列名https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.columns.html
print(data_raw.shape) #查看數據集行列分佈,幾行幾列https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html
print(data_raw.describe()) #查看數據的大體情況https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html
data_raw.info結果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
data_raw.sample(10)結果:
data_raw.describe()結果:
2.3.3 數據清洗的4個C:糾正(Correcting)、填補(Completing)、創建(Creating)和轉換(Converting)
1)糾正異常值,2)完成丟失的信息,3)創建新的分析功能,以及4)將字段轉換爲正確的計算格式來清理數據
1)糾正:檢查數據,沒有任何異常或不可接受的數據輸入。
我們發現在年齡和票價上可能存在潛在的異常值。但是,由於它們是合理的值,我們將等到完成探索性分析後再確定是否應該排除在數據集中。如果它們肯定是不合理的值,例如年齡=800,那麼可以直接剔除。但是,當我們從原始值修改數據時要小心,因爲可能需要創建精確的模型。
2)填補:缺少值可能很糟糕,因爲有些算法不能處理空值。而其他的,如決策樹,可以處理空值。因此,在開始建模之前填補空值是很重要的,因爲我們將比較和對比幾個模型。有兩種常見的方法,要麼刪除記錄,要麼使用合理的輸入填充缺少的值。不建議刪除記錄,尤其是大部分記錄,除非它確實代表一個不完整的記錄。相反,最好將缺失的值歸集。定性數據的一種基本方法是使用模式輸入。定量數據的基本方法是使用平均值、中值或平均值+隨機標準差進行輸入。
年齡、船艙和SES字段中存在空值。中間方法是根據具體標準使用基本方法,如按平均年齡或按票價和SES上船港口。然而,在部署之前,有更復雜的方法論,應該將其與基礎模型進行比較,以確定複雜性是否真的有價值。對於這個數據集,年齡將由中間值輸入,座艙屬性將被刪除,登船將由模式輸入。隨後的模型迭代可能會修改這個決定,以確定它是否提高了模型的準確性。
3)創建:特徵工程就是我們使用現有的特徵來創建新的特徵,以確定它們是否提供新的信息來預測結果。對於這個數據集,我們將創建一個Title變量來確定它是否在生存中起到作用。
4)轉換:最後,我們將數據處理格式化。沒有日期或貨幣格式,只有數據類型格式。我們的分類數據作爲對象導入,這使得數學計算變得困難。對於這個數據集,我們將把對象數據類型轉換爲分類虛擬變量。
print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)
print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)
結果:
Train columns with null values:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
----------
Test/Validation columns with null values:
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
----------
Now that we know what to clean, let's execute our code.
a 數據清洗
Developer Documentation:
- pandas.DataFrame
- pandas.DataFrame.info
- pandas.DataFrame.describe
- Indexing and Selecting Data
- pandas.isnull
- pandas.DataFrame.sum
- pandas.DataFrame.mode
- pandas.DataFrame.copy
- pandas.DataFrame.fillna
- pandas.DataFrame.drop
- pandas.Series.value_counts
- pandas.DataFrame.loc
###填充或刪除缺失值
for dataset in data_cleaner:
#complete missing age with median
dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
#complete embarked with mode
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
#complete missing fare with median
dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
#刪除 Cabin feature/column
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)
print(data1.isnull().sum())
print("-"*10)
print(data_val.isnull().sum())
結果:
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
----------
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 327
Embarked 0
dtype: int64
###3)創建特徵
for dataset in data_cleaner:
#Discrete variables
dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1
dataset['IsAlone'] = 1 #initialize to yes/1 is alone
dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1
#從name中提取Title: http://www.pythonforbeginners.com/dictionary/python-split
dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
#Fare與Age連續區間劃分; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
#Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
#Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
#清除數目較少的Title
print(data1['Title'].value_counts())
stat_min = 10 #公共最小值http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index
#apply and lambda functions 查找與替換: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*10)
#preview data again
data1.info()
data_val.info()
data1.sample(10)
結果:
Mr 517
Miss 182
Mrs 125
Master 40
Misc 27
Name: Title, dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 891 non-null object
FamilySize 891 non-null int64
IsAlone 891 non-null int64
Title 891 non-null object
FareBin 891 non-null category
AgeBin 891 non-null category
dtypes: category(2), float64(2), int64(6), object(4)
memory usage: 85.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 16 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 418 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 418 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
FamilySize 418 non-null int64
IsAlone 418 non-null int64
Title 418 non-null object
FareBin 418 non-null category
AgeBin 418 non-null category
dtypes: category(2), float64(2), int64(6), object(6)
memory usage: 46.8+ KB
data1.sample(10):
b 格式轉換
Developer Documentation:
- Categorical Encoding
- Sklearn LabelEncoder
- Sklearn OneHotEncoder
- Pandas Categorical dtype
- pandas.get_dummies
#轉換:Label Encoder
label = LabelEncoder()
for dataset in data_cleaner:
dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
dataset['Title_Code'] = label.fit_transform(dataset['Title'])
dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
#define target
Target = ['Survived']
#define x variables
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare']
data1_xy = Target + data1_x
print('Original X Y: ', data1_xy, '\n')
#連續型變量變換
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')
#虛擬變量
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')
data1_dummy.head()
結果:
Original X Y: ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
Bin X Y: ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
Dummy X Y: ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs']
2.3.4 再次檢查
print('Train columns with null values: \n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)
print('Test/Validation columns with null values: \n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)
data_raw.describe(include = 'all')
結果:
Train columns with null values:
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
FamilySize 0
IsAlone 0
Title 0
FareBin 0
AgeBin 0
Sex_Code 0
Embarked_Code 0
Title_Code 0
AgeBin_Code 0
FareBin_Code 0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 891 non-null object
FamilySize 891 non-null int64
IsAlone 891 non-null int64
Title 891 non-null object
FareBin 891 non-null category
AgeBin 891 non-null category
Sex_Code 891 non-null int64
Embarked_Code 891 non-null int64
Title_Code 891 non-null int64
AgeBin_Code 891 non-null int64
FareBin_Code 891 non-null int64
dtypes: category(2), float64(2), int64(11), object(4)
memory usage: 120.3+ KB
None
----------
Test/Validation columns with null values:
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 327
Embarked 0
FamilySize 0
IsAlone 0
Title 0
FareBin 0
AgeBin 0
Sex_Code 0
Embarked_Code 0
Title_Code 0
AgeBin_Code 0
FareBin_Code 0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 418 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 418 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
FamilySize 418 non-null int64
IsAlone 418 non-null int64
Title 418 non-null object
FareBin 418 non-null category
AgeBin 418 non-null category
Sex_Code 418 non-null int64
Embarked_Code 418 non-null int64
Title_Code 418 non-null int64
AgeBin_Code 418 non-null int64
FareBin_Code 418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
data_raw.describe(include = 'all')結果:
2.3.5 數據集劃分
如前所述,所提供的測試文件實際上是提交競賽的驗證數據。因此,我們將使用sklearn的train_test_split函數將訓練數據拆分爲兩個數據集:75/25拆分,保證我們的模型不會過擬合。也就是說,該算法特定於一個給定的子集,它不能歸納另一個子集。重要的是,我們的算法沒有用過我們用來測試的子集,所以它不會產生“欺騙性”記憶。在後面的部分中,我們還將使用sklearn的交叉驗證功能,該功能將我們的數據集拆分爲用於數據建模比較的訓練和測試。
#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)
print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))
train1_x_bin.head()
結果:
Data1 Shape: (891, 19)
Train1 Shape: (668, 8)
Test1 Shape: (223, 8)
train_x_bin.head() 結果:
2.4 探索性分析
任何處理過數據的人都知道,垃圾輸入,垃圾輸出(GIGO)。因此,部署描述性統計和圖形統計以在數據集中查找潛在的問題、模式、分類、相關性和比較是很重要的。此外,數據分類(即定性與定量)對於理解和選擇正確的假設檢驗或數據模型也很重要。
在經過之前的操作之後得到乾淨數據,我們將使用描述性和圖形統計來探索我們的數據,以描述和總結我們的變量。在這個階段,將對特性進行分類,並確定它們與目標變量以及彼此之間的相關性。
#羣體生存率的相關分析: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
for x in data1_x:
if data1[x].dtype != 'float64' :
print('Survival Correlation by:', x)
print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
print('-'*10, '\n')
#crosstabs卡方檢驗: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html
print(pd.crosstab(data1['Title'],data1[Target[0]]))
結果:
Survival Correlation by: Sex
Sex Survived
0 female 0.742038
1 male 0.188908
----------
Survival Correlation by: Pclass
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
----------
Survival Correlation by: Embarked
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
----------
Survival Correlation by: Title
Title Survived
0 Master 0.575000
1 Misc 0.444444
2 Miss 0.697802
3 Mr 0.156673
4 Mrs 0.792000
----------
Survival Correlation by: SibSp
SibSp Survived
0 0 0.345395
1 1 0.535885
2 2 0.464286
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
----------
Survival Correlation by: Parch
Parch Survived
0 0 0.343658
1 1 0.550847
2 2 0.500000
3 3 0.600000
4 4 0.000000
5 5 0.200000
6 6 0.000000
----------
Survival Correlation by: FamilySize
FamilySize Survived
0 1 0.303538
1 2 0.552795
2 3 0.578431
3 4 0.724138
4 5 0.200000
5 6 0.136364
6 7 0.333333
7 8 0.000000
8 11 0.000000
----------
Survival Correlation by: IsAlone
IsAlone Survived
0 0 0.505650
1 1 0.303538
----------
Survived 0 1
Title
Master 17 23
Misc 15 12
Miss 55 127
Mr 436 81
Mrs 26 99
定量數據matplotlib可視化:
#optional plotting w/pandas: https://pandas.pydata.org/pandas-docs/stable/visualization.html
#we will use matplotlib.pyplot: https://matplotlib.org/api/pyplot_api.html to organize our graphics will use figure: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
#subplot:https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot
#and subplotS: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=matplotlib%20pyplot%20subplots#matplotlib.pyplot.subplots
#定量數據
plt.figure(figsize=[16,12])
plt.subplot(231)
plt.boxplot(x=data1['Fare'], showmeans = True, meanline = True)
plt.title('Fare Boxplot')
plt.ylabel('Fare ($)')
plt.subplot(232)
plt.boxplot(data1['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')
plt.subplot(233)
plt.boxplot(data1['FamilySize'], showmeans = True, meanline = True)
plt.title('Family Size Boxplot')
plt.ylabel('Family Size (#)')
plt.subplot(234)
plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')
plt.legend()
plt.subplot(235)
plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
plt.subplot(236)
plt.hist(x = [data1[data1['Survived']==1]['FamilySize'], data1[data1['Survived']==0]['FamilySize']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Family Size Histogram by Survival')
plt.xlabel('Family Size (#)')
plt.ylabel('# of Passengers')
plt.legend()
結果:
Seaborn作圖:
#使用Seaborn圖形進行多變量比較: https://seaborn.pydata.org/api.html
#個體特徵圖
fig, saxis = plt.subplots(2, 3,figsize=(16,12))
sns.barplot(x = 'Embarked', y = 'Survived', data=data1, ax = saxis[0,0])
sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=data1, ax = saxis[0,1])
sns.barplot(x = 'IsAlone', y = 'Survived', order=[1,0], data=data1, ax = saxis[0,2])
sns.pointplot(x = 'FareBin', y = 'Survived', data=data1, ax = saxis[1,0])
sns.pointplot(x = 'AgeBin', y = 'Survived', data=data1, ax = saxis[1,1])
sns.pointplot(x = 'FamilySize', y = 'Survived', data=data1, ax = saxis[1,2])
結果:
定性變量:
#定性變量: Sex
#we know sex mattered in survival, now let's compare sex and a 2nd feature
fig, qaxis = plt.subplots(1,3,figsize=(14,12))
sns.barplot(x = 'Sex', y = 'Survived', hue = 'Embarked', data=data1, ax = qaxis[0])
axis1.set_title('Sex vs Embarked Survival Comparison')
sns.barplot(x = 'Sex', y = 'Survived', hue = 'Pclass', data=data1, ax = qaxis[1])
axis1.set_title('Sex vs Pclass Survival Comparison')
sns.barplot(x = 'Sex', y = 'Survived', hue = 'IsAlone', data=data1, ax = qaxis[2])
axis1.set_title('Sex vs IsAlone Survival Comparison')
結果:
side-by-side comparisons:
#more side-by-side comparisons
fig, (maxis1, maxis2) = plt.subplots(1, 2,figsize=(14,12))
#how does family size factor with sex & survival compare
sns.pointplot(x="FamilySize", y="Survived", hue="Sex", data=data1,
palette={"male": "blue", "female": "pink"},
markers=["*", "o"], linestyles=["-", "--"], ax = maxis1)
#how does class factor with sex & survival compare
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data1,
palette={"male": "blue", "female": "pink"},
markers=["*", "o"], linestyles=["-", "--"], ax = maxis2)
結果:
#how does embark port factor with class, sex, and survival compare facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
e = sns.FacetGrid(data1, col = 'Embarked')
e.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette = 'deep')
e.add_legend()
結果:
#plot distributions of age of passengers who survived or did not survive
a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , data1['Age'].max()))
a.add_legend()
結果:
#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')
h.map(plt.hist, 'Age', alpha = .75)
h.add_legend()
結果:
#兩兩比對
pp = sns.pairplot(data1, hue = 'Survived', palette = 'deep', size=1.2, diag_kind = 'kde', diag_kws=dict(shade=True), plot_kws=dict(s=10) )
pp.set(xticklabels=[])
結果:
#correlation heatmap of dataset
def correlation_heatmap(df):
_ , ax = plt.subplots(figsize =(14, 12))
colormap = sns.diverging_palette(220, 10, as_cmap = True)
_ = sns.heatmap(
df.corr(),
cmap = colormap,
square=True,
cbar_kws={'shrink':.9 },
ax=ax,
annot=True,
linewidths=0.1,vmax=1.0, linecolor='white',
annot_kws={'fontsize':12 }
)
plt.title('Pearson Correlation of Features', y=1.05, size=15)
correlation_heatmap(data1)
結果:
2.5 數據建模
與描述性統計和推斷統計一樣,數據建模既可以總結數據,也可以預測未來的結果。您的數據集和預期結果將決定可用的算法。重要的是要記住,算法只是工具,而不是魔法棒。你必須仍然是精通工藝的人,知道如何爲工作選擇合適的工具。一個類比就是讓別人給你一把Philip螺絲刀,然後他們給你一把平頭螺絲刀,或者最糟糕的是一把錘子。最壞的情況是,這使你不能完成這項任務。數據建模也是如此。錯誤的模型會導致糟糕的性能,而最壞的結果是錯誤的結論(這將被用作智能操作)。
數據科學是數學(即統計學、線性代數等)、計算機科學(即編程語言、計算機系統等)和商業管理(即通信、項目管理等)之間的多學科領域。大多數數據科學家來自三個領域中的一個,所以他們傾向於傾向於這一學科。
Machine Learning (ML) is teaching the machine how-to think and not what to think.
機器學習的目的是解決人類的問題。機器學習可分爲:監督學習、無監督學習和強化學習。監督學習是模型有Label。無監督學習沒有。強化學習是前兩種模式的混合,沒有立即給出正確答案,而是在隨後的一系列事件之後,進行強化學習。我們正在進行有監督的機器學習,因爲我們通過向算法提供一組特性及其相應的目標來訓練它。然後,我們希望從同一個數據集中爲它提供一個新的子集,並在預測準確性方面有類似的結果。
有許多機器學習算法,但是它們可以簡化爲四個類別:分類、迴歸、聚類或降維,這取決於你的目標變量和數據建模目標。今天將重點放在分類和迴歸上,我們可以歸納爲連續目標變量需要回歸算法,離散目標變量需要分類算法。另一個需要注意的是,邏輯迴歸雖然名字中有迴歸,但實際上是一種分類算法。由於我們的問題是預測乘客是否倖存,這是一個離散的目標變量。我們將使用sklearn庫中的分類算法開始分析。我們將使用交叉驗證和評分指標(在後面的章節中討論)對算法的性能進行排名和比較。
Machine Learning Selection:
- Sklearn Estimator Overview
- Sklearn Estimator Detail
- Choosing Estimator Mind Map
- Choosing Estimator Cheat Sheet
Machine Learning Classification Algorithms:
- Ensemble Methods
- Generalized Linear Models (GLM)
- Naive Bayes
- Nearest Neighbors
- Support Vector Machines (SVM)
- Decision Trees
- Discriminant Analysis
2.5.1 怎樣選擇算法
當涉及到數據建模時,初學者的問題總是,“什麼是最好的機器學習算法?”對於這一點,初學者必須學習機器學習的無免費午餐定理(No Free Lunch Theorem)。簡言之,NFLT指出,對於所有數據集,沒有一種超級算法在所有情況下都能發揮最佳效果。因此,最好的方法是嘗試多個MLA,對它們進行優化,並根據您的特定場景對它們進行比較。有鑑於此,已經做了一些很好的研究來比較算法,例如Caruana和Niculescu Mizil 2006,觀看MLA比較的視頻講座。Ogutu,2011用於基因組選擇。Fernandez Delgado et al,2014對比17個家庭的179個分類器,Thoma 2016 sklearn對比,還有一個學派認爲,更多的數據勝過更好的算法。
有了這些信息,初學者從哪裏開始呢?我建議從Trees, Bagging, Random Forests, and Boosting開始。它們基本上是決策樹的不同實現,這是最容易學習和理解的概念。在下一節中討論的,它們也比SVC更容易調優。下面,我將簡要介紹如何運行和比較幾個MLA,但是這個kernel的其餘部分將集中在通過決策樹及其派生工具學習數據建模。
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#GLM
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#Nearest Neighbor
neighbors.KNeighborsClassifier(),
#SVM
svm.SVC(probability=True),
svm.NuSVC(probability=True),
svm.LinearSVC(),
#Trees
tree.DecisionTreeClassifier(),
tree.ExtraTreeClassifier(),
#Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
XGBClassifier()
]
#split dataset in cross-validation with this splitter class: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit
#note: this is an alternative to train_test_split
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%
#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
#create table to compare MLA predictions
MLA_predict = data1[Target]
#index through MLA and save performance to table
row_index = 0
for alg in MLA:
#set name and parameters
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
#score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split,return_train_score = True)
MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()
#if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3 #let's know the worst that can happen!
#save MLA predictions - see section 6 for usage
alg.fit(data1[data1_x_bin], data1[Target])
MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
row_index+=1
#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
#MLA_predict
結果:
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')
#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')
結果:
2.5.2 模型評估
到此,通過一些基本的數據清理、分析和機器學習算法(MLA),我們能夠以大約82%的精度預測乘客的生存率。
制定最低正確率,我們知道1502/2224或67.5%的人死亡。因此,如果我們只是預測最常見的事件,即100%的人死亡,那麼我們正確率應達到67.5%。所以把68%設爲最差性能,低於此則不加考慮。
#group by or pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
pivot_female = data1[data1.Sex=='female'].groupby(['Sex','Pclass', 'Embarked','FareBin'])['Survived'].mean()
print('Survival Decision Tree w/Female Node: \n',pivot_female)
pivot_male = data1[data1.Sex=='male'].groupby(['Sex','Title'])['Survived'].mean()
print('\n\nSurvival Decision Tree w/Male Node: \n',pivot_male)
結果:
Survival Decision Tree w/Female Node:
Sex Pclass Embarked FareBin
female 1 C (14.454, 31.0] 0.666667
(31.0, 512.329] 1.000000
Q (31.0, 512.329] 1.000000
S (14.454, 31.0] 1.000000
(31.0, 512.329] 0.955556
2 C (7.91, 14.454] 1.000000
(14.454, 31.0] 1.000000
(31.0, 512.329] 1.000000
Q (7.91, 14.454] 1.000000
S (7.91, 14.454] 0.875000
(14.454, 31.0] 0.916667
(31.0, 512.329] 1.000000
3 C (-0.001, 7.91] 1.000000
(7.91, 14.454] 0.428571
(14.454, 31.0] 0.666667
Q (-0.001, 7.91] 0.750000
(7.91, 14.454] 0.500000
(14.454, 31.0] 0.714286
S (-0.001, 7.91] 0.533333
(7.91, 14.454] 0.448276
(14.454, 31.0] 0.357143
(31.0, 512.329] 0.125000
Name: Survived, dtype: float64
Survival Decision Tree w/Male Node:
Sex Title
male Master 0.575000
Misc 0.250000
Mr 0.156673
Name: Survived, dtype: float64
#handmade data model using brain power (and Microsoft Excel Pivot Tables for quick calculations)
def mytree(df):
#initialize table to store predictions
Model = pd.DataFrame(data = {'Predict':[]})
male_title = ['Master'] #survived titles
for index, row in df.iterrows():
#Question 1: Were you on the Titanic; majority died
Model.loc[index, 'Predict'] = 0
#Question 2: Are you female; majority survived
if (df.loc[index, 'Sex'] == 'female'):
Model.loc[index, 'Predict'] = 1
#Question 3A Female - Class and Question 4 Embarked gain minimum information
#Question 5B Female - FareBin; set anything less than .5 in female node decision tree back to 0
if ((df.loc[index, 'Sex'] == 'female') &
(df.loc[index, 'Pclass'] == 3) &
(df.loc[index, 'Embarked'] == 'S') &
(df.loc[index, 'Fare'] > 8)
):
Model.loc[index, 'Predict'] = 0
#Question 3B Male: Title; set anything greater than .5 to 1 for majority survived
if ((df.loc[index, 'Sex'] == 'male') &
(df.loc[index, 'Title'] in male_title)
):
Model.loc[index, 'Predict'] = 1
return Model
#model data
Tree_Predict = mytree(data1)
print('Decision Tree Model Accuracy/Precision Score: {:.2f}%\n'.format(metrics.accuracy_score(data1['Survived'], Tree_Predict)*100))
#Accuracy Summary Report with http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
#Where recall score = (true positives)/(true positive + false negative) w/1 being best:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
#And F1 score = weighted average of precision and recall w/1 being best: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
print(metrics.classification_report(data1['Survived'], Tree_Predict))
結果:
Decision Tree Model Accuracy/Precision Score: 82.04%
precision recall f1-score support
0 0.82 0.91 0.86 549
1 0.82 0.68 0.75 342
avg / total 0.82 0.82 0.82 891
#Plot Accuracy Summary
#Credit: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(data1['Survived'], Tree_Predict)
np.set_printoptions(precision=2)
class_names = ['Dead', 'Survived']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')
結果:
Confusion matrix, without normalization
[[497 52]
[108 234]]
Normalized confusion matrix
[[ 0.91 0.09]
[ 0.32 0.68]]
Cross-Validation (CV)檢測模型性能
we used sklearn cross_validate function to train, test, and score our model performance.
2.5.3 使用超參數調優
class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
使用parametergrid、gridsearchcv和sklearn scoring來調整我們的模型;有關ROC_AUC的更多信息,用graphviz來可視化我們的樹。
#base model
dtree = tree.DecisionTreeClassifier(random_state = 0)
base_results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target], cv = cv_split,return_train_score = True)
dtree.fit(data1[data1_x_bin], data1[Target])
print('BEFORE DT Parameters: ', dtree.get_params())
print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100))
print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
#print("BEFORE DT Test w/bin set score min: {:.2f}". format(base_results['test_score'].min()*100))
print('-'*10)
#tune hyper-parameters: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
param_grid = {'criterion': ['gini', 'entropy'], #scoring methodology; two supported formulas for calculating information gain - default is gini
#'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
#'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
#'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
#'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
'random_state': [0] #seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
}
#choose best model with grid_search: #http://scikit-learn.org/stable/modules/grid_search.html#grid-search
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])
#print(tune_model.cv_results_.keys())
#print(tune_model.cv_results_['params'])
print('AFTER DT Parameters: ', tune_model.best_params_)
#print(tune_model.cv_results_['mean_train_score'])
print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100))
#print(tune_model.cv_results_['mean_test_score'])
print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)
BEFORE DT Parameters: {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 0, 'splitter': 'best'}
BEFORE DT Training w/bin score mean: 89.51
BEFORE DT Test w/bin score mean: 82.09
BEFORE DT Test w/bin score 3*std: +/- 5.57
----------
AFTER DT Parameters: {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT Training w/bin score mean: 89.35
AFTER DT Test w/bin score mean: 87.40
AFTER DT Test w/bin score 3*std: +/- 5.00
----------
2.5.4 通過特徵調整模型
如前所述,預測變量越多不意味模型越好,但正確的預測變量可以。所以數據建模的另一個步驟是特徵選擇。sklearn有幾個選項,我們將使用遞歸特徵消除(rfe)和交叉驗證(cv)。
#base model
print('BEFORE DT RFE Training Shape Old: ', data1[data1_x_bin].shape)
print('BEFORE DT RFE Training Columns Old: ', data1[data1_x_bin].columns.values)
print("BEFORE DT RFE Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100))
print("BEFORE DT RFE Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT RFE Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
print('-'*10)
#feature selection
dtree_rfe = feature_selection.RFECV(dtree, step = 1, scoring = 'accuracy', cv = cv_split)
dtree_rfe.fit(data1[data1_x_bin], data1[Target])
#transform x&y to reduced features and fit new model
#alternative: can use pipeline to reduce fit and transform steps: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
X_rfe = data1[data1_x_bin].columns.values[dtree_rfe.get_support()]
rfe_results = model_selection.cross_validate(dtree, data1[X_rfe], data1[Target], cv = cv_split,return_train_score = True)
#print(dtree_rfe.grid_scores_)
print('AFTER DT RFE Training Shape New: ', data1[X_rfe].shape)
print('AFTER DT RFE Training Columns New: ', X_rfe)
print("AFTER DT RFE Training w/bin score mean: {:.2f}". format(rfe_results['train_score'].mean()*100))
print("AFTER DT RFE Test w/bin score mean: {:.2f}". format(rfe_results['test_score'].mean()*100))
print("AFTER DT RFE Test w/bin score 3*std: +/- {:.2f}". format(rfe_results['test_score'].std()*100*3))
print('-'*10)
#tune rfe model
rfe_tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
rfe_tune_model.fit(data1[X_rfe], data1[Target])
#print(rfe_tune_model.cv_results_.keys())
#print(rfe_tune_model.cv_results_['params'])
print('AFTER DT RFE Tuned Parameters: ', rfe_tune_model.best_params_)
#print(rfe_tune_model.cv_results_['mean_train_score'])
print("AFTER DT RFE Tuned Training w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100))
#print(rfe_tune_model.cv_results_['mean_test_score'])
print("AFTER DT RFE Tuned Test w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT RFE Tuned Test w/bin score 3*std: +/- {:.2f}". format(rfe_tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)
結果:
BEFORE DT RFE Training Shape Old: (891, 7)
BEFORE DT RFE Training Columns Old: ['Sex_Code' 'Pclass' 'Embarked_Code' 'Title_Code' 'FamilySize'
'AgeBin_Code' 'FareBin_Code']
BEFORE DT RFE Training w/bin score mean: 89.51
BEFORE DT RFE Test w/bin score mean: 82.09
BEFORE DT RFE Test w/bin score 3*std: +/- 5.57
----------
AFTER DT RFE Training Shape New: (891, 6)
AFTER DT RFE Training Columns New: ['Sex_Code' 'Pclass' 'Title_Code' 'FamilySize' 'AgeBin_Code' 'FareBin_Code']
AFTER DT RFE Training w/bin score mean: 88.16
AFTER DT RFE Test w/bin score mean: 83.06
AFTER DT RFE Test w/bin score 3*std: +/- 6.22
----------
AFTER DT RFE Tuned Parameters: {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT RFE Tuned Training w/bin score mean: 89.39
AFTER DT RFE Tuned Test w/bin score mean: 87.34
AFTER DT RFE Tuned Test w/bin score 3*std: +/- 6.21
----------
2.6 模型驗證及實現
在您根據數據的一個子集對模型進行訓練之後,是時候測試您的模型了。這有助於確保模型沒有過擬合,或者其特定於所選子集,從而使其無法準確地適合同一數據集中的其他子集。在這一步中,我們將使用測試集確定模型是否過度擬合、歸納及欠擬合。
#compare algorithm predictions with each other, where 1 = exactly similar and 0 = exactly opposite
#there are some 1's, but enough blues and light reds to create a "super algorithm" by combining them
correlation_heatmap(MLA_predict)
結果:
#why choose one model, when you can pick them all with voting classifier
#http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
#removed models w/o attribute 'predict_proba' required for vote classifier and models with a 1.0 correlation to another model
vote_est = [
#Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
('ada', ensemble.AdaBoostClassifier()),
('bc', ensemble.BaggingClassifier()),
('etc',ensemble.ExtraTreesClassifier()),
('gbc', ensemble.GradientBoostingClassifier()),
('rfc', ensemble.RandomForestClassifier()),
#Gaussian Processes: http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc
('gpc', gaussian_process.GaussianProcessClassifier()),
#GLM: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
('lr', linear_model.LogisticRegressionCV()),
#Navies Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
('bnb', naive_bayes.BernoulliNB()),
('gnb', naive_bayes.GaussianNB()),
#Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html
('knn', neighbors.KNeighborsClassifier()),
#SVM: http://scikit-learn.org/stable/modules/svm.html
('svc', svm.SVC(probability=True)),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
('xgb', XGBClassifier())
]
#Hard Vote or majority rules
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, data1[data1_x_bin], data1[Target], cv = cv_split,return_train_score = True)
vote_hard.fit(data1[data1_x_bin], data1[Target])
print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv['train_score'].mean()*100))
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-'*10)
#Soft Vote or weighted probabilities
vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, data1[data1_x_bin], data1[Target], cv = cv_split,return_train_score = True)
vote_soft.fit(data1[data1_x_bin], data1[Target])
print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv['train_score'].mean()*100))
print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
print('-'*10)
結果:
Hard Voting Training w/bin score mean: 86.59
Hard Voting Test w/bin score mean: 82.39
Hard Voting Test w/bin score 3*std: +/- 4.95
----------
Soft Voting Training w/bin score mean: 87.15
Soft Voting Test w/bin score mean: 82.35
Soft Voting Test w/bin score 3*std: +/- 4.85
----------
#WARNING: Running is very computational intensive and time expensive.
#Code is written for experimental/developmental purposes and not production ready!
#Hyperparameter Tune with GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]
grid_param = [
[{
#AdaBoostClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
'n_estimators': grid_n_estimator, #default=50
'learning_rate': grid_learn, #default=1
#'algorithm': ['SAMME', 'SAMME.R'], #default=’SAMME.R
'random_state': grid_seed
}],
[{
#BaggingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
'n_estimators': grid_n_estimator, #default=10
'max_samples': grid_ratio, #default=1.0
'random_state': grid_seed
}],
[{
#ExtraTreesClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
'n_estimators': grid_n_estimator, #default=10
'criterion': grid_criterion, #default=”gini”
'max_depth': grid_max_depth, #default=None
'random_state': grid_seed
}],
[{
#GradientBoostingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
#'loss': ['deviance', 'exponential'], #default=’deviance’
'learning_rate': [.05], #default=0.1 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
'n_estimators': [300], #default=100 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
#'criterion': ['friedman_mse', 'mse', 'mae'], #default=”friedman_mse”
'max_depth': grid_max_depth, #default=3
'random_state': grid_seed
}],
[{
#RandomForestClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
'n_estimators': grid_n_estimator, #default=10
'criterion': grid_criterion, #default=”gini”
'max_depth': grid_max_depth, #default=None
'oob_score': [True], #default=False -- 12/31/17 set to reduce runtime -- The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 146.35 seconds.
'random_state': grid_seed
}],
[{
#GaussianProcessClassifier
'max_iter_predict': grid_n_estimator, #default: 100
'random_state': grid_seed
}],
[{
#LogisticRegressionCV - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
'fit_intercept': grid_bool, #default: True
#'penalty': ['l1','l2'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], #default: lbfgs
'random_state': grid_seed
}],
[{
#BernoulliNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB
'alpha': grid_ratio, #default: 1.0
}],
#GaussianNB -
[{}],
[{
#KNeighborsClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
'n_neighbors': [1,2,3,4,5,6,7], #default: 5
'weights': ['uniform', 'distance'], #default = ‘uniform’
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}],
[{
#SVC - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
#http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
#'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'C': [1,2,3,4,5], #default=1.0
'gamma': grid_ratio, #edfault: auto
'decision_function_shape': ['ovo', 'ovr'], #default:ovr
'probability': [True],
'random_state': grid_seed
}],
[{
#XGBClassifier - http://xgboost.readthedocs.io/en/latest/parameter.html
'learning_rate': grid_learn, #default: .3
'max_depth': [1,2,4,6,8,10], #default 2
'n_estimators': grid_n_estimator,
'seed': grid_seed
}]
]
start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip
#print(clf[1]) #vote_est is a list of tuples, index 0 is the name and index 1 is the algorithm
#print(param)
start = time.perf_counter()
best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
best_search.fit(data1[data1_x_bin], data1[Target])
run = time.perf_counter() - start
best_param = best_search.best_params_
print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
clf[1].set_params(**best_param)
run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
print('-'*10)
結果:
The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 37.28 seconds.
The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.04 seconds.
The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 68.93 seconds.
The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 38.77 seconds.
The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 84.14 seconds.
The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 6.19 seconds.
The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 9.40 seconds.
The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.24 seconds.
The best parameter for GaussianNB is {} with a runtime of 0.05 seconds.
The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 5.56 seconds.
The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 30.49 seconds.
The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 43.57 seconds.
Total optimization time was 5.96 minutes.
----------
#Hard Vote or majority rules w/Tuned Hyperparameters
grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
grid_hard_cv = model_selection.cross_validate(grid_hard, data1[data1_x_bin], data1[Target], cv = cv_split,return_train_score = True)
grid_hard.fit(data1[data1_x_bin], data1[Target])
print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
print('-'*10)
#Soft Vote or weighted probabilities w/Tuned Hyperparameters
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, data1[data1_x_bin], data1[Target], cv = cv_split,return_train_score = True)
grid_soft.fit(data1[data1_x_bin], data1[Target])
print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
print('-'*10)
結果:
Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 85.22
Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 82.31
Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.26
----------
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 84.76
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 82.28
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.42
----------
#prepare data for modeling
print(data_val.info())
print("-"*10)
#data_val.sample(10)
#decision tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_dt = tree.DecisionTreeClassifier()
#submit_dt = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
#submit_dt.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_dt.best_params_) #Best Parameters: {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
#data_val['Survived'] = submit_dt.predict(data_val[data1_x_bin])
#bagging w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77990
#submit_bc = ensemble.BaggingClassifier()
#submit_bc = model_selection.GridSearchCV(ensemble.BaggingClassifier(), param_grid= {'n_estimators':grid_n_estimator, 'max_samples': grid_ratio, 'oob_score': grid_bool, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_bc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_bc.best_params_) #Best Parameters: {'max_samples': 0.25, 'n_estimators': 500, 'oob_score': True, 'random_state': 0}
#data_val['Survived'] = submit_bc.predict(data_val[data1_x_bin])
#extra tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_etc = ensemble.ExtraTreesClassifier()
#submit_etc = model_selection.GridSearchCV(ensemble.ExtraTreesClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_etc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_etc.best_params_) #Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_etc.predict(data_val[data1_x_bin])
#random foreset w/full dataset modeling submission score: defaults= 0.71291, tuned= 0.73205
#submit_rfc = ensemble.RandomForestClassifier()
#submit_rfc = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_rfc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_rfc.best_params_) #Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_rfc.predict(data_val[data1_x_bin])
#ada boosting w/full dataset modeling submission score: defaults= 0.74162, tuned= 0.75119
#submit_abc = ensemble.AdaBoostClassifier()
#submit_abc = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid={'n_estimators': grid_n_estimator, 'learning_rate': grid_ratio, 'algorithm': ['SAMME', 'SAMME.R'], 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_abc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_abc.best_params_) #Best Parameters: {'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0}
#data_val['Survived'] = submit_abc.predict(data_val[data1_x_bin])
#gradient boosting w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77033
#submit_gbc = ensemble.GradientBoostingClassifier()
#submit_gbc = model_selection.GridSearchCV(ensemble.GradientBoostingClassifier(), param_grid={'learning_rate': grid_ratio, 'n_estimators': grid_n_estimator, 'max_depth': grid_max_depth, 'random_state':grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_gbc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_gbc.best_params_) #Best Parameters: {'learning_rate': 0.25, 'max_depth': 2, 'n_estimators': 50, 'random_state': 0}
#data_val['Survived'] = submit_gbc.predict(data_val[data1_x_bin])
#extreme boosting w/full dataset modeling submission score: defaults= 0.73684, tuned= 0.77990
#submit_xgb = XGBClassifier()
#submit_xgb = model_selection.GridSearchCV(XGBClassifier(), param_grid= {'learning_rate': grid_learn, 'max_depth': [0,2,4,6,8,10], 'n_estimators': grid_n_estimator, 'seed': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_xgb.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_xgb.best_params_) #Best Parameters: {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0}
#data_val['Survived'] = submit_xgb.predict(data_val[data1_x_bin])
#handmade decision tree - submission score = 0.77990
data_val['Survived'] = mytree(data_val).astype(int)
#hard voting classifier w/full dataset modeling submission score: defaults= 0.75598, tuned = 0.77990
#data_val['Survived'] = vote_hard.predict(data_val[data1_x_bin])
data_val['Survived'] = grid_hard.predict(data_val[data1_x_bin])
#soft voting classifier w/full dataset modeling submission score: defaults= 0.73684, tuned = 0.74162
#data_val['Survived'] = vote_soft.predict(data_val[data1_x_bin])
#data_val['Survived'] = grid_soft.predict(data_val[data1_x_bin])
#submit file
submit = data_val[['PassengerId','Survived']]
submit.to_csv("../working/submit.csv", index=False)
print('Validation Data Distribution: \n', data_val['Survived'].value_counts(normalize = True))
submit.sample(10)
結果:
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 418 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 418 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
FamilySize 418 non-null int64
IsAlone 418 non-null int64
Title 418 non-null object
FareBin 418 non-null category
AgeBin 418 non-null category
Sex_Code 418 non-null int64
Embarked_Code 418 non-null int64
Title_Code 418 non-null int64
AgeBin_Code 418 non-null int64
FareBin_Code 418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
Validation Data Distribution:
0 0.633971
1 0.366029
Name: Survived, dtype: float64
2.7 模型優化及策略
模型優化需要重複這個過程,使它變得更好…更強大…比以前更快。作爲一名數據科學家,您的策略應該是將開發人員操作和應用程序管道外包,這樣您就有更多的時間來關注建議和設計。一旦你能夠包裝你的想法,這就成爲你的“兌換貨幣”。