1. sklearn簡介
- Python語言的編程工具
- 基於Numpy,Scipy和matplotlib構建
- 開源,BSD許可
sklearn官網:(http://scikit-learn.org/)學習中可經常查閱API等信息,官網提供的文檔和例子也十分詳盡.
2 sklearn的使用流程
數據收集 -> 特徵工程(數據清洗,特徵選擇,降維) -> 模型訓練(選擇模型)
->模型測試 -> 部署上線
3 sklearn覆蓋的機器學習問題
- 分類 Classification
- 迴歸 Regression
- 聚類 Clustering
- 降維 Dimensionality reduction
- 模型選擇 Model_selection\
- 預處理 Preprocessing
4 Estinator評估器, Transformer轉換器 Pipeline管道
評估器 Estimator : 訓練模型的武器,如分類器Classifier,迴歸器Regressor都屬於評估器Estimator
核心函數:- fit(X_train, y_train) : 用於訓練模型
- predict(X-test) : 用於對測試集預測
轉換器 Transformer : 用於數據預處理的武器
核心函數:- fit(x_train) : 用於訓練轉換器參數
- transform(X_test) : 用於對測試集轉換數據
- 流水線 Pipeline
典型的機器學習流程:
datasets — transformer1 — transformer2 — transform3 … transformern — Estinator
fit方法是通過訓練(計算)獲取相關的參數並保存, 之後可以多次使用transform方法對數據進行轉換
transform方法 一個任務集可以有多個transformer
Estinator 一個任務只能有一個 並且一定是在最後用 fit方法訓練也是得到一組參數,然後可以進行多次predict預測
pipeline管道就是把多個transformer和一個estimator包裝起來,形成一個整體
pipeline運行機制如下:
5 Pipeline使用實例
泰坦尼克號案例
# 導入相關包
import pandas as pd
# 讀取數據
titanic_df = pd.read_csv('./datasets/titanic/train.csv')
# drop 掉 PassengerId,Name,Ticket,Cabin
titanic_df = titanic_df.drop(['PassengerId','Name','Ticket','Cabin'], axis = 1)
# 構造一個新變量 AgeIsMissing
titanic_df['AgeIsMissing'] = 0
titanic_df.loc[titanic_df['Age'].isnull(), 'AgeIsMissing'] = 1
# 對 Age 缺失值進行均值填充
age_mean = round(titanic_df['Age'].mean())
titanic_df['Age'].fillna(age_mean, inplace=True)
# 對 Embarked 缺失值用'S'替換
titanic_df['Embarked'].fillna('S', inplace=True)
# 對 Age 進行分箱--自定義分箱
cut_points = [0,18,25,40,60,100]
titanic_df["AgeBin"] = pd.cut(titanic_df.Age, bins=cut_points)
# 對 Fare 船票價格進行分箱--等深分箱
titanic_df["FareBin"] = pd.qcut(titanic_df.Fare, 5)
# 構造 FamilySize 變量
titanic_df['FamilySize'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1
# 構造一個新變量 IsAlone(是否獨自一人)
titanic_df['IsAlone'] = 0
titanic_df.loc[titanic_df['FamilySize'] == 1, 'IsAlone'] = 1
# 構造一個新變量 IsMother(是否是母親)
titanic_df['IsMother'] = 0
titanic_df.loc[(titanic_df['Sex']=='female') & (titanic_df['Parch']>0) & (titanic_df['Age']>20),
'IsMother'] = 1
# 把 Sex 性別和 AgeBin 特徵進行組合
titanic_df['SexAgeCombo'] = titanic_df['Sex'] + "_" + titanic_df['AgeBin'].astype(str)
# 對 Pclass,Sex,Embarked,AgeBin,FareBin,FamilySize,Sex_Age_combo 進行獨熱編碼
Pclass = pd.get_dummies(titanic_df.Pclass,prefix='Pclass')
Sex = pd.get_dummies(titanic_df.Sex,prefix='Sex')
Embarked = pd.get_dummies(titanic_df.Embarked,prefix='Embarked')
AgeBin = pd.get_dummies(titanic_df.AgeBin,prefix='AgeBin')
FareBin = pd.get_dummies(titanic_df.FareBin,prefix='FareBin')
FamilySize = pd.get_dummies(titanic_df.FamilySize,prefix='FamilySize')
SexAgeCombo = pd.get_dummies(titanic_df.SexAgeCombo,prefix='SexAgeCombo')
# 把需要的變量全部拼接在一起
TrainData =pd.concat([titanic_df[['Survived','AgeIsMissing','IsAlone','IsMother']],
Pclass,Sex,Embarked,AgeBin,FareBin,FamilySize,SexAgeCombo],axis=1)
# 描述性統計
TrainData.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Survived | 891.0 | 0.383838 | 0.486592 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
AgeIsMissing | 891.0 | 0.198653 | 0.399210 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
IsAlone | 891.0 | 0.602694 | 0.489615 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
IsMother | 891.0 | 0.084175 | 0.277806 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Pclass_1 | 891.0 | 0.242424 | 0.428790 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Pclass_2 | 891.0 | 0.206510 | 0.405028 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Pclass_3 | 891.0 | 0.551066 | 0.497665 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
Sex_female | 891.0 | 0.352413 | 0.477990 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Sex_male | 891.0 | 0.647587 | 0.477990 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
Embarked_C | 891.0 | 0.188552 | 0.391372 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Embarked_Q | 891.0 | 0.086420 | 0.281141 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Embarked_S | 891.0 | 0.725028 | 0.446751 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
AgeBin_(0, 18] | 891.0 | 0.156004 | 0.363063 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
AgeBin_(18, 25] | 891.0 | 0.181818 | 0.385911 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
AgeBin_(25, 40] | 891.0 | 0.493827 | 0.500243 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
AgeBin_(40, 60] | 891.0 | 0.143659 | 0.350940 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
AgeBin_(60, 100] | 891.0 | 0.024691 | 0.155270 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FareBin_(-0.001, 7.854] | 891.0 | 0.200898 | 0.400897 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FareBin_(7.854, 10.5] | 891.0 | 0.206510 | 0.405028 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FareBin_(10.5, 21.679] | 891.0 | 0.193042 | 0.394907 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FareBin_(21.679, 39.688] | 891.0 | 0.202020 | 0.401733 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FareBin_(39.688, 512.329] | 891.0 | 0.197531 | 0.398360 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_1 | 891.0 | 0.602694 | 0.489615 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
FamilySize_2 | 891.0 | 0.180696 | 0.384982 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_3 | 891.0 | 0.114478 | 0.318570 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_4 | 891.0 | 0.032548 | 0.177549 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_5 | 891.0 | 0.016835 | 0.128725 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_6 | 891.0 | 0.024691 | 0.155270 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_7 | 891.0 | 0.013468 | 0.115332 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_8 | 891.0 | 0.006734 | 0.081830 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
FamilySize_11 | 891.0 | 0.007856 | 0.088337 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_female_(0, 18] | 891.0 | 0.076319 | 0.265657 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_female_(18, 25] | 891.0 | 0.060606 | 0.238740 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_female_(25, 40] | 891.0 | 0.161616 | 0.368305 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_female_(40, 60] | 891.0 | 0.050505 | 0.219108 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_female_(60, 100] | 891.0 | 0.003367 | 0.057961 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_male_(0, 18] | 891.0 | 0.079686 | 0.270958 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_male_(18, 25] | 891.0 | 0.121212 | 0.326557 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_male_(25, 40] | 891.0 | 0.332211 | 0.471271 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
SexAgeCombo_male_(40, 60] | 891.0 | 0.093154 | 0.290811 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
SexAgeCombo_male_(60, 100] | 891.0 | 0.021324 | 0.144544 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
# 導入相關包
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# 拆分訓練集和測試集
trainData_X = TrainData.drop(['Survived'], axis = 1)
trainData_y = TrainData.Survived
X_train, X_test, y_train, y_test = train_test_split(trainData_X, trainData_y, test_size=0.3, random_state=123456)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
# 定義了pipeline中包含的各個組件(2個Transformer和1個Estimator)
# scl 用於標準化處理數據,pca 用於降維,clf 是K近鄰Estimator
estimators = [('scl', StandardScaler()), ('pca', PCA(n_components=20)), ('clf', KNeighborsClassifier(10))]
estimators
[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)),
('pca',
PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)),
('clf',
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=10, p=2,
weights='uniform'))]
# 定義pipline流程
pipeline_knn = Pipeline(estimators)
# 通過pipeline訓練模型
pipeline_knn.fit(X_train,y_train)
Pipeline(memory=None,
steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=10, p=2,
weights='uniform'))])
# 利用模型對測試集進行預測,輸出target預測標籤值和概率
y_test_pred = pipeline_knn.predict(X_test)
# 分類評估彙總報告
print(classification_report(y_test,y_test_pred))
precision recall f1-score support
0 0.81 0.81 0.81 166
1 0.70 0.70 0.70 102
avg / total 0.77 0.77 0.77 268