機器學習框架 - sklearn 快速上手


1. sklearn簡介

  • Python語言的編程工具
  • 基於Numpy,Scipy和matplotlib構建
  • 開源,BSD許可

sklearn官網:(http://scikit-learn.org/)學習中可經常查閱API等信息,官網提供的文檔和例子也十分詳盡.

2 sklearn的使用流程

數據收集 -> 特徵工程(數據清洗,特徵選擇,降維) -> 模型訓練(選擇模型)
->模型測試 -> 部署上線
這裏寫圖片描述

3 sklearn覆蓋的機器學習問題

  • 分類 Classification
  • 迴歸 Regression
  • 聚類 Clustering
  • 降維 Dimensionality reduction
  • 模型選擇 Model_selection\
  • 預處理 Preprocessing

這裏寫圖片描述

4 Estinator評估器, Transformer轉換器 Pipeline管道

  1. 評估器 Estimator : 訓練模型的武器,如分類器Classifier,迴歸器Regressor都屬於評估器Estimator
    核心函數:

    • fit(X_train, y_train) : 用於訓練模型
    • predict(X-test) : 用於對測試集預測
  2. 轉換器 Transformer : 用於數據預處理的武器
    核心函數:

    • fit(x_train) : 用於訓練轉換器參數
    • transform(X_test) : 用於對測試集轉換數據
  3. 流水線 Pipeline

典型的機器學習流程:

datasets — transformer1 — transformer2 — transform3 … transformern — Estinator
fit方法是通過訓練(計算)獲取相關的參數並保存, 之後可以多次使用transform方法對數據進行轉換
transform方法 一個任務集可以有多個transformer
Estinator 一個任務只能有一個 並且一定是在最後用 fit方法訓練也是得到一組參數,然後可以進行多次predict預測
pipeline管道就是把多個transformer和一個estimator包裝起來,形成一個整體
pipeline運行機制如下:
這裏寫圖片描述

5 Pipeline使用實例

泰坦尼克號案例

# 導入相關包
import pandas as pd
# 讀取數據
titanic_df = pd.read_csv('./datasets/titanic/train.csv')
# drop 掉 PassengerId,Name,Ticket,Cabin
titanic_df = titanic_df.drop(['PassengerId','Name','Ticket','Cabin'], axis = 1)
# 構造一個新變量 AgeIsMissing
titanic_df['AgeIsMissing'] = 0
titanic_df.loc[titanic_df['Age'].isnull(), 'AgeIsMissing'] = 1
# 對 Age 缺失值進行均值填充
age_mean = round(titanic_df['Age'].mean())
titanic_df['Age'].fillna(age_mean, inplace=True)
# 對 Embarked 缺失值用'S'替換
titanic_df['Embarked'].fillna('S', inplace=True)
# 對 Age 進行分箱--自定義分箱
cut_points = [0,18,25,40,60,100]
titanic_df["AgeBin"] = pd.cut(titanic_df.Age, bins=cut_points)
# 對 Fare 船票價格進行分箱--等深分箱
titanic_df["FareBin"] = pd.qcut(titanic_df.Fare, 5)
# 構造 FamilySize 變量
titanic_df['FamilySize'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1
# 構造一個新變量 IsAlone(是否獨自一人)
titanic_df['IsAlone'] = 0
titanic_df.loc[titanic_df['FamilySize'] == 1, 'IsAlone'] = 1
# 構造一個新變量 IsMother(是否是母親)
titanic_df['IsMother'] = 0
titanic_df.loc[(titanic_df['Sex']=='female') & (titanic_df['Parch']>0) & (titanic_df['Age']>20),
'IsMother'] = 1
# 把 Sex 性別和 AgeBin 特徵進行組合
titanic_df['SexAgeCombo'] = titanic_df['Sex'] + "_" + titanic_df['AgeBin'].astype(str)
# 對 Pclass,Sex,Embarked,AgeBin,FareBin,FamilySize,Sex_Age_combo 進行獨熱編碼
Pclass = pd.get_dummies(titanic_df.Pclass,prefix='Pclass')
Sex = pd.get_dummies(titanic_df.Sex,prefix='Sex')
Embarked = pd.get_dummies(titanic_df.Embarked,prefix='Embarked')
AgeBin = pd.get_dummies(titanic_df.AgeBin,prefix='AgeBin')
FareBin = pd.get_dummies(titanic_df.FareBin,prefix='FareBin')
FamilySize = pd.get_dummies(titanic_df.FamilySize,prefix='FamilySize')
SexAgeCombo = pd.get_dummies(titanic_df.SexAgeCombo,prefix='SexAgeCombo')
# 把需要的變量全部拼接在一起 
TrainData =pd.concat([titanic_df[['Survived','AgeIsMissing','IsAlone','IsMother']],
Pclass,Sex,Embarked,AgeBin,FareBin,FamilySize,SexAgeCombo],axis=1)
# 描述性統計
TrainData.describe().transpose()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
count mean std min 25% 50% 75% max
Survived 891.0 0.383838 0.486592 0.0 0.0 0.0 1.0 1.0
AgeIsMissing 891.0 0.198653 0.399210 0.0 0.0 0.0 0.0 1.0
IsAlone 891.0 0.602694 0.489615 0.0 0.0 1.0 1.0 1.0
IsMother 891.0 0.084175 0.277806 0.0 0.0 0.0 0.0 1.0
Pclass_1 891.0 0.242424 0.428790 0.0 0.0 0.0 0.0 1.0
Pclass_2 891.0 0.206510 0.405028 0.0 0.0 0.0 0.0 1.0
Pclass_3 891.0 0.551066 0.497665 0.0 0.0 1.0 1.0 1.0
Sex_female 891.0 0.352413 0.477990 0.0 0.0 0.0 1.0 1.0
Sex_male 891.0 0.647587 0.477990 0.0 0.0 1.0 1.0 1.0
Embarked_C 891.0 0.188552 0.391372 0.0 0.0 0.0 0.0 1.0
Embarked_Q 891.0 0.086420 0.281141 0.0 0.0 0.0 0.0 1.0
Embarked_S 891.0 0.725028 0.446751 0.0 0.0 1.0 1.0 1.0
AgeBin_(0, 18] 891.0 0.156004 0.363063 0.0 0.0 0.0 0.0 1.0
AgeBin_(18, 25] 891.0 0.181818 0.385911 0.0 0.0 0.0 0.0 1.0
AgeBin_(25, 40] 891.0 0.493827 0.500243 0.0 0.0 0.0 1.0 1.0
AgeBin_(40, 60] 891.0 0.143659 0.350940 0.0 0.0 0.0 0.0 1.0
AgeBin_(60, 100] 891.0 0.024691 0.155270 0.0 0.0 0.0 0.0 1.0
FareBin_(-0.001, 7.854] 891.0 0.200898 0.400897 0.0 0.0 0.0 0.0 1.0
FareBin_(7.854, 10.5] 891.0 0.206510 0.405028 0.0 0.0 0.0 0.0 1.0
FareBin_(10.5, 21.679] 891.0 0.193042 0.394907 0.0 0.0 0.0 0.0 1.0
FareBin_(21.679, 39.688] 891.0 0.202020 0.401733 0.0 0.0 0.0 0.0 1.0
FareBin_(39.688, 512.329] 891.0 0.197531 0.398360 0.0 0.0 0.0 0.0 1.0
FamilySize_1 891.0 0.602694 0.489615 0.0 0.0 1.0 1.0 1.0
FamilySize_2 891.0 0.180696 0.384982 0.0 0.0 0.0 0.0 1.0
FamilySize_3 891.0 0.114478 0.318570 0.0 0.0 0.0 0.0 1.0
FamilySize_4 891.0 0.032548 0.177549 0.0 0.0 0.0 0.0 1.0
FamilySize_5 891.0 0.016835 0.128725 0.0 0.0 0.0 0.0 1.0
FamilySize_6 891.0 0.024691 0.155270 0.0 0.0 0.0 0.0 1.0
FamilySize_7 891.0 0.013468 0.115332 0.0 0.0 0.0 0.0 1.0
FamilySize_8 891.0 0.006734 0.081830 0.0 0.0 0.0 0.0 1.0
FamilySize_11 891.0 0.007856 0.088337 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_female_(0, 18] 891.0 0.076319 0.265657 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_female_(18, 25] 891.0 0.060606 0.238740 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_female_(25, 40] 891.0 0.161616 0.368305 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_female_(40, 60] 891.0 0.050505 0.219108 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_female_(60, 100] 891.0 0.003367 0.057961 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_male_(0, 18] 891.0 0.079686 0.270958 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_male_(18, 25] 891.0 0.121212 0.326557 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_male_(25, 40] 891.0 0.332211 0.471271 0.0 0.0 0.0 1.0 1.0
SexAgeCombo_male_(40, 60] 891.0 0.093154 0.290811 0.0 0.0 0.0 0.0 1.0
SexAgeCombo_male_(60, 100] 891.0 0.021324 0.144544 0.0 0.0 0.0 0.0 1.0
# 導入相關包
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# 拆分訓練集和測試集
trainData_X = TrainData.drop(['Survived'], axis = 1)
trainData_y = TrainData.Survived
X_train, X_test, y_train, y_test = train_test_split(trainData_X, trainData_y, test_size=0.3, random_state=123456)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
# 定義了pipeline中包含的各個組件(2個Transformer和1個Estimator)
# scl 用於標準化處理數據,pca 用於降維,clf 是K近鄰Estimator
estimators = [('scl', StandardScaler()), ('pca', PCA(n_components=20)), ('clf', KNeighborsClassifier(10))]
estimators
[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('pca',
  PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)),
 ('clf',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=10, p=2,
             weights='uniform'))]
# 定義pipline流程
pipeline_knn = Pipeline(estimators)
# 通過pipeline訓練模型
pipeline_knn.fit(X_train,y_train)
Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform'))])
# 利用模型對測試集進行預測,輸出target預測標籤值和概率
y_test_pred = pipeline_knn.predict(X_test)
# 分類評估彙總報告
print(classification_report(y_test,y_test_pred))
             precision    recall  f1-score   support

          0       0.81      0.81      0.81       166
          1       0.70      0.70      0.70       102

avg / total       0.77      0.77      0.77       268
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章