自動機器學習之auto-sklearn入門

當我們做完了特徵工程之後，就可以代入模型訓練和預測，對於模型的選擇及調參，主要根據分析者的經驗。在具體使用時，經常遇到同一批數據，同一種模型，不同的分析者得出的結果相差很多。

前面學習了幾種常用的機器學習方法原理以及適用場景，對於完全沒有經驗的開發者，只要有足夠時間，嘗試足夠多的算法和參數組合，理論上也能達到最優的訓練結果，同理程序也能實現該功能，並通過算法優化該過程，自動尋找最優的模型解決方案，即自動機器學習框架。

Auto-Sklearn主要基於sklearn機器學習庫，使用方法也與之類似，這讓熟悉sklearn的開發者很容易切換到Auto-Sklearn。在模型方面，除了sklearn提供的機器學習模型，還加入了xgboost算法支持；在框架整體調優方面，使用了貝葉斯優化。

系統要求：
auto-sklearn has the following system requirements:
    Linux operating system (for example Ubuntu) (get Linux here),
    Python (>=3.5) (get Python here).
    C++ compiler (with C++11 supports) (get GCC here) and
    SWIG (version 3.0 or later) (get SWIG here).

1、建立新環境（這個建立虛擬環境）
conda create --name automl python=3.6
source activate automl

如何建立虛擬環境

2、安裝相關包
更新下pip
pip install --upgrade pip
yum install -y libffi-devel python-devel openssl-devel gcc swig
yum install gcc-c++

Please install all dependencies manually with:
curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
(這一步因爲網絡的問題會出現很多失敗的或者下載whl包非常慢情況，建議先本地下載後進行pip install ***.whl , 要耐心，我整了大半天吧)

安裝swig到默認目錄（從官網下載源碼包3.0以上版本，官網下載也很慢要耐心，也可以考慮從CSND下載3.0以上版本）
# ./configure
# make
# make install

查看swig版本
#swig -version
如果出現swig:error while loading shared libraries:libpcre.so.1異常，
確認是否安裝pcre，否則安裝pcre

如果確認安裝pcre，則運行
#ldd $(which swig)
會看到
libpcre.so.1 => not found
手動添加鏈接：
#ln -s /usr/local/lib/libpcre.so.1 /lib
完畢後再次運行
#swig -version

安裝swig到用戶指定目錄
$ ./configure --prefix=usr/local/bin/swig
$ make
$ sudo make install

路徑添加到文件，
$ vim /etc/profile (需要管理員權限)

在最後添加一行：PATH=/usr/local/bin/swig:$PATH。保存後重新加載生效：
$source /etc/profile

可看到版本信息。
    SWIG Version 3.0.11
    Compiled with g++ [x86_64-pc-linux-gnu]
    Configured options: +pcre
    Please see http://www.swig.org for reporting bugs and further information

3、安裝auto-sklearn
pip install auto-sklearn
pip install statsmodels

4、測試代碼

import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

#結果：Accuracy score 0.9933333333333333

這個注意下，需要等1個小時纔出結果。
This will run for one hour and should result in an accuracy above 0.98.

運行過程中出現這個情況，不影響執行，我查了一下意思是在進行mean()的時候，有空的數組。
Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
/home/dm/anaconda3/envs/automl/lib/python3.6/site-packages/autosklearn/evaluation/train_evaluator.py:197: RuntimeWarning: Mean of empty slice
Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
/home/dm/anaconda3/envs/automl/lib/python3.6/site-packages/autosklearn/evaluation/train_evaluator.py:197: RuntimeWarning: Mean of empty slice

整個auto-sklearn的環境就搭建好了，希望能幫到大家入門。

5、關鍵參數

Auto-sklearn支持的參數較多，以分類器爲例，參數及其默認值如下圖所示：

下面介紹其常用參數，分爲四個部分：

(1) 控制訓練時間和內存使用量

參數默認訓練總時長爲一小時（3600），一般使用以下參數按需重置，單位是秒。

    time_left_for_this_task：設置所有模型訓練時間總和
    per_run_time_limit：設置單個模型訓練最長時間
    ml_memory_limit：設置最大內存用量

(2) 模型存儲

參數默認爲訓練完成後刪除訓練的暫存目錄和輸出目錄，使用以下參數，可指定其暫存目錄及是否刪除。

    tmp_folder：暫存目錄
    output_folder：輸出目錄
    delete_tmp_folder_after_terminate：訓練完成後是否刪除暫存目錄
    delete_output_folder_after_terminate：訓練完成後是否刪除輸出目錄
    shared_mode：是否共享模型

(3) 數據切分

使用resampling_strategy參數可設置訓練集與測試集的切分方法，以防止過擬合，用以下方法設置五折交叉驗證：

resampling_strategy='cv'
resampling_strategy_arguments={'folds': 5}

用以下方法設置將數據切分爲訓練集和測集，其中訓練集數據佔2/3。

resampling_strategy='holdout',
resampling_strategy_arguments={'train_size': 0.67}

(4) 模型選擇

參數支持指定備選的機器學習模型，或者從所有模型中去掉一些機器學習模型，這兩個參數只需要設置其中之一。

include_estimators：指定可選模型
exclude_estimators：從所有模型中去掉指定模型

auto-sklearn除了支持sklearn中的模型以外，還支持xgboost模型。具體模型及其在auto-sklearn中對應的名稱可通過查看源碼中具體實現方法獲取，通過以下目錄內容查看支持的分類模型：autosklearn/pipeline/components/classification/，可看到其中包含：adaboost、extra_trees、random_forest、libsvm_svc、xgradient_boosting等方法。

實例代碼：（需要關注數據的類型，自動學習的時候對類型有要求）

#!-*- coding:utf-8 -*-

import pandas as pd
import time 
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn.metrics
import autosklearn.classification
import statsmodels.api as sm

datas = []
newdatas = []
alldatas = []
labels = []
for line in open('vecs_new.txt'):
    datas.append(eval(line))
for line in open('labels_new.txt'):
    labels.append(eval(line))
datas = np.array(datas)
for item in datas[:,:-1]:
    for lines in item:
        #print(lines)
        newdatas.append(int(lines))
        #print(newdatas)
    alldatas.append(newdatas)
    newdatas = []
datas = alldatas
#datas = datas[:,:-1]
labels = np.array(labels)
print(alldatas[:2])

x_train, x_test, y_train, y_test = train_test_split(datas, labels, train_size=0.7, random_state=1)
print(len(x_train),len(y_train))
x_test = np.array(x_test)
y_test = np.array(y_test)

automl = autosklearn.classification.AutoSklearnClassifier(
   time_left_for_this_task=120, per_run_time_limit=120, # 兩分鐘
   include_estimators=["random_forest"])
automl.fit(x_train, y_train)
#print(automl.show_models())
print(x_test[:2])
print(y_test[:2])
y_hat = automl.predict(x_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

用2分鐘時間進行訓練結果：
['/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000000.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000001.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000002.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000003.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000004.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000005.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000006.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000007.ensemble', '/tmp/autosklearn_tmp_29423_5323/.auto-sklearn/ensembles/1.0000000008.ensemble']
[[ 6  0  0  4  5  2  1  3  1  0  0  0  0  0  0  0  1  0  0  0  0  0  0  2
   0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  2  0  0  0  0  0  0  0  0  0  0  0  0 14  0
   0  9  8  4  1  3  1  0  0  0  0  0  0  0  1  0  0  0  0  0  0  2  0  0
   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0 12 11]
 [ 0  0  0  1  4  0  0  0  0  0  1  1  0  2  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
   0  1 11  1  1  1  0  0  1  1  0  8  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  9  9]]
[0 0]
Accuracy score 0.5523690773067331

參考：
swig 和 pcre 安裝
https://blog.csdn.net/shanglianlm/article/details/88797529
https://blog.csdn.net/zhangkzz/article/details/88555830
auto-sklearn 入門
https://www.jiqizhixin.com/articles/2019-08-13-8
https://www.jianshu.com/p/cd775730a1ec
auto-sklearn 安裝
https://automl.github.io/auto-sklearn/master/installation.html

自動機器學習之auto-sklearn入門

spark DataFrame 基本操作函數

Auto Machine Learning 自動化機器學習筆記

model.save(sc,'fname')異常Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError

自動機器學習之auto-sklearn入門

SuperSet logo修改、導出csv中文亂碼、sql查詢超時問題解決(默認30s)問題處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結