sklearn2pmml xgboost缺失值(missing)處理的坑

今天同事在部署xgboost pmml模型時遇到了大坑，線上spark預測和本地python預測結果怎麼都不對應，記錄一下處理過程。

看了下同事的代碼，貌似也沒有問題

from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from xgboost import XGBClassifier
weight = train_y.sum() * 1.0/ (len(train_data) - train_y.sum())
xgb_clf = XGBClassifier(learning_rate=0.1,n_estimators=100,max_depth=3,objective='binary:logistic',seed=1,silent=1,reg_alpha=3,reg_lambda=0.9,scale_pos_weight=1/weight,missing=-9999, eval_metric='auc')
pipeline = PMMLPipeline([('classifier',xgb_clf)])
pipeline.fit(train_x,train_y)
sklearn2pmml(pipeline,'data/a_card_1.pmml',with_repr=True)

首先注意到和之前不同點在於這次缺失值不是nan了，這引起了我的警覺，重新訓練了下模型，把樣本缺失值處理爲np.nan，訓練時missing設爲默認值None，這時和線上對比發現一致了，果然是missing value的問題。

sklearn2pmml對於xgboost並沒有暴露missing這個參數，所以對於missing不爲None的童鞋可使用https://github.com/jpmml/jpmml-xgboost 轉化。

xgb_clf.get_booster().dump_model('/tmp/a_card_model.dump.txt')
xgb_clf.get_booster().save_model('/tmp/xgb.model')

java -jar target/jpmml-xgboost-executable-1.3-SNAPSHOT.jar --model-input /tmp/xgb.model --fmap-input /tmp/xgb.fmap --pmml-output xgboost_miss.pmml --missing-value -9999

fmap可通過以下方式產生

fmap(feature map file)：實現feature id和feature name的對應
格式爲 featmap.txt: <featureid> <featurename> <q or i or int>\n

Feature id從0開始直到特徵的個數爲止，從小到大排列。
i表示是二分類特徵
q表示數值變量，如年齡，時間等。q可以缺省
int表示特徵爲整數(when int is hinted, the decision boundary will be integer)
可根據以下語句通過讀取pkl文件的feature_name生成，或者根據feature順序通過別的方式生成 


def ceate_feature_map(file_name,features): 
    outfile = open(file_name, 'w') 
    for i, feat in enumerate(features): 
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))

通過對比PMML可以發現不同點就在於DataField增加了missing配置

<DataDictionary>
		<DataField name="_target" optype="categorical" dataType="integer">
			<Value value="0"/>
			<Value value="1"/>
		</DataField>
		<DataField name="pas_age" optype="continuous" dataType="float">
			<Value value="-9999" property="missing"/>
		</DataField>
		<DataField name="last_gulf_call_days" optype="continuous" dataType="float">
			<Value value="-9999" property="missing"/>
		</DataField>
		....
</DataDictionary>

可以手動在之前的PMLL文件中增加即可解決這個問題。

當然我覺得更好的方式就是使用默認值，即np.nan，對應到spark也就是null,非常自然。

不過沒怎麼看懂PMML是怎麼處理缺失值的，貼一段xgboost原生和PMML對比

booster[0]:
0:[last_30_days_invoice_value<2407.28491] yes=1,no=2,missing=2
	1:[last_6_month_finish_count_variation_coefficient<0.61500001] yes=3,no=4,missing=4
		3:[last_6_month_fast_finish_order_max_actual_cost<86.4949951] yes=7,no=8,missing=8
			7:leaf=-0.0717158243
			8:leaf=-0.147665188
		4:[last_1_year_taxi_finish_order_actual_cost<505.25] yes=9,no=10,missing=9
			9:leaf=-0.0261387583
			10:leaf=-0.178924426
	2:[app_system_tools_wifi_category_number_rate<0.0645833313] yes=5,no=6,missing=5
		5:[last_1_year_night_finish_rate<0.0652500018] yes=11,no=12,missing=12
			11:leaf=-0.0177322756
			12:leaf=0.0268170126
		6:[app_stock_sub_category_number_rate<0.0875959098] yes=13,no=14,missing=13
			13:leaf=0.06783209
			14:leaf=-0.0312540941

<Segment id="1">
   <True/>
   <TreeModel functionName="regression" missingValueStrategy="none" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit" x-mathContext="float">
       <MiningSchema>
           <MiningField name="last_6_month_fast_finish_order_max_actual_cost"/>
           <MiningField name="last_1_year_night_finish_rate"/>
           <MiningField name="last_30_days_invoice_value"/>
           <MiningField name="app_stock_sub_category_number_rate"/>
           <MiningField name="app_system_tools_wifi_category_number_rate"/>
           <MiningField name="last_1_year_taxi_finish_order_actual_cost"/>
           <MiningField name="last_6_month_finish_count_variation_coefficient"/>
       </MiningSchema>
       <Node score="0.026817013">
           <True/>
           <Node score="-0.026138758">
               <SimplePredicate field="last_30_days_invoice_value" operator="lessThan" value="2407.285"/>
               <Node score="-0.14766519">
                   <SimplePredicate field="last_6_month_finish_count_variation_coefficient" operator="lessThan" value="0.615"/>
                   <Node score="-0.071715824">
                       <SimplePredicate field="last_6_month_fast_finish_order_max_actual_cost" operator="lessThan" value="86.494995"/>
                   </Node>
               </Node>
               <Node score="-0.17892443">
                   <SimplePredicate field="last_1_year_taxi_finish_order_actual_cost" operator="greaterOrEqual" value="505.25"/>
               </Node>
           </Node>
           <Node score="0.06783209">
               <SimplePredicate field="app_system_tools_wifi_category_number_rate" operator="greaterOrEqual" value="0.06458333"/>
               <Node score="-0.031254094">
                   <SimplePredicate field="app_stock_sub_category_number_rate" operator="greaterOrEqual" value="0.08759591"/>
               </Node>
           </Node>
           <Node score="-0.017732276">
               <SimplePredicate field="last_1_year_night_finish_rate" operator="lessThan" value="0.06525"/>
           </Node>
       </Node>
   </TreeModel>
</Segment>

xgboost有明確的當遇到缺失值如何處理說明，但PMML貌似並沒有，看出的童鞋麻煩告知我一下，非常感謝。

我們實現了配置化在Spark上部署模型，如一模型部署配置如下

sparkConf:
  #spark任務名稱， 必填
  appName: driverCCardPMML
  #是否啓用hive支持
  enableHiveSupport: true
  #spark其它配置選項，如內存，shffle partitions數量等
appConf:
  #debug開啓時，每個節點會做持久化
  debug: true
  #持久化數量，0代表全量
  limit: 10
  savePath: /user/fbi/model_deploy/
  sourcePath: /user/fbi/model_source/
#一個子節點只有一個父節點，所以樹更合適
tree:
  #節點描述
  desc: C卡PMML
  #名稱，用於標識一個組件
  name: model_pmml
  #傳遞給組件的參數，包括模型超參數，以及配置參數等
  parameters:
    #pmml文件路徑，暫只支持本地文件，spark-submit可通過--files glm1.pmml上傳
    pmmlPath: zkc_driver_ccard_v1.1.pmml
    #是否排除原始列，默認false(保留)
    excludeOriginColumn: true
    #排除例外，如uid等
    excludeExcept: ["uid"]
  #子節點合併所有結果，如果children只有一個，可省略joinType，joinKey
  #join類型，full(默認)， inner， left， right
  children:
    - desc: 加載C卡原數據
      name: data_source
      parameters:
        #支持hql， hql_file json
        type: hql_file
        path: datasource.sql
        #方便模型校驗，可配置saveTable，將負責數據源落庫，將會保存到/user/fbi/model_source/year/month/day/model_driver_c_card_source.parquet
        saveTable: model_driver_c_card_source
      transformer:
        - desc: 數據類型轉換
          name: feature_data_type
          parameters:
            #原始數據類型 tinyint，smallint， int， bigint， float， double，string， decimal
            originalType: ["decimal"]
            #目標數據類型
            targetType: double
            #排除的列名，可省略
            #exceptColumn: []      
  #transformer節點，pipeline模式
  transformer:
    - desc: 落庫
      name: data_sink
      parameters:
        #是否自動建表
        auto: true
        path: /user/fbi/
        db: riskmanage_dm
        table: model_driver_c_card_v1
        tableName: 模型-司機-C卡-v1

最近也在反思是否有更好的離線部署方式，如DSL，比如通過spark-sql可以完全實現上面的處理流程，當然需要稍微擴展下spark-sql語法，是否值得嘗試？

大家對離線模型都是如何部署的，歡迎交流。

sklearn2pmml xgboost缺失值(missing)處理的坑

sklearn2pmml xgboost缺失值(missing)處理的坑

[gevent源碼分析] libev cython綁定core.pyx

Python之美[從菜鳥到高手]--NotImplemented小析

gevent: AssertionError: Impossible to call blocking function in the event loop callback

sklearn2pmml xgboost缺失值(missing)處理的坑

微信支付SDK(python版)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結