試用一下H2O全自動機器學習
下載數據集
天池練習賽"工業蒸汽量預測",下個數據集:https://tianchi.aliyun.com/competition/entrance/231693/introduction
安裝H2O
H2O requirements:
pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future
install H2O:
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
訓練模型並預測
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
# 初始化H2O
h2o.init()
# 讀數據集
col_types = ["numeric"]*39 # 列數
data = h2o.import_file('zhengqi_train.txt',sep='\t', col_types=col_types)
out = h2o.import_file('zhengqi_test.txt',sep='\t')
#切分數據集用以訓練模型
train, test = data.split_frame(ratios=[.7], seed=1)
# 列名賦值
x = train.columns
y = "target"
x.remove(y)
# 訓練模型
nfolds = 7
gbm = H2OGradientBoostingEstimator(nfolds=nfolds,
fold_assignment="Modulo",
keep_cross_validation_predictions=True)
gbm.train(x=x, y=y, training_frame=train)
rf = H2ORandomForestEstimator(nfolds=nfolds,
fold_assignment="Modulo",
keep_cross_validation_predictions=True)
rf.train(x=x, y=y, training_frame=train)
stack = H2OStackedEnsembleEstimator(model_id="ensemble",
training_frame=train,
validation_frame=test,
base_models=[gbm.model_id, rf.model_id])
stack.train(x=x, y=y, training_frame=train, validation_frame=test)
stack.model_performance()
# 預測並保存待提交結果
result = stack.predict(out)
result = result.as_data_frame()['predict'].to_list()
with open('result_h2o.txt', 'w') as f:
for i in result:
f.write("{}\n".format(i))
# h2o.export_file(result,'result_h2o.txt',sep = "\n",parts = 1)
h2o.shutdown()
提交結果
直接不做任何特徵工程,超過了這個練習賽86%的隊伍!
看來H2O還是可以的,接下來用Spark結合H2O跑大數據試試