使用Pyspark訓練模型後,經常要將模型的訓練結果輸出爲hive表,這篇博文就介紹如何將dataframe數據存爲hive表。
想把DataFrame數據存爲hive數據,就需要用到HiveContext,下面看下如何使用:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
import numpy as np
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
#創建一個會話
spark = SparkSession \
.builder \
.master("yarn") \
.appName('create_df_test2') \
.enableHiveSupport() \
.getOrCreate()
#輸入要訓練的數據
trainData = spark.sql("""select * from table""")
# 1.2 構造訓練數據集
trainingSet = trainData.rdd.map(lambda x: Row(label=x[-1], features=Vectors.dense(x[:-1]))).toDF()
#訓練數據
LR = LogisticRegression(labelCol='label', featuresCol="features") # regParam=0.01,
LRModel = LR.fit(trainingSet)
coef = LRModel.coefficients
#得到訓練結果
re = [(coef[0], coef[1], coef[2], LRModel.intercept)]
#將list轉爲dataframe
df_re = spark.createDataFrame([(float(tup[0]), float(tup[1]), float(tup[2]), float(tup[3])) for tup in re],
['r1', 'r2', 'r3', 'r'])
#創建hive_text
hive_text = HiveContext(spark)
#將DataFrame數據轉成table:registerDataFrameAsTable
rows_data = hive_text.registerDataFrameAsTable(df_re, tableName='testhive') #生成虛擬表,設置表名
data_2 = hive_text.sql("select * from table_tmp") #執行sql語句
print(data_2.take(2))
print(data_2.collect()[0])
#下面是第一次創建,table1還不存在因此用create
hive_text.sql('create table table1 select * from table_tmp')
#若表已經建好,存在,那麼使用:
hive_text.sql('insert overwrite table table1 select * from table_tmp')
spark.stop()
看下中間的輸出:
JOB運行成功後,就可以在查詢平臺上查詢該表了: