不得不說Spark是一款優秀的計算引擎,繼承Spark-ML、Spark-Graphx機器學習和圖計算框架,Spark-ML一般用於離線分析和挖掘,生成模型。
如果我們把模型保存在HDFS,需要在實時計算裏面使用提前訓練好的模型,
解決方案如下:
1、通過轉換序列化方式,把模型轉換成可以被其他語言調用的方式,如:java、python
2、在spark-streaming中使用
具體讀取kafak的配置信息和保證EOS的不在這裏體現,主要體現如何使用Spark-ML訓練好的模型,具體代碼如下:
val spark = SparkSession.builder().
appName("StreamingMLModel").
getOrCreate()
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(2))
val bootstrapServer = ""
val groupId = "E30E62E2-8B73-BBB0-AA8C-1A53E400646F-1"
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrapServer,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
ConsumerConfig.MAX_POLL_RECORDS_CONFIG -> "100000"
)
val newsTopic = ""
val topicSet = Set(newsTopic)
val kafkaStream = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicSet, kafkaParams)
)
val bayesPipeLineModelPath = ""
val pipeLine = PipelineModel.load(bayesPipeLineModelPath)
val source = kafkaStream.map(_.value()).map(JSON.parseObject)
source.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
//除微博外的數據
val data = rdd.map(jsonObj => (jsonObj.getString("id"), jsonObj.getString("content"))).
toDF("id", "content")
val predict = pipeLine.transform(data)
val filterRdd = predict.select("id", "prediction").
map(row => (row.getString(0), row.getDouble(1)))
filterRdd.foreachPartition(records => {
val list = records.toList
val goodNews = list.filter { case (news, bayes) => bayes == 1.0 }
val badNews = list.filter { case (news, bayes) => bayes == 0.0 }
// todo 輸出外部DB或緩存或kafka
})
}
})
1、當流計算首次啓動的時候,一定要限速
2、在streaming頁面或者給予StreamingListener添加監控類