spark streaming使用DataFrames和SQL操作。
使用StreamingContext正在使用的SparkContext創建SparkSession。這樣做,以便可以在executed at the driver故障時重新啓動。
這是通過創建一個延遲實例化的SparkSession單例實例來完成的。這在以下示例中顯示。它修改了早期的單詞計數示例,以使用DataFrames和SQL生成單詞計數。每個RDD都轉換爲DataFrame,註冊爲臨時表,然後使用SQL進行查詢。
代碼使用java語言編寫
StreamingWordCountApp.java
package com.imooc.spark;
import org.apache.spark.sql.*;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
/**
* 使用Java開發Spark Streaming應用程序
*/
public class StreamingWordCountApp {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setMaster("local[2]")
.setAppName("StreamingWordCountApp");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
// 創建一個DStream(hostname + port)
JavaDStream<String> lines = jssc
.socketTextStream("192.168.3.173", 9999);
// JavaPairDStream<String, Integer> counts = lines.flatMap(line ->
// Arrays.asList(line.split("\t")).iterator())
// .mapToPair(word ->
// new Tuple2<String,Integer>(word, 1))
// .reduceByKey((x,y) -> x+y);
// 輸出到控制檯
// counts.print();
lines.foreachRDD((rdd, time) -> {
// Get the singleton instance of SparkSession
SparkSession spark = SparkSession.builder().config(rdd.context().getConf()).getOrCreate();
// Convert RDD[String] to RDD[case class] to DataFrame
JavaRDD<JavaRow> rowRDD = rdd.map(word -> {
JavaRow record = new JavaRow();
record.setWord(word);
return record;
});
Dataset wordsDataFrame = spark.createDataFrame(rowRDD, JavaRow.class);
// Creates a temporary view using the DataFrame
wordsDataFrame.createOrReplaceTempView("words");
// Do word count on table using SQL and print it
Dataset wordCountsDataFrame =
spark.sql("select word, count(*) as total from words group by word");
wordCountsDataFrame.show();
});
jssc.start();
jssc.awaitTermination();
}
}
JavaRow.java
package com.imooc.spark;
public class JavaRow implements java.io.Serializable {
private String word;
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
}
運行命令:
執行spark streaming程序結果顯示如下:
官網: http://spark.apache.org/docs/2.3.0/streaming-programming-guide.html