輸出操作指定了對流數據經轉化操作得到的數據所要執行的操作(例如把結果推入外部數據庫或輸出到屏幕上)。與 RDD 中的惰性求值類似,如果一個 DStream 及其派生出的 DStream 都沒有被執行輸出操作,那麼這些 DStream 就都不會被求值。如果StreamingContext 中沒有設定輸出操作,整個 context 就都不會啓動。
package com.ljpbd.bigdata.spark.streaming import com.ljpbd.bigdata.spark.Util.JdbcUtil import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord} import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.streaming.dstream.{DStream, InputDStream} import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies} import org.apache.spark.streaming.{Seconds, StreamingContext} import java.sql.{Connection, PreparedStatement, ResultSet} import java.text.SimpleDateFormat import java.util.Date import scala.collection.mutable.ListBuffer object SparkStreaming11_Req1BlockList { def main(args: Array[String]): Unit = { //1.創建 SparkConf val sparkConf: SparkConf = new SparkConf().setAppName("ReceiverWordCount").setMaster("local[*]") //2.創建 StreamingContext val ssc = new StreamingContext(sparkConf, Seconds(3)) //3.定義 Kafka 參數 val kafkaPara: Map[String, Object] = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "linux1:9092,linux2:9092,linux3:9092", ConsumerConfig.GROUP_ID_CONFIG -> "atguigu", "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer" ) //4.讀取 Kafka 數據創建 DStream val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Set("atguigu"), kafkaPara)) //5.將每條消息的 KV 取出 val adClickData: DStream[AdClickData] = kafkaDStream.map( kafkaData => { val data: String = kafkaData.value() val datas: Array[String] = data.split(" ") AdClickData(datas(0), datas(1), datas(2), datas(3), datas(4)) } ) //週期性獲取黑名單數據 //判斷點擊用戶是否在黑名單中 //如果用戶不在黑名單中,那麼進行統計數量(每個採集週期) val ds: DStream[((String, String, String), Int)] = adClickData.transform( rdd => { //通過jdbc週期性獲取黑名單數據 val blackList = ListBuffer[String]() val connection: Connection = JdbcUtil.getConnection val pstat: PreparedStatement = connection.prepareStatement("select userid from black_list") val rs: ResultSet = pstat.executeQuery() while (rs.next()) { blackList.append(rs.getString(1)) } rs.close() pstat.close() connection.close() val filterRdd: RDD[AdClickData] = rdd.filter( data => { //判斷點擊用戶是不是在黑名單中 !blackList.contains(data.user) } ) //如果用戶不在黑名單中,那麼進行統計數量(每個採集週期) filterRdd.map( data => { val sdf = new SimpleDateFormat("yyyy-MM-dd") val day = sdf.format(new Date(data.ts)) val user = data.user val ad = data.ad ((day, user, ad), 1) } ).reduceByKey(_ + _) } ) ds.foreachRDD( rdd => { //rdd.foreach會每一條數據創建連接 /* foreach是rdd的算子,算子之外的代碼是在driver端執行,算子之內的代碼是在executor執行 ,可以將對象從driver傳輸到executor,這樣就會涉及到閉包操作,需要將數據序列化 但是數據庫連接對象是不能序列化的 val conn: Connection = JdbcUtil.getConnection rdd提供了一個算子可以有效提供效率,foreachPartition 可以一個分區創建一個連接對象,這樣就可以大幅度減少連接對象的創建 */ /* rdd.foreachPartition( iter=>{ val conn: Connection = JdbcUtil.getConnection conn.close() } )*/ rdd.foreach { case ((day, user, ad), count) => { println(s"${day} ${user} ${ad} ${count}") if (count >= 30) { //如果統計數量超過點擊域值,那麼將用戶拉入到黑名單中 val conn: Connection = JdbcUtil.getConnection var sql = """ |insert into black_list(userid) values(?) |on DUPLICATE KEY |UPDATE userid = ? |""".stripMargin JdbcUtil.executeUpdate(conn, sql, Array(user, user)) conn.close() } else { //如果沒有超過域值,那麼需要將當天的廣告點擊數量進行更新, val conn: Connection = JdbcUtil.getConnection val sql = """ |select * from user_ad_count where dt=? and userid=? and adid=? |""".stripMargin //查詢統計表數據 var flag = JdbcUtil.isExist(conn, sql, Array(day, user, ad)) //如果存在數據,則更新 if (flag) { val sql1 = """ |update user_ad_count |set count=count+? |where dt=? and userid=? and adid=? |""".stripMargin JdbcUtil.executeUpdate(conn, sql1, Array(count, day, user, ad)) //判斷更新後的點擊數量是否超過域值,如果超過,那麼將用戶拉入到黑名單中 val sql2 = """ |select * from user_ad_count |where dt=? and userid=? and adid=? and count>=30 |""".stripMargin var flag1 = JdbcUtil.isExist(conn, sql2, Array(day, user, ad)) if (flag1) { val sql3 = """ |insert into black_list(userid) values(?) |on DUPLICATE KEY |UPDATE userid = ? |""".stripMargin JdbcUtil.executeUpdate(conn, sql3, Array(user, user)) } } else { //如果不存在數據,那麼新增 val sql4 = """ |insert into user_ad_count(dt,userid,adid,count) values(?,?,?,?) |""".stripMargin JdbcUtil.executeUpdate(conn, sql4, Array(day, user, ad, count)) } conn.close() } } } } ) //7.開啓任務 ssc.start() ssc.awaitTermination() } //廣告點擊數據 case class AdClickData(ts: String, area: String, city: String, user: String, ad: String) }