基本概念

Spark Steaming支持对某个时间窗口内实现对数据计算

上图描绘了是以3倍的微批次作为一个窗口长度，并且以2倍微批次作为滑动间隔。将落入到相同窗口的微批次合并成一个相对较大的微批次-窗口批次。

Spark要求所有的窗口的长度以及滑动的间隔必须是微批次的整数倍

滑动窗口：窗口长度 > 滑动间隔 窗口与窗口之间存在元素的重叠。
滚动窗口：窗口长度 = 滑动间隔 窗口与窗口之间没有元素的重叠。
注：目前不存在：==窗口长度 < 滑动间隔 == 这种窗口，会造成数据的丢失。

窗口计算时间属性

Event Time -事件时间 < Ingestion Time -摄入时间 < Processing Time -处理时间

Spark DStreaming 目前仅仅支持 processing Time -处理时间，但是Spark的Structured Streaming 支持Event Time

窗口算子

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

window（windowLength，slideInterval）

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 进行窗口的计算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .window(Seconds(4),Seconds(2))  //窗口长度 >滑动间隔
      .reduceByKey(_+_)
      .print()


    scc.start()
    scc.awaitTermination()
  }
}

以上window后可以更的算子:count、reduce、reduceByKey、countByValue为了方便起见Spark提供合成算子例如

window+count 等价于 countByWindow**(windowLength, slideInterval)、window+reduceByKey 等价 reduceByKeyAndWindow

reduceByKeyAndWindow

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 进行窗口的计算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .reduceByKeyAndWindow((v1:Int,v2:Int)=>v1+v2,Seconds(4),Seconds(3))
      .print()
    scc.start()
    scc.awaitTermination()
  }
}

如果窗口重合过半，在计算窗口值的时候，可以使用下面方式计算结果

val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoints")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(//当窗口重叠 超过50% ,使用以下计算效率较高
    (v1:Int,v2:Int)=>v1+v2,//上一个窗口结果+新进来的元素
    (v1:Int,v2:Int)=>v1-v2,//减去移出元素
    Seconds(4),
    Seconds(3),
    filterFunc = (t)=> t._2 > 0
)
.print()

ssc.start()
ssc.awaitTermination()

DStreams输出

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
    (v1:Int,v2:Int)=>v1+v2,//上一个窗口结果+新进来的元素
    Seconds(60),
    Seconds(1)
)
.filter(t=> t._2 > 10)
.foreachRDD(rdd=>{
    rdd.foreachPartition(vs=>{
        vs.foreach(v=>KafkaSink.send2Kafka(v._1,v._2.toString))
    })
})

ssc.start()
ssc.awaitTermination()

object KafkaSink {

  private def createKafkaProducer(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"10")
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")
    new KafkaProducer[String,String](props)
  }
  val kafkaProducer:KafkaProducer[String,String]=createKafkaProducer()

  def send2Kafka(k:String,v:String): Unit ={
    val message = new ProducerRecord[String,String]("topic01",k,v)
    kafkaProducer.send(message)
  }
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.flush()
      kafkaProducer.close()
    }
  })
}

注：对于Spark而言，默认只有当窗口的时间结束之后才会将窗口的计算结果最终输出，通常将该种输出方式成为钳制输出形式。

Spark-DStream的窗口计算

基本概念

窗口计算时间属性

窗口算子

window（windowLength，slideInterval）

reduceByKeyAndWindow

DStreams输出

squirrel 連接phoenix訪問Hbase

phoenix 安裝教程

Java API 操作Phoenix

phoenix+springBoot+mybatis 整合

Hbase的完全分佈式集羣搭建

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結