Spark 的 Accumulator 與 AccumulatorV2
1.概述
Accumulator累加器能精確地統計數據的各種屬性,eg:可以統計符合條件的session,在一段時間段內產生了多少次購買,統計出各種屬性的數據。
def accumulator[T](initialValue: T, name: String)
initialValue: 初始值
name: 創建累加器時,可以指定累加器的名字,這樣在Driver 4040 Web UI的Task顯示時可以幫助你瞭解程序運行的情況。
2. 例子
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("SparkRdd")
val sc = new SparkContext(conf)
val accum = sc.accumulator(0,"test1")
val data = sc.parallelize(1 to 9)
//使用action操作觸發執行
data.foreach(x=>accum+=1)
println(accum.value)
//結果將返回9
}
3. 注意事項
-
accumulator需要通過一個action操作來觸發才能獲取到accum.value的值
-
在使用累加器的過程中,只能使用一次action的操作才能保證結果的準確性
object SparkAccumulator { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[2]").setAppName("SparkAccumulator") val sc = new SparkContext(conf) val accum = sc.accumulator(0,"test2") val data = sc.parallelize(1 to 8) val data2 = data.map{ x=>{ if(x%2==0){ accum+=1 0 }else 1 } } //使用action操作觸發執行 println(data2.count) // 輸出結果8 println(accum.value) // 輸出結果4 println(data2.count) // 輸出結果8 println(accum.value) // 輸出結果8 sc.stop() } }
第二次調用data2.count action的時候,會重新獲取一次data2,導致accum+=1執行了兩次
解決方案:只要將任務之間的依賴關係切斷,使用rdd的cache,persist方法
4. AccumulatorV2
Accumulator 在spark2.0就過時了,2.0後使用AccumulatorV2
/**
* The base class for accumulators, that can accumulate inputs of type `IN`,
* and produce output of type `OUT`.
*/
abstract class AccumulatorV2[IN, OUT] extends Serializable { }
與Accumulator的變化
-
不用傳初始化值參數,默認是從0開始;
-
創建累加器時,可以指定累加器的名字,這樣在Driver 4040 Web UI的Task顯示時,可以看到該名字的累加器在各Task中的實際的值(如果不指定累加器名字,則不會在Web UI上顯示)
-
新增了reset方法,可以重置該累加器歸零
-
獲取實例方式
val accumulator = sc.longAccumulator(“test”)
或者
val accumulator = new LongAccumulator()
sc.register(accumulator,“test”)
LongAccumulator 源碼
/**
* An [[AccumulatorV2 accumulator]] for computing sum, count, and averages for 64-bit integers.
*
* @since 2.0.0
*/
class LongAccumulator extends AccumulatorV2[jl.Long, jl.Long] {
private var _sum = 0L
private var _count = 0L
/**
* Adds v to the accumulator, i.e. increment sum by v and count by 1.
* @since 2.0.0
*/
override def isZero: Boolean = _sum == 0L && _count == 0
override def copy(): LongAccumulator = {
val newAcc = new LongAccumulator
newAcc._count = this._count
newAcc._sum = this._sum
newAcc
}
override def reset(): Unit = {
_sum = 0L
_count = 0L
}
/**
* Adds v to the accumulator, i.e. increment sum by v and count by 1.
* @since 2.0.0
*/
override def add(v: jl.Long): Unit = {
_sum += v
_count += 1
}
/**
* Adds v to the accumulator, i.e. increment sum by v and count by 1.
* @since 2.0.0
*/
def add(v: Long): Unit = {
_sum += v
_count += 1
}
/**
* Returns the number of elements added to the accumulator.
* @since 2.0.0
*/
def count: Long = _count
/**
* Returns the sum of elements added to the accumulator.
* @since 2.0.0
*/
def sum: Long = _sum
/**
* Returns the average of elements added to the accumulator.
* @since 2.0.0
*/
def avg: Double = _sum.toDouble / _count
override def merge(other: AccumulatorV2[jl.Long, jl.Long]): Unit = other match {
case o: LongAccumulator =>
_sum += o.sum
_count += o.count
case _ =>
throw new UnsupportedOperationException(
s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
private[spark] def setValue(newValue: Long): Unit = _sum = newValue
override def value: jl.Long = _sum
}
實例
object MyAccumulator {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("MyAccumulator")
val sc = new SparkContext(conf)
val accumulator = sc.longAccumulator("count")
val rdd1 = sc.parallelize(10 to 100).map(x=>{
if(x%2==0){
accumulator.add(1)
}
})
println("count = " + rdd1.count())
println("accumulator = " + accumulator.value)
sc.stop()
}
}