Spark Core 學習整理

閉包

閉包的大致作用就是：函數可以訪問函數外面的變量，但是函數內對變量的修改，在函數外是不可見的。

首先，閉包是一個函數，然後，也是最本質的地方：這個函數內部會引用（依賴）到一些變量，這些變量既不是全局的也不是局部的，而是在定義在上下文中的（這種變量被稱爲“自由變量”，我們會在稍後的例子中看到這種變量），閉包的“神奇”之處是它可以“cache”或者說是持續的“trace”它所引用的這些變量。（從語言實現層面上解釋就是：這些變量以及它們引用的對象不會被GC釋放）。同樣是這件事情，換另一種說法就是：閉包是一個函數，但同時這個函數“背後”還自帶了一個“隱式”的上下文保存了函數內部引用到的一些（自由）變量。

mapPartitions

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

map: 比如一個partition中有1萬條數據；那麼你的function要執行和計算1萬次。

MapPartitions:一個task僅僅會執行一次function，function一次接收所有的partition數據。只要執行一次就可以了，性能比較高。

如果在map過程中需要頻繁創建額外的對象(例如將rdd中的數據通過jdbc寫入數據庫,map需要爲每個元素創建一個鏈接而mapPartition爲每個partition創建一個鏈接),則mapPartitions效率比map高的多。

SparkSql或DataFrame默認會對程序進行mapPartition的優化。

Demo

val standardResultRDD = inputFile.mapPartitions(l => { val standardInstalledAppUtilObject = new DeSerializableDimHdfs(obj.value).getStandardInstalledAppUtilObject for (e <- l) yield { OdsStandardInstlledAppPro.odsStandardInstlledApp(e, standardInstalledAppUtilObject) } }).filter(_ != null)

broadcast

  /**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

Broadcast是隻讀的，裝載的數據是可序列化的

Demo
val obj = sc.broadcast(objPro)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark Core 學習整理

閉包

mapPartitions

broadcast

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

spark dataframe 解析複雜 json

simpleHTTPServer文件傳輸

Spark Core 學習整理

將master分支合併到dev分支

Mac Intellij IDEA中pyspark的環境搭建

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結